# A Journey Through Fastbook (AJTFB) - Chapter 4

The fourth in a weekly-ish series where I revisit the fast.ai book, "Deep Learning for Coders with fastai & PyTorch", and provide commentary on the bits that jumped out to me chapter by chapter. So without further adieu, let's go!

```
mnist_path = untar_data(URLs.MNIST_SAMPLE)
mnist_path.ls()
```

```
sample_3 = Image.open((mnist_path/'train/3').ls().sorted()[1])
sample_3
```

```
sample_3_t = tensor(sample_3)
df = pd.DataFrame(sample_3_t[4:15, 4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
```

### What is a baseline model and why do you want one?

A simple model that you are confident should perform reasonably well. It should be simple to implement and easy to test

**Why do you want to start with a baseline model?**

... without starting with a sensible baseline, it is difficult to know whether your super-fancy models are any good

**How do you build/find one of these models?**

You can search online for folks that have trained models to solve a problem similar to your's and/or you can start with one of the high-level examples in the fastai docs against your data. There are a bunch covering core vision, text, tabuluar and colab filtering tasks right here.

### Tensors

**What is a "Tensor"?**

Like a numpy array, but with GPU support. The data it contains must be of the ** same type** and must conform in

**.**

*rectangular shape***Important:**"try to avoid as much as possible writing loops, and replace them by commands that work directly on arrays or tensors"

Let's take a look ..

```
threes = (mnist_path/'train/3').ls().sorted()
len(threes), threes[0]
```

```
all_threes = [ tensor(Image.open(fp)) for fp in threes ]
len(all_threes), all_threes[0].shape
```

```
stacked_threes = torch.stack(all_threes).float()/255
stacked_threes.shape
```

Important information about tensors include its `shape`

, `rank`

, and `type`

:

```
print('shape: ', stacked_threes.shape)
# rank = the total number of axes
print('rank: ', stacked_threes.ndim)
# type = the datatype of its contents
print('type: ', stacked_threes.dtype)
```

Important things you can do to a tensor, `view`

, `@`

, `where`

```
stacked_threes_rank_2 = stacked_threes.view(-1, 28*28)
print('orig. shape: ', stacked_threes.shape)
print('make into a rank 2 tensor', stacked_threes_rank_2.shape)
# @ = operator for matrix multiplication
print('result of matrix multiplication: ', (stacked_threes @ torch.randn((1,28,28))).shape)
# where = torch.where(a,b,c) => [b[i] if a[i] else c[i] for i in range(len(a))] ... see p.167
trgs = tensor([1,0,1])
preds = tensor([0.9, 0.4, 0.2])
def mnist_loss(preds, targs):
return torch.where(targs == 1, 1 - preds, preds).mean()
print('output of where: ', mnist_loss(preds, trgs))
```

For an interactive lesson on matrix multiplication, this is the best!

Check out pp.145-148 to learn about "broadcasting", a critical piece to understanding how you can and should manipulate tensors or numpy arrays!

Here are the steps:

**INITIALIZE**the weights = initializing parameters to random valuesFor each image,

**PREDICT**whether it is a 3 or 7Based on the predictions, calculate how good the model is by calculating its

**LOSS**(small is good)Calculate the

**GRADIENT**,*"which measures for each weight how changing the weight would change the loss"***STEP**, change all the weights based on the gradientStarting at step 2,

**REPEAT****STOP**when you don't want to train any longer or the model is good enough

Below, we'll delve deeper into these steps. We'll do this by getting a big more into the sample code beginning on p.150 ...

### Step 3: Calculating the loss

**Important:**"For continuous data, it’s common to use

*mean squared error*". In order to understand how to write this, read it right-to-left (e.g., error -> square -> mean)

```
def mse(preds, targs): return ((preds-targs)**2).mean()
# in PyTorch
loss = F.mse_loss(preds, targs)
```

**Important:**Accuracy is a bad loss function

**Why is accuracy a poor loss function?**

"The gradient of a funciton is its slope, or its steepness ... how much the value of the function goes up or down, divided by how much we changed the input

`(y_new - y_old) / (x_new - x_old)`

.... The problem with [accuracy] is that a small change in weights from`x_old`

to`x_new`

isn't likely to cause any prediction to change, so`(y_new - y_old)`

will almost always be 0 ... **the gradient is 0 almost everywhere. A very small change in the value of a weight will often not change the accuracy at all

A gradient = 0 will mean that the weights aren't updated.

Important:"We need a loss function that, when our weights result in slightly better predictions, gives us a slightly better loss"

**Metrics v. Loss**

**Important:**"...

**and**

*the metric is to drive human understanding***.**

*the loss is to drive automated learning***Important:**"... focus on these metrics, rather than the loss, when judging the performance of a model."

**Important:**"... the loss must be a function that has a meaningful derivative ... must be reasonably smooth [so] that [it] would respond to small changes in confidence level.

The loss function is one that can be optimized using its gradient!

### Step 4: Calculating the gradients

**Important:**"the gradients

**to make our model better ... allows us to more quickly calculate whether our loss will go up or down we we make those adjustments"**

*tell us how much we have to change each weight***Important:**"The gradients

**; they don’t tell us exactly how far to adjust the parameters. But they do give us some idea of how far" (large slope = bigger adjustments needed whereas a small slope suggests we are close to the optimal value)**

*tell us only the slope of our function*"The ** derivative** of a function tells you how much a change in its parameters will change its result"

Remember: We are calculating a gradient for *EVERY* weight so we know how to adjust it to make our model better (i.e., lower the LOSS)

`requires_grad`

tells PyTorch "that we want to calculate gradients with respect to that variable at that value"

```
def plot_function(f, tx=None, ty=None, title=None, min=-2, max=2, figsize=(6,4)):
x = torch.linspace(min,max)
fig,ax = plt.subplots(figsize=figsize)
ax.plot(x,f(x))
if tx is not None: ax.set_xlabel(tx)
if ty is not None: ax.set_ylabel(ty)
if title is not None: ax.set_title(title)
```

Here we pretend that the below is our **loss function**. Running a number through it, our **weight** will produce a result, an **activation** ... in this case, our **loss** (which again is a value telling us how good or bad our model is; smaller = good)

```
xt = tensor(-1.5).requires_grad_(); xt
```

```
def f(x): return x**2
loss = f(xt)
plot_function(f, 'x', 'x**2')
plt.scatter(xt.detach().numpy(), loss.detach().numpy(), color='red')
print('Loss: ', loss.item())
```

So if our parameter is `-1.5`

we get a loss = `2.25`

. Since the direction of our slope is downward (negative), by changing its value to be a bit more positive, we get closer to achieving our goal of *minimizing our loss*

```
xt = tensor(-1.).requires_grad_(); xt
loss = f(xt)
plot_function(f, 'x', 'x**2')
plt.scatter(xt.detach().numpy(), loss.detach().numpy(), color='red')
print('Loss: ', loss.item())
```

And yes, our loss has improved! If the direction of our slope were upwards (positive), we would conversely want `x`

to be smaller.

** BUT** now ... imagine having to figure all this out for a million parameters. Obviously, we wouldn't want to try doing this manually as we did before, and thanks to PyTorch, we don't have too :)

Remember that by utilizing the `requires_grad_()`

function, we have told PyTorch to keep track of how to compute the gradients based on the other calucations we perform, like running it through our loss function above. Let's see what that looks like.

```
xt = tensor(-1.).requires_grad_();
print(xt)
loss = f(xt)
print(loss)
```

That `<PowBackward0>`

is the gradient function it will use to calculate the gradients when needed. And when we need it, we call the `backward`

method to do so.

```
loss.backward()
print(xt.grad)
```

And the calcuated gradient is exactly what we expected given that to calculate the derivate of `x**2`

is `2x`

... `2*-1 = -2`

.

Again, the gradient tells us ** the slope of our function**. Here have a a negative/downward slope and so at the very least, we know what moving in that direction will get us closer to the minimum.

The question is now, **How far do we move in that direction?**

### Step 5: Change all the weights based on the gradient using a "Learning Rate"

The **learning rate** (or LR) is a number (usually a small number like 1e-3 or 0.1) that we multiply the gradient by to get a better parameter value. For a given parameter/weight `w`

, the calculation looks like this:

`w -= w.grad * lr`

Notice we take the negative of the grad * lr operation because we want to move in the opposite direction.

**Important:**We do this in a

`with torch.no_grad()`

so that we don’t calculate the gradient for the gradient calculating operation
```
lr = 0.01
with torch.no_grad():
xt -= xt.grad * lr
print('New value for xt: ', xt)
print('New loss: ', f(xt))
```

You can see the loss get smaller which is exactly what we want! "The magnitude of the gradient (i.e., the steepness of the slope) [tells] us how big a step to take."

The above operation is also called the **optimization step**

See pp.156-157 for examples of what using a too small or too large LR might look like when training. This could help you troubleshoot things if yours looks wonky.

A **Dataset** contains tuples of independent and dependent variables

```
ds = L(enumerate(string.ascii_lowercase))
ds
```

A **DataLoader** receives a dataset and gives us back as many *mini-batches* are necessary based on the *batch size* we specify

```
dl = DataLoader(ds, bs=6, shuffle=True)
list(dl)
```

### Measuring distances

See pp.141-142. There are two main ways to measure distances.

**L1 norm** (or mean absolute difference): Take the mean of the absolute value of differences

`l1_loss = (tensor_a - tensor_b).abs().mean()`

**L2 norm** (or root mean squared error, RMSE): Take the square root of the mean of the square differences. The squaring of differences makes everything positive and the square root undoes the squaring.

**Important:**"... the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)"

`l2_loss = ((tensor_a - tensor_b) ** 2).sqrt()`

**nn.Linear**: Initializes its parameters and performs a linear operation. It contains both the weights and biases in a single class

```
lin1 = nn.Linear(28*28, 1)
# the trainable parameters
weights, bias = lin1.parameters()
print(weights.shape, bias.shape)
```

**nn.ReLU**: Allows us to add a non-linearity between linear classifiers. Simply put, it ensures that all activations passed to it are a positive number with every negative number replaced with a 0.

Notice below that it has no trainable parameters!

```
plot_function(F.relu)
```

```
non_lin1 = nn.ReLU()
print(list(non_lin1.parameters()))
print('using nn.ReLU: ', non_lin1(tensor(-1)), non_lin1(tensor(4)))
print('using max()', tensor(-1).max(tensor(0.0)), tensor(4).max(tensor(0.0)))
```

**Why do you want to have non-linearities?**

Because "there's no point in just putting one linear layout directly after another one, because when we multiply things together and then add them up multiple times, that could be replaced by multiplying different things together and adding them up just once .... BUT if we put a non-linear between them ... this is no longer true.

."Now each linear layer is somewhat decoupled from the other ones and can do its own useful work

These kind of functions are also called **"activation functions"**, because the only operate and produce activations ... there are no trainable parameters.

**nn.Sequential**: A module that can be passed modules, which when called, calls each of those layers in turn.

```
lin2 = nn.Linear(1, 10)
seq_model = nn.Sequential(lin1, non_lin1, lin2)
seq_params = list(seq_model.parameters())
print(len(seq_params))
for p in seq_params: print(p.shape)
```

Why `4`

? Simple, remember that each `nn.Linear`

above has two trainable parameters (the weights and bias), 2+2 = 4.

### Summary

This chapter walks you through creating a baseline model to a full blown training loop in PyTorch. ** Read it, and read it again and again!** (I do and have).

**Important Vocb/Concepts**

*Activations*: Numbers that are calculated by both linear and non-linear layers

*Parameters*: Randomly initialized parameters that can be trained.

*Neural Network*: A chain of linear and non-linear functions your data runs through to produce a result.

*Gradient*: "The derivative of the loss with respect to some parameter of the model"

*Backpropagation*: The computing of the gradients "of the loss with respect to all model parameters"

*Gradient Descent*: "Taking a step in the direction opposite to the gradients to make the model parameters a little bit better"

## Resources

- https://book.fast.ai - The book's website; it's updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc...