- Negative Log-Likelihood & CrossEntropy Loss
- So why not use accuracy?
!pip install fastai
import torch from torch.nn import functional as F from fastai2.vision.all import *
We've been doing multi-classification since week one, and last week, we learned about how a NN "learns" by evaluating its predictions as measured by something called a "loss function."
So for multi-classification tasks, what is our loss function?
path = untar_data(URLs.PETS)/'images' def is_cat(x): return x.isupper() dls = ImageDataLoaders.from_name_func( path, get_image_files(path), valid_pct=0.2, seed=42, label_func=is_cat, item_tfms=Resize(224)) learn = cnn_learner(dls, resnet34, metrics=error_rate) learn.loss_func
Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.cache/torch/checkpoints/resnet34-333f7ec4.pth
FlattenedLoss of CrossEntropyLoss()
Let's imagine a model who's objective is to predict the label of an example given five possible classes to choose from. Our predictions might look like this ...
preds = torch.randn(3, 5); preds
tensor([[-0.3139, 0.6737, -0.0143, 1.9929, -0.6949], [ 0.5285, 0.1311, 0.2628, 0.6450, 1.7745], [-1.7458, 2.0199, -0.1365, 1.4622, -0.0940]])
Because this is a supervised task, we know the actual labels of our three training examples above (e.g., the label of the first example is the first class, the label of the 2nd example the 4th class, and so forth)
targets = torch.tensor([0, 3, 4])
Step 1: Convert the predictions for each example into probabilities using
softmax. This describes how confident your model is in predicting what it belongs to respectively for each class
probs = F.softmax(preds, dim=1); probs
tensor([[0.0635, 0.1704, 0.0856, 0.6372, 0.0433], [0.1421, 0.0955, 0.1089, 0.1596, 0.4939], [0.0126, 0.5458, 0.0632, 0.3125, 0.0659]])
If we sum the probabilities across each example, you'll see they add up to 1
tensor([1.0000, 1.0000, 1.0000])
Step 2: Calculate the "negative log likelihood" for each example where
y = the probability of the correct class
loss = -log(y)
We can do this in one-line using something called tensor/array indexing
example_idxs = range(len(preds)); example_idxs
correct_class_probs = probs[example_idxs, targets]; correct_class_probs
tensor([0.0635, 0.1596, 0.0659])
nll = -torch.log(correct_class_probs); nll
tensor([2.7574, 1.8349, 2.7194])
Step 3: The loss is the mean of the individual NLLs
... or using PyTorch
... or we can do this all at once using PyTorch's
As you can see, cross entropy loss simply combines the
log_softmax operation with the
negative log-likelihood loss
def plot_function(f, tx=None, ty=None, title=None, min=-2, max=2, figsize=(6,4)): x = torch.linspace(min,max) fig,ax = plt.subplots(figsize=figsize) ax.plot(x,f(x)) if tx is not None: ax.set_xlabel(tx) if ty is not None: ax.set_ylabel(ty) if title is not None: ax.set_title(title)
def f(x): return -torch.log(x) plot_function(f, 'x (prob correct class)', '-log(x)', title='Negative Log-Likelihood', min=0, max=1)
NLL loss will be higher the smaller the probability of the correct class
What does this all mean? The lower the confidence it has in predicting the correct class, the higher the loss. It will:
1) Penalize correct predictions that it isn't confident about more so than correct predictions it is very confident about.
2) And vice-versa, it will penalize incorrect predictions it is very confident about more so than incorrect predictions it isn't very confident about
Why is this better than accuracy?
Because accuracy simply tells you whether you got it right or wrong (a 1 or a 0), whereast NLL incorporates the confidence as well. That information provides you're model with a much better insight w/r/t to how well it is really doing in a single number (INF to 0), resulting in gradients that the model can actually use!
Rember that a loss function returns a number. That's it!
Or the more technical explanation from fastbook:
"The gradient of a function is its slope, or its steepness, which can be defined as rise over run -- that is, how much the value of function goes up or down, divided by how much you changed the input. We can write this in maths:
(y_new-y_old) / (x_new-x_old). Specifically, it is defined when
x_newis very similar to
x_old, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. So the problem is that a small change in weights from
x_newisn't likely to cause any prediction to change, so
(y_new - y_old)will be zero. In other words, the gradient is zero almost everywhere. As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!" 1
So to summarize,
accuracy is a great metric for human intutition but not so much for your your model. If you're doing multi-classification, your model will do much better with something that will provide it gradients it can actually use in improving your parameters, and that something is