from fastai.vision.all import *
= untar_data(URLs.IMAGENETTE) path
A Journey Through Fastbook (AJTFB) - Chapter 7: Advanced techniques for training image classification models
Other posts in this series:
A Journey Through Fastbook (AJTFB) - Chapter 1
A Journey Through Fastbook (AJTFB) - Chapter 2
A Journey Through Fastbook (AJTFB) - Chapter 3
A Journey Through Fastbook (AJTFB) - Chapter 4
A Journey Through Fastbook (AJTFB) - Chapter 5
A Journey Through Fastbook (AJTFB) - Chapter 6a
A Journey Through Fastbook (AJTFB) - Chapter 6b
A Journey Through Fastbook (AJTFB) - Chapter 8
A Journey Through Fastbook (AJTFB) - Chapter 9
Imagenette
The Imagenette is a subset of the ImageNet dataset that “contains a subset of 10 very different categories from the orginal ImageNet dataset, making for quicker training when we want to experiment”
Start with small datasets and models for initial experimentation and prototyping. Both will allow you to iterate over your experiments more quickly and verify your code works from beginning to end without having to wait hours for your training/validation loops to finish. “You should aim to have an iteration speed of no more than a couple of minutes …. If it’s taking longer to do an experiment, think about how you could cut down your dataset, or simply your model, to improve your experimentation speed.”
Tip 1: Use the “presizing trick”
See chapter 5, pp.189-191. The idea here is to first crop the image so that further augmentations can be applied without creating empty space (via item_tfms
), with further augmentations being applied on the GPU on batches of images for speed (via batch_tfms
).
On the training set, the initial crop area is chosen randomly with the size set to cover the entire width/height of the image with random crop and other augmentations done on the GPU.
On the validation set, a center square is always used in the first step and only a resize is applied on the GPU to get the image width/height equal to the final size needed.
= DataBlock(
dblock =(ImageBlock(), CategoryBlock()),
blocks= get_image_files,
get_items = parent_label,
get_y = Resize(460),
item_tfms = aug_transforms(size=224, min_scale=0.75)
batch_tfms
)
= dblock.dataloaders(path, bs=64) dls
Tip 2: Create a “baseline”
we are not using a pretrained model here, we are training one from scratch.
= xresnet50()
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn 5, 3e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.654316 | 4.794844 | 0.320015 | 05:21 |
1 | 1.217274 | 1.211676 | 0.612024 | 05:19 |
2 | 0.964628 | 1.417025 | 0.617252 | 05:06 |
3 | 0.736836 | 0.677910 | 0.787155 | 05:12 |
4 | 0.596578 | 0.539180 | 0.833831 | 05:05 |
Tip 3: Normalize your data
“When training a model, it helps if your input data is normalized - this is, has a mean of 0 and a standard deviation of 1.””
For images we do this over each channel (the 1 dimension) but averaging over all axes with the exception of the channel axis. In fastai, we can utilize the Normalize
transform to apply this a batch at a time.
If we don’t tell this transform what mean/std to use, “fastai will automatically calculate them from a single batch of your data”
If we are using ImageNet images, we can use imagenet_stats
instead of calculating the mean/std ourselves).
# an example of normalization calculated on a batch of images
# (because we aren't using normalization yet, you'll see the mean and standard deviation are not very close to
# 0 and 1 respectively)
= dls.one_batch()
x, y
=[0,2,3]), x.std(dim=[0,2,3]) x.mean(dim
(TensorImage([0.4518, 0.4554, 0.4344], device='cuda:0'),
TensorImage([0.2868, 0.2783, 0.2998], device='cuda:0'))
def get_dls(batch_size, image_size):
= DataBlock(
dblock =(ImageBlock(), CategoryBlock()),
blocks= get_image_files,
get_items = parent_label,
get_y = Resize(460),
item_tfms = [*aug_transforms(size=image_size, min_scale=0.75), Normalize.from_stats(*imagenet_stats)]
batch_tfms
)
= dblock.dataloaders(path, bs=batch_size)
dls return dls
= get_dls(64, 224)
dls
# an example of normalization calculated on a batch of images
# (because we are using normalization now, the mean and standard deviation are very close to 0 and 1 respectively)
= dls.one_batch()
x, y print(x.mean(dim=[0,2,3]), x.std(dim=[0,2,3]))
# does this normalization improve our model? Let's see ...
= xresnet50()
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn 5, 3e-3) learn.fit_one_cycle(
TensorImage([-0.0816, -0.0114, 0.0695], device='cuda:0') TensorImage([1.1806, 1.1762, 1.2825], device='cuda:0')
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.701530 | 1.856552 | 0.468633 | 05:07 |
1 | 1.280709 | 1.384676 | 0.573562 | 05:05 |
2 | 1.007325 | 1.073023 | 0.656460 | 05:06 |
3 | 0.762624 | 0.666320 | 0.784541 | 05:06 |
4 | 0.606407 | 0.573812 | 0.823376 | 05:02 |
“… when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference or transfer learning will need to use the same statistics …. If you’re using a model that someone else has trained, make sure you find out what normalization statistics they used and match them.”
Tip 4: Use “progressive resizing”
“… start training using small images, and end training using large images. Spending most of the epochs training with small images helps training complete faster.”
Think of this as a form of transfer learning
“… the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image …. So, when we change the image size in the middle of training, it doesn’t mean that we have to find totally different parameters for our model.”
“Progressive resizing has an additional benefit: it is another form of data augmentation. Therefore, you should expect to see better generalization”
= get_dls(128,128)
dls = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn 4, 3e-3)
learn.fit_one_cycle(
# simply replace the `Learner.dls` with new `DataLoaders` and continue traning.
= get_dls(64, 224)
learn.dls 5, 1e-3) learn.fine_tune(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.848093 | 1.582196 | 0.526512 | 03:02 |
1 | 1.297791 | 1.205059 | 0.616878 | 03:01 |
2 | 0.985249 | 1.022758 | 0.690067 | 02:55 |
3 | 0.762485 | 0.688779 | 0.787155 | 02:53 |
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.845315 | 1.171858 | 0.650112 | 05:08 |
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.635858 | 0.834369 | 0.751307 | 05:06 |
1 | 0.664283 | 0.665261 | 0.796117 | 05:10 |
2 | 0.585543 | 0.634785 | 0.796490 | 05:11 |
3 | 0.478250 | 0.495538 | 0.840926 | 05:02 |
4 | 0.429075 | 0.448893 | 0.855489 | 05:08 |
To use the DataLoaders
with bigger images, we simply assign it to Learner.dls
.
Bigger images will require smaller batch sizes. Also, you will not get a benefit of using images sized larger than the size of your images on disk!
“… for transfer learning, progressive resizing may acutally hurt performance …. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and the dataset and was trained on similar-sized images, so the weights don’t need to be changed much. In that case, training on smaller images may damage the pretrained weights.
“On the other hand, if the transfer learning task is going to use images that are of different sizes, shapes, or styles than those used in the pretraining task, progressive resizing will probably help”
If you are unsure, try it!
Tip 5: Use Test Time Augmentation (TTA)
TTA is a form of data augmentation applied to the validation set that adds augmented versions of the images. “During inference or validation, creating multiple versions of each image using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image.”
“… select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we can do this not just for different crops, but for different values across all of our test time augmentation parameters”
What is the problem TTA addresses and why use it?
“When we use random cropping, fastai will automatically use center-cropping for the validation set” which can be probelmatic, for example, in multi-label tasks where “sometimes there are small objects toward the edges of an image” that might be cropped out entirely or perhaps features on the fringe that are required for any classification task.
# you can pass any `DataLoaders` to `tta()` (by default it uses your validation `DataLoader`)
= learn.tta()
preds, targs accuracy(preds, targs).item()
0.861090362071991
“TTA gives us a good boost in performance, with no additional training required.
Tip 6: Use MixUp
“Mixup … is a powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model that was trained on data similar to your dataset”
It is a dataset-independent form of data augmentation = can be applied without domain knowledge of the dataset to configure other forms of data augmentation (e.g., flipping and to what degree, etc…)
How does Mixup work?
- Select another random image
- Pick a weight at random
- Take a weighted average of the selected image with your image = Your independent variable
- Take a weighted average of the selected image’s labels with your image’s labels = Your dependent variable
- Use #3 to predict #4
In pseudocode:
= dataset[randint(0, len(dataset))]
img2, targ2 = random_float(0.5, 1.0)
t = t * img1 + (1-t) * img2
new_img = t * targ1 + (1-t) * targ2 new_targ
“For this to work, our targets need to be one-hot encoded”
“Mixup requires far more epochs”
“One of the reasons Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by **using Mixup on activations inside their models, not just on inputs - this allows Mixup to be used for NLP and other data types too.”
See pp.247-249 for a detailed example of how Mixup works and is used in fastai
= xresnet50()
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=MixUp)
learn 5, 3e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.183965 | 2.523945 | 0.320762 | 02:57 |
1 | 1.729223 | 1.974045 | 0.461538 | 03:01 |
2 | 1.479313 | 1.131723 | 0.630695 | 03:07 |
3 | 1.294975 | 0.872954 | 0.724421 | 03:08 |
4 | 1.183486 | 0.731506 | 0.776699 | 03:06 |
“… it’s going to be hard to train, because … the model has to predict two labels per image rather than just one …. Overfitting seems less likely to be a problem.”
Tip 7: Use “Label Smoothing”
“In the theoretical expression of loss, in classification problems, our targets are one hot encoded …. That means the model is trained to return 0 for all categories but one, for which it is trained to return 1…. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it’s not too sure, just because it was trained that way.
“This can become very harmful if your data is not perfectly labeled.”
“In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images that are harder to label”
What is the solution this this?
“… we could replace all our 1s with a number a bit less than 1, and our 0s with a number a bit more than 0, and then train. This is” = Label smoothing. “By encouraging your model to be a less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better at inference.”
See pp.249-251 for a detailed explanation and example of how label smoothing operates. To use it, we just have to change our loss function.
= xresnet50()
model = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy)
learn 5, 3e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.757509 | 3.500999 | 0.253921 | 03:06 |
1 | 2.257501 | 2.817133 | 0.440627 | 03:00 |
2 | 1.968483 | 2.138581 | 0.617625 | 02:59 |
3 | 1.781833 | 1.700527 | 0.772591 | 03:05 |
4 | 1.648491 | 1.632251 | 0.798357 | 03:01 |
“As with Mixup, you won’t generally see significant improvements from label smoothing until you train more epochs.”
Resources
https://book.fast.ai - The book’s website; it’s updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc…
Bag of Tricks for Image Classification with Convolutional Neural Networks discusses a variety of techniques you can use with CNNs
How to Train Your ResNet 8: Bag of Tricks discusses a variety of techniques you can use to training ResNets.
IceVision is a great resource for all things computer vision and a fastai friendly library. You may want to follow these twitter accounts as well: @ai_fast_track and @Fra_Pochetti (creator of IceVision).