A Journey Through Fastbook (AJTFB) - Chapter 7: Advanced techniques for training image classification models

fastai

fastbook

classification

computer vision

techniques

bag of tricks

This chapter of "Deep Learning for Coders with fastai & PyTorch" details several techniques you can apply to getting SOTA results with your image classification models! It’s the last chapter dedicated to computer vision before diving into colloborate filtering, tabular, and NLP models

Author

Wayde Gilliam

Published

March 28, 2022

Imagenette

The Imagenette is a subset of the ImageNet dataset that “contains a subset of 10 very different categories from the orginal ImageNet dataset, making for quicker training when we want to experiment”

Tip

Start with small datasets and models for initial experimentation and prototyping. Both will allow you to iterate over your experiments more quickly and verify your code works from beginning to end without having to wait hours for your training/validation loops to finish. “You should aim to have an iteration speed of no more than a couple of minutes …. If it’s taking longer to do an experiment, think about how you could cut down your dataset, or simply your model, to improve your experimentation speed.”

from fastai.vision.all import *
path = untar_data(URLs.IMAGENETTE)

Tip 1: Use the “presizing trick”

See chapter 5, pp.189-191. The idea here is to first crop the image so that further augmentations can be applied without creating empty space (via item_tfms), with further augmentations being applied on the GPU on batches of images for speed (via batch_tfms).

On the training set, the initial crop area is chosen randomly with the size set to cover the entire width/height of the image with random crop and other augmentations done on the GPU.

On the validation set, a center square is always used in the first step and only a resize is applied on the GPU to get the image width/height equal to the final size needed.

dblock = DataBlock(
    blocks=(ImageBlock(), CategoryBlock()),
    get_items = get_image_files,
    get_y = parent_label,
    item_tfms = Resize(460),
    batch_tfms = aug_transforms(size=224, min_scale=0.75)
)

dls = dblock.dataloaders(path, bs=64)

Tip 2: Create a “baseline”

Note

we are not using a pretrained model here, we are training one from scratch.

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.654316	4.794844	0.320015	05:21
1	1.217274	1.211676	0.612024	05:19
2	0.964628	1.417025	0.617252	05:06
3	0.736836	0.677910	0.787155	05:12
4	0.596578	0.539180	0.833831	05:05

Tip 3: Normalize your data

Tip

“When training a model, it helps if your input data is normalized - this is, has a mean of 0 and a standard deviation of 1.””

For images we do this over each channel (the 1 dimension) but averaging over all axes with the exception of the channel axis. In fastai, we can utilize the Normalize transform to apply this a batch at a time.

Important

If we don’t tell this transform what mean/std to use, “fastai will automatically calculate them from a single batch of your data”

Important

If we are using ImageNet images, we can use imagenet_stats instead of calculating the mean/std ourselves).

# an example of normalization calculated on a batch of images
# (because we aren't using normalization yet, you'll see the mean and standard deviation are not very close to
# 0 and 1 respectively)
x, y = dls.one_batch()

x.mean(dim=[0,2,3]), x.std(dim=[0,2,3])

(TensorImage([0.4518, 0.4554, 0.4344], device='cuda:0'),
 TensorImage([0.2868, 0.2783, 0.2998], device='cuda:0'))

def get_dls(batch_size, image_size):
  dblock = DataBlock(
      blocks=(ImageBlock(), CategoryBlock()),
      get_items = get_image_files,
      get_y = parent_label,
      item_tfms = Resize(460),
      batch_tfms = [*aug_transforms(size=image_size, min_scale=0.75), Normalize.from_stats(*imagenet_stats)]
  )

  dls = dblock.dataloaders(path, bs=batch_size)
  return dls

dls = get_dls(64, 224)

# an example of normalization calculated on a batch of images
# (because we are using normalization now, the mean and standard deviation are very close to 0 and 1 respectively)
x, y = dls.one_batch()
print(x.mean(dim=[0,2,3]), x.std(dim=[0,2,3]))

# does this normalization improve our model? Let's see ...
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

TensorImage([-0.0816, -0.0114,  0.0695], device='cuda:0') TensorImage([1.1806, 1.1762, 1.2825], device='cuda:0')

epoch	train_loss	valid_loss	accuracy	time
0	1.701530	1.856552	0.468633	05:07
1	1.280709	1.384676	0.573562	05:05
2	1.007325	1.073023	0.656460	05:06
3	0.762624	0.666320	0.784541	05:06
4	0.606407	0.573812	0.823376	05:02

Important

“… when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference or transfer learning will need to use the same statistics …. If you’re using a model that someone else has trained, make sure you find out what normalization statistics they used and match them.”

Tip 4: Use “progressive resizing”

“… start training using small images, and end training using large images. Spending most of the epochs training with small images helps training complete faster.”

Note

Think of this as a form of transfer learning

“… the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image …. So, when we change the image size in the middle of training, it doesn’t mean that we have to find totally different parameters for our model.”

Note

“Progressive resizing has an additional benefit: it is another form of data augmentation. Therefore, you should expect to see better generalization”

dls = get_dls(128,128)
learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

# simply replace the `Learner.dls` with new `DataLoaders` and continue traning.
learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.848093	1.582196	0.526512	03:02
1	1.297791	1.205059	0.616878	03:01
2	0.985249	1.022758	0.690067	02:55
3	0.762485	0.688779	0.787155	02:53

epoch	train_loss	valid_loss	accuracy	time
0	0.845315	1.171858	0.650112	05:08

epoch	train_loss	valid_loss	accuracy	time
0	0.635858	0.834369	0.751307	05:06
1	0.664283	0.665261	0.796117	05:10
2	0.585543	0.634785	0.796490	05:11
3	0.478250	0.495538	0.840926	05:02
4	0.429075	0.448893	0.855489	05:08

Note

To use the DataLoaders with bigger images, we simply assign it to Learner.dls.

Important

Bigger images will require smaller batch sizes. Also, you will not get a benefit of using images sized larger than the size of your images on disk!

Important

“… for transfer learning, progressive resizing may acutally hurt performance …. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and the dataset and was trained on similar-sized images, so the weights don’t need to be changed much. In that case, training on smaller images may damage the pretrained weights.

“On the other hand, if the transfer learning task is going to use images that are of different sizes, shapes, or styles than those used in the pretraining task, progressive resizing will probably help”

Tip

If you are unsure, try it!

Tip 5: Use Test Time Augmentation (TTA)

Important

TTA is a form of data augmentation applied to the validation set that adds augmented versions of the images. “During inference or validation, creating multiple versions of each image using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image.”

“… select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we can do this not just for different crops, but for different values across all of our test time augmentation parameters”

What is the problem TTA addresses and why use it?

“When we use random cropping, fastai will automatically use center-cropping for the validation set” which can be probelmatic, for example, in multi-label tasks where “sometimes there are small objects toward the edges of an image” that might be cropped out entirely or perhaps features on the fringe that are required for any classification task.

# you can pass any `DataLoaders` to `tta()` (by default it uses your validation `DataLoader`)
preds, targs = learn.tta()
accuracy(preds, targs).item()

0.861090362071991

“TTA gives us a good boost in performance, with no additional training required.

Tip 6: Use `MixUp`

“Mixup … is a powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model that was trained on data similar to your dataset”

It is a dataset-independent form of data augmentation = can be applied without domain knowledge of the dataset to configure other forms of data augmentation (e.g., flipping and to what degree, etc…)

How does Mixup work?

Select another random image
Pick a weight at random
Take a weighted average of the selected image with your image = Your independent variable
Take a weighted average of the selected image’s labels with your image’s labels = Your dependent variable
Use #3 to predict #4

In pseudocode:

img2, targ2 = dataset[randint(0, len(dataset))]
t = random_float(0.5, 1.0)
new_img = t * img1 + (1-t) * img2
new_targ = t * targ1 + (1-t) * targ2

Note

“For this to work, our targets need to be one-hot encoded”

Important

“Mixup requires far more epochs”

Note

“One of the reasons Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by **using Mixup on activations inside their models, not just on inputs - this allows Mixup to be used for NLP and other data types too.”

See pp.247-249 for a detailed example of how Mixup works and is used in fastai

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=MixUp)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	2.183965	2.523945	0.320762	02:57
1	1.729223	1.974045	0.461538	03:01
2	1.479313	1.131723	0.630695	03:07
3	1.294975	0.872954	0.724421	03:08
4	1.183486	0.731506	0.776699	03:06

Note

“… it’s going to be hard to train, because … the model has to predict two labels per image rather than just one …. Overfitting seems less likely to be a problem.”

Tip 7: Use “Label Smoothing”

“In the theoretical expression of loss, in classification problems, our targets are one hot encoded …. That means the model is trained to return 0 for all categories but one, for which it is trained to return 1…. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it’s not too sure, just because it was trained that way.

Important

“This can become very harmful if your data is not perfectly labeled.”

“In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images that are harder to label”

What is the solution this this?

“… we could replace all our 1s with a number a bit less than 1, and our 0s with a number a bit more than 0, and then train. This is” = Label smoothing. “By encouraging your model to be a less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better at inference.”

See pp.249-251 for a detailed explanation and example of how label smoothing operates. To use it, we just have to change our loss function.

model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	2.757509	3.500999	0.253921	03:06
1	2.257501	2.817133	0.440627	03:00
2	1.968483	2.138581	0.617625	02:59
3	1.781833	1.700527	0.772591	03:05
4	1.648491	1.632251	0.798357	03:01

Important

“As with Mixup, you won’t generally see significant improvements from label smoothing until you train more epochs.”

Resources

https://book.fast.ai - The book’s website; it’s updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc…
Bag of Tricks for Image Classification with Convolutional Neural Networks discusses a variety of techniques you can use with CNNs
How to Train Your ResNet 8: Bag of Tricks discusses a variety of techniques you can use to training ResNets.
IceVision is a great resource for all things computer vision and a fastai friendly library. You may want to follow these twitter accounts as well: @ai_fast_track and @Fra_Pochetti (creator of IceVision).

Imagenette

Tip 1: Use the “presizing trick”

Tip 2: Create a “baseline”

Tip 3: Normalize your data

Tip 4: Use “progressive resizing”

Tip 5: Use Test Time Augmentation (TTA)

Tip 6: Use MixUp

Tip 7: Use “Label Smoothing”

Resources

Tip 6: Use `MixUp`