A Journey Through Fastbook (AJTFB) - Chapter 7: Advanced techniques for training image classification models

fastai
fastbook
classification
computer vision
techniques
bag of tricks
This chapter of "Deep Learning for Coders with fastai & PyTorch" details several techniques you can apply to getting SOTA results with your image classification models! It’s the last chapter dedicated to computer vision before diving into colloborate filtering, tabular, and NLP models
Author

Wayde Gilliam

Published

March 28, 2022

Other posts in this series:

A Journey Through Fastbook (AJTFB) - Chapter 1
A Journey Through Fastbook (AJTFB) - Chapter 2
A Journey Through Fastbook (AJTFB) - Chapter 3
A Journey Through Fastbook (AJTFB) - Chapter 4
A Journey Through Fastbook (AJTFB) - Chapter 5
A Journey Through Fastbook (AJTFB) - Chapter 6a
A Journey Through Fastbook (AJTFB) - Chapter 6b
A Journey Through Fastbook (AJTFB) - Chapter 8
A Journey Through Fastbook (AJTFB) - Chapter 9

Imagenette

The Imagenette is a subset of the ImageNet dataset that “contains a subset of 10 very different categories from the orginal ImageNet dataset, making for quicker training when we want to experiment”

Tip

Start with small datasets and models for initial experimentation and prototyping. Both will allow you to iterate over your experiments more quickly and verify your code works from beginning to end without having to wait hours for your training/validation loops to finish. “You should aim to have an iteration speed of no more than a couple of minutes …. If it’s taking longer to do an experiment, think about how you could cut down your dataset, or simply your model, to improve your experimentation speed.

from fastai.vision.all import *
path = untar_data(URLs.IMAGENETTE)

Tip 1: Use the “presizing trick”

See chapter 5, pp.189-191. The idea here is to first crop the image so that further augmentations can be applied without creating empty space (via item_tfms), with further augmentations being applied on the GPU on batches of images for speed (via batch_tfms).

On the training set, the initial crop area is chosen randomly with the size set to cover the entire width/height of the image with random crop and other augmentations done on the GPU.

On the validation set, a center square is always used in the first step and only a resize is applied on the GPU to get the image width/height equal to the final size needed.

dblock = DataBlock(
    blocks=(ImageBlock(), CategoryBlock()),
    get_items = get_image_files,
    get_y = parent_label,
    item_tfms = Resize(460),
    batch_tfms = aug_transforms(size=224, min_scale=0.75)
)

dls = dblock.dataloaders(path, bs=64)

Tip 2: Create a “baseline”

Note

we are not using a pretrained model here, we are training one from scratch.

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.654316 4.794844 0.320015 05:21
1 1.217274 1.211676 0.612024 05:19
2 0.964628 1.417025 0.617252 05:06
3 0.736836 0.677910 0.787155 05:12
4 0.596578 0.539180 0.833831 05:05

Tip 3: Normalize your data

Tip

“When training a model, it helps if your input data is normalized - this is, has a mean of 0 and a standard deviation of 1.””

For images we do this over each channel (the 1 dimension) but averaging over all axes with the exception of the channel axis. In fastai, we can utilize the Normalize transform to apply this a batch at a time.

Important

If we don’t tell this transform what mean/std to use, “fastai will automatically calculate them from a single batch of your data”

Important

If we are using ImageNet images, we can use imagenet_stats instead of calculating the mean/std ourselves).

# an example of normalization calculated on a batch of images
# (because we aren't using normalization yet, you'll see the mean and standard deviation are not very close to
# 0 and 1 respectively)
x, y = dls.one_batch()

x.mean(dim=[0,2,3]), x.std(dim=[0,2,3])
(TensorImage([0.4518, 0.4554, 0.4344], device='cuda:0'),
 TensorImage([0.2868, 0.2783, 0.2998], device='cuda:0'))
def get_dls(batch_size, image_size):
  dblock = DataBlock(
      blocks=(ImageBlock(), CategoryBlock()),
      get_items = get_image_files,
      get_y = parent_label,
      item_tfms = Resize(460),
      batch_tfms = [*aug_transforms(size=image_size, min_scale=0.75), Normalize.from_stats(*imagenet_stats)]
  )

  dls = dblock.dataloaders(path, bs=batch_size)
  return dls
dls = get_dls(64, 224)

# an example of normalization calculated on a batch of images
# (because we are using normalization now, the mean and standard deviation are very close to 0 and 1 respectively)
x, y = dls.one_batch()
print(x.mean(dim=[0,2,3]), x.std(dim=[0,2,3]))

# does this normalization improve our model? Let's see ...
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)
TensorImage([-0.0816, -0.0114,  0.0695], device='cuda:0') TensorImage([1.1806, 1.1762, 1.2825], device='cuda:0')
epoch train_loss valid_loss accuracy time
0 1.701530 1.856552 0.468633 05:07
1 1.280709 1.384676 0.573562 05:05
2 1.007325 1.073023 0.656460 05:06
3 0.762624 0.666320 0.784541 05:06
4 0.606407 0.573812 0.823376 05:02
Important

“… when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference or transfer learning will need to use the same statistics …. If you’re using a model that someone else has trained, make sure you find out what normalization statistics they used and match them.

Tip 4: Use “progressive resizing”

“… start training using small images, and end training using large images. Spending most of the epochs training with small images helps training complete faster.”

Note

Think of this as a form of transfer learning

“… the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image …. So, when we change the image size in the middle of training, it doesn’t mean that we have to find totally different parameters for our model.”

Note

“Progressive resizing has an additional benefit: it is another form of data augmentation. Therefore, you should expect to see better generalization”

dls = get_dls(128,128)
learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

# simply replace the `Learner.dls` with new `DataLoaders` and continue traning.
learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.848093 1.582196 0.526512 03:02
1 1.297791 1.205059 0.616878 03:01
2 0.985249 1.022758 0.690067 02:55
3 0.762485 0.688779 0.787155 02:53
epoch train_loss valid_loss accuracy time
0 0.845315 1.171858 0.650112 05:08
epoch train_loss valid_loss accuracy time
0 0.635858 0.834369 0.751307 05:06
1 0.664283 0.665261 0.796117 05:10
2 0.585543 0.634785 0.796490 05:11
3 0.478250 0.495538 0.840926 05:02
4 0.429075 0.448893 0.855489 05:08
Note

To use the DataLoaders with bigger images, we simply assign it to Learner.dls.

Important

Bigger images will require smaller batch sizes. Also, you will not get a benefit of using images sized larger than the size of your images on disk!

Important

“… for transfer learning, progressive resizing may acutally hurt performance …. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and the dataset and was trained on similar-sized images, so the weights don’t need to be changed much. In that case, training on smaller images may damage the pretrained weights.

“On the other hand, if the transfer learning task is going to use images that are of different sizes, shapes, or styles than those used in the pretraining task, progressive resizing will probably help”

Tip

If you are unsure, try it!

Tip 5: Use Test Time Augmentation (TTA)

Important

TTA is a form of data augmentation applied to the validation set that adds augmented versions of the images. “During inference or validation, creating multiple versions of each image using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image.”

“… select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we can do this not just for different crops, but for different values across all of our test time augmentation parameters”

What is the problem TTA addresses and why use it?

“When we use random cropping, fastai will automatically use center-cropping for the validation set” which can be probelmatic, for example, in multi-label tasks where “sometimes there are small objects toward the edges of an image” that might be cropped out entirely or perhaps features on the fringe that are required for any classification task.

# you can pass any `DataLoaders` to `tta()` (by default it uses your validation `DataLoader`)
preds, targs = learn.tta()
accuracy(preds, targs).item()
0.861090362071991

“TTA gives us a good boost in performance, with no additional training required.

Tip 6: Use MixUp

Mixup … is a powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model that was trained on data similar to your dataset

It is a dataset-independent form of data augmentation = can be applied without domain knowledge of the dataset to configure other forms of data augmentation (e.g., flipping and to what degree, etc…)

How does Mixup work?

  1. Select another random image
  2. Pick a weight at random
  3. Take a weighted average of the selected image with your image = Your independent variable
  4. Take a weighted average of the selected image’s labels with your image’s labels = Your dependent variable
  5. Use #3 to predict #4

In pseudocode:

img2, targ2 = dataset[randint(0, len(dataset))]
t = random_float(0.5, 1.0)
new_img = t * img1 + (1-t) * img2
new_targ = t * targ1 + (1-t) * targ2
Note

“For this to work, our targets need to be one-hot encoded”

Important

“Mixup requires far more epochs”

Note

“One of the reasons Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by **using Mixup on activations inside their models, not just on inputs - this allows Mixup to be used for NLP and other data types too.”

See pp.247-249 for a detailed example of how Mixup works and is used in fastai

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=MixUp)
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 2.183965 2.523945 0.320762 02:57
1 1.729223 1.974045 0.461538 03:01
2 1.479313 1.131723 0.630695 03:07
3 1.294975 0.872954 0.724421 03:08
4 1.183486 0.731506 0.776699 03:06
Note

“… it’s going to be hard to train, because … the model has to predict two labels per image rather than just one …. Overfitting seems less likely to be a problem.”

Tip 7: Use “Label Smoothing”

“In the theoretical expression of loss, in classification problems, our targets are one hot encoded …. That means the model is trained to return 0 for all categories but one, for which it is trained to return 1…. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it’s not too sure, just because it was trained that way.

Important

“This can become very harmful if your data is not perfectly labeled.”

“In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images that are harder to label”

What is the solution this this?

“… we could replace all our 1s with a number a bit less than 1, and our 0s with a number a bit more than 0, and then train. This is” = Label smoothing. “By encouraging your model to be a less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better at inference.”

See pp.249-251 for a detailed explanation and example of how label smoothing operates. To use it, we just have to change our loss function.

model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 2.757509 3.500999 0.253921 03:06
1 2.257501 2.817133 0.440627 03:00
2 1.968483 2.138581 0.617625 02:59
3 1.781833 1.700527 0.772591 03:05
4 1.648491 1.632251 0.798357 03:01
Important

“As with Mixup, you won’t generally see significant improvements from label smoothing until you train more epochs.”

Resources

  1. https://book.fast.ai - The book’s website; it’s updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc…

  2. Bag of Tricks for Image Classification with Convolutional Neural Networks discusses a variety of techniques you can use with CNNs

  3. How to Train Your ResNet 8: Bag of Tricks discusses a variety of techniques you can use to training ResNets.

  4. IceVision is a great resource for all things computer vision and a fastai friendly library. You may want to follow these twitter accounts as well: @ai_fast_track and @Fra_Pochetti (creator of IceVision).