Arguably the best architecture for most computer vision tasks, here we take a look at ResNet and how it can be used in fastai for a variety of such tasks.


What is a ResNet & Why use it for computer vision tasks?

A ResNet is a model architecture that has proven to work well in CV tasks. Several variants exist with different numbers of layers with the larger architectures taking longer to train and more prone to overfitting especially with smaller datasets.

The number represents the number of layers in this particular ResNet variant ... "(other options are 18, 50, 101, and 152) ... model architectures with more layers take longer to train and are more prone to overfitting ... on the other hand, when using more data, they can be qite a bit more accurate." 2

What other things can use images recognizers for besides image tasks?

Sound, time series, malware classification ... "a good rule of thumb for converting a dataset into an image representation: if the human eye can recognize categories from the images, then a deep learning model should be able to do so too." 1

How does it fare against more recent architectures like vision transformers?

Pretty well apparently (at least at the time this post was written) ...


ResNet best practices

Tip: Start with a smaller ResNet (like 18 or 34) and move up as needed.
Note: If you have a lot of data, the bigger resnets will likely give you better results.

An example using the high-level API

Step 1: Build our DataLoaders

from fastai.vision.all import *

path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()

dls = ImageDataLoaders.from_name_func(path, get_image_files(path), valid_pct=0.2, seed=42, label_func=is_cat, item_tfms=Resize(224))

Why do we make images 224x224 pixels?

"This is the standard size for historical reasons (old pretrained models require this size exactly) ... If you increase the size, you'll often get a model with better results since it will be able to focus on more details." 3

Tip: Train on progressively larger image sizes using the weights trained on smaller sizes as a kind of pretrained model.

Step 2: Build our cnn_learner

learn = cnn_learner(dls, resnet18, metrics=error_rate)

As you can see above, the architecture being used is a resnet with 18 layers.

Step 3: Train

learn.fine_tune(1)
epoch train_loss valid_loss error_rate time
0 0.161614 0.040670 0.013532 01:03
epoch train_loss valid_loss error_rate time
0 0.062475 0.020072 0.006766 01:04

For more information on how transfer learning works, and the fine_tune method in particuarl, see this section in my "What is machine learning" post.

For more metrics like error_rate, see my "What is a metric" post.


2. "Chaper 1: Your Deep Learning Journey". In The Fastbook pp.30-31.

1. Ibid., p.39. Pages 36-39 provides several examples of how non-image data can be converted to an image for such a purpose.

3. Ibid., p.28