Other posts in this series:
A Journey Through Fastbook (AJTFB) - Chapter 1

Chapter 2


Starting Your Project

Things to think about when deciding on project feasibility

When selecting a project, the most important consideration is data availability

If you don't have enough quality data ... good luck :)

Important: Consider that data augmentation can alleviate both the need for more manual labelling and also protect you from problems with out-of-domain data (e.g. when unexpected image types arise in the data when the model is being used in production) by synthetically creating more data likely to be seen that may not be in your dataset as is.

... iterate from end to end in your project; don't spend months fine-tuning your model, or polishing the perfect GUI, or labeling the perfect dataset

This is good advice for any software project ...fail early and fail often. If you don't, you're likely to only uncover critical problems much later than you would have before, and even worse, you're likely to not produce anything at all! In the world of deep learning there are a number of tools, that while helpful, can really get you so bogged down that you never deploy something usable (e.g., experiment tracking tools, hyperparameter optimization libraries, etc...). Also, remember that getting something in production is a different task from winning a kaggle competition, where the later may require use of some of those aforementioned tools and the ensembling of dozens of models. For production, something better than human is often good enough to get out there and through refactoring, improve.


The Drivetrain Approach

Four Steps

Step 1: Define your objective(s)

It's amazing how in my 20+ years as a developer, how rare it is that a customer is able to clearly define what they want! In my experience, more than not, it is the developers that end up defining the goals. Not having a clear objective is likely to waste time, energy, and money to produce something that won't even see the light of day. You can't gauge the completion or quality of any software project without clear objective(s).

Ex.1: Show most relevant search results.
Ex.2: Drive additional sales by recommending to customers items to purchase they otherwise wouldn't

Step 2: What actions can you take to achieve those objective(s)?

What things can make your goals a reality. Pretty simple.

Ex.1: Ranking the search results will help show the most relevants ones first.
Ex.2: Ranking the recommendations will help.

Step 3: What data is needed to take those actions?

If you don't have the data, you'll need to get it ... because the data pulls the levers which get you closer to your objective(s).

Ex.1: Seeing what how pages linked to other pages.
Ex.2: Collecting data on what customers purchased, what was recommended, and what they did with that info.

Step 4: Build models

Only once you have the data and know what actions you want to be able to take based on the information within it, do you being modeling ... first, defining what models you can even build with that data and second, what data you need to collect for models you can't.

Ex.1: A model that takes the page relation data and predicts a ranking given a query.
Ex.2: Two models that predict the purchasing proabilities conditional on seeing or not seeing a recommendation.

! pip install fastai -q
from fastai.vision import *

Downloading images, getting the downloaded images, & removing those that are corrupt

  1. Use download_images listed as URLs in a text file urls to download the actual images locally.
  2. Get the file path to the images via get_image_files in an L object.
  3. Get rid of the corrupt images using verify_images and Path.unlink.
path = Path('bears/grizzly')
download_images(path, urls=image_urls.txt)

file_paths = get_image_files(path)
failed = verify_images(file_paths)
failed.map(Path.unlink)

Notice how L's map method is used to apply the Path.unlink function to each item in-place.


Getting help

A few of ways ...

download_images?
download_images??
doc(download_images)

You can also use pdb.set_trace (in code) or %debug(in a new cell following the one with the error) to step through your code. I use the former all the time ... its a great way to debug and also learn what the code is doing and why. For example, I use it to look at the shape of things as the travel through and out of different layers in my NNs.

import pdb
def div_by_zero():
  pdb.set_trace()
  x = 1/0
  print('here')

# uncomment this to see what I'm talking about ...
# div_by_zero()

DataBlock API Basics

In order to make your data "modelable" (via DataLoaders, you need to tell fastai 4 things:

  1. What kind of data you are working with
  2. How to get the data
  3. How to label the data
  4. How to create a validation set

Here's an example of how this is done with the DataBlock API:

d_block = DataBlock(
  blocks=(ImageBlock, CategoryBlock),              #=> our independent and dependent variable datatypes
  get_items=get_image_files,                       #=> how to get our data
  splitter=RandomSplitter(valid_pct=0.2, seed=42), #=> how to create the validation set
  get_y=parent_label,                              #=> how to label our data
  item_tfms=Resize(128)                            #=> code that runs against each item as it is fetched

Important: Use the seed argument to ensure you get the same training/validation set each time you run that code; else you won’t be able to know if, as you change hyperparameter values, your model performance changed because of those values and/or because of difference in your training/validation sets!
> ... a DataBlock object ... is like a template for creating a DataLoaders object

For more detailed discussion of the DataBlock API, see the following resources:1. My article "Finding DataBlock Nirvana with fast.ai v2 - Part 1"

  1. The "Walk with fastai2" videos

  2. The fastai docs and the DataBlock Tutorials

Once you've defined your blueprint for how to get your modelable data (i.e., your DataLoaders), you need to pass it the "actual source" of your data, which can be a path or a DataFrame or whatever.

dls = d_block.dataloaders(path)

Use dls.show_batch(...) or dls.valid.show_batch(...) to visualize your training/validation data.

Important: You can change the transforms in your DataBlock by reusing the existing DataBlock using d_block.new

d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.squish))
dls = d_block.dataloaders(path)
...
d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeroes'))
dls = d_block.dataloaders(path)
...

Important: For resizing ... "what we normally do in practice is to randomly select part of the image and then crop to just that part. On each epoch ... we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world: different photos of the same thing may be framed in slightly different ways." This is done using the RandomResizedCrop transform.

d_block = d_block.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))

min_scale: "how much of the image to select at minimum each time." 0.5 = select 50% of the image at minimum.

Important: Pass unique=True to your show_batch functions "to have the same image repeated with different versions" of the transforms you’ve defined.

Data augmentation transorms (e.g., rotation, flipping, perspective warping, brightness changes, contrast changes, etc...) are defined as batch transforms and run on the GPU.

Important: Item Transforms are applied to an item from your dataset when it is fetched, Batch Transforms are applied to a collection of items on the GPU after they have been collated into the same shape.

d_block = d_block.new(item_tfms=RandomResizedCrop(128, min_scale=0.3), batch_tfms=aug_transforms(mult=2))

batch_tfms: Your batch transforms, or more correctly your after batch transforms

Important: aug_transforms are "a standard set of augmentations that we have found work pretty well"


Cleaning your model through training

Important: "It’s helpful to see where exactly our errors are occuring, to see whether they’re due to a dataset problem ... or a model problem" using plot_top_losses

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(5, nrows=1) #> show the 5 examples with the highest loss

Important: A "model can help you find data issues more quickly ... so we normally prefer to train a quick and simple model first, and then use it to help with data cleaning."


Inference

"a model consists of two parts:the architecture and the trained parameters." You can use it just like any other function

#saves the architecture, the trained parameters, and the definintion of how to create your DataLoaders
learn.export()

fastai ... uses your validation set DataLoader for inference by default, so your data augmentation will not be applied.

inf_learn = load_learner(path/'export.pkl')
inf_learn.predict('images/grizzly.jpg')
inf_learn.dls.vocab # => To view possible classification categories/labels

For options on how to deploy your app, see the Deployment section in the course website. I personally like to use FastAPI and there is a good starter template here for that.


How to Avoid Disaster

Important: Your model is only as good as the data it was trained on

Two problems to watch out for:

  1. out-of-domain data: "data that our model sees in production that is very different to wath it saw during training.
  2. domain shift: "whereby the type of data that our model sees changes over time."

Mitigation steps:

Where possible, the first step is to use an entirely manual process with your model running in parallel and not being used to directly drive any actions.

The second step is to try and limit the scop of the model.

The third step is to gradually increase the scope of your rollout.

Important:"Try to think about all the ways in which your system could go wrong, and then think about what measure or report or picture could reflect that problem, and ensure that your regular reporting includes that information."


Resources

  1. https://book.fast.ai - The book's website; it's updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc...

  2. https://docs.fast.ai/ - The library's documentation; includes tutorials and other tips for development.

  3. https://forums.fast.ai/ - If you're not part of the community yet, you should be. Before posting a question, search to see if it has been answered already (95% of the time, it has).