The DataBlock API represents fastai's hig-level approach for building DataLoaders from your raw data sources. It is a resuable blueprint for how data is used both during model training and at inference time, and along with the fastai callback system, it represents one of the core pieces of the fastai framework.

"... a DataBlock object ... is like a template for creating a DataLoaders object" 1

"A DataLoader is a class that provides batches of a few items at a time to the GPU" 3


Defining your "blueprint" using the DataBlock API

There are four things you need to specify to make your data usable for training (e.g., to build at minimum a training and validation DataLoader). 2

  1. What kind of data you are working with
  2. How to get the data
  3. How to label the data
  4. How to create a validation set

Here's an example of how this is done with the DataBlock API:

d_block = DataBlock(
  blocks=(ImageBlock, CategoryBlock),              #=> our independent and dependent variable datatypes
  get_items=get_image_files,                       #=> how to get our data
  splitter=RandomSplitter(valid_pct=0.2, seed=42), #=> how to create the validation set
  get_y=parent_label,                              #=> how to label our data
  item_tfms=Resize(128)                            #=> code that runs against each item as it is fetched

Tip: Use the seed argument to ensure you get the same training/validation set each time you run that code; else you won’t be able to know if, as you change hyperparameter values, your model performance changed because of those values and/or because of difference in your training/validation sets!
Note: To ensure reproducibility in your fastai training, follow the tips/tricks laid out in the Reproducibility: Where is the randomness coming in? forum post.

Using your "blueprint" to build your DataLoaders

Once you've defined your blueprint for how to get your modelable data (i.e., your DataLoaders), you need to pass it the "actual source" of your data, which can be a path or a DataFrame or whatever.

dls = d_block.dataloaders(path)

Note: Use dls.show_batch(...) or dls.valid.show_batch(...) to visualize your training/validation data.

Transforms

The DataBlock API relies heavily on the use of fastai transforms. They are used in the blocks you see above as well as inline, as you'll see below.

What is a "Transform"?

A Transform contains code that is applied automatically during training.

What kinds of transforms are there?

There are two kinds of transforms:

Item Transforms: Applied to each individual item in your dataset, they are applied to an item from your dataset when it is fetched.

Note: Use the item_tfms argument to define your batch transforms. It is more technically correct to think of them as your after batch transforms since that is whey they are applied

Batch Transforms: Applied to a batch of items using the GPU, they are applied to a collection of items on the GPU after they have been collated into the same shape.

Note: Use the batch_tfms argument to define your batch transforms. It is more technically correct to think of them as your after batch transforms since that is whey they are applied

An example:

d_block = d_block.new(item_tfms=RandomResizedCrop(128, min_scale=0.3), batch_tfms=aug_transforms(mult=2))

Note: aug_transforms are "a standard set of augmentations that we have found work pretty well"

When should I use an item transform?

TODO

When should I use a batch transform?

Data augmentation

Data augmentation transorms (e.g., rotation, flipping, perspective warping, brightness changes, contrast changes, etc...) are defined as batch transforms and run on the GPU.


Tips & Tricks

Changing your transforms without having to redefine your DataBlock from scratch

You can change the transforms in your DataBlock by reusing an existing DataBlock via d_block.new.

d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.squish))
dls = d_block.dataloaders(path)
...
d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeroes'))
dls = d_block.dataloaders(path)
...

Resources for learning the DataBlock API

For more detailed discussion of the DataBlock API, see the following resources:

  1. My article "Finding DataBlock Nirvana with fast.ai v2 - Part 1"

  2. The "Walk with fastai2" videos

  3. The fastai docs and the DataBlock Tutorials


1. "Chaper 2: From Model to Production". In The Fastbook p.72

3. Ibid.

2. Ibid., pp.70-74 provide more detail and details on using the DataBlock API for a multiclassification computer vision task