Other posts in this series:
A Journey Through Fastbook (AJTFB) - Chapter 1
When selecting a project, the most important consideration is data availability
If you don't have enough quality data ... good luck :)
... iterate from end to end in your project; don't spend months fine-tuning your model, or polishing the perfect GUI, or labeling the perfect dataset
This is good advice for any software project ...fail early and fail often. If you don't, you're likely to only uncover critical problems much later than you would have before, and even worse, you're likely to not produce anything at all! In the world of deep learning there are a number of tools, that while helpful, can really get you so bogged down that you never deploy something usable (e.g., experiment tracking tools, hyperparameter optimization libraries, etc...). Also, remember that getting something in production is a different task from winning a kaggle competition, where the later may require use of some of those aforementioned tools and the ensembling of dozens of models. For production, something better than human is often good enough to get out there and through refactoring, improve.
It's amazing how in my 20+ years as a developer, how rare it is that a customer is able to clearly define what they want! In my experience, more than not, it is the developers that end up defining the goals. Not having a clear objective is likely to waste time, energy, and money to produce something that won't even see the light of day. You can't gauge the completion or quality of any software project without clear objective(s).
Ex.1: Show most relevant search results.
Ex.2: Drive additional sales by recommending to customers items to purchase they otherwise wouldn't
What things can make your goals a reality. Pretty simple.
Ex.1: Ranking the search results will help show the most relevants ones first.
Ex.2: Ranking the recommendations will help.
If you don't have the data, you'll need to get it ... because the data pulls the levers which get you closer to your objective(s).
Ex.1: Seeing what how pages linked to other pages.
Ex.2: Collecting data on what customers purchased, what was recommended, and what they did with that info.
Only once you have the data and know what actions you want to be able to take based on the information within it, do you being modeling ... first, defining what models you can even build with that data and second, what data you need to collect for models you can't.
Ex.1: A model that takes the page relation data and predicts a ranking given a query.
Ex.2: Two models that predict the purchasing proabilities conditional on seeing or not seeing a recommendation.
! pip install fastai -q from fastai.vision import *
download_imageslisted as URLs in a text file
urlsto download the actual images locally.
- Get the file path to the images via
- Get rid of the corrupt images using
path = Path('bears/grizzly') download_images(path, urls=image_urls.txt) file_paths = get_image_files(path) failed = verify_images(file_paths) failed.map(Path.unlink)
map method is used to apply the
Path.unlink function to each item in-place.
You can also use
pdb.set_trace (in code) or
%debug(in a new cell following the one with the error) to step through your code. I use the former all the time ... its a great way to debug and also learn what the code is doing and why. For example, I use it to look at the shape of things as the travel through and out of different layers in my NNs.
import pdb def div_by_zero(): pdb.set_trace() x = 1/0 print('here') # uncomment this to see what I'm talking about ... # div_by_zero()
In order to make your data "modelable" (via
DataLoaders, you need to tell fastai 4 things:
- What kind of data you are working with
- How to get the data
- How to label the data
- How to create a validation set
Here's an example of how this is done with the
d_block = DataBlock( blocks=(ImageBlock, CategoryBlock), #=> our independent and dependent variable datatypes get_items=get_image_files, #=> how to get our data splitter=RandomSplitter(valid_pct=0.2, seed=42), #=> how to create the validation set get_y=parent_label, #=> how to label our data item_tfms=Resize(128) #=> code that runs against each item as it is fetched
seedargument to ensure you get the same training/validation set each time you run that code; else you won’t be able to know if, as you change hyperparameter values, your model performance changed because of those values and/or because of difference in your training/validation sets!
DataBlock object... is like a template for creating a
For more detailed discussion of the DataBlock API, see the following resources:1. My article "Finding DataBlock Nirvana with fast.ai v2 - Part 1"
Once you've defined your blueprint for how to get your modelable data (i.e., your
DataLoaders), you need to pass it the "actual source" of your data, which can be a path or a DataFrame or whatever.
dls = d_block.dataloaders(path)
dls.valid.show_batch(...) to visualize your training/validation data.
d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.squish)) dls = d_block.dataloaders(path) ... d_block = d_block.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeroes')) dls = d_block.dataloaders(path) ...
d_block = d_block.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
min_scale: "how much of the image to select at minimum each time." 0.5 = select 50% of the image at minimum.
show_batchfunctions "to have the same image repeated with different versions" of the transforms you’ve defined.
Data augmentation transorms (e.g., rotation, flipping, perspective warping, brightness changes, contrast changes, etc...) are defined as batch transforms and run on the GPU.
d_block = d_block.new(item_tfms=RandomResizedCrop(128, min_scale=0.3), batch_tfms=aug_transforms(mult=2))
batch_tfms: Your batch transforms, or more correctly your after batch transforms
aug_transformsare "a standard set of augmentations that we have found work pretty well"
interp = ClassificationInterpretation.from_learner(learn) interp.plot_top_losses(5, nrows=1) #> show the 5 examples with the highest loss
"a model consists of two parts:the architecture and the trained parameters." You can use it just like any other function
#saves the architecture, the trained parameters, and the definintion of how to create your DataLoaders learn.export()
fastai ... uses your validation set
DataLoaderfor inference by default, so your data augmentation will not be applied.
inf_learn = load_learner(path/'export.pkl') inf_learn.predict('images/grizzly.jpg') inf_learn.dls.vocab # => To view possible classification categories/labels
Two problems to watch out for:
- out-of-domain data: "data that our model sees in production that is very different to wath it saw during training.
- domain shift: "whereby the type of data that our model sees changes over time."
Where possible, the first step is to use an entirely manual process with your model running in parallel and not being used to directly drive any actions.
The second step is to try and limit the scop of the model.
The third step is to gradually increase the scope of your rollout.
Important:"Try to think about all the ways in which your system could go wrong, and then think about what measure or report or picture could reflect that problem, and ensure that your regular reporting includes that information."
https://book.fast.ai - The book's website; it's updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc...
https://docs.fast.ai/ - The library's documentation; includes tutorials and other tips for development.
https://forums.fast.ai/ - If you're not part of the community yet, you should be. Before posting a question, search to see if it has been answered already (95% of the time, it has).