Cervantes once wrote that "the journey is better than the inn", but I rather like to think that the journey is the inn.

It means that the journey, irrespective to its difficulties (and likely because of them), is what you look back on with fondness at its end rather than the end itself. It's why I enjoy reading "The Lord of the Rings" every five years or so, where as I age and experience the hand life has dealt me, I find myself appreciating different aspects of the story from the time before and gaining new insights into what I value and want to be as a human being. I find my journey with deep learning to be roughly analgous to that.

I've been a part of the fast.ai community for several years. I've been through the course multiple times (since it was using theano back in the old days), I've contributed to the library, and use it as the basis for one of my own. And as with each course, with a re-reading of the book I find myself deriving new insights and appreciating different ideas than those I had before.

And so, while your journey may bring you different revelations, here are the meandering thoughts of one 49 year old married father of 3 living in San Diego, California, USA, as I embark upon the first chapter in what I consider "The Lord of the Rings" of deep learning.

Chapter 1

How to Learn Deep Learning

You can do this!

Hi, everybody; I'm Jeremy ... I do not have any formal technical education ... didn't have great grades. I was much more interested in doing real projects.

This is meaningful to me as someone with a BA in History and a MA in Theology. It's a reminder that if you want something, it's within your grasp to make it happen if you are willing to put in the work. It's also a reminder that key to getting there is actually doing something! If find too many people thinking that if they just get into that school, or if they can just take that class, then they'll be a good software enginner or deep learning practitioner. The reality is that the only way you get there is by doing it ... just like pull-ups (which aren't much fun when you're starting out and/or you're 49 and overweight).

The problem with traditional education

... how math is taught - we require students to spend years doing rote memorization and learning dry disconnected fundatmentals that we claim will pay off later, long after most of them quit the subject.

This also is the problem with higher education in general, where young people spend at least four to five years learning things they already learned in High School or else things they don't really care about and will be forgotten right after finals, spending in excess of $100,000 for the privilege of it and likely going into debt in the tens of thousands of dollars, all with this idea that having done it they will be prepared for the real world. Unfortunately, that's not how it works. Whether you are in a university of even go to university, what matter is what you do ... not what classes you took or what your GPA is.

Deep Learning (and coding in general) is an art maybe more so than a science

The hardest part of deep learning is artisanal.

I remember going to an iOS conference way back in the day and a conference speaker asking how many folks in the session I was sitting in had a background in music. 80-90% of the audience raised their hands. Sure, there is math and stats and a science to deep learning, but like any coding enterprise, it's an art ... with some artists being better than others along with room for improvement regardless of whether you're Van Gough or painting by the numbers.

Doing is how you learn, and what you've done is what matters

... focus on your hobbies and passions ... Common character traits in the people who do well at deep learning include playfulness and curiosity.

at Tesla .. CEO Elon Musk says 'A PhD is definitely not required. All that matters is a deep understanding of AI & ability to implement NNs in a way that is actually useful .... Don't care if you even graduated High School.'

... the most important thing for learning deep learning is writing code and experimenting."

Getting Started

Training & Transfer Learning

... a model is a special kind of program:it's one that can do many different things, depending > on the weights.

Weights are just variables, and a weight assignment is a particuarl choice of values for those variables. [Weights] are generally referred to as model parameters ... the term weights being reserved for a particular type of model parameter.

  • The functional form of the model is called its architecture.

It is "the template of the model that we're trying to fit; i.e., the actual mathematical function that we're passing the input data and parameters to" ... whereas the model is a particular set of parameters + the architecture.

  • The weights are called parameters.

These are the things that are "learnt"; the values that can change

  • The predictions are calculated from your indpendent variables [your X]
  • The [model's] measure of performance is called the loss ... [which depends on how well your model is able to predict] the correct labels (also known as targets or the dependent variable) [your y] ... [given the independent variables as input].

The loss is a measure of model performance that SGD can use to make your model better. A good loss function provides good gradients (slopes) that can be used to make even very minor changes to your weights so as to improve things. Visually, you want gentle rolling hills rather than abrupt steps or jagged peaks.

Transfer learning is the process of taking a "pretrained model" that has been trained on a very large dataset with proven SOTA results, and "fine tuning" it for your specific task, which while likely similar to the task the pretrained model was trained for to one degree or another, is not the necesarily the same.

What does this mean?

  1. The head of your model (the newly added part specific to your dataset/task) should be trained first since it is the only one with completely random weights.
  2. The degree to which your weights of the pretrained model will need to be updated is proportional to how similar your data is to the data it was trained on. The more dissimilar, the more the weights will need to be changed.
  3. Your model will only be as good as the data it was trained on, so make sure what you have is representative of what it will see in the real world. It "can learn to operate on only the patterns seen in the input data used to train it."

The process of training (or fitting) the model is the process of finding a set of parameter values (or weights) that specialize that general architecture into a model that works well for our particular kind of data [and task]

fastai's fine_tune method uses proven tricks and hyperparameters for various DL tasks that the author's have found works well and works most of the time. See p.33 for more info on what it does.

... once the model is trained - that is, once we've chosen our final weight assignments - then we can think of the weights as being part of the model since we're not varying them anymore.

This means a trained model can be treated like a typical function.


Metrics are a human-understandable measures of model quality whereas the loss is the machine's. They are based on your validation set and are what you really care about, whereas the loss is "a measure of performance" that the training system can use to update weights automatically.

A good choice for loss is a function "that is easy for stochastic gradient descent (SGD) to use, whereas a good choies for your metrics are functions that your business users will care about. Seldom are they the same because most metrics don't provide smooth gradients that SGD can use to update your model's weights.

Examples of common metrics:

error rate = "the proportion of images that were incorrectly identified.

accuracy = the proportation of images that were correctly identified (1 - error rate)

Validation & Test Sets

What is a validation set?

A validation set (also know as the "development set") does not include any data from the training set. It's purpose to is gauge the generalization prowess of your model and also ensure you are neight overfitting or underfitting.

If [the model] makes an accurate prediction for a data item, that should be because it has learned characteristics of that kind of item, and not because the model has been shaped by actually having seen that particular item.

Why do we need a validation set?

[because] what we care about is how well our model works on previously unseen images ... the longer you train for, the better your accuracy will get on the training set ... as the model starts to memorize the training set rather than finding generalizable underlying patterns in the data = overfitting

Overfitting happens when the model "remembers specific features of the input data, rather than generalizing well to data not seen during training."

Important: Your models should always overfit before anything else. It is your training loss gets better while your validation loss gets worse ... in other words, if you’re validation loss is improving, even if not to the extent of your training loss, you are not overfitting
Important: ALWAYS include a validation set.
Important: ALWAYS measure your accuracy (or any metrics) on the validation set.
Important: Set the seed parameter so that you "get the same validation set every time" so that "if we change our model and retrain it, we know any differences are due to the changes to the model, not due to having a different random validation set."
The validation set also informs us how we may change the hyperparamters (e.g., model architecture, learning rates, data augmentation, etc...) to improve results. These parameters are NOT learned ... they are choices WE make that affect the learning of the model parameters.

What is a test set?

A test set ensures that we aren't overfitting our hyperparameter choices; it is held back even from ourselves and used to evaulate the model at the very end.

Important: If evaluating 3rd party solutions know how to create a good test set and how to create a good baseline model. Hold these out from the potential consultants and use them to fairly evaluate their work.

How do you define a good validation and test sets?

See pp.50-54 ...

A key property of the validation and test sets is that they must be representative of the new data you will see in the future.

For time series, that means you'll likely want to make your validation set a continuous section with the latest dates.

You'll also want to make sure your model isn't learning particular ancillary features of particular things in your images (e.g., you want to see how your model performs on a person or boat it hasn't seen before ... see pp.53-54 for examples).

Q & A & Best Practices

What is a Transform

A Transform conatins code that is applied automatically during training. There are two kinds ...

  1. item_tfms:Applied to each item2. batch_tfms: Applied to a batch of items at a time using the GPU

Why do we make images 224x224 pixels?

This is the standard size for historical reasons (old pretrained models require this size exactly) ... If you increase the size, you'll often get a model with better results since it will be able to focus on more details.

Important:Train on progressively larger image sizes using the weights trained on smaller sizes as a kind of pretrained model.

What is a ResNet & Why use it for computer vision tasks?

A ResNet is a model architecture that has proven to work well in CV tasks. Several variants exist with different numbers of layers with the larger architectures taking longer to train and more prone to overfitting especially with smaller datasets.

Important: Start with a smaller ResNet (like 18 or 34) and move up as needed.
Important: If you have a lot of data, the bigger resnets will likely give you better results.
And what other things can use images recognizers for besides image tasks? Sound, time series, malware classification ...

... a good rule of thumb for converting a dataset into an image representation:if the human eye can recognize categories from the images, then a deep learning model should be able to do so too. See pp.36-39

How can we see what our NN's are actually learning/doing?

See pp.33-36. Being able to inspect what your NN is doing (e.g., looking at the activations and the gradients) is one of the most important things you can learn as they are often the key to improving results.

Important: Learn how to visualize and understand your activations and gradients!

What is the difference between categorical and continuous datatypes?

Categorical data "contains values taht are one of a discrete set of choice" such as gender, occupation, day of week, etc...

Continuous data is numerical that represents a quantity such as age, salary, prices, etc...

Important: For tasks that predict a continuous number, consider using y_range to constrain the network to predicting a value in the known range of valid values. (see p.47)


  1. https://book.fast.ai - The book's website; it's updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc...

  2. https://course.fast.ai/datasets - A variety of slimmed down datasets you can use for various DL tasks that support "rapid prototyping and experimentation."

  3. https://huggingface.co/docs/datasets/ - Serves a similar purpose to the fastai datasets but for the NLP domain. Includes metrics and full/sub-set datasets that you can use to benchmark your results against the top guns of deep learning.

    Important: Start with a smaller dataset and scale up to full size to accelerate modeling!