- What is a training set?
- What is a validation set?
- What is a test set?
- How to create good validation and test sets
Here we look at the differences between training, validation, and test sets, and also both strategies and best practices for building each.
A training set consits of the data your model sees during training. These are the inputs and labels your model will use to determine the loss and update it's parameters in a way that will hopefully lead to a model that works well for its given task.
Because a model needs something to train on. It should be representative of the data the model will see in the future, and it should be updated if/when you discover that is not the case.
To train a model on examples resembling that which the model will seen in the future. More is generally better, but quality is king (e.g., bad data in, bad data out).
To provide augmented examples for your model to see so as to increase the number of examples and better reflect what the model may see in the real world.
A validation set (also know as the "development set") does not include any data from the training set. It's purpose to is gauge the generalization prowess of your model and also ensure you are neight overfitting or underfitting.
"If [the model] makes an accurate prediction for a data item, that should be because it has learned characteristics of that kind of item, and not because the model has been shaped by actually having seen that particular item." 1
"[because] what we care about is how well our model works on previously unseen images ... the longer you train for, the better your accuracy will get on the training set ... as the model starts to memorize the training set rather than finding generalizable underlying patterns in the data = overfitting" 2
Overfitting happens when the model "remembers specific features of the input data, rather than generalizing well to data not seen during training." 3
seedparameter so that you "get the same validation set every time" so that "if we change our model and retrain it, we know any differences are due to the changes to the model, not due to having a different random validation set."
It gives us a sense of how well our model is doing on examples it hasn't seen, which makes sense since the ultimate worth of a model is in how well it generalizes to things unseen in the future.
The validation set also informs us how we may change the hyperparamters (e.g., model architecture, learning rates, data augmentation, etc...) to improve results. These parameters are NOT learned ... they are choices WE make that affect the learning of the model parameters. 5
A test set ensures that we aren't overfitting our hyperparameter choices; it is held back even from ourselves and used to evaulate the model at the very end.
"[Since we] are evaluating the model by looking at predictions on the validation data when we decide to explore new hyperparameter values ... subsequent version of the model are, indirectly, shaped by us having seen the validation data ... [and therefore], we are in danger of overfitting the validation data through human trial and error and exploration." 6
Note:A key property of the validation and test sets is that they must be representative of the new data you will see in the future.
To ensure we aren't inadvertently causing the model to overfit via our hyperparameter tuning which happens as a result of us looking at the validation set. It is a completely hidden dataset; it isn't used for training or tuning, only for measuring performance.
If evaluating 3rd party solutions. You'll want to know how to create a good test set and how to create a good baseline model. Hold these out from the potential consultants and use them to fairly evaluate their work.
To ensure you aren't overfitting your model as a result of validation set examination. As with the validation set, a good test set offers further assurance your model isn't learning particular ancillary features of particular things in your images.
It isn't always as easy as randomly shuffling your data!
Again, what both of these sets should haven in common is that they "must be representative of the new data you will see in the future." And what this looks like often dependes on your use case and task.
First, consider cases where historical data is required to predict the future, for example of quant traders use "backtesting to check whether their models are predictive of future periods, based on past data" 7
"A second common case occurs when you can easily anticipate ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with." 9
As an example of this, the Kaggle distracted driver competition is used. In it, based on pictures of drivers you need to predict categories of distraction. Since the goal of such a model would be to make good predictions against drivers the model hasn't seen, it would make sense to create a validation and also a test set consiting of specific drivers the training set doesn't include (in fact, the competition's test set is exactly that!). "If you used all the people in training your model, your model might be overfitting to the paricipants of those specific people and not just learning the states (texting, eating, etc.)." 10
Another example of this is the Kaggle fisheries competition where the objective is to predict the species of fish caught on fishing boats. As the goal of such a model is to predict the species on other/future boats, it makes sense then that "the test set consisted of images from boats that didn't appear in the training data, so in this case you'd want your validation set to also include boats that are not in the training set." 11
For a stellar example of how this looks in practice, see this thread from Boris Dayma on an issue he noticed when looking at his results on the training and validation sets. Note how his EDA was directed via training a model ... and also make sure to read through all the comments, replies, etc... for other things to pay attention too when seeing unusual results during training (there is a lot of good stuff there). Ultimately, in his case, what he found out was that the dataset was imbalanced and the imbalanced data was getting lumped together in the same batches due to poor shuffling strategy. He documents his fix in a subsequent thread so check that out too.
2. Ibid., p.29↩
5. Ibid., p.49↩
7. Ibid., p.53↩
8. Ibid., p.51. There are some really good illustraions on pp.51 and 52 with some follow-up intutition on page 53 wrt to time series splits↩
9. Ibid., p.53↩
11. Ibid. pp.53-54↩