A Journey Through Fastbook (AJTFB) - Chapter 9: Tabular Modeling
In chapter of 8 of "Deep Learning for Coders with fastai & PyTorch" numbers*. "Structured" or "tabular" data describes datasets that look like an Excel spreadsheet or a relational database table, of which, it may be a composed of both categorical and/or real numbers. Working with such data is the subject of chapter 9, so lets go!
- Tabular Modeling
- Categorical Embeddings
- Imports
- Data preparation
- Creating our TabularPandas
- Approach 1: Decision Trees
- Approach 2: Random Forests
- Step 1: Define your Random Forest
- Step 2: Determine why our validation set is worse than training
- Model Interpretation
- How confident are we in our predictions using a particular row of data?
- Which columns are the strongest predictors (and which can we ignore)?
- Which columns are effectively redundant?
- How do find the relationship between two predictors (columns)?
- For predicting a specific row of data, what were the most important factors and how did they influence the prediction?
- The "Extrapolation" problem
- Approach 3: Neural Networks
- Approach 4: Boosting
- Other things to try
- In summary
- Resources
Other posts in this series:
A Journey Through Fastbook (AJTFB) - Chapter 1
A Journey Through Fastbook (AJTFB) - Chapter 2
A Journey Through Fastbook (AJTFB) - Chapter 3
A Journey Through Fastbook (AJTFB) - Chapter 4
A Journey Through Fastbook (AJTFB) - Chapter 5
A Journey Through Fastbook (AJTFB) - Chapter 6a
A Journey Through Fastbook (AJTFB) - Chapter 6b
A Journey Through Fastbook (AJTFB) - Chapter 7
A Journey Through Fastbook (AJTFB) - Chapter 8
Tabular Modeling
What is it?
"Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value of one column based on the values in the other columns." Tabular data is also called "structured data" while "unstructured data" represents things like text, images, audio, etc...
Why is it important?
Though it is reported that 80-90% of data is unstructured (think images, text, audio), ironically, it appears that the vast majority of "real world" machine learning is concerened with tabular/structured data.
If you are just starting out with Data Science you should know that still the vast majority of DS problems in the industry concern structured/tabular data. This is what you should focus on in order to make a professional inroad.
— Bojan Tunguz (@tunguz) February 23, 2022
1/3
What are they?
-
For structured data, ensembles of decision trees (e.g., random forests and gradient boosting machines like XGBoost).
-
For unstructured data, multilayered neural networks learned with SGD.
Note: "... ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning.Important: Since a "critical step of interpreting a model of tabular data is significantly easier for decesion tree ensembles ... ensembles of decision trees are our first approach for analyzing a new tabular dataset"Important: Neural networks will considered when "there are some high-cardinality categorical variables" or there are columns with unstructured data. A example of a high "cardinality" (e.g., the number of discrete levels representing the categories) would be something like zip code.
See pages 282-284 for more discussion on the pros/cons of decision trees and neural networks for tabular data.
Categorical Embeddings
Continuous v. Categorical
Continuous variables "contain a real numbers that be fed into a model directly and are meaningful in and out of themselves. Examples include "age" and "price".
Categorical variables "contain a number of discrete levels, such as 'movie ID,' for which addition and multiplication don't have any meaning (even if they're stored as numbers). Other examples include dates, columns indicating "sex", "gender", "department", etc...
How do we represent "categorical" data?
We learned this in chapter 8, we represent such data via entity embeddings.
Because "an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot-encoded input layer ... the embedding transforms the categorical variables into inputs that are both continuous and meaningful."
In other words ...
from kaggle import api
from dtreeviz.trees import *
from fastai.tabular.all import *
from IPython.display import Image, display_svg, SVG
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor,export_graphviz
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
Step 1: Get the data
We'll be getting the data from kaggle. If you're running on colab, check out these instructions for getting setup with the kaggle API
path = Path("bluebook")
path
if not path.exists():
path.mkdir()
api.competition_download_cli("bluebook-for-bulldozers", path=path)
file_extract(path/"bluebook-for-bulldozers.zip")
path.ls(file_type="text")
Step 2: EDA
low_memory=False
unless Pandas acutally runs out of memory." The default = True
(will look only at the first few rows of data to infer column datatypes).
train_df = pd.read_csv(path/"TrainAndValid.csv", low_memory=False)
test_df = pd.read_csv(path/"Test.csv", low_memory=False)
train_df.columns
describe()
is a method that gives you some basic stats for each column.
train_df.describe().T
advanced_describe()
is a method I created that builds on top of the default describe()
method to include stats on missing and unique values (which are both very helpful in terms of cleanup, understanding potential issues, and in determining the size of your embeddings for categorical data). For categorical variables with few discrete levels, this method will also show you what they are.
advanced_describe(train_df)
train_df.ProductSize.unique()
sizes = ['Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']
train_df.ProductSize = train_df.ProductSize.astype("category")
train_df.ProductSize = train_df.ProductSize.cat.set_categories(sizes, ordered=True) # note: "inplace=True" is depreciated as of 1.30
train_df.ProductSize.unique()
Handling Your Dependent Variable(s)
"You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you ... in this case, Kaggle tells us [our measure is] root mean squared log error (RMLSE)" and because of this we need to make our target the log of the price "so that the m_rmse
of that value will give us what we ultimately need."
dep_var = "SalePrice"
train_df[dep_var] = np.log(train_df[dep_var])
Handling Dates
Dates are "different from most ordinal values in that some dates are qualitatively different from others in a way that is often relevant to the systems we are modeling." As such, we want the model to know if whether the day is a holiday, or part of the weekend, or in a certain month, etc... is important. To do this, "we **replace every date column with a set of date metadata columns, such as holiday, day of week, and month" = categorical data that might be very useful!
Can use fastai's add_datepart()
function to do this.
train_df = add_datepart(train_df, "saledate")
test_df = add_datepart(test_df, "saledate")
[col for col in train_df.columns if col.startswith("sale")]
For this we can use fastai's TabularPandas
class (allows us to apply TabularProc
transforms to the DataFrame it wraps to do things like fill missing values, make columns categorical, etc...).
Categorify
: "a TabularProc
that replaces a column with a numeric categorical column"
FillMissing
: "a TabularProc
that replaces missing values with the median of the column, and **creates a new Boolean column that is set to True
for any row where the value was missing. You can change this fill strategy via the
fill_strategy` argument.
procs = [Categorify, FillMissing]
cont, cat = cont_cat_split(train_df, 1, dep_var=dep_var)
Step 2: Define our training and validation splits
What is the difference between validation and test sets again?
Recall that a validation set "is data we hold back from training in order to ensure that the training process does not overfit on the training data" ... while a test set "is data that is held back from ourselves in order to ensure that we don't overfit on the validation data as we export various model architectures and hyperparameters."
Because this is a time series problem, we'll make the validation set include data for the last 6 months of the full training set, and the training set everything before that. See p.291 for more on this!
cond = (train_df.saleYear < 2011) | (train_df.saleMonth < 10)
train_idxs = np.where(cond)[0]
valid_idxs = np.where(~cond)[0]
splits = (list(train_idxs), list(valid_idxs))
to = TabularPandas(train_df, procs, cat, cont, y_names=dep_var, splits=splits)
type(to)
TabularPandas
behaves a lot like a fastai Datasets
object, including train
and valid
attributes"
len(to.train), len(to.valid)
to.show(3)
to.items.head(3)
We can see that mapping via the classes
attribute:
to.classes["ProductSize"]
TabularPandas
object so you don’t have to process the data again
save_pickle(path/"to.pkl", to)
# load from filesystem
to = load_pickle(path/"to.pkl")
Approach 1: Decision Trees
"A decision tree asks a series of binary (yes or no) questions about the data. After each question, the data at that part of the tree is split between a Yes and a No branch.... After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required."
"... for regression, we take the target mean of the items in the group"
train_xs, train_y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(train_xs, train_y)
draw_tree(m, train_xs, size=7, leaves_parallel=True, precision=2)
What is in each box above?:
- The decision criterion for the best split that was found
- "samples": The # of examples in the group
- "value": The average value of the target for that group
- "squared_error": The MSE for that group
Note: "The top node represents the initial model before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions and will always predict the value to be the average value of the whole dataset."
A "leaf node" is a node "with no answers coming out of them, because there are no more questions to be answered."
See p.293 for more on intrepreting the diagram above.
samp_idx = np.random.permutation(len(train_y))[:500]
dtreeviz(
m,
train_xs.iloc[samp_idx],
train_y.iloc[samp_idx],
train_xs.columns,
dep_var,
fontname="DejaVu Sans",
scale=1.6,
label_fontsize=10,
orientation="LR"
)
"This shows a cart of the distribution of the data for each split point"
dtreeviz
to find problems with the data.
For example, you can see there is a problem with "YearMade" as a bunch of tractors show they are made in the year 1000. The likely explanation is that if they don't have the info on a tractor, they set it = 1000 to indicate "Unknown".
We can replace this with something like 1950 to make the visuals more clear ...
train_xs.loc[train_xs["YearMade"] < 1900, "YearMade"] = 1950
valid_xs.loc[valid_xs["YearMade"] < 1900, "YearMade"] = 1950
samp_idx = np.random.permutation(len(train_y))[:500]
dtreeviz(
m,
train_xs.iloc[samp_idx],
train_y.iloc[samp_idx],
train_xs.columns,
dep_var,
fontname="DejaVu Sans",
scale=1.6,
label_fontsize=10,
orientation="LR"
)
m = DecisionTreeRegressor()
m.fit(train_xs, train_y)
def r_mse(preds, targs):
return round(math.sqrt(((preds - targs)**2).mean()), 6)
def m_rmse(m, xs, y):
return r_mse(m.predict(xs), y)
m_rmse(m, train_xs, train_y)
m_rmse(m, valid_xs, valid_y)
What does the above indicate?
That we might be overfitting ... badly (because our training set is perfect and our validation set not so much).
Why are we overfitting?
Because "we have nearly as many leaf nodes as data points ... sklearn's default settings allow it to continue splitting nodes until there is only one item in each leaf node." See pp.295-296 for more intuition on why trees overfit.
m.get_n_leaves(), len(train_xs)
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, to.train.xs, to.train.y), m_rmse(m, to.valid.xs, to.valid.y)
min_samples_leaf=25
= "Stop when all leaf nodes have a minimum of 25 samples"
A note on categorical variables
Decision trees **don't have embedding layers - "so how can these untreated categorical variables do anything useful"?
Answer: "It just works!"
get_dummies
, "there is not really any evidence that such an approach improves the end result."
See p.297 for more about why OHE aren't necessary and why decision tree just work with categorical variables out-of-the-box.
Approach 2: Random Forests
A random forest "is a model that averages the predictions of a large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree (what columns and rows are included in each tree) and other tree parameters"
Why does it work so well?
Because it utlizes bagging.
What is "bagging"?
From the "Bagging Predictors" paper ... "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.... The multiple versions are formed by making bootstrap replicates (a randomly chosen subset of rows) of the learning set."
This means that we can improve the performance of a model by training it multiple times with a different random subset of the data each time, and then averaging the predictions.
See p.298 for more details on how bagging works.
def fit_rf(xs, y, n_estimators=40, max_samples=200_000, max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(
n_jobs=-1, # Tells sklearn to use all our CPUs to build the trees in parallel
n_estimators=n_estimators, # The number of trees
max_samples=max_samples, # The number of rows to sample for training ea. tree
max_features=max_features, # The number of columns to sample at each split
min_samples_leaf=min_samples_leaf, # Stop when all leaf nodes have at least this number of samples
oob_score=True
).fit(xs, y)
m = fit_rf(train_xs, train_y)
m_rmse(m, train_xs, train_y), m_rmse(m, valid_xs, valid_y)
Recommended hyperparameter values:
-
n_estimators
: "as high a number as you have time to train ... more trees = more accurate -
max_samples
: default (200,000) -
max_features
: default ("auto") or 0.5 -
min_samples_leaf
: default (1) or 4
How to get the predictions for a SINGLE tree?
tree_preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_]) # added .values (see: https://stackoverflow.com/a/69378867)
r_mse(tree_preds.mean(0), valid_y)
How does n_estimators
impact model performance?
To answer this, we can increment the number of trees we use in our predictions one at a time like so:
plt.plot([r_mse(tree_preds[:i+1].mean(0), valid_y) for i in range(40)])
Step 2: Determine why our validation set is worse than training
"The OOB error is a way of measuring prediction error in the training dataset" based on rows not used in the training of a particular tree. "This allows us to see whether the model is overfitting, without needing a separate validation set."
"... out-of-bag error is a little like imagining that every tree therefore also has its own validation set" based on the prediction of rows not used in its training.
r_mse(m.oob_prediction_, train_y)
How confident are we in our predictions using a particular row of data?
Answer: "use the standard deviation of predictions across the trees, instead of just the mean. This tells us the relative confidence of predictions"
preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_])
preds.shape #=> (# of trees, # of predictions)
preds_std = preds.std(0) # get rid of first dimension (the trees)
preds_std[:5]
def rf_feature_importance(m, df):
return pd.DataFrame({"cols": df.columns, "imp": m.feature_importances_}).sort_values("imp", ascending=False)
def plot_fi(fi_df):
return fi_df.plot("cols", "imp", "barh", figsize=(12,7), legend=False)
fi_df = rf_feature_importance(m, train_xs)
# Let's look at the 10 most important features
fi_df[:10]
plot_fi(fi_df[:30])
See p.304 for how feature importance is calculated.
cols_to_keep = fi_df[fi_df.imp > 0.005].cols
len(cols_to_keep)
train_xs_keep = train_xs[cols_to_keep]
valid_xs_keep = valid_xs[cols_to_keep]
m = fit_rf(train_xs_keep, train_y)
m_rmse(m, train_xs_keep, train_y), m_rmse(m, valid_xs_keep, valid_y)
The accuracy is about the same as before, but the model is much more interpretable ...
len(train_xs.columns), len(train_xs_keep.columns)
plot_fi(rf_feature_importance(m, train_xs_keep[:30]))
cluster_columns(train_xs_keep)