Tabular Modeling

What is it?

"Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value of one column based on the values in the other columns." Tabular data is also called "structured data" while "unstructured data" represents things like text, images, audio, etc...

Why is it important?

Though it is reported that 80-90% of data is unstructured (think images, text, audio), ironically, it appears that the vast majority of "real world" machine learning is concerened with tabular/structured data.

Note: And here’s the good news ... "recent studies have shown that the vast majority of datasets can be be modeled with just two methods."

What are they?

  1. For structured data, ensembles of decision trees (e.g., random forests and gradient boosting machines like XGBoost).

  2. For unstructured data, multilayered neural networks learned with SGD.

    Note: "... ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning.
    Important: Since a "critical step of interpreting a model of tabular data is significantly easier for decesion tree ensembles ... ensembles of decision trees are our first approach for analyzing a new tabular dataset"
    Important: Neural networks will considered when "there are some high-cardinality categorical variables" or there are columns with unstructured data. A example of a high "cardinality" (e.g., the number of discrete levels representing the categories) would be something like zip code.

See pages 282-284 for more discussion on the pros/cons of decision trees and neural networks for tabular data.


Categorical Embeddings

Continuous v. Categorical

Continuous variables "contain a real numbers that be fed into a model directly and are meaningful in and out of themselves. Examples include "age" and "price".

Categorical variables "contain a number of discrete levels, such as 'movie ID,' for which addition and multiplication don't have any meaning (even if they're stored as numbers). Other examples include dates, columns indicating "sex", "gender", "department", etc...

How do we represent "categorical" data?

We learned this in chapter 8, we represent such data via entity embeddings.

Important: Entity embeddings allow for a more complex and learnt numerical representation of a thing. This representation is likely task/data specific to one degree or another such that "Sunday" may have one representation in a task predicting how many hours people work on a day, and another for a task attempting to predict the number of trades that will be executed on each day of the week.
Note: "... by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variable. [It] is especially useful for datasets with lots of high cardinality features.. See pp.278-282 for examples of this in relation to the Rossmann sales competition on kaggle.

Because "an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot-encoded input layer ... the embedding transforms the categorical variables into inputs that are both continuous and meaningful."

In other words ...

Important: "... we provide the model fundamentally categorical data about discrete entities ... and the model learns an embedding for these entities that defines a continuous notion of distance between them."
Note: Given all this, "we can combine our continous embedding values with truly continuous input data [by just concatentating] the variables and feed[ing] the concatenation into our final dense layers. For an example of this, see the "Wide & Deep Learning for Recommender Systems" paper. See the below from page 282 on what that approach looks like.

Imports

from kaggle import api
from dtreeviz.trees import *
from fastai.tabular.all import *
from IPython.display import Image, display_svg, SVG
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor,export_graphviz

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

Data preparation

Step 1: Get the data

We'll be getting the data from kaggle. If you're running on colab, check out these instructions for getting setup with the kaggle API

path = Path("bluebook")
path
Path('bluebook')
if not path.exists():
  path.mkdir()
  api.competition_download_cli("bluebook-for-bulldozers", path=path)
  file_extract(path/"bluebook-for-bulldozers.zip")

path.ls(file_type="text")
Downloading bluebook-for-bulldozers.zip to bluebook
100%|██████████| 48.4M/48.4M [00:00<00:00, 112MB/s] 

(#7) [Path('bluebook/Machine_Appendix.csv'),Path('bluebook/Test.csv'),Path('bluebook/ValidSolution.csv'),Path('bluebook/median_benchmark.csv'),Path('bluebook/TrainAndValid.csv'),Path('bluebook/Valid.csv'),Path('bluebook/random_forest_benchmark_test.csv')]

Step 2: EDA

Important: "In any sort of data science work, it’s important to look at your data directly to make sure you understand the format, how it’s stored, what types of values it holds, etc."
Tip: "... it’s a good idea to also specify low_memory=False unless Pandas acutally runs out of memory." The default = True (will look only at the first few rows of data to infer column datatypes).
train_df = pd.read_csv(path/"TrainAndValid.csv", low_memory=False)
test_df = pd.read_csv(path/"Test.csv", low_memory=False)

train_df.columns
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
      dtype='object')

describe() is a method that gives you some basic stats for each column.

train_df.describe().T
count mean std min 25% 50% 75% max
SalesID 412698.0 2.011161e+06 1.080068e+06 1139246.0 1421897.75 1645852.5 2261012.50 6333349.0
SalePrice 412698.0 3.121518e+04 2.314174e+04 4750.0 14500.00 24000.0 40000.00 142000.0
MachineID 412698.0 1.230061e+06 4.539533e+05 0.0 1088593.25 1284397.0 1478079.25 2486330.0
ModelID 412698.0 6.947202e+03 6.280825e+03 28.0 3261.00 4605.0 8899.00 37198.0
datasource 412698.0 1.351694e+02 9.646749e+00 121.0 132.00 132.0 136.00 173.0
auctioneerID 392562.0 6.585268e+00 1.715841e+01 0.0 1.00 2.0 4.00 99.0
YearMade 412698.0 1.899050e+03 2.921902e+02 1000.0 1985.00 1995.0 2001.00 2014.0
MachineHoursCurrentMeter 147504.0 3.522988e+03 2.716993e+04 0.0 0.00 0.0 3209.00 2483300.0

advanced_describe() is a method I created that builds on top of the default describe() method to include stats on missing and unique values (which are both very helpful in terms of cleanup, understanding potential issues, and in determining the size of your embeddings for categorical data). For categorical variables with few discrete levels, this method will also show you what they are.

advanced_describe(train_df)
count mean std min ... unique unique% unique_values dtype
SalesID 412698 2011161.16364 1080067.724498 1139246.0 ... 412698 7786.75 NaN int64
SalePrice 412698 31215.181414 23141.743695 4750.0 ... 954 18.00 NaN float64
MachineID 412698 1230061.436646 453953.25795 0.0 ... 348808 6581.28 NaN int64
ModelID 412698 6947.201828 6280.824982 28.0 ... 5281 99.64 NaN int64
datasource 412698 135.169361 9.646749 121.0 ... 6 0.11 [121, 132, 136, 149, 172, 173] int64
... ... ... ... ... ... ... ... ... ...
Backhoe_Mounting 80712 NaN NaN NaN ... 3 0.06 [nan, None or Unspecified, Yes] object
Blade_Type 81875 NaN NaN NaN ... 11 0.21 [nan, PAT, None or Unspecified, Semi U, VPAT, Straight, Angle, No, U, Landfill, Coal] object
Travel_Controls 81877 NaN NaN NaN ... 8 0.15 [nan, None or Unspecified, Differential Steer, Lever, Finger Tip, 2 Pedal, Pedal, 1 Speed] object
Differential_Type 71564 NaN NaN NaN ... 5 0.09 [Standard, nan, Limited Slip, No Spin, Locking] object
Steering_Controls 71522 NaN NaN NaN ... 6 0.11 [Conventional, nan, Command Control, Four Wheel Standard, Wheel, No] object

53 rows × 14 columns

Step 3: Preprocessing

Handling Ordinal columns

Tip: "... a good next step is to handle ordinal columns ... columns containing strings or similar, **but where those strings have a natural ordering."
train_df.ProductSize.unique()
array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large',
       'Compact'], dtype=object)

Important: "tell Pandas about a suitable ordering of these levels"
sizes = ['Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']
train_df.ProductSize = train_df.ProductSize.astype("category")
train_df.ProductSize = train_df.ProductSize.cat.set_categories(sizes, ordered=True) # note: "inplace=True" is depreciated as of 1.30
train_df.ProductSize.unique()
[NaN, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact']
Categories (6, object): ['Large' < 'Large / Medium' < 'Medium' < 'Small' < 'Mini' < 'Compact']

Handling Your Dependent Variable(s)

Tip: Update the dependent variable to suit your objective.

"You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you ... in this case, Kaggle tells us [our measure is] root mean squared log error (RMLSE)" and because of this we need to make our target the log of the price "so that the m_rmse of that value will give us what we ultimately need."

dep_var = "SalePrice"
train_df[dep_var] = np.log(train_df[dep_var])

Handling Dates

Important: "... enrich our representation of dates"

Dates are "different from most ordinal values in that some dates are qualitatively different from others in a way that is often relevant to the systems we are modeling." As such, we want the model to know if whether the day is a holiday, or part of the weekend, or in a certain month, etc... is important. To do this, "we **replace every date column with a set of date metadata columns, such as holiday, day of week, and month" = categorical data that might be very useful!

Can use fastai's add_datepart() function to do this.

Important: Apply same preprocessing to both your train/evaluation and test sets!
train_df = add_datepart(train_df, "saledate")
test_df = add_datepart(test_df, "saledate")
[col for col in train_df.columns if col.startswith("sale")]
['saleYear',
 'saleMonth',
 'saleWeek',
 'saleDay',
 'saleDayofweek',
 'saleDayofyear',
 'saleIs_month_end',
 'saleIs_month_start',
 'saleIs_quarter_end',
 'saleIs_quarter_start',
 'saleIs_year_end',
 'saleIs_year_start',
 'saleElapsed']

Handling Strings and Missing Data

For this we can use fastai's TabularPandas class (allows us to apply TabularProc transforms to the DataFrame it wraps to do things like fill missing values, make columns categorical, etc...).

Categorify: "a TabularProc that replaces a column with a numeric categorical column"

FillMissing: "a TabularProc that replaces missing values with the median of the column, and **creates a new Boolean column that is set to True for any row where the value was missing. You can change this fill strategy via thefill_strategy` argument.

procs = [Categorify, FillMissing]

Creating our TabularPandas

Step 1: Define our continuous and categorical columns

We need to tell TabularPandas what columns are continumous and which are categorical, and we can use fastai's cont_cat_split to do that like so ...

cont, cat = cont_cat_split(train_df, 1, dep_var=dep_var)

Step 2: Define our training and validation splits

What is the difference between validation and test sets again?

Recall that a validation set "is data we hold back from training in order to ensure that the training process does not overfit on the training data" ... while a test set "is data that is held back from ourselves in order to ensure that we don't overfit on the validation data as we export various model architectures and hyperparameters."

Important: "define our validation data so that it has the same sort of relationship wot the training data as the test set will have."

Because this is a time series problem, we'll make the validation set include data for the last 6 months of the full training set, and the training set everything before that. See p.291 for more on this!

cond = (train_df.saleYear < 2011) | (train_df.saleMonth < 10)
train_idxs = np.where(cond)[0]
valid_idxs = np.where(~cond)[0]

splits = (list(train_idxs), list(valid_idxs))

Step 3: Build our TabularPandas object

And finally, we instantiate our TabularPandas object, passing in our data, procs, splits and dependent variables as such.

to = TabularPandas(train_df, procs, cat, cont, y_names=dep_var, splits=splits)
type(to)
fastai.tabular.core.TabularPandas

Note: "A TabularPandas behaves a lot like a fastai Datasets object, including train and valid attributes"
len(to.train), len(to.valid)
(404710, 7988)
to.show(3)
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc fiModelSeries fiModelDescriptor ProductSize fiProductClassDesc state ProductGroup ProductGroupDesc Drive_System Enclosure Forks Pad_Type Ride_Control Stick Transmission Turbocharged Blade_Extension Blade_Width Enclosure_Type Engine_Horsepower Hydraulics Pushblock Ripper Scarifier Tip_Control Tire_Size Coupler Coupler_System Grouser_Tracks Hydraulics_Flow Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start auctioneerID_na MachineHoursCurrentMeter_na SalesID MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleElapsed SalePrice
0 Low 521D 521 D #na# #na# #na# Wheel Loader - 110.0 to 120.0 Horsepower Alabama WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# None or Unspecified None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False False False 1139246 999089 3157 121 3.0 2004 68.0 2006 11 46 16 3 320 1.163635e+09 11.097410
1 Low 950FII 950 F II #na# Medium Wheel Loader - 150.0 to 175.0 Horsepower North Carolina WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# 23.5 None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False False False 1139248 117657 77 121 3.0 1996 4640.0 2004 3 13 26 4 86 1.080259e+09 10.950807
2 High 226 226 #na# #na# #na# #na# Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity New York SSL Skid Steer Loaders #na# OROPS None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# Auxiliary #na# #na# #na# #na# #na# None or Unspecified None or Unspecified None or Unspecified Standard #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# False False False False False False False False 1139249 434808 7009 121 3.0 2001 2838.0 2004 2 9 26 3 57 1.077754e+09 9.210340
to.items.head(3)
SalesID SalePrice MachineID ModelID ... saleIs_year_start saleElapsed auctioneerID_na MachineHoursCurrentMeter_na
0 1139246 11.097410 999089 3157 ... 1 1.163635e+09 1 1
1 1139248 10.950807 117657 77 ... 1 1.080259e+09 1 1
2 1139249 9.210340 434808 7009 ... 1 1.077754e+09 1 1

3 rows × 67 columns

Note: As you can see above, the underlying string values (the levels) have been replaced with a unique number associated to that level.

We can see that mapping via the classes attribute:

to.classes["ProductSize"]
['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']

Tip: Save your preprocessed TabularPandas object so you don’t have to process the data again
save_pickle(path/"to.pkl", to)

# load from filesystem
to = load_pickle(path/"to.pkl")

Approach 1: Decision Trees

"A decision tree asks a series of binary (yes or no) questions about the data. After each question, the data at that part of the tree is split between a Yes and a No branch.... After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required."

"... for regression, we take the target mean of the items in the group"

Note: See p.288 for how we find the right questions to ask.

Step 1: Define independent/dependent variables

After this, our data is completely numeric and there are no missing values!

train_xs, train_y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

Step 2: Build our tree

max_leaf_nodes=4 = "Stop when you have 4 leaf nodes"

m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(train_xs, train_y)
DecisionTreeRegressor(max_leaf_nodes=4)
draw_tree(m, train_xs, size=7, leaves_parallel=True, precision=2)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> Tree 0 Coupler_System ≤ 0.5 squared_error = 0.48 samples = 404710 value = 10.1 1 YearMade ≤ 1991.5 squared_error = 0.42 samples = 360847 value = 10.21 0->1 True 2 squared_error = 0.12 samples = 43863 value = 9.21 0->2 False 3 squared_error = 0.37 samples = 155724 value = 9.97 1->3 4 ProductSize ≤ 4.5 squared_error = 0.37 samples = 205123 value = 10.4 1->4 5 squared_error = 0.31 samples = 182403 value = 10.5 4->5 6 squared_error = 0.17 samples = 22720 value = 9.62 4->6

What is in each box above?:

  1. The decision criterion for the best split that was found
  2. "samples": The # of examples in the group
  3. "value": The average value of the target for that group
  4. "squared_error": The MSE for that group
    Note: "The top node represents the initial model before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions and will always predict the value to be the average value of the whole dataset."

A "leaf node" is a node "with no answers coming out of them, because there are no more questions to be answered."

See p.293 for more on intrepreting the diagram above.

samp_idx = np.random.permutation(len(train_y))[:500]

dtreeviz(
    m, 
    train_xs.iloc[samp_idx], 
    train_y.iloc[samp_idx], 
    train_xs.columns, 
    dep_var, 
    fontname="DejaVu Sans", 
    scale=1.6, 
    label_fontsize=10, 
    orientation="LR"
  )
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
  "X does not have valid feature names, but"
G node4 leaf5 node4->leaf5 leaf6 node4->leaf6 node1 node1->node4 leaf3 node1->leaf3 leaf2 node0 node0->node1 node0->leaf2 >

"This shows a cart of the distribution of the data for each split point"

Tip: Use dtreeviz to find problems with the data.

For example, you can see there is a problem with "YearMade" as a bunch of tractors show they are made in the year 1000. The likely explanation is that if they don't have the info on a tractor, they set it = 1000 to indicate "Unknown".

We can replace this with something like 1950 to make the visuals more clear ...

train_xs.loc[train_xs["YearMade"] < 1900, "YearMade"] = 1950
valid_xs.loc[valid_xs["YearMade"] < 1900, "YearMade"] = 1950
samp_idx = np.random.permutation(len(train_y))[:500]

dtreeviz(
    m, 
    train_xs.iloc[samp_idx], 
    train_y.iloc[samp_idx], 
    train_xs.columns, 
    dep_var, 
    fontname="DejaVu Sans", 
    scale=1.6, 
    label_fontsize=10, 
    orientation="LR"
  )
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
  "X does not have valid feature names, but"
G node4 leaf5 node4->leaf5 leaf6 node4->leaf6 node1 node1->node4 leaf3 node1->leaf3 leaf2 node0 node0->node1 node0->leaf2 >

Step 3: Build other trees

m = DecisionTreeRegressor()
m.fit(train_xs, train_y)
DecisionTreeRegressor()

Important: Create/Use the metric used by the competition (or your work)
def r_mse(preds, targs):
  return round(math.sqrt(((preds - targs)**2).mean()), 6)

def m_rmse(m, xs, y):
  return r_mse(m.predict(xs), y)
m_rmse(m, train_xs, train_y)
0.0
m_rmse(m, valid_xs, valid_y)
0.333111

What does the above indicate?

That we might be overfitting ... badly (because our training set is perfect and our validation set not so much).

Why are we overfitting?

Because "we have nearly as many leaf nodes as data points ... sklearn's default settings allow it to continue splitting nodes until there is only one item in each leaf node." See pp.295-296 for more intuition on why trees overfit.

m.get_n_leaves(), len(train_xs)
(324559, 404710)
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)

m_rmse(m, to.train.xs, to.train.y), m_rmse(m, to.valid.xs, to.valid.y)
(0.211677, 0.268059)

min_samples_leaf=25 = "Stop when all leaf nodes have a minimum of 25 samples"

A note on categorical variables

Decision trees **don't have embedding layers - "so how can these untreated categorical variables do anything useful"?

Answer: "It just works!"

Note: While it is possible to replace a categorical variable with multiple OHE columns using something like get_dummies, "there is not really any evidence that such an approach improves the end result."

See p.297 for more about why OHE aren't necessary and why decision tree just work with categorical variables out-of-the-box.


Approach 2: Random Forests

A random forest "is a model that averages the predictions of a large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree (what columns and rows are included in each tree) and other tree parameters"

Why does it work so well?

Because it utlizes bagging.

What is "bagging"?

From the "Bagging Predictors" paper ... "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.... The multiple versions are formed by making bootstrap replicates (a randomly chosen subset of rows) of the learning set."

Note: "... although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. **Different models make different errors ... the average of those errors, therefore, is zero!"

This means that we can improve the performance of a model by training it multiple times with a different random subset of the data each time, and then averaging the predictions.

Note: Ensembling is a form of bagging

See p.298 for more details on how bagging works.

Step 1: Define your Random Forest

Since we need to define a variety of parameters that indicate the number of trees we want, how subsets of rows should be randomly chosen, how subsets of columns should likewise be randomly chosen, etc., we'll put the creation behind a function we can call.

def fit_rf(xs, y, n_estimators=40, max_samples=200_000, max_features=0.5, min_samples_leaf=5, **kwargs):
  return RandomForestRegressor(
      n_jobs=-1,                              # Tells sklearn to use all our CPUs to build the trees in parallel
      n_estimators=n_estimators,              # The number of trees
      max_samples=max_samples,                # The number of rows to sample for training ea. tree
      max_features=max_features,              # The number of columns to sample at each split
      min_samples_leaf=min_samples_leaf,      # Stop when all leaf nodes have at least this number of samples
      oob_score=True
  ).fit(xs, y)
m = fit_rf(train_xs, train_y)
m_rmse(m, train_xs, train_y), m_rmse(m, valid_xs, valid_y)
(0.170771, 0.232215)

Important: Random Forests are very sensitive to hyperparameter choices!

Recommended hyperparameter values:

  • n_estimators: "as high a number as you have time to train ... more trees = more accurate

  • max_samples: default (200,000)

  • max_features: default ("auto") or 0.5

  • min_samples_leaf: default (1) or 4

    Tip: Bigger forests using a smalle subset of features tens to be better (see chart below)

How to get the predictions for a SINGLE tree?

tree_preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_]) # added .values (see: https://stackoverflow.com/a/69378867)

r_mse(tree_preds.mean(0), valid_y)
0.232215

How does n_estimators impact model performance?

To answer this, we can increment the number of trees we use in our predictions one at a time like so:

plt.plot([r_mse(tree_preds[:i+1].mean(0), valid_y) for i in range(40)])
[<matplotlib.lines.Line2D at 0x7f6562cc3cd0>]

Tip: Use the above technique to determine a good range of trees to try.

Step 2: Determine why our validation set is worse than training

Tip: Use the out-of-bag (OOB) error to determine if we’re overfitting, or if the validation set covers a different time period, or if it’s a bit of both.

"The OOB error is a way of measuring prediction error in the training dataset" based on rows not used in the training of a particular tree. "This allows us to see whether the model is overfitting, without needing a separate validation set."

"... out-of-bag error is a little like imagining that every tree therefore also has its own validation set" based on the prediction of rows not used in its training.

Important: "This is particularly beneficial in cases where we have only a small amount of training data" (we don’t necessarily have to remove items to create a validation set).
Note: If your OOB error is << than our validation set error, "something else is causing the error"
Note: "we compoare [OOB predictions] to our training labels "since this is being calculated on trees using the training set."
r_mse(m.oob_prediction_, train_y)
0.210601

Model Interpretation

How confident are we in our predictions using a particular row of data?

Answer: "use the standard deviation of predictions across the trees, instead of just the mean. This tells us the relative confidence of predictions"

Important: "...be more cautious of using the reulsts for rows where trees give very different results (higher standard deviations)"
Tip: This information is helpful in production where "if you were using this model to decide which items to bid on at auction, a low-confidence prediction might cause you to look more carefully at an item before you made a bid."
preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_])
preds.shape #=> (# of trees, # of predictions)
(40, 7988)
preds_std = preds.std(0) # get rid of first dimension (the trees)
preds_std[:5]
array([0.25911739, 0.08550421, 0.11939131, 0.29835108, 0.16490343])

Which columns are the strongest predictors (and which can we ignore)?

"It's not normally enough to just know that a model can make accurate predictions - we also want to know how it's making predictions"

Answer: "feature importances give us this insight."

def rf_feature_importance(m, df):
  return pd.DataFrame({"cols": df.columns, "imp": m.feature_importances_}).sort_values("imp", ascending=False)

def plot_fi(fi_df):
  return fi_df.plot("cols", "imp", "barh", figsize=(12,7), legend=False)
fi_df = rf_feature_importance(m, train_xs)

# Let's look at the 10 most important features
fi_df[:10]
cols imp
57 YearMade 0.186662
6 ProductSize 0.133412
30 Coupler_System 0.098091
7 fiProductClassDesc 0.071924
32 Hydraulics_Flow 0.064008
65 saleElapsed 0.050607
54 ModelID 0.049313
3 fiSecondaryDesc 0.045434
1 fiModelDesc 0.033899
31 Grouser_Tracks 0.026686
plot_fi(fi_df[:30])
<matplotlib.axes._subplots.AxesSubplot at 0x7f6562b8dd90>

See p.304 for how feature importance is calculated.

Important: By removing variables of low importance we can still get good results while making our model simpler, more interpretable, and easier to maintain
cols_to_keep = fi_df[fi_df.imp > 0.005].cols
len(cols_to_keep)
21
train_xs_keep = train_xs[cols_to_keep]
valid_xs_keep = valid_xs[cols_to_keep]
m = fit_rf(train_xs_keep, train_y)
m_rmse(m, train_xs_keep, train_y), m_rmse(m, valid_xs_keep, valid_y)
(0.181366, 0.231985)

Important: Compare accuracy of full model with column subset to ensure equitable performance!

The accuracy is about the same as before, but the model is much more interpretable ...

len(train_xs.columns), len(train_xs_keep.columns)
(66, 21)
plot_fi(rf_feature_importance(m, train_xs_keep[:30]))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6562afe190>

Which columns are effectively redundant?

Important: Like removing unimportant features, by removing redundant information, we make our model simpler, more interpretable, and easier to maintain
cluster_columns(train_xs_keep)