In this blog post, we will be predicting the sale price of bulldozers sold at auctions, based on the usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.
The data for this competition on Kaggle is split into three parts:
- train.csv is the training set, which contains data through the end of 2011. - valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012. - test.csv is the test set, which contains data from May 1, 2012 - Novemeber 2012.
The key fields in train.csv are: - SalesID
: the unique identifier of the sale - MachineID
: the unique identifier of a machine. A machine can be sold multiple times. - saleprice
: what the machine sold for at auction - saledata
: the date of the sale
import fastbook
fastbook.setup_book()
Let’s download the necessary libraries we’ll require throughout the notebook.
from fastbook import *
from kaggle import api
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
= 20
pd.options.display.max_rows = 8 pd.options.display.max_columns
We’ll download the data from Kaggle using the Kaggle API.
= '{"username":"xxxx","key":"xxxx"}' creds
= Path('~/.kaggle/kaggle.json').expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
Let’s pick a path to download the dataset to.
= URLs.path('bluebook')
path path
Path('/home/pranav/.fastai/archive/bluebook')
We’ll use the Kaggle API to download the dataset to that path and extract it.
if not path.exists():
path.mkdir()'bluebook-for-bulldozers', path=path)
api.competition_download_cli( file_extract(path)
= path
Path.BASE_PATH ='text') path.ls(file_type
(#7) [Path('TrainAndValid.csv'),Path('Valid.csv'),Path('random_forest_benchmark_test.csv'),Path('Test.csv'),Path('median_benchmark.csv'),Path('ValidSolution.csv'),Path('Machine_Appendix.csv')]
The Dataset
No that we have downloaded our dataset, let’s take a look at it. We’ll read the training set into a Pandas dataframe. We’ll specify low_memory=False
unless Pandas runs out of memory and returns an error. The low_memory
parameter, which is True
by default, tells Pandas to look at only a few rows of data at a time to figure out what type of data is in each column.
= pd.read_csv(path/'TrainAndValid.csv', low_memory=False) df
df.head()
SalesID | SalePrice | MachineID | ModelID | ... | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 66000.0 | 999089 | 3157 | ... | NaN | NaN | Standard | Conventional |
1 | 1139248 | 57000.0 | 117657 | 77 | ... | NaN | NaN | Standard | Conventional |
2 | 1139249 | 10000.0 | 434808 | 7009 | ... | NaN | NaN | NaN | NaN |
3 | 1139251 | 38500.0 | 1026470 | 332 | ... | NaN | NaN | NaN | NaN |
4 | 1139253 | 11000.0 | 1057373 | 17311 | ... | NaN | NaN | NaN | NaN |
5 rows × 53 columns
df.shape
(412698, 53)
The dataset contains 412,698 rows of data and 53 columns. That’s a lot of data for us to look at!
df.columns
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
dtype='object')
Looking at the data, we can try to gauge what kind of information is in each column.
Let’s have a look at the ordinal columns, i.e. columns containing strings or similar but where those strings have a natural ordering. ProductSize
seems like it could be one of those columns.
'ProductSize'].unique() df[
array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact'], dtype=object)
We can tell Pandas about a suitable ordering of these levels.
= 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
sizes
'ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True) df[
The metric we’ll be using is the root mean squared log error (RMLSE) between the actual and predicted prices, since that is what Kaggle suggests for this dataset. We need to take the log of the prices, so that the m_rmse
of that value will give us what we ultimately need.
= 'SalePrice'
dep_var = np.log(df[dep_var]) df[dep_var]
Data Preprocessing
Handling Dates
Our model should know more than whether a date is more recent or less recent than another. We might want our model to make decisions based on that date’s day of the week, on whether a day is a holiday, on what month it is in, and so forth. To do this, we’ll replace every data column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that might be useful.
fastai comes with a function add_datepart
that will do this for us when we pass a column name that contains dates.
= add_datepart(df, 'saledate') df
Let’s do the same for our test set.
= pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate') df_test
df.shape
(412698, 65)
We can see that instead of the original 53, we now have 65 columns in our dataset. Let’s see what the new columns are.
' '.join(o for o in df.columns if o.startswith('sale'))
'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
Handling Strings and Missing Data
fastai’s class TabularPandas
wraps a Pandas dataframe and provides a few conveniences. To populate a TabularPandas
, we will use two TabularProcs
- Categorify
and FillMissing
.
A TabularProc
is like a regular Transform
, except: - it returns the exact same object that’s passed to it, after modifying the object in place. - it runs the transform once, when the data is first passed in, rather than as the data is accessed.
Categorify
is a TabularProc
that replaces a categorical column with a numeric column. FillMissing
is a TabularProc
that replaces missing values with the median of the column, and creates a new Boolean column that is set to True
for any row where the value was missing.
= [Categorify, FillMissing] procs
TabularPandas
will also handle splitting the dataset into training and validation sets for us. But we’ll want to define our validation data so that it has the same sort of relationship to the training data as the test set will have. As mentioned in the data description above, the test set covers a six-month period from May 2012, which is later in time than any date in the training set. It means that if we’re going to have a useful validation set, we’ll want it to tbe later in time than the training set. The Kaggle training data ends in April 2012, so we will define a narrower training set that consists only of the training data from before November 2011, and we’ll define a validation set consisting of data from after 2011.
To do this we use np.where
, which will return (as the first element of a tuple) the indices of all True
values.
= (df.saleYear < 2011) | (df.saleMonth < 10)
cond = np.where(cond)[0]
train_idx = np.where(~cond)[0]
valid_idx = (list(train_idx), list(valid_idx)) splits
TabularPandas
needs to be told which columns are continuous and which are categorical. We can handle that automatically using the helper function cont_cat_split
.
= cont_cat_split(df, 1, dep_var=dep_var)
cont, cat
= TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits) to
TabularPandas
provides train
and valid
attributes.
len(to.train), len(to.valid)
(404710, 7988)
5) to.show(
UsageBand | fiModelDesc | fiBaseModel | fiSecondaryDesc | fiModelSeries | fiModelDescriptor | ProductSize | fiProductClassDesc | state | ProductGroup | ProductGroupDesc | Drive_System | Enclosure | Forks | Pad_Type | Ride_Control | Stick | Transmission | Turbocharged | Blade_Extension | Blade_Width | Enclosure_Type | Engine_Horsepower | Hydraulics | Pushblock | Ripper | Scarifier | Tip_Control | Tire_Size | Coupler | Coupler_System | Grouser_Tracks | Hydraulics_Flow | Track_Type | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | saleIs_month_end | saleIs_month_start | saleIs_quarter_end | saleIs_quarter_start | saleIs_year_end | saleIs_year_start | auctioneerID_na | MachineHoursCurrentMeter_na | SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | saleYear | saleMonth | saleWeek | saleDay | saleDayofweek | saleDayofyear | saleElapsed | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Low | 521D | 521 | D | #na# | #na# | #na# | Wheel Loader - 110.0 to 120.0 Horsepower | Alabama | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139246 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | 2006 | 11 | 46 | 16 | 3 | 320 | 1.163635e+09 | 11.097410 |
1 | Low | 950FII | 950 | F | II | #na# | Medium | Wheel Loader - 150.0 to 175.0 Horsepower | North Carolina | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | 23.5 | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139248 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | 2004 | 3 | 13 | 26 | 4 | 86 | 1.080259e+09 | 10.950807 |
2 | High | 226 | 226 | #na# | #na# | #na# | #na# | Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity | New York | SSL | Skid Steer Loaders | #na# | OROPS | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Auxiliary | #na# | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | None or Unspecified | Standard | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | False | False | False | False | False | False | False | False | 1139249 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | 2004 | 2 | 9 | 26 | 3 | 57 | 1.077754e+09 | 9.210340 |
3 | High | PC120-6E | PC120 | #na# | -6E | #na# | Small | Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons | Texas | TEX | Track Excavators | #na# | EROPS w AC | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | False | False | False | False | False | False | False | False | 1139251 | 1026470 | 332 | 121 | 3.0 | 2001 | 3486.0 | 2011 | 5 | 20 | 19 | 3 | 139 | 1.305763e+09 | 10.558414 |
4 | Medium | S175 | S175 | #na# | #na# | #na# | #na# | Skid Steer Loader - 1601.0 to 1751.0 Lb Operating Capacity | New York | SSL | Skid Steer Loaders | #na# | EROPS | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Auxiliary | #na# | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | None or Unspecified | Standard | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | False | False | False | False | False | False | False | False | 1139253 | 1057373 | 17311 | 121 | 3.0 | 2007 | 722.0 | 2009 | 7 | 30 | 23 | 3 | 204 | 1.248307e+09 | 9.305651 |
We can see that the data is still displayed as strings for categories.
However, the underlying items are all numeric.
5) to.items.head(
SalesID | SalePrice | MachineID | ModelID | ... | saleIs_year_start | saleElapsed | auctioneerID_na | MachineHoursCurrentMeter_na | |
---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 11.097410 | 999089 | 3157 | ... | 1 | 1.163635e+09 | 1 | 1 |
1 | 1139248 | 10.950807 | 117657 | 77 | ... | 1 | 1.080259e+09 | 1 | 1 |
2 | 1139249 | 9.210340 | 434808 | 7009 | ... | 1 | 1.077754e+09 | 1 | 1 |
3 | 1139251 | 10.558414 | 1026470 | 332 | ... | 1 | 1.305763e+09 | 1 | 1 |
4 | 1139253 | 9.305651 | 1057373 | 17311 | ... | 1 | 1.248307e+09 | 1 | 1 |
5 rows × 67 columns
The conversion of categorical columns to numbers is done by simply replacing each unique level with a number. The numbers associated with the levels are chosen consecutively as they are seen in a column, so there’s no particular meaning to the numbers in categorical columns after conversion. The exception is if you first convert a column to a Pandas ordered category (ProductSize
), in which case the ordering you chose is used.
We can see the mapping by looking at the classes
attribute.
'ProductSize'] to.classes[
['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']
Now that we are done with the preprocessing stage, let’s save our data.
/'to.pkl', to) save_pickle(path
Model Training
Now that our data is all numeric, and there are no missing values, we can train our model.
Let’s define our independent and dependent variables.
= to.train.xs, to.train.y
xs, y = to.valid.xs, to.valid.y valid_xs, valid_y
Decision Trees
Let us create a Decision Tree.
= DecisionTreeRegressor(max_leaf_nodes = 4)
m m.fit(xs, y)
DecisionTreeRegressor(max_leaf_nodes=4)
Let’s create a simple model first with only four leaf nodes.
We can display the tree to see what it’s learned.
=7, leaves_parallel=True, precision=2) draw_tree(m, xs, size
We can also show the information using the dtreeviz library.
= np.random.permutation(len(y))[:500]
samp_idx
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var, ='DejaVu Sans', scale=1.6, label_fontsize=10, orientation='LR') fontname
This shows a chart of the distribution of the data for each split point. Here we see one of the benefits of creating a simple model first. We can clearaly see that there’s a problem with our YearMade
data, which shows that there are bulldozers made in the year 1000. This is probably just a missing value code (a value that doesn’t otherwise appear in the data and that is used as a placeholder in cases where a value is missing). For modeling purposes, 100 is fine, but this outlier makes visualizing the values we are interested in more difficult. So we’ll replace it with 1950.
'YearMade'] < 1900, 'YearMade'] = 1950
xs.loc[xs['YearMade'] < 1900, 'YearMade'] = 1950 valid_xs.loc[valid_xs[
Even though it doesn;t change the result of the model in any significant way, that change makes the split much clearer in the tree visualization.
= DecisionTreeRegressor(max_leaf_nodes = 4).fit(xs, y)
m
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var, ='DejaVu Sans', scale=1.6, label_fontsize=10, orientation='LR') fontname
Now let’s have the decision tree algorithm build a bigger tree. We’ll not pass in any stopping criteria.
= DecisionTreeRegressor()
m m.fit(xs, y)
DecisionTreeRegressor()
We’ll create a little function to check the root mean squared error of our model (m_rmse
) of our model.
def r_mse(pred, y):
return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y):
return r_mse(m.predict(xs), y)
m_rmse(m, xs, y)
0.0
Hmm, here our model is showing an error of 0. But that doesn’t mean our model is perfect. It probably means that our model is badly overfitting. Let’s check the error on the validation set.
m_rmse(m, valid_xs, valid_y)
0.334935
Here’s why we’re overfitting so badly.
# no. of leaf nodes, no. of datapoints
len(xs) m.get_n_leaves(),
(324560, 404710)
We’ve got nearly as many leaf nodes as data points! That’s a little too much. sklearn’s default settings allow it to continue splitting nodes until there is only one item in each leaf node. Let’s change the stopping rule.
= DecisionTreeRegressor(min_samples_leaf=25)
m
m.fit(to.train.xs, to.train.y) m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.248593, 0.323339)
That looks much better. Let’s check the number of leaves again.
m.get_n_leaves()
12397
Random Forests
A random forest is a model that averages the predictions of a large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters.
We create a random forest just like we create a decision tree, except now, we’re alsp specifying the paramters that indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).
def rf(xs, y, n_estimators=100, max_samples=200_000,
=0.5, min_samples_leaf=5, **kwargs):
max_featuresreturn RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
=max_samples, max_features=max_features,
max_samples=min_samples_leaf, oob_score=True).fit(xs, y) min_samples_leaf
= rf(xs, y) m
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.169731, 0.232174)
Our validation RMSE is much improved compared to the one produced by the one decision tree.
To see the impact of n_estimators
, let’s get the predictions from each individual tree in our forest (these are in the estimators_
attribute).
= np.stack([t.predict(valid_xs) for t in m.estimators_]) preds
0), valid_y) r_mse(preds.mean(
0.232174
As you can see, preds.mean(0)
gives the same results as our random forest.
Let’s see what happens to the RMSE as we add more and more trees.
+1].mean(0), valid_y) for i in range(100)]) plt.plot([r_mse(preds[:i
[<matplotlib.lines.Line2D at 0x7f039ea0bf10>]
As you can see, the improvement levels off quite a bit after around 30 trees.
The performance on our validation set is worse than on our training set. Random forests have out-of-bad (OOB) error that can help us with this.
Model Interpretation
Out-of-Bag Error
In a random forest, each tree is trained on a different subset of the training data. The OOB error is a way of measuring prediction error on the training set by only including in the calculation of a row’s error trees where that row was not included in training. This allows us to see whether the model is overfitting, without needing a separate validation set. This is particularly beneficial in cases where we have only a small amount of training data, as it allows us to see whether our model generalizes without removing items to create a validation set. The OOB predictions are available in the oob_prediction_
attribute. We compare them to the training labels, since this is being calculated on trees using the training set.
r_mse(m.oob_prediction_, y)
0.208747
We can see that our OOB error is much lower than our validation set error. This means that something else is causing that error, in addition to normal generalization error.
Tree Variance for Prediction Confidence
How confident are we in our predictions using a particular row of data?
We saw how the model averages the individual tree’s predictions to get an overall prediction, i.e. an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the relative confidence of the predictions. In general, we would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistence (lower standard deviations).
= np.stack([t.predict(valid_xs) for t in m.estimators_]) preds
preds.shape
(100, 7988)
Now we have a prediction for every tree and every auction (100 trees and 7,988 auctions) in the validation set. Using this, we can get the standard deviation of the predictions over all the trees, for every auction.
= preds.std(0) preds_std
Let’s find the standard deviations for the predictions for the first five auctions, i.e. the first five rows of the validation set.
5] preds_std[:
array([0.253279 , 0.10988732, 0.10787489, 0.25965299, 0.12552832])
As you can see, the confidence in the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others, it’s higher, as the trees don’t agree. If you were using this model to decide what items to bid on at auction, a low-confidence prediction might cause you to look more carefully at an item before you made a bid.
Feature Importance
Which columns are the strongest predictors?
Feature Importance gives us insight into how a model is making predictions. We can use the feature_importances_
attribute. Let’s create a function to pop them into a dataframe and sort them.
def rf_feat_importance(m, df):
return pd.DataFrame({'cols': df.columns, 'imp': m.feature_importances_}
'imp', ascending=False) ).sort_values(
= rf_feat_importance(m, xs)
fi 10] fi[:
cols | imp | |
---|---|---|
57 | YearMade | 0.176429 |
6 | ProductSize | 0.119205 |
30 | Coupler_System | 0.117753 |
7 | fiProductClassDesc | 0.072875 |
54 | ModelID | 0.054168 |
65 | saleElapsed | 0.049664 |
31 | Grouser_Tracks | 0.048707 |
3 | fiSecondaryDesc | 0.045853 |
12 | Enclosure | 0.034020 |
1 | fiModelDesc | 0.033681 |
The feature importances of our model show that the first few most importance columns have much higher importance scores than the rest, with YearMade
and ProductSize
being at the top of the list.
A plot of the feature importances shows the relative importance more clearly.
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
30]) plot_fi(fi[:
<AxesSubplot:ylabel='cols'>
Removing Low-Importance Variables
Which columns can we ignore?
We could just use a subset of the columns by removing the variables of low importance and still get good results. Let’s try just keeping those with a feature importance greater than 0.005.
= fi[fi.imp > 0.005].cols
to_keep len(to_keep)
21
We can retrain our model using just this subset of the columns.
= xs[to_keep]
xs_imp = valid_xs[to_keep] valid_xs_imp
= rf(xs_imp, y) m
And here’s the result:
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp, valid_y)
(0.180239, 0.230229)
Our validation accuracy is about the same but we have far fewer columns to study.
len(xs.columns), len(xs_imp.columns)
(66, 21)
Let’s look at our feature importance plot again.
plot_fi(rf_feat_importance(m, xs_imp))
<AxesSubplot:ylabel='cols'>
Removing Redundant Features
Which columns are effectively redundant with each other, for purposes of prediction?
There seem to be variables with similar meanings, for e.g. ProductGroup
and ProductGroupDesc
. Let’s try to remove any redundant features.
cluster_columns(xs_imp)
In this chart, the pairs of columns that are most similar are the ones that were merged together early. Unsurprisingly, the fields ProductGroup
and ProductGroupDesc
were merged quite early, along with saleYear
and saleElapsed
, and fiBaseModel
and fiModelDesc
. Let’s try removing some of these closely related features to see if the model can be simplified without impacting the accuracy.
First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower max_samples
and higher min_samples_leaf
. The OOB score is a number returned by sklearn that ranges between 1.0 for a perfect model and 0.0 for a random model. We don’t need it to be very accurate; we’re just going to use t to compare different models, based on removing some of the possibly redundant columns.
def get_oob(df):
= RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
m =50000, max_features=0.5,
max_samples=-1, oob_score=True)
n_jobs
m.fit(df, y)return m.oob_score_
Here’s our baseline:
get_oob(xs_imp)
0.8773453848775988
Now we’ll try removing each of our potentially redundant variables, one at a time.
=1)) for c in (
{c:get_oob(xs_imp.drop(c, axis'saleYear', 'saleElapsed',
'ProductGroupDesc', 'ProductGroup',
'fiModelDesc', 'fiBaseModel',
'Hydraulics_Flow', 'Grouser_Tracks', 'Coupler_System')}
{'saleYear': 0.8768014266597242,
'saleElapsed': 0.8719167973771567,
'ProductGroupDesc': 0.8770285539401924,
'ProductGroup': 0.8780539740440588,
'fiModelDesc': 0.8753778428292395,
'fiBaseModel': 0.8762011220745426,
'Hydraulics_Flow': 0.8773475825106756,
'Grouser_Tracks': 0.8776377009405991,
'Coupler_System': 0.8771692298411222}
Now let’s try dropping multiple variables. We’ll drop one from each of the tightly aligned pairs we noticed earlier.
# maybe drop columns with higher oob score
= ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
to_drop =1)) get_oob(xs_imp.drop(to_drop, axis
0.8743584495396625
This is not much worse than the model with all the fields.
Let’s create dataframes without these columns.
= xs_imp.drop(to_drop, axis=1)
xs_final = valid_xs_imp.drop(to_drop, axis=1) valid_xs_final
Let’s save the dataframes.
/'xs_final.pkl', xs_final)
save_pickle(path/'valid_xs_final.pkl', valid_xs_final) save_pickle(path
Now we can check our RMSE again to confirm that the accuracy hasn’t changed substantially.
= rf(xs_final, y)
m m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.18224, 0.231327)
By focusing on the most important variables, and removing some redundant ones, we’ve greatly simplified our model. Now, let’s see how those variables affect our predictions using partial dependence plots.
Partial Dependence
How do predictions vary, as we vary the columns?
As we’ve seen, the two most important predictors are ProductSize
and YearMade
. We’d like to understand the relationship between these predictors and sale price. It’s a good idea to first check the count of values per category to see how common each category is. For this, we’ll use the Pandas value_counts
method.
= valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
p = to.classes['ProductSize']
c range(len(c)), c) plt.yticks(
([<matplotlib.axis.YTick at 0x7f039c4c7220>,
<matplotlib.axis.YTick at 0x7f039c4c75b0>,
<matplotlib.axis.YTick at 0x7f039c4f18e0>,
<matplotlib.axis.YTick at 0x7f039de4e100>,
<matplotlib.axis.YTick at 0x7f039de4e610>,
<matplotlib.axis.YTick at 0x7f039de4eb20>,
<matplotlib.axis.YTick at 0x7f039de53070>],
[Text(0, 0, '#na#'),
Text(0, 1, 'Large'),
Text(0, 2, 'Large / Medium'),
Text(0, 3, 'Medium'),
Text(0, 4, 'Small'),
Text(0, 5, 'Mini'),
Text(0, 6, 'Compact')])
The largest group is #na#
, which is the label fastai applies to missing values.
Let’s do the same thing for YearMade
. Since this is a numeric feature, we’ll need to draw a histogram, which groups the year values into a few discrete bins.
= valid_xs_final['YearMade'].hist() ax
Other than the special value 1950, which we used for coding missing values, most of the data is from after 1990.
Partial dependence plots try to answer the question: if a row is varied on nothing other than the feature in question, how would it impact the dependent variable? For e.g., how does YearMade
impact sale price, all other things being equal?
We’ll replace every single value in the YearMade
column with 1950, and then calculate the predicted sale price for every auction, and take the average over all the auctions. Then we do the same for 1951, 1952, and so forth until 2011. This isolates the effect of only YearMade
. With these averages, we can then plot each of these years on the x-axis, and each of the prediction on the y-axis. That gives us a partial dependence plot.
from sklearn.inspection import plot_partial_dependence
= plt.subplots(figsize=(12,4))
fig, ax 'YearMade', 'ProductSize'],
plot_partial_dependence(m, valid_xs_final, [=20, ax=ax) grid_resolution
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x7f039c5286d0>
In the YearMade
plot, we can clearly see a nearly linear relationship between year and price. Our dependent variable is after taking the logarithm, which means that in practice there is an exponential increase in price. Since depreciation is generally recognized as being a multiplicative factor over time, for a given sale date, varying year made ought to show an exponential relationship with sale price.
In the ProductSize
plot, it shows that the final group, which is for missing values, has the lowest price. This is concerning and we would want to find out why it’s missing so often, and what that means. Missing values could sometimes be useful predictors, or sometimes they can indicate data leakage.
Tree Interpreter
For predicting with a particular row of data, what were the most important factors, and how did they influence the prediction?
Tree Interpreters help you identify which factors influence specific predictions.
import warnings
'ignore', FutureWarning)
warnings.simplefilter(
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
Let’s say we’re looking at some particular item in the auction. Our model might predict that this item will be very expensive, and we want to know why. So, we take that one row of data and put it throguh the first decision tree, looking to see what split is used at each point throughout the tree. For each split, we see what the increase or decrease in the addition is, compared to the parent node of the tree. We do this for every tree, and add up the total change in importance by split variable.
For instance, let’s pick the first few rows of our validation set.
= valid_xs_final.iloc[:5] row
We can pass these to treeinterpreter
.
= treeinterpreter.predict(m, row.values) prediction, bias, contributions
prediction
is the prediction that the random forest makes. bias
is the prediction based on taking the mean of the dependent variable (i.e. the model that is the root of every tree). contributions
tell us the total change in prediction due to each of the independent variables. So, the sum of contributions
and bias
must include predictions
, for each row.
Let’s look at just the first row.
0], bias[0], contributions[0].sum() prediction[
(array([10.02467957]), 10.104720584542706, -0.0800410097588184)
The clearest way to display the contributions is with a waterfall plot.
0], threshold=0.08,
waterfall(valid_xs_final.columns, contributions[=45,formatting='{:,.3f}'); rotation_value
This shows how the positive and negative contributions from all the independent variables sum up to create the final prediction, which is the right-hand column labeled “net”.
This kind of information is most useful in production since you can use it to provide useful information to users of your data product about the underlying reasoning behind the predictions.
Extrapolation Problem
A random forest just averages the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Thus, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time. Your predictions will be systematically too low. Moreover, random forests are not able to extrapolate outside of the types of data they have seen, in a more general sense. That’s why we need to make sure that our validation set does not contain out-of-domain data.
Finding Out-Of-Domain Data
We’ll try to predict whether a row is in the validation set or training set. For this, we’ll combine our training and validation sets together, create a dependent variable that represents which dataset each row comes from, build a random forest using that data, and get its feature importance.
= pd.concat([xs_final, valid_xs_final])
df_dom = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
is_valid
= rf(df_dom, is_valid)
m 6] rf_feat_importance(m, df_dom)[:
cols | imp | |
---|---|---|
5 | saleElapsed | 0.896474 |
10 | SalesID | 0.078574 |
13 | MachineID | 0.019742 |
0 | YearMade | 0.001183 |
4 | ModelID | 0.000789 |
16 | Tire_Size | 0.000525 |
This shows that there are three columns that differ significantly between the training and validation sets: saleElapsed
, SalesID
, and MachineID
.
saleElapsed
: it’s the number of days between the start of the dataset and each row, so it directly encodes the date.
SalesID
: the identifiers for auction sales might increment over time.
MachineID
: the same might be happening for individual items sold in the auctions.
Let’s get a baseline of the original random forest model’s RMSE, then see what the effect is of removing each of these columns in turn.
= rf(xs_final, y)
m print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID', 'saleElapsed', 'MachineID'):
= rf(xs_final.drop(c, axis=1), y)
m print(c, m_rmse(m, valid_xs_final.drop(c, axis=1), valid_y))
orig 0.231707
SalesID 0.229693
saleElapsed 0.234533
MachineID 0.230408
It looks like we should be able to remove SalesID
and MachineID
without losing any accuracy. Let’s check.
= ['SalesID', 'MachineID']
time_vars = xs_final.drop(time_vars, axis=1)
xs_final_time = valid_xs_final.drop(time_vars, axis=1)
valid_xs_time
= rf(xs_final_time, y)
m m_rmse(m, valid_xs_time, valid_y)
0.228485
Removing these variables has slightly improved the model’s accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand.
One thing that might help in our case is to simply avoid using old data. Often, old data shows relationships that just aren’t valid any more. Let’s just try using the most recent few years of the data.
'saleYear'].hist() xs[
<AxesSubplot:>
Let’s consider only the sales after the year 2004.
= xs['saleYear'] > 2004
filt = xs_final_time[filt]
xs_filt = y[filt] y_filt
Here’s the result of training on this subset.
= rf(xs_filt, y_filt)
m m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)
(0.176994, 0.228504)
It’s a little better, which shows that you shouldn’t always just use your entire dataset; sometimes a subset can be better.
Using a Neural Network
We’ll now try using a neural network. We can use the same approach to build a neural network model. Let’s first replicate the steps we took to set up the TabularPandas
object.
= pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df_nn 'ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df_nn[= np.log(df_nn[dep_var])
df_nn[dep_var] = add_datepart(df_nn, 'saledate') df_nn
We can leverage the work we did to trim unwanted columns in the random forest by using the same set of columns for our neural network.
= df_nn[list(xs_final_time.columns) + [dep_var]] df_nn_final
Categorical columns are handled very differently in neural networks, compared to the decision tree approach. In a neural network, a great way to handle categorical variables is by using embeddings. To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card
parameter. If it’s lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you’ve tested whether there are better ways to group the variable, so we’ll use 9,000 as our max_card
.
= cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var) cont_nn, cat_nn
There is one variable that we absolutely do not want to treat as categorical: the saleElapsed
variable. A categorical variable, by definition, cannot extrapolate outside the range of values that it has seen, but we want to be able to predict auction sales in the future.
Let’s verify that cont_cat_split
did the correct thing.
cont_nn
['saleElapsed']
Let’s look at the cardinality of each of the categorical variables that we have chosen so far.
df_nn_final[cat_nn].nunique()
YearMade 73
ProductSize 6
Coupler_System 2
fiProductClassDesc 74
ModelID 5281
fiSecondaryDesc 177
Enclosure 6
fiModelDesc 5059
Hydraulics_Flow 3
fiModelDescriptor 140
ProductGroup 6
Hydraulics 12
Drive_System 4
Tire_Size 17
dtype: int64
The fact that there are two variables pertaining to the “model” of the equipment, both with similar high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this when analyzing redundant features, since that relies on similar variables being sorted in the same order, i.e. they need to have similarly named levels. Having a column with 5,000 levels means needing 5,000 columns in our embedding matrix, which would be nice to avoid if possible.
Let’s see what the impact of removing one of these model columns has on the random forest.
= xs_filt.drop('fiModelDescriptor', axis=1)
xs_filt2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = rf(xs_filt2, y_filt)
m2 m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
(0.175986, 0.229453)
There’s minimal impact, so we will remove it as a perdictor for our neural network.
'fiModelDescriptor') cat_nn.remove(
We can create our TabularPandas
object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalization since the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But a neural network does care about this and thus, we add the Normalize
processor when we build our TabularPandas
object.
= [Categorify, FillMissing, Normalize]
procs_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
to_nn =splits, y_names=dep_var) splits
Since tabular models and data don’t generally require much GPU RAM, we can use larger batch sizes.
= to_nn.dataloaders(1024) dls
It’s a good idea to set y_range
for regression models, so let’s find out the min and max of our dependent variable.
= to_nn.train.y
y min(), y.max() y.
(8.465899, 11.863583)
We can now create the Learner
to create this tabular model.
By default, for tabular data, fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works well for small datasets, but here we’ve got quite a large dataset, so we increase the layer sizes to 500 and 250.
from fastai.tabular.all import *
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from fastai.tabular.all import *
ModuleNotFoundError: No module named 'fastai'
= tabular_learner(dls, y_range=(8,12), layers=[500, 250],
learn =1, loss_func=F.mse_loss) n_out
learn.lr_find()
SuggestedLRs(lr_min=0.0033113110810518267, lr_steep=0.00015848931798245758)
There’s no need to use fine_tune
, so we’ll train with fit_one_cycle
for a few epochs and see how it looks.
5, 3.3e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.069181 | 0.062983 | 01:33 |
1 | 0.057557 | 0.062395 | 01:27 |
2 | 0.050993 | 0.056399 | 01:29 |
3 | 0.045013 | 0.051584 | 01:26 |
4 | 0.041281 | 0.051339 | 01:37 |
We can use our r_mse
function to compare the result to the random forest result we got earlier.
= learn.get_preds()
preds, targs r_mse(preds, targs)
0.226582
It’s quite a bit better than the random forest, although it took longer to train.
Ensembling
Another thing that can help with generalization is to use several models and average their prediction - a technique known as ensembling.
We have trained two very different models, trained using very different algorithms: random forest, and a neural network. It would be reasonable to expect that the kinds of errors that each one makes would be quite different. Thus, we might expect that the average of their predictions would be better than either one’s individual predictions. We can create an ensemble of the random forest adn the neural network.
The PyTorch model and sklearn model create data of different types: PyTorch gives us a rank-2 tensor (i.e. a colun matrix), whereas NumPy gives us a rank-1 array (a vector). squeeze
removes any unit axes from a tensor, and to_np
converts it into a NumPy array.
= m.predict(valid_xs_time)
rf_preds = (to_np(preds.squeeze()) + rf_preds) / 2 ens_preds
r_mse(ens_preds, valid_y)
0.222424
This is a much better result. In fact, this result is better than any score shown on the Kaggle leaderboard. However, it isn’t directly comparable since the leaderboard uses a separate test dataset that we do not have access to. But our results are certainly encouraging!