Introduction to Machine Learning – Case Study: Pipelines

X_train_scaled.head()

	longitude	latitude	housing_median_age	households	median_income	rooms_per_household	bedrooms_per_household	population_per_household
6051	0.908140	-0.743917	-0.526078	0.266135	-0.389736	-0.210591	-0.083813	0.126398
20113	-0.002057	1.083123	-0.923283	-1.253312	-0.198924	4.726412	11.166631	-0.050132
14289	1.218207	-1.352930	1.380504	0.542873	-0.635239	-0.273606	-0.025391	-0.099240
13665	1.128188	-0.753286	-0.843842	-0.561467	0.714077	0.122307	-0.280310	0.010183
14471	1.168196	-1.287344	-0.843842	2.500924	-1.059242	-0.640266	-0.190617	0.126808

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
round(knn.score(X_train_scaled, y_train), 3)

0.798

How to carry out cross-validation?

from sklearn.model_selection import cross_validate

knn = KNeighborsRegressor()
scores = cross_validate(knn, X_train_scaled, y_train, return_train_score=True)
pd.DataFrame(scores)

	fit_time	score_time	test_score	train_score
0	0.011471	0.170820	0.696373	0.794236
1	0.011446	0.149618	0.684447	0.791467
2	0.011852	0.168145	0.695532	0.789436
3	0.011272	0.164569	0.679478	0.793243
4	0.011536	0.103350	0.680657	0.794820

Let’s try cross-validation with transformed data.

Is there a problem here?

We are using our X_train_scaled in our cross_validate() function which already has all our preprocessing done.

Let’s bring back the golden rule where Our test data should not influence our training data and this applies also to our validation data and that it also should not influence our training data.

Our first instinct to develop good habits is to split our data first however when we use the cross_validate() function, since we are scaling first and then splitting into training and validation, information from our validation data is influencing our training data now.

With imputation and scaling, we are scaling and imputing values based on all the information in the data meaning the training data AND the validation data and so we are not adhering to the golden rule anymore.

Every row in our x_train_scaled has now been influenced in a minor way by every other row in x_train_scaled.

With scaling every row has been transformed based on all the data before splitting between training and validation.

Here we said that we allowed information from the validation set to leak into the training step.

We need to take care that we are keeping our validation data truly as unseen data.

Before we look at the right approach to this, it’s important to look at the wrong approaches and understand why we cannot perform cross-validation in such ways.

Bad methodology 1: Scaling the data separately

scaler = StandardScaler();
scaler.fit(X_train_imp);
X_train_scaled = scaler.transform(X_train_imp)

# Creating a separate object for scaling test data - Not a good idea.
scaler = StandardScaler();
scaler.fit(X_test_imp); # Calling fit on the test data - Yikes! 
X_test_scaled = scaler.transform(X_test_imp) # Transforming the test data using the scaler fit on test data ... Bad!

knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
print("Training score: ", round(knn.score(X_train_scaled, y_train), 2))
print("Test score: ", round(knn.score(X_test_scaled, y_test), 2))

Training score:  0.8
Test score:  0.7

The first mistake that often occurs is as follows.

We make our transformer, we fit it on the training data and then transform the training data.

Next, we make a second transformer, fit it on the test data and then transform our test data.

What is wrong with this approach?

Although we are keeping our test data separate from our training data, by scaling the train and test splits separately, we are using two different StandardScaler objects.

This is bad because we want to apply the same transformation on the training and test splits.

The test data will have different values than the training data producing a different transformation than the training data.

In summary, we should never fit on test data, whether it’s to build a model or with a transforming, test data should never be exposed to the fit function.

Bad methodology 2: Scaling the data together

X_train_imp.shape, X_test_imp.shape

((18576, 8), (2064, 8))

# Join the train and test sets back together
X_train_imp_df = pd.DataFrame(X_train_imp,columns=X_train.columns, index=X_train.index)
X_test_imp_df = pd.DataFrame(X_test_imp,columns=X_test.columns, index=X_test.index)
XX = pd.concat([X_train_imp_df, X_test_imp_df], axis = 0) ## Don't do it! 
XX.shape

(20640, 8)

scaler = StandardScaler()
scaler.fit(XX);
XX_scaled = scaler.transform(XX) 
XX_train, XX_test = XX_scaled[:18576], XX_scaled[18576:]

knn = KNeighborsRegressor()
knn.fit(XX_train, y_train);
print('Train score: ', (round(knn.score(XX_train, y_train), 2))) # Misleading score
print('Test score: ', (round(knn.score(XX_test, y_test), 2))) # Misleading score

Train score:  0.8
Test score:  0.71

So, what can we do?

We can create a scikit-learn Pipeline!

Pipelines allow us to define a “pipeline” of transformers with a final estimator.

Let’s see it in action

from sklearn.pipeline import Pipeline

pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("reg", KNeighborsRegressor())
])

pipe.fit(X_train, y_train);

What’s happening:

imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
scaler.fit(X_train_imp)
X_train_imp_scaled = scaler.transform(X_train_imp)
knn.fit(X_train_imp_scaled, y_train)

pipe.predict(X_train)

array([126500., 117380., 187700., ..., 259500., 308120.,  60860.], shape=(18576,))

X_train_imp = imputer.transform(X_train)
X_train_imp_scaled = scaler.transform(X_train_imp)
knn.predict(X_train_imp_scaled)

Attribution

scores_processed = cross_validate(pipe, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_processed)

	fit_time	score_time	test_score	train_score
0	0.019747	0.169661	0.693883	0.792395
1	0.019282	0.165191	0.685017	0.789108
2	0.018938	0.167123	0.694409	0.787796
3	0.019154	0.170876	0.677055	0.792444
4	0.018790	0.136934	0.714494	0.823421

pd.DataFrame(scores_processed).mean()

fit_time       0.019182
score_time     0.161957
test_score     0.692972
train_score    0.797033
dtype: float64

from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy="median")
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores).mean()

fit_time       0.001186
score_time     0.000478
test_score    -0.055115
train_score   -0.054611
dtype: float64