Case Study: Pipelines

X_train_scaled.head()
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
6051 0.908140 -0.743917 -0.526078 0.266135 -0.389736 -0.210591 -0.083813 0.126398
20113 -0.002057 1.083123 -0.923283 -1.253312 -0.198924 4.726412 11.166631 -0.050132
14289 1.218207 -1.352930 1.380504 0.542873 -0.635239 -0.273606 -0.025391 -0.099240
13665 1.128188 -0.753286 -0.843842 -0.561467 0.714077 0.122307 -0.280310 0.010183
14471 1.168196 -1.287344 -0.843842 2.500924 -1.059242 -0.640266 -0.190617 0.126808


from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
round(knn.score(X_train_scaled, y_train), 3)
0.798

How to carry out cross-validation?

from sklearn.model_selection import cross_validate

knn = KNeighborsRegressor()
scores = cross_validate(knn, X_train_scaled, y_train, return_train_score=True)
pd.DataFrame(scores)
fit_time score_time test_score train_score
0 0.011471 0.170820 0.696373 0.794236
1 0.011446 0.149618 0.684447 0.791467
2 0.011852 0.168145 0.695532 0.789436
3 0.011272 0.164569 0.679478 0.793243
4 0.011536 0.103350 0.680657 0.794820

Bad methodology 1: Scaling the data separately

scaler = StandardScaler();
scaler.fit(X_train_imp);
X_train_scaled = scaler.transform(X_train_imp)


# Creating a separate object for scaling test data - Not a good idea.
scaler = StandardScaler();
scaler.fit(X_test_imp); # Calling fit on the test data - Yikes! 
X_test_scaled = scaler.transform(X_test_imp) # Transforming the test data using the scaler fit on test data ... Bad! 


knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
print("Training score: ", round(knn.score(X_train_scaled, y_train), 2))
print("Test score: ", round(knn.score(X_test_scaled, y_test), 2))
Training score:  0.8
Test score:  0.7

Bad methodology 2: Scaling the data together

X_train_imp.shape, X_test_imp.shape
((18576, 8), (2064, 8))


# Join the train and test sets back together
X_train_imp_df = pd.DataFrame(X_train_imp,columns=X_train.columns, index=X_train.index)
X_test_imp_df = pd.DataFrame(X_test_imp,columns=X_test.columns, index=X_test.index)
XX = pd.concat([X_train_imp_df, X_test_imp_df], axis = 0) ## Don't do it! 
XX.shape 
(20640, 8)


scaler = StandardScaler()
scaler.fit(XX);
XX_scaled = scaler.transform(XX) 
XX_train, XX_test = XX_scaled[:18576], XX_scaled[18576:]
knn = KNeighborsRegressor()
knn.fit(XX_train, y_train);
print('Train score: ', (round(knn.score(XX_train, y_train), 2))) # Misleading score
print('Test score: ', (round(knn.score(XX_test, y_test), 2))) # Misleading score
Train score:  0.8
Test score:  0.71



So, what can we do?

We can create a scikit-learn Pipeline!

Pipelines allow us to define a “pipeline” of transformers with a final estimator.

Let’s see it in action

from sklearn.pipeline import Pipeline

pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("reg", KNeighborsRegressor())
])
pipe.fit(X_train, y_train);


What’s happening:

imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
scaler.fit(X_train_imp)
X_train_imp_scaled = scaler.transform(X_train_imp)
knn.fit(X_train_imp_scaled, y_train)
pipe.predict(X_train) 
array([126500., 117380., 187700., ..., 259500., 308120.,  60860.], shape=(18576,))


X_train_imp = imputer.transform(X_train)
X_train_imp_scaled = scaler.transform(X_train_imp)
knn.predict(X_train_imp_scaled)
404 image

Attribution

scores_processed = cross_validate(pipe, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_processed)
fit_time score_time test_score train_score
0 0.019747 0.169661 0.693883 0.792395
1 0.019282 0.165191 0.685017 0.789108
2 0.018938 0.167123 0.694409 0.787796
3 0.019154 0.170876 0.677055 0.792444
4 0.018790 0.136934 0.714494 0.823421
pd.DataFrame(scores_processed).mean()
fit_time       0.019182
score_time     0.161957
test_score     0.692972
train_score    0.797033
dtype: float64


from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy="median")
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores).mean()
fit_time       0.001186
score_time     0.000478
test_score    -0.055115
train_score   -0.054611
dtype: float64

Let’s apply what we learned!