Case Study: Preprocessing with Scaling

knn_unscaled = KNeighborsClassifier();
knn_unscaled.fit(X_train, y_train)
print('Train score: ', (round(knn_unscaled.score(X_train, y_train), 2)))
print('Test score: ', (round(knn_unscaled.score(X_test, y_test), 2)))
KNeighborsClassifier()
Train score:  0.71
Test score:  0.45


knn_scaled = KNeighborsClassifier();
knn_scaled.fit(X_train_scaled, y_train)
print('Train score: ', (round(knn_scaled.score(X_train_scaled, y_train), 2)))
print('Test score: ', (round(knn_scaled.score(X_test_scaled, y_test), 2)))
KNeighborsClassifier()
Train score:  0.94
Test score:  0.89

Scaling

404 image

Attribution

Approach What it does How to update 𝑋 (but see below!) sklearn implementation
Normalization sets range to [0,1] X -= np.min(X, axis=0)
X /= np.max(X, axis=0)
MinMaxScaler()
Standardization sets sample mean to 0, s.d. to 1 X -= np.mean(X, axis=0)
X /= np.std(X, axis=0)
StandardScaler()

 

There are all sorts of articles on this. Such as here.

pd.DataFrame(X_train_imp, columns=X_train.columns, index=X_train.index).head()
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 3.1250 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 3.4861 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 2.6604 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 5.2139 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 1.8580 3.817558 1.004801 4.323045


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imp)
X_test_scaled = scaler.transform(X_test_imp)
pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index).head()
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
6051 0.908140 -0.743917 -0.526078 0.266135 -0.389736 -0.210591 -0.083813 0.126398
20113 -0.002057 1.083123 -0.923283 -1.253312 -0.198924 4.726412 11.166631 -0.050132
14289 1.218207 -1.352930 1.380504 0.542873 -0.635239 -0.273606 -0.025391 -0.099240
13665 1.128188 -0.753286 -0.843842 -0.561467 0.714077 0.122307 -0.280310 0.010183
14471 1.168196 -1.287344 -0.843842 2.500924 -1.059242 -0.640266 -0.190617 0.126808

Unscaled data

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()
knn.fit(X_train_imp, y_train);
print(round(knn.score(X_train_imp, y_train), 3))
print(round(knn.score(X_test_imp, y_test), 3))
0.561
0.32


Scaled data

knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
print(round(knn.score(X_train_scaled, y_train), 3))
print(round(knn.score(X_test_scaled, y_test), 3))
0.798
0.712
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_imp)
X_test_scaled = scaler.transform(X_test_imp)
pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index).head()
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
6051 0.657371 0.159405 0.411765 0.098832 0.181039 0.028717 0.021437 0.002918
20113 0.476096 0.573858 0.313725 0.003124 0.205942 0.116642 0.182806 0.001495
14289 0.719124 0.021254 0.882353 0.116264 0.148998 0.027594 0.022275 0.001099
13665 0.701195 0.157279 0.333333 0.046703 0.325099 0.034645 0.018619 0.001981
14471 0.709163 0.036132 0.333333 0.239599 0.093661 0.021064 0.019905 0.002922

Unscaled data

knn = KNeighborsRegressor()
knn.fit(X_train_imp, y_train);
print(round(knn.score(X_train_imp, y_train), 3))
print(round(knn.score(X_test_imp, y_test), 3))
0.561
0.32


Scaled data

knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train);
print(round(knn.score(X_train_scaled, y_train), 3))
print(round(knn.score(X_test_scaled, y_test), 3))
0.801
0.723

Let’s apply what we learned!