The Importance of Preprocessing

So far …

  • Models: Decision trees, 𝑘-NNs, SVMs with RBF kernel.
  • Fundamentals: Train-validation-test split, cross-validation, the fundamental tradeoff, the golden rule.



Now …

Preprocessing: Transforming input data into a format a machine learning model can use and understand.

Basketball dataset

bball_df = pd.read_csv('data/bball.csv')
bball_df.head(3)
full_name rating jersey ... draft_round draft_peak college
0 LeBron James 97 #23 ... 1 1 NaN
1 Kawhi Leonard 97 #2 ... 1 15 San Diego State
2 Giannis Antetokounmpo 96 #34 ... 1 15 NaN

3 rows × 14 columns


bball_df = bball_df[(bball_df['position'] =='G') | (bball_df['position'] =='F')]
X = bball_df[['weight', 'height', 'salary']]
y =bball_df["position"]
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.20, random_state=123)


X_train.head(3)
weight height salary
152 79.4 1.88 1588231.0
337 82.1 1.91 2149560.0
130 106.6 2.03 6500000.0

Dummy Classifier

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy = DummyClassifier(strategy="most_frequent")
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
print('Mean validation score', scores['test_score'].mean().round(2))
Mean validation score 0.57


from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
scores = cross_validate(knn, X_train, y_train, return_train_score=True)
print('Mean validation score', scores['test_score'].mean().round(2))
Mean validation score 0.5
two_players = X_train.sample(2, random_state=42)
two_players
weight height salary
285 91.2 1.98 1882867.0
236 112.0 2.08 2000000.0


from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(two_players)[1,0]
np.float64(117133.00184682972)


two_players_subset = two_players[["salary"]]
two_players_subset
salary
285 1882867.0
236 2000000.0


euclidean_distances(two_players_subset)[1,0]
np.float64(117133.0)

Transformers: Scaling example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   # Create feature transformer object
scaler.fit(X_train); # Fitting the transformer on the train split
X_train_scaled = scaler.transform(X_train) # Transforming the train split
X_test_scaled = scaler.transform(X_test) # Transforming the test split
pd.DataFrame(X_train_scaled, columns = X_train.columns).head()
weight height salary
0 -1.552775 -1.236056 -0.728809
1 -1.257147 -0.800950 -0.670086
2 1.425407 0.939473 -0.214967
3 1.370661 1.664650 -0.585185
4 0.286690 -0.510879 -0.386408

Sklearn’s predict vs transform

model.fit(X_train, y_train)
X_train_predictions = model.predict(X_train)


transformer.fit(X_train, [y_train])
X_train_transformed = transformer.transform(X_train)

or

X_train_transformed = transformer.fit_transform(X_train)
knn_unscaled = KNeighborsClassifier()
knn_unscaled.fit(X_train, y_train);
print('Train score: ', (round(knn_unscaled.score(X_train, y_train), 2)))
print('Test score: ', (round(knn_unscaled.score(X_test, y_test), 2)))
Train score:  0.71
Test score:  0.45


knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train);
print('Train score: ', (round(knn_scaled.score(X_train_scaled, y_train), 2)))
print('Test score: ', (round(knn_scaled.score(X_test_scaled, y_test), 2)))
Train score:  0.94
Test score:  0.89

Let’s apply what we learned!