Introduction to Machine Learning – The Importance of Preprocessing

So far …

Models: Decision trees, 𝑘-NNs, SVMs with RBF kernel.
Fundamentals: Train-validation-test split, cross-validation, the fundamental tradeoff, the golden rule.

Now …

Preprocessing: Transforming input data into a format a machine learning model can use and understand.

Basketball dataset

bball_df = pd.read_csv('data/bball.csv')
bball_df.head(3)

	full_name	rating	jersey	...	draft_round	draft_peak	college
0	LeBron James	97	#23	...	1	1	NaN
1	Kawhi Leonard	97	#2	...	1	15	San Diego State
2	Giannis Antetokounmpo	96	#34	...	1	15	NaN

3 rows × 14 columns

bball_df = bball_df[(bball_df['position'] =='G') | (bball_df['position'] =='F')]
X = bball_df[['weight', 'height', 'salary']]
y =bball_df["position"]
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.20, random_state=123)

X_train.head(3)

	weight	height	salary
152	79.4	1.88	1588231.0
337	82.1	1.91	2149560.0
130	106.6	2.03	6500000.0

Dummy Classifier

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy = DummyClassifier(strategy="most_frequent")
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
print('Mean validation score', scores['test_score'].mean().round(2))

Mean validation score 0.57

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
scores = cross_validate(knn, X_train, y_train, return_train_score=True)
print('Mean validation score', scores['test_score'].mean().round(2))

Mean validation score 0.5

two_players = X_train.sample(2, random_state=42)
two_players

	weight	height	salary
285	91.2	1.98	1882867.0
236	112.0	2.08	2000000.0

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(two_players)[1,0]

np.float64(117133.00184682972)

two_players_subset = two_players[["salary"]]
two_players_subset

	salary
285	1882867.0
236	2000000.0

euclidean_distances(two_players_subset)[1,0]

np.float64(117133.0)

Let’s have a look at just 2 players.

We can see the values in each column.

The values in the weight column are around 100, and the values in the height column are around 2.

The salary column has values much higher at around 2 million.

Let’s now calculate the distance between the two players.

We can see the distance between player 285 and 236 is 117133.00184683.

What happens if we only consider the salary column though?

It looks like we get almost the same distance!

The distance is completely dominated by the feature with larger values.

The features with smaller values are being ignored.

Does it matter?

Yes! The scale is based on how data was collected.
Features on a smaller scale can be highly informative and there is no good reason to ignore them.
We want our model to be robust and not sensitive to the scale.

Was this a problem for decision trees?

No. In decision trees we ask questions on one feature at a time.

So, what do we do about this?

Well, we have to scale the columns so they are all using a similar range of values!

Luckily Sklearn has tools called transformers for this.

Transformers: Scaling example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   # Create feature transformer object
scaler.fit(X_train); # Fitting the transformer on the train split
X_train_scaled = scaler.transform(X_train) # Transforming the train split
X_test_scaled = scaler.transform(X_test) # Transforming the test split
pd.DataFrame(X_train_scaled, columns = X_train.columns).head()

	weight	height	salary
0	-1.552775	-1.236056	-0.728809
1	-1.257147	-0.800950	-0.670086
2	1.425407	0.939473	-0.214967
3	1.370661	1.664650	-0.585185
4	0.286690	-0.510879	-0.386408

One form of preprocessing we can do is scaling we will talk about this in more detail to come but for now just take a look at the tools we are using.

We’ll be using sklearn’s StandardScaler, which is a transformer.

For now, try to only focus on the syntax.

We’ll talk about scaling in a bit.

Create a feature transformer object. This is done in a similar way to how we create a model. Transformers accepts hyperparameters as well.
Fitting the transformer on the train split.
Transform the train split using .transform().
Then transform the test split.

sklearn uses fit and transform paradigms for feature transformations. (In model building it was fit and predict or score)

We fit the transformer on the train split and then transform the train split as well as the test split.

transform replaces predict here.

We can now see that our values in our X_train have been scales so they are now all on the same scale.

The salary values are no longer greater than the values in the height and weight columns.

Sklearn’s predict vs transform

model.fit(X_train, y_train)
X_train_predictions = model.predict(X_train)

transformer.fit(X_train, [y_train])
X_train_transformed = transformer.transform(X_train)

or

X_train_transformed = transformer.fit_transform(X_train)

knn_unscaled = KNeighborsClassifier()
knn_unscaled.fit(X_train, y_train);
print('Train score: ', (round(knn_unscaled.score(X_train, y_train), 2)))
print('Test score: ', (round(knn_unscaled.score(X_test, y_test), 2)))

Train score:  0.71
Test score:  0.45

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train);
print('Train score: ', (round(knn_scaled.score(X_train_scaled, y_train), 2)))
print('Test score: ', (round(knn_scaled.score(X_test_scaled, y_test), 2)))

Train score:  0.94
Test score:  0.89