Introduction to Machine Learning

Single split problems

So what do we do?

𝑘-fold cross-validation

Here we will introduce something called cross-validation or 𝑘-fold cross-validation which attempts to get the best of both worlds.

We still have the test set here at the bottom locked away that we will not touch until the end.

Instead of splitting our training set and simply chopping it into train and validation sets, we do something more complicated that allows us to validate more accurately and not be over-reliant on the random dividing into the training and validation sets.

Doing this could lead to us either being lucky or unlucky with the splitting, causing extremely accurate or very poor scores.

Cross-validation consists of splitting the data into k-folds ( 𝑘>2, often 𝑘=10 ). In the picture below 𝑘=4.

Each “fold” gets a turn at being the validation set. And the other folds are used as the training set.

Then we use a new fold as the validation set and the rest now become the training set.

This is repeated until every fold has an opportunity to act as the validation set.

Each round will produce a score so after 𝑘-fold cross-validation, it will produce 𝑘 scores. We usually average over the 𝑘 results.

It’s better to notice the variation in the scores across folds.

We can get a more “robust” score on unseen data.

The main disadvantage here is that this as K increases the longer it takes to run the code, which is a problem for bigger datasets / more complex models.

Cross-validation using scikit-learn

df = pd.read_csv("data/canada_usa_cities.csv")
X = df.drop(columns=["country"])
y = df["country"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=4)
cv_score = cross_val_score(model, X_train, y_train, cv=5)
cv_score

array([0.76470588, 0.82352941, 0.78787879, 0.78787879, 0.84848485])

First, we import cross_val_score from sklearn.model_selection which is gonna take care of the cross-validation for us.

Conveniently we don’t need to do the splitting into folds ourselves and the functions we use from these libraries will help us with that.

We create our decision tree model.

We use cross_val_score() and specify the model and the training features and target as arguments.

We also specify cv which determines the cross-validation splitting strategy or how many “folds” there are.

Here we are saying there at 5 folds on the data.

For each fold, the model is fitted on the training portion and scores on the validation portion.

The output of cross_val_score() is the validation score for each fold.

Typically an average of the scores can be taken to produce a single measure of how the model is doing but it can be useful to look at the individual scores to observe the variation among them.

If the scores are all quite different, that may make us question our model more.

cv_scores = cross_val_score(model, X_train, y_train, cv=10)
cv_scores

array([0.76470588, 0.82352941, 0.70588235, 0.94117647, 0.82352941, 0.82352941, 0.70588235, 0.9375    , 0.9375    , 0.9375    ])

cv_scores.mean()

np.float64(0.8400735294117647)

from sklearn.model_selection import cross_validate

scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
scores

{'fit_time': array([0.00168347, 0.00167537, 0.00131798, 0.00130844, 0.00128198, 0.00131845, 0.00128341, 0.0012989 , 0.00128961, 0.00132298]),
 'score_time': array([0.00122404, 0.0011344 , 0.00103498, 0.00102401, 0.00101638, 0.00103545, 0.00100684, 0.00101876, 0.00103331, 0.00102472]),
 'test_score': array([0.76470588, 0.82352941, 0.70588235, 0.94117647, 0.82352941, 0.82352941, 0.70588235, 0.9375    , 0.9375    , 0.9375    ]),
 'train_score': array([0.91333333, 0.90666667, 0.90666667, 0.9       , 0.90666667, 0.91333333, 0.92      , 0.90066225, 0.90066225, 0.90066225])}

scores

{'fit_time': array([0.00168347, 0.00167537, 0.00131798, 0.00130844, 0.00128198, 0.00131845, 0.00128341, 0.0012989 , 0.00128961, 0.00132298]),
 'score_time': array([0.00122404, 0.0011344 , 0.00103498, 0.00102401, 0.00101638, 0.00103545, 0.00100684, 0.00101876, 0.00103331, 0.00102472]),
 'test_score': array([0.76470588, 0.82352941, 0.70588235, 0.94117647, 0.82352941, 0.82352941, 0.70588235, 0.9375    , 0.9375    , 0.9375    ]),
 'train_score': array([0.91333333, 0.90666667, 0.90666667, 0.9       , 0.90666667, 0.91333333, 0.92      , 0.90066225, 0.90066225, 0.90066225])}

pd.DataFrame(scores)

	fit_time	score_time	test_score	train_score
0	0.001683	0.001224	0.764706	0.913333
1	0.001675	0.001134	0.823529	0.906667
2	0.001318	0.001035	0.705882	0.906667
...	...	...	...	...
7	0.001299	0.001019	0.937500	0.900662
8	0.001290	0.001033	0.937500	0.900662
9	0.001323	0.001025	0.937500	0.900662

10 rows × 4 columns

pd.DataFrame(scores).mean()

fit_time       0.001378
score_time     0.001055
test_score     0.840074
train_score    0.906865
dtype: float64

cross_val_score(model, X_train, y_train, cv=10).mean()

np.float64(0.8400735294117647)

pd.DataFrame(scores).std()

fit_time       0.000160
score_time     0.000069
test_score     0.094993
train_score    0.006822
dtype: float64

Our typical supervised learning set up is as follows:

Given training data with X and y.
We split our data into X_train, y_train, X_test, y_test.
Hyperparameter optimization using cross-validation on X_train and y_train.
We assess the best model using X_test and y_test.
The test score tells us how well our model generalizes.
If the test score is reasonable, we deploy the model.