Introduction to Machine Learning – The fundamental tradeoff and the golden rule

Reminder:

score_train: is our training score (or mean train score from cross-validation).
score_valid is our validation score (or mean validation score from cross-validation).
score_test is our test score.

The “fundamental tradeoff” of supervised learning

As model complexity ↑, Score_train ↑ and Score_train − Score_valid tend to ↑.

We are going to talk about the fundamental tradeoff of supervised learning. We’ve already danced around this topic which involves the concepts of overfitting and underfitting.

If our model is very simple, like DummyClassifier(), or a Decision tree with a max_depth of 1 then we won’t really learn any “specific patterns” of the training set, we will only learn some general trend.

This is underfitting.

If our model is very complex, like a DecisionTreeClassifier(max_depth=None), then we will learn unreliable patterns that get every single training example correct, but there will be a huge gap between training error and validation error.

This is overfitting.

The trade-off is there is a tension between these two concepts. When we underfit less, we overfit more.

As we increase model complexity, our training score increases (overfit more, underfit less) but the trade-off is that the gap between the training data and the test data will also increase.

The question is how will the validation score react?

How to pick a model that would generalize better?

df = pd.read_csv("data/canada_usa_cities.csv")
X = df.drop(columns=["country"])
y = df["country"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

results_dict = {"depth": list(), "mean_train_score": list(), "mean_cv_score": list()}

for depth in range(1,20):
    model = DecisionTreeClassifier(max_depth=depth)
    scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
    results_dict["depth"].append(depth)
    results_dict["mean_cv_score"].append(scores["test_score"].mean())
    results_dict["mean_train_score"].append(scores["train_score"].mean())

results_df = pd.DataFrame(results_dict)

results_df

	depth	mean_train_score	mean_cv_score
0	1	0.834349	0.809926
1	2	0.844989	0.804044
2	3	0.862967	0.804412
...	...	...	...
16	17	1.000000	0.815074
17	18	1.000000	0.803309
18	19	1.000000	0.803309

19 rows × 3 columns

source = results_df.melt(id_vars=['depth'] , 
                              value_vars=['mean_train_score', 'mean_cv_score'], 
                              var_name='plot', value_name='score')

chart1 = alt.Chart(source).mark_line().encode(
    alt.X('depth:Q', axis=alt.Axis(title="Tree Depth")),
    alt.Y('score:Q'),
    alt.Color('plot:N', scale=alt.Scale(domain=['mean_train_score', 'mean_cv_score'],
                                           range=['teal', 'gold'])))
chart1

This plot shows that as we increase our depth (increase our complexity) our training data increases.

We can also see that as we increase our depth, we overfit more, and the gap between the train score and validation score also increases.

We can see that there is a spot where the gap is the smallest while still producing a decent validation score. Somewhat of a “sweet spot” if you will. In the plot, this would be around max_depth is 5.

In summary, at the beginning when our model is simple and underfitting, increasing our model complexity is a good idea since that will cause us to underfit less and overfit, not that much more. But as we continue to increase our complexity, the trade-off is more evident and overfitting occurs more without increasing the validation score so much.

Commonly, we look at the cross-validation score and pick the hyperparameter with the highest cross-validation score.

results_df.sort_values('mean_cv_score', ascending=False).iloc[0]

depth               5.000000
mean_train_score    0.918848
mean_cv_score       0.845956
Name: 4, dtype: float64

best_depth = int(results_df.sort_values('mean_cv_score', ascending=False).iloc[0]['depth'])
best_depth

model = DecisionTreeClassifier(max_depth=best_depth)
model.fit(X_train, y_train)
print("Score on test set: " + str(round(model.score(X_test, y_test), 2)))

Score on test set: 0.83

The Golden Rule

Even though we care the most about test score:

THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY

Golden rule violation: Example 1

Attribution: The A register - Katyanna Quach

Golden rule violation: Example 2

Attribution: MIT Technology Review- Tom Simonite

How can we avoid violating the golden rule?

Here is the workflow we’ll generally follow.

Splitting: Before doing anything, split the data X and y into X_train, X_test, y_train, y_test or train_df and test_df using train_test_split.
Select the best model using cross-validation: Use cross_validate with return_train_score = True so that we can get access to training scores in each fold. (If we want to plot train vs validation error plots, for instance.)
Scoring on test data: Finally score on the test data with the chosen hyperparameters to examine the generalization performance.