The fundamental tradeoff and the golden rule

Reminder:

  • score_train: is our training score (or mean train score from cross-validation).

  • score_valid is our validation score (or mean validation score from cross-validation).

  • score_test is our test score.

The “fundamental tradeoff” of supervised learning


As model complexity ↑, Score_train ↑ and Score_train − Score_valid tend to ↑.

How to pick a model that would generalize better?

df = pd.read_csv("data/canada_usa_cities.csv")
X = df.drop(columns=["country"])
y = df["country"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

results_dict = {"depth": list(), "mean_train_score": list(), "mean_cv_score": list()}

for depth in range(1,20):
    model = DecisionTreeClassifier(max_depth=depth)
    scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
    results_dict["depth"].append(depth)
    results_dict["mean_cv_score"].append(scores["test_score"].mean())
    results_dict["mean_train_score"].append(scores["train_score"].mean())

results_df = pd.DataFrame(results_dict)
results_df
depth mean_train_score mean_cv_score
0 1 0.834349 0.809926
1 2 0.844989 0.804044
2 3 0.862967 0.804412
... ... ... ...
16 17 1.000000 0.815074
17 18 1.000000 0.803309
18 19 1.000000 0.803309

19 rows × 3 columns

source = results_df.melt(id_vars=['depth'] , 
                              value_vars=['mean_train_score', 'mean_cv_score'], 
                              var_name='plot', value_name='score')
chart1 = alt.Chart(source).mark_line().encode(
    alt.X('depth:Q', axis=alt.Axis(title="Tree Depth")),
    alt.Y('score:Q'),
    alt.Color('plot:N', scale=alt.Scale(domain=['mean_train_score', 'mean_cv_score'],
                                           range=['teal', 'gold'])))
chart1
A caption
results_df.sort_values('mean_cv_score', ascending=False).iloc[0]
depth               5.000000
mean_train_score    0.918848
mean_cv_score       0.845956
Name: 4, dtype: float64


best_depth = int(results_df.sort_values('mean_cv_score', ascending=False).iloc[0]['depth'])
best_depth
5


model = DecisionTreeClassifier(max_depth=best_depth)
model.fit(X_train, y_train)
print("Score on test set: " + str(round(model.score(X_test, y_test), 2)))
Score on test set: 0.83

The Golden Rule

Even though we care the most about test score:

THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY


A caption

Golden rule violation: Example 1

A caption

Attribution: The A register - Katyanna Quach

Golden rule violation: Example 2

A caption

Attribution: MIT Technology Review- Tom Simonite

How can we avoid violating the golden rule?

A caption


Here is the workflow we’ll generally follow.

  • Splitting: Before doing anything, split the data X and y into X_train, X_test, y_train, y_test or train_df and test_df using train_test_split.
  • Select the best model using cross-validation: Use cross_validate with return_train_score = True so that we can get access to training scores in each fold. (If we want to plot train vs validation error plots, for instance.)
  • Scoring on test data: Finally score on the test data with the chosen hyperparameters to examine the generalization performance.

Let’s apply what we learned!