Automated Hyperparameter Optimization

The problem with hyperparameters

  • We may have a lot of them.
  • Picking reasonable hyperparameters is important -> it helps avoid underfit or overfit models.
  • Nobody knows exactly how to choose them.
  • May interact with each other in unexpected ways.
  • The best settings depend on the specific data/problem.
  • Can take a long time to execute.

How to pick hyperparameters

Manual hyperparameter optimization

Advantages:

  • We may have some intuition about what might work.

Disadvantages:

  • It takes a lot of work.
  • In some cases, intuition might be worse than a data-driven approach.

Automated hyperparameter optimization

Advantages:

  • Reduce human effort.
  • Less prone to error.
  • Data-driven approaches may be effective.

Disadvantages:

  • It may be hard to incorporate intuition.
  • Overfitting on the validation set.

Automated hyperparameter optimization


Bring in the data

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=['country']), train_df['country']
X_test, y_test = test_df.drop(columns=['country']), test_df['country']
X_train.head()
longitude latitude
160 -76.4813 44.2307
127 -81.2496 42.9837
169 -66.0580 45.2788
188 -73.2533 45.3057
187 -67.9245 47.1652
param_grid = {
    "gamma": [0.1, 1.0, 10, 100],
    "C": [0.1, 1.0, 10, 100]
}


svc = SVC()
grid_search = GridSearchCV(svc, param_grid, cv= 5, verbose=1, n_jobs=-1)


grid_search.fit(X_train, y_train);
Fitting 5 folds for each of 16 candidates, totalling 80 fits
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("svc", SVC())])


param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]
}


grid_search = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);
Fitting 5 folds for each of 16 candidates, totalling 80 fits
param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]}
    
grid_search = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True, verbose=1)


for gamma in [0.1, 1.0, 10, 100]:
    for C in [0.1, 1.0, 10, 100]:
        for fold in folds:
            fit in training portion with the given C and gamma
            score on validation portion
        compute average score
pick hyperparameters with the best score


grid_search.fit(X_train, y_train);
Fitting 5 folds for each of 16 candidates, totalling 80 fits
404 image

Now what?

grid_search.best_params_
{'svc__C': 10, 'svc__gamma': 1.0}


grid_search.best_score_
np.float64(0.8208556149732621)
best_model = grid_search.best_estimator_

best_model.fit(X_train, y_train)

best_model.score(X_test, y_test)
0.8333333333333334


grid_search.score(X_test, y_test)
0.8333333333333334
best_model.predict(X_test)
array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',
       'Canada'], dtype=object)


grid_search.predict(X_test)
array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',
       'Canada'], dtype=object)



Notice any problems?

  • Required number of models to evaluate grows exponentially with the dimensional of the configuration space.
  • Exhaustive search may become infeasible fairly quickly.
  • Example: Suppose we have 5 hyperparameters and 10 different values for each hyperparameter
    • That means we’ll be evaluating \(10^5=100,000\) models! That is, we’ll be calling cross_validate 100,000 times!
    • Exhaustive search may become infeasible fairly quickly.

Enter randomized hyperparameter search!

from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]
}


random_search = RandomizedSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1, n_iter=10)
random_search.fit(X_train, y_train);
Fitting 5 folds for each of 10 candidates, totalling 50 fits


random_search.score(X_test, y_test)
0.8333333333333334

Extra (optional slide)

import scipy

param_grid = {
    "svc__C": scipy.stats.uniform(0, 100),
    "svc__gamma": scipy.stats.uniform(0, 100)}


random_gs = RandomizedSearchCV(pipe, param_grid, n_jobs=-1, cv=10, return_train_score=True, n_iter=10)
random_gs.fit(X_train, y_train);
random_gs.best_params_
{'svc__C': np.float64(88.18996062865577),
 'svc__gamma': np.float64(9.929233744844247)}


random_gs.best_score_
np.float64(0.7569852941176471)


random_gs.score(X_test, y_test)
0.7380952380952381

How different do they score?

grid_search.score(X_test, y_test)
0.8333333333333334


random_search.score(X_test, y_test)
0.8333333333333334

Overfitting on the validation set

Overfitting on validation set of parameter learning:

  • During learning, we could search over tons of different decision trees.
  • So, we can get “lucky” and find one with a high training score by chance.
    • “Overfitting of the training score”.

Overfitting on validation set of hyper-parameter learning:

  • Here, we might optimize the validation score over 100 values of max_depth.
  • One of the 100 trees might have a high validation score by chance.

Consider a multiple-choice (a,b,c,d) “test” with 10 questions:

  • If you choose answers randomly, the expected grade is 25% (no bias).
  • If you fill out two tests randomly and pick the best, the expected grade is 33%.
    • overfitting ~8%.
  • If you take the best among 10 random tests, the expected grade is ~47%.
  • If you take the best among 100, the expected grade is ~62%.
  • If you take the best among 1000, the expected grade is ~73%.
    • You have so many “chances” that you expect to do well.

But on new questions, the “random choice” accuracy is still 25%.

  • If we instead used a 100-question test then:

    • Expected grade from best over 1 randomly-filled tests is 25%.
    • Expected grade from best over 2 randomly-filled tests is ~27%.
    • Expected grade from best over 10 randomly-filled tests is ~32%.
    • Expected grade from best over 100 randomly-filled tests is ~36%.
    • Expected grade from best over 1000 randomly-filled tests is ~40%.
  • The optimization bias grows with the number of things we try.

  • But, optimization bias shrinks quickly with the number of examples.

    • But it’s still non-zero and growing if you over-use your validation set!



404 image

Let’s apply what we learned!