Introduction to Machine Learning – Automated Hyperparameter Optimization

The problem with hyperparameters

We may have a lot of them.
Picking reasonable hyperparameters is important -> it helps avoid underfit or overfit models.
Nobody knows exactly how to choose them.
May interact with each other in unexpected ways.
The best settings depend on the specific data/problem.
Can take a long time to execute.

How to pick hyperparameters

Manual hyperparameter optimization

Advantages:

We may have some intuition about what might work.

Disadvantages:

It takes a lot of work.
In some cases, intuition might be worse than a data-driven approach.

Automated hyperparameter optimization

Advantages:

Reduce human effort.
Less prone to error.
Data-driven approaches may be effective.

Disadvantages:

It may be hard to incorporate intuition.
Overfitting on the validation set.

Automated hyperparameter optimization

Exhaustive grid search: sklearn.model_selection.GridSearchCV
Randomized hyperparameter optimization: sklearn.model_selection.RandomizedSearchCV

Bring in the data

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=['country']), train_df['country']
X_test, y_test = test_df.drop(columns=['country']), test_df['country']
X_train.head()

	longitude	latitude
160	-76.4813	44.2307
127	-81.2496	42.9837
169	-66.0580	45.2788
188	-73.2533	45.3057
187	-67.9245	47.1652

Exhaustive grid search

from sklearn.model_selection import GridSearchCV

param_grid = {
    "gamma": [0.1, 1.0, 10, 100]
}

from sklearn.svm import SVC

svc = SVC()
grid_search = GridSearchCV(svc, param_grid, verbose=1)

grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 4 candidates, totalling 20 fits

We are first going to discuss GridSearchCV which is an exhaustive grid search method.

How do we use sklearn.model_selection.GridSearchCV?

First, we need to decide what our model is and then decide what hyperparameters we wish to tune.

We are going to use an SVM classifier.

We build a dictionary called param_grid and we specify the values we wish to look over for the hyperparameter.

Next, we build a model of our choosing. Here we are building an SVM classifier.

Using GridSearchCV we first specify our model followed by the hyperparameter values we are checking (in this case param_grid).

Assigning verbose=1 tells GridSearchCV to print some output while it’s working.

When we call grid_search.fit(X_train, y_train), it does all the work for us.

It tries all the values we specified for gamma in our param_grid object.

In this case, it’s checking 0.1, 1, 10, and 100 and for each on it’s also performing cross-validation.

param_grid = {
    "gamma": [0.1, 1.0, 10, 100],
    "C": [0.1, 1.0, 10, 100]
}

svc = SVC()
grid_search = GridSearchCV(svc, param_grid, cv= 5, verbose=1, n_jobs=-1)

grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 16 candidates, totalling 80 fits

The nice thing about this is we can do this for multiple hyperparameters simultaneously as well.

So, we can search each of the values for C and gamma while performing cross-validation!

We want to find the best overall combination of gamma and C.

The grid in GridSearchCV stands for the way that it’s checking the hyperparameters.

Since there 4 options for each, grid search is checking every value in each hyperparameter to one another.

That means it’s checking 4 x 4 = 16 different combinations of hyperparameter values for the model.

In GridSearchCV we can specify the number of folds of cross-validation with the argument cv.

Since we are specifying cv=5 that means that fit is called a total of 80 times (16 different combinations x 5 cross-validation folds).

Something new we’ve added here is n_jobs=-1.

This is a little more complex.

Setting this to -1 helps make this process faster by running hyperparameter optimization in parallel instead of in a sequence.

Sometimes when we are checking many hyperparameters, values, and multiple cross-validation folds, this can take quite a long time.

Setting n_jobs=-1 helps with that.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("svc", SVC())])

param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 16 candidates, totalling 80 fits

We can also implement this with pipelines.

After specifying the steps in a pipeline, a user must specify a set of values for each hyperparameter in param_grid as we did before.

Notice that we named our steps in the pipeline now, so svc corresponds to the model initialization of the SVM classifier.

Then in our param_grid, we specify the name of the step followed by two underscores __ and the name of the hyperparameter.

This is because the pipeline would not know which hyperparameter goes with each step. Does gamma correspond to the hyperparameter in SimpleImputer() or StandardScaler()?

This now gives the pipeline clear instructions on which hyperparameters correspond with which step.

Let’s call GridSearchCV setting the first argument to the pipeline name instead of the model name this time.

param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]}
    
grid_search = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True, verbose=1)

for gamma in [0.1, 1.0, 10, 100]:
    for C in [0.1, 1.0, 10, 100]:
        for fold in folds:
            fit in training portion with the given C and gamma
            score on validation portion
        compute average score
pick hyperparameters with the best score

grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 16 candidates, totalling 80 fits

This animation helps explain why we must search over all possible values for each hyperparameter.

So here, we will fix C with a value of 1 and loop over the values of 1, 10 and 100 for gamma.

This results in 100 having the best score with 0.82.

Next, we fix gamma at 100. Since that was what we found was the most optimal when C was equal to 1.

Now when we loop over the values of 1, 10 and 100 for C we get the most optimal value to be 10.

So naturally, we would pick the values 100 for gamma and 10 for C, however, if we had performed every possible combination, we would have seen that the optimal values would have actually been 10 for both gamma and C.

The same thing is shown if we did it the other way around, first fixing gamma at a value of 1 and then looping over all possible values of C.

This time the most optimal combination is gamma equal to 1 and C equal to 100 which is again not the optimal value of 10 for each.

This is why it is so important not to fix either of the hyperparameters since it won’t necessarily help you find the most optimal values.

Now what?

grid_search.best_params_

{'svc__C': 10, 'svc__gamma': 1.0}

grid_search.best_score_

np.float64(0.8208556149732621)

best_model = grid_search.best_estimator_

best_model.fit(X_train, y_train)

best_model.score(X_test, y_test)

0.8333333333333334

grid_search.score(X_test, y_test)

0.8333333333333334

best_model.predict(X_test)

array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',
       'Canada'], dtype=object)

grid_search.predict(X_test)

array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',
       'Canada'], dtype=object)

Notice any problems?

Required number of models to evaluate grows exponentially with the dimensional of the configuration space.
Exhaustive search may become infeasible fairly quickly.
Example: Suppose we have 5 hyperparameters and 10 different values for each hyperparameter
- That means we’ll be evaluating \(10^5=100,000\) models! That is, we’ll be calling cross_validate 100,000 times!
- Exhaustive search may become infeasible fairly quickly.

Enter randomized hyperparameter search!

from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]
}

random_search = RandomizedSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1, n_iter=10)
random_search.fit(X_train, y_train);

Fitting 5 folds for each of 10 candidates, totalling 50 fits

random_search.score(X_test, y_test)

0.8333333333333334

Extra (optional slide)

import scipy

param_grid = {
    "svc__C": scipy.stats.uniform(0, 100),
    "svc__gamma": scipy.stats.uniform(0, 100)}

random_gs = RandomizedSearchCV(pipe, param_grid, n_jobs=-1, cv=10, return_train_score=True, n_iter=10)
random_gs.fit(X_train, y_train);
random_gs.best_params_

{'svc__C': np.float64(88.18996062865577),
 'svc__gamma': np.float64(9.929233744844247)}

random_gs.best_score_

np.float64(0.7569852941176471)

random_gs.score(X_test, y_test)

0.7380952380952381

How different do they score?

grid_search.score(X_test, y_test)

0.8333333333333334

random_search.score(X_test, y_test)

0.8333333333333334

Overfitting on the validation set

Overfitting on validation set of parameter learning:

During learning, we could search over tons of different decision trees.
So, we can get “lucky” and find one with a high training score by chance.
- “Overfitting of the training score”.

Overfitting on validation set of hyper-parameter learning:

Here, we might optimize the validation score over 100 values of max_depth.
One of the 100 trees might have a high validation score by chance.

Consider a multiple-choice (a,b,c,d) “test” with 10 questions:

If you choose answers randomly, the expected grade is 25% (no bias).
If you fill out two tests randomly and pick the best, the expected grade is 33%.
- overfitting ~8%.
If you take the best among 10 random tests, the expected grade is ~47%.
If you take the best among 100, the expected grade is ~62%.
If you take the best among 1000, the expected grade is ~73%.
- You have so many “chances” that you expect to do well.

But on new questions, the “random choice” accuracy is still 25%.

If we instead used a 100-question test then:
- Expected grade from best over 1 randomly-filled tests is 25%.
- Expected grade from best over 2 randomly-filled tests is ~27%.
- Expected grade from best over 10 randomly-filled tests is ~32%.
- Expected grade from best over 100 randomly-filled tests is ~36%.
- Expected grade from best over 1000 randomly-filled tests is ~40%.
The optimization bias grows with the number of things we try.
But, optimization bias shrinks quickly with the number of examples.
- But it’s still non-zero and growing if you over-use your validation set!

This exercise helps explain the concept of overfitting on the validation set.

Essentially our odds of doing well on a multiple choice exam (if we are guessing) increases the more times we can repeat and randomly take the exam again.

Because we have so many chances you’ll eventually do well and perhaps not representative of your knowledge (remember you are randomly guessing)

The same occurs with selecting hyperparameters.

The more hyperparameters values and combinations we try, the more likely we will randomly get a better scoring model by chance and not because the model represents the data well.

This overfitting can be decreased somewhat by increasing the number of questions or in our case, the number of examples we have.

TLDR: If your test score is lower than your validation score, it may be because did so much hyperparameter optimization that you got lucky and the bigger data set that you have, the better.

Automated Hyperparameter Optimization

The problem with hyperparameters

How to pick hyperparameters

Manual hyperparameter optimization

Automated hyperparameter optimization

Automated hyperparameter optimization

Bring in the data

Exhaustive grid search

Now what?

Notice any problems?

Extra (optional slide)

How different do they score?

Overfitting on the validation set

Overfitting on validation set of parameter learning:

Overfitting on validation set of hyper-parameter learning:

Let’s apply what we learned!