Grid search is a method for performing hyperparameter tuning for a model. This technique involves identifying one or more hyperparameters that you would like to tune, and then selecting some number of values to consider for each hyperparameter. We then evaluate each possible set of hyperparameters by performing some type of validation. Typically, this will involve performing cross-validation to generate an out-of-sample performance estimate for each set of hyperparameters. We then typically select the model that has the highest cross-validation score.
We will illustrate how to perform grid search in Scikit-Learn in this lesson. We begin by importing a few packages and tools that are not directly related to grid search.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
In this lesson, we will work with a synthetic dataset created for a classification problem. The dataset contains 400 observations, each of which will have 6 features, and will be assigned one of 10 possible classes. The features are stored in an array named X
, while the labels are stored in an array named y
. We start by viewing the contents of X
in a DataFrame format.
np.random.seed(1)
X, y = make_classification(n_samples=400, n_features=6, n_informative=6,
n_redundant=0, n_classes=10, class_sep=2)
pd.DataFrame(X)
Let's now view the first few elements of y
.
print(y[:20])
GridSearch is performed in Scikit-Learn using the GridSearchCV
class. We will import this class in the cell below.
from sklearn.model_selection import GridSearchCV
We will illustrate the usage of GridSearchCV
by first performing hyperparameter tuning to select the optimal value of the regularization parameter C
in a logistic regression model.
We start by defining a parameter grid. This is a dictionary containing keys for any hyperparameters we wish to tune over. The values associated with each key should be a list or array of values to consider for that hyperparameter.
param_grid = [
{'C': 10**np.linspace(-3,3,20)}
]
We then create an instance of the estimate that we wish to tune over. In this case, that is the LogisticRegression
class. Note that we do not fit the model to the training data yet.
lin_reg = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
We then create an instance of the GridSearchCv
class. When creating this instance, we must provide an estimator, a parameter grid, a number of folds to use in cross-validation, an evaluation metric to use in cross-validation. If we specify that refit=True
(which is the default value), then GridSearchCV
will automatically fit the best model found to the entire data set. We will discuss this more later.
After creating an instance of GridSearchCV
, we train it using the fit
method.
A trained GridSearchCV
obtain has many attributes and methods that we might be interested in. We will explore this in more detail later, but for now, the most important attributes are best_score_
and best_params_
. The best_score_
attribute will contain the cross-validation score for the best model found, while best_params_
will be a dictionary of the hyperparameter values that generated the optimal cross-validation score.
lr_gridsearch = GridSearchCV(lin_reg, param_grid, cv=10, scoring='accuracy',
refit=True, iid=False)
lr_gridsearch.fit(X, y)
print(lr_gridsearch.best_score_)
print(lr_gridsearch.best_params_)
We see that the highest cross-validation score obtain for any of the values of C
considered was 62.7%. This was obtained by using C = 0.16237767
.
When trained, GridSearchCV
class will automatically refit a final model to the full training set using the optimal hyperparameter values found. This model is stored in the attribute best_estimator_
.
In the cell below, we extract the best model from our GridSearchCV
object and use it to calculate the training accuracy for this model.
lr_model = lr_gridsearch.best_estimator_
print('Training Score:', lr_model.score(X, y))
We will now illustrate how to use GridSearchCV
to perform hyperparameter tuning for a decision tree. We will tune over two hyperparameters: max_depth
and min_samples_leaf
.
param_grid = [{
'max_depth': [2, 4, 8, 16, 32, 64],
'min_samples_leaf': [2, 4, 8, 16]
}]
tree = DecisionTreeClassifier()
np.random.seed(1)
dt_gridsearch = GridSearchCV(tree, param_grid, cv=10, scoring='accuracy',
refit=True, iid=False)
dt_gridsearch.fit(X, y)
print(dt_gridsearch.best_score_)
print(dt_gridsearch.best_params_)
The decision tree with the highest cross-validation score had a max_depth
of 32 and a min_samples_leaf
of 8. Notice that this model outperforms the best logistic regression model that we found above. In the cell below, we extract the best model from the GridSearchCV
object, and calculate its score on the training set.
dt_model = dt_gridsearch.best_estimator_
print('Training Score:', dt_model.score(X, y))
We will now illustrate how to use GridSearchCV
to perform hyperparameter tuning for a random forest. We will tune over two hyperparameters: max_depth
and min_samples_leaf
. We will set the n_estimators
hyperparameter to 200.
param_grid = [{
'max_depth':[2, 4, 8, 16, 32, 64],
'min_samples_leaf':[2, 4, 8, 16]
}]
forest = RandomForestClassifier(n_estimators=200)
np.random.seed(1)
rf_gridsearch = GridSearchCV(forest, param_grid, cv=10, scoring='accuracy',
refit=True, iid=False)
rf_gridsearch.fit(X, y)
print(rf_gridsearch.best_score_)
print(rf_gridsearch.best_params_)
The random forest with the highest cross-validation score had a max_depth
of 8 and a min_samples_leaf
of 4. This model outperforms either of our previous two models. In the cell below, we extract the best model from the GridSearchCV
object, and calculate its score on the training set.
rf_model = rf_gridsearch.best_estimator_
print('Training Score:', rf_model.score(X, y))
If we would like to see more detailed results pertaining to the results of the grid seach process, more information can be found in the cv_results
attribute of a trained instance of the GridSearchCV
class. This attribute contains a dictionary with several pieces of information pertaining to the results of the cross-validation steps. We will start by looking at the keys of the items stored in this dictionary.
cv_res = rf_gridsearch.cv_results_
print(cv_res.keys())
The items split0_test_score
through split9_test_score
each contain the validation score for each of the models considered on one particular fold. The average validation scores for each individual model can be found in the mean_test_score
item.
print(cv_res['mean_test_score'])
In the cell below, we print the average test scores along with the hyperparameter values for the models that generated them.
for score, params in zip(cv_res['mean_test_score'], cv_res['params']):
print(score, params)
We see that although the max_depth=8
, min_samples_leaf=4
model performed the best, there were a few other models that had very similar results.