search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook

Comprehensive Guide on Grid Search

Machine Learning
chevron_right
Hyper-parameter tuning
schedule Mar 9, 2022
Last updated
local_offer Machine LearningPython
Tags

What is grid search?

Grid search is a brute-force technique to find the optimal hyper-parameters for model building. Finding the optimal hyper-parameters is extremely important in machine learning because the final performance of a model will depend largely on the hyper-parameters. Grid search simply trains and evaluates a model based on the chosen values of hyper-parameters, and then selects the model that performs the best.

As an example, consider a Random Forest classifier. The two hyper-parameters are as follows:

  • max_depth: the maximum-depth that a decision tree can go down to.

  • max_features: the maximum number of features that can be selected at random at each split.

For grid search, we supply values for these hyper-parameters that we want to test. For instance, suppose we wanted to test out the following values:

max_depth: [2,3]
max_features: [1,2,3]

Grid search will then select every combination of the hyper-parameters (like a grid) and build the model for each combination using cross validation to obtain its performance. In this case, grid search will test out the following 6 combinations of hyper-parameters:

max_depth

max_features

2

1

2

2

2

3

3

1

3

2

3

3

For each model built, cross validation will return a performance metric, and therefore grid search will return the combination of hyper-parameters with the best performance.

Using Python's sklearn to implement grid search

Suppose we wanted to classify the type of an iris given four features (e.g. sepal length) using a Random Forest classifier. As explained above, we can use grid search to tune the two hyper-parameters: max_depth and max_features.

We begin by importing the relevant modules:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn import datasets
import pandas as pd
import numpy as np

We then read the Iris dataset and convert the data-type to Pandas' DataFrame:

bunch_iris = datasets.load_iris()
# Construct a DataFrame from the Bunch Object
data = pd.DataFrame(data=np.c_[bunch_iris['data'], bunch_iris['target']],
columns=bunch_iris['feature_names'] + ['target'])
data.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

We then split the data into training and testing sets:

# Break into X (features) and y (target)
X = data.iloc[:,1:4]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2000)
print("Number of rows of X_train:", X_train.shape[0])
print("Number of rows of y_train:", y_train.shape[0])
print("Number of rows of X_test:", X_test.shape[0])
print("Number of rows of y_test:", y_test.shape[0])
Number of rows of X_train: 120
Number of rows of y_train: 120
Number of rows of X_test: 30
Number of rows of y_test: 30

We then use the training set to perform grid search:

param_grid = {
"max_depth":[2,3],
"max_features": [1,2,3]
}

# random_state is like the seed - this is for reproducible results
model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)
print("Best hyper-parameters:" + str(grid_search.best_params_))
print("Best score:" + str(grid_search.best_score_))
Best hyper-parameters:{'max_depth': 2, 'max_features': 1}
Best score:0.9833333333333334

Here, note the following:

  • Just like in our previous example, we are testing out 6 different combinations of hyper-parameters. Make sure that the keys of the param_grid match the keyword argument of the model - in this case, RandomForestClassifier takes in as argument the keyword arguments max_depth and max_features.

  • Since GridSearchCV uses cross validation to obtain the performance metric, we need to specify the number of folds with cv. For a guide on cross validation, click here.

  • The results of the grid search tell us that the best combination of hyper-parameters is max_depth=2 and max_features=1. With these hyper-parameters, the classification accuracy over 0.98.

Now that we have obtained the optimal hyper-parameters, we can build our model using the entire training set and get our final performance metric using the testing set:

model_optimal = RandomForestClassifier(max_depth=2, max_features=1, random_state=42)
model_optimal.fit(X_train, y_train)
y_test_predicted = model_optimal.predict(X_test)
print(classification_report(y_test, y_test_predicted))
precision recall f1-score support
0.0 1.00 1.00 1.00 8
1.0 0.88 0.70 0.78 10
2.0 0.79 0.92 0.85 12
accuracy 0.87 30
macro avg 0.89 0.87 0.87 30
weighted avg 0.87 0.87 0.86 30

We see that the classification accuracy based on the training set is 0.87.

WARNING

It is generally recommended that only the training set is used for hyper-parameter tuning and model selection.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...