CPSC 330 Lecture 8: Hyperparameter Optimization

Andrew Roth (Slides adapted from Varada Kolhatkar and Firas Moosvi)

Announcements

Important information about midterm 1
- https://piazza.com/class/m4ujp0s4xgm5o5/post/204
HW3 is due next week Monday, Feb 3rd, 11:59 pm.
Reminder my office hours
- Tuesday from 12:30 to 1:30 in my office ICCS 353

Learning outcomes

Explain the need for hyperparameter optimization
Carry out hyperparameter optimization using sklearn’s GridSearchCV and RandomizedSearchCV
Explain different hyperparameters of GridSearchCV
Explain the importance of selecting a good range for the values.
Explain optimization bias
Identify and reason when to trust and not trust reported accuracies

Recap

Recap: Logistic regression

A linear model used for binary classification tasks.
- (Optional) There is am extension of logistic regression called multinomial logistic regression for multiclass classification.
Parameters:
- Coefficients (Weights): The model learns a coefficient or a weight associated with each feature that represents its importance.
- Bias (Intercept): A constant term added to the linear combination of features and their coefficients.

Recap: Logistic regression

The model computes a weighted sum of the input features’ values, adjusted by their respective coefficients and the bias term.
This weighted sum is passed through a sigmoid function to transform it into a probability score, indicating the likelihood of the input belonging to the “positive” class.

\[\begin{equation} \hat{p} = \sigma\left(\sum_{j=1}^d w_j x_j + b\right) \end{equation}\]

\(\hat{p}\) is the predicted probability of the example belonging to the positive class.
\(w_j\) is the learned weight associated with feature \(j\)
\(x_j\) is the value of the input feature \(j\)
\(b\) is the bias term

Recap: Logistic regression

For a dataset with \(d\) features, the decision boundary that separates the classes is a \(d-1\) dimensional hyperplane.
Complexity hyperparameter: C in sklearn.
- Higher C \(\rightarrow\) more complex model meaning larger coefficients
- Lower C \(\rightarrow\) less complex model meaning smaller coefficients

Interpretation of coefficients in linear models

The \(j\)th coefficient tells us how feature \(j\) affects the prediction
- if \(w_j > 0\) then increasing \(x_{ij}\) moves us toward predicting \(+1\)
- if \(w_j < 0\) then increasing \(x_{ij}\) moves us toward prediction \(-1\)
- if \(w_j = 0\) then the feature is not used in making a prediction

Importance of scaling

When you are interpreting the model coefficients, scaling is crucial.
If you do not scale the data, features with smaller magnitude are going to get coefficients with bigger magnitude whereas features with bigger scale are going to get coefficients with smaller magnitude.
That said, when you scale the data, feature values become hard to interpret for humans!

Limitations of linear models

Is your data “linearly separable”? Can you draw a hyperplane between these datapoints that separates them with 0 error.
If the training examples can be separated by a linear decision rule, they are linearly separable.

Recap: `CountVectorizer` input

Primarily designed to accept either a pandas.Series of text data or a 1D numpy array. It can also process a list of string data directly.
Unlike many transformers that handle multiple features (DataFrame or 2D numpy array), CountVectorizer a single text column at a time.
If your dataset contains multiple text columns, you will need to instantiate separate CountVectorizer objects for each text feature.
This approach ensures that the unique vocabulary and tokenization processes are correctly applied to each specific text column without interference.

Motivation

Hyperparameter optimization

Data

sms_df = pd.read_csv(DATA_DIR + "spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]
train_df.head(4)

	target	sms
3130	spam	LookAtMe!: Thanks for your purchase of a video...
106	ham	Aight, I'll hit you up when I get some cash
4697	ham	Don no da:)whats you plan?
856	ham	Going to take your babe out ?

Model building

Let’s define a pipeline

pipe_svm = make_pipeline(CountVectorizer(), SVC())

What are some hyperparameters for this pipeline?

Suppose we want to try out different hyperparameter values.

parameters = {
    "max_features": [100, 200, 400],
    "gamma": [0.01, 0.1, 1.0],
    "C": [0.01, 0.1, 1.0],
}

Hyperparameters: the problem

In order to improve the generalization performance, finding the best values for the important hyperparameters of a model is necessary for almost all models and datasets.
Picking good hyperparameters is important because if we don’t do it, we might end up with an underfit or overfit model.

Manual hyperparameter optimization procedure

Define a parameter space.
Iterate through possible combinations.
Evaluate model performance.

What are some limitations of this approach?

Manual hyperparameter optimization

Advantage: we may have some intuition about what might work.
- E.g. if I’m massively overfitting, try decreasing max_depth or C.
Disadvantages
- It takes a lot of work
- Not reproducible
- In very complicated cases, our intuition might be worse than a data-driven approach

Automated hyperparameter optimization

Formulate the hyperparamter optimization as a one big search problem.
Often we have many hyperparameters of different types: categorical, integer, and continuous.
Often, the search space is quite big and systematic search for optimal values is infeasible.

`sklearn` methods

sklearn provides two main methods for hyperparameter optimization
- Grid Search
- Random Search

Grid search

Grid search overview

Covers all possible combinations from the provided grid.
Can be parallelized easily.
Integrates cross-validation.

Grid search in practice

For GridSearchCV we need
- An instantiated model or a pipeline
- A parameter grid: A user specifies a set of values for each hyperparameter.
- Other optional arguments

The method considers product of the sets and evaluates each combination one by one.

Grid search example

from sklearn.model_selection import GridSearchCV

pipe_svm = make_pipeline(CountVectorizer(), SVC())

param_grid = {
    "countvectorizer__max_features": [100, 200, 400],
    "svc__gamma": [0.01, 0.1, 1.0],
    "svc__C": [0.01, 0.1, 1.0],
}
grid_search = GridSearchCV(
  pipe_svm, 
  param_grid=param_grid, 
  n_jobs=-1, 
  return_train_score=True
)
grid_search.fit(X_train, y_train)
grid_search.best_score_

np.float64(0.9782606272997375)

njobs=-1 will use all available cores

Problems with exhaustive grid search

Required number of models to evaluate grows exponentially with the dimensionally of the configuration space.
Example: Suppose you have
- 5 hyperparameters
- 10 different values for each hyperparameter
- You’ll be evaluating \(10^5=100,000\) models! That is you’ll be calling cross_validate 100,000 times!
- Exhaustive search may become infeasible fairly quickly.
Other options?

Random search

Random search overview

More efficient than grid search when dealing with large hyperparameter spaces.
Samples a given number of parameter settings from distributions.

Random search example

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint, uniform

pipe_svc = make_pipeline(CountVectorizer(), SVC())

param_dist = {
    "countvectorizer__max_features": randint(100, 2000), 
    "svc__C": uniform(0.1, 1e4),  # loguniform(1e-3, 1e3),
    "svc__gamma": loguniform(1e-5, 1e3),
}
random_search = RandomizedSearchCV(
  pipe_svm,                                    
  param_distributions=param_dist, 
  n_iter=10, 
  n_jobs=-1, 
  return_train_score=True
)

# Carry out the search
random_search.fit(X_train, y_train)
random_search.best_score_

np.float64(0.9828474655872702)

`n_iter`

Note the n_iter, we didn’t need this for GridSearchCV.
Larger n_iter will take longer but it’ll do more searching.
- Remember you still need to multiply by number of folds!
You can set random_state for reproducibility but you don’t have to do it.

Advantages of `RandomizedSearchCV`

Faster compared to GridSearchCV.
Adding parameters that do not influence the performance does not affect efficiency.
Works better when some parameters are more important than others.
In general, I recommend using RandomizedSearchCV rather than GridSearchCV.

Advantages of `RandomizedSearchCV`

Questions for class discussion

Suppose you have 10 hyperparameters, each with 4 possible values. If you run GridSearchCV with this parameter grid, how many cross-validation experiments will be carried out?
Suppose you have 10 hyperparameters and each takes 4 values. If you run RandomizedSearchCV with this parameter grid with n_iter=20, how many cross-validation experiments will be carried out?

(iClicker) Exercise 8.1

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are TRUE.

1. If you get best results at the edges of your parameter grid, it might be a good idea to adjust the range of values in your parameter grid.
1. Grid search is guaranteed to find the best hyperparameter values.
1. It is possible to get different hyperparameters in different runs of RandomizedSearchCV.

Optimization bias

Optimization bias (motivation)

Why do we need to evaluate the model on the test set in the end?
Why not just use cross-validation on the whole dataset?
While carrying out hyperparameter optimization, we usually try over many possibilities.
If our dataset is small and if your validation set is hit too many times, we suffer from optimization bias or overfitting the validation set.

Optimization bias of parameter learning

Overfitting of the training error
An example:
- During training, we could search over tons of different decision trees.
- So we can get “lucky” and find a tree with low training error by chance.

Optimization bias of hyper-parameter learning

Overfitting of the validation error
An example:
- Here, we might optimize the validation error over 1000 values of max_depth.
- One of the 1000 trees might have low validation error by chance.

(Optional) Example 1: Optimization bias

Consider a multiple-choice (a,b,c,d) “test” with 10 questions:

If you choose answers randomly, expected grade is 25% (no bias).
If you fill out two tests randomly and pick the best, expected grade is 33%.
- Optimization bias of ~8%.
If you take the best among 10 random tests, expected grade is ~47%.
If you take the best among 100, expected grade is ~62%.
If you take the best among 1000, expected grade is ~73%.
If you take the best among 10000, expected grade is ~82%.
- You have so many “chances” that you expect to do well.

But on new questions the “random choice” accuracy is still 25%.

(Optional) Example 2: Optimization bias

If we instead used a 100-question test then:

Expected grade from best over 1 randomly-filled test is 25%.
Expected grade from best over 2 randomly-filled test is ~27%.
Expected grade from best over 10 randomly-filled test is ~32%.
Expected grade from best over 100 randomly-filled test is ~36%.
Expected grade from best over 1000 randomly-filled test is ~40%.
Expected grade from best over 10000 randomly-filled test is ~43%.

The optimization bias grows with the number of things we try.
- “Complexity” of the set of models we search over.

But, optimization bias shrinks quickly with the number of examples.
- But it’s still non-zero and growing if you over-use your validation set!

Optimization bias overview

Why do we need separate validation and test datasets?

This is why we need a test set

The frustrating part is that if our dataset is small then our test set is also small 😔.
But we don’t have a lot of better alternatives, unfortunately, if we have a small dataset.

When test score is much lower than CV score

What to do if your test score is much lower than your cross-validation score:
- Try simpler models and use the test set a couple of times; it’s not the end of the world.
- Communicate this clearly when you report the results.

Mitigating optimization bias.

Cross-validation
Ensembles
Regularization and choosing a simpler model

Questions for you

You have a dataset and you give me 1/10th of it. The dataset given to me is rather small and so I split it into 96% train and 4% validation split. I carry out hyperparameter optimization using a single 4% validation split and report validation accuracy of 0.97. Would it classify the rest of the data with similar accuracy?
- Probably
- Probably not

Summary

Automated hyperparameter optimization

Advantages
- Reduce human effort
- Less prone to error and improve reproducibility
- Data-driven approaches may be effective
Disadvantages
- May be hard to incorporate intuition
- Be careful about overfitting on the validation set

Extra

Discussion

Let’s say that, for a particular feature, the histograms of that feature are identical for the two target classes. Does that mean the feature is not useful for predicting the target class?

CPSC 330 Lecture 8: Hyperparameter Optimization

Announcements

Learning outcomes

Recap

Recap: Logistic regression

Recap: Logistic regression

Recap: Logistic regression

Interpretation of coefficients in linear models

Importance of scaling

Limitations of linear models

Recap: CountVectorizer input

Motivation

Hyperparameter optimization

Data

Model building

Hyperparameters: the problem

Manual hyperparameter optimization procedure

Manual hyperparameter optimization

Automated hyperparameter optimization

sklearn methods

Grid search

Grid search overview

Grid search in practice

Grid search example

Problems with exhaustive grid search

Random search

Random search overview

Random search example

n_iter

Advantages of RandomizedSearchCV

Advantages of RandomizedSearchCV

Questions for class discussion

(iClicker) Exercise 8.1

Optimization bias

Optimization bias (motivation)

Optimization bias of parameter learning

Optimization bias of hyper-parameter learning

(Optional) Example 1: Optimization bias

(Optional) Example 2: Optimization bias

Optimization bias overview

This is why we need a test set

When test score is much lower than CV score

Mitigating optimization bias.

Questions for you

Summary

Automated hyperparameter optimization

Extra

Discussion

Recap: `CountVectorizer` input

`sklearn` methods

`n_iter`

Advantages of `RandomizedSearchCV`

Advantages of `RandomizedSearchCV`