CPSC 330 Lecture 7: Linear models

Andrew Roth (Slides adapted from Varada Kolhatkar and Firas Moosvi)

Announcements

  • Important information about midterm 1
    • https://piazza.com/class/m4ujp0s4xgm5o5/post/204
  • HW3 is due next week Monday, Feb 3rd, 11:59 pm.
    • You can work in pairs for this assignment.
  • Where to find slides?
    • https://aroth85.github.io/cpsc330-slides/lecture.html

Learning outcomes

From this lecture, students are expected to be able to:

  • Explain the advantages of getting probability scores instead of hard predictions during classification
  • Explain the general intuition behind linear models
  • Explain the difference between linear regression and logistic regression
  • Explain how predict works for linear regression
  • Explain how can you interpret model predictions using coefficients learned by a linear model
  • Explain the advantages and limitations of linear classifiers

Learning outcomes (contd)

  • Broadly describe linear SVMs
  • Demonstrate how the alpha hyperparameter of Ridge is related to the fundamental tradeoff
  • Use scikit-learn’s Ridge model
  • Use scikit-learn’s LogisticRegression model and predict_proba to get probability scores

Recap: Dealing with text features

  • Preprocessing text to fit into machine learning models using text vectorization.
  • Bag of words representation

Recap: sklearn CountVectorizer

  • Use scikit-learn’s CountVectorizer to encode text data
  • CountVectorizer: Transforms text into a matrix of token counts
  • Important parameters:
    • max_features: Control the number of features used in the model
    • max_df, min_df: Control document frequency thresholds
    • ngram_range: Defines the range of n-grams to be extracted
    • stop_words: Enables the removal of common words that are typically uninformative in most applications, such as “and”, “the”, etc.

Recap: Incorporating text features in a machine learning pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

text_pipeline = make_pipeline(
    CountVectorizer(),
    SVC()
)

(iClicker) Exercise 6.1

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

    1. You could carry out cross-validation by passing a ColumnTransformer object to cross_validate.
    1. After applying column transformer, the order of the columns in the transformed data has to be the same as the order of the columns in the original data.
    1. After applying a column transformer, the transformed data is always going to be of different shape than the original data.
    1. When you call fit_transform on a ColumnTransformer object, you get a numpy ndarray.

(iClicker) Exercise 6.2

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

    1. handle_unknown="ignore" would treat all unknown categories equally.
    1. As you increase the value for max_features hyperparameter of CountVectorizer the training score is likely to go up.
    1. Suppose you are encoding text data using CountVectorizer. If you encounter a word in the validation or the test split that’s not available in the training data, we’ll get an error.
    1. In the code below, inside cross_validate, each fold might have slightly different number of features (columns) in the fold.
pipe = (CountVectorizer(), SVC())
cross_validate(pipe, X_train, y_train)

Linear models

Linear models

  • Linear models make an assumption that the relationship between X and y is linear.
  • In this case, with only one feature, our model is a straight line.
  • What do we need to represent a line?
    • Slope (\(w_1\)): Determines the angle of the line.
    • Y-intercept (\(w_0\)): Where the line crosses the y-axis.

  • Making predictions

\[ y_{hat} = w_1 \times \text{# hours studied} + w_0\]

Ridge vs. LinearRegression

  • Ordinary linear regression is sensitive to multicolinearity and overfitting
  • Multicolinearity: Overlapping and redundant features. Most of the real-world datasets have colinear features.
  • Linear regression may produce large and unstable coefficients in such cases.
  • Ridge adds a parameter to control the complexity of a model. Finds a line that balances fit and prevents overly large coefficients.

When to use what?

  • LinearRegression
    • When interpretability is key, and no multicollinearity exists
  • Ridge
    • When you have multicollinearity (highly correlated features).
    • When you want to prevent overfitting in linear models.
  • In this course, we’ll use Ridge.

(iClicker) Exercise 7.1

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

    1. Increasing the hyperparameter alpha of Ridge is likely to decrease model complexity.
    1. Ridge can be used with datasets that have multiple features.
    1. With Ridge, we learn one coefficient per training example.
    1. If you train a linear regression model on a 2-dimensional problem (2 features), the model will learn 3 parameters: one for each feature and one for the bias term.

Logistic regression

  • Suppose your target is binary: pass or fail
  • Logistic regression is used for such binary classification tasks.
  • Logistic regression predicts a probability that the given example belongs to a particular class.
  • It uses Sigmoid function to map any real-valued input into a value between 0 and 1, representing the probability of a specific outcome.
  • A threshold (usually 0.5) is applied to the predicted probability to decide the final class label.

Logistic regression: Decision boundary

  • The decision boundary is the point on the x-axis where the corresponding predicted probability on the y-axis is 0.5.

Decision boundary of logistic regression

Parametric vs. non-Parametric models (high-level)

  • Imagine you are training a logistic regression model. For each of the following scenarios, identify how many parameters (weights and biases) will be learned.
  • Scenario 1: 100 features and 1,000 examples
  • Scenario 2: 100 features and 1 million examples

Parametric vs. non-Parametric models (high-level)

Parametric

  • Examples: Logistic regression, linear regression, linear SVM
  • Models with a fixed number of parameters, regardless of the dataset size
  • Simple, computationally efficient, less prone to overfitting
  • Less flexible, may not capture complex relationships

Non parametric

  • Examples: KNN, SVM RBF, Decision tree with no specific depth specified
  • Models where the number of parameters grows with the dataset size. They do not assume a fixed form for the functions being learned.
  • Flexible, can adapt to complex patterns
  • Computationally expensive, risk of overfitting with noisy data

Class demo

(iClicker) Exercise 7.2

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

    1. Increasing logistic regression’s C hyperparameter increases model complexity.
    1. The raw output score can be used to calculate the probability score for a given prediction.
    1. For linear classifier trained on \(d\) features, the decision boundary is a \(d-1\)-dimensional hyperparlane.
    1. A linear model is likely to be uncertain about the data points close to the decision boundary.

Linear SVM

  • We have seen non-linear SVM with RBF kernel before. This is the default SVC model in sklearn because it tends to work better in many cases.
  • There is also a linear SVM. You can pass kernel="linear" to create a linear SVM.
  • predict method of linear SVM and logistic regression works the same way.
  • We can get coef_ associated with the features and intercept_ using a Linear SVM model

Summary

Summary of linear models

  • Linear regression is a linear model for regression whereas logistic regression is a linear model for classification.
  • Both these models learn one coefficient per feature, plus an intercept.

Interpretation of coefficients in linear models

  • The \(j\)th coefficient tells us how feature \(j\) affects the prediction
    • if \(w_j > 0\) then increasing \(x_{ij}\) moves us toward predicting \(+1\)
    • if \(w_j < 0\) then increasing \(x_{ij}\) moves us toward prediction \(-1\)
    • if \(w_j = 0\) then the feature is not used in making a prediction

Importance of scaling

  • When you are interpreting the model coefficients, scaling is crucial.
  • If you do not scale the data, features with smaller magnitude are going to get coefficients with bigger magnitude whereas features with bigger scale are going to get coefficients with smaller magnitude.
  • That said, when you scale the data, feature values become hard to interpret for humans!

Limitations of linear models

  • Is your data “linearly separable”? Can you draw a hyperplane between these datapoints that separates them with 0 error.
  • If the training examples can be separated by a linear decision rule, they are linearly separable.