Lecture 5: Preprocessing and sklearn pipelines

Andrew Roth (Slides adapted from Varada Kolhatkar and Firas Moosvi)

Announcements

HW1 grades have been posted.
- See syllabus about regrade request etiquette.
Homework 1 solutions have been posted on Canvas under Files tab. Please do not share them with anyone or do not post them anywhere.
Syllabus quiz is due Jan 24.
HW3 should be available.

Learning outcomes

From this lecture, you will be able to

Explain motivation for preprocessing in supervised machine learning;
Discuss golden rule in the context of feature transformations;
Identify when to implement feature transformations such as imputation, scaling, and one-hot encoding in a machine learning model development pipeline;
Use sklearn transformers for applying feature transformations on your dataset;
Use sklearn.pipeline.Pipeline and sklearn.pipeline.make_pipeline to build a preliminary machine learning pipeline.

Recap

Decision trees: Split data into subsets based on feature values to create decision rules
\(k\)-NNs: Classify based on the majority vote from k nearest neighbors
SVM RBFs: Create a boundary using an RBF kernel to separate classes

Motivation

So far we have seen
- Three ML models (decision trees, \(k\)-NNs, SVMs with RBF kernel)
- ML fundamentals (train-validation-test split, cross-validation, the fundamental tradeoff, the golden rule)
Are we ready to do machine learning on real-world datasets?
- Very often real-world datasets need preprocessing before we use them to build ML models.

(iClicker) Exercise 5.1

iClicker cloud join link: https://join.iclicker.com/HTRZ

Take a guess: In your machine learning project, how much time will you typically spend on data preparation and transformation?

1. ~80% of the project time
1. ~20% of the project time
1. ~50% of the project time
1. None. Most of the time will be spent on model building

The question is adapted from here.

Preprocessing motivation: example

You’re trying to find a suitable date based on:

Age (closer to yours is better).
Number of Facebook Friends (how should we interpret?).

Preprocessing motivation: example

You are 30 years old and have 250 Facebook friends.

Person	Age	#FB Friends	Euclidean Distance Calculation	Distance
A	25	400	√(5² + 150²)	150.08
B	27	300	√(3² + 50²)	50.09
C	30	500	√(0² + 250²)	250.00
D	60	250	√(30² + 0²)	30.00

Based on the distances, the two nearest neighbors (2-NN) are:

Person D (Distance: 30.00)
Person B (Distance: 50.09)

What’s the problem here?

Common transformations

Imputation: Fill the gaps! (🟩 🟧 🟦)

Fill in missing data using a chosen strategy:

Mean: Replace missing values with the average of the available data.
Median: Use the middle value.
Most Frequent: Use the most common value (mode).
KNN Imputation: Fill based on similar neighbors.

Example:

Imputation is like filling in your average or median or most frequent grade for an assessment you missed.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Scaling: Everything to the same range! (📉 📈)

Ensure all features have a comparable range.

StandardScaler: Mean = 0, Standard Deviation = 1.

Example:

Scaling is like adjusting the number of everyone’s Facebook friends so that both the number of friends and their age are on a comparable scale. This way, one feature doesn’t dominate the other when making comparisons.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

(iClicker) Exercise 5.2

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

1. StandardScaler ensures a fixed range (i.e., minimum and maximum values) for the features.
1. StandardScaler calculates mean and standard deviation for each feature separately.
1. In general, it’s a good idea to apply scaling on numeric features before training \(k\)-NN or SVM RBF models.
1. The transformed feature values might be hard to interpret for humans.
1. After applying SimpleImputer the transformed data has a different shape than the original data.

One-Hot encoding: 🍎 → 1️⃣ 0️⃣ 0️⃣

Convert categorical features into binary columns.

Creates new binary columns for each category.
Useful for handling categorical data in machine learning models.

Example:

Turn “Apple, Banana, Orange” into binary columns:

Fruit	🍎	🍌	🍊
Apple 🍎	1	0	0
Banana 🍌	0	1	0
Orange 🍊	0	0	1

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

Ordinal encoding: Ranking matters! (⭐️⭐️⭐️ → 3️⃣)

Convert categories into integer values that have a meaningful order.

Assign integers based on order or rank.
Useful when there is an inherent ranking in the data.

Example:

Turn “Poor, Average, Good” into 1, 2, 3:

Rating	Ordinal
Poor	1
Average	2
Good	3

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X_ordinal = encoder.fit_transform(X)

`sklearn` Transformers vs Estimators

Transformers

Are used to transform or preprocess data.
Implement the fit and transform methods.
- fit(X): Learns parameters from the data.
- transform(X): Applies the learned transformation to the data.
Examples:
- Imputation (SimpleImputer): Fills missing values.
- Scaling (StandardScaler): Standardizes features.

fit_transform(X): Convenience method for calling fit and then transform on the same data.

Estimators

Used to make predictions.
Implement fit and predict methods.
- fit(X, y): Learns from labeled data.
- predict(X): Makes predictions on new data.
Examples: DecisionTreeClassifier, SVC, KNeighborsClassifier

Regression models are also estimators

Feature transformations and the golden rule

How to carry out cross-validation? (improper)

How to carry out cross-validation? (proper)

The golden rule in feature transformations

Never transform the entire dataset at once!
Why? It leads to data leakage — using information from the test set in your training process, which can artificially inflate model performance.
Fit transformers like scalers and imputers on the training set only.
Apply the transformations to both the training and test sets separately.

Example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

`sklearn` Pipelines

Pipeline is a way to chain multiple steps (e.g., preprocessing + model fitting) into a single workflow.
Simplify the code and improves readability.
Reduce the risk of data leakage by ensuring proper transformation of the training and test sets.
Automatically apply transformations in sequence.

Example:

Chaining a StandardScaler with a KNeighborsClassifier model.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

(iClicker) Exercise 5.3

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

1. You can have scaling of numeric features, one-hot encoding of categorical features, and scikit-learn estimator within a single pipeline.
1. Once you have a scikit-learn pipeline object with an estimator as the last step, you can call fit, predict, and score on it.
1. You can carry out data splitting within scikit-learn pipeline.
1. We have to be careful of the order we put each transformation and model in a pipeline.

Lecture 5: Preprocessing and sklearn pipelines

Announcements

Learning outcomes

Recap

Motivation

(iClicker) Exercise 5.1

Preprocessing motivation: example

Preprocessing motivation: example

Common transformations

Imputation: Fill the gaps! (🟩 🟧 🟦)

Example:

Scaling: Everything to the same range! (📉 📈)

Example:

(iClicker) Exercise 5.2

One-Hot encoding: 🍎 → 1️⃣ 0️⃣ 0️⃣

Example:

Ordinal encoding: Ranking matters! (⭐️⭐️⭐️ → 3️⃣)

Example:

sklearn Transformers vs Estimators

Transformers

Estimators

Feature transformations and the golden rule

How to carry out cross-validation? (improper)

How to carry out cross-validation? (proper)

The golden rule in feature transformations

Example:

sklearn Pipelines

Example:

(iClicker) Exercise 5.3

`sklearn` Transformers vs Estimators

`sklearn` Pipelines