From this lecture, you will be able to
sklearn transformers for applying feature transformations on your dataset;sklearn.pipeline.Pipeline and sklearn.pipeline.make_pipeline to build a preliminary machine learning pipeline.iClicker cloud join link: https://join.iclicker.com/HTRZ
Take a guess: In your machine learning project, how much time will you typically spend on data preparation and transformation?
The question is adapted from here.
You’re trying to find a suitable date based on:
| Person | Age | #FB Friends | Euclidean Distance Calculation | Distance |
|---|---|---|---|---|
| A | 25 | 400 | √(5² + 150²) | 150.08 |
| B | 27 | 300 | √(3² + 50²) | 50.09 |
| C | 30 | 500 | √(0² + 250²) | 250.00 |
| D | 60 | 250 | √(30² + 0²) | 30.00 |
Based on the distances, the two nearest neighbors (2-NN) are:
What’s the problem here?
Fill in missing data using a chosen strategy:
Imputation is like filling in your average or median or most frequent grade for an assessment you missed.
Ensure all features have a comparable range.
Scaling is like adjusting the number of everyone’s Facebook friends so that both the number of friends and their age are on a comparable scale. This way, one feature doesn’t dominate the other when making comparisons.
iClicker cloud join link: https://join.iclicker.com/HTRZ
Select all of the following statements which are TRUE.
StandardScaler ensures a fixed range (i.e., minimum and maximum values) for the features.StandardScaler calculates mean and standard deviation for each feature separately.SimpleImputer the transformed data has a different shape than the original data.Convert categorical features into binary columns.
Turn “Apple, Banana, Orange” into binary columns:
| Fruit | 🍎 | 🍌 | 🍊 |
|---|---|---|---|
| Apple 🍎 | 1 | 0 | 0 |
| Banana 🍌 | 0 | 1 | 0 |
| Orange 🍊 | 0 | 0 | 1 |
Convert categories into integer values that have a meaningful order.
Turn “Poor, Average, Good” into 1, 2, 3:
| Rating | Ordinal |
|---|---|
| Poor | 1 |
| Average | 2 |
| Good | 3 |
sklearn Transformers vs Estimatorsfit and transform methods.
fit(X): Learns parameters from the data.transform(X): Applies the learned transformation to the data.SimpleImputer): Fills missing values.StandardScaler): Standardizes features.
fit_transform(X): Convenience method for callingfitand thentransformon the same data.
fit and predict methods.
fit(X, y): Learns from labeled data.predict(X): Makes predictions on new data.DecisionTreeClassifier, SVC, KNeighborsClassifierRegression models are also estimators
sklearn PipelinesChaining a StandardScaler with a KNeighborsClassifier model.
iClicker cloud join link: https://join.iclicker.com/HTRZ
Select all of the following statements which are TRUE.
scikit-learn pipeline object with an estimator as the last step, you can call fit, predict, and score on it.scikit-learn pipeline.