Lecture 4: $k$-nearest neighbours and SVM RBFs

Andrew Roth (Slides adapted from Varada Kolhatkar and Firas Moosvi)

Announcements

Homework 2 due Jan 20
Syllabus quiz due date Jan 24
The lecture notes within these notebooks align with the content presented in the videos. Even though we do not cover all the content from these notebooks during lectures, it’s your responsibility to go through them on your own.

Learning outcomes

From this lecture, you will be able to

Describe the curse of dimensionality
Explain the notion of similarity-based algorithms
Describe how $k$-NNs and SVMs with RBF kernel work
Describe the effects of hyper-parameters for $k$-NNs and SVMs

Recap

Which of the following scenarios do NOT necessarily imply overfitting?

1. Training accuracy is 0.98 while validation accuracy is 0.60.
1. The model is too specific to the training data.
1. The decision boundary of a classifier is wiggly and highly irregular.
1. Training and validation accuracies are both approximately 0.88.

Recap

Which of the following statements about overfitting is true?

1. Overfitting is always beneficial for model performance on unseen data.
1. Some degree of overfitting is common in most real-world problems.
1. Overfitting ensures the model will perform well in real-world scenarios.
1. Overfitting occurs when the model learns the training data too closely, including its noise and outliers.

Recap

How might one address the issue of underfitting in a machine learning model.

1. Introduce more noise to the training data.
1. Remove features that might be relevant to the prediction.
1. Increase the model’s complexity, possibly by adding more parameter or features
1. Use a smaller dataset for training.

Overfitting and underfitting

An overfit model matches the training set so closely that it fails to make correct predictions on new unseen data.
An underfit model is too simple and does not even make good predictions on the training data

Source

Recap

Why do we split the data? What are train/valid/test splits?
What are the benefits of cross-validation?
What’s the fundamental trade-off in supervised machine learning?
What is the golden rule of machine learning?

Cross validation

Summary of train, validation, test, and deployment data

	`fit`	`score`	`predict`
Train	✔️	✔️	✔️
Validation		✔️	✔️
Test		once	once
Deployment			✔️

Recap: The fundamental tradeoff

As you increase the model complexity, training score tends to go up and the gap between train and validation scores tends to go up.

Analogy-based models

Motivation

General idea of $k$-nearest neighbours algorithm

Geometric view of tabular data and dimensions

X_spotify.shape

(2017, 13)

	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	valence
0	0.0102	0.833	204600	0.434	0.021900	2	0.1650	-8.795	1	0.4310	150.062	4.0	0.286
1	0.1990	0.743	326933	0.359	0.006110	1	0.1370	-10.401	1	0.0794	160.083	4.0	0.588
2	0.0344	0.838	185707	0.412	0.000234	2	0.1590	-7.148	1	0.2890	75.044	4.0	0.173
3	0.6040	0.494	199413	0.338	0.510000	5	0.0922	-15.236	1	0.0261	86.468	4.0	0.230
4	0.1800	0.678	392893	0.561	0.512000	5	0.4390	-11.648	0	0.0694	174.004	4.0	0.904

Distance between feature vectors

A common way to calculate the distance between vectors is calculating the Euclidean distance.
The euclidean distance between vectors $u = <u_1, u_2, \dots, u_n>$ and $v = <v_1, v_2, \dots, v_n>$ is defined as:

\[distance(u, v) = \sqrt{\sum_{i =1}^{n} (u_i - v_i)^2}\]

$k$-Nearest Neighbours ($k$-NNs)

$k$-NNs example

$k$-NNs (n=1)

n_neighbors 1

$k$-NNs (n=3)

n_neighbors 3

`KNeighborsClassifier`

for k in [1, 3]:
    neigh = KNeighborsClassifier(n_neighbors=k)
    neigh.fit(X_small_cities, y_small_cities)
    print(
        "Prediction of the black dot with %d neighbours: %s"
        % (k, neigh.predict(test_point))
    )

Prediction of the black dot with 1 neighbours: ['USA']
Prediction of the black dot with 3 neighbours: ['Canada']

Effect of $k$

iClicker 4.1

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

1. Analogy-based models find examples from the test set that are most similar to the query example we are predicting.
1. Euclidean distance will always have a non-negative value.
1. With $k$-NN, setting the hyperparameter $k$ to larger values typically reduces training error.
1. Similar to decision trees, $k$-NNs finds a small set of good features.
1. In $k$-NN, with $k > 1$, the classification of the closest neighbour to the test example always contributes the most to the prediction.

Regression with $k$-nearest neighbours ($k$-NNs)

SVMs

Curse of dimensionality

Affects all learners but especially bad for nearest-neighbour.
$k$-NN usually works well when the number of dimensions $d$ is small but things fall apart quickly as $d$ goes up.
If there are many irrelevant attributes, $k$-NN is hopelessly confused because all of them contribute to finding similarity between examples.
With enough irrelevant attributes the accidental similarity swamps out meaningful similarity and $k$-NN is no better than random guessing.

Overview

SVM RBFs are more like weighted $k$-NNs.
- The decision boundary is defined by a set of positive and negative examples and their weights together with their similarity measure.
- A test example is labeled positive if on average it looks more like positive examples than the negative examples.
Difference between $k$-NNs and SVM RBFs
- SVM RBFs only remember the key examples (support vectors).
- SVMs use a different similarity metric which is called a “kernel”. A popular kernel is Radial Basis Functions (RBFs)
- They usually perform better than $k$-NNs!

Decision boundary of SVMs

Support vectors

The decision boundary only depends on the support vectors.

Relation of `gamma` and the fundamental trade-off

Relation of `C` and the fundamental trade-off

iClicker 4.2

iClicker cloud join link: https://join.iclicker.com/HTRZ

Select all of the following statements which are TRUE.

1. $k$-NN may perform poorly in high-dimensional space (say, d > 1000).
1. In sklearn’s SVC classifier, large values of gamma tend to result in higher training score but probably lower validation score.
1. If we increase both gamma and C, we can’t be certain if the model becomes more complex or less complex.

Summary

Similarity-based algorithms

Use similarity or distance metrics to predict targets.
Examples: $k$-nearest neighbors, Support Vector Machines (SVMs) with RBF Kernel.

$k$-nearest neighbours

Classifies an object based on the majority label among its $k$ closest neighbors.
Main hyperparameter: $k$ or n_neighbors in sklearn
Distance Metrics: Euclidean
Strengths: Simple and intuitive, can learn complex decision boundaries
Challenges: Sensitive to the choice of distance metric and scaling (coming up).

Curse of dimensionality

As dimensionality increases, the volume of the space increases exponentially, making the data sparse.
Distance metrics lose meaning
- Accidental similarity swamps out meaningful similarity
- All points become almost equidistant.
Overfitting becomes likely: Harder to generalize with high-dimensional data.
How to deal with this?
- Dimensionality reduction (PCA) (not covered in this course)
- Feature selection techniques.

SVMs with RBF kernel

RBF Kernel: Radial Basis Function, a way to transform data into higher dimensions implicitly.
Strengths
- Effective in high-dimensional and sparse data
- Good performance on non-linear problems.
Hyperparameters:
- C$: Regularization parameter (trade-off between correct classification of training examples and maximization of the decision margin).
- $\gamma$: Defines how far the influence of a single training example reaches.

Intuition of `C` and `gamma` in SVM RBF

C (Regularization): Controls the trade-off between perfect training accuracy and having a simpler decision boundary.
- High C: Strict, complex boundary (overfitting risk).
- Low C: More errors allowed, smoother boundary (generalizes better).
Gamma (Kernel Width): Controls the influence of individual data points.
- High Gamma: Points have local impact, complex boundary.
- Low Gamma: Points affect broader areas, smoother boundary.
Key trade-off: Proper balance between C and gamma is crucial for avoiding overfitting or underfitting.

Class demo

(time permitting)

Lecture 4: \(k\)-nearest neighbours and SVM RBFs

Announcements

Learning outcomes

Recap

Recap

Recap

Overfitting and underfitting

Recap

Cross validation

Summary of train, validation, test, and deployment data

Recap: The fundamental tradeoff

Analogy-based models

Motivation

General idea of \(k\)-nearest neighbours algorithm

Geometric view of tabular data and dimensions

Distance between feature vectors

\(k\)-Nearest Neighbours (\(k\)-NNs)

\(k\)-NNs example

\(k\)-NNs (n=1)

\(k\)-NNs (n=3)

`KNeighborsClassifier`

Effect of \(k\)

iClicker 4.1

Regression with \(k\)-nearest neighbours (\(k\)-NNs)

SVMs

Curse of dimensionality

Overview

Decision boundary of SVMs

Support vectors

Relation of `gamma` and the fundamental trade-off

Relation of `C` and the fundamental trade-off

iClicker 4.2

Summary

Similarity-based algorithms

\(k\)-nearest neighbours

Curse of dimensionality

SVMs with RBF kernel

Intuition of `C` and `gamma` in SVM RBF

Class demo

Lecture 4: \(k\)-nearest neighbours and SVM RBFs

Announcements

Learning outcomes

Recap

Recap

Recap

Overfitting and underfitting

Recap

Cross validation

Summary of train, validation, test, and deployment data

Recap: The fundamental tradeoff

Analogy-based models

Motivation

General idea of \(k\)-nearest neighbours algorithm

Geometric view of tabular data and dimensions

Distance between feature vectors

\(k\)-Nearest Neighbours (\(k\)-NNs)

\(k\)-NNs example

\(k\)-NNs (n=1)

\(k\)-NNs (n=3)

KNeighborsClassifier

Effect of \(k\)

iClicker 4.1

Regression with \(k\)-nearest neighbours (\(k\)-NNs)

SVMs

Curse of dimensionality

Overview

Decision boundary of SVMs

Support vectors

Relation of gamma and the fundamental trade-off

Relation of C and the fundamental trade-off

iClicker 4.2

Summary

Similarity-based algorithms

\(k\)-nearest neighbours

Curse of dimensionality

SVMs with RBF kernel

Intuition of C and gamma in SVM RBF

Class demo

`KNeighborsClassifier`

Relation of `gamma` and the fundamental trade-off

Relation of `C` and the fundamental trade-off

Intuition of `C` and `gamma` in SVM RBF