CPSC 330 Lecture 16: DBSCAN, Hierarchical Clustering

Andrew Roth (Slides adapted from Varada Kolhatkar and Firas Moosvi)

Announcements

iClicker join link: https://join.iclicker.com/HTRZ

1. Similar to K-nearest neighbours, K-Means is a non parametric model.
1. The meaning of K in K-nearest neighbours and K-Means clustering is very similar.
1. Scaling of input features is crucial in clustering.
1. In clustering, it’s almost always a good idea to find equal-sized clusters.

iClicker join link: https://join.iclicker.com/HTRZ

1. With tiny epsilon (eps in sklearn) and min samples=1 (min_samples=1 in sklearn) we are likely to end up with each point in its own cluster.
1. With a smaller value of eps and larger number for min_samples we are likely to end up with one big cluster.
1. K-Means is more susceptible to outliers compared to DBSCAN.
1. In DBSCAN to be part of a cluster, each point must have at least min_samples neighbours in a given radius (including itself).

iClicker join link: https://join.iclicker.com/HTRZ

1. In hierarchical clustering we do not have to worry about initialization.
1. Hierarchical clustering can only be applied to smaller datasets because dendrograms are hard to visualize for large datasets.
1. In all the clustering methods we have seen (K-Means, DBSCAN, hierarchical clustering), there is a way to decide the granularity of clustering (i.e., how many clusters to pick).
1. To get robust clustering we can naively ensemble cluster labels (e.g., pick the most popular label) produced by different clustering methods.
1. If you have a high Silhouette score and very clean and robust clusters, it means that the algorithm has captured the semantic meaning in the data of our interest.