Become an NLP expert with videos & code for BERT and beyond → Join NLP Basecamp now!
DBSCAN Clustering
DBSCAN is a popular clustering algorithm which is fundamentally very different from k-means.
In k-means clustering, each cluster is represented by a centroid, and points are assigned to whichever centroid they are closest to. In DBSCAN, there are no centroids, and clusters are formed by linking nearby points to one another.
k-means requires specifying the number of clusters, ‘k’. DBSCAN does not, but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster. These two parameters are a distance threshold, \( \varepsilon \) (epsilon), and “MinPts” (minimum number of points), to be explained.
k-means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes.
I like the language of trees for describing cluster growth in DBSCAN. It starts with an arbitrary seed point which has at least MinPts points nearby within a distance (or “radius”) of \( \varepsilon \). We do a breadth-first search along each of these nearby points. For a given nearby point, we check how many points it has within its radius. If it has fewer than MinPts neighbors, this point becomes a leaf–we don’t continue to grow the cluster from it. If it does have at least MinPts, however, then it’s a branch, and we add all of its neighbors to the FIFO queue of our breadth-first search.
Once the breadth-first search is complete, we’re done with that cluster, and we never revisit any of the points in it. We pick a new arbitrary seed point (which isn’t already part of another cluster), and grow the next cluster. This continues until all of the points have been assigned.
There is one other novel aspect of DBSCAN which affects the algorithm. If a point has fewer than MinPts neighbors, AND it’s not a leaf node of another cluster, then it’s labeled as a “Noise” point that doesn’t belong to any cluster.
Noise points are identified as part of the process of selecting a new seed–if a particular seed point doesn’t have enough neighbors, it’s labeled as a Noise point. This label is often temporary, however–these Noise points are often picked up by some cluster as a leaf node.
Visualization
Naftali Harris has created a great web-based visualization of running DBSCAN on a 2-dimensional dataset. Try clicking on the “Smiley” dataset and hitting the GO button. (That’s where the image from this post came from). Very cool!
Algorithm in Python
To fully understand the algorithm, I think it’s best to just look at some code.
Below is a working implementation in Python. Note that the emphasis in this implementation is on illustrating the algorithm… the distance calculations, for example, could be optimized significantly.
You can also find this code (along with an example that validates it’s correctness) on GitHub here.