Clustering comparison


On this page the centroid-based clustering method k-Means will be compared to the density-based clustering method DBSCAN.

k-Means was introduced by James MacQueen in 1967. It aims to partition all observations in to k clusters so that the within-cluster sum of squares is minimized while the between-cluster sum of squares is maximized.

DBSCAN or Density-based spatial clustering of applications with noise was introduced in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu. It forms clusters of closely neighboring points and is able to detect outliers as noise.

Algorithms


k-Means

The k-Means Algorithm is based on the method of least squares and optimizes \[ F = \sum_{i=1}^{k}\sum_{x_{j} \in S_{i}}\left \| x_{j} - \mu_{i} \right \|^{2} \] where \(x_i\) are data points, \(\mu_i\) are Centroids or Means and \(S_i\) are corresponding clusters.
The Algorithm has three steps.

1. Initialisation
Choose \(k\) Centroids (also called Means)
\( m_1^{(0)},\cdots,m_k^{(0)} \)

kmeans initialisation

2. Assign Points
Each data Point is assigned to the "nearest" cluster (cluster variance is increased the least).
\( S_i^{\left( t \right)} = \left\{ x_j : \left\| x_j - m_{i}^{\left( t \right)} \right\|^2 \leq \left\| x_j - m_{i^*}^{\left( t \right)} \right\|^2 \text{ for all } i^*=1,\cdots,k \right\} \)

kmeans assign points

3. Update Centroids
Recalculate the new cluster means.
\( m_i^{\left( t+1 \right)} = \frac{1}{\left| S_{i}^{\left( t \right)} \right|} \sum_{x_j \in S_{i}^{\left( t \right)}} x_j \)

kmeans update centroids

Repeat steps 2 and 3 until the centroids don't change anymore.

kmeans iteration

DBSCAN

In DBSCAN there are three different kinds of points:

  • core points which are dense themselfs
  • density-reachable points that are reachable from core points but are not dense themselfs
  • outliers or noise points

The DBSCAN method has two parameters e and minPts, where e is the maximum neigborhood radius and minPts is the minimum number of neighbors to be a core point.

From one point another point is reachable if their distance is smaller than e. A Point is dense (=core point) if it has at least minPts in its e-reachable neighborhood.

Density-reachable points are reachable from core points but are not themselfs dense.

All reachable points around at least one core point form a cluster. Other points are outliers.

e minPts:3
blue: core points, green: density-reachable points, orange: outliers

Algorithm

  1. Start with arbitrary, not yet visited point
  2. Get the point's e-neighborhood
  3. If the point is dense, start a cluster
  4. If not, the point is labeled as noise (it can get part of a cluster later)
  5. Get the e-neighborhood of all unvisited points in the cluster and add them to the cluster
  6. Again add the e-neighborhood of the neighbors to the cluster if the neighbor is dense itself
  7. Continue until the density-connected cluster is completely found
  8. Start with a new unvisited point
  9. Continue until all points are either part of a cluster or labeled as noise

Advantages of DBSCAN over k-Means

  • DBSCAN does not require a priori knowled about the number of classes
  • DBSCAN can separate arbitrarily shaped and non-linearly separable clusters
  • DBSCAN is robust against noise

Disadvantages of DBSCAN

  • One has to understand the data to choose meaningful parameters
  • DBSCAN has difficulties at clustering datasets with large differences in densities
  • DBSCAN is not entirely deterministic (Assignment of points reachable from more than one cluster is dependant on the processing order)

Hands-On


There are three tabs Data, k-Means and DBSCAN.
In the Data tab you can toggle to draw own data points by clicking in the drawing area below, you can load random and predefined datasets and clear the drawing area.
In the k-Means tab you can change the configuration of k-Means by adding own centroids, adding a number of random centroids and clearing all centroids. In addition you can change the distance measure. Then you can run the k-Means algorithm either step by step or in a loop.
In the DBSCAN tab you can change the parameters e and minPts and choose the distance measure. Then you can perform the classification and clear the clustering again.
Each cluster is displayed in its own color. Outliers are displayed white with gray outline.

References