Here are some points, which happen to have been sampled from six round Gaussian distributions. Can we figure out where the centers of those Gaussians were, and which points came from which cluster? Theoretically, this is a hard problem. There’s no way to know that we have the best clustering without checking all possible assignments of points to clusters.
We use an iterative algorithm, k-means. Rather than solving the hard problem of finding the best cluster assignments, this algorithm alternates between two easy problems: finding pairs of points that are closest to each other, and calculating an average.
Click the “Random Clusters” button to drop in six random cluster centers. The light gray lines represent the “continental divide” between clusters: any point in the region associated with a cluster is closer to that cluster than any other cluster.
Click the “Cluster Points” button to assign each point to its nearest cluster. You should see cluster numbers appear. Each time the assignment of a cluster changes, the number will be highlighted for a second.
Click the “Move Clusters” button to shift the clusters so that they are at the centroid of the points they are responsible for. This just averages the x and y coordinates of those points.
Now alternate between clicking those two buttons. If clusters lose all their points, I’m moving them to a random location. Eventually, you might reach a state where no points want to change clusters, so the clusters don’t move. The algorithm has converged. This may not be the best overall cluster assignment, but it’s the best local cluster assignment.