Jack Dermody

K-means clustering

Text Clustering Four Ways

K-means clustering tries to partition the set of vectors into k randomly initialized clusters. Again, results will change each time but a cursory examination of the clustering results shows that it seems to do a better job on the data-set than k-means. Although the projected document vectors are now length 512 (from around 1500) the result is much the same as the initial k-means clustering (while reducing the clustering computation by two thirds). The final result is arguably better as well with LSA finding some interesting sets of documents that were missed by vanilla k-means. In this case the latent document vectors are now length 256 so the k-means performance is now twice that of random projections. We haven't formally evaluated the results in this tutorial but a cursory examination of the four sets of results shows that NNMF is well suited to text clustering, while K-means in its three variants gives good but somewhat varied results.