Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Moses Charikar, Monika Henzinger, Lunjia Hu, Maximilian Vötsch, Erik Waingarten
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and k-means++ can take Ω(ndk) time when clustering n points in a d-dimensional space (represented by an n×d matrix X) into k clusters. On massive datasets with moderate to large k, the multiplicative k factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time O(nnz(X)+nlogn) for arbitrary k. Here nnz(X) is the total number of non-zero entries in the input dataset X, which is upper bounded by nd and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio ˜O(k4) on any input dataset for the k-means objective, and our experiments show that the quality of the clusters found by our algorithm is usually much better than this worst-case bound. We use our algorithm for k-means clustering and for coreset construction; our experiments show that it gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks. Our theoretical analysis is based on novel results of independent interest. We show that the approximation ratio achieved after a random one-dimensional projection can be lifted to the original points and that k-means++ seeding can be implemented in expected time O(nlogn) in one dimension.