{"title": "Clustering Stable Instances of Euclidean k-means.", "book": "Advances in Neural Information Processing Systems", "page_first": 6500, "page_last": 6509, "abstract": "The Euclidean k-means problem is arguably the most widely-studied clustering problem in machine learning. While the k-means objective is NP-hard in the worst-case, practitioners have enjoyed remarkable success in applying heuristics like Lloyd's algorithm for this problem. To address this disconnect, we study the following question: what properties of real-world instances will enable us to design efficient algorithms and prove guarantees for finding the optimal clustering? We consider a natural notion called additive perturbation stability that we believe captures many practical instances of Euclidean k-means clustering. Stable instances have unique optimal k-means solutions that does not change even when each point is perturbed a little (in Euclidean distance). This captures the property that k-means optimal solution should be tolerant to measurement errors and uncertainty in the points. We design efficient algorithms that provably recover the optimal clustering for instances that are additive perturbation stable. When the instance has some additional separation, we can design a simple, efficient algorithm with provable guarantees that is also robust to outliers. We also complement these results by studying the amount of stability in real datasets, and demonstrating that our algorithm performs well on these benchmark datasets.", "full_text": "Clustering Stable Instances of Euclidean k-means\n\nAbhratanu Dutta\u2217\n\nNorthwestern University\n\nadutta@u.northwestern.edu\n\nAravindan Vijayaraghavan\u2217\n\nNorthwestern University\n\naravindv@northwestern.edu\n\nAlex Wang\u2020\n\nCarnegie Mellon University\n\nalexwang@u.northwestern.edu\n\nAbstract\n\nThe Euclidean k-means problem is arguably the most widely-studied clustering\nproblem in machine learning. While the k-means objective is NP-hard in the\nworst-case, practitioners have enjoyed remarkable success in applying heuristics\nlike Lloyd\u2019s algorithm for this problem. To address this disconnect, we study\nthe following question: what properties of real-world instances will enable us to\ndesign ef\ufb01cient algorithms and prove guarantees for \ufb01nding the optimal clustering?\nWe consider a natural notion called additive perturbation stability that we believe\ncaptures many practical instances of Euclidean k-means clustering. Stable instances\nhave unique optimal k-means solutions that does not change even when each point\nis perturbed a little (in Euclidean distance). This captures the property that k-\nmeans optimal solution should be tolerant to measurement errors and uncertainty\nin the points. We design ef\ufb01cient algorithms that provably recover the optimal\nclustering for instances that are additive perturbation stable. When the instance\nhas some additional separation, we can design a simple, ef\ufb01cient algorithm with\nprovable guarantees that is also robust to outliers. We also complement these\nresults by studying the amount of stability in real datasets, and demonstrating that\nour algorithm performs well on these benchmark datasets.\n\n1\n\nIntroduction\n\nOne of the major challenges in the theory of clustering is to bridge the large disconnect between our\ntheoretical and practical understanding of the complexity of clustering. While theory tells us that\nmost common clustering objectives like k-means or k-median clustering problems are intractable in\nthe worst case, many heuristics like Lloyd\u2019s algorithm or k-means++ seem to be effective in practice.\nIn fact, this has led to the \u201cCDNM\u201d thesis [11, 9]: \u201cClustering is dif\ufb01cult only when it does not\nmatter\u201d.\nWe try to address the following natural questions in this paper: Why are real-world instances of\nclustering easy? Can we identify properties of real-world instances that make them tractable?\nWe focus on the Euclidean k-means clustering problem where we are given n points X =\n{ x1, . . . , xn } \u2282 Rd, and we need to \ufb01nd k centers \u00b51, \u00b52, . . . , \u00b5k \u2208 Rd minimizing the objec-\nx\u2208X mini\u2208[k] (cid:107)x \u2212 \u00b5i(cid:107)2. The k-means clustering problem is the most well-studied objective\nfor clustering points in Euclidean space [18, 3]. The problem is NP-hard in the worst-case [14] even\nfor k = 2, and a constant factor hardness of approximation is known for larger k [5].\n\ntive(cid:80)\n\n\u2217Supported by the National Science Foundation (NSF) under Grant No. CCF-1637585.\n\u2020Part of the work was done while the author was at Northwestern University.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOne way to model real-world instances of clustering problems is through instance stability, which\nis an implicit structural assumption about the instance. Practically interesting instances of k-means\nclustering problem often have a clear optimal clustering solution (usually the ground-truth clustering)\nthat is stable: i.e., it remains optimal even under small perturbations of the instance. As argued in [7],\nclustering objectives like k-means are often just a proxy for recovering a ground-truth clustering that\nis close to the optimal solution. Instances in practice always have measurement errors, and optimizing\nthe k-means objective is meaningful only when the optimal solution is stable to these perturbations.\nThis notion of stability was formalized independently in a pair of in\ufb02uential works [11, 7]. The\npredominant strand of work on instance stability assumes that the optimal solution is resilient to\nmultiplicative perturbations of the distances [11]. For any \u03b3 \u2265 1, a metric clustering instance (X, d)\non point set X \u2282 Rd and metric d : X \u00d7 X \u2192 R+ is said to be \u03b3-factor stable iff the (unique)\noptimal clustering C1, . . . , Ck of X remains the optimal solution for any instance (X, d(cid:48)) where any\n(subset) of the the distances are increased by up to a \u03b3 factor i.e., d(x, y) \u2264 d(cid:48)(x, y) \u2264 \u03b3d(x, y)\nfor any x, y \u2208 X. In a series of recent works [4, 8] culminating in [2], it was shown that 2-factor\nperturbation stable (i.e., \u03b3 \u2265 2) instances of k-means can be solved in polynomial time.\nMultiplicative perturbation stability represents an elegant, well-motivated formalism that captures\nrobustness to measurement errors for clustering problems in general metric spaces (\u03b3 = 1.1 captures\nrelative errors of 10% in the distances). However, multiplicative perturbation stability has the\nfollowing drawbacks in the case of Euclidean clustering problems:\n\n\u2022 Measurement errors in Euclidean instances are better captured using additive perturbations.\nUncertainty of \u03b4 in the position of x, y leads to an additive error of \u03b4 in (cid:107)x\u2212y(cid:107)2, irrespective\nof how large or small (cid:107)x \u2212 y(cid:107)2 is.\n\n\u2022 The amount of stability \u03b3 needed to enable ef\ufb01cient algorithms (i.e., \u03b3 \u2265 2) often imply\nstrong structural conditions, that are unlikely to be satis\ufb01ed by many real-world datasets.\nFor instance, \u03b3-factor perturbation stability implies that every point is a multiplicative factor\nof \u03b3 closer to its own center (say \u00b5i) than to any other cluster center \u00b5j.\n\n\u2022 Algorithms that are known to have provable guarantees under multiplicative perturbation\nstability are based on single-linkage or MST algorithms that are very non-robust by nature.\nIn the presence of a few outliers or noise, any incorrect decision in the lower layers gets\npropagated up to the higher levels.\n\nIn this work, we consider a natural additive notion of stability for Euclidean instances, when the\noptimal k-means clustering solution does not change even where each point is moved by a small\nEuclidean distance of up to \u03b4. Moving each point by at most \u03b4 corresponds to a small additive pertur-\nbation to the pairwise distances between the points3. Unlike multiplicative notions of perturbation\nstability [11, 4], this notion of additive perturbation is not scale invariant. Hence the normalization or\nscale of the perturbation is important.\nAckerman and Ben-David [1] initiated the study of additive perturbation stability when the distance\nbetween any pair of points can be changed by at most \u03b4 = \u03b5 diam(X) with diam(X) being the diam-\neter of the whole dataset. The algorithms take time nO(k/\u03b52) = nO(k diam2(X)/\u03b42) and correspond to\npolynomial time algorithms when k, 1/\u03b5 are constants. However, this dependence of k diam2(X)/\u03b42\nin the exponent is not desirable since the diameter is a very non-robust quantity \u2014 the presence of one\noutlier (that is even far away from the decision boundary) can increase the diameter arbitrarily. Hence,\nthese guarantees are useful mainly when the whole instance lies within a small ball and for a small\nnumber of clusters [1, 10]. Our notion of additive perturbation stability will use a different scale\nparameter that is closely related to the distance between the centers, instead of the diameter diam(X).\nAs we will see soon, our results for additive perturbation stability have no explicit dependence on\nthe diameter, and allows instances to have potentially unbounded clusters (as in the case of far-way\noutliers). Further with some additional assumptions, we also obtain polynomial time algorithmic\nguarantees for large k.\n\n3Note that not all additive perturbations to the distances can be captured by an appropriate movement of the\n\npoints in the cluster. Hence the notion we consider in our paper is a weaker assumption on the instance.\n\n2\n\n\fFigure 1: a)Left: the \ufb01gure shows an instance with k = 2 satisfying \u03b5-APS with D being separation\nbetween the means. The half-angle of the cone is arctan(1/\u03b5) and the distance between \u00b51 and the\napex of the cone (\u2206) is at most D/2. b) Right: The \ufb01gure shows a (\u03c1, \u2206, \u03b5)-separated instance, with\nscale parameter \u2206. All the points lie inside the cones of half-angle arctan(1/\u03b5), whose apexes are\nseparated by a margin of at least \u03c1.\n\n1.1 Additive Perturbation Stability and Our Contributions\n\nWe consider a notion of additive stability where the points in the instance can be moved by at\nmost \u03b4 = \u03b5D, where \u03b5 \u2208 (0, 1) is a parameter, and D = maxi(cid:54)=j Dij = maxi(cid:54)=j(cid:107)\u00b5i \u2212 \u00b5j(cid:107)2 is\nthe maximum distance between pairs of means. Suppose X is a k-means clustering instance with\noptimal clustering C1, C2, . . . , Ck. We say that X is \u03b5-APS (additive perturbation stable) iff every\n\u03b4 = \u03b5D-additive perturbation of X has C1, C2, . . . , Ck as an optimal clustering solution. (See\nDe\ufb01nition 2.3 for a formal de\ufb01nition). Note that there is no restriction on the diameter of the instance,\nor even the diameters of the individual clusters. Hence, our notion of additive perturbation stability\nallows the instance to be unbounded.\nGeometric property of \u03b5-APS instances. Clusters in the optimal solution of an \u03b5-APS instance\nsatisfy a natural geometric condition, that implies an \u201cangular separation\u201d between every pair of\nclusters.\nProposition 1.1 (Geometric Implication of \u03b5-APS). Consider an \u03b5-APS instance X, and let Ci, Cj\nbe two clusters of the optimal solution. Any point x \u2208 Ci lies in a cone whose axis is along the\ndirection (\u00b5i \u2212 \u00b5j) with half-angle arctan(1/\u03b5). Hence if u is the unit vector along \u00b5i \u2212 \u00b5j then\n\n\u2200x \u2208 Ci,\n\n|(cid:104)x \u2212 \u00b5i+\u00b5j\n(cid:107)x \u2212 \u00b5i+\u00b5j\n\n2\n\n, u(cid:105)|\n(cid:107)2\n\n2\n\n>\n\n\u03b5\u221a\n1 + \u03b52\n\n.\n\n(1)\n\nFor any j \u2208 [k], all the points in cluster Ci lie inside the cone with its axis along (\u00b5i \u2212 \u00b5j) as in\n2 \u2212 \u03b5)D. We will call \u2206 the\nFigure 1. The distance between \u00b5i and the apex of the cone is \u2206 = ( 1\nscale parameter of the clustering.\nWe believe that many clustering instances in practice satisfy \u03b5-APS condition for reasonable constants\n\u03b5. In fact, our experiments in Section 4 suggest that the above geometric condition is satis\ufb01ed for\nreasonable values e.g., \u03b5 \u2208 (0.001, 0.2).\nWhile the points can be arbitrarily far away from their own means, the above angular separation (1)\nis crucial in proving the polynomial time guarantees for our algorithms. For instance, this implies\nthat at least 1/2 of the points in a cluster Ci are within a Euclidean distance of at most O(\u2206/\u03b5) from\n\u00b5i. This geometric condition (1) of the dataset enables the design of a tractable algorithm for k = 2\nwith provable guarantees. This algorithm is based on a modi\ufb01cation of the perceptron algorithm in\nsupervised learning, and is inspired by [13].\nInformal Theorem 1.2. For any \ufb01xed \u03b5 > 0, there exists an dnpoly(1/\u03b5) time algorithm that correctly\nclusters all \u03b5-APS 2-means instances.\n\nFor k-means clustering, similar techniques can be used to learn the separating halfspace for each\npair of clusters. But this incurs an exponential dependence on k2, which renders this approach\n\n3\n\n\finef\ufb01cient for large k.4 We now consider a natural strengthening of this assumption that allows us to\nget poly(n, d, k) guarantees.\n\nAngular Separation with additional Margin Separation. We consider a natural strengthening\nof additive perturbation stability where there is an additional margin between any pair of clusters.\nThis is reminiscent of margin assumptions in supervised learning of halfspaces and spectral clustering\nguarantees of Kumar and Kannan [15] (see Section 1.2). Consider a k-means clustering instance\nX having an optimal solution C1, C2, . . . , Ck. This instance is (\u03c1, \u2206, \u03b5)-separated iff for each\ni (cid:54)= j \u2208 [k], the subinstance induced by Ci, Cj has parameter scale \u2206, and all points in the clusters\nCi, Cj lie inside cones of half-angle arctan(1/\u03b5), which are separated by a margin of at least \u03c1. This\nis implied by the stronger condition that the subinstance induced by Ci, Cj is \u03b5-additive perturbation\nstable with scale parameter \u2206 even when Ci and Cj are moved towards each other by \u03c1. Please see\nFigure 1 for an illustration. We formally de\ufb01ne (\u03c1, \u2206, \u03b5)-separated stable instances geometrically in\nSection 2.\nInformal Theorem 1.3 (Polytime algorithm for (\u03c1, \u2206, \u03b5)-separated instances). There is an algorithm\n\nrunning in time5 (cid:101)O(n2kd) that given any instance X that is (\u03c1, \u2206, \u03b5)-separated with \u03c1 \u2265 \u2126(\u2206/\u03b52)\n\nrecovers the optimal clustering C1, . . . , Ck.\n\nA formal statement of the theorem (with unequal sized clusters), and its proof are given in Section 3.\nWe prove these polynomial time guarantees for a new, simple algorithm ( Algorithm 3.1 ). The\nalgorithm constructs a graph with one vertex for each point, and edges between points that within a\ndistance of at most r (for an appropriate threshold r). The algorithm then \ufb01nds the k-largest connected\ncomponents. It then uses the k empirical means of these k components to cluster all the points.\nIn addition to having provable guarantees, the algorithm also seems ef\ufb01cient in practice, and performs\nwell on standard clustering datasets. Experiments that we conducted on some standard clustering\ndatasets in UCI suggest that our algorithm manages to almost recover the ground truth, and achieves a\nk-means objective cost that is very comparable to Lloyd\u2019s algorithm and k-means++ (see Section 4).\nIn fact, our algorithm can also be used to initialize the Lloyd\u2019s algorithm: our guarantees show\nthat when the instance is (\u03c1, \u2206, \u03b5)-separated, one iteration of Lloyd\u2019s algorithm already \ufb01nds the\noptimal clustering. Experiments suggest that our algorithm \ufb01nds initializers of smaller k-means cost\ncompared to the initializers of k-means++ [3] and also recover the ground-truth to good accuracy\n(see Section 4 and Supplementary material for details).\nRobustness to Outliers. Perturbation stability requires the optimal solution to remain completely\nunchanged under any valid perturbation. In practice, the stability of an instance may be dramatically\nreduced by a few outliers. We show provable guarantees for a slight modi\ufb01cation of the algorithm,\nin the setting when an \u03b7-fraction of the points can be arbitrary outliers, and do not lie in the stable\nregions. More formally, we assume that we are given an instance X \u222a Z where there is an (unknown)\nset of points Z with |Z| \u2264 \u03b7|X| such that X is a (\u03c1, \u2206, \u03b5)-separated-stable instance. Here \u03b7n is\nassumed to be smaller than size of the smallest cluster by a constant factor. This is similar to robust\nperturbation resilience considered in [8, 16]. Our experiments in Section 4 indicate that the stability\nor separation can increase a lot after ignoring a few points close to the margin.\nIn what follows, wmax = max|Ci|/n and wmin = min|Ci|/n are the maximum and minimum\nweight of clusters, and \u03b7 < wmin.\nTheorem 1.4. Given X \u222a Z where X satis\ufb01es (\u03c1, \u2206, \u03b5)-separated with \u03b7 < wmin and\n\n(cid:18) \u2206\n\n(cid:19)(cid:19)\n\n(cid:18) wmax + \u03b7\n\nwmin \u2212 \u03b7\n\nand \u03b7 < wmin, there is a polynomial time algorithm running in time (cid:101)O(n2dk) that returns a clustering\n\n\u03b52\n\n\u03c1 = \u2126\n\nconsistent with C1, . . . , Ck on X.\n\nThis robust algorithm is effectively the same as Algorithm 3.1 with one additional step that removes\nall low-degree vertices in the graph. This step removes bad outliers in Z without removing too many\npoints from X.\n\n4We remark that the results of [1] also incur an exponential dependence on k.\n\n5The (cid:101)O hides an inverse Ackerman fuction of n.\n\n4\n\n\f1.2 Comparisons to Other Related Work\n\nAwasthi et al. [4] showed that \u03b3-multiplicative perturbation stable instance also satis\ufb01ed the notion\nof \u03b3-center based stability (every point is a \u03b3-factor closer to its center than to any other center) [4].\nThey showed that an algorithm based on the classic single linkage algorithm works under this weaker\nnotion when \u03b3 \u2265 3. This was subsequently improved by [8], and the best result along these lines [2]\ngives a polynomial time algorithm that works for \u03b3 \u2265 2. A robust version of (\u03b3, \u03b7)-perturbation\nresilience was explored for center-based clustering objectives [8]. As such, the notions of additive\nperturbation stability, and (\u03c1, \u2206, \u03b5)-separated instances are incomparable to the various notions of\nmultiplicative perturbation stability. Further as argued in [9], we believe that additive perturbation\nstability is more realistic for Euclidean clustering problems.\nAckerman and Ben-David [1] initiated the study of various deterministic assumptions for clustering\ninstances. The measure of stability most related to this work is Center Perturbation (CP) clusterability\n(an instance is \u03b4-CP-clusterable if perturbing the centers by a distance of \u03b4 does not increase the\ncost much). A subtle difference is their focus on obtaining solutions with small objective cost [1],\nwhile our goal is to recover the optimal clustering. However, the main qualitative difference is how\nthe length scale is de\ufb01ned \u2013 this is crucial for additive perturbations. The run time of the algorithm\nin [1] is npoly(k,diam(X)/\u03b4) , where the length scale of the perturbations is diam(X), the diameter\nof the whole instance. Our notion of additive perturbations uses a much smaller length-scale of \u2206\n(essentially the inter-mean distance; see Prop. 1.1 for a geometric intepretation), and Theorem 1.2\ngives a run-time guarantee of npoly(\u2206/\u03b4) for k = 2 (Theorem 1.2 is stated in terms of \u03b5 = \u03b4/\u2206).\nBy using the largest inter-mean distance instead of the diameter as the length scale, our algorithmic\nguarantees can also handle unbounded clusters with arbitrarily large diameters and outliers.\nThe exciting results of Kumar and Kannan [15] and Awasthi and Sheffet [6] also gave deterministic\nmargin-separation condition, under which spectral clustering (PCA followed by k-means) \ufb01nds the\noptimum clusters under deterministic conditions about the data. Suppose \u03c3 = (cid:107)X \u2212 C(cid:107)2\nop/n is the\n\u201cspectral radius\u201d of the dataset, where C is the matrix given by the centers. In the case of equal-sized\n\u221a\nclusters, the improved results of [6] proves approximate recovery of the optimal clustering if the\nmargin \u03c1 between the clusters along the line joining the centers satis\ufb01es \u03c1 = \u2126(\nk\u03c3). Our notion\nof margin \u03c1 in (\u03c1, \u2206, \u03b5)-separated instances is analogous to the margin separation notion used by\nthe above results on spectral clustering [15, 6]. In particular, we require a margin of \u03c1 = \u2126(\u2206/\u03b52)\nwhere \u2206 is our scale parameter, with no extra\nk factor. However, we emphasize that the two margin\nconditions are incomparable, since the spectral radius \u03c3 is incomparable to the scale parameter \u2206.\nWe now illustrate the difference between these deterministic conditions by presenting a couple of\nexamples. Consider an instance with n points drawn from a mixture of k Gaussians in d dimensions\nwith identical diagonal covariance matrices with variance 1 in the \ufb01rst O(1) coordinates and roughly\nd in the others, and all the means lying in the subspace spanned by these \ufb01rst O(1) co-ordinates. In\n1\nOn the other hand, these instances satisfy our geometric conditions with \u03b5 = \u2126(1), \u2206 \u223c \u221a\nk log n between clusters.\nthis setting, the results of [15, 6] require a margin separation of at least\nlog n and\nk). 6\ntherefore our algorithm only needs a margin separation of \u03c1\nHowever, if the n points were drawn from a mixture of spherical Gaussians in high dimensions (with\nd (cid:29) k), then the margin condition required for [15, 6] is weaker.\n\nlog n ( hence, saving a factor of\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nclustering. It is well-known that given a cluster C the value of \u00b5 minimizing(cid:80)\n\n2 Stability de\ufb01nitions and geometric properties\nX \u2286 Rd will denote a k-means clustering instance and C1, . . . , Ck will often refer to its optimal\nx\u2208C (cid:107)x \u2212 \u00b5(cid:107)2 is\nx\u2208C x, the mean of the points in the set. We give more preliminaries about the\n\ngiven by \u00b5 = 1|C|\nk-means problem in the Supplementary Material.\n\n(cid:80)\n\n2.1 Balance Parameter\n\nWe de\ufb01ne an instance parameter, \u03b2, capturing how balanced a given instance\u2019s clusters are.\n\n6Further, while algorithms for learning GMM models may work here, adding some outliers far from the\n\ndecision boundary will cause many of these algorithms to fail, while our algorithm is robust to such outliers.\n\n5\n\n\fFigure 2: An example of the family of perturbations considered by Lemma 2.4. Here v is in the\nupwards direction. If a is to the right of the diagonal solid line, then a(cid:48) will be to the right of the\nslanted dashed line and will lie on the wrong side of the separating hyperplane.\n\nDe\ufb01nition 2.1 (Balance parameter). Given an instance X with optimal clustering C1, . . . , Ck, we\nsay X satis\ufb01es balance parameter \u03b2 \u2265 1 if for all i (cid:54)= j, \u03b2|Ci| > |Cj|.\nWe will write \u03b2 in place of \u03b2(X) when the instance is clear from context.\n\n2.2 Additive perturbation stability\nDe\ufb01nition 2.2 (\u03b5-additive perturbation). Let X = { x1, . . . , xn } be a k-means clustering in-\nstance with optimal clustering C1, C2, . . . , Ck whose means are given by \u00b51, \u00b52, . . . , \u00b5k. Let\nn } is an \u03b5-additive perturbation\nD = maxi,j (cid:107)\u00b5i \u2212 \u00b5j(cid:107). We say that the instance X(cid:48) = { x(cid:48)\nof X if for all i, (cid:107)x(cid:48)\nDe\ufb01nition 2.3 (\u03b5-additive perturbation stability). Let X be a k-means clustering instance with\noptimal clustering C1, C2, . . . , Ck. We say that X is \u03b5-additive perturbation stable (APS) if every\n\u03b5-additive perturbation of X has unique optimal clustering given by C1, C2, . . . , Ck.\n\ni \u2212 xi(cid:107) \u2264 \u03b5D.\n\n1, . . . , x(cid:48)\n\nIntuitively, the dif\ufb01culty of the clustering task increases as the stability parameter \u03b5 decreases. For\nexample, when \u03b5 = 0 the set of \u03b5-APS instances contains any instance with a unique solution. In the\nfollowing we will only consider \u03b5 > 0.\n\n2.3 Geometric implication of \u03b5-APS\nLet X be an \u03b5-APS k-means clustering instance such that each cluster has at least 4 points. Fix i (cid:54)= j\nand consider a pair of clusters Ci, Cj with means \u00b5i, \u00b5j and de\ufb01ne the following notations.\n\nbe the maximum distance between any pair of means.\n\n\u2022 Let Di,j = (cid:107)\u00b5i \u2212 \u00b5j(cid:107) be the distance between \u00b5i and \u00b5j and let D = maxi(cid:48),j(cid:48) (cid:107)\u00b5i(cid:48) \u2212 \u00b5j(cid:48)(cid:107)\n\u2022 Let u = \u00b5i\u2212\u00b5j\n\n(cid:107)\u00b5i\u2212\u00b5j(cid:107) be the unit vector in the intermean direction. Let V = u\u22a5 be the space\n\northogonal to u. For x \u2208 Rd, let x(u) and x(V ) be the projections x onto u and V .\n\n\u2022 Let p = \u00b5i+\u00b5j\n\n2\n\nbe the midpoint between \u00b5i and \u00b5j.\n\nA simple perturbation that we can use will move all points in Ci and Cj along the direction \u00b5i \u2212 \u00b5j\nby a \u03b4 amount, while another perturbation moves these points along \u00b5j \u2212 \u00b5i; these will allow us to\nconclude that \u2203 margin of size 2\u03b4. To establish Proposition 1.1, we will choose a clever \u03b5-perturbation\nthat allows us to show that clusters must live in cone regions (see \ufb01gure 1 left). This perturbation\nchooses two clusters and moves their means in opposite directions orthogonal to u while moving a\nsingle point towards the other cluster (see \ufb01gure 2). The following lemma establishes Proposition 1.1.\nLemma 2.4. For any x \u2208 Ci \u222a Cj, (cid:107)(x \u2212 p)(V )(cid:107) < 1\nProof. Let v \u2208 V be a unit vector perpendicular to u. Without loss of generality, let a, b, c, d \u2208 Ci\nbe distinct. Note that Di,j \u2264 D and consider the \u03b5-additive perturbation given by\nX(cid:48) = { a \u2212 \u03b4u, b + \u03b4u, c \u2212 \u03b4v, d \u2212 \u03b4v } \u222a { x \u2212 \u03b4\n\n(cid:0)(cid:107)(x \u2212 p)(u)(cid:107) \u2212 \u03b5Di,j\n\n2 v | x \u2208 Ci \\ { a, b, c, d}} \u222a { x + \u03b4\n\n2 v | x \u2208 Cj }\n\n(cid:1).\n\n\u03b5\n\n6\n\n\fand X \\ {Ci \u222a Cj}where \u03b4 = \u03b5Di,j (see \ufb01gure 2). By assumption, { Ci, Cj } remains the optimal\nclustering of Ci \u222a Cj. We have constructed X(cid:48) such that the new means are at \u00b5(cid:48)\ni = \u00b5i \u2212 \u03b5Di,j\n2 v\n2 v, and the midpoint between the means is p(cid:48) = p. The halfspace containing \u00b5(cid:48)\nand \u00b5(cid:48)\nj(cid:105) > 0. Hence, as a(cid:48) is classi\ufb01ed\ngiven by the linear separator between \u00b5(cid:48)\ncorrectly by the \u03b5-APS assumption,\ni \u2212 \u00b5(cid:48)\n\nj(cid:105) = (cid:104)a \u2212 p \u2212 \u03b5Di,ju, Di,ju \u2212 \u03b5Di,jv(cid:105)\n\nj is (cid:104)x \u2212 p(cid:48), \u00b5(cid:48)\n\nj = \u00b5j + \u03b5Di,j\n\n(cid:104)a(cid:48) \u2212 p(cid:48), \u00b5(cid:48)\n\ni and \u00b5(cid:48)\n\ni \u2212 \u00b5(cid:48)\n\ni\n\n= Di,j((cid:104)a \u2212 p, u(cid:105) \u2212 \u03b5(cid:104)a \u2212 p, v(cid:105) \u2212 \u03b5Di,j) > 0\n\nThen noting that (cid:104)u, a \u2212 p(cid:105) is positive, we have that (cid:104)a \u2212 p, v(cid:105) < 1\n\n(cid:0)(cid:107)(a \u2212 p)(u)(cid:107) \u2212 \u03b5Di,j\n\n(cid:1).\n\n\u03b5\n\nNote that this property follows from perturbations which only affect two clusters at a time. Our\nresults follow from this weaker notion.\n\n2.4\n\n(\u03c1, \u2206, \u03b5)-separation\n\nMotivated by Lemma 2.4, we de\ufb01ne a geometric condition where the angular separation and margin\nseparation are parametrized separately. This notion of separation is implied by a stronger stability\nassumption where any pair of clusters is \u03b5-APS with scale parameter \u2206 even after being moved\ntowards each other a distance of \u03c1.\nWe say that a pair of clusters is (\u03c1, \u2206, \u03b5)-separated if their points lie in cones with axes along the\nintermean direction, half-angle arctan(1/\u03b5), and apexes at distance \u2206 from their means and at least\n\u03c1 from each other (see \ufb01gure 1 right). Formally, we require the following.\nDe\ufb01nition 2.5 (Pairwise (\u03c1, \u2206, \u03b5)-separation). Given a pair of clusters Ci, Cj with means \u00b5i, \u00b5j, let\nu = \u00b5i\u2212\u00b5j\n(cid:107)\u00b5i\u2212\u00b5j(cid:107) be the unit vector in the intermean direction and let p = (\u00b5i + \u00b5j)/2. We say that Ci\nand Cj are (\u03c1, \u2206, \u03b5)-separated if Di,j \u2265 \u03c1 + 2\u2206 and for all x \u2208 Ci \u222a Cj,\n\n(cid:0)(cid:107)(x \u2212 p)(u)(cid:107) \u2212 (Di,j/2 \u2212 \u2206)(cid:1) .\n\n(cid:107)(x \u2212 p)(V )(cid:107) \u2264 1\n\u03b5\n\nDe\ufb01nition 2.6 ((\u03c1, \u2206, \u03b5)-separation). We say that an instance X is (\u03c1, \u2206, \u03b5)-separated if every pair\nof clusters in the optimal clustering is (\u03c1, \u2206, \u03b5)-separated.\n\n3 k-means clustering for general k\n\nWe assume that our instance has balance parameter \u03b2. Our algorithm takes in as input the set of\npoints X and k, and outputs a clustering of all the points.\nAlgorithm 3.1.\nInput: X = { x1, . . . , xn }, k.\n1: for all pairs a, b of distinct points in { xi } do\n2:\n3:\n4:\n\nCreate graph G on vertex set { x1, . . . , xn } where xi and xj have an edge iff (cid:107)xi\u2212xj(cid:107) <\nLet a1, . . . , ak \u2208 Rd where ai is the mean of the ith largest connected component of G\n\nLet r = (cid:107)a \u2212 b(cid:107) be our guess for \u03c1\nprocedure INITIALIZE\n\nr\n\nprocedure ASSIGN\n\n5:\n6:\n7:\n8:\n9: Return clustering with smallest k-means objective found above\n\nCalculate the k-means objective of C1, . . . , Ck\n\nLet C1, . . . , Ck be the clusters obtained by assigning each point in X to the closest ai\n\nTheorem 3.2. Algorithm 3.1 recovers C1, . . . , Ck for any (\u03c1, \u2206, \u03b5)-separated instance with \u03c1 =\n\u2126\n\n(cid:17)\n\n\u03b52 + \u03b2\u2206\n\nand the running time is (cid:101)O(n2kd).\n\n(cid:16) \u2206\nof \u03c1 and each pass takes O(kd) time, the algorithm runs in (cid:101)O(n2kd).\n\n\u03b5\n\nWe maintain the connected components and their centers via a union-\ufb01nd data structure and keep it\nupdated as we increase \u03c1 and add edges to the dynamic graph. Since we go over n2 possible choices\n\n7\n\n\f2\n\ni,j\n\nThe rest of the section is devoted to proving Theorem 3.2. De\ufb01ne the following regions of Rd for every\npair i, j. Given i, j, let Ci, Cj be the corresponding clusters with means \u00b5i, \u00b5j. Let u = \u00b5i\u2212\u00b5j\n(cid:107)\u00b5i\u2212\u00b5j(cid:107) be\nthe unit vector in the inter-mean direction and p = \u00b5i+\u00b5j\nbe the point between the two means. We\n\ufb01rst de\ufb01ne formally S(cone)\nwhich was described in the introduction (the feasible region) and two\nother regions of the clusters that will be useful in the analysis (see Figure 1b). We observe that Ci\nbelongs to the intersection of all the cones S(cone)\nDe\ufb01nition 3.3.\n\u2022 S(cone)\n\u2022 S(nice)\n\u2022 S(good)\n\ni,j = { x \u2208 Rd | (cid:107)(x \u2212 (\u00b5i \u2212 \u2206u))(V )(cid:107) \u2264 1\ni,j = { x \u2208 S(cone)\nj(cid:54)=i S(nice)\n.\n\n\u03b5(cid:104)x \u2212 (\u00b5i \u2212 \u2206u), u(cid:105)},\n\n| (cid:104)x \u2212 \u00b5i, u(cid:105) \u2264 0},\n\nover j (cid:54)= i.\n\n=(cid:84)\n\ni,j\n\ni,j\n\ni,j\n\ni\n\ni\n\ni,j\n\ni,j\n\n, is de\ufb01ned as all points in the cap of S(cone)\n\nThe nice area of i with respect to j i.e. S(nice)\n\u201cabove\u201d \u00b5i.\nThe good area of a cluster S(good)\nis the intersection of its nice areas with respect to all other clusters.\nIt suf\ufb01ces to prove the following two main lemmas. Lemma 3.4 states that the ASSIGN subroutine\ncorrectly clusters all points given an initialization satisfying certain properties. Lemma 3.5 states that\nthe initialization returned by the INITIALIZE subroutine satis\ufb01es these properties when we guess\nr = \u03c1 correctly. As \u03c1 is only used as a threshold on edge lengths, testing the distances between all\npairs of data points i.e. {(cid:107)a \u2212 b(cid:107) : a, b \u2208 X } suf\ufb01ces.\nLemma 3.4. For a (\u03c1, \u2206, \u03b5)-separated instance with \u03c1 = \u2126(\u2206/\u03b52), the ASSIGN subroutine recovers\nC1, C2,\u00b7\u00b7\u00b7 Ck correctly when initialized with k points { a1, a2, . . . , ak } where ai \u2208 S(good)\nLemma 3.5. For an (\u03c1, \u2206, \u03b5)-separated instance with balance parameter \u03b2 and \u03c1 = \u2126(\u03b2\u2206/\u03b5), the\nINITIALIZE subroutine outputs one point each from { S(good)\nTo prove Lemma 3.5 we de\ufb01ne a region of each cluster S(core)\n, the core, such that most (at least\n\u03b2/(1 + \u03b2) fraction) of the points in Ci will belong to the connected component containing S(core)\n.\nHence, any large connected component (in particular, the k largest ones) must contain the core of\none of the clusters. Meanwhile, the margin ensures points across clusters are not connected. Further,\nsince S(core)\naccounts for most points in Ci, the angular separation ensures that the empirical mean of\nthe connected component is in S(good)\n\n: i \u2208 [k]} when r = \u03c1.\n\n.\n\n.\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n4 Experimental results\n\nWe evaluate Algorithm 3.1 on multiple real world datasets and compare its performance to the\nperformance of k-means++, and also check how well these datasets satisfy our geometric conditions.\nSee supplementary results for details about ground truth recovery.\nDatasets. Experiments were run on unnormalized and normalized versions of four labeled datasets\nfrom the UCI Machine Learning Repository: Wine (n = 178, k = 3, d = 13), Iris (n = 150, k = 3,\nd = 4), Banknote Authentication (n = 1372, k = 2, d = 5), and Letter Recognition (n = 20, 000,\nk = 26, d = 16). Normalization was used to scale each feature to unit range.\nPerformance We ran Algorithm 3.1 on the datasets. The cost of the returned solution for each of the\nnormalized and unnormalized versions of the datasets is recorded in Table 1 column 2. Our guarantees\nshow that under (\u03c1, \u2206, \u03b5)-separation for appropriate values of \u03c1 (see section 3), the algorithm will\n\ufb01nd the optimal clustering after a single iteration of Lloyd\u2019s algorithm. Even when \u03c1 does not satisfy\nour requirement, we can use our algorithm as an initialization heuristic for Lloyd\u2019s algorithm. We\ncompare our initialization with the k-means++ initialization heuristic (D2 weighting). In Table 1,\nthis is compared to the smallest initialization cost of 1000 trials of k-means++ on each of the datasets,\nthe solution found by Lloyd\u2019s algorithm using our initialization and the smallest k-means cost of 100\ntrials of Lloyd\u2019s algorithm using a k-mean++ initialization.\nSeparation in real data sets. As the ground truth clusterings in our datasets are not in general\nlinearly separable, we consider the clusters given by Lloyd\u2019s algorithm initialized with the ground\n\n8\n\n\fTable 1: Comparison of k-means cost for Alg 3.1 and k-means++\n\nk-means++ Alg 3.1 with Lloyd\u2019s\n2.426e+06\n\n2.371e+06\n\nDataset\nWine\nWine (normalized)\nIris\nIris (normalized)\nBanknote Auth.\nBanknote (norm.)\nLetter Recognition\nLetter Rec. (norm.)\n\nAlg 3.1\n\n2.376e+06\n\n48.99\n81.04\n7.035\n44808.9\n138.4\n744707\n3367.8\n\n65.50\n86.45\n7.676\n49959.9\n155.7\n921643\n4092.1\n\n48.99\n78.95\n6.998\n44049.4\n138.1\n629407\n2767.5\n\nk-means++ with Lloyd\u2019s\n\n2.371e+06\n\n48.95\n78.94\n6.998\n44049.4\n138.1\n611268\n2742.3\n\nTable 2: Values of (\u03c1, \u03b5, \u2206) satis\ufb01ed by (1 \u2212 \u03b7)-fraction of points\n\nminimum \u03c1/\u2206 average \u03c1/\u2206 maximum \u03c1/\u2206\n\nDataset\nWine\n\nIris\n\nBanknote Auth.\n\nLetter Recognition\n\n\u03b7\n0.1\n\n0.1\n\n0.1\n\n0.1\n\n\u03b5\n0.1\n0.01\n0.1\n0.01\n0.1\n0.01\n0.1\n0.01\n\n0.566\n0.609\n0.398\n0.496\n0.264\n0.398\n0.018\n0.378\n\n1.5\n1.53\n4.35\n5.04\n0.264\n0.398\n2.19\n3.07\n\n3.05\n3.07\n7.7\n9.06\n0.264\n0.398\n7.11\n11.4\n\ntruth solutions. Values of \u03b5 for Lemma 2.4. We calculate the maximum value of \u03b5 such that a given\npair of clusters satis\ufb01es the geometric condition in Proposition 1.1. The results are collected in the\nSupplementary material in Table 3. We see that the average value of \u03b5 lies approximately in the range\n(0.01, 0.1).\nValues of (\u03c1, \u2206, \u03b5)-separation. We attempt to measure the values of \u03c1, \u2206, and \u03b5 in the datasets.\nFor \u03b7 = 0.05, 0.1, \u03b5 = 0.1, 0.01, and a pair of clusters Ci, Cj, we calculate \u03c1 as the maximum\nmargin separation a pair of axis-aligned cones with half-angle arctan(1/\u03b5) can have while capturing\na (1 \u2212 \u03b7)-fraction of all points. For some datasets and values for \u03b7 and \u03b5, there may not be any such\nvalue of \u03c1, in this case we leave the row blank. The results for the unnormalized datasets with \u03b7 = 0.1\nare collected in Table 2. (See Supplementary material for the full table).\n\n5 Conclusion and Future Directions\n\nWe studied a natural notion of additive perturbation stability, that we believe captures many real-world\ninstances of Euclidean k-means clustering. We \ufb01rst gave a polynomial time algorithm when k = 2.\nFor large k, under an additional margin assumption, we gave a fast algorithm of independent interest\nthat provably recovers the optimal clustering under these assumptions (in fact the algorithm is also\nrobust to noise and outliers). An appealing aspect of this algorithm is that it is not tailored towards the\nmodel; our experiments indicate that this algorithm works well in practice even when the assumptions\ndo not hold. Our results with the margin assumption hence gives an algorithm that (A) has provable\nguarantees (under reasonable assumptions) (B) is ef\ufb01cient and practical and (C) is robust to errors.\nWhile the margin assumption seems like a realistic assumption qualitatively, we believe that the exact\ncondition we assume is not optimal. An interesting open question is understanding whether such a\nmargin is necessary for designing tractable algorithms for large k. We conjecture that for higher k,\nclustering remains hard even with \u03b5 additive perturbation resilience (without any additional margin\nassumption). Improving the margin condition or proving lower bounds on the amount of additive\nstability required are interesting future directions.\n\n9\n\n\fReferences\n[1] Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. In Proceedings of the Twelth\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, pages 1\u20138. PMLR, 2009.\n\n[2] Haris Angelidakis, Konstantin Makarychev, and Yury Makarychev. Algorithms for stable and perturbation-\n\nresilient problems. In Symposium on Theory of Computing (STOC), 2017.\n\n[3] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of\n\nthe Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201907, pages 1027\u20131035, 2007.\n\n[4] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability.\n\nInformation Processing Letters, 112(1\u20132):49 \u2013 54, 2012.\n\n[5] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of\napproximation of euclidean k-means. In Symposium on Computational Geometry, pages 754\u2013767, 2015.\n\n[6] Pranjal Awasthi and Or Sheffet.\n\nImproved spectral-norm bounds for clustering.\n\nRandomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37\u201349. 2012.\n\nIn Approximation,\n\n[7] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximate clustering without the approxima-\ntion. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201909,\npages 1068\u20131077, 2009.\n\n[8] Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on\n\nComputing, 45(1):102\u2013155, 2016.\n\n[9] Shai Ben-David. Computational feasibility of clustering under clusterability assumptions. CoRR,\n\nabs/1501.00437, 2015.\n\n[10] Shalev Ben-David and Lev Reyzin. Data stability in clustering: A closer look. Theoretical Computer\n\nScience, 558:51 \u2013 61, 2014. Algorithmic Learning Theory.\n\n[11] Yonatan Bilu and Nathan Linial. Are stable instances easy? In Innovations in Computer Science - ICS\n\n2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings, pages 332\u2013341, 2010.\n\n[12] Hans-Dieter Block. The perceptron: A model for brain functioning. Reviews of Modern Physics, 34(1):123\u2013\n\n135, 1962.\n\n[13] Avrim Blum and John Dunagan. Smoothed analysis of the perceptron algorithm for linear programming.\n\nIn Proceedings of Symposium on Dicrete Algorithms (SODA), 2002.\n\n[14] Sanjoy Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering,\n\nUniversity of California, San Diego, 2008.\n\n[15] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\n\nIn\nFoundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 299\u2013308. IEEE,\n2010.\n\n[16] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Bilu-linial stable instances of\n\nmax cut. Proc. 22nd Symposium on Discrete Algorithms (SODA), 2014.\n\n[17] A.B.J Novikoff. On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical\n\nTheory of Automata, XII(1):615\u2013622, 1962.\n\n[18] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge\n\nUniversity Press, New York, NY, USA, 1st edition, 2011.\n\n10\n\n\f", "award": [], "sourceid": 3260, "authors": [{"given_name": "Aravindan", "family_name": "Vijayaraghavan", "institution": "Northwestern University"}, {"given_name": "Abhratanu", "family_name": "Dutta", "institution": "Northwestern University"}, {"given_name": "Alex", "family_name": "Wang", "institution": "Northwestern University"}]}