{"title": "How to tell when a clustering is (approximately) correct using convex relaxations", "book": "Advances in Neural Information Processing Systems", "page_first": 7407, "page_last": 7418, "abstract": "We introduce the Sublevel Set (SS) method, a generic method to obtain sufficient guarantees of near-optimality and uniqueness (up to small perturbations) for a clustering. This method can be instantiated for a variety of clustering loss functions for which convex relaxations exist. Obtaining the guarantees in practice amounts to solving a convex optimization. We demonstrate the applicability of this method by obtaining distribution free guarantees for K-means clustering on realistic data sets.", "full_text": "How to tell when a clustering is (approximately)\n\ncorrect using convex relaxations\n\nMarina Meila\u21e4\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nAbstract\n\nWe introduce the Sublevel Set (SS) method, a generic method to obtain suf\ufb01cient\nguarantees of near-optimality and uniqueness (up to small perturbations) for a\nclustering. This method can be instantiated for a variety of clustering loss functions\nfor which convex relaxations exist. Obtaining the guarantees in practice amounts to\nsolving a convex optimization. We demonstrate the applicability of this method by\nobtaining distribution free guarantees for K-means clustering on realistic data sets.\n\n1\n\nIntroduction\n\nThis paper proposes a framework for providing theoretical guarantees for clustering, without making\n(untestable) assumptions about the data generating process. The main question we address is: can a\nuser tell, with no prior knowledge, if the clustering C returned by a clustering algorithm is meaningful?\nThis is the fundamental problem of cluster validation. We build on the idea of [Mei06] who proposed\nthat: to be meaninful, a clustering C must be both \u201cgood\u201d and the only \u201cgood\u201d clustering of the data\nD, up to small perturbations. Such a clustering is called stable. Data that contains a stable clustering\nis said to be clusterable.\nIn this paper, we adopt the loss-based clustering framework, where for a given number of clusters K,\nthe best clustering of the data is the one that minimizes a loss function Loss(D,C). This framework\nincludes K-means, K-medians, graph partitioning, as well as model-based clustering (by letting the\nloss function be the data log-likelihood.2). Consequently, a good clustering C has low Loss(D,C),\nin a way that will become precise later. Supposing that it is possible to \ufb01nd a good clustering, the\nchallenge is to verify that C is stable without enumerating all the other clusterings. Hence this work\nwill show how to obtain theorems like the following.\nStability Theorem (\u21e4). Given a clustering C of data set D, a function Loss(D,C), and technical\nconditions T , there is an \u270f such that d(C,C0) \uf8ff \u270f whenever Loss(D,C0) \uf8ff Loss(D,C).\nIf we can show this for a clustering C, it means that C captures structure existing in the data, thus it is\nmeaningful. It should also be evident, that it is not possible to obtain such guarantees in general; they\ncan only exist for clusterable data, as illustrated in Figure 1.\nThe rest of the paper will describe a generic method for obtaining technical conditions T and stability\ntheorems such as Stability Theorem (\u21e4) above for loss-based clustering (section 2). We will illustrate\nthe working of this method for the K-means cost function (sections 3 and 6), and will present further\ninstantiations in Section 4 and related work in Section 5. Section 7 concludes the paper. While the idea\nof using stability as in (\u21e4) was introduced by [Mei06] who used spectral bounds, the contributions of\nthis paper are (1) to greatly expand the scope of [Mei06] from spectral bounds to general tractable\n\n\u21e4www.stat.washington.edu/mmp\n2Restrict to hard clusterings only.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Top,left: A data set that is not clusterable: the clustering shown is nearly optimal, but not stable.\nBottom,left: A clusterable data set, described in the Experiments section 6, (n = 1024) with a stable and\nnearly optimal clustering C w.r.t the K-means loss. The spherical clusters have = 0.9, the cluster centers are\n4p2 \u21e1 5.67 apart; the method described in Section 3 guarantees than all clusterings of this data set that are at\nleast as \u201cgood\u201d as C differ from it in at most 1.44% of the points. Right: \" and \"Sp for K = 4, n = 1024 and\nvarious values (over 10 replications). The values of \"Sp exceeding pmin are not valid BTRs.\n\nrelaxations and to a much wider class of clustering problems, (2) to obtain new results in the case of\nthe K-means loss by using the Semide\ufb01nite Programming (SDP) relaxations of K-means, and (3) to\ndemonstrate that these are much tighter than the previous ones.\n\n2 Proving stability via convex relaxations\n\n2.1 Preliminaries and de\ufb01nitions\nRepresenting clusterings as matrices Let D = {x1, . . . xn} be the data to be clustered. We make\nno assumptions about the distribution of these data. A clustering C = {C1, . . . CK} of the dataset D is\na partition of the indices {1, 2, . . . n} := [n] into K non-empty mutually disjoint subsets C1, . . . CK,\ncalled clusters. Let nk = |Ck| for k 2 [K]; PK\nk=1 nk = n, pmin = mink2[K] nk/n, pmax =\nmaxk2[K] nk/n. We denote by CK the space of all clusterings with K clusters. A clustering can be\nrepresented by an n \u21e5 n clustering matrix X de\ufb01ned as\n\nif i, j 2 Ck for some k 2 [K]\notherwise\n\n, X = [Xij]n\n\ni,j=1.\n\n(1)\n\nXij = \u21e2 1/nk\n\n0\n\nThe following proposition lists the properties of the matrix X. Its proof as well as all other proofs\ncan be found in the supplement.\nProposition 1. For any clustering C of n data points, the matrix X de\ufb01ned by (1) has elements Xij\nin [0, 1]; trace X = K, X1 = 1, where 1 = [ 1 . . . 1 ]T , kXk2\nF = trace X T X = K. Moreover,\nX \u232b 0, i.e. X is a positive semide\ufb01nite (PSD) matrix (i.e. X symmetric and xT Xx 0 for all\nvectors x).\nTo distinguish a clustering matrix X from other n \u21e5 n symmetric matrices satisfying Proposition 1,\nwe sometimes denote the former by X(C).\nMeasuring the distance between two clusterings The earthmover\u2019s distance (also called the\nmisclassi\ufb01cation error distance) between two clusterings C,C0 over the same set of n points is\n\ndEM (C,C0) = 1 \n\n1\nn\n\nk=1|Ck \\ C0\u21e1(k)|,\n\n(2)\n\nmax\n\n\u21e12SKPK\n\n2\n\n\fwhere \u21e1 ranges over the set of all permutations of K elements SK, and \u21e1(k) indexes a cluster in C0.\nThis de\ufb01nition can be generalized to clusterings with different numbers of clusters and to unequally\nweighted data points.\n\n2.2 Sublevel set problems and outline of the method\nLosses and convex relaxations A loss function Loss(D,C) (such as the K-means or K-medians\nloss), speci\ufb01es what kind of clusters the user is interested in, via the optimization problem below.\n(3)\n\nLoss(D,C), with solution Copt\n\nClustering problem: Lopt = min\nC2CK\n\nAs most loss functions require a number of clusters K as input, we assume that K is \ufb01xed and given.\nIn Section 7 we return to the issue of choosing K. The majority of interesting loss functions result in\ncombinatorial optimization problems (3) known to be hard in the worst case.\nLet X be a convex set in the Euclidean space, such that X{ X(C), C2 CK}. If we extend\nLoss(D,C) to Loss(D, X) convex in X for all X 2X , then the problem\nLoss(D, X), with solution X\u21e4\n(4)\n\nL\u21e4 = min\nX2X\n\nis a convex relaxation of the problem (3). In the above, the representation X(C) can be the one\nde\ufb01ned in (1), or a different injective mapping of CK to an Euclidean space. Because X CK,\nL\u21e4 \uf8ff Lopt and X\u21e4 is generally not a clustering matrix. On the other hand, \ufb01nding X\u21e4 is typically\ntractable, while \ufb01nding Copt is not.\nConvex relaxations for clustering have received considerable interest. For graph partitioning problems,\n[XJ03] introduced two SDP relaxations, [HS11] proposed non-convex but tight relaxations to a class\nof graph partitioning problems, while [RMH14] proposed a continuous relaxed balanced cut convex\nproblem that depends on a submodular set function S. By judiciously choosing S, one obtains\ntight relaxations to various classes of graph cut problems. Correlation clustering, a graph clustering\nproblem appearing in image analysis, has been given an SDP relaxation in [Swa04, AS16b]. For\ncommunity detection under the Stochastic Block Model [HLL83] several SDP relaxations have been\nrecently introduced [CX14, VOH14, JHDF16] as well as Sum-of-Squares relaxations for \ufb01nding\nhidden cliques [DM15]. For centroid based clustering, we have LP relaxations for K-medians [CG99]\nand K-means [ACKS15] and more recent, tighter relaxations via SDPs in [ABC+14]. The SDP\nrelaxations [ABC+14, IMPV15b] for K-means have guarantees under the speci\ufb01c generative model\ncalled the Stochastic Ball Model [IMPV15a]. For hierarchical clustering in the cost-based paradigm\nintroduced by [Das16], we have LP relaxations introduced by [RP16, CG99, CC16]. We also mention\nthe related area of convex clustering [BH07] where loss functions are designed to be convex. These\ncan often be seen as relaxations to standard K-center or exemplar based losses.\n\nThe sublevel set method Now we show how to use an existing relaxation to obtain guarantees of\nthe form (\u21e4) for clustering. Given a Loss, its clustering problem (3), and a convex relaxation (4) for\nit we proceed as follows:\nStep 1 Use the convex relaxation to \ufb01nd a set of good clusterings that contains a given C . This set\nis X\uf8ffl = {X 2X , Loss(D, X) \uf8ff l}, the sublevel set of Loss, at the value l = Loss(D,C).\nThis set is convex when Loss is convex in X.\nStep 2 Show that if X\uf8ffl has suf\ufb01ciently small diameter, then all clusterings in it are contained in the\nball {C0, dEM (C,C0) \uf8ff \u270f}. We will call this \u270f a Bound from Tractable Relaxation (BTR).\nIn more detail, consider a dataset D, with a clustering C2 CK. If for the given Loss a convex\nrelaxation exists, let its feasible set be X , and let X(C) be the image of C in X . To accomplish Step\n1, we modify (4) into the optimization problem below, which we call Sublevel Set (SS) problem.\n(5)\n\nSS \u270f01 = max\n\ns.t. Loss(D, X0) \uf8ff Loss(D,C).\n\nX02X kX(C) X0k,\n\nIn the above, the norm |||| can be chosen conveniently; in this paper it will be the Frobenius norm.\nThe feasible set for (5) is X\uf8ffLoss(D,C), a convex set. The tractability of (5) depends on the objective\n||X(C) X0||, which generally is not concave. In the next sections it will be shown that the mapping\nX(C) in (1) always leads to tractable SS problems, and we will present other examples of such\n\n3\n\n\fmappings. When SS is tractable, by solving it we obtain that ||X(C0) X(C)|| \uf8ff \u270f0 for all clusterings\nC0 with Loss(D,C0) \uf8ff Loss(D, X(C)). Thus, SS \ufb01nds a ball centered at C that contains all the\ngood clusterings. Here \u201cgood\u201d means \u201cat least as good as C\u201d w.r.t. Loss, but any sublevel set can\nbe considered, even for levels lower than Loss(D,C). The radii of these sublevel sets tell us how\nclusterable the data is, in a norm that depends on the relaxation X .\n||X(C0) X(C)|| could be considered a distance between partitions, but this \u201cdistance\u201d is less\nintuitive, and has the added disadvantage that it depends on the mapping X used. In Step 2, we\ntransform the bound \"01 into a bound for the earthmover\u2019s distance dEM. In the next sections we\nprovide examples and suf\ufb01cient conditions when this is possible by existing methods.\n\n3 BTR bounds for the K-means loss\nThe K-means clustering paradigm In K-means clustering, the data points are vectors x1:n 2 Rd.\nThe objective is to minimize the squared error loss, also known as the K-means loss\n\nLoss(D,C) =\n\nKXk=1Xi2Ck\n\n||xi \u00b5k||2, with \u00b5k =\n\nxi,\n\nfor k 2 [K].\n\n(6)\n\n1\n\nnk Xi2Ck\n\nIf one substitutes the expressions of the centers \u00b51:K into Loss, one obtains a function of the matrix\nX and the squared distances matrix D.\n\nLoss(D,C) \u2318 Loss(D, X(C)) =\n\n1\n2hD, Xi,\n\n(7)\n\nwhere\n\nD = [Dij]i,j2[n], Dij = ||xi xj||2\n\n(8)\nand hA, Bi denotes the Frobenius scalar product hA, Bi def= trace(AT B). The norm ||x|| denotes\nthe Euclidean norm of x. Finding the best clustering of a data set D is thus equivalent to solving the\nfollowing optimization problem [ABC+14], which we will refer to as the K-means problem.\n(9)\n\nmin\n\nC2CKhD, X(C)i\n\nA SDP relaxation for the K-means problem The K-means loss is hard to optimize in general,\nbecause of the presence of local minima. But, since K-means is one of the most widely used and well\nstudied methods for \ufb01nding groups in data, several tractable relaxations for the K-means problem\nhave been developed. [DH04, DGK04] introduced a spectral relaxation, [ABC+14] introduced two\nconvex relaxations, one resulting in a Linear Program (LP), the other in a Semi-De\ufb01nite Program\n(SDP). We present the latter here.\nX2XhD, Xi s.t. X = {X \u232b 0, trace X = K, X1 = 1, Xij 0 for i, j 2 [n]}\u21e2 Rn\u21e5n.\nBy Proposition 1, X(C) 2X for all C2 CK. In [ABC+14] it was shown that problem (10) is\nconvex, and that it can be cast as a SDP. In general, X\u21e4 the optimal solution of (10) may not be a\nclustering matrix. [ABC+14] showed that when data are sampled from well-separated discs, X\u21e4 is a\nclustering matrix corresponding to the optimal clustering C\u21e4 of the data (and C\u21e4 assigns the points in\neach disk to a different cluster).\n\n(10)\n\nmin\n\nIn this section we explain how we obtain BTR bounds in the K-means\n\nA BTR for K-means\nclustering paradigm, by exploiting the relaxation (10).\nWe shall assume that a data set D is given, and that the user has already found a clustering C of this\ndata set (by e.g. running the K-means algorithm). The user would like to know if: (a) is C optimal (in\nother words, is it the globally optimal solution to the non-convex problem (9))? and (b) could there\nbe other clusterings of the data D that are very different from C but are similar or better w.r.t to Loss?\nAs Figure 1 shows, the two questions are both important, if the goal of the clustering is to capture the\nstructure of the data (i.e. the clustering) instead of minimizing the clustering Loss. We introduce the\nfollowing SDP instantiating the generic SS problem (5).\n\n(SSKm)\n\n = min\n\nX02XhX(C), X0i\n4\n\ns.t.hD, X0i \uf8ff hD, X(C)i\n\n(11)\n\n\fF = ||X0||2\n\nIn the above, X(C) is the clustering matrix of the known clustering, X is de\ufb01ned in (10) and D is the\nsquared distance matrix of D. If C is given, the number of clusters is implicitly given by K = |C|.\nHence a user has all the information available for solving this SDP in practice. Note that the value \nis not given, but an output of the optimization algorithm solving (SSKm).\nLet us now examine what the optimal solution X0 and optimal value mean. First note that, because\nF = 2K 2hX, X0i. Hence, the minimizer\nF = K (by Proposition 1), ||X X0||2\n||X||2\nof hX, X0i is the same X0 that maximizes ||X X0||F .\nIn comparison with (10), (SSKm) adds an inequality constraint, thus restricting the feasible set of\n(10) to matrices X0 that have Loss no larger than the loss of C. Both X\u21e4 and X(C) are feasible for\n(SSKm), but clusterings with higher Loss than C are not. Hence, (SSKm) \ufb01nds among the feasible\nmatrices X0 which have low loss, the one that is furthest away from X(C). Typically, X0 is not in\nCK, and we are not interested in X0, but in how far it is from X. As Theorem 2 below will show,\nthe optimum value in (SSKm) determines this distance, measured in Frobenius norm, and \uf8ff K.\nConsequently, if the value K is small enough, it implies that no good clusterings of the data can\ndiffer more than K from X(C). Therefore, our main result below states that when the value is\nnear its maximum K, it controls the deviation from C of any other good clustering.\nTheorem 2. Let D be represented by its squared distance matrix D, let C be a clustering of D,\nwith K, pmin, pmax de\ufb01ned as in Section 2.1, and let (C) be the optimal value of problem (SSKm).\nThen, if \" = (K (C))pmax \uf8ff pmin, any clustering C0 with Loss(C0) \uf8ff Loss(C) is at distance\ndEM (C,C0) \uf8ff \".\nWe summarize the validation procedure below.\n\nInput Data set with D 2 Rn\u21e5n de\ufb01ned as in (8), clustering C with K clusters\nPreprocess Calculate pmin, pmax, and clustering matrix X(C).\nobtained.\n\n1. Solve problem (SSKm) numerically (by e.g. calling a SDP solver); let be the optimal value\n\n2. Set \u270f0 = K and \u270f = \u270f0pmax.\n3. If \u270f \uf8ff pmin then\n\nTheorem 2 holds: \" is a BTR for C.\nelse no guarantees for C by this method.\n\nTheorem 2 instantiates Stability Theorem (\u21e4). When the theorem\u2019s conditions hold, it provides a\ncerti\ufb01cate of stability for C. It is also evident that this can only happen if C is stable, which moreover\nis predicated on the data being clusterable.\n\n4 For what other clustering paradigms can we obtain BTRs?\n\nNow we show that the framework of Section 2 can be readily applied to several other clustering\nparadigms with very little extra work.\nDe\ufb01ne the following injective mappings of CK into sets of matrices. The X mapping is given in (1).\nThe mapping \u02dcX 2 Rn\u21e5n is given by \u02dcXij = 1 if i, j 2 Ck for some K and 0 otherwise. The mapping\nZ 2 Rn\u21e5K is given by Zij = 1/pnk if i 2 Ck for k 2 [K] and 0 otherwise.\nTheorem 3. Let Loss de\ufb01ne a clustering paradigm that has a convex relaxation in which clustering\nC is mapped to one of the matrices X, \u02dcX, Z above. Then the following statements hold. (1) There\nexists a convex SS problem of the form = minX02X\uf8fflhX(C), X0i (and similarly for \u02dcX, Z). (2)\nFrom the optimal value a BTR \" can be obtained.\nFor the X mapping, \" = (K )pmax; for the \u02dcX mapping, \" = Pk2[K] n2\nk+(nK+1)2+(K1)2\nfor the Z mapping, \" = (K 2/2)pmax. In all three cases, \" is a BTR whenever \" \uf8ff pmin.\nThis theorem can also be extended to cover weighted representations such as those used for graph\npartitioning in [MSX05]. Theorem 3 shows that getting bounds for a clustering paradigm does\nnot depend directly on the Loss or clustering paradigm, but on the space of the convex relaxation.\nMoreover, somebody who already uses one of the above clustering relaxations cited would have very\nlittle to do to also obtain bounds.\n\n2pmin\n\n;\n\n5\n\n\fOf the previously mentioned relaxations the X mapping is used by [PW07, IMPV15b] for K-means\nin a SDP relaxations, by [RP16, CC16] for cost-based hierarchical clustering in an LP relaxation, and\nby [Swa04] for correlation clustering. The \u02dcX mapping is used by [CX14, JHDF16] for the Stochastic\nBlock Model [HLL83], respectively by [VOH14] for the Degree-Corrected Stochastic Block Model\n[KN11]. The Z mapping is used by the spectral relaxations [DH04, Mei06] for K-means. Finally,\nnote that the relaxations in [HS11, RMH14] are not covered by Theorem 3, and obtaining BTR\nbounds from them looks both challenging and promising.\n\n5 Related work\n\nExisting distribution free guarantees for clustering All the previous explicit BTR bounds we\nare aware of are based on spectral relaxations: [Mei06] gives a spectral bound for K-means and\n[MSX05, WM15] give BTR for graph partitioning. The work of [LGT14] relates the existence of\ngood r-way graph partitioning to a large K-eigengap of the graph normalized Laplacian, where\nr K 3K. More precisely, if K+K /K > c(log K)2/9 then this partition is \u201cbetter\u201d\nthan K/3 \u21e5 c0 (for c, c0 unspeci\ufb01ed). While these results are remarkable for their generality,\nthey requires extremely large K+K /K to produce non-trivial bounds, no matter what c, c0 are;\nmoreover, because K+K \uf8ff 1, they also require K \u2327 1. In [PSZ15], a BTR bound for spectral\nclustering is given, which depends on unspeci\ufb01ed constants.\n\nAlgorithmic results under clusterability assumptions For \ufb01nite mixtures, a series of re-\nmarkable results from the 2000\u2019s [Das00][AM05, VW04][DS07] and a few more recent ones\n[BMvL12][BWY14] established theoretical guarantees for recovery, together with tractable clustering\nalgorithms. These papers are important because for the \ufb01rst time, recovery is not tied to maximizing\nthe likelihood, but to the separation of the cluster centers, and to the relative sizes and spreads of the\nclusters. Recovery guarantees have been obtained also in block-models for network data, such as\nthe Stochastic Block Model (SBM) [AS15, AS16a], Degree-Corrected SBM (DC-SBM)[QR13] and\nPreference Frame Model (PFM) [WM15].\nOther researchers have shown that under resilience [BL09] or other clusterability assumptions on\nthe data, one can \ufb01nd a (nearly) optimal clustering ef\ufb01ciently: [ACKS15, ABS12, ABV14, ABS12,\nABS10, BL09, CFR15, BL16, BHW16, BLG14] and [Ben15], which offers a critical survey of\nthis area, and underscores that the clusterability conditions are in general very restrictive. Similar\nresults for graph clustering are given in e.g. [KVV00]. We have already mentioned the recent\n[ABC+14, IMPV15b]. Our results are complementary to the work in this area. This work provides\nvery strong evidence that if D is clusterable, a good clustering C is easy to \ufb01nd. This corroborates a\nlarge body of empirical evidence, including our own experiments in Section 6. In future work we aim\nto close the loop by providing end-to-end algorithms that both cluster data ef\ufb01ciently and give BTRs\nfor the resulting C. Second, our work grounds this area of research. By the SS method one could\nhope to prove the assumptions that (some of) the aforementioned algorithms are relying on, making\nthem more relevant.\n\nOther work in unsupervised learning Recently, in [HM16] a PAC-like framework for unsuper-\nvised learning is proposed. Similar to our paper, the framework of [HM16] argues for the need of\na hypothesis class, of an assumption that the data \ufb01ts the model class (i.e., the (k, \u270f) decodability\ncondition), and the use of problem speci\ufb01c tractable relaxations as vehicles for both tractable algo-\nrithms and error bounds. The difference is that they concentrate on prediction, not clustering structure.\nFor instance, under the framework of [HM16] one could provide very good guarantees for K-means\nclustering data that is not clusterable (such as Figure 1 top, left).\n\n6 Experimental evaluations for the K-means guarantees\n\nWe implemented SSKmvia the SDP solver SDPNAL+[ZST10, YST15]. We also implemented the\nspectral bound of [Mei06], the only other method offering BTR bounds. The main questions of\ninterest were (1) do our BTRs exist for realistic situations? (2) how tight are the bounds obtained?\nand (3) given that SDP solvers are computationally demanding, can this approach be applied to\nreasonably large data sets?\n\n6\n\n\fFigure 2: Separation statistics for the K = 4 data, n = 1024, all values. Left: histogram of\nmink ||xi \u00b5k||/ mink,k0 ||\u00b5k \u00b5k0|| (i.e. distance of point to its center over minimum center\nseparation) colored by . Note that when the clusters are contained in equal non-intersecting balls\nthis ratio is strictly smaller than 0.5. Right: boxplot of distance to second closest center over distance\nto own center, versus .\n\n6.1 Synthetic data\n\nWe sampled data from a mixture of K = 4 normal distributions with equal spherical covariances\n2Id, for d = 15 dimensions. The cluster sizes nk were approximately equal to bn/Kc. The cluster\nmeans were at the corners of a regular tetrahedron with center separation ||\u00b5k \u00b5k0|| = 4p2 \u21e1 5.67.\nThe data was clustered by K-means with random initialization, then the bounds \" and \"Sp were\ncomputed.\nIn the experiments we also performed outlier removal, as follows. For each xi, we computed the\nsum of the distances to its pmin/2 nearest neigbors. We then removed the n0 data points with the\nlargest values for this sum. For good measure, we \ufb01rst added 20 outliers, then removed n0 = 4%n\nrespectively n0 = 2%n points, depending on whether the n < 525 or n 525, before computing\nthe bounds \", \"Sp. Consequently, these bounds do not refer to all possible clusterings of the original\nD, but to the \u201ccleaned\u201d dataset. Note however that the outlier removal does not depend on the cluster\nlabels; in fact, we perform it before clustering the data.\nFigure 1 displays the bounds \", \"Sp for data with K = 4, n = 1024, as well as one instance of the\ndata used. The \" BTR is much tighter than the spectral bound \"Sp, and, surprisingly enough, holds\neven when the clusters \u201ctouch\u201d, i.e when there is no region of low density between the clusters.\nOtherwise put, the distribution free BTR bounds hold even when the data are not contained in\nnon-intersecting balls, which is the best known condition for clusterability under model assumptions\n[ABC+14, ABV14]. Figure 2 (left) shows that, when 0.8, the minimal spheres containing the\nclusters intersect; on the right we see that there are points which are almost equidistant from two\ncluster centers.\nNext, we performed experiments with unequal cluster sizes p1:4 = 0.1, 0.2, 0.3, 0.4, and with non-\ngaussian clusters (details in the Supplement). We also performed experiments with K = 6 clusters,\nwith p1:6 = 0.1, 0.18, 0.18, 0.18, 0.18, 0.18. For K = 6 we placed cluster centers along a line (see\nFigure 3 in the Supplement); this hurts the spectral bound which depends on a stable K 1-subspace,\nbut does not hurt, and may even help the SDP bound \".\nThe results are shown in Table 1. The experiments reported in Table 1 are chosen to illustrate the\nlimits of what is achievable by this SS method. In experiments with smaller dispersion values then\n0.6, respectively 0.06, the bounds \u270f were very near 0. The table also shows that \" takes similar values\nin the case of equal and unequal clusters. However, in the latter case, the condition \" \uf8ff pmin/pmax is\nmore stringent, hence some of the bounds obtained are not valid.\nIt is not unexpected that more dispersed clusters, i.e. large , or imbalanced cluster sizes, re\ufb02ected\nin a small pmin, limit the range of the stability guarantees we can obtain. Noise equalizes the Loss\nlandscape, increasing the instability. A bound must be tight enough to \u201cpreserve the smallest cluster\u201d,\n\n7\n\n\fK=4\nn\n0.6\n0.8\n1.0\n1.2\n\nUnequal normal clusters\n\nUnequal non-normal clusters\n\n= 200\n0.00(0.00)\n0.01(0.01)\n0.09 (0.05)\n0.28 (0.08)\n\nn = 400\n0.00(0.00)\n0.01(0.01)\n0.06 (0.01)\n0.21 (0.05)\n\nn = 800\n0.00(0.00)\n0.01(0.01)\n0.07 (0.02)\n0.21 (0.03)\n\nn = 200\n\nn = 400\n\nn = 800\n\n0.001(0.001)\n0.006(0.004)\n0.04 (0.02)\n0.16 (0.06)\n\n0.001(0.000)\n0.004(0.002)\n0.03 (0.01)\n0.14 (0.03)\n\n0.002(0.007)\n0.007(0.003)\n0.03 (0.01)\n0.13 (0.03)\n\nK = 6 clusters\n\nnormal\n= 525\n0.00(0.00)\n0.01(0.00)\n0.01(0.00)\n\nn\n0.06\n0.08\n0.1\n\nnon-normal\nn = 525\n\n0.005(0.001)\n0.006(0.001)\n0.009(0.003)\n\nTable 1: BTR bound \" for K = 4 (top) respectively K = 6 (bottom) clusters of unequal sizes (mean\nand standard deviation over 10 replications). The values in gray are not valid, owing to the fact that\n\"pmax > pmin in these cases. Bounds for smaller values were almost all 0. Bottom, right:the\nK = 6, non-normal data for = 0.1.\n\nor the clustering is unstable. We see the use of our method in tandem with (existing or future)\ninstability detection methods, such as resampling.\nOver all experiments, we have found that the BTR \" is virtually insensitive to the value of n, and\ndegrades slowly when pmin decreases. The main limitation to obtaining BTR is the requirement that\n\" \uf8ff pmin/pmax; this requirement is based (see the proof of Theorem 2 in the Supplement) on the\nrelationships from [Mei12] between ||X(C) X(C0)|| and dEM (C,C0). These are not tight, meaning\nthat the regimes in which provable guarantees exist is even larger. From a practical perspective, it is\nlikely that the \" values marked in gray are valid BTR bounds even though we cannot prove it at this\ntime.\n\n6.2 Real data: con\ufb01gurations of the aspirin (C9H8O4) molecule\n\nThese n = 2118 samples (see Figure 4 in Supplement) were obtained via Molecular Dynamics (MD)\nsimulation at T = 500Kelvin by [CTS+17] and represent 3D positions of the 21 atoms of aspirin. It\nwas discovered recently that aspirin\u2019s potential energy surface has two energy wells, so we cluster\nthese data with K = 2, after having removed n0 = 0.5%n = 106 outliers. The clusters found\nhave relative sizes pmin = .26, pmax = .74, and the BTR bound is \" = .065, an informative bound.\nHowever, this took over 10h; that encouraged us to try the following heuristic: instead of removing\noutliers, we removed the 60% of the data points closest to their centers. The motivation was that the\ndif\ufb01culty of the SDP depends on the cluster boundaries and not on the easy points. The run time\nreduced to 42 minutes and the bound obtained \" = 0.047 is comparable with the original one. While\nthis speed-up method is ad-hoc, we are con\ufb01dent that it can be made rigurous in the future, opening\nup the SS method to larger data sets.\n\n7 Conclusion\n\nWe have introduced a generic method for obtaining distribution free, worst case, guarantees for a\nvariety of clustering algorithms. The method exploits the vast amount of existing work in convex\nrelaxations for clustering; as more results and tighter relaxations appear in this area, the SS method\nwill be able to take advantage of them. For the case of K-means clustering, we have shown empirically\nthat the bounds obtained apply to realistic cases, far surpassing the existing results. It is extremely\nrare in machine learning to have worst case bounds that are relevant (VC bounds are typically above\n1, when they can be computed). However, when the relaxations used by the SS method are tight, we\nobtain bounds that are not only informative, they are near 0 in non-trivial situations.\n\n8\n\n\fThe SS method depends only on observed and computable quantities, does not contain unde\ufb01ned\nconstants and does not make any assumptions about the data. However, connections with probabilistic\nmodels of the data are possible, and we plan to explore this avenue in future work. The BTR exist\nonly when the data is clusterable. Currently we cannot show that all the clusterable cases can be\ngiven guarantees; this depends on the tightness of the relaxation.\nThe SDP relaxation for K-means has been instrumental to obtaining tight guarantees. In many\npractical situations, the computational demands of the SDP solver are justi\ufb01ed by the guarantees\noffered. We believe that expanding the method to larger data, beyond the scope of generic SDP\nsolvers, is possible by exploiting the special structure of the SS problem.\nThroughout the paper, we have assumed that K is \ufb01xed. In practice, K is not known, and it is chosen\nafter clusterings with K = 1, 2, . . . Kmax have been obtained. Our SS method could replace the\n(more or less ad-hoc) methods for selecting K, with the following: For all K try to \ufb01nd a BTR \"(K).\nIf successful, the respective clustering and its K are selected. It is of course possible to select more\nthan one K, but only if the data indeed supports both clusterings. Thus, indirectly, the BTR can\nprovide a theoretically sound method of selecting K.\n\nAcknowledgments\n\nThe author gratefully acknowledges Maryam Fazel for her early interest in this work, for reading\na previous version of this paper and for many discussions; the Simons Institute for the Theory of\nComputing where part of this research was performed, and partial support from the NSF DMS PD\n08-1269 and NSF IIS-0313339 awards.\n\nReferences\n[ABC+14] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, and R. Ward. Relax, no need\n\nto round: integrality of clustering formulations. ArXiv e-prints, August 2014.\n\n[ABS10] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability.\n\nArXiv, 1009.3594, 2010.\n\n[ABS12] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability.\n\nInf. Process. Lett., 112(1-2):49\u201354, 2012.\n\n[ABV14] Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive\nclustering. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014,\nBeijing, China, 21-26 June 2014, pages 550\u2013558, 2014.\n\n[ACKS15] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness\nof approximation of euclidean k-means. In Lars Arge and J\u00e1nos Pach, editors, 31st International\nSymposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Nether-\nlands, volume 34 of LIPIcs, pages 754\u2013767. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik,\n2015.\n\n[AM05] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In\nPeter Auer and Ron Meir, editors, 18th Annual Conference on Learning Theory, COLT 2005, pages\n458\u2013471, Berlin/Heidelberg, 2005. Springer.\n\n[AS15] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models:\n\nfundamental limits and ef\ufb01cient recovery algorithms. arXiv preprint arXiv:1503.00609, 2015.\n\n[AS16a] Emmanuel Abbe and Colin Sandon. Achieving the ks threshold in the general stochastic block\nmodel with linearized acyclic belief propagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n1334\u20131342. Curran Associates, Inc., 2016.\n\n[AS16b] Sara Ahmadian and Chaitanya Swamy. Approximation algorithms for clustering problems with\nlower bounds and outliers. In Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani,\nand Davide Sangiorgi, editors, 43rd International Colloquium on Automata, Languages, and\nProgramming, ICALP 2016, July 11-15, 2016, Rome, Italy, volume 55 of LIPIcs, pages 69:1\u201369:15.\nSchloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.\n\n9\n\n\f[Ben15] Shai Ben-David. Computational feasibility of clustering under clusterability assumptions. Technical\n\nReport arXiv:1501.00437, 2015.\n\n[BH07] Francis R. Bach and Za\u00efd Harchaoui. DIFFRAC: a discriminative and \ufb02exible framework for\nclustering. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Advances\nin Neural Information Processing Systems 20, pages 49\u201356. Curran Associates, Inc., 2007.\n\n[BHW16] Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center clustering under perturbation\nresilience. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP\n2016, July 11-15, 2016, Rome, Italy, pages 68:1\u201368:14, 2016.\n\n[BL09] Yonatan Bilu and Nathan Linial. Are stable instances easy? CoRR, abs/0906.3162, 2009.\n\n[BL16] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM J. Comput.,\n\n45(1):102\u2013155, 2016.\n\n[BLG14] Maria-Florina Balcan, Yingyu Liang, and Pramod Gupta. Robust hierarchical clustering. Journal\n\nof Machine Learning Research, 15(1):3831\u20133871, 2014.\n\n[BMvL12] Sebastien Bubeck, Marina Meil\u02d8a, and Ulrike von Luxburg. How the initialization affects the\n\nstability of the k-means algorithm. ESAIM: Probability and Statistics, 16:436\u2013452, 2012.\n\n[BWY14] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM\n\nalgorithm: From population to sample-based analysis. CoRR, abs/1408.2156, 2014.\n\n[CC16] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and\n\nspreading metrics. Technical Report 1609:09548, arXiv, 2016.\n\n[CFR15] Sam Cole, Shmuel Friedland, and Lev Reyzin. A simple spectral algorithm for recovering planted\n\npartitions. CoRR, abs/1503.00423, 2015.\n\n[CG99] M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median\nproblems. In 40th Annual Symposium on Foundations of Computer Science, pages 378\u2013388, 1999.\n\n[CTS+17] Stefan Chmiela, Alexandre Tkatchenko, Huziel E. Sauceda, Igor Poltavsky, Kristof T. Sch\u00fctt, and\nKlaus-Robert M\u00fcller. Machine learning of accurate energy-conserving molecular force \ufb01elds.\nScience Advances, 3(5):e1603015, 2017.\n\n[CX14] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and\nsubmatrix localization with a growing number of clusters and submatrices. arXiv preprint\narXiv:1402.1267, 2014.\n\n[Das00] Sanjoy Dasgupta. Experiments with random projection. In UAI \u201900: Proceedings of the 16th\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 143\u2013151, San Francisco, CA, USA,\n2000. Morgan Kaufmann Publishers Inc.\n\n[Das16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Daniel Wichs and\nYishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of\nComputing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 118\u2013127. ACM, 2016.\n\n[DGK04] Inderjit S. Dhillon, Y. Guan, and Brian Kulis. Kernel K-means, spectral clustering and normalized\ncuts. In Ronny Kohavi, Johannes Gehrke, and Joydeep Ghosh, editors, Proceedings of The Tenth\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), pages\n551\u2013556. ACM Press, 2004.\n\n[DH04] Chris Ding and Xiaofeng He. K-means clustering via principal component analysis. In Carla E.\nBrodley, editor, Proceedings of the International Machine Learning Conference (ICML). Morgan\nKauffman, 2004.\n\n[DM15] Y. Deshpande and A. Montanari. Improved Sum-of-Squares Lower Bounds for Hidden Clique and\n\nHidden Submatrix Problems. ArXiv e-prints, February 2015.\n\n[DS07] Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixtures of separated,\n\nspherical gaussians. Journal of Machine Learnig Research, 8:203\u2013226, Feb 2007.\n\n[HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:\n\nFirst steps. Social networks, 5(2):109\u2013137, 1983.\n\n[HM16] Elad Hazan and Tengyu Ma. A non-generative framework and convex relaxations for unsupervised\n\nlearning. CoRR, abs/1610.01132, 2016.\n\n10\n\n\f[HS11] Matthias Hein and Simon Setzer. Beyond spectral clustering - tight relaxations of balanced graph\ncuts. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 2366\u20132374. Curran Associates, Inc.,\n2011.\n\n[IMPV15a] T. Iguchi, D. G. Mixon, J. Peterson, and S. Villar. On the tightness of an SDP relaxation of k-means.\n\nArXiv e-prints, May 2015.\n\n[IMPV15b] T. Iguchi, D. G. Mixon, J. Peterson, and S. Villar. Probably certi\ufb01ably correct k-means clustering.\n\nArXiv e-prints, September 2015.\n\n[JHDF16] A. Jalali, Q. Han, I. Dumitriu, and M. Fazel. Relative density and exact recovery in heterogeneous\n\nstochastic block models. In Proc. of NIPS 2016, December 2016.\n\n[KN11] B. Karrer and M.E.J. Newman. Stochastic blockmodels and community structure in networks.\n\nPhysical Review, 83:16107, 2011.\n\n[KVV00] Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: good, bad and spectral. In Proc.\n\nof 41st Symposium on the Foundations of Computer Science, FOCS 2000, 2000.\n\n[LGT14] James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multi-way spectral partitioning and\n\nhigher-order cheeger inequalities. arxiv preprint arXiv:1111.1055, 2014.\n\n[Mei06] Marina Meil\u02d8a. The uniqueness of a good optimum for K-means. In Andrew Moore and William\nCohen, editors, Proceedings of the International Machine Learning Conference (ICML), pages\n625\u2013632. International Machine Learning Society, 2006.\n\n[Mei12] Marina Meil\u02d8a. Local equivalence of distances between clusterings \u2013 a geometric perspective.\n\nMachine Learning, 86(3):369\u2013389, 2012.\n\n[MSX05] Marina Meil\u02d8a, Susan Shortreed, and Liang Xu. Regularized spectral learning. In Robert Cowell\nand Zoubin Ghahramani, editors, Proceedings of the Arti\ufb01cial Intelligence and Statistics Work-\nshop(AISTATS 05), 2005.\n\n[PSZ15] Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering\nIn Peter Gr\u00fcnwald and Elad Hazan, editors, Proceedings of The 28th Conference on\n\nworks!\nLearning Theory (COLT), volume 40, pages 1\u201333, 2015.\n\n[PW07] J Peng and Y Wei. Approximating k-means-type clustering via semide\ufb01nite programming. SIAM\n\njournal on optimization, 2007.\n\n[QR13] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic\n\nblockmodel. In Advances in Neural Information Processing Systems, 2013.\n\n[RMH14] Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta, and Matthias Hein. Tight continuous\nrelaxation of the balanced k-cut problem. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages\n3131\u20133139. Curran Associates, Inc., 2014.\n\n[RP16] Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Isabelle Guyon\nand Ulrike von Luxburg, editors, Advances in Neural Information Processing Systems (NIPS),\n2016.\n\n[Swa04] Chaitanya Swamy. Correlation clustering: maximizing agreements via semide\ufb01nite programming.\nIn J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA, pages 526\u2013527. SIAM, 2004.\n\n[VOH14] Ramya Korlakai Vinayak, Samet Oymak, and Babak Hassibi. Graph clustering with missing data:\nConvex algorithms and analysis. In Advances in Neural Information Processing Systems (NIPS),\npages 2996\u20133004, 2014.\n\n[VW04] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixtures of distributions.\n\nJournal of Computer Systems Science, 68(4):841\u2013860, 2004.\n\n[WM15] Yali Wan and Marina Meila. A class of network models recoverable by spectral clustering. In\nDaniel Lee and Masashi Sugiyama, editors, Advances in Neural Information Processing Systems\n(NIPS), 2015.\n\n11\n\n\f[XJ03] Eric P. Xing and Michael I. Jordan. On semide\ufb01nite relaxation for normalized k-cut and connections\nto spectral clustering. Technical Report UCB/CSD-03-1265, EECS Department, University of\nCalifornia, Berkeley, Jun 2003.\n\n[YST15] L.Q. Yang, D.F. Sun, and K.C. Toh. Sdpnal+: a majorized semismooth newton-cg augmented\nlagrangian method for semide\ufb01nite programming with nonnegative constraints. Mathematical\nProgramming Computation, 7:331\u2013366, 2015. arXiv:1406.0942.\n\n[ZST10] Xinyuan Zhao, Defeng Sun, and Kim-Chuan Toh. A newton-cg augmented lagrangian method for\n\nsemide\ufb01nite programming. SIAM J. Optimization, 20:1737\u20131765., 2010.\n\n12\n\n\f", "award": [], "sourceid": 3692, "authors": [{"given_name": "Marina", "family_name": "Meila", "institution": "University of Washington"}]}