{"title": "A Constant-Factor Bi-Criteria Approximation Guarantee for k-means++", "book": "Advances in Neural Information Processing Systems", "page_first": 604, "page_last": 612, "abstract": "This paper studies the $k$-means++ algorithm for clustering as well as the class of $D^\\ell$ sampling algorithms to which $k$-means++ belongs.  It is shown that for any constant factor $\\beta > 1$, selecting $\\beta k$ cluster centers by $D^\\ell$ sampling yields a constant-factor approximation to the optimal clustering with $k$ centers, in expectation and without conditions on the dataset.  This result extends the previously known $O(\\log k)$ guarantee for the case $\\beta = 1$ to the constant-factor bi-criteria regime.  It also improves upon an existing constant-factor bi-criteria result that holds only with constant probability.", "full_text": "A Constant-Factor Bi-Criteria Approximation\n\nGuarantee for k-means++\n\nDennis Wei\nIBM Research\n\nYorktown Heights, NY 10598, USA\n\ndwei@us.ibm.com\n\nAbstract\n\nThis paper studies the k-means++ algorithm for clustering as well as the class of D(cid:96)\nsampling algorithms to which k-means++ belongs. It is shown that for any constant\nfactor \u03b2 > 1, selecting \u03b2k cluster centers by D(cid:96) sampling yields a constant-factor\napproximation to the optimal clustering with k centers, in expectation and without\nconditions on the dataset. This result extends the previously known O(log k)\nguarantee for the case \u03b2 = 1 to the constant-factor bi-criteria regime. It also\nimproves upon an existing constant-factor bi-criteria result that holds only with\nconstant probability.\n\n1\n\nIntroduction\n\nThe k-means problem and its variants constitute one of the most popular paradigms for clustering\n[15]. Given a set of n data points, the task is to group them into k clusters, each de\ufb01ned by a cluster\ncenter, such that the sum of distances from points to cluster centers (raised to a power (cid:96)) is minimized.\nOptimal clustering in this sense is known to be NP-hard [11, 3, 20, 6]. In practice, the most widely\nused algorithm remains Lloyd\u2019s [19] (often referred to as the k-means algorithm), which alternates\nbetween updating centers given cluster assignments and re-assigning points to clusters.\nIn this paper, we study an enhancement to Lloyd\u2019s algorithm known as k-means++ [4] and the more\ngeneral class of D(cid:96) sampling algorithms to which k-means++ belongs. These algorithms select\ncluster centers randomly from the given data points with probabilities proportional to their current\ncosts. The clustering can then be re\ufb01ned using Lloyd\u2019s algorithm. D(cid:96) sampling is attractive for\ntwo reasons: First, it is guaranteed to yield an expected O(log k) approximation to the optimal\nclustering with k centers [4]. Second, it is as simple as Lloyd\u2019s algorithm, both conceptually as well\nas computationally with O(nkd) running time in d dimensions.\nThe particular focus of this paper is on the setting where an optimal k-clustering remains the\nbenchmark but more than k cluster centers can be sampled to improve the approximation. Speci\ufb01cally,\nit is shown that for any constant factor \u03b2 > 1, if \u03b2k centers are chosen by D(cid:96) sampling, then a\nconstant-factor approximation to the optimal k-clustering is obtained. This guarantee holds in\nexpectation and for all datasets, like the one in [4], and improves upon the O(log k) factor therein.\nSuch a result is known as a constant-factor bi-criteria approximation since both the optimal cost and\nthe relevant degrees of freedom (k in this case) are exceeded but only by constant factors.\nIn the context of clustering, bi-criteria approximation guarantees can be valuable because an ap-\npropriate number of clusters k is almost never known or pre-speci\ufb01ed in practice. Approaches to\ndetermining k from the data are ideally based on knowing how the optimal cost decreases as k\nincreases, but obtaining this optimal trade-off between cost and k is NP-hard as mentioned earlier.\nAlternatively, a simpler algorithm (like k-means++) that has a constant-factor bi-criteria guarantee\nwould ensure that the trade-off curve generated by this algorithm deviates by no more than constant\nfactors along both axes from the optimal curve. This may be more appealing than a deviation along\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe cost axis that grows as O(log k). Furthermore, if a solution with a speci\ufb01ed number of clusters k\nis truly required, then linear programming techniques can be used to select a k-subset from the \u03b2k\ncluster centers while still maintaining a constant-factor approximation [1, 8].\nThe next section reviews existing work on D(cid:96) sampling and other clustering approximations. Section 2\nformally states the problem, the D(cid:96) sampling algorithm, and existing lemmas regarding the algorithm.\nSection 3 states the main results and compares them to previous results. Proofs are presented in\nSection 4 with more algebraic proofs deferred to the supplementary material.\n\n1.1 Related Work\n\nApproximation algorithms for k-means ((cid:96) = 2), k-medians ((cid:96) = 1), and related problems span a\nwide range in the trade-off between tighter approximation factors and lower algorithm complexity.\nAt one end, while exact algorithms [14] and polynomial-time approximation schemes (PTAS)\n(see [22, 18, 9, 12, 13, 10] and references therein) may have polynomial running times in n, the\ndependence on k and/or the dimension d is exponential or worse. Simpler local search [17, 5] and\nlinear programming [8, 16] algorithms offer constant-factor approximations but still with high-order\npolynomial running times in n, and some rely on dense discretizations of size O(n\u0001\u2212d log(1/\u0001)).\nIn contrast to the above, this paper focuses on highly practical algorithms in the D(cid:96) sampling class,\nincluding k-means++. As mentioned, it was proved in [4] that D(cid:96) sampling results in an O(log k)\napproximation, in expectation and for all datasets. The current work extends this guarantee to the\nconstant-factor bi-criteria regime, also for all datasets. The authors of [4] also provided a matching\nlower bound, exhibiting a dataset on which k-means++ achieves an expected \u2126(log k) approximation.\nImproved O(1) approximation factors have been shown for sampling algorithms like k-means++\nprovided that the dataset satis\ufb01es certain conditions. Such results were established in [24] for k-\nmeans++ and other variants of Lloyd\u2019s algorithm under the condition that the dataset is well-suited\nin a sense to partitioning into k clusters, and for an algorithm called successive sampling [23] with\nO(n(k + log n) + k2 log2 n) running time subject to a bound on the dispersion of the points.\nIn a similar direction to the one pursued in the present work, [1] showed that if the number of cluster\ncenters is increased to a constant factor times k, then k-means++ can achieve a constant-factor\napproximation, albeit only with constant probability. An O(1) factor was also obtained independently\nby [2] using more centers, of order O(k log k). It is important to note that the constant-probability\nresult of [1] in no way implies the main results herein, which are true in expectation and are therefore\nstronger guarantees. Furthermore, Section 3.1 shows that a constant-probability corollary of Theorem\n1 improves upon [1] by more than a factor of 2.\nRecently, [21, 7] have also established constant-factor bi-criteria results for the k-means problem.\nThese works differ from the present paper in studying more complex local search and linear program-\nming algorithms applied to large discretizations, of size nO(log(1/\u0001)/\u00012) (a high-order polynomial)\nin [21] and O(n\u0001\u2212d log(1/\u0001)) in [7], the latter the same as in [17]. Moreover, [7] employs search\nneighborhoods that are also of exponential size in d (requiring doubly exponential running time).\n\n2 Preliminaries\n\n2.1 Problem De\ufb01nition\nWe are given n points x1, . . . , xn in a real metric space X with metric D(x, y). The objective is to\nchoose t cluster centers c1, . . . , ct in X and assign points to the nearest cluster center to minimize the\npotential function\n\nn(cid:88)\n\n\u03c6 =\n\nmin\n\nj=1,...,t\n\nD(xi, cj)(cid:96).\n\ni=1\n\nare broken arbitrarily. For a subset of points S, de\ufb01ne \u03c6(S) =(cid:80)\n\nA cluster is thus de\ufb01ned by the points xi assigned to a center cj, where ties (multiple closest centers)\nxi\u2208S minj=1,...,t D(xi, cj)(cid:96) to be\nthe contribution to the potential from S; \u03c6(xi) is the contribution from a single point xi.\nThe exponent (cid:96) \u2265 1 in (1) is regarded as a problem parameter. Letting (cid:96) = 2 and D be Euclidean\ndistance, we have what is usually known as the k-means problem, so-called because the optimal\n\n(1)\n\n2\n\n\fAlgorithm 1 D(cid:96) Sampling\n\nInput: Data points x1, . . . , xn, number of clusters t. Initialize \u03c6(xi) = 1 for i = 1, . . . , n.\nfor j = 1 to t do\n\nSelect jth center cj = xi with probability \u03c6(xi)/\u03c6.\nUpdate \u03c6(xi) for i = 1, . . . , n.\n\ncluster centers are means of the points assigned to them. The choice (cid:96) = 1 is also popular and\ncorresponds to the k-medians problem.\nThroughout this paper, an optimal clustering will always refer to one that minimizes (1) over solutions\nwith t = k clusters, where k \u2265 2 is given. Likewise, the term optimal cluster and symbol A will refer\nto one of the k clusters from this optimal solution. The goal is to approximate the potential \u03c6\u2217 of this\noptimal k-clustering using t = \u03b2k cluster centers for \u03b2 \u2265 1.\n\n2.2 D(cid:96) Sampling Algorithm\n\nThe D(cid:96) sampling algorithm chooses cluster centers randomly from x1, . . . , xn with probabilities\nproportional to their current contributions to the potential, as detailed in Algorithm 1. Following [4],\nthe case (cid:96) = 2 is referred to as the k-means++ algorithm and the non-uniform probabilities used after\nthe \ufb01rst iteration are referred to as D2 weighting (hence D(cid:96) in general). For t cluster centers, the\nrunning time of D(cid:96) sampling is O(ntd) in d dimensions.\nIn practice, Algorithm 1 is used as an initialization to Lloyd\u2019s algorithm, which usually produces\nfurther decreases in the potential. The analysis herein pertains only to Algorithm 1 and not to the\nsubsequent improvement due to Lloyd\u2019s algorithm.\n\n2.3 Existing Lemmas Regarding D(cid:96) Sampling\n\nThe following lemmas synthesize useful results from [4] that bound the expected potential within a\nsingle optimal cluster due to selecting a center from that cluster with uniform or D(cid:96) weighting.\nLemma 1. [4, Lemmas 3.1 and 5.1] Given an optimal cluster A, let \u03c6 be the potential resulting from\nu \u03c6\u2217(A)\nselecting a \ufb01rst cluster center randomly from A with uniform weighting. Then E[\u03c6(A)] \u2264 r((cid:96))\nfor any A, where\n\nLemma 2. [4, Lemma 3.2] Given an optimal cluster A and an initial potential \u03c6, let \u03c6(cid:48) be the\npotential resulting from adding a cluster center selected randomly from A with D(cid:96) weighting. Then\nE[\u03c6(cid:48)(A)] \u2264 r((cid:96))\n\nD \u03c6\u2217(A) for any A, where r((cid:96))\n\nD = 2(cid:96)r((cid:96))\nu .\n\nThe factor of 2(cid:96) between r((cid:96))\n\nu and r((cid:96))\n\nD for general (cid:96) is explained just before Theorem 5.1 in [4].\n\n3 Main Results\n\nThe main results of this paper are stated below in terms of the single-cluster approximation ratio r((cid:96))\nD\nde\ufb01ned by Lemma 2. Subsequently in Section 3.1, the results are discussed in the context of previous\nwork.\n(cid:19)\nTheorem 1. Let \u03c6 be the potential resulting from selecting \u03b2k cluster centers according to Algo-\nrithm 1, where \u03b2 \u2265 1. The expected approximation ratio is then bounded as\n\u2212 \u0398\nk \u223c log k is the kth\n2 + \u00b7\u00b7\u00b7 + 1\n\nE[\u03c6]\n\u03c6\u2217 \u2264 r((cid:96))\n\u221a\nwhere \u03d5 = (1 +\nharmonic number.\n\n.\n= 1.618 is the golden ratio and Hk = 1 + 1\n\n(cid:26) \u03d5(k \u2212 2)\n\n(\u03b2 \u2212 1)k + \u03d5\n\n(cid:27)(cid:19)\n\n(cid:18) 1\n\nn\n\n1 + min\n\nD\n\n(cid:18)\n\n, Hk\u22121\n\n,\n\n5)/2\n\nIn the proof of Theorem 1 in Section 4.2, it is shown that the 1/n term is indeed non-positive and can\ntherefore be omitted, with negligible loss for large n.\n\n3\n\n(cid:26)2,\n\n(cid:96) = 2 and D is Euclidean,\n\nr((cid:96))\nu =\n\n2(cid:96), otherwise.\n\n\fThe approximation ratio bound in Theorem 1 is stated as a function of k. The following corollary\ncon\ufb01rms that the theorem also implies a constant-factor bi-criteria approximation.\nCorollary 1. With the same de\ufb01nitions as in Theorem 1, the expected approximation ratio is bounded\nas\n\n(cid:18)\n\nE[\u03c6]\n\u03c6\u2217 \u2264 r((cid:96))\n\nD\n\n1 +\n\n\u03d5\n\u03b2 \u2212 1\n\n(cid:19)\n\n.\n\nProof. The minimum in Theorem 1 is bounded by its \ufb01rst term. This term is in turn increasing in k\nwith asymptote \u03d5/(\u03b2 \u2212 1), which can therefore be taken as a k-independent bound.\n\nIt follows from Corollary 1 that a constant \u201coversampling\u201d ratio \u03b2 > 1 leads to a constant-factor\napproximation. Theorem 1 offers a further re\ufb01nement for \ufb01nite k.\nThe bounds in Theorem 1 and Corollary 1 consist of two factors. As \u03b2 increases, the second,\nparenthesized factor decreases to 1 either exactly or approximately as 1/(\u03b2 \u2212 1). The \ufb01rst factor\nof r((cid:96))\nD however is no smaller than 4, and is a direct consequence of Lemma 2. Any future work on\nimproving Lemma 2 would therefore strengthen the approximation factors above.\n\n3.1 Comparisons to Existing Results\n\nA comparison of Theorem 1 to results in [4] is implicit in its statement since the Hk\u22121 term in the\nminimum comes directly from [4, Theorems 3.1 and 5.1]. For k = 2, 3, the \ufb01rst term in the minimum\nis smaller than Hk\u22121 for any \u03b2 \u2265 1, and hence Theorem 1 is always an improvement. For k > 3,\nTheorem 1 improves upon [4] for \u03b2 greater than the critical value\n\u03c6(k \u2212 2 \u2212 Hk\u22121)\n\n\u03b2c = 1 +\n\nkHk\u22121\n\n.\n\nNumerical evaluation of \u03b2c shows that it reaches a maximum value of 1.204 at k = 22 and then\ndecreases back toward 1 roughly as 1/Hk\u22121. It can be concluded that for any k, at most 20%\noversampling is required for Theorem 1 to guarantee a better approximation than [4].\nThe most closely related result to Theorem 1 and Corollary 1 is found in [1, Theorem 1]. The latter\nestablishes a constant-factor bi-criteria approximation that holds only with constant probability, as\nopposed to in expectation. Since a bound on the expectation implies a bound with constant probability\nvia Markov\u2019s inequality (but not the other way around), a direct comparison with [1] is possible.\nSpeci\ufb01cally, for (cid:96) = 2 and the t = (cid:100)16(k +\nk)(cid:101) cluster centers assumed in [1], Theorem 1 in the\npresent work implies that\n\n\u221a\n\n(cid:18)\n\n(cid:26)\n\n(cid:27)(cid:19)\n\n(cid:16)\n\n\u2264 8\n\n1 +\n\n\u03d5\n15\n\n(cid:17)\n\n,\n\n\u03d5(k \u2212 2)\n\u221a\nafter taking k \u2192 \u221e. Then by Markov\u2019s inequality,\n\nE[\u03c6]\n\u03c6\u2217 \u2264 8\n\n(cid:100)15k + 16\n\n1 + min\n\nk(cid:101) + \u03d5\n\n, Hk\u22121\n\n(cid:16)\n\n(cid:17) .\n\n\u03d5\n15\n\n1 +\n\n= 9.137\n\n\u03c6\n\n\u03c6\u2217 \u2264 8\n\n0.97\n\nwith probability at least 1 \u2212 0.97 = 0.03 as in [1]. This 9.137 approximation factor is less than half\nthe factor of 20 in [1].\nCorollary 1 may also be compared to the results in [21], which are obtained through more complex\nalgorithms applied to a large discretization, of size nO(log(1/\u0001)/\u00012) for reasonably small \u0001. The main\ndifference between Corollary 1 and the bounds in [21] is the extra factor of r((cid:96))\nD . As discussed above,\nthis factor is due to Lemma 2 and is unlikely to be intrinsic to the D(cid:96) sampling algorithm.\n\n4 Proofs\n\nThe overall strategy used to prove Theorem 1 is similar to that in [4]. The key intermediate result is\nLemma 3 below, which relates the potential at a later iteration in Algorithm 1 to the potential at an\nearlier iteration. Section 4.1 is devoted to proving Lemma 3. Subsequently in Section 4.2, Theorem 1\nis proven by an application of Lemma 3.\n\n4\n\n\fIn the sequel, we say that an optimal cluster A is covered by a set of cluster centers if at least one of\nthe centers lies in A. Otherwise A is uncovered. Also de\ufb01ne \u03c1 = r((cid:96))\nLemma 3. For an initial set of centers leaving u optimal clusters uncovered, let \u03c6 denote the\npotential, U the union of uncovered clusters, and V the union of covered clusters. Let \u03c6(cid:48) denote\nthe potential resulting from adding t \u2265 u centers, each selected randomly with D(cid:96) weighting as in\nAlgorithm 1. Then the new potential is bounded in expectation as\n\nD \u03c6\u2217 as an abbreviation.\n\nE[\u03c6(cid:48) | \u03c6] \u2264 cV (t, u)\u03c6(V) + cU (t, u)\u03c1(U)\n\nfor coef\ufb01cients cV (t, u) and cU (t, u) that depend only on t, u. This holds in particular for\n\ncV (t, u) =\n\nt + au + b\nt \u2212 u + b\n\n(cid:26)cV (t \u2212 1, u \u2212 1), u > 0,\n\n(a + 1)u\nt \u2212 u + b\n\n= 1 +\n\n,\n\n(2a)\n\n(2b)\nwhere the parameters a and b satisfy a + 1 \u2265 b > 0 and ab \u2265 1. The choice of a, b that minimizes\ncV (t, u) in (2a) is a + 1 = b = \u03d5.\n\ncU (t, u) =\n\nu = 0,\n\n0,\n\n4.1 Proof of Lemma 3\n\nLemma 3 is proven using induction, showing that if it holds for (t, u) and (t, u + 1), then it also\nholds for (t + 1, u + 1), similar to the proof of [4, Lemma 3.3]. The proof is organized into three\nparts. Section 4.1.1 provides base cases. In Section 4.1.2, suf\ufb01cient conditions on the coef\ufb01cients\ncV (t, u), cU (t, u) are derived that allow the inductive step to be completed. In Section 4.1.3, it is\nshown that the closed-form expressions in (2) are consistent with the base cases in Section 4.1.1 and\nsatisfy the suf\ufb01cient conditions from Section 4.1.2, thus completing the proof.\n\n4.1.1 Base cases\n\nThis subsection exhibits two base cases of Lemma 3. The \ufb01rst case corresponds to u = 0, for which\nwe have \u03c6(V) = \u03c6. Since adding centers cannot increase the potential, i.e. \u03c6(cid:48) \u2264 \u03c6 deterministically,\nLemma 3 holds with\n\ncV (t, 0) = 1,\n\ncU (t, 0) = 0,\n\n(3)\nThe second base case occurs for t = u, u \u2265 1. For this purpose, a slightly strengthened version of [4,\nLemma 3.3] is used, as given next.\nLemma 4. With the same de\ufb01nitions as in Lemma 3 except with t \u2264 u, we have\n\u03c6(U),\n\nE[\u03c6(cid:48) | \u03c6] \u2264 (1 + Ht)\u03c6(V) + (1 + Ht\u22121)\u03c1(U) +\n\nt \u2265 0.\n\nwhere we de\ufb01ne H0 = 0 and H\u22121 = \u22121 for convenience.\nThe improvement is in the coef\ufb01cient in front of \u03c1(U), from (1 + Ht) to (1 + Ht\u22121). The proof\nfollows that of [4, Lemma 3.3] with some differences and is deferred to the supplementary material.\nSpecializing to the case t = u, Lemma 4 coincides with Lemma 3 with coef\ufb01cients\n\nu \u2212 t\nu\n\ncV (u, u) = 1 + Hu,\n\ncU (u, u) = 1 + Hu\u22121.\n\n(4)\n\n4.1.2 Suf\ufb01cient conditions on coef\ufb01cients\n\nWe now assume inductively that Lemma 3 holds for (t, u) and (t, u + 1). The induction to the case\n(t + 1, u + 1) is then completed under the following suf\ufb01cient conditions on the coef\ufb01cients:\n\ncV (t, u + 1) \u2265 1,\n\n(cV (t, u + 1) \u2212 cU (t, u + 1))cV (t, u)2 \u2265 (cU (t, u + 1) \u2212 cV (t, u))2,\n\ncV (t, u) +(cid:0)cV (t, u)2 + 4 max{cV (t, u + 1) \u2212 cV (t, u), 0}(cid:1)1/2(cid:105)\n\n(cid:104)\n\nand\n\ncV (t + 1, u + 1) \u2265 1\n2\ncU (t + 1, u + 1) \u2265 cV (t, u).\n\n(5a)\n(5b)\n\n(6a)\n\n(6b)\n\n,\n\n5\n\n\fThe \ufb01rst pair of conditions (5) applies to the coef\ufb01cients involved in the inductive hypothesis for (t, u)\nand (t, u + 1). The second pair (6) can be seen as a recursive speci\ufb01cation of the new coef\ufb01cients\nfor (t + 1, u + 1). This inductive step together with base cases (3) and (4) are suf\ufb01cient to extend\nLemma 3 to all t > u, starting with (t, u) = (1, 0) and (t, u + 1) = (1, 1).\nThe inductive step is broken down into a series of three lemmas, each building upon the last. The \ufb01rst\nlemma applies the inductive hypothesis to derive a bound on the potential that depends not only on\n\u03c6(V) and \u03c1(U) but also on \u03c6(U).\nLemma 5. Assume that Lemma 3 holds for (t, u) and (t, u + 1). Then for the case (t + 1, u + 1),\ni.e. \u03c6 corresponding to u + 1 uncovered clusters and \u03c6(cid:48) resulting after adding t + 1 centers,\n\n(cid:26) cV (t, u)\u03c6(U) + cV (t, u + 1)\u03c6(V)\n\n\u03c6(U) + \u03c6(V)\n\n\u03c6(V)\n\nE[\u03c6(cid:48) | \u03c6] \u2264 min\n\ncV (t, u)\u03c6(U) + cU (t, u + 1)\u03c6(V)\n\n\u03c6(U) + \u03c6(V)\n\n+\n\n(cid:27)\n\n\u03c1(U), \u03c6(U) + \u03c6(V)\n\n.\n\nProof. We consider the two cases in which the \ufb01rst of the t + 1 new centers is chosen from either the\ncovered set V or the uncovered set U. Denote by \u03c61 the potential after adding the \ufb01rst new center.\nCovered case: This case occurs with probability \u03c6(V)/\u03c6 and leaves the covered and uncovered sets\nunchanged. We then invoke Lemma 3 with (t, u + 1) (one fewer center to add) and \u03c61 playing the\nrole of \u03c6. The contribution to E[\u03c6(cid:48) | \u03c6] from this case is then bounded by\n\u03c6(V)\n\u03c6\n\n(cid:0)cV (t, u + 1)\u03c61(V) + cU (t, u + 1)\u03c1(U)(cid:1) \u2264 \u03c6(V)\n\n(cV (t, u + 1)\u03c6(V) + cU (t, u + 1)\u03c1(U)) ,\n(7)\n\nnoting that \u03c61(S) \u2264 \u03c6(S) for any set S.\nUncovered case: We consider each uncovered cluster A \u2286 U separately. With probability \u03c6(A)/\u03c6,\nthe \ufb01rst new center is selected from A, moving A from the uncovered to the covered set and reducing\nthe number of uncovered clusters by one. Applying Lemma 3 for (t, u), the contribution to E[\u03c6(cid:48) | \u03c6]\nis bounded by\n\n\u03c6\n\n(cid:2)cV (t, u)(cid:0)\u03c61(V) + \u03c61(A)(cid:1) + cU (t, u)(\u03c1(U) \u2212 \u03c1(A))(cid:3) .\n\n\u03c6(A)\n\u03c6\n\nTaking the expectation with respect to possible centers in A and using Lemma 2 and \u03c61(V) \u2264 \u03c6(V),\nwe obtain the further bound\n\u03c6(A)\n\u03c6\n\n[cV (t, u)(\u03c6(V) + \u03c1(A)) + cU (t, u)(\u03c1(U) \u2212 \u03c1(A))] .\n\nSumming over A \u2286 U yields\n\n\u03c6(U)\n\u03c6\n\n(cV (t, u)\u03c6(V) + cU (t, u)\u03c1(U)) +\n\ncV (t, u) \u2212 cU (t, u)\n\n(cid:88)\n\n\u03c6\n\nA\u2286U\n\n\u2264 \u03c6(U)\n\ncV (t, u)(\u03c6(V) + \u03c1(U)),\n\nusing the inner product bound(cid:80)A\u2286U \u03c6(A)\u03c1(A) \u2264 \u03c6(U)\u03c1(U).\n\n\u03c6\n\n\u03c6(A)\u03c1(A)\n\n(8)\n\nThe result follows from summing (7) and (8) and combining with the trivial bound E[\u03c6(cid:48) | \u03c6] \u2264 \u03c6 =\n\u03c6(U) + \u03c6(V).\nThe bound in Lemma 5 depends on \u03c6(U), the potential over uncovered clusters, which can be\narbitrarily large or small. In the next lemma, \u03c6(U) is eliminated by maximizing with respect to it.\nLemma 6. Assume that Lemma 3 holds for (t, u) and (t, u + 1) with cV (t, u + 1) \u2265 1. Then for the\ncase (t + 1, u + 1) in the sense of Lemma 5,\n\n(cid:110)\n(cid:111)\ncV (t, u)(\u03c6(V) + \u03c1(U)),(cid:112)Q\n\n,\n\nE[\u03c6(cid:48) | \u03c6] \u2264 1\n2\n\ncV (t, u)(\u03c6(V) + \u03c1(U)) +\n\n1\n2\n\nmax\n\n6\n\n\fwhere\n\nQ =(cid:0)cV (t, u)2 \u2212 4cV (t, u) + 4cV (t, u + 1)(cid:1) \u03c6(V)2\n\n+ 2(cid:0)cV (t, u)2 \u2212 2cV (t, u) + 2cU (t, u + 1)(cid:1) \u03c6(V)\u03c1(U) + cV (t, u)2\u03c1(U)2.\n\nProof. Let B1(\u03c6(U)) and B2(\u03c6(U)) denote the two terms inside the minimum in Lemma 5 (i.e.\nB2(\u03c6(U)) = \u03c6(U) + \u03c6(V)). The derivative of B1(\u03c6(U)) with respect to \u03c6(U) is given by\n1(\u03c6(U)) =\nB(cid:48)\n\n(cid:2)(cV (t, u) \u2212 cV (t, u + 1))\u03c6(V) + (cV (t, u) \u2212 cU (t, u + 1))\u03c1(U)(cid:3),\n\n\u03c6(V)\n\n(\u03c6(U) + \u03c6(V))2\n\nwhich does not change sign as a function of \u03c6(U). The two cases B(cid:48)\n1(\u03c6(U)) < 0\nare considered separately below. Taking the maximum of the resulting bounds (9), (10) establishes\nthe lemma.\nCase B(cid:48)\nformer has the \ufb01nite supremum\n\n1(\u03c6(U)) \u2265 0: Both B1(\u03c6(U)) and B2(\u03c6(U)) are non-decreasing functions of \u03c6(U). The\n\n1(\u03c6(U)) \u2265 0 and B(cid:48)\n\ncV (t, u)(\u03c6(V) + \u03c1(U)),\n\n(9)\nwhereas the latter increases without bound. Therefore B1(\u03c6(U)) eventually becomes the smaller of\nthe two and (9) can be taken as an upper bound on min{B1(\u03c6(U)), B2(\u03c6(U))}.\n1(\u03c6(U)) < 0: At \u03c6(U) = 0, we have B1(0) = cV (t, u + 1)\u03c6(V) + cU (t, u + 1)\u03c1(U) and\nCase B(cid:48)\nB2(0) = \u03c6(V). The assumption cV (t, u + 1) \u2265 1 implies that B1(0) \u2265 B2(0). Since B1(\u03c6(U))\nis now a decreasing function, the two functions must intersect and the point of intersection then\nprovides an upper bound on min{B1(\u03c6(U)), B2(\u03c6(U))}. The supplementary material provides some\nalgebraic details on solving for the intersection. The resulting bound is\n\ncV (t, u)(\u03c6(V) + \u03c1(U)) +\n\n1\n2\n\n1\n2\n\n(cid:112)Q.\n\n(10)\n\nThe bound in Lemma 6 is a nonlinear function of \u03c6(V) and \u03c1(U), in contrast to the desired form in\nLemma 3. The next step is to linearize the bound by imposing additional conditions (5).\nLemma 7. Assume that Lemma 3 holds for (t, u) and (t, u + 1) with coef\ufb01cients satisfying (5). Then\nfor the case (t + 1, u + 1) in the sense of Lemma 5,\nE[\u03c6(cid:48) | \u03c6] \u2264 1\n2\n\ncV (t, u) +(cid:0)cV (t, u)2 + 4 max{cV (t, u + 1) \u2212 cV (t, u), 0}(cid:1)1/2(cid:105)\n\n\u03c6(V) + cV (t, u)\u03c1(U).\n\n(cid:104)\n\nProof. It suf\ufb01ces to linearize the\n\nb\u03c1(U))2 for all \u03c6(V), \u03c1(U) with a =(cid:2)cV (t, u)2 + 4(cV (t, u + 1) \u2212 cV (t, u))(cid:3)1/2 and b = cV (t, u).\n\nQ term in Lemma 6, speci\ufb01cally by showing that Q \u2264 (a\u03c6(V) +\n\nProof of this inequality is provided in the supplementary material. Incorporating the inequality into\nLemma 6 proves the result.\n\n\u221a\n\nGiven conditions (5) and Lemma 7, the inductive step for Lemma 3 can be completed by de\ufb01ning\ncV (t + 1, u + 1) and cU (t + 1, u + 1) recursively as in (6).\n\n4.1.3 Proof with speci\ufb01c form for coef\ufb01cients\nWe now prove that Lemma 3 holds for coef\ufb01cients cV (t, u), cU (t, u) given by (2) with a + 1 \u2265 b > 0\nand ab \u2265 1. Given the inductive approach and the results established in Sections 4.1.1 and 4.1.2,\nthe proof requires the remaining steps below. First, it is shown that the base cases (3), (4) from\nSection 4.1.1 imply that Lemma 3 is true for the same base cases but with cV (t, u), cU (t, u) given\nby (2) instead. Second, (2) is shown to satisfy conditions (5) for all t > u, thus permitting Lemma\n7 to be used. Third, (2) is also shown to satisfy (6), which combined with Lemma 7 completes the\ninduction.\n\n7\n\n\f(cid:18)\n\nConsidering the base cases, for u = 0, (3) and (2) coincide so there is nothing to prove. For the case\nt = u, u \u2265 1, Lemma 3 with coef\ufb01cients given by (4) implies the same with coef\ufb01cients given by (2)\nprovided that\n\n(cid:19)\n\nb\n\n(cid:18)\n\n(cid:19)\n\n\u03c1(U)\n\n(a + 1)(u \u2212 1)\n\nb\n\n(1 + Hu)\u03c6(V) + (1 + Hu\u22121)\u03c1(U) \u2264\n\n1 +\n\n(a + 1)u\n\n\u03c6(V) +\n\n1 +\n\nfor all \u03c6(V), \u03c1(U). This in turn is ensured if the coef\ufb01cients satisfy Hu \u2264 (a + 1)u/b for all u \u2265 1.\nThe most stringent case is u = 1 and corresponds to the assumption a + 1 \u2265 b.\nFor the second step of establishing (5), it is clear that (5a) is satis\ufb01ed by (2a). A direct calculation\npresented in the supplementary material shows that (5b) is also true.\nLemma 8. Condition (5b) is satis\ufb01ed for all t > u if cV (t, u), cU (t, u) are given by (2) and ab \u2265 1.\nSimilarly for the third step, it suf\ufb01ces to show that (2a) satis\ufb01es recursion (6a) since (2b) automatically\nsatis\ufb01es (6b). A proof is provided in the supplementary material.\nLemma 9. Recursion (6a) is satis\ufb01ed for all t > u if cV (t, u) is given by (2a) and ab \u2265 1.\nLastly, we minimize cV (t, u) in (2a) with respect to a, b, subject to a + 1 \u2265 b > 0 and ab \u2265 1. For\n\ufb01xed a, minimizing with respect to b yields b = a + 1 and cV (t, u) = 1 + ((a + 1)u)/(t\u2212 u + a + 1).\nMinimizing with respect to a then results in setting ab = a(a + 1) = 1. The solution satisfying\na + 1 > 0 is a = \u03d5 \u2212 1 and b = \u03d5.\n\n4.2 Proof of Theorem 1\nDenote by nA the number of points in optimal cluster A. In the \ufb01rst iteration of Algorithm 1, the \ufb01rst\ncluster center is selected from some A with probability nA/n. Conditioned on this event, Lemma 3\nis applied with covered set V = A, u = k \u2212 1 uncovered clusters, and t = \u03b2k \u2212 1 remaining cluster\ncenters. This bounds the \ufb01nal potential \u03c6(cid:48) as\n\nE[\u03c6(cid:48) | \u03c6] \u2264 cV (\u03b2k \u2212 1, k \u2212 1)\u03c6(A) + cU (\u03b2k \u2212 1, k \u2212 1)(\u03c1 \u2212 \u03c1(A))\n\nwhere cV (t, u), cU (t, u) are given by (2) with a + 1 = b = \u03d5. Taking the expectation over possible\ncenters in A and using Lemma 1,\n\nE[\u03c6(cid:48) | A] \u2264 r((cid:96))\n\nu cV (\u03b2k \u2212 1, k \u2212 1)\u03c6\u2217(A) + cU (\u03b2k \u2212 1, k \u2212 1)(\u03c1 \u2212 \u03c1(A)).\n\nTaking the expectation over clusters A and recalling that \u03c1 = r((cid:96))\n\nE[\u03c6(cid:48)] \u2264 r((cid:96))\n\nD cU (\u03b2k \u2212 1, k \u2212 1)\u03c6\u2217 \u2212 C\n\n\u03c6\u2217(A),\n\n(11)\n\n(cid:88)\nD \u03c6\u2217,\nnA\nn\n\nA\n\nwhere C = r((cid:96))\n2,\n\nD cU (\u03b2k \u2212 1, k \u2212 1)\u2212 r((cid:96))\n\nu cV (\u03b2k \u2212 1, k \u2212 1). Using (2) and r((cid:96))\n2(cid:96) ((\u03b2 \u2212 1)k + \u03d5(k \u2212 1)) \u2212 (\u03b2 \u2212 1 + \u03d5)k\n\nC = r((cid:96))\nu\n\n(\u03b2 \u2212 1)k + \u03d5\n\n(2(cid:96) \u2212 1)(\u03b2 \u2212 1)k + \u03d5((2(cid:96) \u2212 1)(k \u2212 1) \u2212 1)\n\n(\u03b2 \u2212 1)k + \u03d5\nThe last expression for C is seen to be non-negative for \u03b2 \u2265 1, k \u2265 2, and (cid:96) \u2265 1. Furthermore, since\nnA = 1 (a singleton cluster) implies that \u03c6\u2217(A) = 0, we have\n\n.\n\n= r((cid:96))\nu\n\nD = 2(cid:96)r((cid:96))\n\nu from Lemma\n\n(cid:88)\n\nA\n\nnA\u03c6\u2217(A) =\n\nnA\u03c6\u2217(A) \u2265 2\u03c6\u2217.\n\n(cid:88)\n\nA:nA\u22652\n\n(cid:18)\n\n(cid:19)\n\nSubstituting (2) and (12) into (11), we obtain\n\nE[\u03c6(cid:48)]\n\u03c6\u2217 \u2264 r((cid:96))\n\nD\n\n\u03d5(k \u2212 2)\n\n(\u03b2 \u2212 1)k + \u03d5\n\n1 +\n\n\u2212 2C\nn\n\n.\n\n(12)\n\n(13)\n\nThe last step is to recall [4, Theorems 3.1 and 5.1], which together state that\n\n(14)\nfor \u03c6(cid:48) resulting from selecting exactly k cluster centers. In fact, (14) also holds for \u03b2k centers, \u03b2 \u2265 1,\nsince adding centers cannot increase the potential. The proof is completed by taking the minimum of\n(13) and (14).\n\nD (1 + Hk\u22121)\n\nE[\u03c6(cid:48)]\n\u03c6\u2217 \u2264 r((cid:96))\n\n8\n\n\fReferences\n[1] A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In Proc. 12th Int.\nWorkshop and 13th Int. Workshop on Approximation, Randomization, and Combinatorial Optimization.\nAlgorithms and Techniques, pages 15\u201328, August 2009.\n\n[2] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Adv. Neural Information\n\nProcessing Systems 22, pages 10\u201318, December 2009.\n\n[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering.\n\nMach. Learn., 75(2):245\u2013248, May 2009.\n\n[4] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. 18th ACM-SIAM\n\nSymp. Discrete Algorithms, pages 1027\u20131035, January 2007.\n\n[5] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for\n\nk-median and facility location problems. SIAM J. Comput., 33(3):544\u2013562, March 2004.\n\n[6] P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop. The hardness of approximation of Euclidean\n\nk-means. In Proc. 31st Int. Symp. Computational Geometry, pages 754\u2013767, June 2015.\n\n[7] S. Bandyapadhyay and K. Varadarajan. On variants of k-means clustering. Technical Report\n\narXiv:1512.02985, December 2015.\n\n[8] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the\n\nk-median problem. J. Comput. Syst. Sci., 65(1):129\u2013149, August 2002.\n\n[9] K. Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their\n\napplications. SIAM J. Comput., 39(3):923\u2013947, September 2009.\n\n[10] V. Cohen-Addad, P. N. Klein, and C. Mathieu. Local search yields approximation schemes for k-means\n\nand k-median in Euclidean and minor-free metrics. Technical Report arXiv:1603.09535, March 2016.\n\n[11] S. Dasgupta. The hardness of k-means clustering. Technical Report CS2008-0916, Department of\n\nComputer Science and Engineering, University of California, San Diego, 2008.\n\n[12] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In\n\nProc. 23rd Int. Symp. Computational Geometry, pages 11\u201318, June 2007.\n\n[13] Z. Friggstad, M. Rezapour, and M. R. Salavatipour. Local search yields a PTAS for k-means in doubling\n\nmetrics. Technical Report arXiv:1603.08976, March 2016.\n\n[14] M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to\n\nvariance-based k-clustering. In Proc. 10th Int. Symp. Computational Geometry, pages 332\u2013339, 1994.\n\n[15] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett., 31(8):651\u2013666, June 2010.\n\n[16] K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems\n\nusing the primal-dual schema and Lagrangian relaxation. J. ACM, 48(2):274\u2013296, March 2001.\n\n[17] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search\n\napproximation algorithm for k-means clustering. Comput. Geom., 28(2\u20133):89\u2013112, June 2004.\n\n[18] A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any\n\ndimensions. J. ACM, 57(2):5:1\u20135:32, January 2010.\n\n[19] S. Lloyd. Least squares quantization in PCM. Technical report, Bell Laboratories, 1957.\n\n[20] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. In Proc. 3rd\n\nInt. Workshop Algorithms and Computation, pages 274\u2013285, February 2009.\n\n[21] K. Makarychev, Y. Makarychev, M. Sviridenko, and J. Ward. A bi-criteria approximation algorithm for k\n\nmeans. Technical Report arXiv:1507.04227, August 2015.\n\n[22] J. Matou\u0161ek. On approximate geometric k-clustering. Discrete & Comput. Geom., 24(1):61\u201384, January\n\n2000.\n\n[23] R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Mach. Learn., 56(1\u2013\n\n3):35\u201360, June 2004.\n\n[24] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the\n\nk-means problem. J. ACM, 59(6):28, December 2012.\n\n9\n\n\f", "award": [], "sourceid": 337, "authors": [{"given_name": "Dennis", "family_name": "Wei", "institution": "IBM Research"}]}