{"title": "Fast and Provably Good Seedings for k-Means", "book": "Advances in Neural Information Processing Systems", "page_first": 55, "page_last": 63, "abstract": "Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.", "full_text": "Fast and Provably Good Seedings for k-Means\n\nOlivier Bachem\n\nETH Zurich\n\nolivier.bachem@inf.ethz.ch\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nMario Lucic\n\nETH Zurich\n\nlucic@inf.ethz.ch\n\nS. Hamed Hassani\n\nDepartment of Computer Science\n\nETH Zurich\n\nhamed@inf.ethz.ch\n\nAndreas Krause\n\nDepartment of Computer Science\n\nETH Zurich\n\nkrausea@ethz.ch\n\nAbstract\n\nSeeding \u2013 the task of \ufb01nding initial cluster centers \u2013 is critical in obtaining high-\nquality clusterings for k-Means. However, k-means++ seeding, the state of the\nart algorithm, does not scale well to massive datasets as it is inherently sequential\nand requires k full passes through the data. It was recently shown that Markov\nchain Monte Carlo sampling can be used to ef\ufb01ciently approximate the seeding\nstep of k-means++. However, this result requires assumptions on the data gener-\nating distribution. We propose a simple yet fast seeding algorithm that produces\nprovably good clusterings even without assumptions on the data. Our analysis\nshows that the algorithm allows for a favourable trade-off between solution quality\nand computational cost, speeding up k-means++ seeding by up to several orders\nof magnitude. We validate our theoretical results in extensive experiments on a\nvariety of real-world data sets.\n\n1\n\nIntroduction\n\nk-means++ (Arthur & Vassilvitskii, 2007) is one of the most widely used methods to solve k-Means\nclustering. The algorithm is simple and consists of two steps: In the seeding step, initial cluster\ncenters are found using an adaptive sampling scheme called D2-sampling. In the second step, this\nsolution is re\ufb01ned using Lloyd\u2019s algorithm (Lloyd, 1982), the classic iterative algorithm for k-Means.\nThe key advantages of k-means++ are its strong empirical performance, theoretical guarantees on\nthe solution quality, and ease of use. Arthur & Vassilvitskii (2007) show that k-means++ produces\nclusterings that are in expectation O(log k)-competitive with the optimal solution without any\nassumptions on the data. Furthermore, this theoretical guarantee already holds after the seeding\nstep. The subsequent use of Lloyd\u2019s algorithm to re\ufb01ne the solution only guarantees that the solution\nquality does not deteriorate and that it converges to a locally optimal solution in \ufb01nite time. In\ncontrast, using naive seeding such as selecting data points uniformly at random followed by Lloyd\u2019s\nalgorithm can produce solutions that are arbitrarily bad compared to the optimal solution.\nThe drawback of k-means++ is that it does not scale easily to massive data sets since both its\nseeding step and every iteration of Lloyd\u2019s algorithm require the computation of all pairwise distances\nbetween cluster centers and data points. Lloyd\u2019s algorithm can be parallelized in the MapReduce\nframework (Zhao et al., 2009) or even replaced by fast stochastic optimization techniques such as\nonline or mini-batch k-Means (Bottou & Bengio, 1994; Sculley, 2010). However, the seeding step\nrequires k inherently sequential passes through the data, making it impractical even for moderate k.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThis highlights the need for a fast and scalable seeding algorithm. Ideally, it should also retain the\ntheoretical guarantees of k-means++ and provide equally competitive clusterings in practice. Such\nan approach was presented by Bachem et al. (2016) who propose to approximate k-means++ using a\nMarkov chain Monte Carlo (MCMC) approach and provide a fast seeding algorithm. Under natural\nassumptions on the data generating distribution, the authors show that the computational complexity\nof k-means++ can be greatly decreased while retaining the same O(log k) guarantee on the solution\nquality. The drawback of this approach is that these assumptions may not hold and that checking\ntheir validity is expensive (see detailed discussion in Section 3).\nOur contributions. The goal of this paper is to provide fast and competitive seedings for k-Means\nclustering without prior assumptions on the data. As our key contributions, we\n(1) propose a simple yet fast seeding algorithm for k-Means,\n(2) show that it produces provably good clusterings without assumptions on the data,\n(3) provide stronger theoretical guarantees under assumptions on the data generating distribution,\n(4) extend the algorithm to arbitrary distance metrics and various divergence measures,\n(5) compare the algorithm to previous results, both theoretically and empirically, and\n(6) demonstrate its effectiveness on several real-world data sets.\n\n2 Background and related work\n\nWe will start by formalizing the problem and reviewing several recent results. Let X denote a set of\nn points in Rd. For any \ufb01nite set C \u21e2 Rd and x 2X , we de\ufb01ne\nc2Ckx ck2\n2.\n\nd(x, C)2 = min\n\nThe objective of k-Means clustering is to \ufb01nd a set C of k cluster centers in Rd such that the\nquantization error C(X ) is minimized, where\n\nC(X ) = Xx2X\n\nd(x, C)2.\n\nWe denote the optimal quantization error with k centers by k\n\nOP T (X ), the mean of X by \u00b5(X ), and\nthe variance of X by Var(X ) =Px2X d(x, \u00b5(X ))2. We note that 1\nD2-sampling. Given a set of centers C, the D2-sampling strategy, as the name suggests, is to sample\neach point x 2X with probability proportional to the squared distance to the selected centers,\n\nOP T (X ) = Var(X ).\n\n(1)\n\np(x | C) =\n\nd(x, C)2\n\nPx02X d(x0, C)2 .\n\nThe seeding step of k-means++ builds upon D2-sampling: It \ufb01rst samples an initial center uniformly\nat random. Then, k 1 additional centers are sequentially added to the previously sampled centers\nusing D2-sampling. The resulting computational complexity is \u21e5(nkd), as for each x 2X the\ndistance d(x, C)2 in (1) needs to be updated whenever a center is added to C.\nMetropolis-Hastings. The Metropolis-Hastings algorithm (Hastings, 1970) is a MCMC method for\nsampling from a probability distribution p(x) whose density is known only up to constants. Consider\nthe following variant that uses an independent proposal distribution q(x) to build a Markov chain:\nStart with an arbitrary initial state x1 and in each iteration j 2 [2, . . . , m] sample a candidate yj using\nq(x). Then, either accept this candidate (i.e., xj = yj) with probability\n\n\u21e1(xj1, yj) = min\u2713 p(yj)\n\np(xj1)\n\nq(xj1)\nq(yj)\n\n, 1\u25c6\n\n(2)\n\nor reject it otherwise (i.e., xj = xj1). The stationary distribution of this Markov chain is p(x).\nHence, for m suf\ufb01ciently large, the distribution of xm is approximately p(x).\nApproximation using MCMC (K-MC2). Bachem et al. (2016) propose to speed up k-means++ by\nreplacing the exact D2-sampling in (1) with a fast approximation based on MCMC sampling. In each\niteration j 2 [2, 3, . . . , k], one constructs a Markov chain of length m using the Metropolis-Hasting\n\n2\n\n\falgorithm with an independent and uniform proposal distribution q(x) = 1/n. The key advantage is\nthat the acceptance probability in (2) only depends on d(yj, C)2 and d(xj1, C)2 since\n\nmin\u2713 p(yj)\n\np(xj1)\n\nq(xj1)\nq(yj)\n\n, 1\u25c6 = min\u2713 d(yj, C)2\n\nd(xj1, C)2 , 1\u25c6 .\n\nCritically, in each of the k 1 iterations, the algorithm does not require a full pass through the data,\nbut only needs to compute the distances between m points and up to k 1 centers. As a consequence,\nthe complexity of K-MC2 is Omk2d compared to O(nkd) for k-means++ seeding.\nTo bound the quality of the solutions produced by K-MC2, Bachem et al. (2016) analyze the mixing\ntime of the described Markov chains. To this end, the authors de\ufb01ne the two data-dependent quantities:\n\n\u21b5(X ) = max\nx2X\n\nd(x, \u00b5(X ))2\n\nPx02X d(x0, \u00b5(X ))2 ,\n\nand (X ) =\n\n1\nOP T (X )\nk\nOP T (X )\n\n.\n\n(3)\n\nIn order to bound each term, the authors assume that the data is generated i.i.d. from a distribution F\nand impose two conditions on F . First, they assume that F exhibits exponential tails and prove that\n\nin this case \u21b5(X ) 2Olog2 n with high probability. Second, they assume that \u201cF is approximately\nuniform on a hypersphere\u201d. This in turn implies that (X ) 2O (k) with high probability. Under\nthese assumptions, the authors prove that the solution generated by K-MC2 is in expectation O(log k)-\ncompetitive with the optimal solution if m 2 \u21e5k log2 n log k. In this case, the total computational\ncomplexity of K-MC2 is Ok3d log2 n log k which is sublinear in the number of data points.\n\nOther related work. A survey on seeding methods for k-Means was provided by Celebi et al.\n(2013). D2-sampling and k-means++ have been extensively studied in the literature. Previous work\nwas primarily focused on related algorithms (Arthur & Vassilvitskii, 2007; Ostrovsky et al., 2006;\nJaiswal et al., 2014, 2015), its theoretical properties (Ailon et al., 2009; Aggarwal et al., 2009) and\nbad instances (Arthur & Vassilvitskii, 2007; Brunsch & R\u00f6glin, 2011). As such, these results are\ncomplementary to the ones presented in this paper.\nAn alternative approach to scalable seeding was investigated by Bahmani et al. (2012). The au-\nthors propose the k-meansk algorithm that retains the same O(log k) guarantee in expectation as\nk-means++. k-meansk reduces the number of sequential passes through the data to O(log n) by\noversampling cluster centers in each of the rounds. While this allows one to parallelize each of the\nO(log n) rounds, it also increases the total computational complexity from O(nkd) to O(nkd log n).\nThis method is feasible if substantial computational resources are available in the form of a cluster.\nOur approach, on the other hand, has an orthogonal use case: It aims to ef\ufb01ciently approximate\nk-means++ seeding with a substantially lower complexity.\n\n3 Assumption-free K-MC2\n\nBuilding on the MCMC strategy introduced by Bachem et al. (2016), we propose an algorithm which\naddresses the drawbacks of the K-MC2 algorithm, namely:\n(1) The theoretical results of K-MC2 hold only if the data is drawn independently from a distribution\nsatisfying the assumptions stated in Section 2. For example, the results do not extend to heavy-\ntailed distributions which are often observed in real world data.\n(2) Verifying the assumptions, which in turn imply the required chain length, is computationally hard\nand potentially more expensive than running the algorithm. In fact, calculating \u21b5(X ) already\nrequires two full passes through the data, while computing (X ) is NP-hard.\n(3) Theorem 2 of Bachem et al. (2016) does not characterize the tradeoff between m and the expected\nsolution quality: It is only valid for the speci\ufb01c choice of chain length m =\u21e5k log2 n log k.\nAs a consequence, if the assumptions do not hold, we obtain no theoretical guarantee with regards\nto the solution quality. Furthermore, the constants in Theorem 2 are not known and may be large.\nOur approach addresses these shortcomings using three key elements. Firstly, we provide a proposal\ndistribution that renders the assumption on \u21b5(X ) obsolete. Secondly, a novel theoretic analysis\nallows us to obtain theoretical guarantees on the solution quality even without assumptions on (X ).\nFinally, our results characterize the tradeoff between increasing the chain length m and improving\nthe expected solution quality.\n\n3\n\n\f// Preprocessing step\n\nAlgorithm 1 ASSUMPTION-FREE K-MC2(AFK-MC2)\nRequire: Data set X , # of centers k, chain length m\n1: c1 Point uniformly sampled from X\n2: for all x 2X do\n2 d(x, c1)2/Px02X d(x0, c1)2 + 1\n3:\n\nq(x) 1\n// Main loop\n4: C1 { c1}\n5: for i = 2, 3, . . . , k do\nx Point sampled from X using q(x)\n6:\ndx d(x, Ci1)2\n7:\nfor j = 2, 3, . . . , m do\n8:\ny Point sampled from X using q(y)\n9:\ndy d(y, Ci1)2\n10:\nif dyq(x)\ndxq(y) > Unif(0, 1) then x y, dx dy\n11:\n12:\n13: return Ck\n\nCi Ci1 [{ x}\n\n2n\n\nProposal distribution. We argue that the choice of the proposal distribution is critical. Intuitively,\nthe uniform distribution can be a very bad choice if, in any iteration, the true D2-sampling distribution\nis \u201chighly\u201d nonuniform. We suggest the following proposal distribution: We \ufb01rst sample a center\nc1 2X uniformly at random and de\ufb01ne for all x 2X the nonuniform proposal\n\n.\n\n(4)\n\nq(x | c1) =\n\n1\n2\n\n+\n\n1\n2\n\nd(x, c1)2\n\nPx02X d(x0, c1)2\n|\n}\n\n{z\n\n(A)\n\n1\n\n|X||{z}(B)\n\nThe term (A) is the true D2-sampling distribution with regards to the \ufb01rst center c1. For any data\nset, it ensures that we start with the best possible proposal distribution in the second iteration. We\nwill show that this proposal is suf\ufb01cient even for later iterations, rendering any assumptions on \u21b5\nobsolete. The term (B) regularizes the proposal distribution and ensures that the mixing time of\nK-MC2 is always matched up to a factor of two.\nAlgorithm. Algorithm 1 details the proposed fast seeding algorithm ASSUMPTION-FREE K-MC2. In\nthe preprocessing step, it \ufb01rst samples an initial center c1 uniformly at random and then computes the\nproposal distribution q(\u00b7 | c1). In the main loop, it then uses independent Markov chains of length m\nto sample centers in each of the k 1 iterations. The complexity of the main loop is Omk2d.\nThe preprocessing step of ASSUMPTION-FREE K-MC2 requires a single pass through the data to\ncompute the proposal q(\u00b7 | c1). There are several reasons why this additional complexity of O(nd)\nis not an issue in practice: (1) The preprocessing step only requires a single pass through the data\ncompared to k passes for the seeding of k-means++. (2) It is easily parallelized. (3) Given random\naccess to the data, the proposal distribution can be calculated online when saving or copying the data.\n(4) As we will see in Section 4, the effort spent in the preprocessing step pays off: It often allows\nfor shorter Markov chains in the main loop. (5) Computing \u21b5(X ) to verify the \ufb01rst assumption of\nK-MC2 is already more expensive than the preprocessing step of ASSUMPTION-FREE K-MC2.\nTheorem 1. Let \u270f 2 (0, 1) and k 2 N. Let X be any set of n points in Rd and C be the output of\nAlgorithm 1 with m = 1 + 8\n\n\u270f log 4k\n\n\u270f . Then, it holds that\nE [C(X )] \uf8ff 8(log2 k + 2)k\n\u270f k2d log k\n\nOP T (X ) + \u270f Var(X ).\n\n\u270f.\n\nThe computational complexity of the preprocessing step is O(nd) and the computational complexity\nof the main loop is O 1\nThis result shows that ASSUMPTION-FREE K-MC2 produces provably good clusterings for arbitrary\ndata sets without assumptions. The guarantee consists of two terms: The \ufb01rst term, i.e., 8(log2 k +\nOP T (X ), is the theoretical guarantee of k-means++. The second term, \u270f Var(X ), quanti\ufb01es the\n2)k\npotential additional error due to the approximation. The variance is a natural notion as the mean is\nthe optimal quantizer for k = 1. Intuitively, the second term may be interpreted as a scale-invariant\nand additive approximation error.\n\n4\n\n\fTheorem 1 directly characterizes the tradeoff between improving the solution quality and the resulting\nincrease in computational complexity. As m is increased, the solution quality converges to the\ntheoretical guarantee of k-means++. At the same time, even for smaller chain lengths m, we obtain\na provable bound on the solution quality. In contrast, the guarantee of K-MC2 on the solution quality\nonly holds for a speci\ufb01c choice of m.\nFor completeness, ASSUMPTION-FREE K-MC2 may also be analyzed under the assumptions made\nin Bachem et al. (2016). While for K-MC2 the required chain length m is linear in \u21b5(X ),\nASSUMPTION-FREE K-MC2 does not require this assumption. In fact, we will see in Section 4 that\nthis lack of dependence of \u21b5(X ) leads to a better empirical performance. If we assume (X ) 2O (k),\nwe obtain the following result similar to the one of K-MC2 (albeit with a shorter chain length m).\nCorollary 1. Let k 2 N and X be a set of n points in Rd satisfying (X ) 2O (k). Let C be the\noutput of Algorithm 1 with m =\u21e5( k log k). Then it holds that\nE [C(X )] \uf8ff 8(log2 k + 3)k\n\nOP T (X ).\n\nThe computational complexity of the preprocessing is O(nd) and the computational complexity of the\nmain loop is Ok3d log k.\n\n3.1 Proof sketch for Theorem 1\nIn this subsection, we provide a sketch of the proof of Theorem 1 and defer the full proof to\nSection A of the supplementary materials. Intuitively, we \ufb01rst bound how well a single Markov chain\napproximates one iteration of exact D2-sampling. Then, we analyze how the approximation error\naccumulates across iterations and provide a bound on the expected solution quality.\nFor the \ufb01rst step, consider any set C \u2713X of previously sampled centers. Let c1 2 C denote the\n\ufb01rst sampled center that was used to construct the proposal distribution q(x | c1) in (4). In a single\niteration, we would ideally sample a new center x 2X using D2-sampling, i.e., from p(x | C) as\nde\ufb01ned in (1). Instead, Algorithm 1 constructs a Markov chain to sample a new center x 2X as the\nnext cluster center. We denote by \u02dcpc1\nm(x | C) the implied probability of sampling a point x 2X using\nthis Markov chain of length m.\nThe following result shows that in any iteration either C is \u270f1-competitive compared to c1 or the\nMarkov chain approximates D2-sampling well in terms of total variation distance1.\nLemma 1. Let \u270f1,\u270f 2 2 (0, 1) and c1 2X . Consider any set C \u2713X with c1 2 C. For m \n1 + 2\n\u270f1\n\n, at least one of the following holds:\n\nlog 1\n\u270f2\n\n(i)\n(ii)\n\nC(X ) <\u270f 1c1(X ), or\nkp(\u00b7 | C) \u02dcpc1\n\nm(\u00b7 | C)kTV \uf8ff \u270f2.\n\nIn the second step, we bound the expected solution quality of Algorithm 1 based on Lemma 1. While\nthe full proof requires careful propagation of errors across iterations and a corresponding inductive\nargument, the intuition is based on distinguishing between two possible cases of sampled solutions.\nFirst, consider the realizations of the solution C that are \u270f1-competitive compared to c1. By de\ufb01nition,\nC(X ) <\u270f 1c1(X ). Furthermore, the expected solution quality of these realizations can be bounded\nby 2\u270f1 Var(X ) since c1 is chosen uniformly at random and hence in expectation c1(X ) \uf8ff 2 Var(X ).\nSecond, consider the realizations that are not \u270f1-competitive compared to c1. Since the quantization\nerror is non-increasing in sampled centers, Lemma 1 implies that all k 1 Markov chains result in a\ngood approximation of the corresponding D2-sampling. Intuitively, this implies that the approxima-\ntion error in terms of total variation distance across all k 1 iterations is at most \u270f2(k 1). Informally,\nthe expected solution quality is thus bounded with probability 1 \u270f2(k 1) by the expected quality\nof k-means++ and with probability \u270f2(k 1) by c1(X ).\nTheorem 1 can then be proven by setting \u270f1 = \u270f/4 and \u270f2 = \u270f/4k and choosing m suf\ufb01ciently large.\n\n1Let \u2326 be a \ufb01nite sample space on which two probability distributions p and q are de\ufb01ned. The total variation\n\ndistance kp qkTV between p and q is given by 1\n\n2Px2\u2326 |p(x) q(x)|.\n\n5\n\n\fTable 1: Data sets used in experimental evaluation\n\nDATA SET\nCSN (EARTHQUAKES)\nKDD (PROTEIN HOMOLOGY)\nRNA (RNA SEQUENCES)\nSONG (MUSIC SONGS)\nSUSY (SUPERSYM. PARTICLES)\nWEB (WEB USERS)\n\nN\n\n80,000\n145,751\n488,565\n515,345\n5,000,000\n45,811,883\n\nD\n17\n74\n8\n90\n18\n5\n\nK\n200\n200\n200\n2,000\n2,000\n2,000\n\nEVAL\n\nT\nT\nT\nH\nH\nH\n\n\u21b5(X )\n546\n1,268\n\n69\n526\n201\n2\n\nTable 2: Relative error of ASSUMPTION-FREE K-MC2 and K-MC2 in relation to k-means++.\n\nK-MEANS++\nRANDOM\nK-MC2 (m = 20)\nK-MC2 (m = 100)\nK-MC2 (m = 200)\nAFK-MC2 (m = 20)\nAFK-MC2 (m = 100)\nAFK-MC2 (m = 200)\n\nCSN\n0.00%\n\nSONG\nRNA\n0.00%\n0.00%\n9.67%\n399.54% 314.78% 915.46%\n0.41% -0.03%\n32.51%\n65.34%\n0.04% -0.08%\n9.84%\n14.81%\n0.02% -0.04%\n5.97%\n5.48%\n8.31%\n1.45%\n0.00%\n0.01%\n0.81% -0.02% -0.06%\n0.25%\n0.24%\n-0.29%\n0.04% -0.05%\n\nWEB\nSUSY\n0.00%\n0.00%\n4.30% 107.57%\n0.86%\n-0.01%\n0.09%\n1.32%\n0.06%\n-0.16%\n\nKDD\n0.00%\n\n31.91%\n3.39%\n0.65%\n-0.12%\n-0.11%\n-0.03%\n\n3.2 Extension to other clustering problems\nWhile we only consider k-Means clustering and the Euclidean distance in this paper, the results are\nmore general. They can be directly applied, by transforming the data, to any metric space for which\nthere exists a global isometry on Euclidean spaces. Examples would be the Mahalanobis distance and\nGeneralized Symmetrized Bregman divergences (Acharyya et al., 2013).\nThe results also apply to arbitrary distance measures (albeit with different constants) as D2-sampling\ncan be generalized to arbitrary distance measures (Arthur & Vassilvitskii, 2007). However, Var(X )\nneeds to be replaced by 1\nOP T (X ) in Theorem 1 since the mean may not be the optimal quantizer (for\nk = 1) for a different distance metric. The proposed algorithm can further be extended to different\npotential functions of the form k\u00b7k l and used to approximate the corresponding Dl-sampling (Arthur\n& Vassilvitskii, 2007), again with different constants. Similarly, the results also apply to bregman++\n(Ackermann & Bl\u00f6mer, 2010) which provides provably competitive solutions for clustering with a\nbroad class of Bregman divergences (including the KL-divergence and Itakura-Saito distance).\n4 Experimental results\nIn this section2, we empirically validate our theoretical results and compare the proposed algorithm\nASSUMPTION-FREE K-MC2 (AFK-MC2) to three alternative seeding strategies:\n(1) RANDOM, a\n\u201cnaive\u201d baseline that samples k centers from X uniformly at random, (2) the full seeding step of\nk-means++, and (3) K-MC2. For both ASSUMPTION-FREE K-MC2 and K-MC2, we consider the\ndifferent chain lengths m 2{ 1, 2, 5, 10, 20, 50, 100, 150, 200}.\nTable 1 shows the six data sets used in the experiments with their corresponding values for k. We\nchoose an experimental setup similar to Bachem et al. (2016): For half of the data sets, we both train\nthe algorithm and evaluate the corresponding solution on the full data set (denoted by T in the EVAL\ncolumn of Table 1). This corresponds to the classical k-Means setting. In practice, however, one is\noften also interested in the generalization error. For the other half of the data sets, we retain 250,000\ndata points as the holdout set for the evaluation (denoted by H in the EVAL column of Table 1).\nFor all methods, we record the solution quality (either on the full data set or the holdout set) and mea-\nsure the number of distance evaluations needed to run the algorithm. For ASSUMPTION-FREE K-MC2\nthis includes both the preprocessing and the main loop. We run every algorithm 200 times with\ndifferent random seeds and average the results. We further compute and display 95% con\ufb01dence\nintervals for the solution quality.\n\n2An implementation of ASSUMPTION-FREE K-MC2 has been released at http://olivierbachem.ch.\n\n6\n\n\fFigure 1: Quantization error in relation to the chain length m for ASSUMPTION-FREE K-MC2 and\nK-MC2 as well as the quantization error for k-means++ and RANDOM (with no dependence on m).\nASSUMPTION-FREE K-MC2 substantially outperforms K-MC2 except on WEB. Results are averaged\nacross 200 iterations and shaded areas denote 95% con\ufb01dence intervals.\n\nFigure 2: Quantization error\nin relation to the number of distance evaluations for\nASSUMPTION-FREE K-MC2, K-MC2 and k-means++. ASSUMPTION-FREE K-MC2 provides a\nspeedup of up to several orders of magnitude compared to k-means++. Results are averaged across\n200 iterations and shaded areas denote 95% con\ufb01dence intervals.\n\n7\n\n100101102103Chainlengthm0.51.01.52.02.53.03.54.04.55.0Trainingerror105CSNafk-mc2k-mc2k-means++random100101102103Chainlengthm1234567891011KDDafk-mc2k-mc2k-means++random100101102103Chainlengthm0123456789107RNAafk-mc2k-mc2k-means++random100101102103Chainlengthm6.56.66.76.86.97.07.17.27.3Holdouterror1011SONGafk-mc2k-mc2k-means++random100101102103Chainlengthm5.005.055.105.155.205.255.30105SUSYafk-mc2k-mc2k-means++random100101102103Chainlengthm0.81.01.21.41.61.82.02.22.4102WEBafk-mc2k-mc2k-means++random104105106107108Distanceevaluations0.51.01.52.02.53.03.54.04.55.0Trainingerror105CSNk-means++afk-mc2k-mc2104105106107108Distanceevaluations1234567891011KDDk-means++afk-mc2k-mc2104105106107108Distanceevaluations012345678107RNAk-means++afk-mc2k-mc2106107108109Distanceevaluations6.56.66.76.86.97.07.17.27.3Holdouterror1011SONGk-means++afk-mc2k-mc21061071081091010Distanceevaluations5.005.055.105.155.205.255.30105SUSYk-means++afk-mc2k-mc210610710810910101011Distanceevaluations0.81.01.21.41.61.82.02.22.4102WEBk-means++afk-mc2k-mc2\fTable 3: Relative speedup (in terms of distance evaluations) in relation to k-means++.\n\nK-MEANS++\nK-MC2 (m = 20)\nK-MC2 (m = 100)\nK-MC2 (m = 200)\nAFK-MC2 (m = 20)\nAFK-MC2 (m = 100)\nAFK-MC2 (m = 200)\n\nWEB\nKDD\nCSN\n1.0\u21e5\n1.0\u21e5\n1.0\u21e5\n40.0\u21e5 72.9\u21e5 244.3\u21e5 13.3\u21e5 237.5\u21e5 2278.1\u21e5\n8.0\u21e5 14.6\u21e5\n455.6\u21e5\n4.0\u21e5\n227.8\u21e5\n7.3\u21e5\n33.3\u21e5 53.3\u21e5 109.7\u21e5 13.2\u21e5 212.3\u21e5 1064.7\u21e5\n7.7\u21e5 13.6\u21e5\n371.0\u21e5\n204.5\u21e5\n7.0\u21e5\n3.9\u21e5\n\nRNA SONG\n1.0\u21e5\n1.0\u21e5\n2.7\u21e5\n48.9\u21e5\n1.3\u21e5\n24.4\u21e5\n39.2\u21e5\n2.6\u21e5\n1.3\u21e5\n21.8\u21e5\n\nSUSY\n1.0\u21e5\n47.5\u21e5\n23.8\u21e5\n46.4\u21e5\n23.5\u21e5\n\nDiscussion. Figure 1 shows the expected quantization error for the two baselines, RANDOM and\nk-means++, and for the MCMC methods with different chain lengths m. As expected, the seeding\nstep of k-means++ strongly outperforms RANDOM on all data sets. As the chain length m increases,\nthe quality of solutions produced by both ASSUMPTION-FREE K-MC2 and K-MC2 quickly converges\nto that of k-means++ seeding.\nOn all data sets except WEB, ASSUMPTION-FREE K-MC2 starts with a lower initial error due to the\nimproved proposal distribution and outperforms K-MC2 for any given chain length m. For WEB,\nboth algorithms exhibit approximately the same performance. This is expected as \u21b5(X ) of WEB is\nvery low (see Table 1). Hence, there is only a minor difference between the nonuniform proposal of\nASSUMPTION-FREE K-MC2 and the uniform proposal of K-MC2. In fact, one of the key advantages\nof ASSUMPTION-FREE K-MC2 is that its proposal adapts to the data set at hand.\nAs discussed in Section 3, ASSUMPTION-FREE K-MC2 requires an additional preprocessing step\nto compute the nonuniform proposal. Figure 2 shows the expected solution quality in relation\nto the total computational complexity in terms of number of distance evaluations. Both K-MC2\nand ASSUMPTION-FREE K-MC2 generate solutions that are competitive with those produced by\nthe seeding step of k-means++. At the same time, they do this at a fraction of the computational\ncost. Despite the preprocessing, ASSUMPTION-FREE K-MC2 clearly outperforms K-MC2 on the data\nsets with large values for \u21b5(X ) (CSN, KDD and SONG). The additional effort of computing the\nnonuniform proposal is compensated by a substantially lower expected quantization error for a given\nchain size. For the other data sets, ASSUMPTION-FREE K-MC2 is initially disadvantaged by the cost\nof computing the proposal distribution. However, as m increases and more time is spent computing\nthe Markov chains, it either outperforms K-MC2 (RNA and SUSY) or matches its performance (WEB).\nTable 3 details the practical signi\ufb01cance of the proposed algorithm. The results indicate that in\npractice it is suf\ufb01cient to run ASSUMPTION-FREE K-MC2 with a chain length independent of n.\nEven with a small chain length, ASSUMPTION-FREE K-MC2 produces competitive clusterings at\na fraction of the computational cost of the seeding step of k-means++. For example on CSN,\nASSUMPTION-FREE K-MC2 with m = 20 achieves a relative error of 1.45% and a speedup of 33.3\u21e5.\nAt the same time, K-MC2 would have exhibited a substantial relative error of 65.34% while only\nobtaining a slightly better speedup of 40.0\u21e5.\n5 Conclusion\n\nIn this paper, we propose ASSUMPTION-FREE K-MC2, a simple and fast seeding algorithm for\nk-Means. In contrast to the previously introduced algorithm K-MC2, it produces provably good\nclusterings even without assumptions on the data. As a key advantage, ASSUMPTION-FREE K-MC2\nallows to provably trade off solution quality for a decreased computational effort. Extensive experi-\nments illustrate the practical signi\ufb01cance of the proposed algorithm: It obtains competitive clusterings\nat a fraction of the cost of k-means++ seeding and it outperforms or matches its main competitor\nK-MC2 on all considered data sets.\n\nAcknowledgments\nThis research was partially supported by ERC StG 307036, a Google Ph.D. Fellowship and an IBM\nPh.D. Fellowship.\n\n8\n\n\fReferences\nAcharyya, Sreangsu, Banerjee, Arindam, and Boley, Daniel. Bregman divergences and triangle\n\ninequality. In SIAM International Conference on Data Mining (SDM), pp. 476\u2013484, 2013.\n\nAckermann, Marcel R and Bl\u00f6mer, Johannes. Bregman clustering for separable instances. In SWAT,\n\npp. 212\u2013223. Springer, 2010.\n\nAggarwal, Ankit, Deshpande, Amit, and Kannan, Ravi. Adaptive sampling for k-means clustering.\nIn Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques,\npp. 15\u201328. Springer, 2009.\n\nAilon, Nir, Jaiswal, Ragesh, and Monteleoni, Claire. Streaming k-means approximation. In Neural\n\nInformation Processing Systems (NIPS), pp. 10\u201318, 2009.\n\nArthur, David and Vassilvitskii, Sergei. k-means++: The advantages of careful seeding. In Symposium\non Discrete Algorithms (SODA), pp. 1027\u20131035. Society for Industrial and Applied Mathematics,\n2007.\n\nBachem, Olivier, Lucic, Mario, Hassani, S. Hamed, and Krause, Andreas. Approximate k-means++\n\nin sublinear time. In Conference on Arti\ufb01cial Intelligence (AAAI), February 2016.\n\nBahmani, Bahman, Moseley, Benjamin, Vattani, Andrea, Kumar, Ravi, and Vassilvitskii, Sergei.\n\nScalable k-means++. Very Large Data Bases (VLDB), 5(7):622\u2013633, 2012.\n\nBottou, Leon and Bengio, Yoshua. Convergence properties of the k-means algorithms. In Neural\n\nInformation Processing Systems (NIPS), pp. 585\u2013592, 1994.\n\nBrunsch, Tobias and R\u00f6glin, Heiko. A bad instance for k-means++. In Theory and Applications of\n\nModels of Computation, pp. 344\u2013352. Springer, 2011.\n\nCai, Haiyan. Exact bound for the convergence of Metropolis chains. Stochastic Analysis and\n\nApplications, 18(1):63\u201371, 2000.\n\nCelebi, M Emre, Kingravi, Hassan A, and Vela, Patricio A. A comparative study of ef\ufb01cient\ninitialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40\n(1):200\u2013210, 2013.\n\nHastings, W Keith. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\nJaiswal, Ragesh, Kumar, Amit, and Sen, Sandeep. A simple D2-sampling based PTAS for k-means\n\nand other clustering problems. Algorithmica, 70(1):22\u201346, 2014.\n\nJaiswal, Ragesh, Kumar, Mehul, and Yadav, Pulkit. Improved analysis of D2-sampling based PTAS\nfor k-means and other clustering problems. Information Processing Letters, 115(2):100\u2013103, 2015.\nLloyd, Stuart. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):\n\n129\u2013137, 1982.\n\nOstrovsky, Rafail, Rabani, Yuval, Schulman, Leonard J, and Swamy, Chaitanya. The effectiveness of\nLloyd-type methods for the k-means problem. In Foundations of Computer Science (FOCS), pp.\n165\u2013176. IEEE, 2006.\n\nSculley, D. Web-scale k-means clustering. In World Wide Web (WWW), pp. 1177\u20131178. ACM, 2010.\nZhao, Weizhong, Ma, Huifang, and He, Qing. Parallel k-means clustering based on MapReduce. In\n\nCloud Computing, pp. 674\u2013679. Springer, 2009.\n\n9\n\n\f", "award": [], "sourceid": 39, "authors": [{"given_name": "Olivier", "family_name": "Bachem", "institution": "ETH Zurich"}, {"given_name": "Mario", "family_name": "Lucic", "institution": "ETH Zurich"}, {"given_name": "Hamed", "family_name": "Hassani", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}