{"title": "Streaming k-means approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means, in which the algorithm is allowed to output more than k centers (based on the recent k-means++\"), and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method.\"", "full_text": "Streaming k-means approximation\n\nNir Ailon\n\nGoogle Research\n\nnailon@google.com\n\nRagesh Jaiswal\u2217\nColumbia University\n\nrjaiswal@gmail.com\n\nClaire Monteleoni\u2020\nColumbia University\n\ncmontel@ccls.columbia.edu\n\nAbstract\n\nWe provide a clustering algorithm that approximately optimizes the k-means ob-\njective, in the one-pass streaming setting. We make no assumptions about the\ndata, and our algorithm is very light-weight in terms of memory, and computa-\ntion. This setting is applicable to unsupervised learning on massive data sets, or\nresource-constrained devices. The two main ingredients of our theoretical work\nare: a derivation of an extremely simple pseudo-approximation batch algorithm\nfor k-means (based on the recent k-means++), in which the algorithm is allowed\nto output more than k centers, and a streaming clustering algorithm in which batch\nclustering algorithms are performed on small inputs (\ufb01tting in memory) and com-\nbined in a hierarchical manner. Empirical evaluations on real and simulated data\nreveal the practical utility of our method.\n\n1\n\nIntroduction\n\nAs commercial, social, and scienti\ufb01c data sources continue to grow at an unprecedented rate, it is\nincreasingly important that algorithms to process and analyze this data operate in online, or one-pass\nstreaming settings. The goal is to design light-weight algorithms that make only one pass over the\ndata. Clustering techniques are widely used in machine learning applications, as a way to summarize\nlarge quantities of high-dimensional data, by partitioning them into \u201cclusters\u201d that are useful for\nthe speci\ufb01c application. The problem with many heuristics designed to implement some notion of\nclustering is that their outputs can be hard to evaluate. Approximation guarantees, with respect to\nsome reasonable objective, are therefore useful. The k-means objective is a simple, intuitive, and\nwidely-cited clustering objective for data in Euclidean space. However, although many clustering\nalgorithms have been designed with the k-means objective in mind, very few have approximation\nguarantees with respect to this objective.\nIn this work, we give a one-pass streaming algorithm for the k-means problem. We are not aware\nof previous approximation guarantees with respect to the k-means objective that have been shown\nfor simple clustering algorithms that operate in either online or streaming settings. We extend work\nof Arthur and Vassilvitskii [AV07] to provide a bi-criterion approximation algorithm for k-means,\nin the batch setting. They de\ufb01ne a seeding procedure which chooses a subset of k points from a\nbatch of points, and they show that this subset gives an expected O(log (k))-approximation to the k-\nmeans objective. This seeding procedure is followed by Lloyd\u2019s algorithm1 which works very well\nin practice with the seeding. The combined algorithm is called k-means++, and is an O(log (k))-\napproximation algorithm, in expectation.2 We modify k-means++ to obtain a new algorithm, k-\nmeans#, which chooses a subset of O(k log (k)) points, and we show that the chosen subset of\n\n\u2217Department of Computer Science. Research supported by DARPA award HR0011-08-1-0069.\n\u2020Center for Computational Learning Systems\n1Lloyd\u2019s algorithm is popularly known as the k-means algorithm\n2Since the approximation guarantee is proven based on the seeding procedure alone, for the purposes of this\n\nexposition we denote the seeding procedure as k-means++.\n\n1\n\n\fpoints gives a constant approximation to the k-means objective. Apart from giving us a bi-criterion\napproximation algorithm, our modi\ufb01ed seeding procedure is very simple to analyze.\n[GMMM+03] de\ufb01nes a divide-and-conquer strategy to combine multiple bi-criterion approximation\nalgorithms for the k-medoid problem to yield a one-pass streaming approximation algorithm for\nk-median. We extend their analysis to the k-means problem and then use k-means++ and k-means#\nin the divide-and-conquer strategy, yielding an extremely ef\ufb01cient single pass streaming algorithm\nwith an O(c\u03b1 log (k))-approximation guarantee, where \u03b1 \u2248 log n/ log M, n is the number of input\npoints in the stream and M is the amount of work memory available to the algorithm. Empirical\nevaluations, on simulated and real data, demonstrate the practical utility of our techniques.\n\n1.1 Related work\n\nThere is much literature on both clustering algorithms [Gon85, Ind99, VW02, GMMM+03,\nKMNP+04, ORSS06, AV07, CR08, BBG09, AL09], and streaming algorithms [Ind99, GMMM+03,\nM05, McG07].3 There has also been work on combining these settings: designing clustering algo-\nrithms that operate in the streaming setting [Ind99, GMMM+03, CCP03]. Our work is inspired by\nthat of Arthur and Vassilvitskii [AV07], and Guha et al. [GMMM+03], which we mentioned above\nand will discuss in further detail. k-means++, the seeding procedure in [AV07], had previously been\nanalyzed by [ORSS06], under special assumptions on the input data.\nIn order to be useful in machine learning applications, we are concerned with designing algorithms\nthat are extremely light-weight and practical. k-means++ is ef\ufb01cient, very simple, and performs\nwell in practice. There do exist constant approximations to the k-means objective, in the non-\nstreaming setting, such as a local search technique due to [KMNP+04].4 A number of works\n[LV92, CG99, Ind99, CMTS02, AGKM+04] give constant approximation algorithms for the re-\nlated k-median problem in which the objective is to minimize the sum of distances of the points to\ntheir nearest centers (rather than the square of the distances as in k-means), and the centers must be\na subset of the input points. It is popularly believed that most of these algorithms can be extended to\nwork for the k-means problem without too much degredation of the approximation, however there\nis no formal evidence for this yet. Moreover, the running times of most of these algorithms depend\nworse than linearly on the parameters (n, k, and d) which makes these algorithms less useful in prac-\ntice. As future work, we propose analyzing variants of these algorithms in our streaming clustering\nalgorithm, with the goal of yielding a streaming clustering algorithm with a constant approximation\nto the k-means objective.\nFinally, it is important to make a distinction from some lines of clustering research which involve\nassumptions on the data to be clustered. Common assumptions include i.i.d. data, e.g. [BL08], and\ndata that admits a clustering with well separated means e.g. in [VW02, ORSS06, CR08]. Recent\nwork [BBG09] assumes a \u201ctarget\u201d clustering for the speci\ufb01c application and data set, that is close\nto any constant approximation of the clustering objective.\nIn contrast, we prove approximation\nguarantees with respect to the optimal k-means clustering, with no assumptions on the input data.5\nAs in [AV07], our probabilistic guarantees are only with respect to randomness in the algorithm.\n\n1.1.1 Preliminaries\nThe k-means clustering problem is de\ufb01ned as follows: Given n points X\u2282 Rd and a weight\nfunction w : X\u2192 R , the goal is to \ufb01nd a subset C\u2286 Rd,|C| = k such that the following quantity is\nminimized:6 \u03c6C =!x\u2208X w(x)\u00b7D(x,C)2, where D(x,C) denotes the #2 distance of x to the nearest\npoint in C. When the subset C is clear from the context, we denote this distance by D(x). Also,\nfor two points x, y, D(x, y) denotes the #2 distance between x and y. The subset C is alternatively\ncalled a clustering of X and \u03c6C is called the potential function corresponding to the clustering. We\nwill use the term \u201ccenter\u201d to refer to any c \u2208C .\n\n3For a comprehensive survey of streaming results and literature, refer to [M05].\n4In recent, independent work, Aggarwal, Deshpande, and Kannan [ADK09] extend the seeding procedure of\nk-means++ to obtain a constant factor approximation algorithm which outputs O(k) centers. They use similar\ntechniques to ours, but reduce the number of centers by using a stronger concentration property.\n\n5It may be interesting future work to analyze our algorithm in special cases, such as well-separated clusters.\n6For the unweighted case, we can assume that w(x) = 1 for all x.\n\n2\n\n\fDe\ufb01nition 1.1 (Competitive ratio, b-approximation). Given an algorithm B for the k-means prob-\nlems, let \u03c6C be the potential of the clustering C returned by B (on some input set which is implicit)\nand let \u03c6COP T denote the potential of the optimal clustering COP T . Then the competitive ratio is\n. The algorithm B is said to be b-approximation algorithm\nde\ufb01ned to be the worst case ratio\nif\n\n\u03c6COP T\n\n\u03c6C\n\n\u03c6C\n\u03c6COP T \u2264 b.\n\nThe previous de\ufb01nition might be too strong for an approximation algorithm for some purposes. For\nexample, the clustering algorithm performs poorly when it is constrained to output k centers but it\nmight become competitive when it is allowed to output more centers.\nDe\ufb01nition 1.2 ((a, b)-approximation). We call an algorithm B, (a, b)-approximation for the k-\n\u03c6C\nmeans problem if it outputs a clustering C with ak centers with potential \u03c6C such that\n\u03c6COP T \u2264 b in\nthe worst case. Where a > 1, b > 1.\nNote that for simplicity, we measure the memory in terms of the words which essentially means that\nwe assume a point in Rd can be stored in O(1) space.\n\n2 k-means#: The advantages of careful and liberal seeding\n\nThe k-means++ algorithm is an expected \u0398(log k)-approximation algorithm. In this section, we\nextend the ideas in [AV07] to get an (O(log k), O(1))-approximation algorithm. Here is the k-\nmeans++ algorithm:\n\n1. Choose an initial center c1 uniformly at random from X .\n2. Repeat (k \u2212 1) times:\n3.\n\nChoose the next center ci, selecting ci = x# \u2208X with probability\n(here D(.) denotes the distances w.r.t. to the subset of points chosen in the previous rounds)\n\nPx\u2208X D(x)2 .\n\nD(x!)2\n\nAlgorithm 1: k-means++\n\nIn the original de\ufb01nition of k-means++ in [AV07], the above algorithm is followed by Lloyd\u2019s\nalgorithm. The above algorithm is used as a seeding step for Lloyd\u2019s algorithm which is known\nto give the best results in practice. On the other hand, the theoretical guarantee of the k-means++\ncomes from analyzing this seeding step and not Lloyd\u2019s algorithm. So, for our analysis we focus on\nthis seeding step. The running time of the algorithm is O(nkd).\nIn the above algorithm X denotes the set of given points and for any point x, D(x) denotes the\ndistance of this point from the nearest center among the centers chosen in the previous rounds. To\nget an (O(log k), O(1))-approximation algorithm, we make a simple change to the above algorithm.\nWe \ufb01rst set up the tools for analysis. These are the basic lemmas from [AV07]. We will need the\nfollowing de\ufb01nition \ufb01rst:\nDe\ufb01nition 2.1 (Potential w.r.t. a set). Given a clustering C, its potential with respect to some set A\nis denoted by \u03c6C(A) and is de\ufb01ned as \u03c6C(A) =!x\u2208A D(x)2, where D(x) is the distance of the\npoint x from the nearest point in C.\nLemma 2.2 ([AV07], Lemma 3.1). Let A be an arbitrary cluster in COP T , and let C be the clustering\nwith just one center, chosen uniformly at random from A. Then Exp[\u03c6C(A)] = 2 \u00b7 \u03c6COP T (A).\nCorollary 2.3. Let A be an arbitrary cluster in COP T , and let C be the clustering with just one\ncenter, which is chosen uniformly at random from A. Then, Pr[\u03c6C(A) < 8\u03c6COP T (A)] \u2265 3/4\nProof. The proof follows from Markov\u2019s inequality.\nLemma 2.4 ([AV07], Lemma 3.2). Let A be an arbitrary cluster in COP T , and let C be an arbitrary\nclustering.\nIf we add a random center to C from A, chosen with D2 weighting to get C#, then\nExp[\u03c6C!(A)] \u2264 8 \u00b7 \u03c6COP T (A).\nCorollary 2.5. Let A be an arbitrary cluster in COP T , and let C be an arbitrary clustering. If\nwe add a random center to C from A, chosen with D2 weighting to get C#, then Pr[\u03c6C!(A) <\n32 \u00b7 \u03c6COP T (A)] \u2265 3/4.\n\n3\n\n\fWe will use k-means++ and the above two lemmas to obtain a (O(log k), O(1))-approximation\nalgorithm for the k-means problem. Consider the following algorithm:\n\n1. Choose 3 \u00b7 log k centers independently and uniformly at random from X .\n2. Repeat (k \u2212 1) times.\n3.\n\nChoose 3 \u00b7 log k centers independently and with probability\n(here D(.) denotes the distances w.r.t. to the subset of points chosen in the previous rounds)\n\nPx\u2208X D(x)2 .\n\nD(x!)2\n\nAlgorithm 2: k-means#\n\nNote that the algorithm is almost the same as the k-means++ algorithm except that in each round\nof choosing centers, we pick O(log k) centers rather than a single center. The running time of the\nabove algorithm is clearly O(ndk log k).\nLet A = {A1, ..., Ak} denote the set of clusters in the optimal clustering COP T . Let Ci denote the\nclustering after ith round of choosing centers. Let Ai\n\nc denote the subset of clusters \u2208A such that\n\n\u2200A \u2208A i\n\nc,\u03c6 Ci(A) \u2264 32 \u00b7 \u03c6COP T (A).\nu = A\\Ai\n\nc = \u222aA\u2208Ai\n\ncA and let X i\n\nu = X \\ X i\n\nc) >\u03c6 Ci(X i\n\nu) \u2264 \u03c6Ci(X i\n\nc) + \u03c6Ci(X i\n\nc) \u2264 2\u00b7 32\u00b7 \u03c6COP T (X i\n\nc|\u2265 1) and for any i > 1, \u03c6Ci(X i\n\nc. Now after the ith round, either \u03c6Ci(X i\n\nc be the subset of \u201cuncovered\u201d\nWe call this subset of clusters, the \u201ccovered\u201d clusters. Let Ai\nclusters. The following simple lemma shows that with constant probability step (1) of k-means#\nc|\u2265 1. Let us\npicks a center such that at least one of the clusters gets covered, or in other words, |A1\ncall this event E.\nLemma 2.6. Pr[E] \u2265 (1 \u2212 1/k).\nProof. The proof easily follows from Corollary 2.3.\nc) \u2264 \u03c6Ci(X i\nu)\nLet X i\nor otherwise. In the former case, using Corollary 2.5, we show that the probability of covering an\nuncovered cluster in the (i + 1)th round is large. In the latter case, we will show that the current set\nof centers is already competitive with constant approximation ratio. Let us start with the latter case.\nu), then \u03c6Ci \u2264\nLemma 2.7. If event E occurs ( |A1\n64\u03c6COP T .\nProof. We get the main result using the following sequence of inequalities: \u03c6Ci = \u03c6Ci(X i\n\u03c6Ci(X i\nc) \u2264 64 \u03c6COP T (using the de\ufb01nition of X i\nc).\nLemma 2.8. If for any i \u2265 1, \u03c6Ci(X i\nc| + 1] \u2265 (1 \u2212 1/k).\nProof. Note that in the (i + 1)th round, the probability that a center is chosen from a cluster /\u2208A i\nc is\nc ) \u2265 1/2. Conditioned on this event, with probability at least 3/4 any of the\nat least\ncenters x chosen in round (i + 1) satis\ufb01es \u03c6Ci\u222ax(A) \u2264 32 \u00b7 \u03c6COP T (A) for some uncovered cluster\nu. This means that with probability at least 3/8 any of the chosen centers x in round (i + 1)\nA \u2208A i\nsatis\ufb01es \u03c6Ci\u222ax(A) \u2264 32 \u00b7 \u03c6COP T (A) for some uncovered cluster A \u2208A i\nu. This further implies that\nwith probability at least (1 \u2212 1/k) at least one of the chosen centers x in round (i + 1) satis\ufb01es\n\u03c6Ci\u222ax(A) \u2264 32 \u00b7 \u03c6COP T (A) for some uncovered cluster A \u2208A i\nu.\nWe use the above two lemmas to prove our main theorem.\nTheorem 2.9. k-means# is a (O(log k), O(1))-approximation algorithm.\nc|\u2265 1) occurs. Given this, suppose for\nProof. From Lemma 2.6 we know that event E (i.e., |Ai\nany i > 1, after the ith round \u03c6Ci(Xc) >\u03c6 Ci(Xu). Then from Lemma 2.7 we have \u03c6C \u2264 \u03c6Ci \u2264\n64\u03c6COP T . If no such i exist, then from Lemma 2.8 we get that the probability that there exists a\ncluster A \u2208A such that A is not covered even after k rounds(i.e., end of the algorithm) is at most:\n1 \u2212 (1 \u2212 1/k)k \u2264 3/4. So with probability at least 1/4, the algorithm covers all the clusters in A.\nIn this case from Lemma 2.8, we have \u03c6C = \u03c6Ck \u2264 32 \u00b7 \u03c6COP T .\nWe have shown that k-means# is a randomized algorithm for clustering which with probability at\nleast 1/4 gives a clustering with competitive ratio 64.\n\nCi (X i\n\u03c6\nu)\nCi (X i\nCi (X i\nu)+\u03c6\n\u03c6\n\nc) \u2264 \u03c6Ci(X i\n\nu), then Pr[|Ai+1\n\nc\n\n| \u2265 |Ai\n\nc) +\n\n4\n\n\f3 A single pass streaming algorithm for k-means\nIn this section, we will provide a single pass streaming algorithm. The basic ingredients for the algo-\nrithm is a divide and conquer strategy de\ufb01ned by [GMMM+03] which uses bi-criterion approxima-\ntion algorithms in the batch setting. We will use k-means++ which is a (1, O(log k))-approximation\nalgorithm and k-means# which is a (O(log k), O(1))-approximation algorithm, to construct a single\npass streaming O(log k)-approximation algorithm for k-means problem. In the next subsection, we\ndevelop some of the tools needed for the above.\n\n3.1 A streaming (a,b)-approximation for k-means\nWe will show that a simple streaming divide-and-conquer scheme, analyzed by [GMMM+03] with\nrespect to the k-medoid objective, can be used to approximate the k-means objective. First we\npresent the scheme due to [GMMM+03], where in this case we use k-means-approximating algo-\nrithms as input.\n\nInputs: (a) Point set S \u2282 Rd. Let n = |S|.\n\n(b) Number of desired clusters, k \u2208 N.\n(c) A, an (a, b)-approximation algorithm to the k-means objective.\n(d) A#, an (a#, b#)-approximation algorithm to the k-means objective.\n\n1. Divide S into groups S1, S2, . . . , S#\n2. For each i \u2208{ 1, 2, . . . , #}\nRun A on Si to get \u2264 ak centers Ti = {ti1, ti2, . . .}\n3.\n4.\nDenote the induced clusters of Si as Si1 \u222a Si2 \u222a \u00b7\u00b7\u00b7\n5. Sw \u2190 T1 \u222a T2 \u222a \u00b7\u00b7\u00b7 \u222a T#, with weights w(tij) \u2190| Sij|\n6. Run A# on Sw to get \u2264 a#k centers T\n7. Return T\n\nAlgorithm 3: [GMMM+03] Streaming divide-and-conquer clustering\n\nFirst note that when every batch Si has size \u221ank, this algorithm takes one pass, and O(a\u221ank)\nmemory. Now we will give an approximation guarantee.\nTheorem 3.1. The algorithm above outputs a clustering that\napproximation to the k-means objective.\nThe a# approximation of the desired number of centers follows directly from the approximation\nproperty of A#, with respect to the number of centers, since A# is the last algorithm to be run. It\nremains to show the approximation of the k-means objective. The proof, which appears in the\nAppendix, involves extending the analysis of [GMMM+03], to the case of the k-means objective.\nUsing the exposition in Dasgupta\u2019s lecture notes [Das08], of the proof due to [GMMM+03], our\nextension is straightforward, and differs in the following ways from the k-medoid analysis.\n\nis an (a#, 2b + 4b#(b + 1))-\n\n1. The k-means objective involves squared distance (as opposed to k-medoid in which the\ndistance is not squared), so the triangle inequality cannot be invoked directly. We replace it\nwith an application of the triangle inequality, followed by (a+b)2 \u2264 2a2+2b2, everywhere\nit occurs, introducing several factors of 2.\n2. Cluster centers are chosen from Rd, for the k-means problem, so in various parts of the\nproof we save an approximation a factor of 2 from the k-medoid problem, in which cluster\ncenters must be chosen from the input data.\n\n3.2 Using k-means++ and k-means# in the divide-and-conquer strategy\nIn the previous subsection, we saw how a (a, b)-approximation algorithm A and an (a#, b#)-\napproximation algorithm A# can be used to get a single pass (a#, 2b + 4b#(b + 1))-approximation\nstreaming algorithm. We now have two randomized algorithms, k-means# which with probability\nat least 1/4 is a (3 log k, 64)-approximation algorithm and k-means++ which is a (1, O(log k))-\napproximation algorithm (the approximation factor being in expectation). We can now use these\ntwo algorithms in the divide-and-conquer strategy to obtain a single pass streaming algorithm.\nWe use the following as algorithms as A and A# in the divide-and-conquer strategy (3):\n\n5\n\n\fA: \u201cRun k-means# on the data 3 log n times independently, and pick the clustering\nwith the smallest cost.\u201d\nA\u2019: \u201cRun k-means++\u201d\n\nWeighted versus non-weighted. Note that k-means and k-means# are approximation algorithms\nfor the non-weighted case (i.e. w(x) = 1 for all points x). On the other hand, in the divide-and-\nconquer strategy we need the algorithm A#, to work for the weighted case where the weights are\nintegers. Note that both k-means and k-means# can be easily generalized for the weighted case\nwhen the weights are integers. Both algorithms compute probabilities based on the cost with respect\nto the current clustering. This cost can be computed by taking into account the weights. For the\nanalysis, we can assume points with multiplicities equal to the integer weight of the point. The\nmemory required remains logarithmic in the input size, including the storing the weights.\n\nImproved memory-approximation tradeoffs\n\nlogarithm of the input size. Moreover, the algorithm has running time O(dnk log n log k).\n\nAnalysis. With probability at least\"1 \u2212 (3/4)3 log n# \u2265\"1 \u2212 1\nn#, algorithm A is a (3 log k, 64)-\napproximation algorithm. Moreover, the space requirement remains logarithmic in the input size. In\nstep (3) of Algorithm 3 we run A on batches of data. Since each batch is of size \u221ank the number of\nbatches is$n/k, the probability that A is a (3 log k, 64)-approximation algorithm for all of these\nn#\u221an/k \u2265 1/2. Conditioned on this event, the divide-and-conquer strategy\nbatches is at least\"1 \u2212 1\ngives a O(log k)-approximation algorithm. The memory required is O(log(k) \u00b7 \u221ank) times the\n3.3\nWe saw in the last section how to obtain a single-pass (a#, cbb#)-approximation for k-means using\n\ufb01rst an (a, b)-approximation on input blocks and then an (a#, b#)-approximation on the union of the\noutput center sets, where c is some global constant. The optimal memory required for this scheme\nwas O(a\u221ank). This immediately implies a tradeoff between the memory requirements (growing\nlike a), the number of centers outputted (which is a#k) and the approximation to the potential (which\nis cbb#) with respect to the optimal solution using k centers. A more subtle tradeoff is possible by a\nrecursive application of the technique in multiple levels. Indeed, the (a, b)-approximation could be\nbroken up in turn into two levels, and so on. This idea was used in [GMMM+03]. Here we make a\nmore precise account of the tradeoff between the different parameters.\nAssume we have subroutines for performing (ai, bi)-approximation for k-means in batch mode, for\ni = 1, . . . r (we will choose a1, . . . , ar, b1, . . . , br later). We will hold r buffers B1, . . . , Br as\nwork areas, where the size of buffer Bi is Mi. In the topmost level, we will divide the input into\nequal blocks of size M1, and run our (a1, b1)-approximation algorithm on each block. Buffer B1\nwill be repeatedly reused for this task, and after each application of the approximation algorithm,\nthe outputted set of (at most) ka1 centers will be added to B2. When B2 is \ufb01lled, we will run\nthe (a2, b2)-approximation algorithm on the data and add the ka2 outputted centers to B3. This\nwill continue until buffer Br \ufb01lls, and the (ar, br)-approximation algorithm outputs the \ufb01nal ark\ncenters. Let ti denote the number of times the i\u2019th level algorithm is executed. Clearly we have\ntikai = Mi+1ti+1 for i = 1, . . . , r\u2212 1. For the last stage we have tr = 1, which means that tr\u22121 =\nMr/kar\u22121, tr\u22122 = Mr\u22121Mr/k2ar\u22122ar\u22121 and generally ti = Mi+1 \u00b7\u00b7\u00b7 Mr/kr\u2212iai \u00b7\u00b7\u00b7 ar\u22121.7 But\nwe must also have t1 = n/M1, implying n = M1\u00b7\u00b7\u00b7Mr\n. In order to minimize the total memory\n! Mi under the last constraint, using standard arguments in multivariate analysis we must have\nM1 = \u00b7\u00b7\u00b7 = Mr, or in other words Mi =\"nkr\u22121a1 \u00b7\u00b7\u00b7 ar\u22121#1/r \u2264 n1/rk(a1 \u00b7\u00b7\u00b7 ar\u22121)1/r for all i.\nThe resulting one-pass algorithm will have an approximation guarantee of (ar, cr\u22121b1 \u00b7\u00b7\u00b7 br) (using\na straightforward extension of the result in the previous section) and memory requirement of at most\nrn1/rk(a1 \u00b7\u00b7\u00b7 ar\u22121)1/r.\nAssume now that we are in the realistic setting in which the available memory is of \ufb01xed size\nM \u2265 k. We will choose r (below), and for each i = 1..r \u2212 1 we choose to either run k-means++\nor the repeated k-means# (algorithm A in the previous subsection), i.e., (ai, bi) = (1, O(log k))\nor (3 log k, O(1)) for each i. For i = r, we choose k-means++, i.e., (ar, br) = (1, O(log k)) (we\nare interested in outputting exactly k centers as the \ufb01nal solution). Let q denote the number of\n\nkr\u22121a1\u00b7\u00b7\u00b7ar\u22121\n\n7We assume all quotients are integers for simplicity of the proof, but note that fractional blocks would arise\n\nin practice.\n\n6\n\n\f \n\nBatch Lloyds\nOnline Lloyds\nDivide and Conquer with km# and km++\nDivide and Conquer with km++\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n9\n\n0\n1\n\n \nf\n\no\n\n \ns\nt\ni\n\nn\nu\nn\n\n \n\ni\n \nt\ns\no\nC\n\n0\n\n \n5\n\n10\n\n15\nk\n\n20\n\n25\n\n6\n\n0\n1\n \nf\no\n \ns\nt\ni\nn\nu\n \nn\ni\n \nt\ns\no\nC\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n5\n\nBatch Lloyds\nDivide and Conquer with km# and km++\nDivide and Conquer with km++\n\n \n\n10\n\n15\nk\n\n20\n\n25\n\n7\n\n0\n1\n\n \nf\n\no\n\n \ns\nt\ni\n\nn\nu\n\n \n\nn\n\ni\n \nt\ns\no\nC\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n5\n\nBatch Lloyds\nDivide and Conquer with km# and km++\nDivide and Conquer with km++\n\n \n\n10\n\n15\nk\n\n20\n\n25\n\nFigure 1: Cost vs. k: (a) Mixtures of gaussians simulation, (b) Cloud data, (c) Spam data,.\n\nindexes i \u2208 [r \u2212 1] such that (ai, bi) = (3 log k, O(1)). By the above discussion, the memory is\nused optimally if M = rn1/rk(3 log k)q/r, in which case the \ufb01nal approximation guarantee will be\n\u02dccr\u22121(log k)r\u2212q, for some global \u02dcc > 0. We concentrate on the case M growing polynomially in n,\nsay M = n\u03b1 for some \u03b1< 1. In this case, the memory optimality constraint implies r = 1/\u03b1 for\nn large enough (regardless of the choice of q). This implies that the \ufb01nal approximation guarantee\nis best if q = r \u2212 1, in other words, we choose the repeated k-means# for levels 1..r \u2212 1, and\nk-means++ for level r. Summarizing, we get:\nTheorem 3.2. If there is access to memory of size M = n\u03b1 for some \ufb01xed \u03b1> 0, then for suf\ufb01ciently\nlarge n the best application of the multi-level scheme described above is obtained by running r =\n-\u03b1. = -log n/ log M. levels, and choosing the repeated k-means# for all but the last level, in which\nk-means++ is chosen. The resulting algorithm is a randomized one-pass streaming approximation\nto k-means, with an approximation ratio of O(\u02dccr\u22121(log k)), for some global \u02dcc > 0. The running\ntime of the algorithm is O(dnk2 log n log k).\n\nWe should compare the above multi-level streaming algorithm with the state-of-art (in terms of\nmemory vs. approximation tradeoff) streaming algorithm for the k-median problem. Charikar,\nCallaghan, and Panigrahy [CCP03] give a one-pass streaming algorithm for the k-median problem\nwhich gives a constant factor approximation and uses O(k\u00b7poly log(n)) memory. The main problem\nwith this algorithm from a practical point of view is that the average processing time per item is\nlarge. It is proportional to the amount of memory used, which is poly-logarithmic in n. This might\nbe undesirable in practical scenarios where we need to process a data item quickly when it arrives. In\ncontrast, the average per item processing time using the divide-and-conquer-strategy is constant and\nfurthermore the algorithm can be pipelined (i.e. data items can be temporarily stored in a memory\nbuffer and quickly processed before the the next memory buffer is \ufb01lled). So, even if [CCP03] can\nbe extended to the k-means setting, streaming algorithms based on the divide-and-conquer-strategy\nwould be more interesting from a practical point of view.\n\n4 Experiments\n\nDatasets.\nIn our discussion, n denotes the number of points in the data, d denotes the dimension,\nand k denotes the number of clusters. Our \ufb01rst evaluation, detailed in Tables 1a)-c) and Figure 1,\ncompares our algorithms on the following data: (1) norm25 is synthetic data generated in the fol-\nlowing manner: we choose 25 random vertices from a 15 dimensional hypercube of side length 500.\nWe then add 400 gaussian random points (with variance 1) around each of these points.8 So, for this\ndata n = 10, 000 and d = 15. The optimum cost for k = 25 is 1.5026 \u00d7 105. (2) The UCI Cloud\ndataset consists of cloud cover data [AN07]. Here n = 1024 and d = 10. (3) The UCI Spambase\ndataset is data for an e-mail spam detection task [AN07]. Here n = 4601 and d = 58.\nTo compare against a baseline method known to be used in practice, we used Lloyd\u2019s algorithm,\ncommonly referred to as the k-means algorithm. Standard Lloyd\u2019s algorithm operates in the batch\nsetting, which is an easier problem than the one-pass streaming setting, so we ran experiments with\nthis algorithm to form a baseline. We also compare to an online version of Lloyd\u2019s algorithm,\nhowever the performance is worse than the batch version, and our methods, for all problems, so we\n\n8Testing clustering algorithms on this simulation distribution was inspired by [AV07].\n\n7\n\n\fk\n5\n10\n15\n20\n25\nk\n5\n10\n15\n20\n25\n\nk\n5\n10\n15\n20\n25\n\nBL\n\n5.1154 \u00b7 109\n3.3080 \u00b7 109\n2.0123 \u00b7 109\n1.4225 \u00b7 109\n0.8602 \u00b7 109\n\nBL\n\n1.7707 \u00b7 107\n0.7683 \u00b7 107\n0.5012 \u00b7 107\n0.4388 \u00b7 107\n0.3839 \u00b7 107\n\nBL\n\nOL\n\n6.5967 \u00b7 109\n6.0146 \u00b7 109\n4.3743 \u00b7 109\n3.7794 \u00b7 109\n2.8859 \u00b7 109\n\nOL\n\n1.2401 \u00b7 108\n8.5684 \u00b7 107\n8.4633 \u00b7 107\n6.5110 \u00b7 107\n6.3758 \u00b7 107\n\nOL\n\nDC-1\n\n7.9398 \u00b7 109\n4.5954 \u00b7 109\n2.5468 \u00b7 109\n1.0718 \u00b7 109\n2.7842 \u00b7 105\nDC-1\n\n2.2924 \u00b7 107\n8.3363 \u00b7 106\n4.9667 \u00b7 106\n3.7479 \u00b7 106\n2.8895 \u00b7 106\nDC-1\n\nDC-2\n\n7.8474 \u00b7 109\n4.6829 \u00b7 109\n2.5898 \u00b7 109\n1.1403 \u00b7 109\n2.7298 \u00b7 105\nDC-2\n\n2.2617 \u00b7 107\n8.7788 \u00b7 106\n4.8806 \u00b7 106\n3.7536 \u00b7 106\n2.9014 \u00b7 106\nDC-2\n\n3.3963 \u00b7 108\n1.0463 \u00b7 108\n5.3557 \u00b7 107\n3.2994 \u00b7 107\n2.3391 \u00b7 107\n\nBL\n1.25\n2.05\n3.88\n8.62\n13.13\nBL\n1.12\n1.20\n2.18\n2.59\n2.43\nBL\n9.68\n34.78\n67.54\n100.44\n109.41\n\nOL\n1.32\n2.45\n3.49\n4.69\n6.04\nOL\n0.13\n0.25\n0.35\n0.47\n0.52\nOL\n0.70\n1.31\n1.88\n2.57\n3.04\n\nDC-1\nDC-2\n14.37\n9.93\n45.39\n21.09\n95.22\n30.34\n190.73\n41.49\n283.19\n53.07\nDC-1 DC-2\n0.92\n1.73\n5.64\n1.87\n2.67\n10.98\n4.19\n25.72\n4.82\n36.17\nDC-1\nDC-2\n5.14\n11.65\n9.75\n40.14\n14.41\n77.75\n22.76\n194.01\n274.42\n27.10\n\n4.9139 \u00b7 108\n1.6952 \u00b7 108\n1.5670 \u00b7 108\n1.5196 \u00b7 108\n1.5168 \u00b7 108\n\n1.7001 \u00b7 109\n1.6930 \u00b7 109\n1.4762 \u00b7 109\n1.4766 \u00b7 109\n1.4754 \u00b7 109\n\n3.4021 \u00b7 108\n1.0206 \u00b7 108\n5.5095 \u00b7 107\n3.3400 \u00b7 107\n2.3151 \u00b7 107\n\nTable 1: Columns 2-5 have the clustering cost and columns 6-9 have time in sec. a) norm25 dataset,\nb) Cloud dataset, c) Spambase dataset.\n\nMemory/\n#levels\n1024/0\n480/1\n360/2\n\nCost\n\n8.74 \u00b7 106\n8.59 \u00b7 106\n8.61 \u00b7 106\n\nTime\n5.5\n3.6\n3.8\n\nMemory/\n#levels\n2048/0\n1250/1\n1125/2\n\nCost\n\n5.78 \u00b7 104\n5.36 \u00b7 104\n5.15 \u00b7 104\n\nTime\n30\n25\n26\n\nMemory/\n#levels\n4601/0\n880/1\n600/2\n\nCost\n\n1.06 \u00b7 108\n0.99 \u00b7 108\n1.03 \u00b7 108\n\nTime\n34\n20\n19.5\n\nTable 2: Multi-level hierarchy evaluation: a) Cloud dataset, k = 10, b) A subset of norm25 dataset,\nn = 2048, k = 25, c) Spambase dataset, k = 10. The memory size decreases as the number of\nlevels of the hierarchy increases. (0 levels means running batch k-means++ on the data.)\n\ndo not include it in our plots for the real data sets.9 Tables 1a)-c) shows average k-means cost (over\n10 random restarts for the randomized algorithms: all but Online Lloyd\u2019s) for these algorithms:\n(1) BL: Batch Lloyd\u2019s, initialized with random centers in the input data, and run to convergence.10\n(2) OL: Online Lloyd\u2019s.\n(3) DC-1: The simple 1-stage divide and conquer algorithm of Section 3.2.\n(4) DC-2: The simple 1-stage divide and conquer algorithm 3 of Section 3.1. The sub-algorithms\nused are A = \u201crun k-means++ 3 \u00b7 log n times and pick best clustering,\u201d and A\u2019 is k-means++. In our\ncontext, k-means++ and k-means# are only the seeding step, not followed by Lloyd\u2019s algorithm.\nIn all problems, our streaming methods achieve much lower cost than Online Lloyd\u2019s, for all settings\nof k, and lower cost than Batch Lloyd\u2019s for most settings of k (including the correct k = 25, in\nnorm25). The gains with respect to batch are noteworthy, since the batch problem is less constrained\nthan the one-pass streaming problem. The performance of DC-1 and DC-2 is comparable.\nTable 2 shows an evaluation of the one-pass multi-level hierarchical algorithm of Section 3.3, on the\ndifferent datasets, simulating different memory restrictions. Although our worst-case theoretical re-\nsults imply an exponential clustering cost as a function of the number of levels, our results show a far\nmore optimistic outcome in which adding levels (and limiting memory) actually improves the out-\ncome. We conjecture that our data contains enough information for clustering even on chunks that \ufb01t\nin small buffers, and therefore the results may re\ufb02ect the bene\ufb01t of the hierarchical implementation.\n\nAcknowledgements. We thank Sanjoy Dasgupta for suggesting the study of approximation algo-\nrithms for k-means in the streaming setting, for excellent lecture notes, and for helpful discussions.\n\n9Despite the poor performance we observed, this algorithm is apparently used in practice, see [Das08].\n10We measured convergence by change in cost less than 1.\n\n8\n\n\fReferences\n[ADK09] Ankit Aggarwal, Amit Deshpande and Ravi Kannan: Adaptive Sampling for k-means Clustering.\n\nAPPROX, 2009.\n\n[AL09] Nir Ailon and Edo Liberty: Correlation Clustering Revisited: The \u201dTrue\u201d Cost of Error Minimization\n\nProblems. To appear in ICALP 2009.\n\n[AMS96] Noga Alon, Yossi Matias, and Mario Szegedy.: The space complexity of approximating the fre-\nquency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Com-\nputing, pages 20\u201329, 1996.\n\n[AV06] David Arthur and Sergei Vassilvitskii: Worst-case and smoothed analyses of the icp algorithm, with\n\nan application to the k-means method. FOCS, 2006\n\n[AV07] David Arthur and Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA, 2007.\n[AGKM+04] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit: Local search heuris-\ntics for k-median and facility location problems. Siam Journal of Computing, 33(3):544\u2013562, 2004.\nRepository.\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html, University of California, Irvine, School of\nInformation and Computer Sciences, 2007.\n[BBG09] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta: Approximate Clustering without the Ap-\n\nUCI Machine\n\n[AN07] A.\n\nNewman:\n\nLearning\n\nAsuncion\n\nand\n\nD.J.\n\nproximation. SODA, 2009.\n\n[BL08] S. Ben-David and U. von Luxburg: Relating clustering stability to properties of cluster boundaries.\n\nCOLT, 2008\n\n[CCP03] Moses Charikar and Liadan O\u2019Callaghan and Rina Panigrahy: Better streaming algorithms for clus-\n\ntering problems. STOC, 2003.\n\n[CG99] Moses Charikar and Sudipto Guha: Improved combinatorial algorithms for the facility location and\n\nk-medians problem. FOCS, 1999.\n\n[CMTS02] M. Charikar, S. Guha , E Tardos, and D. Shmoys: A Constant Factor Approximation Algorithm\n\nfor the k-Median Problem. Journal of Computer and System Sciences, 2002.\n\n[CR08] Kamalika Chaudhuri and Satish Rao: Learning Mixtures of Product Distributions using Correlations\n\nand Independence. COLT, 2008.\n\n[Das08] Sanjoy Dasgupta.: Course notes, CSE 291: Topics in unsupervised learning. http://www-\n\ncse.ucsd.edu/ dasgupta/291/index.html, University of California, San Diego, Spring 2008.\n\n[Gon85] T. F. Gonzalez: Clustering to minimize the maximum intercluster distance. Theoretical Computer\n\nScience, 38, pages 293\u2013306, 1985.\n\n[GMMM+03] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O\u2019Callaghan: Clus-\ntering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering,\n15(3): 515\u2013528, 2003.\n\n[Ind99] Piotr Indyk: Sublinear Time Algorithms for Metric Space Problems. STOC, 1999.\n[JV01] K. Jain and Vijay Vazirani: Approximation Algorithms for Metric Facility Location and k-Median\n\nProblems Using the Primal-Dual Schema and Lagrangian Relaxation. Journal of the ACM. 2001.\n\n[KMNP+04] T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu: A Local\nSearch Approximation Algorithm for k-Means Clustering, Computational Geometry: Theory and\nApplications, 28, 89-112, 2004.\nJ. Lin and J. S. Vitter: Approximation Algorithms for Geometric Median Problems. Information\nProcessing Letters, 1992.\n\n[LV92]\n\n[McG07] Andrew McGregor: Processing Data Streams. Ph.D. Thesis, Computer and Information Science,\n\nUniversity of Pennsylvania, 2007.\nS. Muthukrishnan: Data Streams: Algorithms and Applications, NOW Publishers, Inc., Hanover MA\n[M05]\n[ORSS06] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, Chaitanya Swamy: The effectiveness of\n\nLloyd-type methods for the k-means problem. FOCS, 2006.\n\n[VW02] V. Vempala and G. Wang: A spectral algorithm of learning mixtures of distributions. pages 113\u2013123,\n\nFOCS, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1085, "authors": [{"given_name": "Nir", "family_name": "Ailon", "institution": null}, {"given_name": "Ragesh", "family_name": "Jaiswal", "institution": null}, {"given_name": "Claire", "family_name": "Monteleoni", "institution": null}]}