{"title": "Incremental Clustering: The Case for Extra Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 315, "abstract": "The explosion in the amount of data available for analysis often necessitates a transition from batch to incremental clustering methods, which process one element at a time and typically store only a small subset of the data. In this paper, we initiate the formal analysis of incremental clustering methods focusing on the types of cluster structure that they are able to detect. We find that the incremental setting is strictly weaker than the batch model, proving that a fundamental class of cluster structures that can readily be detected in the batch setting is impossible to identify using any incremental method. Furthermore, we show how the limitations of incremental clustering can be overcome by allowing additional clusters.", "full_text": "Incremental Clustering: The Case for Extra Clusters\n\nMargareta Ackerman\nFlorida State University\n\n600 W College Ave, Tallahassee, FL 32306\n\nmackerman@fsu.edu\n\nSanjoy Dasgupta\n\nUC San Diego\n\n9500 Gilman Dr, La Jolla, CA 92093\n\ndasgupta@eng.ucsd.edu\n\nAbstract\n\nThe explosion in the amount of data available for analysis often necessitates a\ntransition from batch to incremental clustering methods, which process one ele-\nment at a time and typically store only a small subset of the data. In this paper,\nwe initiate the formal analysis of incremental clustering methods focusing on the\ntypes of cluster structure that they are able to detect. We \ufb01nd that the incremental\nsetting is strictly weaker than the batch model, proving that a fundamental class of\ncluster structures that can readily be detected in the batch setting is impossible to\nidentify using any incremental method. Furthermore, we show how the limitations\nof incremental clustering can be overcome by allowing additional clusters.\n\n1\n\nIntroduction\n\nClustering is a fundamental form of data analysis that is applied in a wide variety of domains, from\nastronomy to zoology. With the radical increase in the amount of data collected in recent years,\nthe use of clustering has expanded even further, to applications such as personalization and targeted\nadvertising. Clustering is now a core component of interactive systems that collect information on\nmillions of users on a daily basis. It is becoming impractical to store all relevant information in\nmemory at the same time, often necessitating the transition to incremental methods.\nIncremental methods receive data elements one at a time and typically use much less space than is\nneeded to store the complete data set. This presents a particularly interesting challenge for unsu-\npervised learning, which unlike its supervised counterpart, also suffers from an absence of a unique\ntarget truth. Observe that not all data possesses a meaningful clustering, and when an inherent\nstructure exists, it need not be unique (see Figure 1 for an example). As such, different users may\nbe interested in very different partitions. Consequently, different clustering methods detect distinct\ntypes of structure, often yielding radically different results on the same data. Until now, differences\nin the input-output behaviour of clustering methods have only been studied in the batch setting\n[13, 14, 8, 4, 3, 5, 2, 20]. In this work, we take a \ufb01rst look at the types of cluster structures that can\nbe discovered by incremental clustering methods.\nTo qualify the type of cluster structure present in data, a number of notions of clusterability have\nbeen proposed (for a detailed discussion, see [1] and [8]). These notions capture the structure of\nthe target clustering: the clustering desired by the user for a speci\ufb01c application. As such, notions of\nclusterability facilitate the analysis of clustering methods by making it possible to formally ascertain\nwhether an algorithm correctly recovers the desired partition.\nOne elegant notion of clusterability, introduced by Balcan et al. [8], requires that every element be\ncloser to data in its own cluster than to other points. For simplicity, we will refer to clusterings that\nadhere to this requirement as nice. It was shown by [8] that such clusterings are readily detected\nof\ufb02ine by classical batch algorithms. On the other hand, we prove (Theorem 3.8) that no incre-\nmental method can discover these partitions. Thus, batch algorithms are signi\ufb01cantly stronger than\nincremental methods in their ability to detect cluster structure.\n\n1\n\n\fFigure 1: An example of different cluster structures in the same data. The clustering on the left\n\ufb01nds inherent structure in the data by identifying well-separated partitions, while the clustering on\nthe right discovers structure in the data by focusing on the dense region. The correct partitioning\ndepends on the application at hand.\n\nIn an effort to identify types of cluster structure that incremental methods can recover, we turn\nto stricter notions of clusterability. A notion used by Epter et al. [10] requires that the minimum\nseparation between clusters be larger than the maximum cluster diameter. We call such clusterings\nperfect, and we present an incremental method that is able to recover them (Theorem 4.3).\nYet, this result alone is unsatisfactory. If, indeed, it were necessary to resort to such strict notions\nof clusterability, then incremental methods would have limited utility. Is there some other way to\ncircumvent the limitations of incremental techniques?\nIt turns out that incremental methods become a lot more powerful when we slightly alter the cluster-\ning problem: if, instead of asking for exactly the target partition, we are satis\ufb01ed with a re\ufb01nement,\nthat is, a partition each of whose clusters is contained within some target cluster. Indeed, in many\napplications, it is reasonable to allow additional clusters.\nIncremental methods bene\ufb01t from additional clusters in several ways. First, we exhibit an algorithm\nthat is able to capture nice k-clusterings if it is allowed to return a re\ufb01nement with 2k\u22121 clusters\n(Theorem 5.3), which could be reasonable for small k. We also show that this exponential depen-\ndence on k is unavoidable in general (Theorem 5.4). As such, allowing additional clusters enables\nincremental techniques to overcome their inability to detect nice partitions.\nA similar phenomenon is observed in the analysis of the sequential k-means algorithm, one of\nthe most popular methods of incremental clustering. We show that it is unable to detect perfect\nclusterings (Theorem 4.4), but that if each cluster contains a signi\ufb01cant fraction of the data, then it\ncan recover a re\ufb01nement of (a slight variant of) nice clusterings (Theorem 5.6).\nLastly, we demonstrate the power of additional clusters by relaxing the niceness condition, requiring\nonly that clusters have a signi\ufb01cant core (de\ufb01ned in Section 5.3). Under this milder requirement, we\nshow that a randomized incremental method is able to discover a re\ufb01nement of the target partition\n(Theorem 5.10).\nDue to space limitations, many proofs appear in the supplementary material.\n\n2 De\ufb01nitions\nWe consider a space X equipped with a symmetric distance function d : X \u00d7 X \u2192 R+ satisfying\nd(x, x) = 0. An example is X = Rp with d(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2. It is assumed that a clustering\nalgorithm can invoke d(\u00b7,\u00b7) on any pair x, x(cid:48) \u2208 X .\nA clustering (or, partition) of X is a set of clusters C = {C1, . . . , Ck} such that Ci \u2229 Cj = \u2205 for all\ni (cid:54)= j, and X = \u222ak\nWrite x \u223cC y if x, y are both in some cluster Cj; and x (cid:54)\u223cC y otherwise. This is an equivalence\nrelation.\n\ni=1Ci. A k-clustering is a clustering with k clusters.\n\n2\n\n\fDe\ufb01nition 2.1. An incremental clustering algorithm has the following structure:\n\nfor n = 1, . . . , N:\n\nSee data point xn \u2208 X\nSelect model Mn \u2208 M\n\nwhere N might be \u221e, and M is a collection of clusterings of X . We require the algorithm to\nhave bounded memory, typically a function of the number of clusters. As a result, an incremental\nalgorithm cannot store all data points.\n\nNotice that the ordering of the points is unspeci\ufb01ed. In our results, we consider two types of or-\ndering: arbitrary ordering, which is the standard setting in online learning and allows points to be\nordered by an adversary, and random ordering, which is standard in statistical learning theory. In\nexemplar-based clustering, M = X k: each model is a list of k \u201ccenters\u201d (t1, . . . , tk) that induce\na clustering of X , where every x \u2208 X is assigned to the cluster Ci for which d(x, ti) is smallest\n(breaking ties by picking the smallest i). All the clusterings we will consider in this paper will be\nspeci\ufb01ed in this manner.\nWe also note that the incremental clustering model is closely related to streaming clustering [6, 11],\nthe primary difference being that in the latter framework multiple passes of the data are allowed.\n\n2.1 Examples of incremental clustering algorithms\n\nThe most well-known incremental clustering algorithm is probably sequential k-means, which is\nmeant for data in Euclidean space. It is an incremental variant of Lloyd\u2019s algorithm [17, 18]:\nAlgorithm 2.2. Sequential k-means.\n\nSet T = (t1, . . . , tk) to the \ufb01rst k data points\nInitialize the counts n1, n2, ..., nk to 1\nRepeat:\n\nAcquire the next example, x\nIf ti is the closest center to x:\n\nIncrement ni\nReplace ti by ti + (1/ni)(x \u2212 ti)\n\nThis method, and many variants of it, have been studied intensively in the literature on self-\norganizing maps [16]. It attempts to \ufb01nd centers T that optimize the k-means cost function:\n\n(cid:88)\n\ndata x\n\ncost(T ) =\n\n(cid:107)x \u2212 t(cid:107)2.\n\nmin\nt\u2208T\n\nIt is not hard to see that the solution obtained by sequential k-means at any given time can have\ncost far from optimal; we will see an even stronger lower bound in Theorem 4.4. Nonetheless, we\nwill also see that if additional centers are allowed, this algorithm is able to correctly capture some\nfundamental types of cluster structure.\nAnother family of clustering algorithms with incremental variants are agglomerative procedures [13]\nlike single-linkage [12]. Given n data points in batch mode, these algorithms produce a hierarchical\nclustering on all n points. But the hierarchy can be truncated at the intermediate k-clustering, yield-\ning a tree with k leaves. Moreover, there is a natural scheme for updating these leaves incrementally:\nAlgorithm 2.3. Sequential agglomerative clustering.\n\nSet T to the \ufb01rst k data points\nRepeat:\n\nGet the next point x and add it to T\nSelect t, t(cid:48) \u2208 T for which dist(t, t(cid:48)) is smallest\nReplace t, t(cid:48) by the single center merge(t, t(cid:48))\n\nHere the two functions dist and merge can be varied to optimize different clustering criteria,\nand often require storing additional suf\ufb01cient statistics, such as counts of individual clusters. For\ninstance, Ward\u2019s method of average linkage [19] is geared towards the k-means cost function. We\nwill consider the variant obtained by setting dist(t, t(cid:48)) = d(t, t(cid:48)) and merge(t, t(cid:48)) to either t or t(cid:48):\n\n3\n\n\fAlgorithm 2.4. Sequential nearest-neighbour clustering.\n\nSet T to the \ufb01rst k data points\nRepeat:\n\nGet the next point x and add it to T\nLet t, t(cid:48) be the two closest points in T\nReplace t, t(cid:48) by either of these two points\n\nThe above algorithm was proposed by Ben-David and Reyzin [9]. We will see that it is effective at\npicking out a large class of cluster structures.\n\n2.2 The target clustering\n\nUnlike supervised learning tasks, which are typically endowed with a unique correct classi\ufb01cation,\nclustering is ambiguous. One approach to disambiguating clustering is identifying an objective\nfunction such as k-means, and then de\ufb01ning the clustering task as \ufb01nding the partition with min-\nimum cost. Although there are situations to which this approach is well-suited, many clustering\napplications do not inherently lend themselves to any speci\ufb01c objective function. As such, while\nobjective functions play an essential role in deriving clustering methods, they do not circumvent the\nambiguous nature of clustering.\nThe term target clustering denotes the partition that a speci\ufb01c user is looking for in a data set.\nThis notion was used by Balcan et al. [8] to study what constraints on cluster structure make them\nef\ufb01ciently identi\ufb01able in a batch setting. In this paper, we consider families of target clusterings that\nsatisfy different properties, and ask whether incremental algorithms can identify such clusterings.\nThe target clustering C is de\ufb01ned on a possibly in\ufb01nite space X , from which the learner receives a\nsequence of points. At any time n, the learner has seen n data points and has some clustering that\nideally agrees with C on these points. The methods we consider are exemplar-based: they all specify\na list of points T in X that induce a clustering of X (recall the discussion just before Section 2.1).\nWe consider two requirements:\n\n\u2022 (Strong) T induces the target clustering C.\n\u2022 (Weaker) T induces a re\ufb01nement of the target clustering C: that is, each cluster induced by\n\nT is part of some cluster of C.\n\nIf the learning algorithm is run on a \ufb01nite data set, then we require these conditions to hold once\nall points have been seen. In our positive results, we will also consider in\ufb01nite streams of data, and\nshow that these conditions hold at every time n, taking the target clustering restricted to the points\nseen so far.\n\n3 A basic limitation of incremental clustering\n\nWe begin by studying limitations of incremental clustering compared with the batch setting.\nOne of the most fundamental types of cluster structure is what we shall call nice clusterings for the\nsake of brevity. Originally introduced by Balcan et al. [8] under the name \u201cstrict separation,\u201d this\nnotion has since been applied in [2], [1], and [7], to name a few.\nDe\ufb01nition 3.1 (Nice clustering). A clustering C of (X , d) is nice if for all x, y, z \u2208 X , d(y, x) <\nd(z, x) whenever x \u223cC y and x (cid:54)\u223cC z.\nSee Figure 2 for an example.\nObservation 3.2. If we select one point from every cluster of a nice clustering C, the resulting set\ninduces C. (Moreover, niceness is the minimal property under which this holds.)\nA nice k-clustering is not, in general, unique. For example, consider X = {1, 2, 4, 5} on the real\nline under the usual distance metric; then both {{1},{2},{4, 5}} and {{1, 2},{4},{5}} are nice\n3-clusterings of X . Thus we start by considering data with a unique nice k-clustering.\n\n4\n\n\fFigure 2: A nice clustering may include clusters with very different diameters, as long as the distance\nbetween any two clusters scales as the larger diameter of the two.\n\nSince niceness is a strong requirement, we might expect that it is easy to detect. Indeed, in the batch\nsetting, a unique nice k-clustering can be recovered by single-linkage [8]. However, we show that\nnice partitions cannot be detected in the incremental setting, even if they are unique.\nWe start by formalizing the ordering of the data. An ordering function O takes a \ufb01nite set X and\nreturns an ordering of the points in this set. An ordered distance space is denoted by (O[X ], d).\nDe\ufb01nition 3.3. An incremental clustering algorithm A is nice-detecting if, given a positive integer\nk and (X , d) that has a unique nice k-clustering C, the procedure A(O[X ], d, k) outputs C for any\nordering function O.\n\nIn this section, we show (Theorem 3.8) that no deterministic memory-bounded incremental method\nis nice-detecting, even for points in Euclidean space under the (cid:96)2 metric.\nWe start with the intuition behind the proof. Fix any incremental clustering algorithm and set the\nnumber of clusters to 3. We will specify a data set D with a unique nice 3-clustering that this\nalgorithm cannot detect. The data set has two subsets, D1 and D2, that are far away from each\nother but are otherwise nearly isomorphic. The target 3-clustering is either: (D1, together with a\n2-clustering of D2) or (D2, together with a 2-clustering of D1).\nThe central piece of the construction is the con\ufb01guration of D1 (and likewise, D2). The \ufb01rst point\npresented to the learner is xo. This is followed by a clique of points xi that are equidistant from each\nother and have the same, slightly larger, distance to xo. For instance, we could set distances within\nthe clique d(xi, xj) to 1, and distances d(xi, xo) to 2. Finally there is a point x(cid:48) that is either exactly\nlike one of the xi\u2019s (same distances), or differs from them in just one speci\ufb01c distance d(x(cid:48), xj)\nwhich is set to 2. In the former case, there is a nice 2-clustering of D1, in which one cluster is\nxo and the other cluster is everything else. In the latter case, there is no nice 2-clustering, just the\n1-clustering consisting of all of D1.\nD2 is like D1, but is rigged so that if D1 has a nice 2-clustering, then D2 does not; and vice versa.\nThe two possibilities for D1 are almost identical, and it would seem that the only way an algorithm\ncan distinguish between them is by remembering all the points it has seen. A memory-bounded\nincremental learner does not have this luxury. Formalizing this argument requires some care; we\ncannot, for instance, assume that the learner is using its memory to store individual points.\nIn order to specify D1, we start with a larger collection of points that we call an M-con\ufb01guration,\nand that is independent of any algorithm. We then pick two possibilities for D1 (one with a nice\n2-clustering and one without) from this collection, based on the speci\ufb01c learner.\nDe\ufb01nition 3.4. In any metric space (X , d), for any integer M > 0, de\ufb01ne an M-con\ufb01guration to\nbe a collection of 2M + 1 points xo, x1, . . . , xM , x(cid:48)\n\nM \u2208 X such that\n\n1, . . . , x(cid:48)\n\n\u2022 All interpoint distances are in the range [1, 2].\n\u2022 d(xo, xi), d(xo, x(cid:48)\ni) \u2208 (3/2, 2] for all i \u2265 1.\n\u2022 d(xi, xj), d(x(cid:48)\ni, x(cid:48)\nj), d(xi, x(cid:48)\n\u2022 d(xi, x(cid:48)\n\ni) > d(xo, xi).\n\nj) \u2208 [1, 3/2] for all i (cid:54)= j \u2265 1.\n\n5\n\n\f1, . . . , x(cid:48)\n\nM be any M-con\ufb01guration in (X , d). Pick any index\nj} \u222a {xi : i \u2208 S} has a\n\nThe signi\ufb01cance of this point con\ufb01guration is as follows.\nLemma 3.5. Let xo, x1, . . . , xM , x(cid:48)\n1 \u2264 j \u2264 M and any subset S \u2282 [M ] with |S| > 1. Then the set A = {xo, x(cid:48)\nnice 2-clustering if and only if j (cid:54)\u2208 S.\nProof. Suppose A has a nice 2-clustering {C1, C2}, where C1 is the cluster that contains xo.\nWe \ufb01rst show that C1 is a singleton cluster. If C1 also contains some x(cid:96), then it must contain all\nthe points {xi : i \u2208 S} by niceness since d(x(cid:96), xi) \u2264 3/2 < d(x(cid:96), xo). Since |S| > 1, these\nj) \u2264 3/2 <\npoints include some xi with i (cid:54)= j. Whereupon C1 must also contain x(cid:48)\nd(xi, xo). But this means C2 is empty.\nLikewise, if C1 contains x(cid:48)\nThere is at least one such xi, and we revert to the previous case.\nTherefore C1 = {xo} and, as a result, C2 = {xi : i \u2208 S} \u222a {x(cid:48)\nonly if d(xo, x(cid:48)\nonly if j (cid:54)\u2208 S.\n\nj) < d(xo, x(cid:48)\nj).\nj}. This 2-clustering is nice if and\nj, xi) for all i \u2208 S, which in turn is true if and\n\nj, then it also contains all {xi : i \u2208 S, i (cid:54)= j}, since d(xi, x(cid:48)\n\nj) and d(xo, xi) > d(x(cid:48)\n\nj, since d(xi, x(cid:48)\n\nj) > d(xi, x(cid:48)\n\nBy putting together two M-con\ufb01gurations, we obtain:\nTheorem 3.6. Let (X , d) be any metric space that contains two M-con\ufb01gurations separated by\na distance of at least 4. Then, there is no deterministic incremental algorithm with \u2264 M/2 bits\nof storage that is guaranteed to recover nice 3-clusterings of data sets drawn from X , even when\nlimited to instances in which such clusterings are unique.\n\nProof. Suppose the deterministic incremental learner has a memory capacity of b bits. We will refer\nto the memory contents of the learner as its state, \u03c3 \u2208 {0, 1}b.\nCall the two M-con\ufb01gurations xo, x1, . . . , xM , x(cid:48)\nfeed the following points to the learner:\n\nM and zo, z1, . . . , zM , z(cid:48)\n\n1, . . . , x(cid:48)\n\n1, . . . , z(cid:48)\n\nM . We\n\nBatch 1:\nBatch 2:\nBatch 3:\nBatch 4:\n\nxo and zo\nb distinct points from x1, . . . , xM\nb distinct points from z1, . . . , zM\nTwo \ufb01nal points x(cid:48)\n\nj1 and z(cid:48)\n\nj2\n\nb\n\n(cid:1) > (M/b)b. If M \u2265 2b, this is > 2b, which\n\nThe number of distinct sets of b points in batch 2 is(cid:0)M\n\nThe learner\u2019s state after seeing batch 2 can be described by a function f : {x1, . . . , xM}b \u2192 {0, 1}b.\nmeans that two different sets of points must lead to the same state, call it \u03c3 \u2208 {0, 1}b. Let the indices\nof these sets be S1, S2 \u2282 [M ] (so |S1| = |S2| = b), and pick any j1 \u2208 S1 \\ S2.\nNext, suppose the learner is in state \u03c3 and is then given batch 3. We can capture its state at the end\nof this batch by a function g : {z1, . . . , zM}b \u2192 {0, 1}b, and once again there must be distinct sets\nT1, T2 \u2282 [M ] that yield the same state \u03c3(cid:48). Pick any j2 \u2208 T1 \\ T2.\nIt follows that the sequences of inputs xo, zo, (xi : i \u2208 S1), (zi : i \u2208 T2), x(cid:48)\nj2 and xo, zo, (xi :\ni \u2208 S2), (zi : i \u2208 T1), x(cid:48)\nj2 produce the same \ufb01nal state and thus the same answer. But in the \ufb01rst\ncase, by Lemma 3.5, the unique nice 3-clustering keeps the x\u2019s together and splits the z\u2019s, whereas\nin the second case, it splits the x\u2019s and keeps the z\u2019s together.\n\n, z(cid:48)\n\n, z(cid:48)\n\nj1\n\nj1\n\nAn M-con\ufb01guration can be realized in Euclidean space:\nLemma 3.7. There is an absolute constant co such that for any dimension p, the Euclidean space\nRp, with L2 norm, contains M-con\ufb01gurations for all M < 2cop.\nThe overall conclusions are the following.\nTheorem 3.8. There is no memory-bounded deterministic nice-detecting incremental clustering\nalgorithm that works in arbitrary metric spaces. For data in Rp under the (cid:96)2 metric, there is no\ndeterministic nice-detecting incremental clustering algorithm using less than 2cop\u22121 bits of memory.\n\n6\n\n\f4 A more restricted class of clusterings\n\nThe discovery that nice clusterings cannot be detected using any incremental method, even though\nthey are readily detected in a batch setting, speaks to the substantial limitations of incremental\nalgorithms. We next ask whether there is a well-behaved subclass of nice clusterings that can be\ndetected using incremental methods. Following [10, 2, 5, 1], among others, we consider clusterings\nin which the maximum cluster diameter is smaller than the minimum inter-cluster separation.\nDe\ufb01nition 4.1 (Perfect clustering). A clustering C of (X , d) is perfect if d(x, y) < d(w, z) whenever\nx \u223cC y, w (cid:54)\u223cC z.\nAny perfect clustering is nice. But unlike nice clusterings, perfect clusterings are unique:\nLemma 4.2. For any (X , d) and k, there is at most one perfect k-clustering of (X , d).\nWhenever an algorithm can detect perfect clusterings, we call it perfect-detecting. Formally, an\nincremental clustering algorithm A is perfect-detecting if, given a positive integer k and (X , d) that\nhas a perfect k-clustering, A(O[X ], d, k) outputs that clustering for any ordering function O.\nWe start with an example of a simple perfect-detecting algorithm.\nTheorem 4.3. Sequential nearest-neighbour clustering (Algorithm 2.4) is perfect-detecting.\n\nWe next turn to sequential k-means (Algorithm 2.2), one of the most popular methods for incremen-\ntal clustering. Interestingly, it is unable to detect perfect clusterings.\nIt is not hard to see that a perfect k-clustering is a local optimum of k-means. We will now see an\nexample in which the perfect k-clustering is the global optimum of the k-means cost function, and\nyet sequential k-means fails to detect it.\nTheorem 4.4. There is a set of four points in R3 with a perfect 2-clustering that is also the global\noptimum of the k-means cost function (for k = 2). However, there is no ordering of these points that\nwill enable this clustering to be detected by sequential k-means.\n\n5\n\nIncremental clustering with extra clusters\n\nReturning to the basic lower bound of Theorem 3.8, it turns out that a slight shift in perspective\ngreatly improves the capabilities of incremental methods. Instead of aiming to exactly discover the\ntarget partition, it is suf\ufb01cient in some applications to merely uncover a re\ufb01nement of it. Formally, a\nclustering C of X is a re\ufb01nement of clustering C(cid:48) of X , if x \u223cC y implies x \u223cC(cid:48) y for all x, y \u2208 X .\nWe start by showing that although incremental algorithms cannot detect nice k-clusterings, they can\n\ufb01nd a re\ufb01nement of such a clustering if allowed 2k\u22121 centers. We also show that this is tight.\nNext, we explore the utility of additional clusters for sequential k-means. We show that for a random\nordering of the data, and with extra centers, this algorithm can recover (a slight variant of) nice\nclusterings. We also show that the random ordering is necessary for such a result.\nFinally, we prove that additional clusters extend the utility of incremental methods beyond nice\nclusterings. We introduce a weaker constraint on cluster structure, requiring only that each cluster\npossess a signi\ufb01cant \u201ccore\u201d, and we present a scheme that works under this weaker requirement.\n\n5.1 An incremental algorithm can \ufb01nd nice k-clusterings if allowed 2k centers\n\nEarlier work [8] has shown that that any nice clustering corresponds to a pruning of the tree ob-\ntained by single linkage on the points. With this insight, we develop an incremental algorithm that\nmaintains 2k\u22121 centers that are guaranteed to induce a re\ufb01nement of any nice k-clustering.\nThe following subroutine takes any \ufb01nite S \u2282 X and returns at most 2k\u22121 distinct points:\nCANDIDATES(S)\n\nRun single linkage on S to get a tree\nAssign each leaf node the corresponding data point\nMoving bottom-up, assign each internal node the data point in one of its children\nReturn all points at distance < k from the root\n\n7\n\n\ffor (cid:96) \u2264 k. Then the points returned by\n\nLemma 5.1. Suppose S has a nice (cid:96)-clustering,\nCANDIDATES(S) include at least one representative from each of these clusters.\nHere\u2019s an incremental algorithm that uses 2k\u22121 centers to detect a nice k-clustering.\nAlgorithm 5.2. Incremental clustering with extra centers.\nT0 = \u2205\nFor t = 1, 2, . . .:\n\nReceive xt and set Tt = Tt\u22121 \u222a {xt}\nIf |Tt| > 2k\u22121: Tt \u2190 CANDIDATES(Tt)\n\nTheorem 5.3. Suppose there is a nice k-clustering C of X . Then for each t, the set Tt has at most\n2k\u22121 points, including at least one representative from each Ci for which Ci \u2229 {x1, . . . , xt} (cid:54)= \u2205.\nIt is not possible in general to use fewer centers.\nTheorem 5.4. Pick any incremental clustering algorithm that maintains a list of (cid:96) centers that are\nguaranteed to be consistent with a target nice k-clustering. Then (cid:96) \u2265 2k\u22121.\n\n5.2 Sequential k-means with extra clusters\n\nTheorem 4.4 above shows severe limitations of sequential k-means. The good news is that additional\nclusters allow this algorithm to \ufb01nd a variant of nice partitionings.\nThe following condition imposes structure on the convex hull of the partitions in the target clustering.\nDe\ufb01nition 5.5. A clustering C = {C1, . . . , Ck} is convex-nice if for any i (cid:54)= j, any points x, y in\nthe convex hull of Ci, and any point z in the convex hull of Cj, we have d(y, x) < d(z, x).\nTheorem 5.6. Fix a data set (X , d) with a convex-nice clustering C = {C1, . . . , Ck} and let \u03b2 =\nmini |Ci|/|X|. If the points are ordered uniformly at random, then for any (cid:96) \u2265 k, sequential (cid:96)-means\nwill return a re\ufb01nement of C with probability at least 1 \u2212 ke\u2212\u03b2(cid:96).\nThe probability of failure is small when the re\ufb01nement contains (cid:96) = \u2126((log k)/\u03b2) centers. We can\nalso show that this positive result no longer holds when data is adversarially ordered.\nTheorem 5.7. Pick any k \u2265 3. Consider any data set X in R (under the usual metric) that has\na convex-nice k-clustering C = {C1, . . . , Ck}. Then there exists an ordering of X under which\nsequential (cid:96)-means with (cid:96) \u2264 mini |Ci| centers fails to return a re\ufb01nement of C.\n\n5.3 A broader class of clusterings\n\nWe conclude by considering a substantial generalization of niceness that can be detected by incre-\nmental methods when extra centers are allowed.\nDe\ufb01nition 5.8 (Core). For any clustering C = {C1, . . . , Ck} of (X , d), the core of cluster Ci is the\nmaximal subset C o\n\ni \u2282 Ci such that d(x, z) < d(x, y) for all x \u2208 Ci, z \u2208 C o\n\ni , and y (cid:54)\u2208 Ci.\n\nIn a nice clustering, the core of any cluster is the entire cluster. We now require only that each core\ncontain a signi\ufb01cant fraction of points, and we show that the following simple sampling routine will\n\ufb01nd a re\ufb01nement of the target clustering, even if the points are ordered adversarially.\nAlgorithm 5.9. Algorithm subsample.\n\nSet T to the \ufb01rst (cid:96) elements\nFor t = (cid:96) + 1, (cid:96) + 2, . . .:\n\nGet a new point xt\nWith probability (cid:96)/t:\n\nRemove an element from T uniformly at random and add xt to T\n\nIt is well-known (see, for instance, [15]) that at any time t, the set T consists of (cid:96) elements chosen\nat random without replacement from {x1, . . . , xt}.\nTheorem 5.10. Consider any clustering C = {C1, . . . , Ck} of (X , d), with core {C o\nk}.\n1 , . . . , C o\nLet \u03b2 = mini |C o\ni |/|X|. Fix any (cid:96) \u2265 k. Then, given any ordering of X , Algorithm 5.9 detects a\nre\ufb01nement of C with probability 1 \u2212 ke\u2212\u03b2(cid:96).\n\n8\n\n\fReferences\n[1] M. Ackerman and S. Ben-David. Clusterability: A theoretical study. Proceedings of AISTATS-\n\n09, JMLR: W&CP, 5(1-8):53, 2009.\n\n[2] M. Ackerman, S. Ben-David, S. Branzei, and D. Loker. Weighted clustering. Proc. 26th AAAI\n\nConference on Arti\ufb01cial Intelligence, 2012.\n\n[3] M. Ackerman, S. Ben-David, and D. Loker. Characterization of linkage-based clustering.\n\nCOLT, 2010.\n\n[4] M. Ackerman, S. Ben-David, and D. Loker. Towards property-based classi\ufb01cation of clustering\n\nparadigms. NIPS, 2010.\n\n[5] M. Ackerman, S. Ben-David, D. Loker, and S. Sabato. Clustering oligarchies. Proceedings of\n\nAISTATS-09, JMLR: W&CP, 31(6674), 2013.\n\n[6] Charu C Aggarwal. A survey of stream clustering algorithms., 2013.\n[7] M.-F. Balcan and P. Gupta. Robust hierarchical clustering. In COLT, pages 282\u2013294, 2010.\n[8] M.F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via simi-\nlarity functions. In Proceedings of the 40th annual ACM symposium on Theory of Computing,\npages 671\u2013680. ACM, 2008.\n\n[9] Shalev Ben-David and Lev Reyzin. Data stability in clustering: A closer look. ALT, 2012.\n[10] S. Epter, M. Krishnamoorthy, and M. Zaki. Clusterability detection and initial seed selection\nIn The International Conference on Knowledge Discovery in Databases,\n\nin large datasets.\nvolume 7, 1999.\n\n[11] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O\u2019Callaghan. Clustering data\nstreams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on,\npages 359\u2013366. IEEE, 2000.\n\n[12] J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American\n\nStatistical Association, 76(374):388\u2013394, 1981.\n\n[13] N. Jardine and R. Sibson. Mathematical taxonomy. London, 1971.\n[14] J. Kleinberg. An impossibility theorem for clustering. Proceedings of International Confer-\n\nences on Advances in Neural Information Processing Systems, pages 463\u2013470, 2003.\n\n[15] D.E. Knuth. The Art of Computer Programming: Seminumerical Algorithms, volume 2. 1981.\n[16] T. Kohonen. Self-organizing maps. Springer, 2001.\n[17] S.P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory,\n\n28(2):129\u2013137, 1982.\n\n[18] J.B. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations.\nIn Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol-\nume 1, pages 281\u2013297. University of California Press, 1967.\n\n[19] J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American\n\nStatistical Association, 58:236\u2013244, 1963.\n\n[20] R.B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proceedings of the\nTwenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 639\u2013646. AUAI Press,\n2009.\n\n9\n\n\f", "award": [], "sourceid": 232, "authors": [{"given_name": "Margareta", "family_name": "Ackerman", "institution": "Florida State University"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}]}