{"title": "Towards Property-Based Classification of Clustering Paradigms", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a \u201ccorrect\u201d clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinbergs famous impossibility result, while providing a simpler proof.", "full_text": "Towards Property-Based Classi\ufb01cation of Clustering\n\nParadigms\n\nMargareta Ackerman, Shai Ben-David, and David Loker\n\nD.R.C. School of Computer Science\n\n{mackerma, shai, dloker}@cs.uwaterloo.ca\n\nUniversity of Waterloo, Canada\n\nAbstract\n\nClustering is a basic data mining task with a wide variety of applications. Not\nsurprisingly, there exist many clustering algorithms. However, clustering is an ill\nde\ufb01ned problem - given a data set, it is not clear what a \u201ccorrect\u201d clustering for\nthat set is. Indeed, different algorithms may yield dramatically different outputs\nfor the same input sets. Faced with a concrete clustering task, a user needs to\nchoose an appropriate clustering algorithm. Currently, such decisions are often\nmade in a very ad hoc, if not completely random, manner. Given the crucial effect\nof the choice of a clustering algorithm on the resulting clustering, this state of\naffairs is truly regrettable. In this paper we address the major research challenge\nof developing tools for helping users make more informed decisions when they\ncome to pick a clustering tool for their data. This is, of course, a very ambitious\nendeavor, and in this paper, we make some \ufb01rst steps towards this goal. We pro-\npose to address this problem by distilling abstract properties of the input-output\nbehavior of different clustering paradigms.\nIn this paper, we demonstrate how abstract, intuitive properties of clustering func-\ntions can be used to taxonomize a set of popular clustering algorithmic paradigms.\nOn top of addressing deterministic clustering algorithms, we also propose similar\nproperties for randomized algorithms and use them to highlight functional differ-\nences between different common implementations of k-means clustering. We also\nstudy relationships between the properties, independent of any particular algo-\nrithm. In particular, we strengthen Kleinberg\u2019s famous impossibility result, while\nproviding a simpler proof.\n\n1\n\nIntroduction\n\nIn spite of the wide use of clustering in many practical applications, currently, there exists no princi-\npled method to guide the selection of a clustering algorithm. Of course, users are aware of the costs\ninvolved in employing different clustering algorithms (software purchasing costs, running times,\nmemory requirements, needs for data preprocessing etc.) but there is very little understanding of\nthe differences in the outcomes that these algorithms may produce. We focus on that aspect - the\ninput-output properties of different clustering algorithms.\nThe choice of an appropriate clustering should, of course, be task dependent. A clustering that\nworks well for one task may be unsuitable for another. Even more than for supervised learning, for\nclustering, the choice of an algorithm must incorporate domain knowledge. While some domain\nknowledge is embedded in the choice of similarity between domain elements (or the embedding of\nthese elements into some Euclidean space), there is still a large variance in the behavior of difference\nclustering paradigms over a \ufb01xed similarity measure.\n\n1\n\n\fFor some clustering tasks, there is a natural clustering objective function that one may wish to op-\ntimize (like k-means for vector quantization coding tasks), but very often the task does not readily\ntranslate into a corresponding objective function. Often users are merely searching for a meaningful\nclustering, without a prior preference for any speci\ufb01c objective function. Many (if not most) com-\nmon clustering paradigms do not optimize any clearly de\ufb01ned objective utility, either because no\nsuch objective is de\ufb01ned (like in the case of, say, single linkage clustering) or because optimizing\nthe most relevant objective is computationally infeasible. To overcome computation infeasibility,\nthe algorithms end up carrying out a heuristic whose outcome may be quite different than the actual\nobjective-based optimum (that is the case with the k-means algorithm as well as with spectral clus-\ntering algorithms). What seems to be missing is a clear understanding of the differences in clustering\noutputs in terms of intuitive and usable properties.\nWe propose a different approach to providing guidance to clustering users by identifying signif-\nicant properties of clustering functions that, on one hand distinguish between different clustering\nparadigms, and on the other hand are intended to be relevant to the domain knowledge that a user\nmight have access to. Based on domain expertise users could then choose which properties they\nwant an algorithm to satisfy, and determine which algorithms meet their requirements.\nOur vision is that ultimately, there would be a suf\ufb01ciently rich set of properties that would provide a\ndetailed, property-based, taxonomy of clustering methods, that could, in turn, be used as guidelines\nfor a wide variety of clustering applications. This is a very ambitious enterprize, but that should\nnot deter researchers from addressing it. This paper takes a step towards that goal by using natural\nproperties to examine some popular clustering approaches.\nWe present a taxonomy for common deterministic clustering functions with respect to the proper-\nties that we propose. We also show how to extend this framework to the randomized clustering\nalgorithms, and use these properties to distinguish between two k-means heuristics.\nWe also study relationships between the properties, independent of any particular algorithm. In par-\nticular, we strengthen Kleinberg\u2019s impossibility result[8] using a relaxation of one of the properties\nthat he proposed.\n\n1.1 Previous work\n\nOur work follows a theoretical study of clustering that began with Kleinberg\u2019s impossibility result\n[8], in which he proposes three candidate axioms of clustering and shows that no clustering function\ncan simultaneously satisfy these three axioms. Ackerman and Ben-David [1] subsequently showed\nthese axioms to be consistent in the setting of clustering quality measures. [1] also proposes to\nmake a distinction between clustering \u201caxioms\u201d and clustering \u201cproperties\u201d, where the axioms are\nthe features that de\ufb01ne which partitionings are worthy of the name \u201cclustering\u201d, and the properties\nvary between different clustering paradigms and may be used to construct a taxonomy of clustering\nalgorithms. We adopt that approach here.\nThere are previous results that provide some property based characterizations of a speci\ufb01c clus-\ntering algorithm. In 1975, Jardine and Sibson [6] gave a characterization of single linkage. Last\nyear, Bosagh Zadeh and Ben-David [3] characterize single-linkage within Kleinberg\u2019s framework\nof clustering functions using a special invariance property (\u201cpath distance coherence\u201d). Very re-\ncently, Ackerman, Ben-David and Loker provided a characterization of the family of linkage-based\nclustering in terms of a few natural properties [2].\nSome heuristics have been proposed as a means of distinguishing between the output of clustering\nalgorithms on speci\ufb01c data. These approaches require running the algorithms, and then selecting\nan algorithm based on the outputs that they produce.\nIn particular, validity criteria can be used\nto evaluate the output of clustering algorithms. These measures can be used to select a clustering\nalgorithm by choosing the one that yields the highest quality clustering [10]. However, the result\nonly applies to the original data, and there are no guarantees on the quality of the output of these\nalgorithms on any other data.\n\n2\n\n\f2 De\ufb01nitions and Formal Framework\n\nClustering is wide and heterogenous domain. For most of this paper, we focus on a basic sub-\ndomain where the (only) input to the clustering function is a \ufb01nite set of points endowed with a\nbetween-points distance (or similarity) function, and the output is a partition of that domain.\nA distance function is a symmetric function d : X \u00d7 X \u2192 R+, such that d(x, x) = 0 for all x \u2208 X.\nThe data sets that we consider are pairs (X, d), where X is some \ufb01nite domain set and d is a distance\nfunction over X. These are the inputs for clustering functions.\nA k-clustering C = {C1, C2, . . . , Ck} of a data set X is a partition of X into k disjoint subsets of\n\nCi = X). A clustering of X is a k-clustering of X for some 1 \u2264 k \u2264 |X|.\n\nX (so,(cid:91)\n\ni\n\nFor a clustering C, let |C| denote the number of clusters in C and |Ci| denote the number of points\nin a cluster Ci. For x, y \u2208 X and a clustering C of X, we write x \u223cC y if x and y belong to the\nsame cluster in C and x (cid:54)\u223cC y, otherwise.\nWe say that (X, d) and (X(cid:48), d(cid:48)) are isomorphic data sets, denoting it by (X, d) \u223c (X(cid:48), d(cid:48)), if there\nexists a bijection \u03c6 : X \u2192 X(cid:48) so that d(x, y) = d(cid:48)(\u03c6(x), \u03c6(y)) for all x, y \u2208 X.\nWe say that two clusterings (or partitions) C = (c1, . . . ck) of some domain (X, d) and C(cid:48) =\nk) of some domain (X(cid:48), d(cid:48)) are isomorphic clusterings, denoted (C, d) \u223c= (C(cid:48), d(cid:48)), if there\n(c(cid:48)\n1, . . . c(cid:48)\nexists a bijection \u03c6 : X \u2192 X(cid:48) such that for all x, y \u2208 X, d(x, y) = d(cid:48)(\u03c6(x), \u03c6(y)) and, on top of\nthat, x \u223cC y if and only if \u03c6(x) \u223cC(cid:48) \u03c6(y). Note that this notion depends on both the underlying\ndistance functions and the clusterings.\nWe consider two de\ufb01nitions of a clustering function.\nDe\ufb01nition 1 (General clustering function). A general clustering function is a function that takes as\ninput a pair (X, d) and outputs a clustering of the domain X.\n\nThe second type are clustering functions that require that the number of clusters be provided as part\nof the input.\nDe\ufb01nition 2 (k-clustering function). A k-clustering function is a function that takes as input a pair\n(X, d) and a parameter 1 \u2264 k \u2264 |X| and outputs a k-clustering of the domain X.\n\n2.1 Properties of Clustering Functions\n\nA key component in our approach are properties of clustering functions that address the input-output\nbehavior of these functions. The properties are formulated for k-clustering functions. However,\nall the properties, with the exception of locality1 and re\ufb01nement-con\ufb01ned, apply also for general\nclustering functions.\nIsomorphism invariance: The following invariance property, proposed in [2] under the name \u201crep-\nresentation independence\u201d, seems to be an essential part of our understanding of what clustering is.\nIt requires that the output of a k-clustering function is independent of the labels of the data points.\nA k-clustering function F is isomorphism invariant if whenever (X, d) \u223c (X(cid:48), d(cid:48)), then, for every\nk, F (X, d, k) and F (X(cid:48), d(cid:48), k) are isomorphic clusterings.\nScale invariance: Scale invariance, proposed by Kleinberg [8], requires that the output of a clus-\ntering be invariant to uniform scaling of the data. A k-clustering function F is scale invariant if\nfor any data sets (X, d) and (X, d(cid:48)), if there exists a real number c > 0 so that for all x, y \u2208 X,\nd(x, y) = c \u02d9d(cid:48)(x, y) then for every 1 \u2264 k \u2264 |X|, F (X, d, k) = F (X, d(cid:48), k).\nOrder invariance: Order invariance, proposed by Jardine and Sibson[6], describes clustering func-\ntions that are based on the ordering of pairwise distances. A distance function d(cid:48) of X is an order\ninvariant modi\ufb01cation of d over X if for all x1, x2, x3, x4 \u2208 X, d(x1, x2) < d(x3, x4) if and only\nif d(cid:48)(x1, x2) < d(cid:48)(x3, x4). A k-clustering function F is order invariant if whenever a distance\nfunction d(cid:48) over X is an order invariant modi\ufb01cation of d, F (X, d, k) = F (X, d(cid:48), k) for all k.\n\n1Locality can also be reformulated for general clustering functions, however, we do not discuss this in this\n\nwork.\n\n3\n\n\fLocality: Intuitively, a k-clustering function is local if its behavior on a union of clusters depends\nonly on distances between elements of that union, and is independent of the rest of the domain set.\nLocality was proposed in [2]. A k-clustering function F is local if for any clustering C output by F\n\nand every subset of clusters, C(cid:48) \u2286 C, F ((cid:83) C(cid:48), d,|C(cid:48)|) = C(cid:48).\n\nX (x, y) \u2264 dX (x, y) whenever x \u223cC y, and d(cid:48)\n\nIn other words, for every domain (X, d) and number of clusters, k, if X(cid:48) is the union of k(cid:48) clusters\nin F (X, d, k) for some k(cid:48) \u2264 k, then, applying F to (X(cid:48), d) and asking for a k(cid:48)-clustering, will yield\nthe same clusters that we started with.\nConsistency: Consistency, proposed by Kleinberg [8], aims to formalize the preference for clusters\nthat are dense and well-separated. This property requires that the output of a k-clustering function\nshould remain unchanged after shrinking within-cluster distances and stretching between-cluster\ndistances.\nGiven a clustering C of some domain (X, d), we say that a distance function d(cid:48) over X, is (C, d)-\nX (x, y) \u2265 dX (x, y) whenever x (cid:54)\u223cC y.\nconsistent if d(cid:48)\nA k-clustering function F is consistent if for every X, d, k, if d(cid:48) is (F (X, d, k), d)-consistent then\nF (X, d, k) = F (X, d(cid:48), k).\nWhile this property may sound desirable and natural, it turns out that many common clustering\nparadigms fail to satisfy it. In a sense, this property may be viewed as the main weakness of Klein-\nberg\u2019s impossibility result.\nThe following two properties, proposed in [2], are straightforward relaxations of consistency.\nInner and Outer consistency: Outer consistency represents the preference for well separated clus-\nters, by requiring that the output of a k-clustering function not change if clusters are moved away\nfrom each other.\nA distance function d(cid:48) over X is (C, d)-outer consistent if d(cid:48)\ny, and d(cid:48)\nconsistency, except that (C, d)-consistent is replaced by (C, d)-outer consistent.\nInner consistency represents the preference for placing points that are close together within the same\ncluster, by requiring that the output of a k-clustering function not change if elements of the same\ncluster are moved closer to each other.\nInner consistency is de\ufb01ned in a similar manner to outer-consistency, except that d(cid:48) is (C, d)-inner\nX (x, y) = dX (x, y) whenever x (cid:54)\u223cC y.\nconsistent if d(cid:48)\nClearly, consistency implies both outer-consistency and inner-consistency. Note also that if a func-\ntion is both inner-consistent and outer-consistent then it is consistent.\nk-Richness: The k-richness property requires that we be able to obtain any partition of the do-\nmain by modifying the distances between elements. This property is based on Kleinberg\u2019s [8] rich-\nness axiom, requiring that for any sets X1, X2, . . . , Xk, there exists a distance function d over\ni=1 Xi so that F (X(cid:48), d) = {X1, X2, . . . , Xk}. A k-clustering function F satis\ufb01es k-\ni=1 Xi so\n\nX(cid:48) = (cid:83)k\nrichness if for any sets X1, X2, . . . , Xk, there exists a distance function d over X(cid:48) = (cid:83)k\n\nX (x, y) = dX (x, y) whenever x \u223cC\nX (x, y) \u2265 dX (x, y) whenever x (cid:54)\u223cC y. Outer consistency is de\ufb01ned in the same way\n\nX (x, y) \u2264 dX (x, y) whenever x \u223cC y, and d(cid:48)\n\nthat F (X(cid:48), d, k) = {X1, X2, . . . , Xk}.\nOuter richness: Outer richness, a natural variation on the k-richness property, was proposed in\n[2] under the name \u201cextended richness.\u201d (we have renamed it to contrast this property with \u201cinner\nrichness\u201d, which we propose in Appendix A). Given k sets, a k-clustering function satis\ufb01es outer\nrichness if there exists some way of setting the between-set distances, without modifying distances\nwithin the sets, we can get F to output each of these data sets as a cluster. This corresponds to the\nintuition that any groups of points, regardless of within distances, can be made into separate clusters.\nA clustering function F is outer-rich if for every set of domains, {(X1, d1), . . . (Xn, dk)}, there\ni=1 Xi that extends each of the di\u2019s (for i \u2264 k), such that\n\nexists a distance function \u02c6d over (cid:83)n\nF ((cid:83)k\n\ni=1 Xi, \u02c6d, k) = {X1, X2, . . . , Xk}.\n\nThreshold-richness: Fundamentally, the goal of clustering is to group points that are close to each\nother, and to separate points that are far apart. Axioms of clustering need to represent these ob-\njectives and no set of axioms of clustering can be complete without integrating such requirements.\n\n4\n\n\ft\nn\ne\nt\ns\ni\ns\nn\no\nc\n\nt\nn\ne\nt\ns\ni\ns\nn\no\nc\n\nr\ne\nt\nu\no\n\nr\ne\nn\nn\ni\n\nl\na\nc\no\nl\n\nd\ne\nn\n\ufb01\nn\no\nc\n-\nt\nn\ne\nm\ne\nn\n\ufb01\ne\nr\n\nt\nn\na\ni\nr\na\nv\nn\ni\n\nr\ne\nd\nr\no\n\nh\nc\ni\nr\n\nh\nc\ni\nr\n\nr\ne\nt\nu\no\n\nr\ne\nn\nn\ni\n\nh\nc\ni\nr\n-\nk\n\nh\nc\ni\nr\nd\nl\no\nh\ns\ne\nr\nh\nt\n\nt\nn\na\ni\nr\na\nv\nn\ni\n\ne\nl\na\nc\ns\n\nt\nn\na\ni\nr\na\nv\nn\ni\n\n.\n\no\ns\ni\n\nFunction\n\n(cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nSingle Linkage\nAverage Linkage (cid:88) X (cid:88) (cid:88) X (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nComplete Linkage (cid:88) X (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\n\n(cid:88) X (cid:88) X\n(cid:88) X (cid:88) X\n(cid:88) (cid:88) (cid:88) X\nX (cid:88) X\nX\nX\nX\nX\n\nk-median\nk-means\nMin sum\nRatio cut\n\nNormalized cut\n\nX\n\nFigure 1: A taxonomy of k-clustering functions, illustrating what properties are satis\ufb01ed by some\ncommon k-clustering functions. The results in the k-means row apply both when the centers are part\nof the data set and when the underlying space is Euclidean and the centers are arbitrary points in the\nspace.\n\nConsistency is the only previous property that aims to formalize these requirements. However, con-\nsistency has some counterintuitive implications (see Section 3 in [1]), and is not satis\ufb01ed by many\ncommon clustering functions.\nA k-clustering function F is threshold-rich if for every clustering C of X, there exist real numbers\na < b so that for every distance function d over X where d(x, y) \u2264 a for all x \u223cC y, and d(x, y) \u2265 b\nfor all x (cid:54)\u223cC y, we have that F (X, d,|C|) = C.\nThis property is based on Kleinberg\u2019s [8] \u0393-forcing property, and is equivalent to the requirement\nthat for every partition \u0393, there exists a < b so that (a, b) is \u0393-forcing.\nInner richness: Complementary to outer richness, inner richness requires that there be a way of\nsetting distances within sets, without modifying distances between the sets, so that F outputs each\nset as a cluster. This corresponds to the intuition that between-cluster distances cannot eliminate\nany partition of X. A k-clustering function F satis\ufb01es inner richness if for every data set (X, d)\nand partition {X1, X2, . . . , Xk} of X, there exists a \u02c6d where for all a \u2208 Xi, b \u2208 Xj for i (cid:54)= j,\n\n\u02c6d(a, b) = d(a, b), and F ((cid:83)k\n\ni=1 Xi, \u02c6d, k) = {X1, X2, . . . , Xk}.\n\nRe\ufb01nement-con\ufb01ned2: The following formalization was proposed in [2]. A clustering C of X\nis a re\ufb01nement of clustering C(cid:48) of X if every cluster in C is a subset of some cluster in C(cid:48), or,\nequivalently, if every cluster of C(cid:48) is a union of clusters of C. A k-clustering function is re\ufb01nement\ncon\ufb01ned if for every 1 \u2264 k \u2264 k(cid:48) \u2264 |X|, F (X, d, k(cid:48)) is a re\ufb01nement of F (X, d, k).\n\n3 Property-Based Classi\ufb01cation of Common k-Clustering Functions\n\nIn this section we present a taxonomy of common k-clustering functions. The taxonomy is pre-\nsented in Figure 1 (de\ufb01nitions of the k-clustering functions are in Appendix C in the supplementary\nmaterial).\nThe taxonomy in Figure 1 illustrates how clustering algorithms differ from one another. For ex-\nample, order-invariance and inner-consistency can be used to distinguish among the three common\nlinkage-based algorithms. Min-sum differs from k-means and k-median in that it satis\ufb01es inner-\nconsistency. Unlike all the other algorithms discussed, the spectral clustering functions are not\nlocal.\nThe proofs of the claims embedded in the table appear in the supplementary material.\n\n2In [2], this property was called \u201chierarchical clustering\u201d.\n\n5\n\n\f3.1 Axioms of clustering\n\nOur taxonomy reveals that some intuitive properties, which may be expected of all k-clustering\nfunctions, are not satis\ufb01ed by some common k-clustering functions. For example, locality is not\nsatis\ufb01ed by the spectral clustering functions ratio-cut and normalized-cut. Also, most functions fail\ninner consistency, and therefore do not satisfy consistency, even though the latter was previously\nproposed as an axiom of k-clustering functions [8].\nOn the other hand, isomorphism invariance, scale invariance, and all richness properties (in the set-\nting where the number of clusters, k, is part of the input), are satis\ufb01ed by all the clustering functions\nconsidered. Isomorphism invariance and scale-invariance make for natural axioms. Threshold rich-\nness is the only one that is both satis\ufb01ed by all k-clustering functions considered and re\ufb02ects the\nmain objective of clustering: to group points that are close together and to separate points that are\nfar apart.\nIt is easy to see that threshold richness implies k-richness. It can be shown that when threshold rich-\nness is combined with scale invariance, it also implies outer-richness and inner-richness. Therefore,\nwe propose that scale-invariance, isomorphism-invariance, and threshold richness can be used as\nclustering axioms.\nHowever, we emphasize that these three axioms do not make a complete set of axioms for clustering,\nsince some functions that satisfy all three properties do not make reasonable k-clustering functions;\na function that satis\ufb01es the two invariance properties can also satisfy threshold richness by behaving\nreasonably only on particularly well-clusterable data, while having counter-intuitive behavior on\nother data sets.\n\n4 Properties for Randomized k-Clustering Functions\n\nWe present a formal setting to study and analyze probabilistic k-clustering functions. A probabilistic\nk-clustering function F takes a data set (X, d) and an integer 1 \u2264 k \u2264 |X| and outputs F (X, d, k),\na probability distribution over k-clusterings of X. Let P (F (X, d, k) = C) denote the probability of\nclustering C in the probability distribution F (X, d, k).\n\n4.1 Properties of Probabilistic k-Clustering Functions\n\nWe translate properties of different types into the probabilistic setting.\nInvariance properties: Invariance properties specify when data sets should be clustered in the\nsame way (ex. isomorphism-invariance, scale-invariance, and order-invariance). Such properties are\ntranslated into the probabilistic setting by requiring that when data sets (X, d) and (X(cid:48), d(cid:48)) satisfy\nsome similarity requirements, then F (X, d, k) = F (X(cid:48), d(cid:48), k) for all k.\nConsistency properties: Consistency properties impose conditions that should improve the quality\nof a clustering. Every such property has some notion of a \u201c(C, d)-nice\u201d variant that speci\ufb01es how\nthe underlying distance function can be modi\ufb01ed to better \ufb02esh out clustering C. In the probabilistic\nsetting, such properties require that whenever d(cid:48) is a (C, d)-nice variant, the k-clustering function is\nat least as likely to output C on d(cid:48) as on d, P [F (X, d(cid:48),|C|) = C] \u2265 P [F (X, d,|C|) = C].\nRichness properties: Richness properties require that any desired clustering can be obtained under\ncertain constraints. In the probabilistic setting, we require that the same occurs with arbitrarily high\nprobability. For example, the following is the probabilistic version of the k-richness property. The\nother variants of richness are reformulated analogously.\nDe\ufb01nition 3 (k-Richness). A probabilistic k-clustering function F is k-rich if for any k-clustering C\nof X and any \u0001 > 0, there exists a distance function d over X so that P (F (X, d, k) = C) \u2265 1 \u2212 \u0001.\n\nLocality: We now show how to translate locality into the probabilistic setting. We say that a clus-\ntering of X speci\ufb01es how to cluster a subset X(cid:48) \u2286 X if every cluster that overlaps with X(cid:48) is\ncontained within X(cid:48). Locality requires that a k-clustering function cluster X(cid:48) in the way speci\ufb01ed\nby the superset X.\n\n6\n\n\fProperties\n\nAxioms\n\nOther\n\nt\nn\ne\nt\ns\ni\ns\nn\no\nc\n\nr\ne\nt\nu\n\nOptimal k-means\n\nClustering Algorithm o\n(cid:88)\nX\nX\n\nRandom Centroids Lloyd\nFurthest Centroids Lloyd\n\nl\na\nc\no\nl\n\n(cid:88)\nX\nX\n\nh\nc\ni\nr\n\nd\nl\no\nh\ns\ne\nr\nh\nt\n\nt\nn\na\ni\nr\na\nv\nn\ni\n\ne\nl\na\nc\ns\n\nt\nn\na\ni\nr\na\nv\nn\ni\n\n.\n\no\ns\ni\n\nh\nc\ni\nr\n\nr\ne\nt\nu\no\n\nh\nc\ni\nr\n-\nk\n\n(cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\nX (cid:88) (cid:88) (cid:88) X\n(cid:88) (cid:88) (cid:88) (cid:88) (cid:88)\n\nFigure 2: An analysis of the k-means clustering function and k-means heuristics. The two leftmost\nproperties distinguish the k-means clustering function, properties that are satis\ufb01ed by k-means but\nfail for other reasonable k-clustering functions. The next three are proposed axioms of clustering,\nand the last two properties follow from the axioms.\n\nIn the probabilistic setting, we require that the probability of obtaining a speci\ufb01c clustering of X(cid:48) \u2286\nX is determined by the probability of obtaining that clustering as a subset of F (X, d, k), given that\nthe output of F on (X, d) speci\ufb01es how to cluster X(cid:48).\nDe\ufb01nition 4 (Locality (probabilistic)). A probabilistic k-clustering function F is local if for\nany k-clustering C(cid:48) of X(cid:48), X(cid:48) \u2286 X, and j \u2265 k, where P [\u2203C1, . . . , Ck s.t. \u222ak\n|\ni=1 Ci = X(cid:48)\nF (X, d, j) = C] (cid:54)= 0,\n\nP [F (X(cid:48), d/X(cid:48),|C(cid:48)|) = C(cid:48)] =\n\nP [C(cid:48) \u2286 C | F (X, d, j) = C and C/X(cid:48) is a k-clustering]\ni=1 Ci = X(cid:48) | F (X, d, j) = C]\nP [\u2203C1, . . . , Ck s.t. \u222ak\n\n.\n\n5 Properties Distinguishing K-means Heuristics\n\n5.1 k-means and k-means heuristics\n\nOne of the most popular clustering algorithms is the Lloyd method, which aims to \ufb01nd clusterings\nwith low k-means loss. Indeed, the Lloyd method is sometimes referred to as the \u201ck-means algo-\nrithm.\u201d We maintain a distinction between the k-means objective function and heuristics, such as\nthe Lloyd method, which aim to \ufb01nd clusterings with low k-means loss. For this section, we assume\nthat the data lie in Euclidean space, as is often the case when the Lloyd method is applied.\nDe\ufb01nition 5 (Lloyd method). Given a data set (X, d), and a set S of points in Rn, the Lloyd\nalgorithm performs the following steps until two consecutive iterations return the same clustering.\n\n1. Assign each point in X to its closest element of S. That is, \ufb01nd the clustering C of X so\n\nthat x \u223cC y if and only if argminc\u2208S(cid:107)c \u2212 x(cid:107) = argminc\u2208S(cid:107)c \u2212 y(cid:107).\nx\u2208Ci\n\n2. Compute the centers of mass of the clusters. Set S = {ci = 1|Ci|\n\nx | Ci \u2208 C}.\n\n(cid:80)\n\nThe Lloyd method is highly sensitive to the choice of initial centers. Perhaps the most common\nmethod for initializing the centers for the Lloyd method is to select k random points from the input\ndata set, proposed by Forgy in 1965 [4]. We refer to this initialization method as Random Centroids.\nWe propose a slight variation on a deterministic initialization method by Katsavounidis, Kuo, and\nZhang [7], who propose selecting centers that are far apart. First let c1 and c2 be the two points\nfurthest away from each other. Then, for all 2 \u2264 k, let ci be the point furthest away from its closest\nexisting center. That is, let ci be the point in X that maximizes min1\u2264j\u2264i\u22121 d(cj, ci).\n\n5.2 Distinguishing heuristics by properties\n\nAn analysis of the k-means clustering functions and the two k-means heuristics discussed above\nis shown in Figure 2. The analysis illustrates that the k-means function differs signi\ufb01cantly from\n\n7\n\n\fheuristics that aim to \ufb01nd clusterings with low k-means objective loss. The proofs for this analysis\nwere omitted due to a lack of space (they appear in the supplementary material).\nThere are two properties that are satis\ufb01ed by the k-means clustering function and fail for other rea-\nsonable k-clustering functions: outer-consistency and locality. Neither is satis\ufb01ed by the heuristics.\nNote that unlike k-clustering functions that optimize common clustering objective functions, heuris-\ntics that aim to \ufb01nd clusterings with low loss for these objective functions do not necessarily make\nmeaningful k-clustering functions. Therefore, such heuristic\u2019s failure to satisfy certain properties\ndoes not preclude these properties from being axioms of clustering, but rather illustrates a weakness\nof the heuristic.\nIt is interesting that the Lloyd method with the Furthest Centroids initialization technique satis\ufb01es\nour proposed axioms of clustering while Lloyd with Random Centroid fails threshold richness. This\ncorresponds to the \ufb01nding of He et. al. [5] that in practice, Furthest Centroids performs better than\nRandomized Centroids.\n\n6\n\nImpossibility Results\n\nIn this \ufb01nal section, we strengthen Kleinberg\u2019s famous impossibility result [8], for general clustering\nfunctions, yielding a simpler proof of the original result.\nKleinberg impossibility theorem (Theorem 2.1, [8]) was that no general clustering function can\nsimultaneously satisfy scale-invariance, richness, and consistency. Ackerman and Ben-David[1]\nlater showed that consistency has some counter intuitive consequence. In Section 1, we showed that\nmany natural clustering functions fail inner-consistency3, which implies that there are many general\nclustering functions that fail consistency.\nOn the other hand, many natural algorithms satisfy outer-consistency. We strengthen Kleinberg\u2019s\nimpossibility result by relaxing consistency to outer-consistency.\nTheorem 1. No general clustering function can simultaneously satisfy outer-consistency, scale-\ninvariance, and richness.\n\nProof. Let F be any general clustering function that satis\ufb01es outer-consistency, scale-invariance and\nrichness.\nLet X be some domain set with three or more elements. By richness, there exist distance functions\nd1 and d2 such that F (X, d1) = {X} (every domain point is a cluster on its own) and F (X, d2) is\nsome different clustering, C = {C1, . . . Ck} of X.\nLet r = max{d1(x, y) : x, y \u2208 X} and let c be such that for every x (cid:54)= y, cd2(x, y) \u2265 r. De\ufb01ne\n\u02c6d(x, y) = c \u00b7 d2(x, y), for every x, y \u2208 X. Note that \u02c6d(x, y) \u2265 d1(x, y) for all x, y \u2208 X. By\nouter-consistency, F (X, \u02c6d) = F (X, d1). However, by scale-invariance F (X, \u02c6d) = F (X, d2). This\nis a contradiction since F (X, d1) and F (X, d2) are different clusterings.\n\nA similar result can be obtained, using a similar proof, with inner-consistency replacing outer con-\nsistency. Namely,\nLemma 1. No general clustering function can simultaneously satisfy inner-consistency, scale-\ninvariance, and richness.\n\nSince consistency implies both outer-consistency and inner-consistency, Kleinberg\u2019s original result\nfollows from any one of Theorem 1 or Lemma 1.\nKleinberg\u2019s impossibility result illustrates property trade-offs for general clustering functions. The\ngood news is that these results do not apply when the number of clusters is part of the input, as is\nillustrated in our taxonomy; single linkage satis\ufb01es scale-invariance, consistency and richness.\n\n3Note that a clustering function and it\u2019s corresponding general clustering function satisfy the same set of\n\nconsistency properties.\n\n8\n\n\fReferences\n[1] M. Ackerman and S. Ben-David. Measures of Clustering Quality: A Working Set of Axioms\n\nfor Clustering. NIPS, 2008.\n\n[2] M. Ackerman, S. Ben-David, and D. Loker. Characterization of Linkage-based Clustering.\n\nCOLT, 2010.\n\n[3] R. Bosagh Zadeh and S. Ben-David. \u201cA Uniqueness Theorem for Clustering.\u201d The 25th Annual\n\nConference on Uncertainty in Arti\ufb01cial Intelligence UAI, 2009.\n\n[4] E. Forgy. Cluster analysis of multivariate data: ef\ufb01ciency vs. interpretability of classi\ufb01cations.\n\nIn WNAR meetings, Univ of Calif Riverside, number 768, 1965.\n\n[5] He, J., Lan, M., Tan, C.-L., Sung, S. -Y., and Low, H.-B. (2004). Initialization of cluster\nre\ufb01nement algorithms: A review and comparative study. In Proc. IEEE Int. Joint Conf. Neural\nNetworks (pp. 297?-302).\n\n[6] N. Jardine, R. Sibson, Mathematical Taxonomy Wiley, 1971.\n[7] I. Katsavounidis, C.-C. J. Kuo, and Z. Zhang. A new initialization technique for generalized\n\nLloyd iteration. IEEE Signal Processing Letters, 1(10):144-146, 1994.\n\n[8] Jon Kleinberg. \u201cAn Impossibility Theorem for Clustering.\u201d Advances in Neural Information\n\nProcessing Systems (NIPS) 15, 2002.\n\n[9] U. von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4): 395-416,\n\n2007\n\n[10] L. Vendramin, R.J.G.B. Campello, and E.R. Hruschka. \u201cOn the comparison of relative cluster-\n\ning validity criteria.\u201d Sparks, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1307, "authors": [{"given_name": "Margareta", "family_name": "Ackerman", "institution": null}, {"given_name": "Shai", "family_name": "Ben-David", "institution": null}, {"given_name": "David", "family_name": "Loker", "institution": null}]}