{"title": "Clustering Redemption\u2013Beyond the Impossibility of Kleinberg\u2019s Axioms", "book": "Advances in Neural Information Processing Systems", "page_first": 8517, "page_last": 8526, "abstract": "Kleinberg (2002) stated three axioms that any clustering procedure should satisfy and showed there is no clustering procedure that simultaneously satisfies all three. One of these, called the consistency axiom, requires that when the data is modified in a helpful way, i.e. if points in the same cluster are made more similar and those in different ones made less similar, the algorithm should output the same clustering. To circumvent this impossibility result, research has focused on considering clustering procedures that have a clustering quality measure (or a cost) and showing that a modification of Kleinberg\u2019s axioms that takes cost into account lead to feasible clustering procedures. In this work, we take a different approach, based on the observation that the consistency axiom fails to be satisfied when the \u201ccorrect\u201d number of clusters changes. We modify this axiom by making use of cost functions to determine the correct number of clusters, and require that consistency holds only if the number of clusters remains unchanged. We show that single linkage satisfies the modified axioms, and if the input is well-clusterable, some popular procedures such as k-means also satisfy the axioms, taking a step towards explaining the success of these objective functions for guiding the design of algorithms.", "full_text": "Clustering Redemption\u2013Beyond\n\nthe Impossibility of Kleinberg\u2019s Axioms\n\nVincent Cohen-Addad\u02da\nSorbonne Universit\u00e9s,\nUPMC Univ Paris 06,\n\nCNRS, LIP6\n\nvincent.cohen-addad@lip6.fr\n\nVarun Kanade:\nUniversity of Oxford\nvarunk@cs.ox.ac.uk\n\nAbstract\n\nFrederik Mallmann-Trenn;\n\nMIT\n\nmallmann@mit.edu\n\nKleinberg [20] stated three axioms that any clustering procedure should satisfy and\nshowed there is no clustering procedure that simultaneously satis\ufb01es all three. One\nof these, called the consistency axiom, requires that when the data is modi\ufb01ed in a\nhelpful way, i.e. if points in the same cluster are made more similar and those in\ndifferent ones made less similar, the algorithm should output the same clustering. To\ncircumvent this impossibility result, research has focused on considering clustering\nprocedures that have a clustering quality measure (or a cost) and showing that a\nmodi\ufb01cation of Kleinberg\u2019s axioms that takes cost into account lead to feasible\nclustering procedures. In this work, we take a different approach, based on the\nobservation that the consistency axiom fails to be satis\ufb01ed when the \u201ccorrect\u201d\nnumber of clusters changes. We modify this axiom by making use of cost functions\nto determine the correct number of clusters, and require that consistency holds only\nif the number of clusters remains unchanged. We show that single linkage satis\ufb01es\nthe modi\ufb01ed axioms, and if the input is well-clusterable, some popular procedures\nsuch as k-means also satisfy the axioms, taking a step towards explaining the\nsuccess of these objective functions for guiding the design of algorithms.\n\n1 Introduction\n\nIn a highly in\ufb02uential paper, Kleinberg [20] showed that clustering is impossible in the following\nsense: there exists no clustering function, i.e. a function that takes a point-set and a pairwise\ndis-similarity function4 de\ufb01ned on them as input, and outputs a partition of the point-set, that\nsimultaneously ful\ufb01lls three simple and \u201creasonable\u201d axioms\u2014scale invariance, richness and\nconsistency. Scale invariance requires that scaling all the dis-similarities by the same positive number\nshould not change the output partition. Richness requires that for any partition of the point-set, there\nshould be a way to de\ufb01ne pairwise dis-similarities such that the clustering function will produce\nsaid partition as output. Finally, consistency requires the following: if a clustering function outputs\na certain partition of a point-set, given a certain dis-similarity function, then applying this clustering\nfunction to a transformed dis-similarity function that makes points within the same part more similar\nand points in different parts less similar, should yield the same partition.\nWhile seemingly very natural in the context of clustering, the last of these axioms, consistency,\nis somewhat questionable as has been discussed by researchers over the years (see e.g. [30] and\n\u02daCe projet a b\u00e9n\u00e9\ufb01ci\u00e9 d\u2019une aide de l\u2019\u00c9tat g\u00e9r\u00e9e par l\u2019Agence Nationale de la Recherche au titre du\n:This work was supported in part by the Alan Turing Institute through the EPSRC grant EP/N510129/1.\n;This work was supported in part by NSF Award Numbers CCF-1461559, CCF-0939370, and CCF-1810758.\n4We use dis-similarity rather than distance, as for the most part we don\u2019t require the point-set and the\n\nProgramme FOCAL portant la r\u00e9f\u00e9rence suivante : ANR-18-CE40-0004-01.\n\nassociated dis-similarity function to form a metric space.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\freferences therein). Consider a dataset with a natural clustering consisting of k parts. Kleinberg\u2019s\nconsistency axiom allows a transformation of the dis-similarity function by which one cluster may be\nsubdivided into two subclusters, such that points in the same subcluster are very similar to each other,\nbut suf\ufb01ciently dis-similar to points from the other sub-cluster. The transformed instance may require\na different partition: for a \u201cgood\u201d clustering with k parts, it may be more suitable to de\ufb01ne one cluster\nfor each of the two new subclusters and re-arrange the partition of the remaining points. Alternatively,\none may ask what is the \u201cright\u201d number of clusters in the new instance? Since the original instance\nhad k clusters and since one of the clusters got subdivided into two subclusters, it may be more natural\nto ask for a clustering in k`1 clusters for this new instance. Unfortunately, this is not allowed by the\nconsistency axiom: the clustering should remain the same. This scenario can indeed be formalized as\nshown in Section 4, even in the case where the original clustering into k parts is very well-clusterable.\nWe are not the \ufb01rst to notice the problem with the consistency axiom as de\ufb01ned by Kleinberg, see\ne.g. [23, 2, 14]. This impossibility result has been contrasted by a large body of research that argues\nthat relaxing the axioms by restating them with respect to cost functions (clustering quality measures)\nresolves the inconsistency [14]. For example, in the in\ufb02uential blog post [30], it is observed that\nthe outcome of such a transformation can change the \u201cnatural\u201d number of clusters.\nPerhaps one of the main issue with Kleinberg\u2019s axioms is that they fail to explain why some of the\nclassic clustering objectives, such as the k-means objective function (see De\ufb01nition 1.1), give rise to\nvery popular algorithms such as k-means++ and Lloyd\u2019s algorithm that are very successful in practice5.\nThis suggests that the impossibility result arises from instances that are unrealistic and contrived.\nA way to overcome this impossibility result is to look beyond the worst-case scenario. Motivated by\nthe thesis that \u201cclustering is dif\ufb01cult only when it does not matter\u201d (see e.g. [19, 13, 17]), one can hope\nthat classic objectives such as k-means would satisfy the axioms when the input is well-clusterable.\nUnfortunately, we show that k-means fails to satisfy Kleinberg\u2019s consistency axiom even when we\nrestrict attention to very well-clusterable inputs, in fact even for types of instances for which the\nk-means++ algorithm has been proven to be ef\ufb01cient [21], and as a result one may expect k-means\nto be the \u201cright\u201d objective function to optimize. (See Section 4 for a formal statement of this.)\n\n1.1 Our contributions\n\nOur work aims at bridging the gap between real-world clustering scenarios and an axiomatic\napproach to understanding the theoretical foundations of clustering. We see the problem of clustering\nas a two-step procedure:\n\n1. Determine the \u201cnatural\u201d number k of clusters in the dataset;\n2. Find out the \u201cbest\u201d clustering with k clusters.\n\nThe question of choosing the \u201ccorrect\u201d number of clusters is a very relevant one in practice because\nseveral of the commonly used clustering algorithms take the number of clusters k as a parameter\n(cf. [1, 26, 27, 18]) and would yield nonsensical clusters if k was not carefully chosen. Despite this,\ntheoretical work on choosing the number of clusters is quite limited compared to the vast theoretical\nwork analyzing various clustering algorithms. An approach employed quite often is the so-called\nelbow method, which itself can be de\ufb01ned in different ways. A natural de\ufb01nition is as follows:\nConsider an objective function (a measure of quality) for clustering into k parts, and de\ufb01ne OPTk\nto be the clustering that minimizes this objective function. According to the elbow method, the\n\u201cnatural\u201d number of clusters is de\ufb01ned as the value k\u2039 that maximizes the ratio OPTk\u00b41{OPTk for\nk Pt2,...,n\u00b4 1u, where n is the number of data points. k\u2039 \u201c 1 and k\u2039 \u201c n are explicitly ruled out\nas they would lead to trivial clusterings.\nThe intuition behind this approach is that the maximum gain in information is obtained, precisely\nwhen \ufb01nding k\u2039 groups instead of k\u2039\u00b4 1. There is diminishing information gain when allowing\nmore clusters beyond k\u2039. This approach is widely-used in practice e.g. [27] and has led to interesting\ntheoretical models of real-world inputs. As an example, Ostrovsky et al. [24] de\ufb01ne a \u201creal-world\u201d\ninput with k clusters as an instance for which the k-means objective satis\ufb01es OPTk\u00b41{OPTk\u0105 1`\u03b5\nfor a suf\ufb01ciently large \u03b5. In turn such data models have been used in theoretical work to better\nunderstand the success of algorithms such as k-means++ [21, 10].\n\n5Note that k-means++ and Lloyd\u2019s algorithm aim at minimizing the k-means objective; each step improves\n\nthe quality of the solution w.r.t. k-means objective.\n\n2\n\n\fTaking inspiration from this approach, we return to Kleinberg\u2019s axioms and amend\nthe potential change in the optimal num-\nthe consistency axiom to take into account\nber of clusters. More precisely, we required that\nthe partition\nthe \u201ccorrect\u201d\nof\nnumber of clusters in the new instance is the same as that\ninstance.\n\nthe consistency (i.e.\nis preserved in the transformed instance only if\n\nthe input point-set)\n\nin the original\n\nIn order to meaningfully de\ufb01ne the \u201ccorrect\u201d\nnumber of clusters, we need to include a cost\nfunction, from partitions to the positive reals,\nas part of the input. We de\ufb01ne the \u201ccorrect\u201d\nnumber using the elbow method. We show that\nthe new set of axioms is no longer inconsistent\nand some clustering algorithms, such as single-\nlinkage, in fact satisfy these axioms. While in\nthe worst-case, clustering algorithms based on the classic center-based clustering objectives such\nas k-means, k-median and k-center do not satisfy the axioms, we show that for stable clustering\ninstances, these objective function now satisfy the axioms. We show that the notion of stable instance\ncaptures some interesting scenarios (see full version). Thus our axioms arguably model the process\nof clustering \u201crelevant in practice\u201d inputs, thus taking a step towards explaining the success of some\npopular objective functions.\nStable clustering instances. We de\ufb01ne well-clusterable or stable clustering instance using the\nstability notion introduced by Bilu and Linial [16] in the context of center-based clustering. This\nnotion was later considered in the context of clustering in several other works (see e.g. [9, 13, 15, 12])\nand various (provable) algorithms have been designed for solving these types of instances. We\nconsider the \u03b1-proximity condition introduced by Awasthi et al. [11] which requires that the optimal\nclustering satis\ufb01es the following: Given a point in the ith optimal cluster, \u03b1 times its distance to the\ncenter of cluster i remains smaller than its distance to the centers of the other clusters. This notion\ngeneralizes the notion of stability as shown by [11].\nIn the full version we show that this notion arises for large ranges of parameters (for which our proofs\nhold) in different models such as the stochastic block model and mixture of Gaussians and for which\nthese clustering approaches are used in practice. Our result on stable instances require that the cluster\nsizes are approximately equal; we observe that when using k-means in the context of e.g. Gaussian\nmixture models, roughly balanced clusters and a separation of centers ensures that the minimizing\nthe k-means objective is roughly the same as \ufb01nding maximum likelihood estimators for the centers.\n\n1.2 Related Work\n\nThe authors of the prominent work Ben-David and Ackerman [14] were the \ufb01rst to defy Kleinberg\u2019s\nimpossibility results. The authors focus on clustering quality measures (CQM), or cost functions,\nthat assign clusterings a value. The authors interpret Kleinberg\u2019s axioms in terms of these quality\nmeasures and show that the interpretation of the axioms is consistent. This is similar to our approach\nbut their de\ufb01nition of the consistency axiom differs: their notion of consistency asks for the value of\na clustering to not increase after a perturbation of the inputs that bring points in the same cluster\ncloser and pull points across different clusters apart.\nAs mentioned in the introduction, our consistency axiom requires that the optimal clustering remains\nthe same after the same type of perturbation, if the optimal number of clusters remains the same.\nWe believe that this is a stronger requirement that is of importance: when using a cost function\nfor evaluating a k-clustering, we hope that the minimizer of the cost function (namely the optimal\nsolution) is the underlying natural clustering. Hence, after a perturbation that does not increase\ndistances between points of the same clusters and does not decrease distances between points in\ndifferent clusters, we hope that the minimizer has remained the same if the natural number of clusters\nhas remained k. The axiom proposed by Ben-David and Ackerman [14] does not enforce an optimal\nsolution to remain optimal under the perturbation.\nAckerman [2] and Ackerman et al. [4] also contribute with a large set of axioms or properties that\nare suitable for clustering objective functions. In this paper we focus on the three original axioms\nintroduced by Kleinberg. Thus, our approach aims at complementing their study of the axioms by\nreplacing their consistency axiom with a stronger one. Also our approach differs slightly in the\n\n3\n\ncombine?OriginalinstancePerturbedinstance\ffollowing sense; we aim at de\ufb01ning reasonable axioms that explain why popular objective functions,\nsuch as k-means, are good ones (we refer the reader to [5, 6, 3] for further advanteges of k-means\nand similar methods.\nvan Laarhoven and Marchiori [28] continue this line of research on quality measures and show\nthat adding reasonable axioms leads to set of axioms which are not ful\ufb01lled by modularity, a fairly\npopular CQM. The authors of Puzicha et al. [25] explore properties of clustering objective functions\nfor the setting where the number of clusters, k, is \ufb01xed. They propose a few natural axioms of\nclustering objective functions, and then focus on objective functions that arise by requiring functions\nto decompose into additive form.\nMeil\u02d8a [23] views clusterings as nodes of a lattice: there is an edge between to clustering C and C1 if\nC1 can be obtained by splitting a cluster of C into two parts. The authors give axioms for comparing\nclusterings and show inconsistency of those axioms. Ackerman et al. [5] considers clustering in the\nweighted setting where every point is assigned a real valued weight. The authors analyze of the\nin\ufb02uence of the weighted data on standard clustering algorithms. Ackerman et al. [6] analyze the\nrobustness of clustering algorithms to the addition of points study the robustness of popular clustering\nmethods. See Ackerman [2] for a thorough review on research on clustering properties. There has\nalso been work focused on the single-linkage clustering algorithm and its characterization using\na speci\ufb01c set of axioms including Kleinberg\u2019s axioms [31]. This has later been extended to more\ngeneral families of linkage-based algorithms [3].\nOrganization of the paper: Section 1.2 introduces basic notions and notations. Section 2 describes\nand discusses our new axioms. Section 3 shows that single-linkage satis\ufb01es all of them, even in the\nworst-case scenario, while k-means and k-median satisfy the axioms when we restrict our attention\nto well-clusterable instances. Section 4 shows various impossibility results: k-means does not satisfy\nKleinberg\u2019s axioms even for well-clusterable instances. In the worst-case, none of k-means and\nk-median satis\ufb01es all our re\ufb01ned axioms. The proofs can be found in the full version.\nPreliminaries Let rns denote the set t1,...,nu. An input to a clustering procedure is prns,dq, where\nwe rns is the point-set and d : rns\u02c6rns \u00d1 R` gives the pairwise distances between points in rns\n(we assume d is always symmetric). We do not require prns,dq to be a metric space, though all of\nour results continue to hold if this requirement is added.We denote by \u03a0rns the set of all possible\npartitions of the set rns; \u03a0\u2039rns denotes the set of non-trivial partitions of rns, i.e. excluding the\npartitions consisting of exactly one part and the partition consisting of exactly n parts. For a partition\nP P \u03a0\u2039rns, we will denote by |P| the number of parts. We use OPT (OPTo, respectively) to denote\nthe cost of the optimal solution under the perturbed metric (original metric, respectively).\n\u0159\nDe\ufb01nition 1.1 (k-Means). Let prns, dq be a metric space, and k a non-negative integer. The k-\nmeans problem asks for a subset S of rns, of cardinality at most k, that minimizes costpSq \u201c\nxPrnsmincPS dpx,cq2.\nIn the k-median problem, the distances are not square while in the k-center problem, the sum is\nreplaced by taking the maximum. In the following, we will sometimes refer to points of rns as\nclients. The clustering of X induced by S is the partition of A into subsets C\u201ctC1,...Cku such that\nCi \u201ctxPrns | ci \u201c argmin cPSdpx,cqu (breaking ties arbitrarily). Similarly, given a partition of X\n\u0159\ninto k parts C\u201ctC1,...Cku, we de\ufb01ne the centers induced by C as the set of tcentroidpCiq| CiP Cu,\nwhere we slightly abuse notation by de\ufb01ning the centroid of set of point X \u0102rns as the point y of X\nxPX dpy,xq2 (a.k.a. the medoid). It is a well-known fact that costpCq is minimized\nthat minimizes\nby the centers induced by C. Hence, we will refer to a solution to the k-means problem by a partition\nof the points in k parts, or by a set of k centers.\n\n2 An Axiomatic Result\n\nKleinberg [20] introduced an axiomatic framework for clustering. Following Kleinberg, we de\ufb01ne\na clustering procedure to be a function f that takes a pair prns,dq of a point-set and an associated\ndistance function, and outputs a partition P of rns. This de\ufb01nition is purely combinatorial and in\nwhat follows we will modify it slightly to view clustering as an optimization procedure. Kleinberg\n[20] requires that any clustering procedure satisfy the following three axioms.\nAxiom 2.1 (Scale Invariance). For any input prns,dq and any \u03b1\u0105 0, we have fpprns,dqq\u201c fpprns,\u03b1\u00a8\ndqq, where \u03b1\u00a8d denotes an \u03b1-scaling of the distance function d.\n\n4\n\n\fAxiom 2.2 (Richness). For any P P \u03a0rns, there exists a dP :rns\u02c6rns\u00d1 R`, such that fpprns,dPqq\u201c\nP.\nThe third of Kleinberg\u2019s axiom requires the notion of a P-consistent transformation. For a partition\nP P \u03a0rns, a transformation d1 of d is P-consistent, if d1px,yq\u010f dpx,yq if x,y are in the same part in\nP and d1px,yq\u011b dpx,yq if x,y are in different parts in P.\nAxiom 2.3 (Consistency). If fpprns,dqq\u201c P and if d1 is a P-consistent transformation of d, then\nfpprns,d1qq\u201cP.\nIt is this last axiom that is an unnecessary restriction on clustering procedures. As discussed in the\nintroduction, this restriction comes from the fact that the axiom enforces the number of clusters\nto remain the same, even after the perturbation of the input. Indeed, the number of clusters may\nhave \u201cchanged\u201d as a result of the distance transformation. In general choosing the correct number\nof a clusters is a fairly non-trivial problem. In order to do so, we assume that there is cost function\nassociated with any partition P P \u03a0rns. To avoid trivial cases, we will only allow a clustering\nalgorithm to output a non-trivial partition P P \u03a0\u2039rns. Let \u0393 : \u03a0\u2039rns\u00d1 R` be a cost function. For any\nkPt2,...,n\u00b41u, de\ufb01ne OPT\u0393\n\n\u0393pPq.\n\nk :\u201c minPP\u03a0rns\n|P|\u201ck\n\nk\u00b41{OPT\u0393\n\nFor example, in the so called k-median clustering objective, k data-points are chosen to be centers\nand each point is assigned to its closest center (with arbitrary tie-breaking) to arrive at a partition.\nThen the cost is simply given by adding up the distance of each data-point to its closest center.\nWe now present our re\ufb01ned consistency axiom. We consider a clustering procedure as a procedure that\nhas as input prns,dq as well as a cost function \u0393 : \u03a0\u2039rns\u00d1 R`. The clustering procedure chooses the\nnumber of parts k\u2039, by picking k that maximizes the ratio OPT\u0393\nk and then outputs a partition\nP consisting of k\u2039 parts that achieves the value OPT\u0393\nk\u2039. We refer to such clustering procedures as\nclustering procedures with cost-function \u0393 and denote the use k\u2039pprns,dq,\u0393q to denote the optimal\nvalue of k\u2039 and fpprns,dq,\u0393q to denote the partition output by the clustering procedure f using the\ncost function \u0393.\nAxiom 2.4 (Re\ufb01ned Consistency). If f is a clustering procedure with cost function \u0393 and\nfpprns, dq, \u0393q \u201c P, then if d1 is P-consistent, then either k\u2039pprns, dq, \u0393q \u2030 k\u2039pprns, d1q, \u0393q or\nfpprns,dq,\u0393q\u201c fpprns,d1q,\u0393q.\nWhat the above axiom states is that if a P-consistent transformation does change data in a way that\nclearly changes the natural cluster structure, then it may output a different partition as the proposed\nclustering as long as the number of clusters has changed. However, if as per the objective function \u0393,\nthe \u201coptimal\u201d number of clusters has not changed, then the same partition P should be returned after\na P-consistent transformation. We refer to a clustering procedure using a cost function \u0393 that satis\ufb01es\nAxioms 2.1, 2.2 and 2.4 as admissible. Section 3 establishes that unlike in Kleinberg\u2019s result which\nasks for clustering procedures satisfying Axioms 2.1, 2.2 and 2.3, we obtain a possibility theorem.\nSeveral cost functions commonly used in practice have the effect of encouraging increasingly \ufb01ner\npartitions. As a result, the number of parts, e.g. k in k-means, has to be \ufb01xed to avoid achieving\na trivial partition where each point is placed in its own clusters. On the other hand, it may be\npossible to imagine cost functions that encourage fewer clusters, e.g. if there\u2019s a cost to open a\nnew center as in facility location problems. Based on these, it is possible to demand a stronger\nconsistency axiom than the one state in Axiom 2.4. If P \u201c fpprns,dq,\u0393q and P1 \u201c fpprns,d1q,\u0393q,\none may demand that if k\u2039pprns,dq,\u0393q\u0103 k\u2039pprns,d1q,\u0393q, then P1 is a re\ufb01nement of P; likewise, if\nk\u2039pprns,dq,\u0393q\u0105 k\u2039pprns,d1q,\u0393q, one may demand that P1 is a coarsening of P. The former should be\nexpected for cost functions encouraging \ufb01ner partitions and the latter for cost functions encouraging\nfewer parts. Single linkage does have the property that a P-consistent transformation can never\ndecrease k\u2039 and the resulting modi\ufb01ed partition P1 is a re\ufb01nement of P; however, we leave the formal\nanalysis of this claim to the long version of this extended abstract.\n\n3 Admissible Clustering Functions\n\n3.1 Admissibility of Single Linkage\n\nSingle linkage is most often de\ufb01ned procedurally, rather than as an optimization problem. It is also\ncommonly used as an algorithm for hierarchical clustering; however, it may equally well be viewed\n\n5\n\n\fas a partition-based clustering procedure. Formally, for a given k, the optimization problem that\nresults in the single-linkage algorithm is the following: \u201cFind the minimum weight spanning forest\nwith exactly k connected components (trees)\u201d.\nAs in any clustering procedure, the parameter k is input to the algorithm. In order to choose the value\nof k\u02da, we look at the value of k\u02da which maximizes the ration OPTk{OPTk`1 for kPt1,...,n\u00b41u.\nNote that this method of choosing k\u02da does not allow k\u02da\u201c n nor k\u02da\u201c 1.\nProposition 3.1. Single linkage clustering is admissible.\n\nProving scale invariance and richness is trivial. In order to prove re\ufb01ned consistency, we show that\nthe optimal forest in the modi\ufb01ed metric d1 with k\u02da parts cannot use any edges that go between the\ntrees in the forest obtained with d. We refer the reader to the appendix for the full proof.\nRemark 3.2. Actually, a stronger claim can be made where if k\u02da changes, the new partition output\nby single linkage on prns,d1q will be a re\ufb01nement of the partition output on prns,dq.\n3.2 Admissibility of k-Means\n\nWe now turn to a more formal de\ufb01nition of our \u201cwell-clusterable\u201d instances.\nDe\ufb01nition 3.3 (Center proximity [11]). We say that a metric space prns,dq satis\ufb01es the \u03b1-center\nproximity condition if the centers tc1,...,cku induced by the optimal clustering tC1,...,Cku of prns,dq\nwith respect to the k-means cost satis\ufb01es that for all i\u2030 j and pP Ci, we have dpp,cjq\u011b \u03b1\u00a8dpp,ciq.\nWe further say that an instance is \u03b4-balanced if for all i,j, |Ci|\u010fp1`\u03b4q|Cj|.\nTheorem 3.4. For any \u03b1 \u0105 5.3, \u03b4 \u010f 1{2 and for any \u03b4-balanced instance satisfying the \u03b1-center\nproximity, the k-means objective is an admissible cost function. Moreover, there exists a constant c\nsuch that for \u03b4\u010f 1{2 for any \u03b4-balanced instance satisfying the \u03b1-center proximity with \u03b1\u011b c, the\nk-median objective is an admissible cost function.\nProof. For simplicity, we assume that \u03b1\u201c 6 and \u03b4\u010f 1{2, the general case is similar, the higher \u03b1 the\nhigher \u03b4 can be. We will only show the proof for k-means; the proof for k-median is analogous. The\nproof of all claims can be found in the appendix.\nIt is easy to see that the k-means objective function satis\ufb01es Axioms 2.1 and 2.2 (see also [2]). Hence,\nwe only need to show that the k-means objective satis\ufb01es Axiom 2.3. We will make use of the\nfollowing lemma mainly due to [12], and [11].\nLemma 3.5 ([12]). For any points p P Ci and q P Cj (j \u2030 i) in the optimal clustering of an\n\u03b1-center proximity instance, we have dpci, qq \u011b \u03b1p\u03b1 \u00b4 1qdpci, pq{p\u03b1 ` 1q, and dpp, qq \u011b p\u03b1 \u00b4\n1qmaxtdpp,ciq,dpq,cjqu.\nWe complement this lemma by the following observation:\nClaim 3.6. Given p,q1P Ci and qP Cj, we have that\ndpci,q1q\u010f \u03b1`1\ndpp,q1q\u010f 2\u03b1\n\np\u03b1\u00b41q2 dpp,qq\np\u03b1\u00b41q2 dpp,qq.\n\nand\n\n(2)\n\n(1)\n\n1 ,...,\u03b3\u02da\n\nConsider an adversarial perturbation of the instance as prescribed by Axiom 2.3, namely a C-\nconsistent transformation of d, where C is the optimal k-means clustering of the original instance.\nAssume towards contradiction that the optimal k-means clustering for the perturbed instance \u0393\u201c\nt\u03931,...,\u0393ku, with centers \u03b3\u02da\u201c \u03b3\u02da\nk , differs from the optimal k-means solution for the original\ninstance C\u201ctC1,...,Cku.\nWe claim that, assuming \u03b1\u0105 2`?\n3 it must be that at least one of the clusters of C contains no center\nof \u03b3\u02da\nj that is in Ci, then the optimal clustering remains\ntC1,...,Cku and so \u0393 \u201c C. This follows from Claim 3.6: after the perturbation, each point of Ci\nremains closer to points of Ci than to any other point. Therefore, if there is a center \u03b3\u02da\ni in each Cj,\nthe optimal partitioning of the points remains \u0393.\nThus, we assume that there is at least one cluster of C that has no center of \u03b3\u02da\n\nk . Indeed, if for each Ci there exists a \u03b3\u02da\n\n1 ,...,\u03b3\u02da\n\n1 ,...,\u03b3\u02da\nk .\n\n6\n\n\f1,c1\n\n2,...,c1\n\ni if dpp,c1\n\niq\u0103 dpp,c1\n\nIn the following we aim at bounding OPTk,OPTk`1,OPTk\u00b41 which are the cost of the optimal\nsolutions using k,k`1,k\u00b41 centers in the perturbed instance.\nWe now consider the clusters of C that contain no center in the solution induced by \u0393. We also\nconsider the centers t\u03b31,...,\u03b3tu\u010e \u03b3\u02da induced by \u0393 that are located in a cluster Ci that also contains\nanother center of \u03b3\u02da.\niq\u010f dpp,c1\njq\nGiven a clustering C1 with centers c1\nk, we say a client p is served by c1\njq for all j\u0103 i. For each Ci that contains at least two centers of \u03b3\u02da,\nfor all j\u011b i and dpp,c1\nlet Ai denote the clients served by all the centers of \u03b3\u02da located in Ci. We show:\nClaim 3.7. There exists Ci that contains at least two centers of \u03b3\u02da such that |Ai|\u010f\np\u03b1\u00b41q2\u03b1p\u03b1\u00b42q|Ci|.\n\u02d9\nIn the rest, we further analyze the structure of a cluster Ci satisfying Claim 3.7. Let \u2206i \u201c\n\u00af\nmaxxPAimin\u03b3jP\u03b3\u02dadpx,\u03b3jq2.\n\u02c6\n\u00b4\n\u02d8\n2` p\u03b1`1q2p2\u03b1\u00b41q\nClaim 3.8. We have that OPTk\u00b41\u010fOPTk`\np\u03b1\u00b41q4\u03b1p\u03b1\u00b42q\nClaim 3.9. We have that OPTk`1\u010fOPTk\u00b4p1\u00b4\u03b4q\n1\n\u03b1\u00b41\nClaim 3.10. We have that OPTk{OPTk`1\u0105OPTk\u00b41{OPTk.\nClaim 3.10 shows that if the perturbation creates a clustering \u0393 different from C, then the natural\nvalue of k has changed (namely OPTk\u00b41{OPTk is not the maximizer over all values of k). Hence,\nthe axiom is satis\ufb01ed.\nIt is easy to see that for larger \u03b4, a larger value of \u03b1 allows to derive the proof.\n\n|Ci|\u00a8\u2206i.\n2|Ci|\u2206i.\n\np\u03b1`1q\np\u03b1\u00b41q2\n1\u00b4\n\np\u03b1`1q2\n\n\u02c6\u00b4\n\n\u02d9`\n\n\u00af\n\n2\n\n\u03b1\u00b42\n\u03b1\n\n4 Inadmissibility\n\nIn this section we prove two theorems showing inadmissibility of ubiquitous clustering functions.\nTheorem 4.1. k-means, k-median and k-center are not admissible w.r.t. our axioms.\n\nThe following theorem shows that k-means, k-median remain inadmissible w.r.t. to Kleinberg\u2019s\naxiom even if c-cluster proximity is satis\ufb01ed for any constant c. This is in contrast to Theorem 3.4\nshowing that k-means is admissible w.r.t. to our axioms if 6-cluster proximity is satis\ufb01ed. Given that\nk-means of great importance in real-world settings, we believe that this is further evidence that our\naxioms are more suitable.\nTheorem 4.2. k-means, k-median are not admissible w.r.t. Kleinberg\u2019s axioms even when c-cluster\nproximity is satis\ufb01ed for any constant c.\n\n4.1 Proof of Theorem 4.1\n\ndp\u00a8,\u00a8q\nu1\nu2\nu3\nu4\nvP L,v\u2030 u\nvP R,v\u2030 u\n\nu1\n0\n\nu2\n\u03b3\u00b4\u03b5\n0\n\nu3\n2\u03b3\u00b43\u03b5\n\u03b3\u00b42\u03b5\n0\n\nu4\n2\u03b3\u00b43\u03b5`1\n\u03b3\u00b42\u03b5`1\n1\n0\n\nuP L\n1\n\u03b3\u00b4\u03b5`1\n2\u03b3\u00b43\u03b5`1\n2\u03b3\u00b43\u03b5`2\n2\n\nuP R\n3\u03b3\u00b45\u03b5\n2\u03b3\u00b44\u03b5\n\u03b3`\u03b5 p\u03b3\u00b42\u03b5q\n\u03b3\n3\u03b3\u00b45\u03b5`1\n\u03b3\n\nFigure 1: Original instance and perturbed instance.\nLet V be the set of points, with |V |\u201c n. Assume\nn is even, \u03b3 \u201c 1.5 and \u03b5 \u201c 1{10. Let L and R\nbe two sets of size pn\u00b44q{2 each. The perturbed\ninstance is obtained by using the red value in brack-\nets. Missing entries are given by symmetry and\ndpu,vq\u201c 0 for v\u201c u.\n\nTo see that k-center, k-median and k-means are not admissible, we will construct a distance function\nd having k\u201c 2 with a unique optimal clustering C. The instance is given by Figure 1 and we refer to\nFigure 2 for an illustration. Note that the distance function ful\ufb01lls the triangle inequality albeit this is\nnot required. The main idea behind the construction is that u2 is, in the original instance, assigned to\nthe cluster center u1. In the perturbed instances, after decreasing the distance between u3 and the\nnodes of R, we have that u3 becomes the new center. As a consequence the node u2 is now closer\nto that cluster than to the other cluster. Hence, the clustering changes. It remains to show that the\noptimal number k\u02da of clusters remains 2 in the perturbed instance. Recall that, by de\ufb01nition, we\n\n7\n\n\fFigure 2: An illustration of the instance given by Figure 1 that\nshows that k-center, k-means and k-median are not admissible\nw.r.t. to our axioms. The distance from u1 to u2 is \u03b3\u00b4 \u03b5. The\ndistance from u1 to all of the nodes of L is 1. The distance of a\nnode L to all other nodes of L is 2 etc. The perturbed instances is\nobtained by decreasing the distance between u3 and the nodes of\nR\u2014all other distances remain unchanged. After decreasing those\ndistances, the center shifts to u3 causing u2 to switch clusters and\nhence different clusterings. The red circles denote the optimal\nclusterings with the centers marked red.\n\nexclude the cases k\u02da \u201c 1 and k\u02da \u201c n. We need to check that OPT1{OPT2 \u0105 OPTk{OPTk`1 for all\nkPt2,3,...,n\u00b41u.\nk-center. Note that the optimal solution for k\u201c 1\u2014in both the original instance and the perturbed\ninstance\u2014is to open a center at u2. We get that OPT1 \u201c OPT1\n1 \u201c 2\u03b3\u00b4 4\u03b5. For the case that k \u201c 2\nwe get for the optimal solution in the original instance (perturbed instance, respectively) consists\nof opening centers at u1 and u4 (u1 and u3, respectively). The results in a cost of OPT2 \u201c \u03b3\n(\u03b3\u00b42\u03b5, respectively). Furthermore, note that for any k\u0103 n, OPTk,OPT1\nk \u011b 1. As a result, we have\nOPTk{OPTk`1\u010fOPT2{OPTk`1\u010f \u03b3\u010f 1.5\u0103OPT1{OPT2 for k\u011b 2; the same holds for OPT1.\nk-median and k-means. Consider k-median. Note that for OPT1 the cost is at least pn\u00b44qp2\u03b3\u00b4\n4\u03b5q in the perturbed instance. Furthermore, OPT2 in the perturbed instance is OPT2 \u010f |L|\u00a8 1`\n|R|\u00a8p\u03b3 \u00b4 2\u03b5q` Op1q. Hence OPT1{OPT2 \u00ab 2. We can easily verify that OPTk{OPTk`1 \u010f 1.5 \u0103\nOPT1{OPT2 for all k\u011b 2. The argument for k-means is along the same lines.\n4.2 Proof of Theorem 4.2\n\nFor simplicity we consider an instance that satisfy 6-center proximity. The construction of the original\nand perturbed instance is given by Figure 3. Our construction for k-means and k-median is in fact the\nsame and satis\ufb01es the triangle inequality before and after perturbation. We show that reducing the\nintra-cluster distance changes the optimal solution hence violating Kleinberg\u2019s axioms.\n\ndp\u00a8,\u00a8q\nu1\nu2\nu3\nvP S1,v\u2030 u\nvP S2,v\u2030 u\nvP S3,v\u2030 u\n\nu1\n0\n\nu2\nx\n0\n\nu3\nx\u00a8y\nx\u00a8y\n0\n\nuP S1\n1\nx\nx\u00a8y\n2\n\nuP S2\nx\n1\nx\u00a8y\nx\n2\n\nuP S3\nx\u00a8y\nx\u00a8y\ny\nx\u00a8y\nx\u00a8y\n2y\n\nFigure 3: The two \ufb01gures on the l.h.s. are an example of an instance that satis\ufb01es the 6-center\nproximity where k-means and k-median are not admissible w.r.t. to Kleinberg\u2019s axioms. The distance\nfrom u1 to all nodes in S1 (with |S1|\u201c 5) is 1, the distance for any node of S1 to all other nodes of\nS1 is 2. The distance from any node of S1Ytu1u to any node of S2Ytu2u is x etc. The perturbed\ninstance is generated as follows. First, the intra-distance between all nodes of S1 reduces from 2\nto 0. Second, the set S3Ytu3u is partitioned into equal-sized sets S1\n3 and S2\n3. The intra-distance\nbetween nodes in both set reduces to 0 and the distance between a node of S1\n3 and S1\n3 reduces to y.\nAll other distances remain unchanged. The red circles denote the optimal clusterings with the centers\nmarked red. The table on the r.h.s. shows the original distance metric d. Missing entries are given by\nsymmetry and dpu,vq\u201c 0 for v\u201c u.\n\na\n5{2 and y\u201c 2x. Note that the instance is 0-balanced and satis\ufb01es x-center proximity\n1 be\n3. The clustering C1 induced by the centers u1, u2\n1 be the\n\nWe assume x\u0105\nand also x1-center proximity for every x1\u010f x, by de\ufb01nition. We require a few de\ufb01nitions. Let u1\nthe red node of S1 and let u1\nand u3 and simply assigning all other nodes to the closest node among u1, u2, and u3. Let C1\n\n3 be the red node of S1\n\n8\n\n\u03b3\u2212\u03b5\u03b3\u22122\u03b512LR3\u03b3\u22125\u03b5+1\u03b3\u03b3+\u03b5Originalinstancedu1u2u3u41\u03b3\u03b3\u2212\u03b5\u03b3\u22122\u03b512LR3\u03b3\u22125\u03b5+1\u03b3\u03b3\u22122\u03b5Perturbedinstancedu1u2u3u41\u03b3xxyOriginalinstancedPerturbedinstanced(cid:48)121y0u1S12S2u2u32yS30u1S112S2u2S(cid:48)30S(cid:48)(cid:48)3u31u(cid:48)3u(cid:48)1x\u00b7yx\u00b7yx\u00b7yx\u00b7y\f3. Let C2\n\n1, u3 and u1\n\n1 be the clustering induced by the centers u1\n\nclustering induced by the centers u1, u3, u1\n1, u2,\nu3. Similarly, let C2 be the cluster induced by the centers u1\n3. Observe that the original\nmetric space satis\ufb01es the x-center proximity de\ufb01nition. We will show that the optimal clustering C1\nin the original input and optimal clustering C2 in the perturbed input are different. Hence Kleinberg\u2019s\naxioms are not ful\ufb01lled despite x-center proximity. For a clustering C we use costopCq and costppCq\nto de\ufb01ne the cost before and after perturbation.\n3\u201c costopC1q\u201c|S1|\u00a812`|S2|\u00a812`\nk-means. Consider the original instance. We have that OPTo\n|S3|\u00a8 y2 \u201c 10` 5y2. Furthermore, costopC2q\u011b costopC1\n3 \u0103\ncostopC2q. Consider the perturbed instance. We have costppC2\n1q\u201c 12`|S2|\u00a812`|S1\n3|\u00a8y2\u201c 6`3y2\nand costppC2\n1q\u010f costppC1q. We have OPT3\u201c costppC2q\u201c 1`p|S2|`1q\u00a8x2\u201c 1`6\u00a8x2. Hence, the\noptimal clusterings in the original instance (C1) and the perturbed instance (C2) differ. An analogous\nreasoning yields the result for k-median.\n\n1q\u201c 5` 6x2` 4y2. We have that OPTo\n\nReferences\n\n[1] Determining the number of clusters in a data set - wikipedia page. URL https://en.\n\nwikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set.\n\n[2] M. Ackerman. Towards theoretical foundations of clustering. 2012.\n\n[3] M. Ackerman, S. Ben-David, and D. Loker. Characterization of linkage-based clustering. In\n\nCOLT, pages 270\u2013281. Citeseer, 2010.\n\n[4] M. Ackerman, S. Ben-David, and D. Loker. Towards property-based classi\ufb01cation of clustering\n\nparadigms. In Advances in Neural Information Processing Systems, pages 10\u201318, 2010.\n\n[5] M. Ackerman, S. Ben-David, S. Br\u00e2nzei, and D. Loker. Weighted clustering. In AAAI, 2012.\n\n[6] M. Ackerman, S. Ben-David, D. Loker, and S. Sabato. Clustering oligarchies. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 66\u201374, 2013.\n\n[7] N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. SIAM\n\nJournal on Computing, 26(6):1733\u20131748, 1997.\n\n[8] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph.\n\nRandom Structures and Algorithms, 13(3-4):457\u2013466, 1998.\n\n[9] H. Angelidakis, K. Makarychev, and Y. Makarychev. Algorithms for stable and perturbation-\nresilient problems. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of\nComputing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 438\u2013451, 2017. doi:\n10.1145/3055399.3055487. URL http://doi.acm.org/10.1145/3055399.3055487.\n\n[10] P. Awasthi and O. Sheffet. Improved spectral-norm bounds for clustering. In A. Gupta, K. Jansen,\nJ. Rolim, and R. Servedio, editors, Approximation, Randomization, and Combinatorial Opti-\nmization. Algorithms and Techniques, pages 37\u201349, Berlin, Heidelberg, 2012. Springer Berlin\nHeidelberg. ISBN 978-3-642-32512-0.\n\n[11] P. Awasthi, A. Blum, and O. Sheffet. Center-based clustering under perturbation stability.\n\nInformation Processing Letters, 112(1-2):49\u201354, 2012.\n\n[12] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM J. Comput., 45(1):\n\n102\u2013155, 2016. doi: 10.1137/140981575. URL https://doi.org/10.1137/140981575.\n\n[13] S. Ben-David. Computational feasibility of clustering under clusterability assumptions. arXiv\n\npreprint arXiv:1501.00437, 2015.\n\n[14] S. Ben-David and M. Ackerman. Measures of clustering quality: A working set of axioms for\nclustering. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural\nInformation Processing Systems 21, pages 121\u2013128. Curran Associates, Inc., 2009.\n\n[15] S. Ben-David and N. Haghtalab. Clustering in the presence of background noise. In Proceedings\nof the 31st International Conference on International Conference on Machine Learning - Volume\n32, ICML\u201914, pages II\u2013280\u2013II\u2013288. JMLR.org, 2014. URL http://dl.acm.org/citation.\ncfm?id=3044805.3044924.\n\n9\n\n\f[16] Y. Bilu and N. Linial. Are stable instances easy? Comb. Probab. Comput., 21(5):643\u2013660, Sept.\n2012. ISSN 0963-5483. doi: 10.1017/S0963548312000193. URL http://dx.doi.org/10.\n1017/S0963548312000193.\n\n[17] A. Daniely, N. Linial, and M. Saks. Clustering is dif\ufb01cult only when it does not matter. arXiv\n\npreprint arXiv:1205.4891, 2012.\n\n[18] S. Dudoit and J. Fridlyand. A prediction-based resampling method for estimating the number\n\nof clusters in a dataset. Genome biology, 3(7):research0036\u20131, 2002.\n\n[19] A. Dutta, A. Vijayaraghavan, and A. Wang. Clustering stable instances of euclidean k-means.\n\narXiv preprint arXiv:1712.01241, 2017.\n\n[20] J. M. Kleinberg. An impossibility theorem for clustering. In Advances in Neural Information\n\nProcessing Systems, pages 463\u2013470, 2002.\n\n[21] A. Kumar and R. Kannan. Clustering with spectral norm and the k-means algorithm.\n\nIn\nProceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science,\nFOCS \u201910, pages 299\u2013308, Washington, DC, USA, 2010. IEEE Computer Society. ISBN\n978-0-7695-4244-7. doi: 10.1109/FOCS.2010.35. URL http://dx.doi.org/10.1109/\nFOCS.2010.35.\n\n[22] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science,\n\n2001. Proceedings. 42nd IEEE Symposium on, pages 529\u2013537. IEEE, 2001.\n\n[23] M. Meil\u02d8a. Comparing clusterings: an axiomatic view. In In ICML \u201905: Proceedings of the 22nd\n\ninternational conference on Machine learning, pages 577\u2013584. ACM Press, 2005.\n\n[24] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type\nmethods for the k-means problem. In Foundations of Computer Science, 2006. FOCS\u201906. 47th\nAnnual IEEE Symposium on, pages 165\u2013176. IEEE, 2006.\n\n[25] J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: structure\ndetection by optimization. Pattern Recognition, 33(4):617 \u2013 634, 2000. ISSN 0031-3203. doi:\nhttps://doi.org/10.1016/S0031-3203(99)00076-X. URL http://www.sciencedirect.com/\nscience/article/pii/S003132039900076X.\n\n[26] R. L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267\u2013276, 1953.\n\n[27] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the\ngap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):\n411\u2013423, 2001.\n\n[28] T. van Laarhoven and E. Marchiori. Axioms for graph clustering quality functions. Journal\nof Machine Learning Research, 15:193\u2013215, 2014. URL http://jmlr.org/papers/v15/\nvanlaarhoven14a.html.\n\n[29] E. W. Weisstein. Tree. from mathworld\u2014a wolfram web resource. URL http://mathworld.\n\nwolfram.com/Tree.html.\n\n[30] A. Williams.\n\nIs clustering mathematically impossible? http://alexhwilliams.info/\n\nitsneuronalblog/2015/10/01/clustering2/, 2015.\n\n[31] R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proceedings of the\ntwenty-\ufb01fth conference on uncertainty in arti\ufb01cial intelligence, pages 639\u2013646. AUAI Press,\n2009.\n\n10\n\n\f", "award": [], "sourceid": 5138, "authors": [{"given_name": "Vincent", "family_name": "Cohen-Addad", "institution": "CNRS & Sorbonne Universit\u00e9"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "University of Oxford"}, {"given_name": "Frederik", "family_name": "Mallmann-Trenn", "institution": "MIT"}]}