{"title": "Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search", "book": "Advances in Neural Information Processing Systems", "page_first": 3094, "page_last": 3103, "abstract": "Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice. This paper considers the dual of a problem framework for hierarchical clustering introduced by Dasgupta. The main result is that one of the most popular algorithms used in practice, average linkage agglomerative clustering, has a small constant approximation ratio for this objective. Furthermore, this paper establishes that using bisecting k-means divisive clustering has a very poor lower bound on its approximation ratio for the same objective. However, we show that there are divisive algorithms that perform well with respect to this objective by giving two constant approximation algorithms. This paper is some of the first work to establish guarantees on widely used hierarchical algorithms for a natural objective function. This objective and analysis give insight into what these popular algorithms are optimizing and when they will perform well.", "full_text": "Approximation Bounds for Hierarchical Clustering:\n\nAverage Linkage, Bisecting K-means, and Local\n\nSearch\n\nBenjamin Moseley\u2217\n\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\n\nmoseleyb@andrew.cmu.edu\n\nJoshua R. Wang\u2020\n\nDepartment of Computer Science Stanford University\n\n353 Serra Mall, Stanford, CA 94305, USA\njoshua.wang@cs.stanford.edu\n\nAbstract\n\nHierarchical clustering is a data analysis method that has been used for decades.\nDespite its widespread use, the method has an underdeveloped analytical foun-\ndation. Having a well understood foundation would both support the currently\nused methods and help guide future improvements. The goal of this paper is to\ngive an analytic framework to better understand observations seen in practice.\nThis paper considers the dual of a problem framework for hierarchical clustering\nintroduced by Dasgupta [Das16]. The main result is that one of the most popular\nalgorithms used in practice, average linkage agglomerative clustering, has a small\nconstant approximation ratio for this objective. Furthermore, this paper establishes\nthat using bisecting k-means divisive clustering has a very poor lower bound on\nits approximation ratio for the same objective. However, we show that there are\ndivisive algorithms that perform well with respect to this objective by giving two\nconstant approximation algorithms. This paper is some of the \ufb01rst work to establish\nguarantees on widely used hierarchical algorithms for a natural objective function.\nThis objective and analysis give insight into what these popular algorithms are\noptimizing and when they will perform well.\n\n1\n\nIntroduction\n\nHierarchical clustering is a widely used method to analyze data. See [MC12, KBXS12, HG05] for an\noverview and pointers to relevant work. In a typical hierarchical clustering problem, one is given a set\nof n data points and a notion of similarity between the points. The output is a hierarchy of clusters of\nthe input. Speci\ufb01cally, a dendrogram (tree) is constructed where the leaves correspond to the n input\ndata points and the root corresponds to a cluster containing all data points. Each internal node of the\ntree corresponds to a cluster of the data points in its subtree. The clusters (internal nodes) become\nmore re\ufb01ned as the nodes are lower in the tree. The goal is to construct the tree so that the clusters\ndeeper in the tree contain points that are relatively more similar.\nThere are many reasons for the popularity of hierarchical clustering, including that the number of\nclusters is not predetermined and that the clusters produced induce taxonomies that give meaningful\nways to interpret data.\nMethods used to perform hierarchical clustering are divided into two classes: agglomerative and\ndivisive. Agglomerative algorithms are a bottom-up approach and are more commonly used than\n\u2217Benjamin Moseley was supported in part by a Google Research Award, a Yahoo Research Award and NSF\nGrants CCF-1617724, CCF-1733873 and CCF-1725661. This work was partially done while the author was\nworking at Washington University in St. Louis.\n\n\u2020Joshua R. Wang was supported in part by NSF Grant CCF-1524062.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdivisive approaches [HTF09]. In an agglomerative method, each of the n input data points starts\nas a cluster. Then iteratively, pairs of similar clusters are merged according to some appropriate\nmetric of similarity. Perhaps the most popular metric to de\ufb01ne similarity is average linkage where\nthe similarity between two clusters is de\ufb01ned as the average similarity between all pairs of data points\nin the two clusters. In average linkage agglomerative clustering the two clusters with the highest\naverage similarity are merged at each step. Other metrics are also popular. Related examples include:\nsingle linkage, where the similarity between two clusters is the maximum similarity between any\ntwo single data points in each cluster, and complete linkage, where the distance is the minimum\nsimilarity between any two single data points in each cluster.\nDivisive algorithms are a top-down approach where initially all data points belong to a single cluster.\nSplits are recursively performed, dividing a cluster into two clusters that will be further divided. The\nprocess continues until each cluster consists of a single data point. In each step of the algorithm, the\ndata points are partitioned such that points in each cluster are more similar than points across clusters.\nThere are several approaches to perform divisive clustering. One example is bisecting k-means where\nk-means is used at each step with k = 2. For details on bisecting k-means, see [Jai10].\n\nMotivation: Hierarchical clustering has been used and studied for decades. There has been some\nwork on theoretically quantifying the quality of the solutions produced by algorithms, such as\n[ABBL12, AB16, ZB09, BA08, Das16]. Much of this work focuses on deriving the structure\nof solutions created by algorithms or analytically describing desirable properties of a clustering\nalgorithm. Though the area has been well-studied, there is no widely accepted formal problem\nframework. Hierarchical clustering describes a class of algorithmic methods rather than a problem\nwith an objective function. Studying a formal objective for the problem could lead to the ability to\nobjectively compare different methods; there is a desire for the community to investigate potential\nobjectives. This would further support the use of current methods and guide the development of\nimprovements.\nThis paper is concerned with investigating objectives for hierarchical clustering. The main goal and\nresult of this paper is giving a natural objective that results in a theoretical guarantee for the most\ncommonly used hierarchical clustering algorithm, average linkage agglomerative clustering. This\nguarantee gives support for why the algorithm is popular in practice and the objective gives insight\ninto what the algorithm optimizes. This paper also proves a bad lower bound on bisecting k-means\nwith respect to the same natural objective. This objective can therefore be used as a litmus test for the\napplicability of particular algorithms. This paper further gives top-down approaches that do have\nstrong theoretical guarantees for the objective.\n\nProblem Formulation: Towards this paper\u2019s goal, \ufb01rst a formal problem framework for hierarchical\nclustering needs to be established. Recently, Dasgupta [Das16] introduced a new problem framework\nfor hierarchical clustering. This work justi\ufb01ed their objective by establishing that for several sample\nproblem instances, the resulting solution corresponds to what one might expect out of a desirable\nsolution. This work has spurred considerable interest and there have been several follow up papers\n[CC17, Das16, RP16].\nIn the problem introduced by Dasgupta [Das16] there is a set of n data points as input and for two\npoints i and j there is a weight wi,j denoting their similarity. The higher the weight, the larger the\nsimilarity. This is represented as a weighted complete graph G. In the problem the output is a (full)\nbinary tree where the leaves of the tree correspond to the input data points. For each pair of points i\nand j, let T [i\u2228 j] denote the subtree rooted at i and j\u2019s least common ancestor. Let leaves(T [i\u2228 j])\ndenote the set of leaves in the tree T [i \u2228 j]. The goal is to construct the tree such that the cost\ni,j\u2208[n] wij|leaves(T [i \u2228 j])| is minimized. Intuitively, this objective enforces that\nmore similar points i and j should have a lower common ancestor in the tree because the weight wi,j\nis large and having a smaller least common ancestor ensures that |leaves(T [i \u2228 j])| is smaller. In\nparticular, more similar points should be separated at lower levels of the hierarchical clustering.\n\u221a\nFor this objective, several approximation algorithms have been given [CC17, Das16, RP16]. It is\nknown that there is a divisive clustering algorithm with an approximation ratio of O(\nlog n) [CC17].\nIn particular, the algorithm gives a O(\u03b1n)-approximation where \u03b1n is the approximation ratio of\n\u221a\nthe sparsest cut subroutine [CC17]. Furthermore, assuming the Small-Set Expansion Hypothesis,\nevery algorithm is a \u03c9(1)-approximation [CC17]. The current best known bound on \u03b1n is O(\nlog n)\n[ARV09]. Unfortunately, this conclusion misses one of our key goals in trying to establish an\n\ncostG(T ) :=(cid:80)\n\n2\n\n\fobjective function. While the algorithms and analysis are ingenious, none of the algorithms with\ntheoretical guarantees are from the class of algorithms used in practice. Due to the complexity of the\nproposed algorithms, it will also be dif\ufb01cult to put them into practice.\nHence the question still looms: are there strong theoretical guarantees for practical algorithms? Is\nthe objective from [Das16] the ideal objective for our goals? Is there a natural objective that admits\nsolutions that are provably close to optimal?\n\non constructing a binary tree T to minimize the cost costG(T ) :=(cid:80)\n(cid:80)\ni,j\u2208[n] wij|non-leaves(T [i \u2228 j])| = (n(cid:80)\n\nResults: In this paper, we consider an objective function motivated by the objective introduced by Das-\ngupta in [Das16]. For a given tree T let |non-leaves(T [i \u2228 j])| be the total number of leaves that\nare not in the subtree rooted at the least common ancestor of i and j. The objective in [Das16] focuses\ni,j\u2208[n] wij|leaves(T [i \u2228 j])|.\nThis paper considers the dual problem where T is constructed to maximize the revenue revG(T ) :=\ni,j\u2208[n] wi,j) \u2212 costG(T ). It is important to observe\nthat the optimal clustering is the same for both objectives. Due to this, all the examples given in\n[Das16] motivating their objective by showing desirable structural properties of the optimal solution\nalso apply to the objective considered in this paper. Our objective can be interpreted similarly to that\nin [Das16]. In particular, similar points i and j should be located lower in the tree as to maximize\n|non-leaves(T [i \u2228 j])|, the points that get separated at high levels of the hierarchical clustering.\nThis paper gives a thorough investigation of this new problem framework by analyzing several\nalgorithms for the objective. The main result is establishing that average linkage clustering is a 1\n3-\napproximation. This result gives theoretical justi\ufb01cation for the use of average linkage clustering and,\nadditionally, this shows that the objective considered is tractable since it admits \u2126(1)-approximations.\nThis suggests that the objective captures a component of what average linkage is optimizing for.\nThis paper then seeks to understand what other algorithms are good for this objective. In particular,\nis there a divisive algorithm with strong theoretical guarantees? What can be said about practical\ndivisive algorithms? We establish that bisecting k-means is no better than a O( 1\u221a\nn ) approximation.\nThis establishes that this method is very poor for the objective considered. This suggests that bisecting\nk-means is optimizing for something different than what average linkage optimizes for.\nGiven this negative result, we question whether there are divisive algorithms that optimize for our\nobjective. We answer this question af\ufb01rmatively by giving a local search strategy that obtains a\n3-approximation as well as showing that randomly partitioning is a tight 1\n3-approximation. The\n1\nrandomized algorithm can be found in the supplementary material.\n\nOther Related Work: Very recently a contemporaneous paper [CKMM17] done independently has\nbeen published on ArXiv. This paper considers another class of objectives motivated by the work of\n[Das16]. For their objective, they also derive positive results for average linkage clustering.\n\n2 Preliminaries\n\nIn this section, we give preliminaries including a formal de\ufb01nition of the problem considered and\nbasic building blocks for later algorithm analysis.\nIn the Revenue Hierarchical Clustering Problem there are n input data points given as a set V .\nThere is a weight wi,j \u2265 0 between each pair of points i and j denoting their similarity, represented\nas a complete graph G. The output of the problem is a rooted tree T where the leaves correspond to\nthe data points and the internal nodes of the tree correspond to clusters of the points in the subtree. We\nwill use the indices 1, 2, . . . n to denote the leaves of the tree. For two leaves i and j, let T [i\u2228j] denote\nthe subtree rooted at the least common ancestor of i and j and let the set non-leaves(T [i \u2228 j])\ndenote the number of leaves in T that are not in T [i \u2228 j]. The objective is to construct T to maximize\n\n(cid:80)\nj(cid:54)=i\u2208[n] wi,j|non-leaves(T [i \u2228 j])|.\n\nthe revenue revG(T ) =(cid:80)\nleaves(T [i \u2228 j]) be the set of leaves in T [i \u2228 j] and costG(T ) :=(cid:80)\nrevG(T ) = n(cid:80)\n\nWe make no assumptions on the structure of the optimal tree T ; however, one optimal tree is a\nbinary tree, so we may restrict the solution to binary trees without loss of generality. To see this, let\ni,j wij|leaves(T [i \u2228 j])|.\nThe objective considered in [Das16] focuses on minimizing costG(T ). We note than costG(T ) +\ni,j wi,j, so the optimal solution to minimizing costG(T ) is the same as the optimal\n\ni\u2208[n]\n\n3\n\n\fsolution to maximizing revG(T ). In [Das16] it was shown that the optimal solution for any input is a\nbinary tree.\nAs mentioned, there are two common types of algorithms for hierarchical clustering: agglomerative\n(bottom-up) algorithms and divisive (top-down) algorithms. In an agglomerative algorithm, each\nvertex v \u2208 V begins in separate cluster, and each iteration of the algorithm chooses two clusters to\nmerge into one. In a divisive algorithm, all vertices v \u2208 V begin in a single cluster, and each iteration\nof the algorithm selects a cluster with more than one vertex and partitions it into two small clusters.\nIn this section, we present some basic techniques which we later use to analyze the effect each\niteration has on the revenue. It will be convenient for us to think of the weight function as taking in\ntwo vertices instead of an edge, i.e. w : V \u00d7 V \u2192 R\u22650. This is without loss of generality, because\nwe can always set the weight of any nonedge to zero (e.g. wvv = 0 \u2200v \u2208 V ).\nTo bound the performance of an algorithm it suf\ufb01ces to bound revG(T ) and costG(T ) since revG(T )+\ni,j wi,j. Further, let T \u2217 denote the optimal hierarchical clustering. Then its revenue\nij wij. This is because any edge ij can have at most (n \u2212 2)\n\ncostG(T ) = n(cid:80)\nis at most revG(T \u2217) \u2264 (n \u2212 2)(cid:80)\n\nnon-leaves for its subtree T [i \u2228 j]; i and j are always leaves.\n\n2.1 Analyzing Agglomerative Algorithms\n\na\u2208A,b\u2208B wab.\n\nmerges A, B merge-revG(A, B).\n\nNotice that the \ufb01nal revenue revG(T ) is exactly the sum over iterations of the revenue gains, since\neach edge is counted exactly once: when its endpoints are merged into a single cluster. Hence,\n\nIn this section, we discuss a method for bounding the performance of an agglomerative algorithm.\nWhen an agglomerative algorithm merges two clusters A, B, this determines the least common\nancestor for any pair of nodes i, j where i \u2208 A and j \u2208 B. Given this, we de\ufb01ne the revenue gain\n\ndue to merging A and B as, merge-revG(A, B) := (n \u2212 |A| \u2212 |B|)(cid:80)\nrevG(T ) =(cid:80)\npossible. De\ufb01ne, merge-costG(A, B) := |B|(cid:80)\nan additional 2(cid:80)\nall pairs i, j this is the following, costG(T ) =(cid:80)\n(cid:80)\n\nWe next de\ufb01ne the cost of merging A and B as the following. This is the potential revenue lost by\nmerging A and B; revenue that can no longer be gained after A and B are merged, but was initially\nb\u2208B,c\u2208[n]\\(A\u222aB) wbc.\nThe total cost of the tree T , costG(T ), is exactly the sum over iterations of the cost increases, plus\nij wij term that accounts for each edge being counted towards its own endpoints.\nWe can see why this is true if we consider a pair of vertices i, j \u2208 [n] in the \ufb01nal hierarchical\nclustering T . If at some point a cluster containing i is merged with a third cluster before it gets\nmerged with the cluster containing j, then the number of leaves in T [i \u2228 j] goes up by the size of the\nthird cluster. This is exactly the quantity captured by our cost increase de\ufb01nition. Aggregated over\ni,j\u2208[n] wij +\n\ni,j\u2208[n] wij|leaves(T [i \u2228 j])| = 2(cid:80)\n\na\u2208A,c\u2208[n]\\(A\u222aB) wac + |A|(cid:80)\n\nmerges A, B merge-costG(A, B).\n\n2.2 Analyzing Divisive Algorithms\n\nSimilar reasoning can be used for divisive algorithms. The following are revenue gain and\ncost increase de\ufb01nitions for when a divisive algorithm partitions a cluster into two clusters\nb,b(cid:48)\u2208B wbb(cid:48) and split-costG(A, B) :=\n\nA, B. De\ufb01ne, split-revG(A, B) := |B|(cid:80)\n(|A| + |B|)(cid:80)\n\na,a(cid:48)\u2208A waa(cid:48) + |A|(cid:80)\n\na\u2208A,b\u2208B wab.\n\nConsider the revenue gain. For a, a(cid:48) \u2208 A we are now guaranteed that when the nodes in B are split\nfrom A then every node in B will not be a leaf in T [a \u2228 a(cid:48)] (and a symmetric term for when they\nare both in B). On the cost side, the term counts the cost of any pairs a \u2208 A and b \u2208 B that are now\nseparated since we now know their subtree T [i \u2228 j] has exactly the nodes in A \u222a B as leaves.\n\n3 A Theoretical Guarantee for Average Linkage Agglomerative Clustering\n\nIn this section, we present the main result, a theoretical guarantee on average linkage clustering. We\nadditionally give a bad example lower bounding the best performance of the algorithm. See [MC12]\nfor details and background on this widely used algorithm. The formal de\ufb01nition of the algorithm\n\n4\n\n\fis given in the following pseudocode. The main idea is that initially all n input points are in their\nown cluster and the algorithm recursively merges clusters until there is one cluster. In each step, the\nalgorithm mergers the clusters A and B such that the pair maximizes the average distances of points\nbetween the two clusters,\n\n(cid:80)\n\na\u2208A,b\u2208B wab.\n\n1|A||B|\n\nData: Vertices V , weights w : E \u2192 R\u22650\nInitialize clusters C \u2190 \u222av\u2208V {v};\nwhile |C| \u2265 2 do\n\nChoose A, B \u2208 C to maximize \u00afw(A, B) := 1|A||B|\nSet C \u2190 C \u222a {A \u222a B} \\ {A, B};\n\n(cid:80)\n\na\u2208A,b\u2208B wab;\n\nend\n\nAlgorithm 1: Average Linkage\n\nThe following theorem establishes that this algorithm is only a small constant factor away from\noptimal.\nTheorem 3.1. Consider a graph G = (V, E) with nonnegative edge weights w : E \u2192 R\u22650. Let the\nhierarchical clustering T \u2217 be a optimal solution maximizing of revG(\u00b7) and let T be the hierarchical\nclustering returned by Algorithm 1. Then, revG(T ) \u2265 1\nProof. Consider an iteration of Algorithm 1. Let the current clusters be in the set C, and the algorithm\nchooses to merge clusters A and B from C. When doing so, the algorithm attains a revenue gain\nof the following. Let \u00afw(A, B) = 1|A||B|\na\u2208A,b\u2208B wab be the average weight of an edge between\npoints in A and B.\n\n(cid:80)\nmerge-revG(A, B) = (n \u2212 |A| \u2212 |B|)\n\n|C| (cid:88)\n\n3 revG(T \u2217).\n\n(cid:88)\n\n(cid:88)\n\nwab =\n\nwab\n\na\u2208A,b\u2208B\n\nC\u2208C\\{A,B}\n\na\u2208A,b\u2208B\n\n=\n\nC\u2208C\\{A,B}\n\n(cid:88)\nmerge-costG(A, B) = |B| (cid:88)\n= |B| (cid:88)\n(cid:88)\n\u2264 (cid:88)\n\nC\u2208C\\{A,B}\n\n=\n\n|C||A||B| \u00afw(A, B)\n\na\u2208A,c\u2208[n]\\(A\u222aB)\n\nb\u2208B,c\u2208[n]\\(A\u222aB)\n\nC\u2208C\\{A,B}\n\na\u2208A,c\u2208C\n\nC\u2208C\\{A,B}\n\n|B||A||C| \u00afw(A, C) +\n\nwbc\n\nb\u2208B,c\u2208C\n|A||B||C| \u00afw(B, C)\n\nwbc\n\n(cid:88)\n\nwac + |A| (cid:88)\n(cid:88)\nwac + |A| (cid:88)\n(cid:88)\n(cid:88)\n\nC\u2208C\\{A,B}\n\nC\u2208C\\{A,B}\n\n|B||A||C| \u00afw(A, B) +\n\n|A||B||C| \u00afw(A, B)\n\nwhile at the same time incurring a cost increase of:\n\nC\u2208C\\{A,B}\n\n= 2 \u00b7 merge-revG(A, B)\n(cid:88)\n\nIntuitively, every time this algorithm loses two units of potential it cements the gain of one unit of\npotential, which is why it is a 1\ncostG(T ) = 2\n\n3-approximation. Formally:\nmerge-costG(A, B) \u2264 2\n\nwij + 2 \u00b7 (cid:88)\n\nmerge-revG(A, B)\n\n(cid:88)\n\nwij +\n\n(cid:88)\n(cid:88)\n\ni,j\n\n\u2264 2\n\nmerges A, B\n\nwij + 2 \u00b7 revG(T )\n\ni,j\n\n(cid:88)\nrevG(T ) \u2265 n\nrevG(T ) \u2265 n \u2212 2\n\nij\n\n3\n\n(cid:88)\n\nij\n\nNow the revenue can be bounded as follows.\n\nwij \u2212 costG(T ) \u2265 n\n\nwij \u2265 1\n3\n\nrevG(T \u2217)\n\n5\n\ni,j\n\nmerges A, B\n\n(cid:88)\n\nij\n\nwij \u2212 2\n\n(cid:88)\n\ni,j\n\nwij \u2212 2 \u00b7 revG(T )\n\nwhere the last step follows from the fact that it is impossible to have more than n \u2212 2 non-leaves.\n\n\fu\n\n1 + \u03b4\n\nv\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nn/2 nodes\n\nn/2 nodes\n\nFigure 1: Hard graph for Average Linkage (k = 2 case).\n\nIn the following, we establish that the algorithm is at best a 1/2 approximation. The proof can be\nfound in Section 1 of the supplementary material.\nLemma 3.2. Let \u0001 > 0 be any \ufb01xed constant. There exists a graph G = (V, E) with nonnegative edge\nweights w : E \u2192 R\u22650, such that if the hierarchical clustering T \u2217 is an optimal solution of revG(\u00b7)\n\nand T is the hierarchical clustering returned by Average Linkage, revG(T ) \u2264(cid:0) 1\n\n2 + \u0001(cid:1) revG(T \u2217).\n\n4 A Lower Bound on Bisecting k-means\n\ni=1\n\ni=1\n\n1|Si|\n\nx\u2208Si\n\nx,y\u2208Si\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\npoints and their cluster center: min(cid:80)k\nrewritten as a sum over intra-cluster distances: min(cid:80)k\n\nIn this section, we consider the divisive algorithm which uses the k-means objective (with k = 2)\nwhen choosing how to split clusters. Normally, the k-means objective concerns the distances between\n||x \u2212 \u00b5i||2. However, it is known that this can be\n(cid:80)\n||x\u2212y||2 [ABC+15]. In other\na,a(cid:48)\u2208A ||a \u2212\nwords, when splitting a cluster into two sets A and B, the algorithm minimizes 1|A|\nb,b(cid:48)\u2208B ||b \u2212 b(cid:48)||2. At \ufb01rst glance, this appears to almost capture split-revG(A, B); the\na(cid:48)||2 + 1\nkey difference is that the summation has been scaled down by a factor of |A||B|. Of course, it also\n(cid:80)\ninvolves minimization over squared distances instead of maximization over similarity weights. We\nshow that the divisive algorithm which splits clusters by the natural k-means similarity objective,\nnamely max 1|A|\nb,b(cid:48)\u2208B wbb(cid:48), is not a good approximation to the optimal\nhierarchical clustering.\nLemma 4.1. There exists a graph G = (V, E) with nonnegative edge weights w : E \u2192 R\u22650,\nsuch that if the hierarchical clustering T \u2217 is a maximizer of revG(\u00b7) and T is the hierarchical\nclustering returned by the divisive algorithm which, splits clusters by the k-means similarity objective,\nrevG(T ) \u2264 1\n\u221a\n\na,a(cid:48)\u2208A waa(cid:48) + 1|B|\n\n(cid:80)\n\nn) revG(T \u2217).\n\nB\n\n\u2126(\n\n\u221a\n\nn. There are still unit weight edges between u and n\n\nProof. The plan is to exploit the fact that k-means is optimizing an objective function which differs\nfrom the actual split revenue by a factor of |A||B|.\nWe use almost the same group as in the lower bound against Average Linkage, except that the weight\n2 \u2212 1 other\nof the edge between u and v is\n2 \u2212 1 nodes. See Figure 1 for the structure\nnodes and unit weight edges between v and the remaining n\nof this graph. The key claim is that Divisive k-means will begin by separating u and v from all other\nnodes.\nIt is easy to see that this split scores a value of 1\nn under our alternate k-means objective function.\n2\nWhy does no other split score better? Well, any other split can either keep u and v together or\nseparate them. If the split keeps the two together along with k other nodes, then it scores at most\n\u221a\nn > 6. If the split separates the two, then it\n1\nk+2 [\nscores at most 2, since at best each side can be a tree of weight one edges and hence has fewer edges\nthan nodes.\nNow that we have established our key claim, it is easy to see that Divisive k-means is done scoring\non this graph, since it must next cut the edge uv and the other larger cluster has no edges in it. Hence\nDivisive k-means will score\n\nk+2 + 1, which is less than 1\n\nn(n \u2212 2) on this graph.\n\nn + k] \u2264 \u221a\n\nn if\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nn\n\n2\n\n6\n\n\fAs before, the optimal clustering may merge u with its other neighbors \ufb01rst and v with its other\nneighbors \ufb01rst, scoring a revenue gain of 2 [(n \u2212 2) + (n \u2212 3) + \u00b7\u00b7\u00b7 + (n/2)] = 3\n4 n2\u2212 O(n). There\n\u221a\nis a \u2126(\n\nn) gap between these revenues, completing the proof.\n\n5 Divisive Local-Search\n\nj\u2208A,j(cid:54)=i wi,j (resp.(cid:80)\n\nIn this section, we develop a simple local search algorithm and bound its approximation ratio. The\nlocal search algorithm takes as input a cluster C and divides it into two clusters A and B to optimize\na local objective: the split revenue. In particular, initially A = B = \u2205. Each node in C is added to A\nor B uniformly at random.\nLocal search is run by moving individual nodes between A and B.\n\ni \u2208 A (resp. B) is added to B (resp. A) if (cid:80)\n(cid:80)\nj,l\u2208B wj,l + |B|(cid:80)\n|A|(cid:80)\n\nIn a step, any point\nj\u2208B wi,j >\nj,l\u2208A wj,l +\nj\u2208B,j(cid:54)=i wi,j). This states that a point is moved to another set if the objective increases. The\nalgorithm performs these local moves until there is no node that can be moved to improve the\nobjective.\nData: Vertices V , weights w : E \u2192 R\u22650\nInitialize clusters C \u2190 {V };\nwhile some cluster C \u2208 C has more than one vertex do\nLet A, B be a uniformly random 2-partition of C;\n\nj,l\u2208A;j,l(cid:54)=i wj,l + (|A| \u2212 1)(cid:80)\nj\u2208A wi,j >(cid:80)\n\nj,l\u2208B;j,l(cid:54)=i wj,l + (|B| \u2212 1)(cid:80)\n\nRun local search on A, B to maximize |B|(cid:80)\n\na,a(cid:48)\u2208A waa(cid:48) + |A|(cid:80)\n\nb,b(cid:48)\u2208B wbb(cid:48), considering just\n\nmoving a single node;\nSet C \u2190 C \u222a {A, B} \\ {C};\n\nend\n\nAlgorithm 2: Divisive Local-Search\n\n1\n\n1\n\n3 revG(T \u2217).\n\n3 approximation.\n\nij wij, it suf\ufb01ces to show that revG(T ) \u2265\n\nij wij. We do this by considering possible local moves at every step.\n\nIn the following theorem, we show that the algorithm is arbitrarily close to a 1\nTheorem 5.1. Consider a graph G = (V, E) with nonnegative edge weights w : E \u2192 R\u22650. Let the\nhierarchical clustering T \u2217 be the optimal solution of revG(\u00b7) and let T be the hierarchical clustering\nreturned by Algorithm 2. Then, revG(T ) \u2265 (n\u22126)\n(n\u22122)\n\nConsider any step of the algorithm and suppose the algorithm decides to partition a cluster into\na,a(cid:48)\u2208A waa(cid:48) +\nb,b(cid:48)\u2208B wbb(cid:48). Assume without loss of generality that |B| \u2265 |A|, and consider the expected local\nsearch objective OBJ(cid:48) value for moving a random node from B to A. Note that the new local search\nobjective value is at most what the algorithm obtained, i.e. OBJ(cid:48) \u2264 OBJ:\n\nProof. Since we know that revG(T \u2217) \u2264 (n \u2212 2)(cid:80)\n3 (n \u2212 2)(cid:80)\nA, B. As stated in the algorithm, its local search objective value is OBJ = |B|(cid:80)\n|A|(cid:80)\n\uf8ee\uf8f0(cid:0)|B|\u22121\n(cid:1)\n(cid:1) (cid:88)\n(cid:0)|B|\n\uf8ee\uf8f0|B| \u2212 2\n(cid:88)\n\uf8ee\uf8f0(1 \u2212 2\n(cid:88)\n(cid:88)\n\n\uf8ee\uf8f0 (cid:88)\n\uf8ee\uf8f0 (cid:88)\n(cid:88)\n= OBJ \u2212 (cid:88)\n\n\uf8f9\uf8fb + (|A| + 1)\n\uf8f9\uf8fb + (|A| + 1)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nwab + (|A| + 1)\nwab + (\u2212 2|A|\n\nE[OBJ(cid:48)] = (|B| \u2212 1)\n\n= (|B| \u2212 1)\n\n1\n|B|\n\n1\n|B|\n\n= (|B| \u2212 1)\n\n|B| )\n\nb,b(cid:48)\u2208B\n\nwbb(cid:48)\n\nwab\n\nwab\n\n\uf8f9\uf8fb\n\uf8f9\uf8fb\n\n2\n\n2\n\nwbb(cid:48)\n\nb,b(cid:48)\u2208B\n\na\u2208A,b\u2208B\n\na\u2208A,b\u2208B\n\n|B|\n\nwbb(cid:48)\n\nb,b(cid:48)\u2208B\n\nwaa(cid:48) +\n\na,a(cid:48)\u2208A\n\nwaa(cid:48) +\n\na,a(cid:48)\u2208A\n\na\u2208A,b\u2208B\n\nwaa(cid:48) +\n\na,a(cid:48)\u2208A\n\n\uf8f9\uf8fb\n\n|B| \u2212 1\n|B|\n|B| \u2212 1\n|B|\n\nwaa(cid:48) +\n\na,a(cid:48)\u2208A\n\na\u2208A,b\u2208B\n\n|B| + 1 \u2212 2\n|B| )\n\nwbb(cid:48)\n\nb,b(cid:48)\u2208B\n\n7\n\n\fBut since there are no improving moves we know the following.\n\n0 \u2265 E[OBJ(cid:48)] \u2212 OBJ = \u2212 (cid:88)\n\n(cid:88)\n\nwaa(cid:48) +\n\n|B| \u2212 1\n|B|\n\na,a(cid:48)\u2208A\n\na\u2208A,b\u2208B\n\nwab \u2212 2|A| \u2212 |B| + 2\n\n|B|\n\n(cid:88)\n\nb,b(cid:48)\u2208B\n\nwbb(cid:48)\n\nRearranging terms and multiplying by |B| yields the following.\n\n(cid:88)\n\nwab \u2264 |B| (cid:88)\n\na\u2208A,b\u2208B\n\na,a(cid:48)\u2208A\n\n(|B| \u2212 1)\n\nwaa(cid:48) + (2|A| \u2212 |B| + 2)\n\n(cid:88)\n\nb,b(cid:48)\u2208B\n\nwbb(cid:48)\n\nWe now consider three cases. Either (i) |B| \u2265 |A| + 2, (ii) |B| = |A| + 1, or (iii) |B| = |A|. Case (i)\nis straightforward:\n\n(cid:19)\n\n(cid:18) |B| \u2212 1\n\n|A| + |B|\n\nsplit-costG(A, B) \u2264 split-revG(A, B)\nsplit-costG(A, B) \u2264 split-revG(A, B)\n\n1\n2\n\nIn case (ii), we use the fact that (x + 2)(x \u2212 2) \u2264 (x + 1)(x \u2212 1) to simplify:\n\n|A| + |B|\n\n(cid:18) |B| \u2212 1\n(cid:18) |B| \u2212 1\n(cid:19)(cid:18) |B| \u2212 1\n(cid:18)|B| + 1\n(cid:18) |B| \u2212 2\n(cid:18) 1\n\n|B| + 2\n\n1.5\n\n|A| + |B|\n\n|A| + |B|\n\n|A| + |B|\n\n(cid:19)\n(cid:19)\n(cid:19)\n(cid:19)\n(cid:19)\n\n\u2212\n\n2\n\n|A| + |B|\n\n(cid:18)|A| + 1\n(cid:18)|B| + 2\n\n|A|\n\n(cid:19)\n(cid:19)\n\n|B| + 1\n\nsplit-costG(A, B) \u2264\nsplit-costG(A, B) \u2264\nsplit-costG(A, B) \u2264 split-revG(A, B)\nsplit-costG(A, B) \u2264 split-revG(A, B)\nsplit-costG(A, B) \u2264 split-revG(A, B)\n\nsplit-revG(A, B)\n\nsplit-revG(A, B)\n\nCase (iii) proceeds similarly; we now use the fact that (x + 2)(x \u2212 3) \u2264 (x)(x \u2212 1) to simplify:\n\n|A| + |B|\n\n(cid:18) |B| \u2212 1\n(cid:18) |B| \u2212 1\n(cid:18) |B|\n(cid:19)(cid:18) |B| \u2212 1\n(cid:18) |B| \u2212 3\n(cid:18) 1\n\n|B| + 2\n\n|A| + |B|\n\n|A| + |B|\n\n|A| + |B|\n\n3\n\n(cid:19)\n(cid:19)\n(cid:19)\n(cid:19)\n(cid:19)\n\n\u2212\n\n2\n\n|A| + |B|\n\n(cid:18)|A| + 2\n(cid:18)|B| + 2\n\n|A|\n\n(cid:19)\n(cid:19)\n\n|B|\n\nsplit-costG(A, B) \u2264\nsplit-costG(A, B) \u2264\nsplit-costG(A, B) \u2264 split-revG(A, B)\nsplit-costG(A, B) \u2264 split-revG(A, B)\nsplit-costG(A, B) \u2264 split-revG(A, B)\n\nsplit-revG(A, B)\n\nsplit-revG(A, B)\n\n8\n\n\fHence we have shown that for each step of our algorithm, the split revenue is at least ( 1\ntimes the split cost. We rewrite this inequality and then sum over all iterations:\n\n2 \u2212 3\n\n|A|+|B| )\n\nsplit-costG(A, B) \u2212 3\n(cid:88)\n\ncostG(T ) \u2212 3\n\nwij\n\ni,j\u2208[n]\n\nsplit-revG(A, B) \u2265 1\n2\nrevG(T ) \u2265 1\n2\n\n=\n\n1\n2\n\n\uf8eb\uf8edn\n\nwij \u2212 revG(T )\n\nwab\n\na\u2208A,b\u2208B\n\n(cid:88)\n\uf8f6\uf8f8 \u2212 3\n\n(cid:88)\n\ni,j\u2208[n]\n\nwij\n\ni,j\u2208[n]\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nwij\n\nwij\n\ni,j\u2208[n]\n\ni,j\u2208[n]\n\n3\n2\n\nrevG(T ) \u2265 n \u2212 6\nrevG(T ) \u2265 n \u2212 6\n\n2\n\n3\n\nThis is what we wanted to prove.\nWe note that it is possible to improve the loss in terms of n to n\u22124\nb,b(cid:48)\u2208B wbb(cid:48).\n\nsearch objective (|B| \u2212 1)(cid:80)\n\na,a(cid:48)\u2208A waa(cid:48) + (|A| \u2212 1)(cid:80)\n\nn\u22122 by instead considering the local\n\n6 Conclusion\n\nOne purpose of developing an analytic framework for problems is that it can help clarify and explain\nour observations from practice. In this case, we have shown that average linkage is a 1\n3-approximation\nto a particular objective function, and the analysis that does so helps explain what average linkage\nis optimizing. There is much more to explore in this direction. Are there other objective functions\nwhich characterize other hierarchical clustering algorithms? For example, what are bisecting k-means,\nsingle-linkage, and complete-linkage optimizing for?\nAn analytic framework can also serve to guide development of new algorithms. How well can this\ndual objective be approximated? For example, we suspect that average linkage is actually a constant\napproximation strictly better than 1\n2 threshold? Perhaps the 1\n2\nthreshold is due to a family of graphs which we do not expect to see in practice. Is there a natural\ninput restriction that would allow for better guarantees?\n\n3. Could a smarter algorithm break the 1\n\nReferences\n\n[AB16]\n\nMargareta Ackerman and Shai Ben-David. A characterization of linkage-based hierar-\nchical clustering. Journal of Machine Learning Research, 17:232:1\u2013232:17, 2016.\n\n[ABBL12] Margareta Ackerman, Shai Ben-David, Simina Br\u00e2nzei, and David Loker. Weighted\nclustering. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelli-\ngence, July 22-26, 2012, Toronto, Ontario, Canada., 2012.\n\n[ARV09]\n\n[ABC+15] Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy,\nSoledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering\nformulations. In Proceedings of the 2015 Conference on Innovations in Theoretical\nComputer Science, pages 191\u2013200. ACM, 2015.\nSanjeev Arora, Satish Rao, and Umesh V. Vazirani. Expander \ufb02ows, geometric embed-\ndings and graph partitioning. J. ACM, 56(2):5:1\u20135:37, 2009.\nShai Ben-David and Margareta Ackerman. Measures of clustering quality: A working\nset of axioms for clustering. In Advances in Neural Information Processing Systems 21,\nProceedings of the Twenty-Second Annual Conference on Neural Information Processing\nSystems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 121\u2013128,\n2008.\n\n[BA08]\n\n9\n\n\f[CC17]\n\nMoses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via\nsparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-\nSIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta\nFira, January 16-19, pages 841\u2013854, 2017.\n\n[Das16]\n\n[CKMM17] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu.\nHierarchical clustering: Objective functions and algorithms. CoRR, abs/1704.02147,\n2017.\nSanjoy Dasgupta. A cost function for similarity-based hierarchical clustering.\nIn\nProceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing,\nSTOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 118\u2013127, 2016.\nKatherine A. Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In\nMachine Learning, Proceedings of the Twenty-Second International Conference (ICML\n2005), Bonn, Germany, August 7-11, 2005, pages 297\u2013304, 2005.\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised Learning, pages\n485\u2013585. Springer New York, New York, NY, 2009.\nAnil K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters,\n31(8):651 \u2013 666, 2010.\n\n[HTF09]\n\n[HG05]\n\n[Jai10]\n\n[MC12]\n\n[KBXS12] Akshay Krishnamurthy, Sivaraman Balakrishnan, Min Xu, and Aarti Singh. Ef\ufb01cient\nactive algorithms for hierarchical clustering. In Proceedings of the 29th International\nConference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July\n1, 2012, 2012.\nFionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview.\nWiley Interdisc. Rew.: Data Mining and Knowledge Discovery, 2(1):86\u201397, 2012.\nAurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In\nAdvances in Neural Information Processing Systems 29: Annual Conference on Neural\nInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages\n2316\u20132324, 2016.\nReza Zadeh and Shai Ben-David. A uniqueness theorem for clustering. In UAI 2009,\nProceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence,\nMontreal, QC, Canada, June 18-21, 2009, pages 639\u2013646, 2009.\n\n[ZB09]\n\n[RP16]\n\n10\n\n\f", "award": [], "sourceid": 1755, "authors": [{"given_name": "Benjamin", "family_name": "Moseley", "institution": null}, {"given_name": "Joshua", "family_name": "Wang", "institution": "Stanford University"}]}