{"title": "Subquadratic High-Dimensional Hierarchical Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 11580, "page_last": 11590, "abstract": "We consider the widely-used average-linkage, single-linkage, and Ward's methods for\n computing hierarchical clusterings of high-dimensional Euclidean inputs.\n It is easy to show that there is no efficient implementation of these algorithms\n in high dimensional Euclidean space since it implicitly requires to solve the closest\n pair problem, a notoriously difficult problem.\n\n However, how fast can these algorithms be implemented if we allow approximation?\n More precisely: these algorithms successively merge the clusters that are at closest\n average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge $\\gamma$-approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most $\\gamma$ times the distance of the closest clusters). \n\n We show that one can indeed take advantage of the relaxation and compute the approximate hierarchical clustering tree using $\\widetilde{O}(n)$ $\\gamma$-approximate nearest neighbor queries.\n This leads to an algorithm running in time $\\widetilde{O}(nd) + n^{1+O(1/\\gamma)}$ for $d$-dimensional Euclidean space.\n We then provide experiments showing that these algorithms perform as well as the non-approximate version for classic classification tasks while achieving a significant speed-up.", "full_text": "Subquadratic High-Dimensional Hierarchical\n\nClustering\n\nAmir Abboud\nIBM Research\n\namir.abboud@gmail.com\n\nVincent Cohen-Addad\n\nCNRS & Sorbonne Universit\u00b4e\n\nvcohenad@gmail.com\n\nHussein Houdrouge\n\u00b4Ecole Polytechnique\n\nhussein.houdrouge@polytechnique.edu\n\nAbstract\n\nWe consider the widely-used average-linkage, single-linkage, and Ward\u2019s methods\nfor computing hierarchical clusterings of high-dimensional Euclidean inputs. It is\neasy to show that there is no ef\ufb01cient implementation of these algorithms in high\ndimensional Euclidean space since it implicitly requires to solve the closest pair\nproblem, a notoriously dif\ufb01cult problem.\nHowever, how fast can these algorithms be implemented if we allow approxima-\ntion? More precisely: these algorithms successively merge the clusters that are\nat closest average (for average-linkage), minimum distance (for single-linkage),\nor inducing the least sum-of-square error (for Ward\u2019s). We ask whether one\ncould obtain a signi\ufb01cant running-time improvement if the algorithm can merge\n\u03b3-approximate closest clusters (namely, clusters that are at distance (average, min-\nimum, or sum-of-square error) at most \u03b3 times the distance of the closest clusters).\nWe show that one can indeed take advantage of the relaxation and compute the\n\napproximate hierarchical clustering tree using rOpnq \u03b3-approximate nearest neigh-\nbor queries. This leads to an algorithm running in time rOpndq ` n1`Op1{\u03b3q for\n\nd-dimensional Euclidean space. We then provide experiments showing that these\nalgorithms perform as well as the non-approximate version for classic classi\ufb01ca-\ntion tasks while achieving a signi\ufb01cant speed-up.\n\n1\n\nIntroduction\n\nHierarchical Clustering (HC) is a ubiquitous task in data science. Given a data set of n points\nwith some similarity or distance function over them, the goal is to group similar points together\ninto clusters, and then recursively group similar clusters into larger clusters. The clusters produced\nthroughout the procedure can be thought of as a hierarchy or a tree with the data points at the leaves\nand each internal node corresponds to a cluster containing the points in its subtree. This tree is often\nreferred to as a \u201cdendrogram\u201d and is an important illustrative aid in many settings. By inspecting the\ntree at different levels we get partitions of the data points to varying degrees of granularity. Famous\napplications are in image and text classi\ufb01cation [39], community detection [28], \ufb01nance [40], and in\nbiology [8, 19].\nPerhaps the most popular procedures for HC are Single-Linkage, Average-Linkage, and Ward\u2019s\nmethod. These are so-called agglomerative HC algorithms (as opposed to divisive) since they pro-\nceed in a bottom-up fashion: In the beginning, each data point is in its own cluster, and then the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmost similar clusters are iteratively merged - creating a larger cluster that contains the union of the\npoints from the two smaller clusters - until all points are in the same, \ufb01nal cluster.\nThe difference between the different procedures is in their notion of similarity between clusters,\nwhich determines the choice of clusters to be merged. In Single-Linkage the distance (or dissim-\nilarity) is de\ufb01ned as the minimum distance between any two points, one from each cluster.\nIn\nAverage-Linkage we take the average instead of the minimum, and in Ward\u2019s method we take the\nerror sum-of-squares (ESS). It is widely accepted that Single-Linkage enjoys implementations that\nare somewhat simpler and faster than Average-Linkage and Ward\u2019s, but the results of the latter two\nare often more meaningful. This is because its notion of distance is too sensitive and a meaningless\n\u201cchain\u201d in the data can sabotage the resulting clustering. Extensive discussions of these procedures\ncan be found in many books (e.g. [21, 28, 37, 1]), surveys (e.g. [31, 32, 9]), and experimental studies\n(e.g. [34]).\nAll of these procedures can be performed in nearly quadratic time, and the main question studied\nby this paper is whether we can reduce the time complexity to subquadratic. The standard quadratic\nalgorithm for Single-Linkage is quite simple and can be described as follows. After computing the\nn \u02c6 n distance matrix of the points, we \ufb01nd a minimum spanning tree (MST). This \ufb01rst stage takes\nOpn2dq time if the points are in d-dimensional Euclidean space. In the second stage we perform\nmerging iterations, in which the clusters correspond to connected subgraphs of the MST (initially,\neach point is its own subgraph). We merge the two subgraph whose in-between edge in the MST\nis the smallest. By the properties of MST, the edge between two subgraphs (clusters) is exactly the\nminimum distance between them. This second stage can be done with Opnq insertions, deletions,\nand minimum queries to a data structure, which can be done in near-linear time. The algorithms for\nAverage-Linkage and Ward\u2019s are more complicated since the MST edges between two clusters can\nbe arbitrarily smaller than the average distance or the ESS between them, and we must consider all\npairwise distances in clusters that quickly become very large. Nonetheless, an Opn2 log nq algorithm\n(following a \ufb01rst stage of computing the distance matrix) has been known for many decades [31].\nCan we possibly beat quadratic time? It is often claimed (informally) that \u2126pn2q is a lower bound\nbecause of the \ufb01st stage: it seems necessary to compute the distance matrix of the points whose size\nis already quadratic. More formally, we observe that these procedures are at least as hard as \ufb01nding\nthe closest pair among the set of points, since the very \ufb01rst pair to be merged is the closest pair. And\nindeed, under plausible complexity theoretic assumptions1, there is an almost-quadratic n2\u00b4op1q\nlower bound for the closest pair problem in Euclidean space with dimension d \u201c \u03c9plog nq [2, 26].\nThis gives a quadratic conditional lower bound for all three Single-Linkage, Average-Linkage, and\nWard\u2019s method.\nAchieving subquadratic runtime has been of interest for many decades (as can be deduced from the\nsurvey of Murtagh [31]) and it is increasingly desirable in the era of big data. (See also the recent\nwork on quadratic vs. subquadratic complexity of Empirical Risk Minimization problems [5].)\nIn this work, we focus on worst-case guarantees while allowing for small approximation in the\nanswers: how fast can we perform these procedures if each iteration is allowed to pick an approxi-\nmately best pair to merge? More precisely, when merging two clusters the algorithm is allowed to\ndo the following. If the best pair of (available) clusters has (minimum, average, or ESS) distance\nd then the algorithm can choose any pair of clusters whose distance is between d and \u03b3 \u00a8 d, where\n\u03b3 \u011b 1 is a small constant.\nWhen approximations are allowed the time complexity of closest pair drops, and so does the con-\nditional lower bound. Even in high dimensions, Locality Sensitive Hashing techniques can \ufb01nd the\n\u03b3-approximate nearest neighbors (ANN) in L1-distance with nOp1{\u03b3q time per query [3, 4]. This\ngives a subqadratic n1`Op1{\u03b3q algorithm for closest pair2, but can we achieve the same speed-up for\n\n\u03b3-approximate Average-Linkage? Namely, can we do Average-Linkage as fast as performing rOpnq\n\n(approximate) nearest-neighbor queries?\nFor the simpler \u03b3-approximate Single-Linkage it is rather easy to see that the answer is yes. This\nessentially follows from the classical Nearest Neighbor Chain algorithm for HC [31]. Here is a\n\nregarding the complexity of k-SAT.\n\n1These lower bounds hold under the Strong Exponential Time Hypothesis of Impagliazzo and Paturi [23, 24]\n2On the negative side, we know that a p1 ` \u03b5q approximation requires quadratic time [36].\n\n2\n\n\fsimple way to see why subquadratic is possible in this case: The idea is to replace the expensive \ufb01rst\nstage of the Single-Linkage algorithm (described above) with an approximate MST computation\nwhich can be done in subquadratic time [7, 22] using ANN queries. Then we continue to perform\nthe second stage of the algorithm with this tree.\nStill it is of great interest to speed up the Average-Linkage and Ward\u2019s algorithms since they typically\ngive more meaningful results. This is much harder and before this work, no subquadratic time\nalgorithm for Average-Linkage or Ward\u2019s with provable guarantees were known. Various algorithms\nand heuristics have been proposed, see e.g. [38, 20, 27, 33, 25, 41], that beat quadratic time by either\nmaking assumptions on the data or by changing the merging criteria altogether. Intuitively, while\nin Single-Linkage only Opnq distances are suf\ufb01cient for the entire computation (the distances in the\nMST), it is far from clear why this would be true for Average-Linkage and Ward\u2019s.\n\n1.1 Our Contribution\n\nOur main results are a \u03b3-approximate Ward\u2019s algorithm, and a \u03b3-approximate Average-Linkage\nalgorithm that run in subquadratic \u02dcOpn1`Op1{\u03b3q ` ndq time, for any \u03b3 \u0105 1, when the points are\nin d-dimensional Euclidean space. Moreover, our algorithms are reductions to \u02dcOpnq approximate\nnearest neighbor queries in dimension \u02dcOpdq with L2 distance squared (for Ward\u2019s) or L1 distance\n(for Average-Linkage). Thus, further improvements in ANN algorithms imply faster approximate\nHC, and more importantly, one can use the optimized ANN libraries to speed up our algorithm in a\nblack-box way. In fact, this is what we do to produce our experimental results. Our theorems are as\nfollows.\n\nTheorem 1.1. Given a set of n points in Rd and a ?\n\u03b3-Approximate Nearest Neighbor data structure\nwhich supports insertion, deletion and query time in time T , there exists a \u03b3p1 ` \u03b5q-approximation\nof Ward\u2019s Method running in time Opn \u00a8 T \u00a8 \u03b5\u00b42 logp\u2206nq log nq, where \u2206 is the aspect ratio of the\npoint set.\nTheorem 1.2. Given a set of n points in Rd and a data structure for \u03b3-Approximate Nearest Neigh-\nbor under the L1-norm which supports insertion, deletion and query time in time T , there exists a\n\u03b3p1` \u03b5q-approximation of Average Linkage running in time n\u00a8 T \u00a8 \u03b5\u00b42 logOp1qp\u2206nq, where \u2206 is the\naspect ratio of the point set.\n\nOur algorithm for approximating Ward\u2019s method is very simple: We follow Ward\u2019s algorithm and\niteratively merge clusters. To do so ef\ufb01ciently, we maintain the list of centroids of the current clusters\nand perform approximate nearest neighbor queries on the centroids to \ufb01nd the closest clusters. Of\ncourse, this may not be enough since some clusters may be of very large size compared to others\nand this has to be taken into account in order to obtain a \u03b3-approximation. We thus partition the\ncentroids of the clusters into buckets that represents the approximate sizes of the corresponding\nclusters and have approximate nearest neighbor data structure for each bucket. Then, given a cluster\nC, we identify its closest neighbor (in terms of Ward\u2019s objective) by performing an approximate\nnearest neighbor query on the centroid of C for each bucket and return the best one.\nOur algorithm for Average-Linkage is slightly more involved. Our algorithm adapts the standard\nAverage-Linkage algorithm, with a careful sampling scheme that picks out representatives for each\nlarge cluster, and a strategic policy for when to recompute nearest neighbor information. The other\nsections of this paper are dedicated to explaining the algorithm. Implementation-wise it is on the\nsame order of complexity as the standard Average-Linkage algorithm (assuming a nearest neighbor\ndata structure is used as a black-box), while ef\ufb01ciency-wise it is signi\ufb01cantly better as it goes below\nquadratic time. The gains increase as we increase the tolerance for error, in a controlled way.\nWe focus our empirical analysis on Ward\u2019s method. We show that even for a set of parameters induc-\ning very loose approximation guarantees, the hierarchical clustering tree output by our algorithm is\nas good as the hierarchical clustering tree produced by Ward\u2019s method in terms of classi\ufb01cation for\nmost of several classic datasets. On the other hand, we show that even for moderately large datasets,\ne.g.: sets of 20-dimensional points of size 20000, our algorithm offers a speed-up of 2.5 over the\npopular implementation of Ward\u2019s method of sci-kit learn.\n\n3\n\n\f1.2 Related Works\n\nA related but orthogonal approach to ours was taken by a recent paper [14]. The authors design\nan agglomerative hierarchical clustering algorithm, also using LSH techniques, that at each step,\nwith constant probability, performs the merge that average linkage would have done. However, with\nconstant probability, the merge done by their algorithm is arbitrary, and there is no guarantee on\nthe quality of the merge (in terms of average distance between the clusters merged compared to the\nclosest pair of clusters). We believe that our approach may be more robust since we have a guarantee\non the quality of every merge, which is the crux of our algorithms. Moreover, they only consider\nAverage-Linkage but not Ward\u2019s method.\nStrengthening the theoretical foundations for HC has always been of interest. Recently, an in\ufb02u-\nential paper of Dasgupta [17] pointed to the lack of a well-de\ufb01ned objective function that HC al-\ngorithms try to optimize and proposed one such function. Follow up works showed that Average-\nLinkage achieves a constant factor approximation to (the dual of) this function [16, 29], and also\nproposed new polynomial time HC algorithms for both worst-case and beyond-worst-case scenarios\nthat can achieve better approximation factors [35, 10, 15, 11, 12]. Other theoretical works prove that\nAverage-Linkage can reproduce a \u201ccorrect\u201d clustering, under some stability assumptions on the data\n[6]. Our work takes a different approach. Rather than studying the reasons for the widespread em-\npirical \ufb01ndings of the utility of HC algorithms (and mainly Average-Linkage and Ward\u2019s), we take\nit as a given and ask: how fast can we produce results that are as close as possible to the output of\nAverage-Linkage and Ward\u2019s. In some sense, the objective function we try to optimize is closeness\nto whatever Average-Linkage or Ward\u2019s produce.\n\n1.3 On our Notion of Approximation\n\nThe approximate Average-Linkage notion that we de\ufb01ne (\u03b3-AL) guarantees that at every step, the\nmerged pair is \u03b3-close to the best one. But can we prove any guarantees on the quality of the \ufb01nal\ntree? Will it be \u201cclose\u201d to the output of (exact) AL? (The same applies for Ward\u2019s, but let us focus\non AL in this subsection.)\nOne approach is to look at certain objective functions that measure the quality of a hierarchical\nclustering tree, such as the ones mentioned above ([16, 29] and by [17] for similarity graphs), and\ncompare the guarantees of AL and our \u03b3-AL w.r.t. these objective functions. It is likely that one\ncan prove that \u03b3-AL is guaranteed to give a solution that is no worse than an Op\u03b3q factor from the\nguarantees of (exact) AL w.r.t. to these objective functions. However, such a theorem may not have\nmuch value because (as shown by Charikar et al. [11]) the guarantees of AL are no better than those\nof a random recursive partitioning of the dataset. Therefore, such a theorem will only prove that\n\u03b3-AL is not-much-worse than random, which dramatically understates the quality of \u03b3-AL. In fact,\nin our experiments with a standard classi\ufb01cation task, \u03b3-Ward\u2019s is very close to Ward\u2019s and is much\nbetter than random (random has a 1{k success rate, which is 0.1 or less in case of digits, while ours\nachieves 0.5 \u00b4 0.8).\nAnother approach would be to prove theorems pertaining to an objective function for HC that offers\nthe guarantee that given two trees, if their costs are close then the structures of their HCs are similar.\nUnfortunately, we are not aware of any such objective functions (this is also the case for \ufb02at cluster-\nings such as k-median, k-means, etc.). In particular, with the functions of [16, 29] the trees output\nby AL and by a random recursive partitioning have the same cost, while their structure may be very\ndifferent.\nBesides the empirical evidence, let us mention two more points in support of our algorithms. First,\nour algorithms are essentially reductions to Approximate Nearest Neighbor (ANN) queries, and\nANN queries (using LSH for example) perform very well in practice. In fact, on real world inputs,\nthe algorithm often identi\ufb01es the exact nearest neighbor and then performs the same merge as in AL.\nSecond, we can provide a theoretical analysis of the following form in support of \u03b3-AL. It is known\nthat if the input data is an ultrametric, then AL (and also Single-Linkage or Complete-Linkage) does\nrecover the underlying ultrametric tree (see e.g. [16]) . Now, assume that the ultrametric is clear\nin the sense that if dpa, bq \u0105 dpa, cq then dpa, bq \u0105 \u03b3dpa, cq for some constant \u03b3. In this case, our\nalgorithm will provably recover the ultrametric in n1`Op1{\u03b3q time, whereas AL would need \u2126pn2q\ntime. Notably, in this setting, obtaining an Op1q-approximation w.r.t.\nthe objective functions of\n[16, 29] does not mean that the solution is close to the ultrametric tree.\n\n4\n\n\f2 A \u03b3-Approximation of Ward\u2019s Method\n\n2.1 Preliminaries\nLet P \u0102 Rd be a set of n points. Up to rescaling distances we may assume that the min-\nimum distance between any pair of points is 1. Let \u2206 denote the aspect ratio of P , namely\n\u2206 \u201c maxu,vPP distpu, vq. Let \u03b3 \u0105 1 be a \ufb01xed parameter. Our goal is to build a \u03b3-approximation\nof Ward\u2019s hierarchical clustering.\nLet C be a cluster, then de\ufb01ne the error sum-of-square as\n\nwhere \u00b5pCq \u201c 1|C|\n\nESSpCq \u201c\n\n\u0159\nxPC x. We let the error sum-of-square of a clustering C \u201c tC1, . . . , C(cid:96)u be\n\npx \u00b4 \u00b5pCqqTpx \u00b4 \u00b5pCqq\n\u00ff\n\n\u00ff\n\nxPC\n\nESSpCq \u201c\n\nESSpCq.\n\nCPC\n\nThus, Ward\u2019s algorithm constructs a hierarchy of clusters where each level represents a clustering of\nthe points and where clusters at a given level (cid:96) are subsets of clusters of level (cid:96)`1. Ward\u2019s algorithm\nbuilds this hierarchy in a bottom-up fashion, starting from n clusters (each point is itself a cluster).\nThen, given the clustering of a given level (cid:96), Ward\u2019s algorithm obtains the clustering of the next\nlevel by merging the two clusters that yield the clustering of minimal ESS. More formally, consider\na clustering C \u201c tC1, . . . , C(cid:96)u. To \ufb01nd the clustering of minimum ESS obtained by merging a pair\nof clusters of C, it is enough to minimize the increase in the ESS induced by the merge. Therefore,\nwe want to identify the clusters Ci, Cj that minimize the following quantity.\n\n\u2206ESSpCi, Cjq \u201c |Ci||Cj|\n\n|Ci| ` |Cj|||\u00b5pCiq \u00b4 \u00b5pCjq||2\n\n2.\n\n(1)\n\nWe will also make use of the following fact.\nFact 1. Given two set of points A, B with corresponding centroids \u00b5pAq, \u00b5pBq respectively, we have\n|B|\nthat the centroid of A Y B is on the line joining \u00b5pAq to \u00b5pBq, at distance\n|AYB|||\u00b5pAq \u00b4 \u00b5pBq||2\nfrom \u00b5pAq.\nLet \u03b3 \u0105 0 be a parameter, P a set of points in Rd. Let D be a data structure that for any set P of\nn points in Rd where d \u201c Oplog nq, supports the following operations. Insertion of a point in P in\ntime Opnfp\u03b3qq, for some function f. Deletion of a point in P in time Opnfp\u03b3qq; Given a point p P P ,\noutputs a point inserted to the data structure at L2 \u00b4 distance at most \u03b3 times the distance from p\nto the closest point inserted to the data structure, in time Opnfp\u03b3qq.\nThere are data structures based on locality sensitive hashing for fp\u03b3q \u201c 1 ` Op1{\u03b32q, see for\nexample [4].\n\n2\n\n2.1.1 Finding The Nearest Neighbour Cluster\n\nOur algorithm relies on a Nearest Neighbour Data Structure for clusters, where the distance between\ntwo clusters A, B is given by ESSpA Y Bq \u00b4 ESSpAq \u00b4 ESSpBq. Given a parameter \u03b5 \u0105 0, our\nNearest Neighbour Data Structure Dp\u03b3, \u03b5q for clusters consists of Op\u03b5\u00b41 log nq Nearest Neighbour\nData Structures for points with error parameter ?\n\u03b3 de\ufb01ned as follows. There is a data structure D(cid:96)\nfor each (cid:96) P tp1 ` \u0001qi | i P r1, . . . , log1`\u0001 nsu. The data structure works as follows.\nInsertion(C): Inserting a cluster of a set C of points is done by inserting \u00b5pCq in the Di such that\np1 ` \u03b5qi\u00b41 \u010f |C| \u0103 p1 ` \u03b5qi.\nQuery(C): For each i P tp1 ` \u0001qi | i P r1, . . . , log1`\u0001 nsu perform a nearest neighbor data query\nfor \u00b5pCq in Di, let N NipCq be the result. Output N NipCq that minimizes \u2206ESSC,N NipCq.\nThe proof of the following lemma is in the appendix.\nLemma 2.1. For any \u03b5 \u0105 0, the above nearest neighbour data structure for clusters with parameters\n\u03b3, \u03b5, Dp\u03b3, \u03b5q has the following properties:\n\n5\n\n\f\u03b3q\u03b5\u00b41 log nq;\n\n\u2022 The insertion time is Opnfp?\n\u2022 On Query(C), it returns a clusters C1 such that ESSpC Y C1q \u00b4 ESSpCq \u00b4 ESSpC1q \u010f\np1 ` \u03b5q\u03b3 minBPDp\u03b5,\u03b3q pESSpC Y Bq \u00b4 ESSpCq \u00b4 ESSpBqq.\n\u2022 The query time is Opnfp?\n\n\u03b3q\u03b5\u00b41 logpn\u2206qq.\n\n2.1.2 The Main Algorithm\nWe de\ufb01ne the value of merging two clusters A,B as ESSpAY Bq\u00b4 ESSpAq\u00b4 ESSpBq. Our algo-\nrithm starts by considering each point as its own cluster, together with the Nearest Neighbour Cluster\nData Structure described above. Then, the algorithm creates a logarithmic number of rounded merge\nvalues that partition the range of possible merge values. Let I be the sequence of all possible merge\nvalues in increasing order.\nGiven a set of n points with minimum pairwise distance 1 and maximum pairwise distance \u2206, we\nhave that the total number of merge value \u03b2 is Oplogpn\u2206qq.\nThe algorithm maintains a clustering and at each step decides which two clusters of the current\nclustering should be merged. The clusters of the current clustering are called unmerged clusters.\nThe algorithm iterates over all merge values in an increasing order while maintaining the following\ninvariant:\nInvariant 2.2. When the algorithm reaches merge value \u03b4, for any pair of unmerged cluster C, C1\nwe have ESSpC Y C1q \u00b4 ESSpCq \u00b4 ESSpC1q \u011b \u03b4{\u03b3.\nWe now give a complete description of our algorithm.\n\n1. Let L be the list of unmerged clusters, initially it contains all the points.\n2. For each \u03bd P I:\n\n(a) ToMerge \u00d0 L\n(b) While ToMerge is not empty:\n\ni. Pick a cluster C from ToMerge, and remove it from ToMerge.\nii. N NpCq \u00d0 Approximate Nearest Neighbour Cluster of C.\niii. If ESSpC Y N NpCqq \u00b4 ESSpCq \u00b4 ESSpN NpCqq \u010f \u03bd:\nA. Merge C and N NpCq. Let C1 be the resulting cluster.\nB. Remove N NpCq from ToMerge and add C1 to ToMerge; \u00b5pC1q follows imme-\nC. Remove C, N NpCq from L and add C1 to L\n\ndiately from \u00b5pCq, \u00b5pN NpCqq,|C| and |N NpCq| (see Fact 1)\n\nThe running time analysis and proof of correctness of the algorithm are deferred to the appendix.\n\n3 A \u03b3-Approximation of Average-Linkage\n\n\u0159\n\n\u0159\n\n3.1 Preliminaries\nbPB dpa, bq. The following simple\nFor two sets of points A, B, we let avgpA, Bq \u201c 1|A||B|\nlemma is proved in the appendix.\nLemma 3.1. Consider three sets of points A, B, C. We have that avgpA, Cq \u201c avgpC, Aq \u010f\navgpA, Bq ` avgpB, Cq\n\naPA\n\n3.2 Overview and Main Data Structures\n\nOur goal is to design a \u03b3-approximate Average-Linkage algorithm. The input is a set P of n points\nin a d-dimensional Euclidean space. The algorithm starts with a clustering where each input point is\nin its own cluster. The algorithm then successively merges pairs of clusters. When two clusters are\nmerged, a new cluster consisting of the union of the two merged clusters is created. The unmerged\nclusters at a given time of the execution of the algorithm are the clusters that have not been merged so\nfar. More formally, at the start the set of unmerged clusters is the set of all clusters. Then, whenever\n\n6\n\n\ftwo clusters are merged, the newly created cluster is inserted to the set of unmerged clusters while\nthe two merged clusters are removed from the set. The algorithm merges clusters until all the points\nare in one cluster.\nTo be a \u03b3-approximation to Average-Linkage, our algorithm must merge clusters according to the\nfollowing rule: If the minimum average distance between a pair of unmerged clusters is v then the\nalgorithm is not allowed to merge two unmerged clusters with average distance larger than \u03b3 \u00a8 v.\nLet \u03b5 \u0105 0 and \u03b3 \u011b 1 be parameters. We will show how to use a \u03b3-approximate nearest neighbor\ndata structure (on points) to get a \u03b31-approximate Average-Linkage algorithm where \u03b31 \u201c p1` \u03b5q\u00a8 \u03b3.\nWe make use of the following key ingredients.\n\n\u2022 We design a sampling scheme that allows to choose at most poly log n points per cluster\nwhile preserving the average distance up to p1`\u03b5q-factor with probability at least 1\u00b41{n5.\n\u2022 We design a data structure that given a set of clusters, allows to answer approximate nearest\n\u2022 Finally we provide a careful scheme for the merging steps that allows to bound the number\n\nneighbor queries (on clusters) according to the average distance.\n\nof times the nearest neighbor queries for a given cluster have to be performed.\n\n3.3 The Algorithm\n\nWe are now ready to describe our algorithm. Our algorithm starts with all input points in their\nown clusters and performs a nearest neighbor query for each of them. The algorithm maintains a\npartition of the input into clusters that we call the unmerged clusters, identical to average linkage.\nThe algorithm proceeds in steps. Each step consists of merging several pairs of clusters. For each\nstep we associate a value v, which we refer to as the merge value of the step, which is a power\nof p1 ` \u03b5q and we will show the invariant that at the end of the step associated with value v, the\nunmerged clusters are at distance greater than v{pp1 ` \u03b5q2\u03b3q. Let I be the set of all merge values.\nFor each cluster C, we will maintain a sample of its points by applying the sampling procedure\n(see supplementary materials for more details). To avoid recomputing a sample too often, we set a\nvariable spCq which corresponds to the size of the cluster the last time the sampling procedure was\ncalled.\n\nLazy sampling. Every time two clusters C1, C2 are merged by the algorithm to create a new\ncluster, the following operations are performed:\n\n1. If |C1 Y C2| \u011b p1` \u03b52{p1` \u03b3qq maxpspC1q, spC2qq, then the sampling procedure is called\non C1Y C2 and an approximate nearest cluster query is performed using the nearest cluster\ndata structure (see supplementary materials). Then, spC1 Y C2q is set to |C1 Y C2|. The\nresolution parameter for sampling is the value of the current step divided by n. Namely, if\nthe value of the current step is v, we set \u03b1C1YC1 \u201c v for the sampling procedure.\n2. Otherwise, spC1 Y C2q is set to maxpspC1q, spC2qq and the algorithm uses the sample of\nargmaxCPtC1,C2u|C| as the sample for C1 Y C2.\n\nOnce the above has been performed, a \u03b3-approximate nearest cluster query is performed using the\nsample de\ufb01ned for the cluster resulting of the merge.\nThus, at each step, all the clusters have a \u03b3p1 ` Op\u03b5qq-approximate nearest neighbor among the\nclusters. We denote \u03bdtpCq the approximate nearest neighbor for cluster C at the tth step. This\napproximate nearest neighbor is computed using our data structure (see supplementary materials).\nWe let \u03bdpCq \u201c \u03bdtpCqpCq, where tpCq is the step at which C was created.\nPseudocode for our algorithm\n\n1. Let L be the list of unmerged clusters, initially it contains all the points.\n2. For each v P I:\n\n(a) ToMerge \u00d0 L\n(b) While ToMerge is not empty:\n\n7\n\n\fi. Pick a cluster C from ToMerge, and remove it from ToMerge.\nii. N NpCq \u00d0 Approximate Nearest Neighbour Cluster of C.\niii. If avgpC, N NpCqq \u010f v:\nA. Merge C and N NpCq. Let C1 be the resulting cluster.\nB. Perform the Lazy Sampling procedure on C1 and insert it into the ANN data\nC. Remove N NpCq from ToMerge and add C1 to ToMerge;\nD. Remove C, N NpCq from L and add C1 to L\n\nstructure.\n\nSee supplementary materials for the proof of correctness.\n\n4 Experiments\n\nOur experiments focus on Ward\u2019s method and its approximation since it is a simpler algorithm in\ncontrast with average-linkage. We implemented our algorithm using C++11 on 2.5 GHz 8 core CPU\nwith 7.5 GiB under the Linux operating system. Our algorithm takes a dynamic Nearest Neighbour\ndata structure as a block box. In our implementation, we are using the popular FLANN library [30]\nand our own implementation of LSH for performing approximate nearest neighbor queries. We\ncompare our algorithm to the sci-kit learn implementation of Ward\u2019s method [34] which is a Python\nlibrary that also uses C++ in the background.\nOur algorithm has different parameters for controlling the approximation factor. These parameters\nhave a signi\ufb01cant effect on the performance and the precision of the algorithm. The main parameter\nthat we have is \u0001 which determines the number of data structures to be used (recall that we have one\napproximate nearest neighbor data structure for each p1 ` \u03b5qi, for representing the potential cluster\nsizes) and the sequence of merge values. Moreover, we make use of FLANN library procedure for\n\ufb01nding approximate nearest neighbors using KD-trees. This procedure takes two parameters the\nnumber of trees t and the number of leaves visited f. The algorithm builds t randomized KD-trees\nover the dataset. The number of leaves parameter controls how many leaves of the KD-trees are\nvisited before stopping the search and returning a solution. These parameters control the speed and\nprecision of the nearest neighbor search. For instance, increasing the number of leaves will lead to\na high precision but at the expense of a higher running time. In addition, decreasing the number of\nKD-Tree increases the performance but it decreases the precision. For LSH, we use the algorithm\nof Datar et al. [18] which has mainly two parameters, H the number of hash functions used and r\ncontrolling the \u2019collision\u2019 rate (see details in [18]).\nTo study the effects of these parameters, we did different experiments that combine several param-\neters and we report and discuss the main results in Table 1. The main data that is used in these\nexperiments are classic real-world datasets from the UCI repository and the sci-kit-learn library.\nIris contains 150 points in 4 dimensions, Digits 1797 in 64 dimensions, Boston 506 points in 13\ndimensions, Cancer 569 points in 3 dimensions, and Newsgroup 11314 points in 2241 dimensions.\nTo measure the speed-up achieved by our algorithm, we focus our attention on a set of parameters\nwhich gives classi\ufb01cation error that is similar to Ward\u2019s on the real-world datasets, and then run\nour algorithm (with these parameters) on synthetic dataset of increasing sizes. These parameters are\nprecisely \u0001 \u201c 8, number of trees T \u201c 2, the number of visited leaves L \u201c 10. The datasets are\ngenerated using the blobs procedure of sci-kit learn. The datasets generated are d-dimensional for\nd \u201c t10, 20u and consists of a number of points ranging from 10 000 to 20 000. In both dimensions,\nwe witness a signi\ufb01cant speed-up over the sci-kit learn implementation of Ward\u2019s algorithm. Perhaps\nsurprisingly, the speed-up is already signi\ufb01cant for moderate size datasets. We observe that the\nrunning time is similar for LSH or FLANN.\n\nAcknowledgements. Ce projet a b\u00b4en\u00b4e\ufb01ci\u00b4e d\u2019une aide de l\u2019 \u00b4Etat g\u00b4er\u00b4ee par l\u2019Agence Nationale de la\nRecherche au titre du Programme FOCAL portant la r\u00b4ef\u00b4erence suivante : ANR-18-CE40-0004-01.\n\nReferences\n[1] James Abello, Panos M Pardalos, and Mauricio GC Resende. Handbook of massive data sets,\n\nvolume 4. Springer, 2013.\n\n8\n\n\fWard\u2019s\n\nWard-FLANN (\u0001 = 0.5, T = 16, L = 5)\nWard-FLANN (\u0001 = 4, T = 16, L = 128)\nWard-FLANN (\u0001 = 8, T = 2, L = 10)\nWard-LSH (\u0001 \u201c 10, r \u201c 3, H \u201c n1{10)\nWard-LSH (\u0001 \u201c 10, r \u201c 3, H \u201c n1{2)\nWard-LSH (\u0001 \u201c 2, r \u201c 3, H \u201c n1{2)\n\nIris\n0.67\n0.62\n0.76\n0.75\n0.69\n0.72\n0.72\n\nCancer Digits Boston Newsgroup\n0.146\n0.46\n\u0103 0.05\n0.53\n\u0103 0.05\n0.47\n\u0103 0.05\n0.51\n\u0103 0.05\n0.58\n0.104\n0.48\n0.57\n0.113\n\n0.82\n0.79\n0.56\n0.47\n0.58\n0.73\n0.63\n\n0.80\n0.80\n0.78\n0.80\n0.82\n0.83\n0.83\n\nTable 1: We report the normalized mutual information score of the clustering output by the different\nalgorithms compared to the ground-truth labels for each dataset. We note that 0.05 can obtained on\nNewsgroup through a random labelling of the vertices (up to \u02d80.02). Hence LSH seems a more\nrobust approach for implementing approx-ward.\n\n(a) Running time of our algorithm with parameters (\u03b5\n= 8, T = 2, L = 10) (in red) and of Ward\u2019s method, on\ndatasets of sizes ranging from 10 000 points to 20 000\npoints in R10. We observe that our algorithm is more\nthan 2.5 faster on datasets of size 20 000.\n\n(b) Running time of our algorithm with parameters (\u03b5\n= 8, T = 2, L = 10) (in red) and of Ward\u2019s method, on\ndatasets of sizes ranging from 10 000 points to 20 000\npoints in R20. We observe that our algorithm is more\nthan 2.5 faster on datasets of size 20 000. Interestingly,\nit seems that the dimension has little in\ufb02uence on both\nour algorithm and Ward\u2019s method.\n\n[2] Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors.\nIn IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley,\nCA, USA, 17-20 October, 2015, pages 136\u2013150, 2015.\n\n[3] Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. Beyond locality-\nsensitive hashing. In Proceedings of the twenty-\ufb01fth annual ACM-SIAM symposium on Discrete\nalgorithms, pages 1018\u20131028. Society for Industrial and Applied Mathematics, 2014.\n\n[4] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate\nIn Proceedings of the forty-seventh annual ACM symposium on Theory of\n\nnear neighbors.\ncomputing, pages 793\u2013801. ACM, 2015.\n\n[5] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the \ufb01ne-grained complexity of empirical\nrisk minimization: Kernel methods and neural networks. In Advances in Neural Information\nProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017,\n4-9 December 2017, Long Beach, CA, USA, pages 4311\u20134321, 2017.\n\n[6] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for\nclustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on\nTheory of computing, pages 671\u2013680. ACM, 2008.\n\n[7] Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. Subquadratic approximation algorithms\nfor clustering problems in high dimensional spaces. In Proceedings of the thirty-\ufb01rst annual\nACM symposium on Theory of computing, pages 435\u2013444. ACM, 1999.\n\n9\n\n\f[8] Peter Breyne and Marc Zabeau. Genome-wide expression analysis of plant cell cycle modu-\n\nlated genes. Current opinion in plant biology, 4(2):136\u2013142, 2001.\n\n[9] Gunnar Carlsson and Facundo M\u00b4emoli. Characterization, stability and convergence of hierar-\nchical clustering methods. Journal of machine learning research, 11(Apr):1425\u20131470, 2010.\n\n[10] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut\nand spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on\nDiscrete Algorithms, pages 841\u2013854. Society for Industrial and Applied Mathematics, 2017.\n\n[11] Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better than\naverage-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 2291\u20132304. SIAM, 2019.\n\n[12] Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, and Grigory Yaroslavtsev. Hierarchical\n\nclustering for euclidean data. arXiv preprint arXiv:1812.10582, 2018.\n\n[13] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and\n\ntheir applications. SIAM Journal on Computing, 39(3):923\u2013947, 2009.\n\n[14] Michael Cochez and Hao Mou. Twister tries: Approximate hierarchical agglomerative cluster-\ning for average distance in linear time. In Proceedings of the 2015 ACM SIGMOD international\nconference on Management of data, pages 505\u2013517. ACM, 2015.\n\n[15] Vincent Cohen-Addad, Varun Kanade, and Frederik Mallmann-Trenn. Hierarchical clustering\nbeyond the worst-case. In Advances in Neural Information Processing Systems, pages 6201\u2013\n6209, 2017.\n\n[16] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hier-\narchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth\nAnnual ACM-SIAM Symposium on Discrete Algorithms, pages 378\u2013397. SIAM, 2018.\n\n[17] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. arXiv preprint\n\narXiv:1510.05043, 2015.\n\n[18] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing\nscheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on\nComputational geometry, pages 253\u2013262. ACM, 2004.\n\n[19] Ibai Diez, Paolo Bonifazi, I\u02dcnaki Escudero, Beatriz Mateos, Miguel A Mu\u02dcnoz, Sebastiano Stra-\nmaglia, and Jesus M Cortes. A novel brain partition highlights the modular skeleton shared by\nstructure and function. Scienti\ufb01c reports, 5:10532, 2015.\n\n[20] Pasi Franti, Olli Virmajoki, and Ville Hautamaki. Fast agglomerative clustering using a k-\nIEEE transactions on pattern analysis and machine intelligence,\n\nnearest neighbor graph.\n28(11):1875\u20131881, 2006.\n\n[21] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, NY, USA:, 2001.\n\n[22] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards\n\nremoving the curse of dimensionality. Theory of computing, 8(1):321\u2013350, 2012.\n\n[23] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer\n\nand System Sciences, 62(2):367\u2013375, 2001.\n\n[24] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly\n\nexponential complexity? Journal of Computer and System Sciences, 63(4):512\u2013530, 2001.\n\n[25] Yongkweon Jeon, Jaeyoon Yoo, Jongsun Lee, and Sungroh Yoon. Nc-link: A new linkage\nmethod for ef\ufb01cient hierarchical clustering of large-scale data. IEEE Access, 5:5594\u20135608,\n2017.\n\n10\n\n\f[26] Karthik C. S. and Pasin Manurangsi. On closest pair in euclidean metric: Monochromatic is as\nhard as bichromatic. In 10th Innovations in Theoretical Computer Science Conference, ITCS\n2019, January 10-12, 2019, San Diego, California, USA, pages 17:1\u201317:16, 2019.\n\n[27] Meelis Kull and Jaak Vilo. Fast approximate hierarchical clustering using similarity heuristics.\n\nBioData mining, 1(1):9, 2008.\n\n[28] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets.\n\nCambridge university press, 2014.\n\n[29] Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Av-\nerage linkage, bisecting k-means, and local search. In Advances in Neural Information Pro-\ncessing Systems, pages 3094\u20133103, 2017.\n\n[30] Marius Muja and David G. Lowe. Scalable nearest neighbor algorithms for high dimensional\n\ndata. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36, 2014.\n\n[31] Fionn Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Com-\n\nputer Journal, 26(4):354\u2013359, 1983.\n\n[32] Fionn Murtagh. Comments on \u2019parallel algorithms for hierarchical clustering and cluster va-\n\nlidity\u2019. IEEE Trans. Pattern Anal. Mach. Intell., 14(10):1056\u20131057, 1992.\n\n[33] Dr Otair et al. Approximate k-nearest neighbour based spatial clustering using kd tree. arXiv\n\npreprint arXiv:1303.1951, 2013.\n\n[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-\nrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning\nResearch, 12:2825\u20132830, 2011.\n\n[35] Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Advances\n\nin Neural Information Processing Systems, pages 2316\u20132324, 2016.\n\n[36] Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the\n50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1260\u20131268. ACM,\n2018.\n\n[37] Hinrich Sch\u00a8utze, Christopher D Manning, and Prabhakar Raghavan. Introduction to informa-\n\ntion retrieval, volume 39. Cambridge University Press, 2008.\n\n[38] Hinrich Sch\u00a8utze and Craig Silverstein. Projections for ef\ufb01cient document clustering. In ACM\n\nSIGIR Forum, volume 31, pages 74\u201381. ACM, 1997.\n\n[39] Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering\n\ntechniques. In KDD workshop on text mining, volume 400, pages 525\u2013526. Boston, 2000.\n\n[40] Michele Tumminello, Fabrizio Lillo, and Rosario N Mantegna. Correlation, hierarchies, and\nnetworks in \ufb01nancial markets. Journal of Economic Behavior & Organization, 75(1):40\u201358,\n2010.\n\n[41] Pelin Yildirim and Derya Birant. K-linkage: A new agglomerative approach for hierarchical\n\nclustering. Advances in Electrical and Computer Engineering, 17(4):77\u201389, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6177, "authors": [{"given_name": "Amir", "family_name": "Abboud", "institution": "IBM research"}, {"given_name": "Vincent", "family_name": "Cohen-Addad", "institution": "CNRS & Sorbonne Universit\u00e9"}, {"given_name": "Hussein", "family_name": "Houdrouge", "institution": "Ecole Polytechnique"}]}