{"title": "Ultrametric Fitting by Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 3181, "page_last": 3192, "abstract": "We study the problem of fitting an ultrametric distance to a dissimilarity graph in the context of hierarchical cluster analysis. Standard hierarchical clustering methods are specified procedurally, rather than in terms of the cost function to be optimized. We aim to overcome this limitation by presenting a general optimization framework for ultrametric fitting. Our approach consists of modeling the latter as a constrained optimization problem over the continuous space of ultrametrics. So doing, we can leverage the simple, yet effective, idea of replacing the ultrametric constraint with a min-max operation injected directly into the cost function. The proposed reformulation leads to an unconstrained optimization problem that can be efficiently solved by gradient descent methods. The flexibility of our framework allows us to investigate several cost functions, following the classic paradigm of combining a data fidelity term with a regularization. While we provide no theoretical guarantee to find the global optimum, the numerical results obtained  over a number of synthetic and real datasets demonstrate the good performance of our approach with respect to state-of-the-art agglomerative algorithms. This makes us believe that the proposed framework sheds new light on the way to design a new generation of hierarchical clustering methods. Our code is made publicly available\nat https://github.com/PerretB/ultrametric-fitting.", "full_text": "Ultrametric Fitting by Gradient Descent\n\nGiovanni Chierchia\u2217\n\nBenjamin Perret\u2217\n\nUniversit\u00e9 Paris-Est, LIGM (UMR 8049)\n\nUniversit\u00e9 Paris-Est, LIGM (UMR 8049)\n\nCNRS, ENPC, ESIEE Paris, UPEM\nF-93162, Noisy-le-Grand, France\ngiovanni.chierchia@esiee.fr\n\nCNRS, ENPC, ESIEE Paris, UPEM\nF-93162, Noisy-le-Grand, France\nbenjamin.perret@esiee.fr\n\nAbstract\n\nWe study the problem of \ufb01tting an ultrametric distance to a dissimilarity graph\nin the context of hierarchical cluster analysis. Standard hierarchical clustering\nmethods are speci\ufb01ed procedurally, rather than in terms of the cost function to be\noptimized. We aim to overcome this limitation by presenting a general optimization\nframework for ultrametric \ufb01tting. Our approach consists of modeling the latter as a\nconstrained optimization problem over the continuous space of ultrametrics. So\ndoing, we can leverage the simple, yet effective, idea of replacing the ultrametric\nconstraint with a min-max operation injected directly into the cost function. The\nproposed reformulation leads to an unconstrained optimization problem that can\nbe ef\ufb01ciently solved by gradient descent methods. The \ufb02exibility of our framework\nallows us to investigate several cost functions, following the classic paradigm\nof combining a data \ufb01delity term with a regularization. While we provide no\ntheoretical guarantee to \ufb01nd the global optimum, the numerical results obtained\nover a number of synthetic and real datasets demonstrate the good performance of\nour approach with respect to state-of-the-art agglomerative algorithms. This makes\nus believe that the proposed framework sheds new light on the way to design a new\ngeneration of hierarchical clustering methods. Our code is made publicly available\nat https://github.com/PerretB/ultrametric-fitting.\n\n1\n\nIntroduction\n\nUltrametrics provide a natural way to describe a recursive partitioning of data into increasingly\n\ufb01ner clusters, also known as hierarchical clustering [1]. Ultrametrics are intuitively represented by\ndendrograms, i.e., rooted trees whose leaves correspond to data points, and whose internal nodes\nrepresent the clusters of its descendant leaves. In topology, this corresponds to a metric space in\nwhich the usual triangle inequality is strengthened by the ultrametric inequality, so that every triple of\npoints forms an isosceles triangle, with the two equal sides at least as long as the third side. The main\nquestion investigated in this article is: \u00ab How well can we construct an ultrametric to \ufb01t the given\ndissimilarity data? \u00bb This is what we refer to as ultrametric \ufb01tting.\nUltrametric \ufb01tting can be traced back to the early work on numerical taxonomy [2] in the context of\nphylogenetics [3]. Several well-known algorithms originated in this \ufb01eld, such as single linkage [4],\naverage linkage [5], and Ward method [6]. Nowadays, there exists a large literature on ultrametric\n\ufb01tting, which can be roughly divided in four categories: agglomerative and divisive greedy heuristics\n[7\u201313], integer linear programming [14\u201316], continuous relaxations [17\u201320], and probabilistic\nformulations [21\u201323]. Our work belongs to the family of continuous relaxations.\nThe most popular methods for ultrametric \ufb01tting probably belong to the family of agglomerative\nheuristics. They follow a bottom-up approach, in which the given dissimilarity data are sequentially\n\n\u2217Both authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmerged through some speci\ufb01c strategy. But since the latter is speci\ufb01ed procedurally, it is usually\nhard to understand the objective function being optimized. In this regard, several recent works\n[9, 16, 19, 11, 24] underlined the importance to cast ultrametric \ufb01tting as an optimization problem\nwith a well-de\ufb01ned cost function, so as to better understand how the ultrametric is built.\nRecently, Dasgupta [9] introduced a cost function for evaluating an ultrametric, and proposed an\nheuristic to approximate its optimal solution. The factor of this approximation was later improved by\nseveral works, based on a linear programming relaxation [16], a semide\ufb01nite programming relaxation\n[19], or a recursive \u03c6-sparsest cut algorithm [24]. Along similar lines, it was shown that average\nlinkage provides a good approximation of the optimal solution to Dasgupta cost function [25, 26].\nCloser to our approach, a differentiable relaxation inspired by Dasgupta cost function was also\nproposed [20]. Moreover, a regularization for Dasgupta cost function was formulated in the context\nof semi-supervised clustering [23, 27], based on triplet constraints provided by the user.\nMore generally, the problem of \ufb01nding the closest ultrametric to dissimilarity data was extensively\nstudied through linear programming relaxations [18] and integer linear programming [15]. A special\ncase of interest arises when the dissimilarities are speci\ufb01ed by a planar graph, which is a natural\noccurrence in image segmentation [28\u201331]. By exploiting the planarity of the input graph, a tight\nlinear programming relaxation can be derived from the minimum-weight multi-cut problem [14].\nThere exist many other continuous relaxations of discrete problems in the speci\ufb01c context of image\nsegmentation [32\u201337], but they typically aim at a \ufb02at representation of data, rather than hierarchical.\nContribution. We propose a general optimization framework for ultrametric \ufb01tting based on gradient\ndescent. Our approach consists of optimizing a cost function over the continuous space of ultrametrics,\nwhere the ultrametricity constraint is implicitly enforced by a min-max operation. We demonstrate\nthe versatility of our approach by investigating several cost functions:\n\n1. the closest-ultrametric \ufb01delity term, which expresses that the \ufb01tted ultrametric should be\n\nclose to the given dissimilarity graph;\n\n2. the cluster-size regularization, which penalizes the presence of small clusters in the upper\n\nlevels of the associated hierarchical clustering;\n\n3. the triplet regularization for semi-supervised learning, which aims to minimize the intra-class\n\ndistance and maximize the inter-class distance;\n\n4. the Dasgupta \ufb01delity term, which is a continuous relaxation of Dasgupta cost function\nexpressing that the \ufb01tted ultrametric should associate large dissimilarities to large clusters.\nWe devise ef\ufb01cient algorithms with automatic differentiation in mind, and we show that they scale up\nto millions of vertices on sparse graphs. Finally, we evaluate the proposed cost functions on synthetic\nand real datasets, and we show that they perform as good as Ward method and semi-supervised SVM.\n\n2 Ultrametric \ufb01tting\n\nCentral to this work is the notion of ultrametric, a special kind of metric that is equivalent to\nhierarchical clustering [38]. Formally, an ultrametric d : V \u00d7 V \u2192 R+ is a metric on a space V in\nwhich the triangle inequality is strengthen by the ultrametric inequality, de\ufb01ned as\n\n(\u2200(x, y, z) \u2208 V 3)\n\n(1)\nThe notion of ultrametric can also be de\ufb01ned on a connected (non-complete) graph G = (V, E) with\nnon-negative edge weights w \u2208 W, where W denotes the space of functions from E to R+. In this\ncase, the distance is only available between the pairs of vertices in E, and the ultrametric constraint\nmust be de\ufb01ned over the set of cycles C of G as follows:\n\nd(x, y) \u2264 max{d(x, z), d(z, y)} .\n\nu \u2208 W is an ultrametric2 on G\n\n(2)\nNote that an ultrametric u on G can be extended to all the pairs of vertices in V through the min-max\ndistance on u, which is de\ufb01ned as\n\n(\u2200C \u2208 C,\u2200e \u2208 C) u(e) \u2264 max\n\ne(cid:48)\u2208C\\{e} u(e(cid:48)).\n\n\u21d4\n\n(\u2200(x, y) \u2208 V 2)\n\n(3)\nwhere Pxy denotes the set of all paths between the vertices x and y of G. This observation allows us\nto compactly represent ultrametrics as weight functions u \u2208 W on sparse graphs G, instead of more\ncostly pairwise distances. Figure 1 shows an example of ultrametric and its possible representations.\n\ndu(x, y) = min\nP\u2208Pxy\n\nmax\ne(cid:48)\u2208P\n\nu(e(cid:48)),\n\n2Some authors use different names, such as ultrametric contour map [29] or saliency map [39].\n\n2\n\n\f\uf8eb\uf8ec\uf8ed\n\nx1\nx2\nx3\nx4\n\nx1 x2 x3 x4\nr3\n0\nr3\nr1\nr2\nr3\nr3\n0\n\nr3\nr3\n0\nr2\n\nr1\n0\nr3\nr3\n\n\uf8f6\uf8f7\uf8f8\n\naltitude\nr3\n\nr2\n\nr1\n\n0\n\nn3\n\nn2\n\nn1\n\nx1\n\nx2\n\nx3\n\nx4\n\nx1\n\nx3\n\nr1\n\nr3\n\nr2\n\nx2\n\nr3\n\nx4\n\n(a) Ultrametric\n\n(b) Dendrogram\n\n(c) Ultrametric on a graph\n\nFigure 1: Ultrametric d on {x1, x2, x3, x4} given by the dissimilarity matrix (a), and represented by\nthe dendrogram (b) and the graph (c). Two elements xi and xj merge at the altitude rk = d(xi, xj)\nin the dendrogram, and the corresponding node is the lowest common ancestor (l.c.a.) of xi and xj.\nFor example, the l.c.a. of x1 and x2 is the node n1 at altitude r1, hence d(x1, x2) = r1; the l.c.a. of\nx1 and x3 is the node n3 at altitude r3, hence d(x1, x3) = r3. The graph (c) with edge weights u\nleads to the ultrametric (a) via the min-max distance du de\ufb01ned in (3). For example, all the paths\nfrom x1 to x3 contain an edge of weight r3 which is maximal, and thus du(x1, x3) = r3.\n\nNotation. The dendrogram associated to an ultrametric u on G is denoted by Tu [38]. It is a rooted\ntree whose leaves are the elements of V . Each tree node n \u2208 Tu is the set composed by all the leaves\nof the sub-tree rooted in n. The altitude of a node n, denoted by altu(n), is the maximal distance\nbetween any two elements of n: i.e., altu(n) = max{u(exy) | x, y \u2208 n and exy \u2208 E}. The size of a\nnode n, denoted by |n|, is the number of leaves contained in n. For any two leaves x and y, the lowest\ncommon ancestor of x and y, denoted lcau(x, y), is the smallest node of Tu containing both x and y.\n\n2.1 Optimization framework\n\nOur goal is to \ufb01nd the ultrametric that \"best\" represents the given edge-weighted graph. We propose\nto formulate this task as a constrained optimization problem involving an appropriate cost function\nJ : W \u2192 R de\ufb01ned on the (continuous) space of distances W, leading to\n\nminimize\n\nu\u2208W\n\nJ(u; w)\n\ns.t. u is an ultrametric on G.\n\nThe ultrametricity constraint is highly nonconvex and cannot be ef\ufb01ciently tackled with standard\noptimization algorithms. We circumvent this issue by replacing the constraint with an operation\ninjected directly into the cost function. The idea is that the ultrametricity constraint can be enforced\nimplicitly through the operation that computes the subdominant ultrametric, de\ufb01ned as the largest\nultrametric below the given dissimilarity function. One way to compute the subdominant ultrametric\nis through the min-max operator \u03a6G : W \u2192 W de\ufb01ned by\n\n(\u2200 \u02dcw \u2208 W,\u2200exy \u2208 E)\n\n\u03a6G( \u02dcw)(exy) = min\nP\u2208Pxy\nwhere Pxy is de\ufb01ned as in (3). Then, Problem (4) can be rewritten as\n\nmax\ne(cid:48)\u2208P\n\n\u02dcw(e(cid:48)),\n\n(4)\n\n(5)\n\n(6)\n\n\u02dcw\u2208W J(cid:0)\u03a6G( \u02dcw); w(cid:1).\n\nminimize\n\nSince the min-max operator is sub-differentiable (see (15) in Section 3), the above problem can be\noptimized by gradient descent, as long as J is sub-differentiable. This allows us to devise Algorithm 1.\nNote that the mix-max operator already proved useful in image segmentation to de\ufb01ne a structured\nloss function for end-to-end supervised learning [28, 31]. The goal was however to \ufb01nd a \ufb02at\nsegmentation rather than a hierarchical one. To the best of our knowledge, we are the \ufb01rst to use the\nmix-max operator within an optimization framework for ultrametric \ufb01tting.\n\n2.2 Closest ultrametric\n\nA natural goal for ultrametric \ufb01tting is to \ufb01nd the closest ultrametric to the given dissimilarity graph.\nThis task \ufb01ts nicely into Problem (4) by setting the cost function to the sum of squared errors between\nthe sought ultrametric and the edge weights of the given graph, namely\n\nJclosest(u; w) =\n\nu(e) \u2212 w(e)\n\n.\n\n(7)\n\n(cid:17)2\n\n(cid:16)\n\n(cid:88)\n\ne\u2208E\n\n3\n\n\fAlgorithm 1 Solution to the ultrametric \ufb01tting problem de\ufb01ned in (4).\nRequire: Graph G = (V, E) with edge weights w\n1: \u02dcw[0] \u2190 w\n2: for t = 0, 1, . . . do\n3:\n4:\n5: return \u03a6G( \u02dcw[\u221e])\n\ng[t] \u2190 gradient of J(cid:0)\u03a6G(\u00b7); w(cid:1) evaluated at \u02dcw[t]\n\n\u02dcw[t+1] \u2190 update of \u02dcw[t] using g[t]\n\nAlthough the exact minimization of this cost function is a NP-hard problem [40], the proposed\noptimization framework allows us to compute an approximate solution. Figure 2 shows the ultrametric\ncomputed by Algorithm 1 with Jclosest for an illustrative example of hierarchical clustering.\nA common issue with the closest ultrametric is that small clusters might branch very high in the\ndendrogram. This is also true for average linkage and other agglomerative methods. Such kind of\nultrametrics are undesirable, because they lead to partitions containing very small clusters at large\nscales, as clearly shown in Figures 2b-2c. We now present two approaches to tackle this issue.\n\n2.3 Cluster-size regularization\n\nTo \ufb01ght against the presence of small clusters at large scales, we need to introduce a mechanism\nthat pushes down the altitude of nodes where such incorrect merging occurs. This can be easily\ntranslated in our framework, as the altitude of a node corresponds to the ultrametric distance between\nits children. Speci\ufb01cally, we penalize the ultrametric distance proportionally to some non-negative\ncoef\ufb01cients that depend on the corresponding nodes in the dendrogram, yielding\n\n(cid:88)\n\nexy\u2208E\n\nJsize(u) =\n\nu(exy)\n\n\u03b3u(lcau(x, y))\n\n.\n\n(8)\n\nHere above, the \u03b3 coef\ufb01cients play an essential role: they must be small for the nodes that need to be\npushed down, and large otherwise. We thus rank the nodes by the size of their smallest child, that is\n\n(\u2200n \u2208 Tu)\n\n\u03b3u(n) = min{|c|, c \u2208 Childrenu(n)} ,\n\n(9)\n\nwhere Childrenu(n) denotes the children of a node n in the dendrogram Tu associated to the\nultrametric u on G. Figure 2d shows the ultrametric computed by Algorithm 1 with Jclosest + Jsize.\nThe positive effect of this regularization can be appreciated by observing that small clusters are no\nlonger branched very high in the dendrogram.\n\n2.4 Triplet regularization\n\nTriplet constraints [23, 27] provide an alternative way to penalize small clusters at large scales. Like\nin semi-supervised classi\ufb01cation, we assume that the labels Lv of some data points v \u2208 V are known,\nand we build a set of triplets according to the classes they belong to:\n\nT =(cid:8)(ref, pos, neg) \u2208 V 3 | Lref = Lpos\n\nand Lref (cid:54)= Lneg\n\n(cid:9).\n\n(10)\n\nThese triplets provide valuable information on how to build the ultrametric. Intuitively, we need\na mechanism that reduces the ultrametric distance within the classes, while increasing the ultra-\nmetric distance between different classes. This can be readily expressed in our framework with a\nregularization acting on the altitude of nodes containing the triplets, leading to\n\nmax{0, \u03b1 + du(ref, pos) \u2212 du(ref, neg)}.\n\n(11)\n\n(cid:88)\n\nJtriplet(u) =\n\n(ref,pos,neg)\u2208T\n\nHere above, the constant \u03b1 > 0 represents the minimum prescribed distance between different classes.\nFigure 2e shows the ultrametric computed by Algorithm 1 with Jclosest + Jtriplet.\n\n4\n\n\f(c) Closest\n\n(b) Average link\n\n(a) Graph/Labels\nFigure 2: Illustrative examples of hierarchical clustering. Top row: Ultrametrics \ufb01tted to the input\ngraph; only the top-30 non-leaf nodes are shown in the dendrograms (all the others are contracted\ninto leaves). Bottom row: Assignments obtained by thresholding the ultrametrics at three clusters.\nThe detrimental effect of having \"small clusters at large scales\" can be observed in (b) and (c).\n\n(d) Closest+Size\n\n(e) Closest+Triplet\n\n(f) Dasgupta\n\n2.5 Dasgupta cost function\n\nDasgupta cost function [9] has recently gained traction in the seek of a theoretically grounded\nframework for hierarchical clustering [16, 11, 41, 24, 27]. However its minimization is known to be\na NP-hard problem [9]. The intuition behind this function is that large clusters should be associated\nto large dissimilarities. The idea it then to minimize, for each edge e \u2208 E, the size of the dendrogram\nnode associated to e divided by the weight of e, yielding\n\nJDasgupta(u; w) =\n\n|lcau(x, y)|\nw(ex,y)\n\n.\n\n(12)\n\n(cid:88)\n\nexy\u2208E\n\nHowever, we cannot directly use (12) in our optimization framework, as the derivative of |lcau(x, y)|\nwith respect to the underlying ultrametric u is equal to 0 almost everywhere. To solve this issue, we\npropose a soft-cardinal measure of a node that is differentiable w.r.t. the associated ultrametric u. Let\nn be a node of the dendrogram Tu, and let {x} \u2286 n be a leaf of the sub-tree rooted in n. We observe\nthat the cardinal of n is equal to the number of vertices y \u2208 V such that the ultrametric distance\ndu(x, y) between x and y is strictly lower than the altitude of n, namely\n\n|n| =\n\nH(altu(n) \u2212 du(x, y)),\n\n(13)\n\n(cid:88)\n\ny\u2208V\n\nwhere H is the Heaviside function. By replacing H with a continuous approximation, such as a\nsigmoid function, we provide a soft-cardinal measure of a node of Tu that is differentiable with\nrespect to the ultrametric u. Figure 2f shows the ultrametric computed by Algorithm 1 with JDasgupta.\nNote that a differentiable cost function inspired by Dasgupta cost function was proposed in [20]. This\nfunction replaces the node size by a parametric probability measure which is optimized over a \ufb01xed\ntree. This is fundamentally different from our approach, where the proposed measure is a continuous\nrelaxation of the node size, and it is directly optimized over the ultrametric distance.\n\n3 Algorithms\n\nIn this section, we present a general approach to ef\ufb01ciently compute the various terms appearing in\nthe cost functions introduced earlier. All the proposed algorithms rely on some properties of the single\nlinkage (agglomerative) clustering, which is a dual representation of the subdominant ultrametric.\nWe perform a detailed analysis of the algorithm used to compute the subdominant ultrametric. The\nother algorithms can be found in the supplemental material.\nSingle-linkage clustering can be computed similarly to a minimum spanning tree (m.s.t.) with\nKruskal\u2019s algorithm, by sequentially processing the edges of the graph in non decreasing order,\nand merging the clusters located at the extremities of the edge, when a m.s.t. edge is found. One\nconsequence of this approach is that each node n of the dendrogram representing the single linkage\nclustering is canonically associated to an edge of the m.s.t. (see Figure 3a), which is denoted by \u03c3(n).\n\n5\n\n\f\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\ne13\n0\n1\n0\n0\n0\n0\n0\n\ne13\ne12\ne23\ne35\ne25\ne24\ne45\n\ne12\n0\n1\n0\n0\n0\n0\n0\n\ne23\n0\n0\n1\n0\n0\n0\n0\n\ne35\n0\n0\n0\n0\n0\n1\n0\n\ne25\n0\n0\n0\n0\n0\n1\n0\n\ne24\n0\n0\n0\n0\n0\n1\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\ne45\n0\n0\n0\n0\n0\n0\n1\n\n(a) Single linkage clustering\n\n(b) Jacobian\n\nFigure 3: Each node of the single linkage clustering (in blue) of the graph (in grey) is canonically\nassociated (green dashed arrows) to an edge of a minimum spanning tree of the graph (thick edges):\nthis edge is the pass edge between the leaves of the two children of the node. Edges are numbered\nfrom 1 to M (number of edges). The i-th column of the Jacobian matrix of the \u03a6 operator is equal to\nthe indicator vector denoting the pass edge holding the maximal value of the min-max path between\nthe two extremities of the i-th edge. The pass edge can be found ef\ufb01ciently in the single linkage\nclustering using the l.c.a. operation and the canonical link between the nodes and the m.s.t. edges.\nFor example, the l.c.a. of the vertices 3 and 5 linked by the 4-th edge e35 is the node n4, which is\ncanonically associated to the 6-th edge e24 (\u03c3(n4) = e24): we thus have J6,4 = 1.\n\nIn the following, we assume that we are working with a sparse graph G = (V, E), where O(|E|) =\n|V |. The number of vertices (resp. edges) of G is denoted by N (resp. M). For the ease of writing,\nwe denote the edge weights as vectors of RM . The dendrogram corresponding to the single-linkage\nclustering of the graph G with edge weights \u02dcw \u2208 W is denoted by SL( \u02dcw).\n\n3.1 Subdominant ultrametric\n\nTo obtain an ef\ufb01cient and automatically differentiable algorithm for computing the subdominant\nultrametric, we observe that the min-max distance between any two vertices x, y \u2208 V is given by the\nweight of the pass edge between x and y. This is the edge holding the maximal value of the min-max\npath from x to y, and an arbitrary choice is made if several pass edges exist. Moreover, the pass edge\nbetween x and y corresponds to the l.c.a. of x and y in the single linkage clustering of (G, \u02dcw) (see\nFigure 3a). Equation (5) can be thus rewritten as\n\n(\u2200 \u02dcw \u2208 W,\u2200exy \u2208 E)\n\n\u03a6G( \u02dcw)(exy) = \u02dcw(emst\n\nxy ) with emst\n\nxy = \u03c3(lcaSL( \u02dcw)(x, y)).\n\n(14)\nThe single-linkage clustering can be computed in time O(N log N ) with a variant of Kruskal\u2019s\nminimum spanning tree algorithm [4, 42]. Then, a fast algorithm allows us to compute the l.c.a. of\ntwo nodes in constant time O(1), thanks to a linear time O(N ) preprocessing of the tree [43]. The\nsubdominant ultrametric can thus be computed in time O(N log N ) with Algorithm 2. Moreover, the\ndendrogram associated to the ultrametric returned by Algorithm 2 is the tree computed on line 2.\nNote that Algorithm 2 can be interpreted as a special max pooling applied to the input tensor w, and\ncan be thus automatically differentiated. Indeed, a sub-gradient of the min-max operator \u03a6 at a given\nedge exy is equal to 1 on the pass edge between x and y and 0 elsewhere. Then, the Jacobian of the\nmin-max operator \u03a6 can be written as the matrix composed of the indicator column vectors giving\nthe position of the pass edge associated to the extremities of each edge in E:\n\nwhere \u03a6\u2217\n1j is the column vector of RM equals to 1 in position j, and 0 elsewhere (see Figure 3b).\n\nG( \u02dcwi) denotes the index of the pass edge between the two extremities of the i-th edge, and\n\n=\n\n1\u03a6\u2217\n\nG ( \u02dcw1), . . . , 1\u03a6\u2217\n\nG ( \u02dcwM )\n\n,\n\n(15)\n\n3.2 Regularization terms\n\nThe cluster-size regularization de\ufb01ned in (8) can be implemented through the same strategy used in\nAlgorithm 2, based on the single-linkage clustering and the fast l.c.a. algorithm, leading to a time\ncomplexity in O(N log N ). See supplemental material.\n\n6\n\n(cid:104)\n\n\u2202\u03a6( \u02dcw)\n\n\u2202 \u02dcw\n\n(cid:105)\n\nn1n2n4n34637118512345\fAlgorithm 2 Subdominant ultrametric operator de\ufb01ned in (5) with (14).\nRequire: Graph G = (V, E) with edge weights w\n1: u(exy) \u2190 0 for each exy \u2208 E\n2: tree \u2190 single-linkage(G, w)\n3: preprocess l.c.a. on tree\n4: for each edge exy \u2208 E do\n5:\n6:\n7:\n8: return u\n\npass_node \u2190 lcatree(x, y)\npass_edge \u2190 \u03c3(pass_node)\nu(exy) \u2190 w(pass_edge)\n\n(cid:46) O(N )\n(cid:46) O(N log N ) with[4, 42]\n(cid:46) O(N ) with [43]\n(cid:46) O(N )\n(cid:46) O(1) with [43]\n(cid:46) O(1) see Figure 3a\n(cid:46) O(1)\n\nFurthermore, thanks to Equation (14), the triplet regularization de\ufb01ned in (11) can be written as\n\nmax{0, \u03b1 + u(cid:0)\u03c3(lcau(ref, pos))(cid:1) \u2212 u(cid:0)\u03c3(lcau(ref, neg))(cid:1)}.\n\n(16)\n\n(cid:0)u(cid:1) =\n\nJtriplet\n\n(cid:88)\n\n(ref,pos,neg)\u2208T\n\nThis can be implemented with a time complexity in O(|T | + N log N ). See supplemental material.\n\n3.3 Dasgupta cost function\n\nThe soft cardinal of a node n of the tree Tu as de\ufb01ned in (13) raises two issues: the arbitrary choice\nof the reference vertex x, and the quadratic time complexity \u0398(N 2) of a naive implementation. One\nway to get rid of the arbitrary choice of x is to use the two extremities of the edge \u03c3(n) canonically\nassociated to n. To ef\ufb01ciently compute (13), we can notice that, if c1 and c2 are the children of\nn, then the pass edge between any element x of c1 and any element y of c2 is equal to edge \u03c3(n)\nassociated to n. This allows us to de\ufb01ne card(n) as the relaxation of |n| in (13), by replacing the\nHeaviside function H with the sigmoid function (cid:96), leading to\n\n(cid:17)\n\n,\n\n(17)\n\n(cid:88)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n\nx\u2208\u03c3(n)\n\nx\u2208\u03c3(n)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nx\u2208\u03c3(n)\n\nx\u2208\u03c3(n)\n\nx\u2208\u03c3(n)\n\nx\u2208\u03c3(n)\n\ncard(n) =\n\n=\n\n=\n\n=\n\n=\n\n=\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n2\n\n(cid:96)(altu(n) \u2212 du(x, y))\n\ny\u2208V\n(cid:96)(altu(n) \u2212 du(x, x)) +\n\n(cid:88)\n\ny\u2208A(x)\n\nz\u2208c\u02c6x(y)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\ny\u2208A(x)\n\ny\u2208A(x)\n\n(cid:96)(altu(n) \u2212 du(x, y))\n\ny\u2208V \\{x}\n\n(cid:88)\n(cid:88)\n\n(cid:96)(altu(n) \u2212 du(x, z))\n\n(cid:17)\n(cid:17)\n(cid:17)\n|c\u02c6x(y)|(cid:96)(cid:0)u(\u03c3(n)) \u2212 u(\u03c3(y))(cid:1)(cid:17)\n\nz\u2208c\u02c6x(y)\n|c\u02c6x(y)|(cid:96)(altu(n) \u2212 altu(y))\n\n(cid:96)(altu(n) \u2212 altu(y))\n\ny\u2208A(x)\n\n(cid:96)(altu(n)) +\n\n(cid:96)(altu(n)) +\n\n(cid:96)(altu(n)) +\n\n(cid:96)(cid:0)u(\u03c3(n))(cid:1) +\n\nwhere A(x) is the set of ancestors of x, and c\u02c6x(y) is the child of y that does not contain x. The time\ncomplexity to evaluate (17) is dominated by the sum over the ancestors of n which, in the worst\ncase, is in the order of O(N ), leading also to worst case time complexity of O(N 2). In practice,\ndendrograms are usually well balanced, and thus the number of ancestors of a node is in the order of\nO(log N ), yielding an empirical complexity in O(N log N ).\n\n4 Experiments\n\nIn the following, we evaluate the proposed framework on two different setups. The \ufb01rst one aims to\nassess our continuous relaxation of the closest ultrametric problem with respect to the (almost) exact\nsolution provided by linear programming on planar graphs [14]. The second one aims to compare\nthe performance of our cost functions to the classical hierarchical and semi-supervised clustering\n\n7\n\n\fFigure 4: Test image, its gradient, and superpixel contour weights with 525, 1526, and 4434 edges.\n\n(b) Computation time\n\n(a) Mean square error\n\n(c) Computation time per iteration\nFigure 5: Validation and computation time. Figureas (a) and (b): comparison between the CUCP\nalgorithm [14] and the proposed gradient descent approach. For CUCP we tested different numbers\nof hierarchy levels (5, 10, 20 40) distributed evenly over the range of the input dissimilarity function.\nFigure (a) shows the \ufb01nal mean square error (normalized against CUCP 40) w.r.t. the number of edges\nin the tested graph. Figure (b) shows the run-time compared w.r.t. the number of edges in the tested\ngraph (CUCP was capped at 200 seconds per instance). Figure (c) shows the computation time of the\ntested cost functions (one iteration of Algorithm 1) with respect to the number of edges in the graph.\n\nmethods. The implementation of our algorithms, based on Higra [44] and PyTorch [45] libraries, is\navailable at https://github.com/PerretB/ultrametric-fitting.\n\nFramework validation As Problem (6) is non-convex, there is no guarantee that the gradient\ndescent method will \ufb01nd the global optimum. To assess the performance of the proposed framework,\nwe use the algorithm proposed in [14], denoted by CUCP (Closest Ultrametric via Cutting Plane), as\na baseline for the closest ultrametric problem de\ufb01ned in (7). Indeed, CUCP can provides an (almost)\nexact solution to the closest ultrametric problem for planar graphs based on a reformulation as a set of\ncorrelation clustering/multi-cuts problems with additional hierarchical constraints. However, CUCP\nrequires to de\ufb01ne a priori the set of levels which will compose the \ufb01nal hierarchy.\nWe generated a set of superpixels adjacency graphs of increasing scale from a high-resolution image\n(see Figure 4). The weight of the edge linking two superpixels is de\ufb01ned as the mean gradient value,\nobtained with [46], along the frontier between the two superpixels. The results presented in Figure 5\nshows that the proposed approach is able to provide solutions close to the optimal ones (Figure 5a)\nusing only a fraction of the time needed by the combinatorial algorithm (Figure 5b), and without any\nassumption on the input graph. The complete experimental setup is in the supplemental material.\nThe computation time of some combinations of cost terms are presented in Figure 5c. Note that,\nAlgorithm 1 usually achieves convergence in about one hundred iterations (see supplemental mate-\nrial). Closest and Closest+Size can handle graphs with millions of edges. Dasgupta relaxation is\ncomputationally more demanding, which decreases the limit to a few hundred thousands of edges.\n\nHierarchical clustering We evaluate the proposed optimization framework on \ufb01ve datasets down-\nloaded from the LIBSVM webpage,3 whose size ranges from 270 to 1500 samples. For each dataset,\nwe build a 5-nearest-neighbor graph, to which we add the edges of a minimum spanning tree to\nensure the connectivity. Then, we perform hierarchical clustering on this graph, and we threshold\nthe resulting ultrametric at the prescribed number of clusters. We divide our analysis in two sets of\ncomparisons: hierarchical clustering (unsupervised), and semi-supervised clustering. To be consistent\namong the two types of comparisons, we use the classi\ufb01cation accuracy as a measure of performance.\nFigure 6a compares the performance of three hierarchical clustering methods. The baseline is \"Ward\"\nagglomerative method, applied to the pairwise distance matrix of each dataset. Average linkage\n\n3https://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/\n\n8\n\n\f(a) Hierarchical clustering (unsupervised)\n\n(b) Semi-supervised clustering\n\nFigure 6: Performance on real datasets.\n\nand closest ultrametric are not reported, as their performance is consistently worst. The \"Dasgupta\"\nmethod refers to Algorithm 1 with JDasgupta + \u03bbJsize and \u03bb = 1. The \"Closest+Size\" method refers\nto Algorithm 1 with the cost function Jclosest + \u03bbJsize and \u03bb = 10. In both cases, the regularization\nis only applied to the top-10 dendogram nodes (see supplemental material). The results show that the\nproposed approach is competitive with Ward method (one of the best agglomerative heuristics). On\nthe datasets Digit1 and Heart, \"Dasgupta\" performs slightly worse than \"Closest+Size\": this is partly\ndue to the fact that our relaxation of the Dasgupta cost function is sensible to data scaling.\nFigure 6b compares the performance of two semi-supervised clustering methods, and an additional\nunsupervised method. The \ufb01rst baseline is \"Spectral\" clustering applied to the Gaussian kernel matrix\nof each dataset. The second baseline is \"SVM\" classi\ufb01er trained on the fraction of labeled samples,\nand tested on the remaining unlabeled samples. Between 10% and 40% of training samples are\ndrawn from each dataset using a 10-fold scheme, and the cross-validated performance is reported\nin terms of mean and standard deviation. The \"Closest+Triplet\" method refers to Algorithm 1 with\nJclosest + \u03bbJtriplet, \u03bb = 1 and \u03b1 = 10. The results show that the triplet regularization performs\ncomparably to semi-supervised SVM, which in turn performs better than spectral clustering.\n\n5 Conclusion\n\nWe have presented a general optimization framework for \ufb01tting ultrametrics to sparse edge-weighted\ngraphs in the context of hierarchical clustering. We have demonstrated that our framework can\naccommodate various cost functions, thanks to ef\ufb01cient algorithms that we have carefully designed\nwith automatic differentiation in mind. Experiments carried on simulated and real data allowed us to\nshow that the proposed approach provides good approximate solutions to well-studied problems.\nThe theoretical analysis of our optimization framework is beyond the scope of this paper. Nonetheless,\nwe believe that statistical physics modelling [47] may be a promising direction for future work, based\non the observation that ultrametricity is a physical property of spin-glasses [48\u201353]. Other possible\nextensions include the end-to-end learning of neural networks for hierarchical clustering, possibly in\nthe context of image segmentation.\n\n6 Acknowledgment\n\nThis work was partly supported by the INS2I JCJC project under grant 2019OSCI. We are deeply\ngrateful to Julian Yarkony and Charless Fowlkes for sharing their code, and to Fred Hamprecht for\nmany insightful discussions.\n\nReferences\n[1] F. Murtagh and P. Contreras. Algorithms for hierarchical clustering: an overview. Data Mining\n\nand Knowledge Discovery, 2(1):86\u201397, 2012.\n\n[2] P. H. A. Sneath and R. R. Sokal. Numerical taxonomy. Nature, 193:855\u2013860, 1962.\n[3] J. Felsenstein. Inferring phylogenies. Sinauer Associates, 2003.\n\n9\n\n0.820.730.860.970.980.90.650.840.950.960.940.760.840.960.95Digit1HeartAustralianBr.cancerWine0.60.70.80.91WardDasguptaClosest+Size0.890.690.670.940.970.920.760.80.960.980.930.730.850.960.96Digit1HeartAustralianBr.cancerWine0.60.70.80.91SpectralSVMClosest+Tripl.\f[4] J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis.\n\nJournal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54\u201364, 1969.\n\n[5] N. Jardine and R. Sibson. The construction of hierarchic and non-hierarchic classi\ufb01cations. The\n\nComputer Journal, 11(2):177\u2013184, 1968.\n\n[6] J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American\n\nStatistical Association, 58(301):236\u2013244, 1963.\n\n[7] R. C. de Amorim. Feature relevance in ward\u2019s hierarchical clustering using the lp norm. Journal\n\nof Classi\ufb01cation, 32(1):46\u201362, 2015.\n\n[8] M. Ackerman and S. Ben-David. A characterization of linkage-based hierarchical clustering.\n\nJMLR, 17(231):1\u201317, 2016.\n\n[9] S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Proc. STOC, pages\n\n118\u2013127, Cambridge, MA, USA, 2016.\n\n[10] A. Kobren, N. Monath, A. Krishnamurthy, and A. McCallum. A hierarchical algorithm for\n\nextreme clustering. In Proc. ACM SIGKDD, pages 255\u2013264, 2017.\n\n[11] V. Cohen-Addad, V. Kanade, and F. Mallmann-Trenn. Hierarchical clustering beyond the\n\nworst-case. In Proc. NeurIPS, pages 6201\u20136209, 2017.\n\n[12] M. H. Chehreghani. Reliable agglomerative clustering. Preprint arXiv:1901.02063, 2018.\n\n[13] T. Bonald, B. Charpentier, A. Galland, and A. Hollocou. Hierarchical graph clustering using\n\nnode pair sampling. In KDD Workshop, 2018.\n\n[14] J.E. Yarkony and C. Fowlkes. Planar ultrametrics for image segmentation. In Proc. NeurIPS,\n\npages 64\u201372, 2015.\n\n[15] M. Di Summa, D. Pritchard, and L. Sanit\u00e0. Finding the closest ultrametric. DAM, 180(10):\n\n70\u201380, 2015.\n\n[16] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. In Proc. NeurIPS, pages\n\n2316\u20132324, 2016.\n\n[17] G. De Soete. A least squares algorithm for \ufb01tting an ultrametric tree to a dissimilarity matrix.\n\nPRL, 2(3):133\u2013137, 1984.\n\n[18] N. Ailon and M. Charikar. Fitting tree metrics: Hierarchical clustering and phylogeny. SIAM\n\nJournal on Computing, 40(5):1275\u20131291, 2011.\n\n[19] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering via sparsest cut and\n\nspreading metrics. In Proc. SODA, pages 841\u2013854, 2017.\n\n[20] N. Monath, A. Kobren, and A McCallum. Gradient-based hierarchical clustering. In NIPS 2017\n\nWorkshop on Discrete Structures in Machine Learning, Long Beach, CA, 2017.\n\n[21] J. A. Hartigan. Statistical theory in clustering. Journal of Classi\ufb01cation, 2(1):63\u201376, 1985.\n\n[22] R. Neal. Density modeling and clustering using dirichlet diffusion trees. In Bayesian Statistics,\n\nvolume 7, pages 619\u2013629, 2003.\n\n[23] S. Vikram and S. Dasgupta.\n\nInteractive bayesian hierarchical clustering.\n\nvolume 48, pages 2081\u20132090, New York, USA, 2016.\n\nIn Proc. ICML,\n\n[24] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. Hierarchical clustering:\n\nObjective functions and algorithms. In Proc. SODA, pages 378\u2013397, 2018.\n\n[25] B. Moseley and J. Wang. Approximation bounds for hierarchical clustering: Average linkage,\n\nbisecting k-means, and local search. In Proc. NeurIPS, pages 3094\u20133103, 2017.\n\n[26] M. Charikar, V. Chatziafratis, and R. Niazadeh. Hierarchical clustering better than average-\n\nlinkage. In Proc. SODA, pages 2291\u20132304, 2019.\n\n10\n\n\f[27] V. Chatziafratis, R. Niazadeh, and M. Charikar. Hierarchical clustering with structural con-\n\nstraints. In Proc. ICML, volume 80, pages 774\u2013783, Stockholm, Sweden, 2018.\n\n[28] S.C. Turaga, K.L. Briggman, M. Helmstaedter, W. Denk, and H.S. Seung. Maximin af\ufb01nity\n\nlearning of image segmentation. In Proc. NeurIPS, pages 1865\u20131873, 2009.\n\n[29] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image\n\nsegmentation. IEEE PAMI, 33(5):898\u2013916, 2011.\n\n[30] K.K. Maninis, J. Pont-Tuset, P. Arbel\u00e1ez, and L. Van Gool. Convolutional oriented boundaries:\n\nFrom image segmentation to high-level tasks. IEEE PAMI, 40(4):819\u2013833, 2018.\n\n[31] J. Funke, F. D. Tschopp, W. Grisaitis, A. Sheridan, C. Singh, S. Saalfeld, and S. C. Turaga.\nLarge scale image segmentation with structured loss based deep learning for connectome\nreconstruction. IEEE PAMI, 99:1\u201312, 2018.\n\n[32] H. Ishikawa. Exact optimization for markov random \ufb01elds with convex priors. IEEE PAMI, 25\n\n(10):1333\u20131336, October 2003.\n\n[33] T. Pock, T. Schoenemann, G. Graber, H. Bischof, and D. Cremers. A convex formulation of\n\ncontinuous multi-label problems. In Proc. ECCV, pages 792\u2013805, Marseille, France, 2008.\n\n[34] T. Pock, A. Chambolle, D. Cremers, and H. Bischof. A convex relaxation approach for\ncomputing minimal partitions. In Proc. CVPR, pages 810\u2013817, Miami, FL, USA, June 2009.\n\n[35] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the Mumford-\n\nShah functional. In Proc. ICCV, pages 1133\u20131140, September 2009.\n\n[36] T. M\u00f6llenhoff, E. Laude, M. M\u00f6ller, J. Lellmann, and D. Cremers. Sublabel-accurate relaxation\n\nof nonconvex energies. In Proc. CVPR, pages 3948\u20133956, Las Vegas, NV, USA, June 2016.\n\n[37] M. Foare, N. Pustelnik, and L. Condat. Semi-linearized proximal alternating minimization for a\n\ndiscrete mumford-shah model. Preprint hal-01782346, 2018.\n\n[38] G. Carlsson and F. M\u00e9moli. Characterization, stability and convergence of hierarchical clustering\n\nmethods. JMLR, 11:1425\u20131470, 2010.\n\n[39] L. Najman and M. Schmitt. Geodesic saliency of watershed contours and hierarchical segmen-\n\ntation. IEEE PAMI, 18(12):1163\u20131173, 1996.\n\n[40] M. K\u02c7riv\u00e1nek. The complexity of ultrametric partitions on graphs. IPL, 27(5):265\u2013270, 1988.\n\n[41] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. JMLR, 18(88):1\u201335,\n\n2017.\n\n[42] L. Najman, J. Cousty, and B. Perret. Playing with kruskal: Algorithms for morphological trees\n\nin edge-weighted graphs. In ISMM, volume 7883, pages 135\u2013146, 2013.\n\n[43] M.A. Bender and M. Farach-Colton. The lca problem revisited. In Gaston H. Gonnet and\nAlfredo Viola, editors, LATIN 2000: Theoretical Informatics, pages 88\u201394. Springer Berlin\nHeidelberg, 2000.\n\n[44] B. Perret, G. Chierchia, J. Cousty, S. J. F. Guimar\u00e3es, Y. Kenmochi, and L. Najman. Higra:\n\nHierarchical graph analysis. SoftwareX, 10:1\u20136, 2019.\n\n[45] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop,\n2017.\n\n[46] P. Doll\u00e1r and C. L. Zitnick. Fast edge detection using structured forests. IEEE PAMI, 37(8):\n\n1558\u20131570, 2015.\n\n[47] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and\nL. Zdeborov\u00e1. Machine learning and the physical sciences. Preprint arXiv:1903.10563, 2019.\n\n11\n\n\f[48] B. Grossman. The origin of the ultrametric topology of spin glasses. Journal of Physics A:\n\nMathematical and General, 22(1):L33\u2013L39, January 1989.\n\n[49] L. Leuzzi. Critical behaviour and ultrametricity of Ising spin-glass with long-range interactions.\n\nJournal of Physics A: Mathematical and General, 32(8):1417, 1999.\n\n[50] H. G. Katzgraber and A. K. Hartmann. Ultrametricity and clustering of states in spin glasses: A\n\none-dimensional view. Physical review letters, 102(3):037207, 2009.\n\n[51] H. G. Katzgraber, T. J\u00f6rg, F. Krz \u02dbaka\u0142a, and A. K. Hartmann. Ultrametric probe of the spin-glass\n\nstate in a \ufb01eld. Physical Review B, 86(18):184405, 2012.\n\n[52] R. Baviera and M. A. Virasoro. A method that reveals the multi-level ultrametric tree hidden in\np-spin-glass-like systems. Journal of Statistical Mechanics: Theory and Experiment, 2015(12):\nP12007, 2015.\n\n[53] A. Jagannath. Approximate ultrametricity for random measures and applications to spin glasses.\n\nCommunications on Pure and Applied Mathematics, 70(4):611\u2013664, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1783, "authors": [{"given_name": "Giovanni", "family_name": "Chierchia", "institution": "ESIEE Paris"}, {"given_name": "Benjamin", "family_name": "Perret", "institution": "ESIEE/PARIS"}]}