{"title": "Foundations of Comparison-Based Hierarchical Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 7456, "page_last": 7466, "abstract": "We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. Instead, we assume that only a set of comparisons between objects is available, that is, statements of the form ``objects i and j are more similar than objects k and l.'' Such a scenario is commonly encountered in crowdsourcing applications. The focus of this work is to develop comparison-based hierarchical clustering algorithms that do not rely on the principles of ordinal embedding. We show that single and complete linkage are inherently comparison-based and we develop variants of average linkage. We provide statistical guarantees for the different methods under a planted hierarchical partition model. We also empirically demonstrate the performance of the proposed approaches on several datasets.", "full_text": "Foundations of Comparison-Based Hierarchical\n\nClustering\n\nDebarghya Ghoshdastidar\u2217\u2020\n\nDepartment of Informatics, TU Munich\n\nghoshdas@in.tum.de\n\nMicha\u00ebl Perrot\u2217\n\nMax Planck Institute for Intelligent Systems\n\nmichael.perrot@tuebingen.mpg.de\n\nUlrike von Luxburg\n\nDepartment of Computer Science, University of T\u00fcbingen\n\nMax Planck Institute for Intelligent Systems\nluxburg@informatik.uni-tuebingen.de\n\nAbstract\n\nWe address the classical problem of hierarchical clustering, but in a framework\nwhere one does not have access to a representation of the objects or their pairwise\nsimilarities. Instead, we assume that only a set of comparisons between objects\nis available, that is, statements of the form \u201cobjects i and j are more similar than\nobjects k and l.\u201d Such a scenario is commonly encountered in crowdsourcing\napplications. The focus of this work is to develop comparison-based hierarchical\nclustering algorithms that do not rely on the principles of ordinal embedding. We\nshow that single and complete linkage are inherently comparison-based and we\ndevelop variants of average linkage. We provide statistical guarantees for the\ndifferent methods under a planted hierarchical partition model. We also empirically\ndemonstrate the performance of the proposed approaches on several datasets.\n\n1\n\nIntroduction\n\nThe de\ufb01nition of clustering as the task of grouping similar objects emphasizes the importance of\nassessing similarity scores for the process of clustering. Unfortunately, many applications of data\nanalysis, particularly in crowdsourcing and psychometric problems, do not come with a natural\nrepresentation of the underlying objects or a well-de\ufb01ned similarity function between pairs of objects.\nInstead, one only has access to the results of comparisons of similarities, for instance, quadruplet\ncomparisons of the form \u201csimilarity between xi and xj is larger than similarity between xk and xl.\u201d\nThe importance and robustness of collecting such ordinal information from human subjects and\ncrowds has been widely discussed in the psychometric and crowdsourcing literature (Shepard, 1962;\nYoung, 1987; Borg and Groenen, 2005; Stewart et al., 2005). Subsequently, there has been growing\ninterest in the machine learning and statistics communities to perform data analysis in a comparison-\nbased framework (Agarwal et al., 2007; Van Der Maaten and Weinberger, 2012; Heikinheimo and\nUkkonen, 2013; Zhang et al., 2015; Arias-Castro et al., 2017; Haghiri et al., 2018). The traditional\napproach for learning in an ordinal setup involves a two step procedure\u2014\ufb01rst obtain a Euclidean\nembedding of the objects from available similarity comparisons, and subsequently learn from the\nembedded data using standard machine learning techniques (Borg and Groenen, 2005; Agarwal\net al., 2007; Jamieson and Nowak, 2011; Tamuz et al., 2011; Van Der Maaten and Weinberger,\n2012; Terada and von Luxburg, 2014; Amid and Ukkonen, 2015). As a consequence, the statistical\n\n\u2217Both authors contributed equally to the paper.\n\u2020This work was done when the author was af\ufb01liated to the University of T\u00fcbingen.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fperformance of the resulting comparison-based learning algorithms relies both on the goodness of\nthe embedding and the subsequent statistical consistency of learning from the embedded data. While\nthere exists theoretical guarantees on the accuracy of ordinal embedding (Jamieson and Nowak, 2011;\nKleindessner and Luxburg, 2014; Jain et al., 2016; Arias-Castro et al., 2017), it is not known if one\ncan design provably consistent learning algorithms using mutually dependent embedded data points.\nAn alternative approach, which has become popular in recent years, is to directly learn from the\nordinal relations. This approach has been used for estimation of data dimension, centroid or density\n(Kleindessner and Luxburg, 2015; Heikinheimo and Ukkonen, 2013; Ukkonen et al., 2015), object\nretrieval and nearest neighbour search (Kazemi et al., 2018; Haghiri et al., 2017), classi\ufb01cation and\nregression (Haghiri et al., 2018), clustering (Kleindessner and von Luxburg, 2017a; Ukkonen, 2017),\nas well as hierarchical clustering (Vikram and Dasgupta, 2016; Emamjomeh-Zadeh and Kempe,\n2018). The theoretical advantage of a direct learning principle over an indirect embedding-based\napproach is re\ufb02ected by the fact that some of the above works come with statistical guarantees for\nlearning from ordinal comparisons (Haghiri et al., 2017, 2018; Kazemi et al., 2018).\n\nMotivation. The motivation for the present work arises from the absence of comparison-based\nclustering algorithms that have strong statistical guarantees, or more generally, the limited theory in\nthe context of comparison-based clustering and hierarchical clustering. While theoretical foundations\nof standard hierarchical clustering can be found in the literature (Hartigan, 1981; Chaudhuri et al.,\n2014; Dasgupta, 2016; Moseley and Wang, 2017), corresponding works in the ordinal setup has\nbeen limited (Emamjomeh-Zadeh and Kempe, 2018). A naive approach to derive guarantees for\ncomparison-based clustering would be to combine the analysis of a classic clustering or hierarchical\nclustering algorithm with existing guarantees for ordinal embedding (Arias-Castro et al., 2017).\nUnfortunately, this does not work since the known worst-case error rates for ordinal embedding are\ntoo weak to provide any reasonable guarantee for the resulting comparison-based clustering algorithm.\nThe existing guarantees for ordinal hierarchical clustering hold under a triplet framework, where\neach comparison returns the two most similar among three objects (Emamjomeh-Zadeh and Kempe,\n2018). The results show that the underlying hierarchy can be recovered by few adaptively chosen\ncomparisons, but if the comparisons are provided beforehand, which is the case in crowdsourcing,\nthen the number of required comparisons is rather large. The focus of the present work is to develop\nprovable comparison-based hierarchical clustering algorithms that can \ufb01nd an underlying hierarchy\nin a set of objects given either adaptively or non-adaptively chosen sets of comparisons.\n\nContribution 1: Agglomerative algorithms for comparison-based clustering. The only known\nhierarchical clustering algorithm in a comparison-based framework employs a divisive approach\n(Emamjomeh-Zadeh and Kempe, 2018). We observe that it is easy to perform agglomerative\nhierarchical clustering using only comparisons since one can directly reformulate single linkage and\ncomplete linkage clustering algorithms in the quadruplet comparisons framework. However, it is\nwell known that single and complete linkage algorithms typically have poor worst-case guarantees\n(Cohen-Addad et al., 2018). While average linkage clustering has stronger theoretical guarantees\n(Moseley and Wang, 2017; Cohen-Addad et al., 2018), it cannot be used in the comparison-based\nsetup since it relies on an averaging of similarity scores. We propose two variants of average linkage\nclustering that can be applied to the quadruplet comparisons framework. We numerically compare\nthe merits of these new methods with single and complete linkage and embedding based approaches.\n\nContribution 2: Guarantees for true hierarchy recovery. Dasgupta (2016) provided a new\nperspective for hierarchical clustering in terms of optimizing a cost function that depends on the\npairwise similarities between objects. Subsequently, theoretical research has focused on worst-case\nanalysis of different algorithms with respect to this cost function (Roy and Pokutta, 2016; Moseley\nand Wang, 2017; Cohen-Addad et al., 2018). However, such an analysis is complicated in an ordinal\nsetup, where the algorithm is oblivious to the pairwise similarities. In this case, one can study a\nstronger notion of guarantee in terms of exact recovery of the true hierarchy (Emamjomeh-Zadeh\nand Kempe, 2018). That work, however, considers a simplistic noise model, where the result of each\ncomparison may be randomly \ufb02ipped independently of other comparisons (Jain et al., 2016). Such an\nindependent noise can be easily tackled by repeatedly querying the same comparison and using a\nmajority vote. It cannot account for noise in the underlying objects and their associated similarities.\nInstead, we consider a theoretical model that generates random pairwise similarities with a planted\nhierarchical structure (Balakrishnan et al., 2011). This induces considerable dependence among\n\n2\n\n\f:Set of objects X = {x1, . . . , xN}; Cluster-level similarity W : 2X \u00d7 2X \u2192 R.\n\ninput\noutput :Binary tree, or dendrogram, representing a hierarchical clustering of X .\nbegin\n\nLet B be a collection of N singleton trees C1, . . . ,CN with root nodes Ci.root = {xi}.\nwhile |B| > 1 do\n\nLet C,C(cid:48) be the pair of trees in B for which W (C.root,C(cid:48).root) is maximum.\nCreate C(cid:48)(cid:48) with C(cid:48)(cid:48).root = {C.root \u222a C(cid:48).root}, C(cid:48)(cid:48).lef t = C, and C(cid:48)(cid:48).right = C(cid:48).\nAdd C(cid:48)(cid:48) to the collection B, and remove C,C(cid:48).\n\nend\nreturn The surviving element in B.\n\nend\n\nAlgorithm 1: Agglomerative Hierarchical Clustering.\n\nthe quadruplets, and makes the analysis challenging. We derive conditions under which different\ncomparison-based agglomerative algorithms can exactly recover the hierarchy with high probability.\n\n2 Background\n\nIn this section we introduce standard hierarchical clustering with known similarities, we describe the\nmodel used for the theoretical analyses, and we formalize the comparison-based framework.\n\n2.1 Agglomerative hierarchical clustering with known similarity scores\nLet X = {xi}N\ni=1 be a set of N objects, which may not have a known feature representation. We\nassume that there exists an underlying symmetric similarity function w : X \u00d7 X \u2192 R. The goal of\nhierarchical clustering is to group the N objects to form a binary tree such that xi and xj are merged\nin the bottom of the tree if their similarity score wij = w(xi, xj) is high, and vice-versa. Here, we\nbrie\ufb02y review popular agglomerative clustering algorithms (Cohen-Addad et al., 2018). They rely\non the similarity score w between objects to de\ufb01ne a similarity function between any two clusters,\nW : 2X \u00d7 2X \u2192 R. Starting from N singleton clusters, each iteration of the algorithm merges the\ntwo most similar clusters. This is described in Algorithm 1, where different choices of W lead to\ndifferent algorithms. Given two clusters G and G(cid:48), popular choices for W (G, G(cid:48)) are\n\nW (G, G(cid:48)) =\n\nmax\n\nxi\u2208G,xj\u2208G(cid:48) wij,\n\nor\n\nmin\n\nxi\u2208G,xj\u2208G(cid:48) wij,\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nSingle Linkage (SL)\n\nComplete Linkage (CL)\n\n(cid:88)\n\nor\n\n(cid:124)\n\nxi\u2208G,xj\u2208G(cid:48)\n\n(cid:123)(cid:122)\n\nwij\n(cid:125)\n|G||G(cid:48)|.\n\nAverage Linkage (AL)\n\n2.2 Planted hierarchical model\n\nTheoretically, we study the problem of hierarchical clustering under a noisy hierarchical block matrix\n(Balakrishnan et al., 2011) where, given N objects, the matrix of pairwise similarities can be written\nas M + R, where M = (\u00b5ij)1\u2264i,j\u2264N is a symmetric ideal similarity matrix characterizing the\nplanted hierarchy among the examples and R = (rij)1\u2264i,j\u2264N is a symmetric perturbation matrix\nthat accounts for the noise in the observed similarity scores. In this paper, we assume that the entries\n\n{rij}1\u2264i<j\u2264N are mutually independent and normally distributed, that is rij \u223c N(cid:0)0, \u03c32(cid:1), for some\n\n\ufb01xed variance \u03c32. The ideal similarity matrix M is constructed in the following way. We assume that\nthe planted hierarchy is a balanced binary tree of height L (see Figure 1), where the 2L leaf nodes\nG1, . . . ,G2L correspond to \u201cpure clusters\u201d, each of size N0. Thus, the total number of objects in X is\nN = N02L. For some constants \u03b4 > 0 and \u00b5, the ideal similarities are de\ufb01ned as follows:\nStep-0: X is divided into two equal sized clusters, and, given xi and xj lying in different clusters,\ntheir ideal similarity is set to \u00b5ij = \u00b5 \u2212 L\u03b4 (dark blue off-diagonal block in Figure 1).\nStep-1: Each of the two groups is further divided into two sub-groups, and, for each pair xi, xj\nseparated due to this sub-group formation, we set \u00b5ij = \u00b5 \u2212 (L \u2212 1)\u03b4.\nStep-2, . . . , L \u2212 1: The above process is repeated L \u2212 1 times, and in step (cid:96), the ideal similarity\nacross two newly-formed sub-groups is \u00b5ij = \u00b5 \u2212 (L \u2212 (cid:96))\u03b4.\n\n3\n\n\fFigure 1: (Left) Illustration of the planted hierarchical model for L = 3 along with speci\ufb01cation\nof the distributions for similarities at different levels; (Right) Hierarchical block structure in the\nexpected pairwise similarity matrix, where darker implies smaller similarity.\n\nStep-L: The above steps form 2L clusters, G1, . . . ,G2L, each of size N0. The ideal similarity between\ntwo objects xi, xj belonging to the same cluster is \u00b5ij = \u00b5 (yellow blocks in Figure 1).\nThis gives rise to similarities of the form wij = \u00b5ij + rij for all i < j. By symmetry of M and R,\nwji = wij. We can equivalently assume that, for all i < j, the similarities are independently drawn\n\nas wij = wji \u223c N(cid:0)\u00b5ij, \u03c32(cid:1). Note that the pairwise similarity gets smaller in expectation when two\n\nobjects are merged higher in the true hierarchy. We consider the problem of exact recovery of the\nabove planted structure, that is correct identi\ufb01cation of all the pure clusters G1, . . . ,G2L and recovery\nof the entire hierarchy among the clusters.\n\n2.3 The comparison-based framework\n\nIn Section 2.1 we assumed that, even without a representation of the objects, we had access to a\nsimilarity function w. In the rest of this paper, we consider the ordinal setting, where w is not\navailable, and information about similarities can only be accessed through quadruplet comparisons.\nWe assume that we are given a set Q \u2286 {(i, j, k, l) : xi, xj, xk, xl \u2208 X , wij > wkl} , that is, for\nevery ordered tuple (i, j, k, l) \u2208 Q, we know that xi and xj are more similar than xk and xl. There\n\nexists a total of O(cid:0)N 4(cid:1) quadruplets, but in a practical crowdsourcing application, the available set Q\n\nmay only be a small subset of all possible quadruplets. Since noise is inherent in the similarities, we\ndo not consider it in the comparisons. We assume Q is obtained in either of the two following ways:\nActive comparisons: In this case, the algorithm can adaptively ask an oracle quadruplet queries of\nthe form wij \u2277 wkl and the outcome will be either wij > wkl or wij < wkl.\nPassive comparisons: In this case, for every tuple (i, j, k, l), we assume that with some sampling\nprobability p \u2208 (0, 1], there is a comparison wij \u2277 wkl and based on the outcome either (i, j, k, l) \u2208\nQ or (k, l, i, j) \u2208 Q. We also assume that the observations of the quadruplets are independent.\n\n3 Comparison-based hierarchical clustering\n\ncomparison-based setting, provided that we have access to \u2126(cid:0)N 2(cid:1) adaptively selected quadruplets.\n\nIn this section, we discuss that single linkage and complete linkage can be easily implemented in the\n\nHowever, their statistical guarantees are very weak. It prompts us to study two variants of average\nlinkage. On the one hand, Quadruplets-based Average Linkage (4\u2013AL) uses average linkage-like\nideas to directly estimate the cluster level similarities from the quadruplet comparisons. On the other\nhand, Quadruplets Kernel Average Linkage (4K\u2013AL) uses the quadruplet comparisons to estimate\nthe similarities between the different objects and then uses standard average linkage. We show that\nboth of these variants have good statistical performances in the following senses: (i) they can exactly\nrecover the planted hierarchy under mild assumptions on the signal-to-noise ratio \u03b4\n\u03c3 and the size of\n2L in the model introduced in Section 2.2, (ii) 4K\u2013AL only needs O (N ln N )\nthe pure clusters N0 = N\nactive comparisons to achieve exact recovery, and (iii) both 4K\u2013AL and 4\u2013AL can achieve exact\nrecovery using only a small subset of passively obtained quadruplets (sampling probability p (cid:28) 1).\n\n4\n\n\f3.1 Single linkage (SL) and complete linkage (CL)\n\nThe single and complete linkage algorithms inherently fall in the comparison-based framework. To\nsee this, \ufb01rst notice that the arg max and arg min functions used in these methods only depend on\nquadruplet comparisons. Although it is not possible to exactly compute the linkage value W (G, G(cid:48)),\none can retrieve, in each cluster, the pair of objects that achieve the maximum or minimum similarity.\nThen, the knowledge of these optimal object pairs is suf\ufb01cient since our primary aim is to \ufb01nd the\npair of clusters G, G(cid:48) that maximizes W (G, G(cid:48)) and this can be easily achieved through quadruplet\ncomparisons between the optimal object pairs of every G, G(cid:48). This discussion emphasizes that CL and\nSL fall well in the comparison-based framework when the quadruplets can be adaptively chosen\u2014in\norder to select pairs with minimum or maximum similarities. The next proposition, proved in the\nappendix, bounds the number of active comparisons necessary and suf\ufb01cient to use SL and CL.\nProposition 1 (Active query complexity of SL and CL). The SL and CL algorithms require at\n\nleast \u2126(cid:0)N 2(cid:1) and at most O(cid:0)N 2 ln N(cid:1) number of active quadruplet comparisons.\n\n\u03b4\n\n\u03b7\n\nln\n\n(cid:113)\n\n(cid:16) N\n\nWe now state a suf\ufb01cient condition for exact recovery of the planted model for both SL and CL as\n(cid:114)\n(cid:17)\nwell as a matching (up to constant) necessary condition for SL. The proof is in the appendix.\nTheorem 1 (Exact recovery of planted hierarchy by SL and CL). Assume that \u03b7 \u2208 (0, 1). If\n\u03c3 \u2265 4\n, then SL and CL exactly recover the planted hierarchy with probability 1 \u2212 \u03b7.\n\u03c3 \u2264 1\nConversely, for \u03b4\n\n2L , SL fails to recover the hierarchy with probability 1\n2 .\nTheorem 1 implies that a necessary and suf\ufb01cient condition for exact recovery by single linkage is\nthat the signal-to-noise ratio grows as\nln N with the number of examples. This strong requirement\nraises the question of whether one can achieve exact recovery under weaker assumptions and with\nless quadruplets. The subsequent sections provide an af\ufb01rmative answer to this question.\n\n(cid:1) and large N\n\nln(cid:0) N\n\n\u221a\n\n2L\n\n4\n\n3.2 Quadruplets kernel average linkage (4K\u2013AL)\n\nAverage linkage is dif\ufb01cult to cast to the ordinal framework due to the averaging of pairwise\nsimilarities, wij, which cannot be computed using only comparisons. A \ufb01rst way to overcome this\nissue is to use the quadruplet comparisons to derive some kind of proxies for the similarities wij.\nThese proxy similarities can then be directly used in the standard formulation of average linkage. To\nderive them we use ideas that are close in spirit to the triplet comparisons-based kernel developed by\nKleindessner and von Luxburg (2017a). Furthermore, we propose two different de\ufb01nitions depending\non whether we use active comparisons (Equation 1) or passive comparisons (Equation 3).\n\nActive case. We \ufb01rst consider the active case, where the quadruplet comparisons to be evaluated\ncan be chosen by the algorithm. A pair of distinct items (i0, j0) is chosen uniformly at random, and a\nset of landmark points S is constructed such that every k \u2208 {1, . . . , N} is independently added to S\nwith probability q. The proxy similarity between two distinct objects xi and xj is then de\ufb01ned as\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)I\n(wik>wi0j0 ) \u2212 I\n\nKij =\n\nk\u2208S\\{i,j}\n\n(cid:17)(cid:16)I\n\n(wik<wi0j0)\n\n(wjk>wi0 j0) \u2212 I\n\n(wjk<wi0j0 )\n\n.\n\n(1)\n\nThe underlying idea is that two similar objects should behave similarly with respect to any third\nobject, that is if xi and xj are similar then we should have wik \u2248 wjk for any other object xk. Since\nwe cannot directly access the similarities, we instead use comparisons to a reference similarity wi0j0\nto evaluate the closeness between wik and wjk.\nThe next theorem presents exact recovery guarantees for 4K\u2013AL with actively obtained comparisons.\nTheorem 2 (Exact recovery of planted hierarchy by 4K\u2013AL with active comparisons). Let\n\u03b7 \u2208 (0, 1) and \u2206 = \u03b72\n\u03c3 e\u22122L2\u03b42/\u03c32. There exists an absolute constant C > 0 such that if\n, then with probability at least\nN0 > 4\n\u2206\n1 \u2212 \u03b7, 4K\u2013AL exactly recovers the planted hierarchy using at most 2qN 2 number of actively chosen\nquadruplet comparisons.\nIn particular, if L = O (1), the above statement implies that even with \u03b4\n\u03c3 constant, 4K\u2013AL exactly\nrecovers the planted hierarchy with probability 1 \u2212 \u03b7 using only O (N ln N ) active comparisons.\n\nN and we set q > max\n\n(cid:16) N\n\n(cid:17)(cid:111)\n\n(cid:16) 2\n\nN \u22064 ln\n\n, 3\nN ln\n\nC 22L\n\n(cid:110)\n\n(cid:17)\n\n\u221a\n\n\u03b7\n\n\u03b4\n\n100\n\n\u03b7\n\n5\n\n\fThe above result shows that, in comparison to SL or CL, the proposed 4K\u2013AL method achieves\n\u03c3 , and can also do so with only O (N ln N ) active\nconsistency for smaller signal-to-noise ratio \u03b4\ncomparisons, which is much smaller than that needed by SL and CL. Our result also aligns with the\n(cid:16)\u221a\nconclusion of Emamjomeh-Zadeh and Kempe (2018), who showed that O (N ln N ) active triplet\ncomparisons suf\ufb01ce to recover hierarchy under a different (data-independent) noise model. It is\nalso worth noting that the condition N0 = \u2126\nis necessary for exact recovery of the planted\nhierarchy since the condition is necessary even in the case of planted \ufb02at clustering (Chen and Xu,\n2016, Figure 1).\nFrom a theoretical perspective, it is suf\ufb01cient to use a single random reference similarity wi0j0.\nHowever, in practice, we observe better performances when considering a set R of multiple reference\npairs. Hence, in the experiments, we use the following extension of the above kernel function:\n\n(cid:17)\n\nN\n\n(wik>wi0j0) \u2212 I\n\n(wik<wi0 j0)\n\n(wjk<wi0j0)\n\n.\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)I\n\n(i0,j0)\u2208R\n\nk\u2208S\\{i,j}\n\nKij =\n\n(cid:17)(cid:16)I\n(wjk>wi0j0 ) \u2212 I\n\n(cid:17)\n\nPassive case. Theorem 2 shows that 4K\u2013AL can exactly recover the planted hierarchy even for a\nconstant signal-to-noise ratio, provided that it can actively choose the quadruplets. It is natural to\nask if the same holds in the passive case, where we do not have the freedom of querying speci\ufb01c\ncomparisons but instead have access to a small pre-computed set of quadruplet comparisons Q. We\naddress this problem using the following variant of the aforementioned quadruplets kernel:\n\nN(cid:88)\n\nN(cid:88)\n\n(cid:0)I(i,r,k,l)\u2208Q \u2212 I(k,l,i,r)\u2208Q(cid:1)(cid:0)I(j,r,k,l)\u2208Q \u2212 I(k,l,j,r)\u2208Q(cid:1)\n\nKij =\n\nfor all i (cid:54)= j. This formulation extends the active kernel in (1) by using all(cid:0)N\n\nk,l=1,k<l\n\nr=1\n\n(3)\n\n(cid:1) pairs of (k, l) as\n\n2\n\n(cid:17)\n\n\u221a\n\n\u2206\n\nC 2L\n\u22062\n\n1\nN ln\n\n\u03b7\n\n(cid:114)\n\n(cid:16) 2\n\n\u03b7\n\n, 2\nN 4 ln\n\n(cid:16) N\n\n(cid:17)(cid:27)\n\n(cid:26)\n2\u03c3 e\u2212L2\u03b42/4\u03c32. There exists an absolute constant C > 0 such that if N0 > 8\n\nreference similarities instead of a single pair (i0, j0). But each term in the sum contributes only when\nwe simultaneously observe the comparisons between (i, r) and (k, l) and between (j, r) and (k, l).\nTheorem 3 presents guarantees for 4K\u2013AL with quadruplets obtained from the passive comparisons\nmodel in Section 2.3.\nTheorem 3 (Exact recovery of planted hierarchy by 4K\u2013AL with passive comparisons). Let\n\u03b7 \u2208 (0, 1) and \u2206 = \u03b4\nN\n, then with probability at least 1 \u2212 \u03b7, the\nand we set p > max\n4K\u2013AL algorithm exactly recovers the planted hierarchy using at most pN 4 quadruplet comparisons,\nwhich are passively obtained based on the model described in Section 2.3.\nIn particular, if L = O (1), the above statement implies that even with \u03b4\n\nrecovers the planted hierarchy with probability 1 \u2212 \u03b7 using O(cid:0)N 7/2 ln N(cid:1) passive comparisons.\ndisappointing, O(cid:0)N 7/2(cid:1) passive comparisons might, in fact, be necessary in this case. Indeed,\nEmamjomeh-Zadeh and Kempe (2018, Theorem 2.3) show that in the case of triplets, \u2126(cid:0)N 3(cid:1) passive\nbe easily adapted to the quadruplet comparison setting to prove a worst-case complexity of \u2126(cid:0)N 4(cid:1)\n\n\u03c3 , but passive 4K\u2013AL\nThe derived conditions for exact recovery are similar to Theorem 2 in terms of \u03b4\nrequires a much larger number of passive comparisons than active 4K\u2013AL. While this may seem\n\ntriplet comparisons are necessary to exactly recover a hierarchy in the worst case. The proof can\n\n\u03c3 constant, 4K\u2013AL exactly\n\npassive quadruplets. The above result shows that under the planted model, which is simpler than\nthe worst-case, the query complexity can be improved at least by a factor of\nN. Further study is\nrequired to identify a precise necessary condition. We also believe that the passive query complexity\nshould reduce considerably if the signal-to-noise ratio \u03b4\n\n\u03c3 grows with N.\n\n\u221a\n\n3.3 Quadruplets-based average linkage (4\u2013AL)\n\nIn 4K\u2013AL we derived a proxy for the similarities between objects and then used standard average\nlinkage. In this section we consider a different approach where we use the quadruplet comparisons\nto de\ufb01ne a new cluster-level similarity function. This method is particularly well suited when it\nis not possible to actively query the comparisons. We assume that we are given a set of passively\n\n6\n\n\fobtained quadruplets Q as in the previous section (4K\u2013AL with passive comparisons). Using this set\nof comparisons, one can estimate the relative similarity between two pairs of clusters. For instance,\nlet G1, G2, G3, G4 be four clusters such that G1, G2 are disjoint and so are G3, G4, and de\ufb01ne\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nxi\u2208G1\n\nxj\u2208G2\n\nxk\u2208G3\n\nxl\u2208G4\n\nI(i,j,k,l)\u2208Q \u2212 I(k,l,i,j)\u2208Q\n\n|G1||G2||G3||G4|\n\n.\n\n(4)\n\nWQ (G1, G2(cid:107)G3, G4) =\n\nThe idea is that clusters G1, G2 are more similar to each other than G3, G4 if their objects are, on\naverage, more similar to each other than the objects of G3 and G4. This formulation suggests that an\nagglomerative clustering should merge G1, G2 before G3, G4 if WQ (G1, G2(cid:107)G3, G4) > 0. Also,\nnote that WQ (G1, G2(cid:107)G3, G4) = \u2212WQ (G3, G4(cid:107)G1, G2) and WQ (G1, G2(cid:107)G1, G2) = 0, which\nhints that (4) is a preference relation between pairs of clusters. We use the above preference relation\nWQ to de\ufb01ne a new cluster-level similarity function W that can be used in Algorithm 1. Hence,\ngiven two clusters Gp, Gq, p (cid:54)= q, we de\ufb01ne their similarity as\n\nW (Gp, Gq) =\n\nWQ (Gp, Gq(cid:107)Gr, Gs)\n\nK(K \u2212 1)\n\n.\n\n(5)\n\nK(cid:88)\n\nr,s=1,r(cid:54)=s\n\nThe idea is that two clusters Gp and Gq are similar to each other if, on average, the pair is often\npreferred over the other possible cluster pairs. The above measure W provides an average linkage\napproach based on quadruplets (4\u2013AL), whose statistical guarantees are presented below.\nTheorem 4 (Exact recovery of planted hierarchy by 4\u2013AL with passive comparisons). Let \u03b7 \u2208\n(0, 1) and \u2206 = \u03b4\n(i) An initial step partitions X into pure clusters of sizes in the range [m, 2m] for some m \u2264 1\n2 N0.\n(ii) Q is a passively obtained set of quadruplet comparisons, where each tuple (i, j, k, l) is observed\nindependently with probability p >\nThen, with probability 1 \u2212 \u03b7, starting from the given initial partition and using |Q| \u2264 pN 4 number\nof passive comparisons, 4\u2013AL exactly recovers the planted hierarchy.\nIn particular, if L = O (1), the above statement implies that, when \u03b4\n\n2\u03c3 e\u2212L2\u03b42/4\u03c32. Assume the following:\n(cid:26)\n\nrecovers the planted hierarchy with probability 1 \u2212 \u03b7 using O(cid:16) N 4 ln N\n\n\u03c3 is a constant, 4\u2013AL exactly\n\nfor some constant C > 0.\n\npassive comparisons.\n\n(cid:18) 1\n\n(cid:19)(cid:27)\n\n\u03b7\n\n(cid:17)\n\nC\nm\u22062 max\n\nln N,\n\nln\n\n1\nm\n\nm\n\nCompared to 4K\u2013AL (Theorem 3), the guarantee for 4-AL in Theorem 4 additionally requires an\ninitial partitioning of X into small pure clusters of size m. This is reasonable in the context of the\nhierarchical clustering literature since existing consistency results for average linkage also require\nsimilar assumptions (Cohen-Addad et al., 2018, Theorem 5.8). In principle, one may use passive 4K\u2013\nAL to obtain these initial clusters. Theorem 4 shows that if the size of initial clusters is much larger\nthan ln N, then we do not need to observe all the quadruplets. Moreover, if L = O (1) and we have\n\n\u2126 (N0)-sized initial clusters, then the subsequent steps of 4\u2013AL require only O(cid:0)N 3 ln N(cid:1) passive\ncomparisons out of the O(cid:0)N 4(cid:1) total number of available comparisons. This is less quadruplets\n\nthan 4K\u2013AL, but it is still large for practical purposes. It remains an open question whether better\nsampling rates can be achieved in the passive case. From a practical perspective, our experiments\nin Section 4 demonstrate that 4\u2013AL performs better than 4K\u2013AL even when no initial clusters are\nprovided, that is m = 1.\n\n4 Experiments\n\nIn this section we evaluate our approaches on several problems: we empirically verify our theoretical\n\ufb01ndings, we compare our methods1 to ordinal embedding based approaches on standard datasets, and\nwe illustrate their behaviour on a comparison-based dataset.\n\n4.1 Planted hierarchical model\n\nWe \ufb01rst use the planted hierarchical model presented in Section 2.2 to generate data and study the\nperformance of the various methods introduced in Section 3.\n\n1The code of our methods is available at https://github.com/mperrot/ComparisonHC.\n\n7\n\n\f(a) Proportion p = 0.01.\n\n(b) Proportion p = 0.1.\n\n(c) Proportion p = 1.\n\nFigure 2: AARI of the proposed methods (higher is better) on data obtained from the planted\nhierarchical model with \u00b5 = 0.8, \u03c3 = 0.1, L = 3, N0 = 30. In Figure 2a, 2b, and, 2c, the methods\nget different proportions p of all the quadruplets. Best viewed in color.\n\nData. Recall that our generative model has several parameters, the within-cluster mean similarity \u00b5,\nthe variance \u03c32, the separability constant \u03b4, the depth of the planted partition L and the number of\nexamples in each cluster N0. From the different guarantees presented in Section 3, it is clear that\nthe hardness of the problem depends mainly on the signal-to-noise ratio \u03b4\n\u03c3 , and the probability p of\nobserving samples for the passive methods. Hence, to study the behaviour of the different methods\nwith respect to these two quantities, we set \u00b5 = 0.8, \u03c3 = 0.1, N0 = 30, and L = 3 and we vary\n\u03b4 \u2208 {0.02, 0.04, . . . , 0.2} and p \u2208 {0.01, 0.02, . . . , 0.1, 1}.\nMethods. We study SL, CL, which always use the same number of active comparisons and thus are\nnot impacted by p. We also consider 4K\u2013AL with passive comparisons and its active counterpart,\nN and the number of references in R chosen\n4K\u2013AL\u2013act, implemented as described in (2) with q = ln N\nso that the number of comparisons observed is the same as for the passive methods. Finally, we study\n4\u2013AL with no initial pure clusters and two variants 4\u2013AL\u2013I3 and 4\u2013AL\u2013I5 that have access to initial\nclusters of sizes 3 and 5 respectively. These initial clusters were obtained by uniformly sampling\nwithout replacement from the N0 examples contained in each of the 2L ground-truth clusters.\nEvaluation function. As a measure of performance we use the Averaged Adjusted Rand Index\n(AARI) between the ground truth hierarchy and the hierarchies learned by the different methods.\nThe main idea behind the AARI is to extend the Adjusted Rand Index (Hubert and Arabie, 1985)\nto hierarchies by averaging over the different levels (see the appendix for a formal de\ufb01nition). This\nmeasure takes values in [0, 1] with higher values for more similar hierarchies\u2014AARI (C,C(cid:48)) = 1\nimplies identical hierarchies. We report the mean and the standard deviation of 10 repetitions.\nResults. In Figure 2 we present the results for p = 0.01, p = 0.1 and p = 1. We defer the other\nresults to the appendix. Firstly, similar to the theory, SL can hardly recover the planted hierarchy,\neven for large values of \u03b4\n\u03c3 . CL performs better than SL which implies that it might be possible to\nderive better guarantees. We observe that 4K\u2013AL, 4K\u2013AL\u2013act, and, 4\u2013AL are able to exactly recover\nthe true hierarchy for smaller signal-to-noise ratio and their performances do not degrade much when\nthe number of sampled comparisons is reduced. Finally, as expected, the best method is 4\u2013AL\u2013I5. It\nuses large initial clusters but recovers the true hierarchy even for very small values of \u03b4\n\u03c3 .\n\n4.2 Standard clustering datasets\n\nIn this second set of experiments we compare our passive methods, 4K\u2013AL with passive comparisons\nand 4\u2013AL without initial clusters, to two baselines that use ordinal embedding as a \ufb01rst step.\nBaselines. We consider t-STE (Van Der Maaten and Weinberger, 2012) and FORTE (Jain et al.,\n2016), followed by a standard average linkage approach using a cosine similarity as the base metric\n(tSTE-AL and FORTE-AL). These two methods are parametrized by the embedding dimension\nd. Since it is often dif\ufb01cult to automatically tune parameters in clustering (because of the lack of\nground-truth) we consider several embedding dimensions and report the best results in the main paper.\nIn the appendix, we detail the cosine similarity and report results for other embedding dimensions.\nData. We evaluate the different approaches on 3 different datasets commonly used in hierarchical\nclustering: Zoo, Glass and 20news (Heller and Ghahramani, 2005; Vikram and Dasgupta, 2016). To\n\n8\n\n\f(a) Zoo\n\n(b) Glass\n\n(c) 20news\n\nFigure 3: Dasgupta\u2019s score (lower is better) of the different methods on the Zoo, Glass and 20news\ndatasets. The embedding dimension for FORTE\u2013AL and tSTE\u2013AL is set to 2. Best viewed in color.\n\n\ufb01t the comparison-based setting we generate the comparisons using the cosine similarity. Since it\nis not realistic to assume that all the comparisons are available. We use the procedure described in\nSection 2.3 to passively obtain a proportion p \u2208 {0.01, 0.02, . . . , 0.1} of all the quadruplets. Some\nstatistics on the datasets and details on the comparisons generation are presented in the appendix.\nEvaluation function. Contrary to the planted hierarchical model, we do not have access to a\nground-truth hierarchy and thus we cannot use the AARI measure. Instead, we use the recently\nproposed Dasgupta\u2019s cost (Dasgupta, 2016) that has been speci\ufb01cally designed to evaluate hierarchical\nclustering methods. The idea of this cost is that similar objects that are merged higher in the hierarchy\nshould be penalized. Hence, a lower cost indicates a better hierarchy. Details are provided in the\nappendix. For all the experiments we report the mean and the standard deviation of 10 repetitions.\nResults. We report the results in Figure 3. We note that the proportion of comparisons does not\nhave a large impact as the results are, on average, stable across all regimes. Our methods are either\ncomparable or better than the embedding-based ones. They do not need to \ufb01rst embed the examples\nand thus do not impose a strong Euclidean structure on the data. The impact of this structure is more\nor less pronounced depending on the dataset. Furthermore, as illustrated in the appendix, a poor\nchoice of embedding dimension can drastically worsen the results of the embedding methods.\nComparison-based dataset. In the appendix, we also apply the different methods to a comparison-\nbased dataset called Car (Kleindessner and von Luxburg, 2017b).\n\n5 Conclusion\n\nWe investigated the problem of hierarchical clustering in a comparison-based setting. We showed\nthat the single and complete linkage algorithms (SL and CL) could be used in the setting where\ncomparisons are actively queried, but with poor exact recovery guarantees under a planted hierarchical\nmodel. We also proposed two new approaches based on average linkage (4K\u2013AL and 4\u2013AL) that\ncan be used in the setting of passively obtained comparisons with good guarantees in terms of exact\nrecovery of the planted hierarchy. An active version of 4K\u2013AL achieves exact recovery using only\nO (N ln N ) active comparisons. Empirically, we con\ufb01rmed our theoretical \ufb01ndings and compared\nour methods to two ordinal embedding based baselines on standard and comparison-based datasets.\nThe paper leaves several open problems. From an algorithmic perspective, the key question is whether\none can develop similar provable methods in the triplet setting, where one has access to comparisons\nof the form \u201cxi is more similar to xj than to xk\u201d. An equivalent to passive 4K\u2013AL can obtained\nusing the triplet kernel of Kleindessner and von Luxburg (2017a), while triplet-based variants of\nactive 4K\u2013AL and 4\u2013AL require careful designing. We leave the description of such algorithms and\ntheir theoretical analysis under planted hierarchy to a follow-up work. From a theoretical perspective,\nthe main question is to derive necessary conditions and query complexities for exact recovery of\nplanted hierarchy, and subsequently, validate whether the proposed algorithms are indeed optimal.\nAdditionally, it would be interesting to analyse the performance of the proposed methods in terms of\nDasgupta\u2019s score, and in presence of noisy queries, that is when some answers are randomly \ufb02ipped.\n\n9\n\n\fAcknowledgments\n\nThis work has been supported by the Institutional Strategy of the University of T\u00fcbingen (Deutsche\nForschungsgemeinschaft, DFG, ZUK 63), by the DFG Cluster of Excellence \u201cMachine Learning\n\u2013 New Perspectives for Science\u201d, EXC 2064/1, project number 390727645, by the BMBF through\nthe Tuebingen AI Center (FKZ: 01IS18039A), and by the Baden-W\u00fcrttemberg Eliteprogramm for\nPostdocs.\n\nReferences\nAgarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D., and Belongie, S. (2007). Generalized\nnon-metric multidimensional scaling. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 11\u201318.\n\nAmid, E. and Ukkonen, A. (2015). Multiview triplet embedding: Learning attributes in multiple\n\nmaps. In International Conference on Machine Learning, pages 1472\u20131480.\n\nArias-Castro, E. et al. (2017). Some theory for ordinal embedding. Bernoulli, 23(3):1663\u20131693.\n\nBalakrishnan, S., Xu, M., Krishnamurthy, A., and Singh, A. (2011). Noise thresholds for spectral\n\nclustering. In Advances in Neural Information Processing Systems, pages 954\u2013962.\n\nBorg, I. and Groenen, P. (2005). Modern multidimensional scaling: Theory and applications.\n\nSpringer.\n\nChaudhuri, K., Dasgupta, S., Kpotufe, S., and von Luxburg, U. (2014). Consistent procedures for\ncluster tree estimation and pruning. IEEE Transactions on Information Theory, 60(12):7900\u20137912.\n\nChen, Y. and Xu, J. (2016). Statistical-computational tradeoffs in planted problems and submatrix\nlocalization with a growing number of clusters and submatrices. Journal of Machine Learning\nResearch, 17(27):1\u201357.\n\nCohen-Addad, V., Kanade, V., Mallmann-Trenn, F., and Mathieu, C. (2018). Hierarchical clustering:\n\nObjective functions and algorithms. In Symposium on Discrete Algorithms, pages 378\u2013397.\n\nDasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. In Symposium on\n\nTheory of Computing, pages 118\u2013127.\n\nEmamjomeh-Zadeh, E. and Kempe, D. (2018). Adaptive hierarchical clustering using ordinal queries.\n\nIn Symposium on Discrete Algorithms, pages 415\u2013429.\n\nHaghiri, S., Garreau, D., and von Luxburg, U. (2018). Comparison-based random forests.\n\nInternational Conference on Machine Learning, pages 1866\u20131875.\n\nIn\n\nHaghiri, S., Ghoshdastidar, D., and von Luxburg, U. (2017). Comparison-based nearest neighbor\n\nsearch. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 851\u2013859.\n\nHartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the\n\nAmerican Statistical Association, 76(374):388\u2013394.\n\nHeikinheimo, H. and Ukkonen, A. (2013). The crowd-median algorithm. In AAAI Conference on\n\nHuman Computation and Crowdsourcing.\n\nHeller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In InternationalConfer-\n\nence on Machine Learning, pages 297\u2013304.\n\nHubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classi\ufb01cation, 2(1):193\u2013218.\n\nJain, L., Jamieson, K. G., and Nowak, R. (2016). Finite sample prediction and recovery bounds for\n\nordinal embedding. In Advances in Neural Information Processing Systems, pages 2711\u20132719.\n\nJamieson, K. G. and Nowak, R. D. (2011). Low-dimensional embedding using adaptively selected\nordinal data. In Annual Allerton Conference on Communication, Control, and Computing, pages\n1077\u20131084.\n\n10\n\n\fKazemi, E., Chen, L., Dasgupta, S., and Karbasi, A. (2018). Comparison based learning from weak\n\noracles. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1849\u20131858.\n\nKleindessner, M. and Luxburg, U. (2014). Uniqueness of ordinal embedding. In Conference on\n\nLearning Theory, pages 40\u201367.\n\nKleindessner, M. and Luxburg, U. (2015). Dimensionality estimation without distances. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 471\u2013479.\n\nKleindessner, M. and von Luxburg, U. (2017a). Kernel functions based on triplet similarity compar-\n\nisons. In Advances in Neural Information Processing Systems, pages 6807\u20136817.\n\nKleindessner, M. and von Luxburg, U. (2017b). Lens depth function and k-relative neighborhood\ngraph: versatile tools for ordinal data analysis. The Journal of Machine Learning Research,\n18(1):1889\u20131940.\n\nMoseley, B. and Wang, J. (2017). Approximation bounds for hierarchical clustering: Average linkage,\nbisecting k-means, and local search. In Advances in Neural Information Processing Systems 30,\npages 3097\u20133106.\n\nRoy, A. and Pokutta, S. (2016). Hierarchical clustering via spreading metrics. In Advances in Neural\n\nInformation Processing Systems, pages 2316\u20132324.\n\nShepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown\n\ndistance function. i. Psychometrika, 27(2):125\u2013140.\n\nStewart, N., Brown, G. D., and Chater, N. (2005). Absolute identi\ufb01cation by relative judgment.\n\nPsychological review, 112(4):881.\n\nTamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. (2011). Adaptively learning the crowd\n\nkernel. In International Conference on Machine Learning, pages 673\u2013680.\n\nTerada, Y. and von Luxburg, U. (2014). Local ordinal embedding. In International Conference on\n\nMachine Learning, pages 847\u2013855.\n\nUkkonen, A. (2017). Crowdsourced correlation clustering with relative distance comparisons. arXiv\n\npreprint arXiv:1709.08459.\n\nUkkonen, A., Derakhshan, B., and Heikinheimo, H. (2015). Crowdsourced nonparametric density\nestimation using relative distances. In AAAI Conference on Human Computation and Crowdsourc-\ning.\n\nVan Der Maaten, L. and Weinberger, K. (2012). Stochastic triplet embedding. In IEEE International\n\nWorkshop on Machine Learning for Signal Processing, pages 1\u20136.\n\nVikram, S. and Dasgupta, S. (2016). Interactive bayesian hierarchical clustering. In International\n\nConference on Machine Learning, pages 2081\u20132090.\n\nYoung, F. W. (1987). Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum\n\nAssociates.\n\nZhang, L., Maji, S., and Tomioka, R. (2015). Jointly learning multiple measures of similarities from\n\ntriplet comparisons. arXiv preprint arXiv:1503.01521.\n\n11\n\n\f", "award": [], "sourceid": 4062, "authors": [{"given_name": "Debarghya", "family_name": "Ghoshdastidar", "institution": "Technical University Munich"}, {"given_name": "Micha\u00ebl", "family_name": "Perrot", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Ulrike", "family_name": "von Luxburg", "institution": "University of T\u00fcbingen"}]}