{"title": "Correlation Clustering with Adaptive Similarity Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 12531, "page_last": 12540, "abstract": "In correlation clustering, we are given $n$ objects together with a binary similarity score between each pair of them.\nThe goal is to partition the objects into clusters so to minimise the disagreements with the scores.\nIn this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries.\nOn the one hand, we describe simple active learning algorithms, which provably achieve an almost optimal trade-off while giving cluster recovery guarantees, and we test them on different datasets.\nOn the other hand, we prove information-theoretical bounds on the number of queries necessary to guarantee a prescribed disagreement bound.\nThese results give a rich characterization of the trade-off between queries and clustering error.", "full_text": "Correlation Clustering\n\nwith Adaptive Similarity Queries\n\nMarco Bressan\n\nDepartment of Computer Science\n\nUniversity of Rome Sapienza\n\nAndrea Paudice\n\nDepartment of Computer Science\n\nUniversit\u00e0 degli Studi di Milano & IIT\n\nNicol\u00f2 Cesa-Bianchi\n\nDepartment of Computer Science & DSRC\n\nUniversit\u00e0 degli Studi di Milano\n\nFabio Vitale\n\nDepartment of Computer Science\n\nUniversity of Lille & Inria\n\nAbstract\n\nIn correlation clustering, we are given n objects together with a binary similarity\nscore between each pair of them. The goal is to partition the objects into clusters\nso to minimise the disagreements with the scores. In this work we investigate\ncorrelation clustering as an active learning problem: each similarity score can be\nlearned by making a query, and the goal is to minimise both the disagreements\nand the total number of queries. On the one hand, we describe simple active\nlearning algorithms, which provably achieve an almost optimal trade-off while\ngiving cluster recovery guarantees, and we test them on different datasets. On the\nother hand, we prove information-theoretical bounds on the number of queries\nnecessary to guarantee a prescribed disagreement bound. These results give a rich\ncharacterization of the trade-off between queries and clustering error.\n\n1\n\nIntroduction\n\nClustering is a central problem in unsupervised learning. A clustering problem is typically represented\nby a set of elements together with a notion of similarity (or dissimilarity) between them. When the\nelements are points in a metric space, dissimilarity can be measured via a distance function. In more\ngeneral settings, when the elements to be clustered are members of an abstract set V , similarity is\nde\ufb01ned by an arbitrary symmetric function \u03c3 de\ufb01ned on pairs of distinct elements in V . Correlation\nClustering (CC) [4] is a well-known special case where \u03c3 is a {\u22121, +1}-valued function establishing\nwhether any two distinct elements of V are similar or not. The objective of CC is to cluster the points\nin V so to maximize the correlation with \u03c3. More precisely, CC seeks a clustering minimizing the\nnumber of errors, where an error is given by any pair of elements having similarity \u22121 and belonging\nto the same cluster, or having similarity +1 and belonging to different clusters. Importantly, there\nare no a priori limitations on the number of clusters or their sizes: all partitions of V , including\nthe trivial ones, are valid. Given V and \u03c3, the error achieved by an optimal clustering is known as\nthe Correlation Clustering index, denoted by OPT. A convenient way of representing \u03c3 is through\na graph G = (V, E) where {u, v} \u2208 E iff \u03c3(u, v) = +1. Note that OPT = 0 is equivalent to a\nperfectly clusterable graph (i.e., G is the union of disjoint cliques). Since its introduction, CC has\nattracted a lot of interest in the machine learning community, and has found numerous applications in\nentity resolution [16], image analysis [18], and social media analysis [25]. Known problems in data\nintegration [14] and biology [5] can be cast into the framework of CC [26].\nFrom a machine learning viewpoint, we are interested in settings when the similarity function \u03c3 is\nnot available beforehand, and the algorithm must learn \u03c3 by querying for its value on pairs of objects.\nThis setting is motivated by scenarios in which the similarity information is costly to obtain. For\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexample, in entity resolution, disambiguating between two entities may require invoking the user\u2019s\nhelp. Similarly, deciding if two documents are similar may require a complex computation, and\npossibly the interaction with human experts. In these active learning settings, the learner\u2019s goal is to\ntrade the clustering error against the number of queries. Hence, the fundamental question is: how\nmany queries are needed to achieve a speci\ufb01ed clustering error? Or, in other terms, how close can we\nget to OPT, under a prescribed query budget Q?\n\n1.1 Our Contributions\n\nIn this work we characterize the trade-off between the number Q of queries and the clustering error\non n points. The table below here summarizes our bounds in the context of previous work. Running\ntime and upper/lower bounds on the expected clustering error are expressed in terms of the number of\nqueries Q, and all our upper bounds assume Q = \u2126(n) while our lower bounds assume Q = O(n2).\n\nRunning time\nQ + LP solver + rounding\nQ\nExponential\nExponential (OPT = 0)\nUnrestricted (OPT = 0)\nUnrestricted (OPT (cid:29) 0)\n\nQ(cid:1)\n\nExpected clustering error\n\u221a\n\n3OPT + O(n3/Q)\n\u221a\n\n3(ln n + 1)OPT + O(cid:0)n5/2/\nQ(cid:1)\nOPT + O(cid:0)n5/2/\n(cid:101)O(cid:0)n3/Q(cid:1)\nQ(cid:1)\n\u2126(cid:0)n2/\nOPT + \u2126(cid:0)n3/Q(cid:1)\n\n\u221a\n\nReference\n\n[7]\n\nTheorem 1 (see also [6])\n\nTheorem 7\nTheorem 7\nTheorem 8\nTheorem 9\n\n2\n\nOur \ufb01rst set of contributions is algorithmic. We take inspiration from an existing greedy algorithm,\nKwikCluster [2], that has expected error 3OPT but a vacuous O(n2) worst-case bound on the number\nof queries. We propose a variant of KwikCluster, called ACC, for which we prove several desirable\nproperties. First, ACC achieves expected clustering error 3OPT + O(n3/Q), where Q = \u2126(n) is\n\nbecomes exactly equivalent to KwikCluster. Second, ACC recovers adversarially perturbed latent\nclusters. More precisely, if the input contains a cluster C obtained from a clique by adversarially\nperturbing a fraction \u03b5 of its edges (internal to the clique or leaving the clique), then ACC returns a\n\na deterministic bound on the number of queries. In particular, if ACC is run with Q =(cid:0)n\n(cid:1), then it\ncluster (cid:98)C such that E(cid:2)|C \u2295 (cid:98)C|(cid:3) = O(cid:0)\u03b5|C| + n2/Q(cid:1), where \u2295 denotes symmetric difference. This\nFor example, when OPT = 0 and there are \u2126(cid:0)n3/Q(cid:1) similar pairs, the expected number of queries\n\nmeans that ACC recovers almost completely all perturbed clusters that are large enough to be \u201cseen\u201d\nwith Q queries. We also show, under stronger assumptions, that via independent executions of ACC\none can recover exactly all large clusters with high probability. Third, we show a variant of ACC,\ncalled ACCESS (for Early Stopping Strategy), that makes signi\ufb01cantly less queries on some graphs.\n\nmade by ACCESS is only the square root of the queries made by ACC. In exchange, ACCESS\nmakes at most Q queries in expectation rather than deterministically.\nOur second set of contributions is a nearly complete information-theoretic characterization of the\nquery vs. clustering error trade-off (thus, ignoring computational ef\ufb01ciency). Using VC theory,\nwe prove that for all Q = \u2126(n) the strategy of minimizing disagreements on a random subset of\n\napproximation algorithm, too. The catch is that the approximation algorithm cannot ask the similarity\nof arbitrary pairs, but only of pairs included in the random sample of edges. The best known\napproximation factor in this case is 3(ln n + 1) [15], which gives a clustering error bound of\n\nQ(cid:1), which\npairs achieves, with high probability, clustering error bounded by OPT + O(cid:0)n5/2/\nreduces to (cid:101)O(cid:0)n3/Q(cid:1) when OPT = 0. The VC theory approach can be applied to any ef\ufb01cient\n3(ln n + 1)OPT + O(cid:0)n5/2/\nQ(cid:1) with high probability. This was already observed in [7] albeit in a\nleast OPT + \u2126(cid:0)n3/Q(cid:1). In particular, for Q = \u0398(n2) any algorithm still suffers an additive error of\nthe special case OPT = 0, we show a lower bound \u2126(cid:0)n2/\n\nslightly different context.\nWe complement our upper bounds by developing two information-theoretic lower bounds; these\nlower bounds apply to any algorithm issuing Q = O(n2) queries, possibly chosen in an adaptive\nway. For the general case, we show that any algorithm must suffer an expected clustering error of at\n\norder n, and for Q = \u2126(n) our algorithm ACC is essentially optimal in its additive error term. For\n\nQ(cid:1).\n\n\u221a\n\n\u221a\n\n\u221a\n\nFinally, we evaluate our algorithms empirically on real-world and synthetic datasets.\n\n2\n\n\f2 Related work\n\nMinimizing the correlation clustering error is APX-hard [9], and the best ef\ufb01cient algorithm found\nso far achieves 2.06 OPT [10]. This almost matches the best possible approximation factor 2\nachievable via the natural LP relaxation of the problem [9]. A very simple and elegant algorithm for\napproximating CC is KwikCluster [2]. At each round, KwikCluster draws a random pivot \u03c0r from V ,\nqueries the similarities between \u03c0r and every other node in V , and creates a cluster C containing \u03c0r\nand all points u such that \u03c3(\u03c0r, u) = +1. The algorithm then recursively invokes itself on V \\ C. On\nany instance of CC, KwikCluster achieves an expected error bounded by 3OPT. However, it is easy\nto see that KwikCluster makes \u0398(n2) queries in the worst case (e.g., if \u03c3 is the constant function\n\u22121). Our algorithms can be seen as a parsimonious version of KwikCluster whose goal is reducing\nthe number of queries.\nThe work closest to ours is [6]. Their algorithm runs KwikCluster on a random subset of 1/(2\u03b5)\nnodes and stores the set \u03a0 of resulting pivots. Then, each node v \u2208 V \\ \u03a0 is assigned to the cluster\nidenti\ufb01ed by the pivot \u03c0 \u2208 \u03a0 with smallest index and such that \u03c3(v, \u03c0) = +1. If no such pivot is\nfound, then v becomes a singleton cluster. According to [6, Lemma 4.1], the expected clustering\n\nerror for this variant is 3OPT + O(cid:0)\u03b5n2(cid:1), which can be compared to our bound for ACC by setting\n\nQ = n/\u03b5. On the other hand our algorithms are much simpler and signi\ufb01cantly easier to analyze.\nThis allows us to prove a set of additional properties, such as cluster recovery and instance-dependent\nquery bounds. It is unclear whether these results are obtainable with the techniques of [6].\nAnother line of work attempts to circumvent computational hardness by using the more powerful\nsame-cluster queries (SCQ). A same-cluster query tells whether any two given nodes are clustered\ntogether according to an optimal clustering or not. In [3] SCQs are used to design a FPTAS for a\nvariant of CC with bounded number of clusters. In [23] SCQs are used to design algorithms for\nsolving CC optimally by giving bounds on Q which depend on OPT. Unlike our setting, both works\n\n(cid:1) similarities are known in advance. The work [21] considers the case in which there is\n\nassume all(cid:0)n\n\n2\n\na latent clustering with OPT = 0. The algorithm can issue SCQs, however the oracle is noisy: each\nquery is answered incorrectly with some probability, and the noise is persistent (repeated queries give\nthe same noisy answer). The above setting is closely related to the stochastic block model (SBM),\nwhich is a well-studied model for cluster recovery [1, 19, 22]. However, few works investigate SBMs\nwith pairwise queries [12]. Our setting is strictly harder because our oracle has a budget of OPT\nadversarially incorrect answers.\nA different model is edge classi\ufb01cation. Here the algorithm is given a graph G with hidden binary\nlabels on the edges. The task is to predict the sign of all edges by querying as few labels as\npossible [7, 11, 13]. As before, the oracle can have a budget OPT of incorrect answers, or a latent\nclustering with OPT = 0 is assumed and the oracle\u2019s answers are affected by persistent noise. Unlike\ncorrelation clustering, in edge classi\ufb01cation the algorithm is not constrained to predict in agreement\nwith a partition of the nodes. On the other hand, the algorithm cannot query arbitrary pairs of nodes\nin V , but only those that form an edge in G.\n\nPreliminaries and notation. We denote by V \u2261 {1, . . . , n} the set of input nodes, by E \u2261(cid:0)V\n(cid:1)\nThe cost \u2206C of C is(cid:12)(cid:12)\u0393C(cid:12)(cid:12). The correlation clustering index is OPT = minC \u2206C, where the minimum\n\nthe set of all pairs {u, v} of distincts nodes in V , and by \u03c3 : E \u2192 {\u22121, +1} the binary similarity\nfunction. A clustering C is a partition of V in disjoint clusters Ci : i = 1, . . . , k. Given C and \u03c3, the\nset \u0393C of mistaken edges contains all pairs {u, v} such that \u03c3(u, v) = \u22121 and u, v belong to same\ncluster of C and all pairs {u, v} such that \u03c3(u, v) = +1 and u, v belong to different clusters of C.\nis over all clusterings C. We often view V, \u03c3 as a graph G = (V, E) where {u, v} \u2208 E is an edge\nif and only if \u03c3(u, v) = +1. In this case, for any subset U \u2286 V we let G[U ] be the subgraph of G\ninduced by U, and for any v \u2208 V we let Nv be the neighbor set of v.\nA triangle is any unordered triple T = {u, v, w} \u2286 V . We denote by e = {u, w} a generic triangle\nedge; we write e \u2282 T and v \u2208 T \\ e. We say T is a bad triangle if the labels \u03c3(u, v), \u03c3(u, w), \u03c3(v, w)\nare {+, +,\u2212} (the order is irrelevant). We denote by T the set of all bad triangles in V . It is easy to\nsee that the number of edge-disjoint bad triangles is a lower bound on OPT.\nDue to space limitations, here most of our results are stated without proof, or with a concise proof\nsketch; the full proofs can be found in the supplementary material.\n\n2\n\n3\n\n\f3 The ACC algorithm\n\nWe introduce our active learning algorithm ACC (Active Correlation Clustering).\n\nAlgorithm 1 ACC with query rate f\nParameters: residual node set Vr, round index r\n1: if |Vr| = 0 then RETURN\n2: if |Vr| = 1 then output singleton cluster Vr and RETURN\n3: if r > (cid:100)f (|V1| \u2212 1)(cid:101) then RETURN\n4: Draw pivot \u03c0r u.a.r. from Vr\n5: Cr \u2190 {\u03c0r}\n6: Draw a random subset Sr of (cid:100)f (|Vr| \u2212 1)(cid:101) nodes from Vr \\ {\u03c0r}\n7: for each u \u2208 Sr do query \u03c3(\u03c0r, u)\n8: if \u2203 u \u2208 Sr such that \u03c3(\u03c0r, u) = +1 then\n9:\n10:\n11: Output cluster Cr\n12: ACC(Vr \\ Cr, r + 1)\n\nQuery all remaining pairs (\u03c0r, u) for u \u2208 Vr \\(cid:0){\u03c0r} \u222a Sr\n\nCr \u2190 Cr \u222a {u : \u03c3(\u03c0r, u) = +1}\n\n(cid:1)\n\n(cid:46) Create new cluster and add the pivot to it\n\n(cid:46) Check if there is at least a positive edge\n\n(cid:46) Populate cluster based on queries\n\n(cid:46) Recursive call on the remaining nodes\n\nACC has the same recursive structure as KwikCluster. First, it starts with the full instance V1 = V .\nThen, for each round r = 1, 2, . . . it selects a random pivot \u03c0r \u2208 Vr, queries the similarities between\n\u03c0r and a subset of Vr, removes \u03c0r and possibly other points from Vr, and proceeds on the remaining\nresidual subset Vr+1. However, while KwikCluster queries \u03c3(\u03c0r, u) for all u \u2208 Vr \\ {\u03c0r}, ACC\nqueries only (cid:100)f (nr)(cid:101) \u2264 nr other nodes u (lines 6\u20137), where nr = |Vr|\u2212 1. Thus, while KwikCluster\nalways \ufb01nds all positive labels involving the pivot \u03c0r, ACC can \ufb01nd them or not, with a probability\nthat depends on f. The function f is called query rate function and dictates the tradeoff between the\nclustering cost \u2206 and the number of queries Q, as we prove below. Now, if any of the aforementioned\n(cid:100)f (nr)(cid:101) queries returns a positive label (line 8), then all the labels between \u03c0r and the remaining\nu \u2208 Vr are queried and the algorithm operates as KwikCluster until the end of the recursive call;\notherwise, the pivot becomes a singleton cluster which is removed from the set of nodes. Another\nimportant difference is that ACC deterministically stops after at most (cid:100)f (n)(cid:101) recursive calls (line 1),\ndeclaring all remaining points as singleton clusters. The intuition is that with good probability the\nclusters not found within (cid:100)f (n)(cid:101) rounds are small enough to be safely disregarded. Since the choice\nof f is delicate, we avoid trivialities by assuming f is positive and smooth enough. Formally:\n\nDe\ufb01nition 1. f : N \u2192 R is a query rate function if f (1) = 1, and f (n) \u2264 f (n + 1) \u2264(cid:0)1 + 1\n\n(cid:1)f (n)\n\nn\n\nfor all n \u2208 N. This implies f (n+k)\n\nn+k \u2264 f (n)\n\nn for all k \u2265 1.\n\nWe can now state formally our bounds for ACC.\nTheorem 1. For any query rate function f and any labeling \u03c3 on n nodes, the expected cost E[\u2206A]\nof the clustering output by ACC satis\ufb01es\n\nE[\u2206A] \u2264 3OPT +\n\n2e \u2212 1\n2(e \u2212 1)\n\nn2\nf (n)\n\n+\n\nn\ne\n\n.\n\nThe number of queries made by ACC is deterministically bounded as Q \u2264 n(cid:100)f (n)(cid:101). In the special\ncase f (n) = n for all n \u2208 N, ACC reduces to KwikCluster and achieves E[\u2206A] \u2264 3OPT with\nQ \u2264 n2.\nNote that Theorem 1 gives an upper bound on the error achievable when using Q queries: since\nQ = nf (n), the expected error is at most 3OPT + O(n3/Q). Furthermore, as one expects, if the\nlearner is allowed to ask for all edge signs, then the exact bound of KwikCluster is recovered (note\nthat the \ufb01rst formula in Theorem 1 clearly does not take into account the special case when f (n) = n,\nwhich is considered in the last part of the statement).\nProof sketch. Look at a generic round r, and consider a pair of points {u, w} \u2208 Vr. The essence is\nthat ACC can misclassify {u, w} in one of two ways. First, if \u03c3(u, w) = \u22121, ACC can choose as\n\n4\n\n\fpivot \u03c0r a node v such that \u03c3(v, u) = \u03c3(v, w) = +1. In this case, if the condition on line 8 holds,\nthen ACC will cluster v together with u and w, thus mistaking {u, w}. If instead \u03c3(u, w) = +1,\nthen ACC could mistake {u, w} by pivoting on a node v such that \u03c3(v, u) = +1 and \u03c3(v, w) = \u22121,\nand clustering together only v and u. Crucially, both cases imply the existence of a bad triangle\nT = {u, w, v}. We charge each such mistake to exactly one bad triangle T , so that no triangle is\ncharged twice. The expected number of mistakes can then be bound by 3OPT using the packing\nargument of [2] for KwikCluster. Second, if \u03c3(u, w) = +1 then ACC could choose one of them, say\nu, as pivot \u03c0r, and assign it to a singleton cluster. This means the condition on line 8 fails. We can\nthen bound the number of such mistakes as follows. Suppose \u03c0r has cn/f (n) positive labels towards\nVr for some c \u2265 0. Loosely speaking, we show that the check of line 8 fails with probability e\u2212c,\n\nin which case cn/f (n) mistakes are added. In expectation, this gives cne\u2212c/f (n) = O(cid:0)n/f (n)(cid:1)\nmistakes. Over all f (n) \u2264 n rounds, this gives an overall O(cid:0)n2/f (n)(cid:1). (The actual proof has to take\n\ninto account that all the quantities involved here are not constants, but random variables).\n\n3.1 ACC with Early Stopping Strategy\n\nWe can re\ufb01ne our algorithm ACC so that, in some cases, it takes advantage of the structure of\nthe input to reduce signi\ufb01cantly the expected number of queries. To this end we see the input as\na graph G with edges corresponding to positive labels (see above). Suppose then G contains a\nsuf\ufb01ciently small number O(n2/f (n)) of edges. Since ACC performs up to (cid:100)f (n)(cid:101) rounds, it could\nmake Q = \u0398(f (n)2) queries. However, with just (cid:100)f (n)(cid:101) queries one could detect that G contains\nO(n2/f (n)) edges, and immediately return the trivial clustering formed by all singletons. The\nexpected error would obviously be at most OPT + O(n2/f (n)), i.e. the same of Theorem 1. More\ngenerally, at each round r with (cid:100)f (nr)(cid:101) queries one can check if the residual graph contains at least\nn2/f (n) edges; if the test fails, declaring all nodes in Vr as singletons gives expected additional error\nO(n2/f (n)). The resulting algorithm is a variant of ACC that we call ACCESS (ACC with Early\nStopping Strategy). The pseudocode can be found in the supplementary material.\nFirst, we show ACCESS gives guarantees virtually identical to ACC (only, with Q in expectation).\nFormally:\nTheorem 2. For any query rate function f and any labeling \u03c3 on n nodes, the expected cost E[\u2206A]\nof the clustering output by ACCESS satis\ufb01es\n\nE[\u2206A] \u2264 3OPT + 2\n\nn2\nf (n)\n\n+\n\nn\ne\n\n.\n\nMoreover, the expected number of queries performed by ACCESS is E[Q] \u2264 n((cid:100)f (n)(cid:101) + 4).\nTheorem 2 reassures us that ACCESS is no worse than ACC. In fact, if most edges of G belong to\nrelatively large clusters (namely, all but O(n2/f (n)) edges), then we can show ACCESS uses much\nfewer queries than ACC (in a nutshell, ACCESS quickly \ufb01nds all large clusters and then quits). The\nfollowing theorem captures the essence. For simplicity we assume OPT = 0, i.e. G is a disjoint\nunion of cliques.\nTheorem 3. Suppose OPT = 0 so G is a union of disjoint cliques. Let C1, . . . , C(cid:96) be the cliques of\nj=1 |ECj| = \u2126(n2/f (n)), and\n\nG in nondecreasing order of size. Let i(cid:48) be the smallest i such that(cid:80)i\nlet h(n) = |Ci(cid:48)|. Then ACCESS makes in expectation E[Q] = O(cid:0)n2 lg(n)/h(n)(cid:1) queries.\n\nAs an example, say f (n) =\nn and G contains n1/3 cliques of n2/3 nodes each. Then for ACC The-\norem 1 gives Q \u2264 nf (n) = O(n3/2), while for ACCESS Theorem 3 gives E[Q] = O(n4/3 lg(n)).\n\n\u221a\n\n4 Cluster recovery\nIn the previous section we gave bounds on E[\u2206], the expected total cost of the clustering. However,\nin applications such as community detection and alike, the primary objective is recovering accurately\nthe latent clusters of the graph, the sets of nodes that are \u201cclose\u201d to cliques. This is usually referred to\n\nas cluster recovery. For this problem, an algorithm that outputs a good approximation (cid:98)C of every\n\nlatent cluster C is preferable to an algorithm that minimizes E[\u2206] globally. In this section we show\nthat ACC natively outputs clusters that are close to the latent clusters in the graph, thus acting as a\n\n5\n\n\f2\n\n2\n\n(cid:12)(cid:12) \u2265 (1 \u2212 \u03b5)(cid:0)|C|\n\ncluster recovery tool. We also show that, for a certain type of latent clusters, one can amplify the\naccuracy of ACC via independent executions and recover all clusters exactly with high probability.\nTo capture the notion of \u201clatent cluster\u201d, we introduce the concept of (1 \u2212 \u03b5)-knit set. As usual, we\nview V, \u03c3 as a graph G = (V, E) with e \u2208 E iff \u03c3(e) = +1. Let EC be the edges in the subgraph\ninduced by C \u2286 V and cut(C, C) be the edges between C and C = V \\ C.\n\n(cid:1) and(cid:12)(cid:12)cut(C, C)(cid:12)(cid:12) \u2264 \u03b5(cid:0)|C|\nDe\ufb01nition 2. A subset C \u2286 V is (1 \u2212 \u03b5)-knit if(cid:12)(cid:12)EC\n(cid:1).\nSuppose now we have a cluster (cid:98)C as \u201cestimate\u201d of C. We quantify the distance between C and (cid:98)C as\nthe cardinality of their symmetric difference,(cid:12)(cid:12)(cid:98)C \u2295 C(cid:12)(cid:12) =(cid:12)(cid:12)(cid:98)C \\ C(cid:12)(cid:12) +(cid:12)(cid:12)C \\ (cid:98)C(cid:12)(cid:12). The goal is to obtain,\nfor each (1 \u2212 \u03b5)-knit set C in the graph, a cluster (cid:98)C with |(cid:98)C \u2295 C| = O(\u03b5|C|) for some small \u03b5. We\nthen ACC will miss C entirely. But, for |C| = \u2126(n/f (n)), we can prove E[|(cid:98)C \u2295 C|] = O(\u03b5|C|).\n\nprove ACC does exactly this. Clearly, we must accept that if C is too small, i.e. |C| = o(n/f (n)),\nWe point out that the property of being (1 \u2212 \u03b5)-knit is rather weak for an algorithm, like ACC, that is\ncompletely oblivious to the global topology of the cluster \u2014 all what ACC tries to do is to blindly\ncluster together all the neighbors of the current pivot. In fact, consider a set C formed by two disjoint\ncliques of equal size. This set would be close to 1/2-knit, and yet ACC would never produce a single\n\ncluster (cid:98)C corresponding to C. Things can only worsen if we consider also the edges in cut(C, C),\n\nwhich can lead ACC to assign the nodes of C to several different clusters when pivoting on C. Hence\nit is not obvious that a (1 \u2212 \u03b5)-knit set C can be ef\ufb01ciently recovered by ACC.\nNote that this task can be seen as an adversarial cluster recovery problem. Initially, we start with a\ndisjoint union of cliques, so that OPT = 0. Then, an adversary \ufb02ips the signs of some of the edges\nof the graph. The goal is to retrieve every original clique that has not been perturbed excessively.\nNote that we put no restriction on how the adversary can \ufb02ip edges; therefore, this adversarial setting\nsubsumes constrained adversaries. For example, it subsumes the high-probability regime of the\nstochastic block model [17] where edges are \ufb02ipped according to some distribution.\nWe can now state our main cluster recovery bound for ACC.\n\nTheorem 4. For every C \u2286 V that is (1\u2212 \u03b5)-knit, ACC outputs a cluster (cid:98)C such that E(cid:2)|C \u2295(cid:98)C|(cid:3) \u2264\n3\u03b5|C| + min(cid:8) 2n\nThe min in the bound captures two different regimes: when f (n) is very close to n, then E(cid:2)|C\u2295(cid:98)C|(cid:3) =\n\n(cid:1)|C|(cid:9) + |C|e\u2212|C|f (n)/5n.\n\nO(\u03b5|C|) independently of the size of C, but when f (n) (cid:28) n we need |C| = \u2126(n/f (n)), i.e., |C|\nmust be large enough to be found by ACC.\n\nf (n) ,(cid:0)1 \u2212 f (n)\n\nn\n\n4.1 Exact cluster recovery via ampli\ufb01cation\n\nFor certain latent clusters, one can get recovery guarantees signi\ufb01cantly stronger than the ones given\nnatively by ACC (see Theorem 4). We start by introducing strongly (1 \u2212 \u03b5)-knit sets (also known as\nquasi-cliques). Recall that Nv is the neighbor set of v in the graph G induced by the positive labels.\nDe\ufb01nition 3. A subset C \u2286 V is strongly (1 \u2212 \u03b5)-knit if, for every v \u2208 C, we have Nv \u2286 C and\n|Nv| \u2265 (1 \u2212 \u03b5)(|C| \u2212 1).\nWe remark that ACC alone does not give better guarantees on strongly (1 \u2212 \u03b5)-knit subsets than on\n(1 \u2212 \u03b5)-knit subsets. Suppose for example that |Nv| = (1 \u2212 \u03b5)(|C| \u2212 1) for all v \u2208 C. Then C is\n\nstrongly (1 \u2212 \u03b5)-knit, and yet when pivoting on any v \u2208 C ACC will inevitably produce a cluster (cid:98)C\nwith |(cid:98)C \u2295 C| \u2265 \u03b5|C|, since the pivot has edges to less than (1 \u2212 \u03b5)|C| other nodes of C.\nC is found. Recall that V = [n]. Then, we de\ufb01ne the id of a cluster (cid:98)C as the smallest node of (cid:98)C.\nThe min-tagging rule is the following: when forming (cid:98)C, use its id to tag all of its nodes. Therefore,\nif u(cid:98)C = min{u \u2208 (cid:98)C} is the id of (cid:98)C, we will set id(v) = u(cid:98)C for every v \u2208 (cid:98)C. Consider now the\n\nTo bypass this limitation, we run ACC several times to amplify the probability that every node in\n\nfollowing algorithm, called ACR (Ampli\ufb01ed Cluster Recovery). First, ACR performs K independent\nruns of ACC on input V , using the min-tagging rule on each run. In this way, for each v \u2208 V we\nobtain K tags id1(v), . . . , idK(v), one for each run. Thereafter, for each v \u2208 V we select the tag that\nv has received most often, breaking ties arbitrarily. Finally, nodes with the same tag are clustered\n\n6\n\n\ftogether. One can prove that, with high probability, this clustering contains all strongly (1 \u2212 \u03b5)-knit\nsets. In other words, ACR with high probability recovers all such latent clusters exactly. Formally,\nwe prove:\nTheorem 5. Let \u03b5 \u2264 1\nwith probability at least 1 \u2212 p: for every strongly (1 \u2212 \u03b5)-knit C with |C| > 10 n\n\n10 and \ufb01x p > 0. If ACR is run with K = 48 ln n\n\np , then the following holds\nf (n) , the algorithm\n\noutputs a cluster (cid:98)C such that (cid:98)C = C.\n\nIt is not immediately clear that one can extend this result by relaxing the notion of strongly (1\u2212\u03b5)-knit\nset so to allow for edges between C and the rest of the graph. We just notice that, in that case, every\nnode v \u2208 C could have a neighbor xv \u2208 V \\ C that is smaller than every node of C. In this case,\nwhen pivoting on v ACC would tag v with x rather than with uC, disrupting ACR.\n\n5 A fully additive scheme\n\n2\n\n2\n\n\u03b5 ln 1\n\n\u03b52 queries. When OPT = 0, Q = n\n\nIn this section, we introduce a(n inef\ufb01cient) fully additive approximation algorithm achieving cost\nOPT + n2\u03b5 in high probability using order of n\n\u03b5 suf\ufb01ces.\nOur algorithm combines uniform sampling with empirical risk minimization and is analyzed using\nVC theory.\nFirst, note that CC can be formulated as an agnostic binary classi\ufb01cation problem with binary\nclassi\ufb01ers hC : E \u2192 {\u22121, +1} associated with each clustering C of V (recall that E denotes the\nset of all pairs {u, v} of distinct elements u, v \u2208 V ), and we assume hC(u, v) = +1 iff u and v\nbelong to the same cluster of C. Let Hn be the set of all such hC. The risk of a classi\ufb01er hC with\nrespect to the uniform distribution over E is P(hC(e) (cid:54)= \u03c3(e)) where e is drawn u.a.r. from E. It is\n\n(cid:1).\neasy to see that the risk of any classi\ufb01er hC is directly related to \u2206C, P(cid:0)hC(e) (cid:54)= \u03c3(e)(cid:1) = \u2206C(cid:14)(cid:0)n\nHence, in particular, OPT =(cid:0)n\nP(cid:0)h(e) (cid:54)= \u03c3(e)(cid:1). Now, it is well known \u2014see, e.g., [24,\nprocedure: query O(cid:0)d/\u03b52(cid:1) edges drawn u.a.r. from E, where d is the VC dimension of Hn, and \ufb01nd\nrisk, then O(cid:0)(d/\u03b5) ln(1/\u03b5)(cid:1) random queries suf\ufb01ce. A trivial upper bound on the VC dimension of\nHn is log2 |Hn| = O(cid:0)n ln n). The next result gives the exact value.\n\nTheorem 6.8]\u2014 that we can minimize the risk to within an additive term of \u03b5 using the following\nthe clustering C such that hC makes the fewest mistakes on the sample. If there is h\u2217 \u2208 Hn with zero\n\n(cid:1) minh\u2208Hn\n\nTheorem 6. The VC dimension of the class Hn of all partitions of n elements is n \u2212 1.\nProof. Let d be the VC dimension of Hn. We view an instance of CC as the complete graph Kn with\nedges labelled by \u03c3. Let T be any spanning tree of Kn. For any labeling \u03c3, we can \ufb01nd a clustering\nC of V such that hC perfectly classi\ufb01es the edges of T : simply remove the edges with label \u22121 in T\nand consider the clusters formed by the resulting connected components. Hence d \u2265 n \u2212 1 because\nany spanning tree has exactly n \u2212 1 edges. On the other hand, any set of n edges must contain at\nleast a cycle. It is easy to see that no clustering C makes hC consistent with the labeling \u03c3 that gives\npositive labels to all edges in the cycle but one. Hence d < n.\n\nAn immediate consequence of the above is the following.\nTheorem 7. There exists a randomized algorithm A that, for all 0 < \u03b5 < 1, \ufb01nds a clustering C\n\nsatisfying \u2206C \u2264 OPT +O(cid:0)n2\u03b5(cid:1) with high probability while using Q = O(cid:0) n\nOPT = 0, then Q = O(cid:0) n\n\n(cid:1) queries. Moreover, if\n(cid:1) queries are enough to \ufb01nd a clustering C satisfying \u2206C = O(cid:0)n2\u03b5(cid:1).\n\n\u03b52\n\n\u03b5 ln 1\n\n\u03b5\n\n6 Lower bounds\n\nIn this section we give two lower bounds on the expected clustering error of any (possibly randomized)\nalgorithm. The \ufb01rst bound holds for OPT = 0, and applies to algorithms using a deterministically\nbounded number of queries. This bound is based on a construction from [8, Lemma 11] and related\nto kernel-based learning.\nTheorem 8. For any \u03b5 > 0 such that 1\nlearning algorithm asking fewer than 1\nn \u2265 16\n\n\u03b5 is an even integer, and for every (possibly randomized)\n50\u03b52 queries with probability 1, there exists a labeling \u03c3 on\n\n\u03b5 nodes such that OPT = 0 and the expected cost of the algorithm is at least n2\u03b5\n8 .\n\n\u03b5 ln 1\n\n7\n\n\fTheorem 9. Choose any function \u03b5 = \u03b5(n) such that \u2126(cid:0) 1\n\nOur second bound relaxed the assumption on OPT. It uses essentially the same construction of [6,\nLemma 6.1], giving asymptotically the same guarantees. However, the bound of [6] applies only to\na very restricted class of algorithms: namely, those where the number qv of queries involving any\nspeci\ufb01c node v \u2208 V is deterministically bounded. This rules out a vast class of algorithms, including\nKwikCluster, ACC, and ACCESS, where the number of queries involving a node is a function of the\nrandom choices of the algorithm. Our lower bound is instead fully general: it holds unconditionally\nfor any randomized algorithm, with no restriction on what or how many pairs of points are queried.\n\u03b5 \u2208 N. For every (possibly\nrandomized) learning algorithm and any n0 > 0 there exists a labeling \u03c3 on n \u2265 n0 nodes such\nthat the algorithm has expected error E[\u2206] \u2265 OPT + n2\u03b5\n80 whenever its expected number of queries\nsatis\ufb01es E[Q] < n\nIn fact, the bound of Theorem 9 can be put in a more general form: for any constant c \u2265 1, the\nexpected error is at least c \u00b7 OPT + A(c) where A(c) = \u2126(n2\u03b5) is an additive term with constant\nfactors depending on c (see the proof). Thus, our algorithms ACC and ACCESS are essentially\noptimal in the sense that, for c = 3, they guarantee an optimal additive error up to constant factors.\n\n(cid:1) \u2264 \u03b5 \u2264 1\n\n2 and 1\n\n80 \u03b5 .\n\nn\n\n7 Experiments\n\nWe verify experimentally the tradeoff between clustering cost and number of queries of ACC, using\nsix datasets from [21, 20]. Four datasets come from real-world data, and two are synthetic; all of them\nprovide a ground-truth partitioning of some set V of nodes. Here we show results for one real-world\ndataset (cora, with |V |=1879 and 191 clusters) and one synthetic dataset (skew, with |V |=900 and\n30 clusters). Results for the remaining datasets are similar and can be found in the supplementary\nmaterial. Since the original datasets have OPT = 0, we derived perturbed versions where OPT > 0\n\nas follows. First, for each \u03b7 \u2208 {0, 0.1, 0.5, 1} we let p = \u03b7|E|/(cid:0)n\n\n(cid:1) where |E| is the number of edges\n\n(positive labels) in the dataset (so \u03b7 is the expected number of \ufb02ipped edges measured as a multiple\nof |E|). Then, we \ufb02ipped the label of each pair of nodes independently with probability p. Obviously\nfor p = 0 we have the original dataset.\nFor every dataset and its perturbed versions we then proceeded as follows. For \u03b1 = 0, 0.05, ..., 0.95, 1,\nwe set the query rate function to f (x) = x\u03b1. Then we ran 20 independent executions of ACC, and\ncomputed the average number of queries \u00b5Q and average clustering cost \u00b5\u2206. The variance was\noften negligible, but is reported in the full plots in the supplementary material. The tradeoff between\n\u00b5\u2206 and \u00b5Q is depicted in Figure 1, where the circular marker highlights the case f (x) = x, i.e.\nKwikCluster.\n\n2\n\n(a) skew.\n\n(b) cora.\n\nFigure 1: Performance of ACC.\n\nThe clustering cost clearly drops as the number of queries increases. This drop is particularly marked\non cora, where ACC achieves a clustering cost close to that of KwikCluster using an order of\nmagnitude fewer queries. It is also worth noting that, for the case OPT = 0, the measured clustering\ncost achieved by ACC is 2 to 3 times lower than the theoretical bound of \u2248 3.8n3/Q given by\nTheorem 1.\n\n8\n\n0.00.51.01.52.02.53.0Q1e40.00.20.40.60.81.01.21e4\u03b7=0\u03b7=0.1\u03b7=0.5\u03b7=101234567Q1e40.00.20.40.60.81.01e5\u03b7=0\u03b7=0.1\u03b7=0.5\u03b7=1\fAcknowledgements\n\nThe authors gratefully acknowledge partial support by the Google Focused Award \u201cAlgorithms and\nLearning for AI\u201d (ALL4AI). Marco Bressan and Fabio Vitale are also supported in part by the ERC\nStarting Grant DMAP 680153 and by the \u201cDipartimenti di Eccellenza 2018-2022\u201d grant awarded to\nthe Department of Computer Science of the Sapienza University of Rome. Nicol\u00f2 Cesa-Bianchi is\nalso supported by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR).\n\nReferences\n[1] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models:\nIn Proc. of IEEE FOCS, pages\n\nFundamental limits and ef\ufb01cient algorithms for recovery.\n670\u2013688, 2015.\n\n[2] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information:\n\nRanking and clustering. J. ACM, 55(5):23:1\u201323:27, 2008.\n\n[3] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using\n\nsame-cluster queries. In Proc. of LATIN, pages 14\u201327, 2018.\n\n[4] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning, 56\n\n(1-3):89\u2013113, 2004.\n\n[5] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. Journal\n\nof Computational Biology, 6(3-4):281\u2013297, 1999.\n\n[6] Francesco Bonchi, David Garc\u00eda-Soriano, and Konstantin Kutzkov. Local correlation clustering.\n\nCoRR, abs/1312.5105, 2013.\n\n[7] Nicol\u00f2 Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. A correlation\nclustering approach to link classi\ufb01cation in signed networks. In Proc. of COLT, pages 34.1\u2013\n34.20, 2012.\n\n[8] Nicol\u00f2 Cesa-Bianchi, Yishay Mansour, and Ohad Shamir. On the complexity of learning with\n\nkernels. In Proc. of COLT, pages 297\u2013325, 2015.\n\n[9] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative\n\ninformation. Journal of Computer and System Sciences, 71(3):360\u2013383, 2005.\n\n[10] Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near\noptimal LP rounding algorithm for correlation clustering on complete and complete k-partite\ngraphs. In Proc. of ACM STOC, pages 219\u2013228, 2015.\n\n[11] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs\nvia convex optimization. The Journal of Machine Learning Research, 15(1):2213\u20132238, 2014.\n\n[12] Yuxin Chen, Govinda Kamath, Changho Suh, and David Tse. Community recovery in graphs\n\nwith locality. In Proc. of ICML, pages 689\u2013698, 2016.\n\n[13] Kai-Yang Chiang, Cho-Jui Hsieh, Nagarajan Natarajan, Inderjit S Dhillon, and Ambuj Tewari.\nPrediction and clustering in signed networks: a local to global perspective. The Journal of\nMachine Learning Research, 15(1):1177\u20131213, 2014.\n\n[14] William W Cohen and Jacob Richman. Learning to match and cluster large high-dimensional\n\ndata sets for data integration. In Proc. of ACM KDD, pages 475\u2013480, 2002.\n\n[15] Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in\n\ngeneral weighted graphs. Theoretical Computer Science, 361(2-3):172\u2013187, 2006.\n\n[16] Lise Getoor and Ashwin Machanavajjhala. Entity resolution: theory, practice & open challenges.\n\nProc. of the VLDB Endowment, 5(12):2018\u20132019, 2012.\n\n[17] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:\n\nFirst steps. Social Networks, 5(2):109 \u2013 137, 1983.\n\n9\n\n\f[18] Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, and Chang D Yoo. Higher-order\ncorrelation clustering for image segmentation. In Proc. of NeurIPS, pages 1530\u20131538, 2011.\n\n[19] Laurent Massouli\u00e9. Community detection thresholds and the weak ramanujan property. In Proc.\n\nof ACM STOC, pages 694\u2013703. ACM, 2014.\n\n[20] Arya Mazumdar and Barna Saha. Query complexity of clustering with side information. In\n\nProc. of NeurIPS, pages 4682\u20134693, 2017.\n\n[21] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Proc. of NeurIPS, pages\n\n5788\u20135799, 2017.\n\n[22] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture.\n\nCombinatorica, 38(3):665\u2013708, 2018.\n\n[23] Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded\n\nby optimal cost. In Proc. of ESA, pages 81:1\u201381:17, 2019.\n\n[24] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, New York, NY, USA, 2014.\n\n[25] Jiliang Tang, Yi Chang, Charu Aggarwal, and Huan Liu. A survey of signed network mining in\n\nsocial media. ACM Computing Surveys (CSUR), 49(3):42, 2016.\n\n[26] Anthony Wirth. Correlation Clustering. In Claude Sammut and Geoffrey I Webb, editors,\n\nEncyclopedia of machine learning and data mining, pages 227\u2013231. Springer US, 2010.\n\n10\n\n\f", "award": [], "sourceid": 6807, "authors": [{"given_name": "Marco", "family_name": "Bressan", "institution": "Sapienza University of Rome"}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "Universit\u00e0 degli Studi di Milano"}, {"given_name": "Andrea", "family_name": "Paudice", "institution": "University of Milan"}, {"given_name": "Fabio", "family_name": "Vitale", "institution": "University of Lille - INRIA Lille (France)"}]}