{"title": "Clustering with Noisy Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 5788, "page_last": 5799, "abstract": "In this paper, we provide a rigorous theoretical study of clustering with noisy queries. Given a set of $n$ elements, our goal is to recover the true clustering by asking minimum number of pairwise queries to an oracle. Oracle can answer queries of the form ``do elements $u$ and $v$ belong to the same cluster?''-the queries can be asked interactively (adaptive queries), or non-adaptively up-front, but its answer can be erroneous with probability $p$. In this paper, we provide the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations. We design novel algorithms that closely match this query complexity lower bound, even when the number of clusters is unknown. Moreover, we design computationally efficient algorithms both for the adaptive and non-adaptive settings. The problem captures/generalizes multiple application scenarios. It is directly motivated by the growing body of work that use crowdsourcing for {\\em entity resolution}, a fundamental and challenging data mining task aimed to identify all records in a database referring to the same entity. Here crowd represents the noisy oracle, and the number of queries directly relates to the cost of crowdsourcing. Another application comes from the problem of sign edge prediction in social network, where social interactions can be both positive and negative, and one must identify the sign of all pair-wise interactions by querying a few pairs. Furthermore, clustering with noisy oracle is intimately connected to correlation clustering, leading to improvement therein. Finally, it introduces a new direction of study in the popular stochastic block model where one has an incomplete stochastic block model matrix to recover the clusters.", "full_text": "Clustering with Noisy Queries\n\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nArya Mazumdar and Barna Saha\n\nAmherst, MA 01003\n\n{arya,barna}@cs.umass.edu\n\nAbstract\n\nIn this paper, we provide a rigorous theoretical study of clustering with noisy\nqueries. Given a set of n elements, our goal is to recover the true clustering by\nasking minimum number of pairwise queries to an oracle. Oracle can answer\nqueries of the form \u201cdo elements u and v belong to the same cluster?\u201d-the queries\ncan be asked interactively (adaptive queries), or non-adaptively up-front, but its\nanswer can be erroneous with probability p. In this paper, we provide the \ufb01rst\ninformation theoretic lower bound on the number of queries for clustering with\nnoisy oracle in both situations. We design novel algorithms that closely match\nthis query complexity lower bound, even when the number of clusters is unknown.\nMoreover, we design computationally ef\ufb01cient algorithms both for the adaptive\nand non-adaptive settings. The problem captures/generalizes multiple application\nscenarios. It is directly motivated by the growing body of work that use crowd-\nsourcing for entity resolution, a fundamental and challenging data mining task\naimed to identify all records in a database referring to the same entity. Here crowd\nrepresents the noisy oracle, and the number of queries directly relates to the cost\nof crowdsourcing. Another application comes from the problem of sign edge\nprediction in social network, where social interactions can be both positive and\nnegative, and one must identify the sign of all pair-wise interactions by querying\na few pairs. Furthermore, clustering with noisy oracle is intimately connected\nto correlation clustering, leading to improvement therein. Finally, it introduces\na new direction of study in the popular stochastic block model where one has an\nincomplete stochastic block model matrix to recover the clusters.\n\n1\n\nIntroduction\n\nClustering is one of the most fundamental and popular methods for data classi\ufb01cation. In this paper\nwe initiate a rigorous theoretical study of clustering with the help of a noisy oracle, a model that\ncaptures many application scenarios and has drawn signi\ufb01cant attention in recent years.\nSuppose we are given a set of n points, that need to be clustered into k clusters where k is unknown\nto us. Suppose there is an oracle that can answer pair-wise queries of the form, \u201cdo u and v belong to\nthe same cluster?\u201d. Repeating the same question to the oracle always returns the same answer, but the\n2 \u2212 \u03bb, \u03bb > 0 (i.e., slightly better than random answer).\nanswer could be wrong with probability p = 1\nWe are interested to \ufb01nd the minimum number of queries needed to recover the true clusters with high\nprobability. Understanding query complexity of the noisy oracle model is a fundamental theoretical\nquestion [25] with many existing works on sorting and selection [7, 8] where queries are erroneous\nwith probability p, and repeating the same question does not change the answer. Here we study the\nbasic clustering problem under this setting which also captures several fundamental applications.\nCrowdsourced Entity Resolution. Entity resolution (ER) is an important data mining task that\ntries to identify all records in a database that refer to the same underlying entity. Starting with the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fseminal work of Fellegi and Sunter [26], numerous algorithms with variety of techniques have been\ndeveloped for ER [24, 28, 40, 19]. Still, due to ambiguity in representation and poor data quality,\naccuracy of automated ER techniques has been unsatisfactory. To remedy this, a recent trend in\nER has been to use human in the loop. In this setting, humans are asked simple pair-wise queries\nadaptively, \u201cdo u and v represent the same entity?\u201d, and these answers are used to improve the\n\ufb01nal accuracy [30, 54, 56, 27, 52, 21, 29, 37, 55, 46]. Proliferation of crowdsourcing platforms like\nAmazon Mechanical Turk (AMT), CrowdFlower etc. allows for easy implementation. However,\ndata collected from non-expert workers on crowdsourcing platforms are inevitably noisy. A simple\nscheme to reduce errors could be to take a majority vote after asking the same question to multiple\nindependent crowd workers. However, often that is not suf\ufb01cient. Our experiments on several real\ndatasets (see experimentation section for details) with answers collected from AMT [31, 52] show\nmajority voting could even increase the errors. Interestingly, such an observation has been made\nby a recent paper as well [51]. There are more complex querying model [51, 55, 53], and involved\nheuristics [31, 52] to handle errors in this scenario. Let p, 0 < p < 1/2, be the probability of error1\nof a query answer which might also be the aggregated answer after repeating the query several times.\nTherefore, once the answer has been aggregated, it cannot change. In all crowdsourcing works, the\ngoal is to minimize the number of queries to reduce the cost and time of crowdsourcing, and recover\nthe entities (clusters). This is exactly clustering with noisy oracle. While several heuristics have been\ndeveloped [52, 30, 53], here we provide a rigorous theory with near-optimal algorithms and hardness\nbounds.\nAnother recent work that is conceptually close is by Asthiani et al. [4], where pair-wise queries\nare used for clustering. However, the setting is very different. They consider the speci\ufb01c NP-hard\nk-means objective with distance matrix which must be a metric and must satisfy a deterministic\nseparation property.\nSigned Edge Prediction. The edge sign prediction problem can be de\ufb01ned as follows. Suppose\nwe are given a social network with signs on all its edges, but the sign from node u to v, denoted\nby s(u, v) \u2208 {\u00b11} is hidden. The goal is to recover these signs as best as possible using minimal\namount of information. Social interactions or sentiments can be both positive (\u201clike\u201d, \u201ctrust\u201d) and\nnegative (\u201cdislike\u201d, \u201cdistrust\u201d). [41] provides several such examples; e.g., Wikipedia, where one can\nvote for or against the nomination of others to adminship [10], or Epinions and Slashdots where users\ncan express trust or distrust, or can declare others to be friends or foes [9, 39]. Initiated by [11, 34],\nmany techniques and related models using convex optimization, low-rank approximation and learning\ntheoretic approaches have been used for this problem [17, 12, 14]. Recently [16, 14, 48] proposed\nthe following model for edge sign prediction. We can query a pair of nodes (u, v) to test whether\ns(u, v) = +1 indicating u and v belong to the same cluster or s(u, v) = \u22121 indicating they are not.\nHowever, the query fails to return the correct answer with probability 0 < p < 1/2, and we want to\nquery the minimal possible pairs. This is exactly the case of clustering with noisy oracle. Our result\nsigni\ufb01cantly improves, and generalizes over [16, 14, 48].\nCorrelation Clustering. In fact, when all pair-wise queries are given, and the goal is to recover the\nmaximum likelihood (ML) clustering, then our problem is equivalent to noisy correlation clustering\n[6, 44]. Introduced by [6], correlation clustering is an extremely well-studied model of clustering. We\nare given a graph G = (V, E) with each edge e \u2208 E labelled either +1 or \u22121, the goal of correlation\nclustering is to either (a) minimize the number of disagreements, that is the number of intra-cluster\n\u22121 edges and inter-cluster +1 edges, or (b) maximize the number of agreements that is the number\nof intra-cluster +1 edges and inter-cluster \u22121 edges. Correlation clustering is NP-hard, but can be\napproximated well with provable guarantees [6]. In a random noise model, also introduced by [6] and\nstudied further by [44], we start with a ground truth clustering, and then each edge label is \ufb02ipped\nwith probability p. This is exactly the graph we observe if we make all possible pair-wise queries, and\nthe ML decoding coincides with correlation clustering. The proposed algorithm of [6] can recover in\n\nthis case all clusters of size \u03c9((cid:112)|V | log |V |), and if \u201call\u201d the clusters have size \u2126((cid:112)|V |), then they\n\ncan be recovered by [44]. Using our proposed algorithms for clustering with noisy oracle, we can\nalso recover signi\ufb01cantly smaller sized clusters given the number of clusters are not too many. Such a\nresult is possible to obtain using the repeated-peeling technique of [3]. However, our running time is\nsigni\ufb01cantly better. E.g. for k \u2264 n1/6, we have a running time of O(n log n), whereas for [3], it is\ndominated by the time to solve a convex optimization over n-vertex graph which is at least O(n3).\n\n1An approximation of p can often be estimated manually from a small sample of crowd answers.\n\n2\n\n\fStochastic Block Model (SBM). The clustering with faulty oracle is intimately connected with the\nplanted partition model, also known as the stochastic block model [36, 23, 22, 2, 1, 32, 18, 49]. The\nstochastic block model is an extremely well-studied model of random graphs where two vertices within\nthe same community share an edge with probability p(cid:48), and two vertices in different communities\nshare an edge with probability q(cid:48). It is often assumed that k, the number of communities, is a constant\n(e.g. k = 2 is known as the planted bisection model and is studied extensively [1, 49, 23] or a\nslowly growing function of n (e.g. k = o(log n)). There are extensive literature on characterizing the\nthreshold phenomenon in SBM in terms of the gap between p(cid:48) and q(cid:48)2 (e.g. see [2] and therein for\nmany references) for exact and approximate recovery of clusters of nearly equal size. If we allow\nfor different probability of errors for pairs of elements based on whether they belong to the same\ncluster or not, then the resultant faulty oracle model is an intriguing generalization of SBM. Consider\nthe probability of error for a query on (u, v) is 1 \u2212 p(cid:48) if u and v belong to the same cluster and q(cid:48)\notherwise; but now, we can only learn a subset of the entries of an SBM matrix by querying adaptively.\nUnderstanding how the threshold of recovery changes for such an \u201cincomplete\u201d or \u201cspace-ef\ufb01cient\u201d\nSBM will be a fascinating direction to pursue. In fact, our lower bound results extend to asymmetric\nprobability values, while designing ef\ufb01cient algorithms and sharp thresholds are ongoing works. In\n[15], a locality model where measurements can only be obtained for nearby nodes is studied for two\nclusters with non-adaptive querying and allowing repetitions. It would also be interesting to extend\nour work with such locality constraints.\nIn a companion paper, we have studied a related problem where the queries are not noisy and certain\nsimilarity values between each pair of elements are available [47]. Most of the results of the two\npapers are available online in a more extensive version [45].\nContributions. Formally the clustering with a noisy oracle is de\ufb01ned as follows.\nProblem (Query-Cluster ). Consider a set of points V \u2261 [n] containing k latent clusters Vi,\ni = 1, . . . , k, Vi \u2229 Vj = \u2205, where k and the subsets Vi \u2286 [n] are unknown. There is an oracle\nOp,q : V \u00d7 V \u2192 {\u00b11}, with two error parameters p, q : 0 < p < q < 1. The oracle takes as\ninput a pair of vertices u, v \u2208 V \u00d7 V , and if u, v belong to the same cluster then Op,q(u, v) = +1\nwith probability 1 \u2212 p and Op,q(u, v) = \u22121 with probability p. On the other hand, if u, v do not\nbelong to the same cluster then Op,q(u, v) = +1 with probability 1 \u2212 q and Op,q(u, v) = \u22121 with\nprobability q. Such an oracle is called a binary asymmetric channel. A special case would be when\np = 1 \u2212 q = 1\n2 \u2212 \u03bb, \u03bb > 0, the binary symmetric channel, where the error rate is the same p for all\npairs. Except for the lower bound, we focus on the symmetric case in this paper. Note that the oracle\nreturns the same answer on repetition. Now, given V , \ufb01nd Q \u2286 V \u00d7 V such that |Q| is minimum,\nand from the oracle answers it is possible to recover Vi, i = 1, 2, ..., k with high probability3. Note\nthat the entries of Q can be chosen adaptively based on the answers of previously chosen queries.\nOur contributions are as follows.\n\u2022 Lower Bound (Section 2). We show that \u2126( nk\n\u2206(p(cid:107)q) ) is the information theoretic lower bound\non the number of adaptive queries required to obtain the correct clustering with high probability\neven when the clusters are of similar size (see, Theorem 1). Here \u2206(p(cid:107)q) is the Jensen-Shannon\ndivergence between Bernoulli p and q distributions. For the symmetric case, that is when p = 1 \u2212 q,\n\u2206(p(cid:107)1 \u2212 p) = (1 \u2212 2p) log 1\u2212p\n2 \u2212 \u03bb, our lower bound on query complexity\nis \u2126( nk\n(1\u22122p)2 ). Developing lower bounds in the interactive setting especially with noisy\nanswers appears to be signi\ufb01cantly challenging as popular techniques based on Fano-type inequalities\nfor multiple hypothesis testing [13, 42] do not apply, and we believe our technique will be useful in\nother noisy interactive learning settings.\n\u2022 Information-Theoretic Optimal Algorithm (Section 3 and B.1). For the symmetric error case, we\ndesign an algorithm which asks at most O( nk log n\n(1\u22122p)2 ) queries (Theorem 2) matching the lower bound\n2 \u2212 \u03bb.\nwithin an O(log n) factor, whenever p = 1\n\u2022 Computationally Ef\ufb01cient Algorithm (Section 3.2 and B.2). We next design an algorithm that is\ncomputationally ef\ufb01cient and runs in O(nk log n + k1+2\u03c9) time where \u03c9 \u2264 2.373 is the exponent\nof fast matrix multiplication and asks at most O(nklog(n) + min (nk2log(n), k5 log2 n)) queries\ntreating p as a constant4. Note that most prior works in SBM, or works on edge sign detection, only\n\np . In particular, if p = 1\n\n\u03bb2 ) = \u2126(\n\nnk\n\n2Most recent works consider the region of interest as p(cid:48) = a log n\n3 High probability implies with probability 1 \u2212 on(1), where on(1) \u2192 0 as n \u2192 \u221e\n4For exact dependency on p see the corresponding section.\n\nand q(cid:48) = b log n\n\nn\n\nfor some a > b > 0.\n\nn\n\n3\n\n\f\u221a\n\nn), this improves upon the running time of O(n3) in [3].\n\nconsider the case when k is a constant [2, 32, 18], even just k = 2 [49, 1, 16, 14, 48]. For small\nvalues of k, we get a highly ef\ufb01cient algorithm. We can use this algorithm to recover all clusters of\n\u221a\nsize at least min (k,\nn) log n for correlation clustering on noisy graph, improving upon the results\nof [6, 44]. As long as k = o(\n\u2022 Nonadaptive Algorithm (Section B.3). When the queries must be done up-front, for k = 2, we\ngive a simple O(n log n) time algorithm that asks O( n log n\n(1\u22122p)4 ) queries improving upon [48] where\na polynomial time algorithm (at least with a running time of O(n3)) is shown with number of\nqueries O(n log n/(1/2 \u2212 p)\nlog log n ) and over [16, 14] where O(npoly log n) queries are required\nunder certain conditions on the clusters. Our result generalizes to k > 2, and we show interesting\nlower bounds in this setting (Appendix C in the supplementary material). Further, we derive new\nlower bounds showing trade-off between queries and threshold of recovery for incomplete SBM in\nAppendix C.\n\nlog n\n\n2 Lower bound for the faulty-oracle model\n\nNote that we are not allowed to ask the same question multiple times to get the correct answer. In\nthis case, even for probabilistic recovery, a minimum size bound on cluster size is required. For\nexample, consider the following two different clusterings. C1 : V = (cid:116)k\u22122\ni=1 Vi (cid:116) {v1, v2} (cid:116) {v3} and\nC2 : V = (cid:116)k\u22122\ni=1 Vi (cid:116) {v1} (cid:116) {v2, v3}. Now if one of these two clusterings are given to us uniformly\nat random, no matter how many queries we do, we will fail to recover the correct clustering with\npositive probability. Therefore, the challenge in proving lower bounds is when clusters all have size\nmore than a minimum threshold, or when they are all nearly balanced. This removes the constraint on\nthe algorithm designer on how many times a cluster can be queried with a vertex and the algorithms\ncan have greater \ufb02exibility. Our lower bound holds for a large set of clustering instances. We de\ufb01ne\na clustering to be balanced if either of the following two conditions hold 1) the minimum size of a\ncluster is \u2265 n\nk . For any balanced clustering, we prove a\nlower bound on the number of queries required.\nOur main lower bound in this section uses the Jensen-Shannon (JS) divergence. The well-known KL\ni f (i) log f (i)\ng(i) .\n2 (D(f(cid:107)g) + D(g(cid:107)f )). In particular, the KL and\nFurther de\ufb01ne the JS divergence as: \u2206(f(cid:107)g) = 1\nJS divergences between two Bernoulli random variable with parameters p and q are denoted with\nD(p(cid:107)q) and \u2206(p(cid:107)q) respectively.\nTheorem 1 (Query-Cluster Lower Bound). For any balanced clustering instance, if any (random-\nized) algorithm does not make \u2126\nexpected number of queries then the recovery will be\nincorrect with probability at least 0.29 \u2212 O( 1\nk ).\n\ndivergence is de\ufb01ned between two probability mass functions f and g: D(f(cid:107)g) =(cid:80)\n\n20k , 2) the maximum size of a cluster is \u2264 4n\n\n(cid:16) nk\n\n\u2206(p(cid:107)q)\n\n(cid:17)\n\nnk\n\nmin{D(q(cid:107)p),D(p(cid:107)q)}\n\nNote that the lower bound is more effective when p and q are close. Moreover our actual lower bound\nis slightly tighter with the expected number of queries required given by \u2126\nProof of Theorem 1. We have V to be the n-element set to be clustered: V = (cid:116)k\ni=1Vi. To prove\nTheorem 1 we \ufb01rst show that, if the number of queries is small, then there exist \u2126(k) number of\nclusters, that are not being suf\ufb01ciently queried with. Then we show that, since the size of the clusters\ncannot be too large or too small, there exists a decent number of vertices in these clusters.\nThe main piece of the proof of Theorem 1 is Lemma 1. We provide a sketch of this lemma here, the\nfull proof, which is inspired by a technique of lower bounding regret in multi-arm bandit problems\n(see [5, 38]) is given in Appendix A in the supplementary material.\nLemma 1. Suppose, there are k clusters. There exist at least 4k\n5 clusters such that for each element\nv from any of these clusters, v will be assigned to a wrong cluster by any randomized algorithm with\nprobability 0.29 \u2212 10/k unless the total number of queries involving v is more than\n\nk\n\n10\u2206(p(cid:107)q) .\n\n(cid:16)\n\n(cid:17)\n\n.\n\nProof-sketch of Lemma 1. Let us assume that the k clusters are already formed, and all elements\nexcept for one element v has already been assigned to a cluster. Note that, queries that do not involve\nv plays no role in this stage.\n\n4\n\n\fWe must have, (cid:80)k\n\nNow the problem reduces to a hypothesis testing problem where the ith hypothesis Hi for i = 1, . . . , k,\ndenotes that the true cluster for v is Vi. We can also add a null-hypothesis H0 that stands for the\nvertex belonging to none of the clusters (hypothetical). Let Pi denote the joint probability distribution\nof our observations (the answers to the queries involving vertex v) when Hi is true, i = 1, . . . , k.\nThat is for any event A we have Pi(A) = Pr(A|Hi).\nSuppose T denotes the total number of queries made by an (possibly randomized) algorithm at this\nstage before assigning a cluster. Let the random variable Ti denote the number of queries involving\ncluster Vi, i = 1, . . . , k. In the second step, we need to identify a set of clusters that are not being\nqueried with enough by the algorithm.\n\ni=1\n\n10 \u2212 k = 4k\n5 .\n\nE0Ti = T. Let J1 \u2261 {i \u2208 {1, . . . , k} : E0Ti \u2264 10T\n\nk }. That is J1\nk queries before assignment. Let Ei \u2261\ncontains clusters which were involved in less than 10T\nk }. The set of clus-\n{the algorithm outputs cluster Vi} and J2 = {i \u2208 {1, . . . , n} : P0(Ei) \u2264 10\nters, J = J1 \u2229 J2 has size, |J| \u2265 2 \u00b7 9k\nNow let us assume that we are given an element v \u2208 Vj for some j \u2208 J to cluster (Hj is the true\nhypothesis). The probability of correct clustering is Pj(Ej). In the last step, we give an upper bound\non probability of correct assignment for this element.\n(cid:113) 1\nWe must have, Pj(Ej) = P0(Ej) + Pj(Ej) \u2212 P0(Ej) \u2264 10\nk + (cid:107)P0 \u2212\n2 D(P0(cid:107)Pj). where (cid:107)P0 \u2212 Pj(cid:107)T V denotes the total variation distance between\nPj(cid:107)T V \u2264 10\ntwo distributions and and in the last step we have used the relation between total variation and\ndivergence (Pinsker\u2019s inequality). Since P0 and Pj are the joint distributions of the independent\nrandom variables (answers to queries) that are identical to one of two Bernoulli random variables: Y ,\nk D(q(cid:107)p).\nwhich is Bernoulli(p), or Z, which is Bernoulli(q), it is possible to show, D(P0(cid:107)Pj) \u2264 10T\nNow plugging this in,\n\nk + |P0(Ej) \u2212 Pj(Ej)| \u2264 10\n\nk +\n\nPj(Ej) \u2264 10\nk\n\n+\n\n1\n2\n\n10T\nk\n\nD(q(cid:107)p) \u2264 10\nk\n\n+\n\n=\n\n2\n\n10\nk\n\n+ 0.707,\n\n10D(q(cid:107)p). Had we bounded the total variation distance with D(Pj(cid:107)P0) in the Pinsker\u2019s\n\nk\n\nif T \u2264\ninequality then we would have D(p(cid:107)q) in the denominator.\nNow we are ready to prove Theorem 1.\n\n(cid:114)\n\n(cid:114) 1\n\nProof of Theorem 1. We will show the claim by considering a balanced input. Recall that for a\nbalanced input either the maximum size of a cluster is \u2264 4n\nk or the minimum size of a cluster is\n\u2265 n\n20k . We will consider the two cases separately for the proof.\nCase 1: the maximum size of a cluster is \u2264 4n\nk .\nSuppose, the total number of queries is T (cid:48). That means number of vertices involved in the queries is\n\u2264 2T (cid:48). Note that there are k clusters and n elements. Let U be the set of vertices that are involved in\nless than 16T (cid:48)\nNow we know from Lemma 1 that there exists 4k\n5 clusters such that a vertex v from any one of these\nclusters will be assigned to a wrong cluster by any randomized algorithm with probability 1/4 unless\nthe expected number of queries involving this vertex is more than\n\nn queries. Clearly, (n \u2212 |U|) 16T (cid:48)\n\nn \u2264 2T (cid:48), or |U| \u2265 7n\n8 .\n\nk\n\n10\u2206(q(cid:107)p).\n\n5 clusters. If not, then more\n5 clusters. Or the maximum size of a cluster will\n\n5 = k\n\nk , which is prohibited according to our assumption.\n\n8 vertices must belong to less than k \u2212 4k\n8k > 4n\n\nWe claim that U must have an intersection with at least one of these 4k\nthan 7n\nbe 7n\u00b75\nNow each vertex in the intersection of U and the 4k\n10\u2206(p(cid:107)q) . Therefore we must have T (cid:48) \u2265\ncluster with positive probability if, 16T (cid:48)\nCase 2: the minimum size of a cluster is \u2265 n\n20k .\nLet U(cid:48) be the set of clusters that are involved in at most 16T (cid:48)\n2T (cid:48). This implies, |U(cid:48)| \u2265 7k\n\n8 . Now we know from Lemma 1 that there exist 4k\n\nn \u2264\n\nk\n\nk\n\n5 clusters are going to be assigned to an incorrect\n\nnk\n\n160\u2206(p(cid:107)q) .\n\nqueries. That means, (k \u2212 |U(cid:48)|) 16T (cid:48)\n\nk \u2264\n5 clusters (say U\u2217) such\n\n5\n\n\fk\n\n8 + 4k\n\n5 \u2212 k = 27k\n40 .\n\n10\u2206(p(cid:107)q). Quite clearly |U\u2217 \u2229 U| \u2265 7k\n\nthat a vertex v from any one of these clusters will be assigned to a wrong cluster by any randomized\nalgorithm with probability 1/4 unless the expected number of queries involving this vertex is more\nthan\nConsider a cluster Vi such that i \u2208 U\u2217 \u2229 U, which is always possible because the intersection is\nnonempty. Vi is involved in at most 16T (cid:48)\nk queries. Let the minimum size of any cluster be t. Now,\nat least half of the vertices of Vi must each be involved in at most 32T (cid:48)\nkt queries. Now each of these\n10\u2206(p(cid:107)q) queries (see Lemma 1) to avoid being assigned to a\nvertices must be involved in at least\nwrong cluster with positive probability. This means 32T (cid:48)\n, since\nt \u2265 n\n20k .\n\n10\u2206(p(cid:107)q) or T (cid:48) = \u2126\n\n(cid:16) nk\n\nkt \u2265\n\n\u2206(p(cid:107)q)\n\n(cid:17)\n\nk\n\nk\n\n3 Algorithms\nLet V = (cid:116)k\n\nof the clustering that can be found when all(cid:0)n\n\ni=1Vi be the true clustering and V = (cid:116)k\n\n(cid:1) queries have been made to the faulty oracle. Our \ufb01rst\n\n\u02c6Vi be the maximum likelihood (ML) estimate\n\ni=1\n\nresult obtains a query complexity upper bound within an O(log n) factor of the information theoretic\nlower bound. The algorithm runs in quasi-polynomial time, and we show this is the optimal possible\nassuming the famous planted clique hardness. Next, we show how the ideas can be extended to make\nit computationally ef\ufb01cient. We consider both the adaptive and non-adaptive versions. The missing\nproofs and details are provided in Appendix B in the supplementary document.\n\n2\n\n3.1\n\nInformation-Theoretic Optimal Algorithm\n\n2 . Moreover, the algorithm returns all true clusters of V of size at least C log n\n\nIn particular, we prove the following theorem.\nTheorem 2. There exists an algorithm with query complexity O( nk log n\n(1\u22122p)2 ) for Query-Cluster that\nreturns the ML estimate with high probability when query answers are incorrect with probability\n(1\u22122p)2 for a suitable\np < 1\nconstant C with probability 1 \u2212 on(1).\nRemark 1. Assuming p = 1\n1/2\u2212\u03bb ) \u2264 4\u03bb2\n2\u03bb ln(1 + 2\u03bb\nwithin an O(log n) factor.\n\n1/2\u2212\u03bb =\n1/2\u2212\u03bb = O(\u03bb2) = O((1 \u2212 2p)2), matching the query complexity lower bound\n\n2 \u2212 \u03bb, as \u03bb \u2192 0, \u2206(p(cid:107)1 \u2212 p) = (1 \u2212 2p) ln 1\u2212p\n\np = 2\u03bb ln 1/2+\u03bb\n\nAlgorithm. 1 The algorithm that we propose is completely deterministic and has several phases.\nPhase 1: Selecting a small subgraph. Let c = 16\n\n(1\u22122p)2 .\n\n1. Select c log n vertices arbitrarily from V . Let V (cid:48) be the set of selected vertices. Create a\nsubgraph G(cid:48) = (V (cid:48), E(cid:48)) by querying for every (u, v) \u2208 V (cid:48) \u00d7 V (cid:48) and assigning a weight of\n\u03c9(u, v) = +1 if the query answer is \u201cyes\u201d and \u03c9(u, v) = \u22121 otherwise .\n\n2. Extract the heaviest weight subgraph S in G(cid:48). If |S| \u2265 c log n, move to Phase 2.\n3. Else we have |S| < c log n. Select a new vertex u, add it to V (cid:48), and query u with every\n\nvertex in V (cid:48) \\ {u}. Move to step (2).\n\nPhase 2: Creating an Active List of Clusters. Initialize an empty list called active when Phase 2 is\nexecuted for the \ufb01rst time.\n\n1. Add S to the list active.\n2. Update G(cid:48) by removing S from V (cid:48) and every edge incident on S. For every vertex z \u2208 V (cid:48),\n3. Extract the heaviest weight subgraph S in G(cid:48). If |S| \u2265 c log n, Move to step(1). Else move\n\nu\u2208S \u03c9(z, u) > 0, include z in S and remove z from G(cid:48) with all edges incident to it.\n\nif(cid:80)\n\nto Phase 3.\n\nPhase 3: Growing the Active Clusters. We now have a set of clusters in active.\n\n1. Select an unassigned vertex v not in V (cid:48) (that is previously unexplored), and for every cluster\nC \u2208 active, pick c log n distinct vertices u1, u2, ...., ul in the cluster and query v with them.\nIf the majority of these answers are \u201cyes\u201d, then include v in C.\n\n6\n\n\f2. Else we have for every C \u2208 active the majority answer is \u201cno\u201d for v. Include v \u2208 V (cid:48) and\nquery v with every node in V (cid:48) \\ v and update E(cid:48) accordingly. Extract the heaviest weight\nsubgraph S from G(cid:48) and if its size is at least c log n move to Phase 2 step (1). Else move to\nPhase 3 step (1) by selecting another unexplored vertex.\n\nPhase 4: Maximum Likelihood (ML) Estimate.\n\n(cid:88)\n\n(cid:88)\n\n1. When there is no new vertex to query in Phase 3, extract the maximum likelihood clustering\nof G(cid:48) and return them along with the active clusters, where the ML estimation is de\ufb01ned as,\n(1)\n\nAnalysis. The high level steps of the analysis are as follows. Suppose all(cid:0)n\nhave been made. If the ML estimate of the clustering with these(cid:0)n\n\n(cid:1) queries on V \u00d7 V\n(cid:1) answers is same as the true\n\nS(cid:96),(cid:96)=1,\u00b7\u00b7\u00b7:V =(cid:116)(cid:96)=1S(cid:96)\n\n(see Appendix B.1)\n\ni,j\u2208S(cid:96),i(cid:54)=j\n\n\u03c9i,j,\n\nmax\n\n2\n\n(cid:96)\n\n2\n\ni=1\n\ni=1Vi \u2261 (cid:116)k\n\n\u02c6Vi then the algorithm for noisy oracle \ufb01nds the true clustering\n\nclustering of V that is, (cid:116)k\nwith high probability.\nLet without loss of generality, | \u02c6V1| \u2265 ... \u2265 | \u02c6Vl| \u2265 6c log n > | \u02c6Vl+1| \u2265 ... \u2265 | \u02c6Vk|. We will show that\nPhase 1-3 recover \u02c6V1, \u02c6V2... \u02c6Vl with probability at least 1 \u2212 1\nn. The remaining clusters are recovered in\nPhase 4.\nA subcluster is a subset of nodes in some cluster. Lemma 2 shows that any set S that is included in\nactive in Phase 2 of the algorithm is a subcluster of V . This establishes that all clusters in active at\nany time are subclusters of some original cluster in V .\nLemma 2. Let c(cid:48) = 6c = 96\n(1\u22122p)2 . Algorithm 1 in Phase 1 and 3 returns a subcluster of V of size at\nleast c log n with high probability if G(cid:48) contains a subcluster of V of size at least c(cid:48) log n. Moreover,\nit does not return any set of vertices of size at least c log n if G(cid:48) does not contain a subcluster of V of\nsize at least c log n.\nLemma 2 is proven in three steps. Step 1 shows that if V (cid:48) contains a subcluster of size \u2265 c(cid:48) log n then\nS \u2286 Vi for some i \u2208 [1, k] will be returned with high probability when G(cid:48) is processed. Step 2 shows\nthat size of S will be at least c log n, and \ufb01nally step 3 shows that if there is no subcluster of size at\nleast c log n in V (cid:48), then no subset of size > c log n will be returned by the algorithm when processing\nG(cid:48), because otherwise that S will span more than one cluster, and the weight of a subcluster contained\nin S will be higher than S giving to a contradiction.\nFrom Lemma 2, any S added to active in Phase 2 is a subcluster with high probability, and has size at\nleast c log n. Moreover, whenever G(cid:48) contains a subcluster of V of size at least c(cid:48) log n, it is retrieved\nby the algorithm and added to active. The next lemma shows that each subcluster added to active is\ncorrectly grown to the true cluster: (1) every vertex added to such a cluster is correct, and (2) no two\nclusters in active can be merged. Therefore, clusters obtained from active are the true clusters.\nLemma 3. The list active contains all the true clusters of V of size \u2265 c(cid:48) log n at the end of the\nalgorithm with high probability.\nFinally, once all the clusters in active are grown, we have a fully queried graph in G(cid:48) containing the\nsmall clusters which can be retrieved in Phase 4. This completes the correctness of the algorithm.\nWith the following lemma, we get Theorem 2.\nLemma 4. The query complexity of the algorithm for faulty oracle is O\nRunning time of this algorithm is dominated by \ufb01nding the heaviest weight subgraph in G(cid:48), execution\n(2p\u22121)2 )), that is quasi-polynomial in n. We\nof each of those calls can be done in time O([ k log n\nshow that it is unlikely that this running time can be improved by showing a reduction from the famous\nplanted clique problem for which quasi-polynomial time is the best known (see Appendix B.1).\n\n(cid:16) nk log n\n\n(2p\u22121)2 ]O(\n\n(1\u22122p)2\n\n(cid:17)\n\nlog n\n\n.\n\n3.2 Computationally Ef\ufb01cient Algorithm\n\nWe now prove the following theorem. We give the algorithm here which is completely deterministic\nwith known k. The extension to unknown k and a detailed proof of correctness are deferred to\nAppendix B.2.\n\n7\n\n\f\u221a\n\n(1\u22122p)\n\nN log n and\n\nTheorem 3. There exists a polynomial time algorithm with query complexity O( nk2\n(2p\u22121)4 ) for Query-\nCluster with error probability p, which recovers all clusters of size at least \u2126( k log n\n(2p\u22121)4 ).\n\u221a\n\n(1\u22122p)4 . We de\ufb01ne two thresholds T (a) = pa + 6\nN log n. The algorithm is as follows.\n\nAlgorithm 2. Let N = 64k2 log n\n\u03b8(a) = 2p(1 \u2212 p)a + 2\nPhase 1-2C: Selecting a Small Subgraph. Initially we have an empty graph G(cid:48) = (V (cid:48), E(cid:48)), and all\nvertices in V are unassigned to any cluster.\n1. Select X new vertices arbitrarily from the unassigned vertices in V \\ V (cid:48) and add them to V (cid:48) such\nthat the size of V (cid:48) is N. If there are not enough vertices left in V \\ V (cid:48), select all of them. Update\nG(cid:48) = (V (cid:48), E(cid:48)) by querying for every (u, v) such that u \u2208 X and v \u2208 V (cid:48) and assigning a weight\nof \u03c9(u, v) = +1 if the query answer is \u201cyes\u201d and \u03c9(u, v) = \u22121 otherwise .\n2. Let N +(u) denote all the neighbors of u in G(cid:48) connected by +1-weighted edges. We now\ncluster G(cid:48). Select every u and v such that u (cid:54)= v and |N +(u)|,|N +(v)| \u2265 T (|V (cid:48)|). Then if\n|N +(u)\\N +(v)|+|N +(v)\\N +(u)| \u2264 \u03b8(|V (cid:48)|) (the symmetric difference of these neighborhoods)\ninclude u and v in the same cluster. Include in active all clusters formed in this step that have size\n(1\u22122p)4 . If there is no such cluster, abort. Remove all vertices in such cluster from V (cid:48) and\nat least 64k log n\nany edge incident on them from E(cid:48).\n\nPhase 3C: Growing the Active Clusters.\n1. For every unassigned vertex v \u2208 V \\ V (cid:48), and for every cluster C \u2208 active, pick c log n distinct\nvertices u1, u2, ...., ul in the cluster and query v with them. If the majority of these answers are\n\u201cyes\u201d, then include v in C.\n\n2. Output all the clusters in active and move to Phase 1 step (1) to obtain the remaining clusters.\n\n(2p\u22121)4 ) queries within G(cid:48) and O( k log n\n\n(1\u22122p)2 + kN \u03c9) where \u03c9 \u2264 2.373 is the\nRunning time of the algorithm can be shown to be O( nk log n\nexponent of fast matrix multiplication5. Thus for small values of k, we get a highly ef\ufb01cient\nalgorithm. The query complexity of the algorithm is O( nk2 log n\n(2p\u22121)4 ) since each vertex is involved in\nat most O( k2 log n\n(2p\u22121)2 ) across the active clusters. In fact, in each\niteration, the number of queries within G(cid:48) is O(N 2) and since there could be at most k rounds, the\noverall query complexity is O( nk log n\n(2p\u22121)4 , kN 2)). Moreover, using the algorithm for\nunknown k verbatim, we can obtain a correlation clustering algorithm for random noise model that\n\u221a\n\u221a\nrecovers all clusters of size \u2126( min(k,\nlog n since our ML\nestimate on G(cid:48) is correlation clustering.\n\n), improving over [6, 44] for k <\n\n(2p\u22121)2 + min ( nk2 log n\n\n(2p\u22121)4\n\nn) log n\n\nn\n\n3.3 Non-adaptive Algorithm\n\nFinally for non-adaptive querying that is when querying must be done up front we prove the following.\nThis shows while for k = 2, nonadaptive algorithms are as powerful as adaptive algorithms, for\nk \u2265 3, substantial advantage can be gained by allowing adaptive querying. For details, see Appendix\nB.3 in the supplementary material.\nTheorem 4. \u2022 For k = 2, there exists an O(n log n) time nonadaptive algorithm that recovers the\n(1\u22122p)4 ). For k \u2265 3, if R is the ratio between\nclusters with high probability with query complexity O( n log n\nthe maximum to minimum cluster size, then there exists a randomized nonadaptive algorithm that\nrecovers all clusters with high probability with query complexity O( Rnk log n\n(1\u22122p)2 ). Moreover, there exists\na computationally ef\ufb01cient algorithm for the same with query complexity O( Rnk2 log n\n\u2022 For k \u2265 3, if the minimum cluster size is r, then any deterministic nonadaptive algorithm must\nmake \u2126( n2\nr ) queries even when query answers are perfect to recover the clusters exactly. This shows\nthat adaptive algorithms are much more powerful than their nonadaptive counterparts.\n\n(1\u22122p)4 ).\n\n5Fast matrix multiplication can be avoided by slightly increasing the dependency on k.\n\n8\n\n\f4 Experiments\n\nIn this section, we report some experimental results on real and synthetic datasets.\nReal Datasets. We use the following three real datasets where the answers are generated from\nAmazon Mechanical Turk.\n\u2022 landmarks consists of images of famous landmarks in Paris and Barcelona. Since the images are\nof different sides and clicked at different angles, it is dif\ufb01cult for humans to label them correctly. It\nconsists of 266 nodes, 12 clusters with a total of 35245 edges, out of which 3738 are intra-cluster\nedges [31].\n\u2022 captcha consists of CAPTCHA images, each showing a four-digit number. It consists of 244\nnodes, 69 clusters with a total of 29890 edges out of which only 386 are intra-cluster edges [52].\n\u2022 gym contains images of gymnastics athletes, where it is very dif\ufb01cult to distinguish the face of\nthe athlete, e.g. when the athlete is upside down on the uneven bars. It consists of 94 nodes, 12\nclusters and 4371 edges out of which 449 are intra-cluster edges [52].\n\nInterestingly, we make the following observations.\n\nRepeating queries vs no repetition.\nIn\nlandmarks dataset, when a majority vote is taken after asking each pairwise query 10 times,\nwe get a total erroneous answers of 3696. However, just using the \ufb01rst crowd answer, the\nerroneous answers reduce to 2654. This shows that not only a simple strategy of repeating\neach query and taking a majority vote does not help to reduce error, in fact, it can amplify er-\nrors due to correlated answers by the crowd members. We observed the same phenomenon\nin the gym dataset where 449 answers are incorrect when majority voting is used over \ufb01ve an-\nswers for each query, compared to 310 by just using the \ufb01rst crowd user. For captcha, the\nerror rate slightly decreases when using majority voting from 241 erroneous answers to 201.\nSynthetic Datasets. We also did ex-\nperiments on the following synthetic\ndatasets from [27].\n\u2022 skew and sqrtn contain \ufb01ctitious\nhospital patients data,\nincluding\nname, phone number, birth date and\naddress. The errors are generated\nsynthetically with error probability\np = 0.2. Each of them have 900\nnodes, 404550 edges. skew has\n8175 intra-cluster edges, whereas\nsqrtn contains 13050 intra-cluster\nedges.\n\nFigure 1: Number of Queries vs Accuracy Trade-off\n\nNumber of Queries vs Accuracy.\nFigure 1 plots the number of queries\nvs accuracy trade-off of our computa-\ntionally ef\ufb01cient adaptive algorithm.\nAmong the vertices that are currently clustered, we count the number of induced edges that are\nclassi\ufb01ed correctly and then divide it by the total number of edges in the dataset to calculate accuracy.\nGiven the gap between maximum and minimum cluster size is signi\ufb01cant in all real datasets, non-\nadaptive algorithms do not perform well. Moreover, if we select queries randomly, and look at the\nqueried edges in each cluster, then even to achieve an intra-cluster minimum degree of two in every\nreasonable sized cluster, we waste a huge number queries on inter-cluster edges. While we make only\n389 queries in gym to get an accuracy of 90%, the total number of random queries is 1957 considering\nonly the clusters of size at least nine. For landmark dataset, the number of queries is about 7400\nto get an accuracy of 90%, whereas the total number of random queries is 21675 considering the\nclusters of size at least seven. This can be easily explained by the huge discrepancy in the number of\nintra and inter-cluster edges where random edge querying cannot perform well. Among the edges\nthat were mislabeled by our adaptive algorithm, 70 \u2212 90% of them are inter-cluster with very few\nerrors in intra-cluster edges, that is the clusters returned are often superset of the original clusters.\nSimilarly, the querying cost is also dominated by the inter-cluster edge queries. For example, out of\n4339 queries issued by skew, 3844 are for inter-cluster edges. By using some side information such\nas a similarity matrix, a signi\ufb01cant reduction in query complexity may be possible.\n\n9\n\n\fAcknowledgements: This work is supported in parts by NSF awards CCF 1642658, CCF 1642550,\nCCF 1464310, CCF 1652303, a Yahoo ACE Award and a Google Faculty Research Award. The\nauthors are thankful to an anonymous reviewer whose comments led to many improvements in\nthe presentation. The authors would also like to thank Sanjay Subramanian for his help with the\nexperiments.\n\nReferences\n[1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE\n\nTrans. Information Theory, 62(1):471\u2013487, 2016.\n\n[2] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental\nlimits and ef\ufb01cient algorithms for recovery. In IEEE 56th Annual Symposium on Foundations of\nComputer Science, FOCS, pages 670\u2013688, 2015.\n\n[3] N. Ailon, Y. Chen, and H. Xu. Breaking the small cluster barrier of graph clustering. In\nProceedings of the 30th International Conference on Machine Learning, ICML 2013, pages\n995\u20131003, 2013.\n\n[4] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. NIPS, 2016.\n\n[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[6] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89\u2013113,\n\n2004.\n\n[7] M. Braverman and E. Mossel. Noisy sorting without resampling.\n\nIn Proceedings of the\nnineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 268\u2013276. Society for\nIndustrial and Applied Mathematics, 2008.\n\n[8] M. Braverman and E. Mossel. Sorting from noisy information. CoRR, abs/0910.1191, 2009.\n\n[9] M. J. Brzozowski, T. Hogg, and G. Szabo. Friends and foes: ideological social networking. In\nProceedings of the SIGCHI conference on human factors in computing systems, pages 817\u2013820.\nACM, 2008.\n\n[10] M. Burke and R. Kraut. Mopping up: modeling wikipedia promotion decisions. In Proceedings\nof the 2008 ACM conference on Computer supported cooperative work, pages 27\u201336. ACM,\n2008.\n\n[11] D. Cartwright and F. Harary. Structural balance: a generalization of heider\u2019s theory. Psychologi-\n\ncal review, 63(5):277, 1956.\n\n[12] N. Cesa-Bianchi, C. Gentile, F. Vitale, G. Zappella, et al. A correlation clustering approach to\n\nlink classi\ufb01cation in signed networks. In COLT, pages 34\u20131, 2012.\n\n[13] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees\n\nin the extended planted partition model. In COLT, pages 35\u20131, 2012.\n\n[14] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex\n\noptimization. Journal of Machine Learning Research, 15(1):2213\u20132238, 2014.\n\n[15] Y. Chen, G. Kamath, C. Suh, and D. Tse. Community recovery in graphs with locality. In\nProceedings of The 33rd International Conference on Machine Learning, pages 689\u2013698, 2016.\n\n[16] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Advances in neural information\n\nprocessing systems, pages 2204\u20132212, 2012.\n\n[17] K.-Y. Chiang, C.-J. Hsieh, N. Natarajan, I. S. Dhillon, and A. Tewari. Prediction and clustering\nin signed networks: a local to global perspective. Journal of Machine Learning Research,\n15(1):1177\u20131213, 2014.\n\n10\n\n\f[18] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse\ngraphs: A spectral algorithm with optimal rate of recovery. arXiv preprint arXiv:1501.05021,\n2015.\n\n[19] P. Christen. Data matching: concepts and techniques for record linkage, entity resolution, and\n\nduplicate detection. Springer Science and Business Media, 2012.\n\n[20] T. M. Cover and J. A. Thomas. Elements of information theory, 2nd Ed. John Wiley & Sons,\n\n2012.\n\n[21] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In\n\nWWW, pages 285\u2013294, 2013.\n\n[22] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00b4a. Asymptotic analysis of the stochas-\ntic block model for modular networks and its algorithmic applications. Physical Review E,\n84(6):066106, 2011.\n\n[23] M. E. Dyer and A. M. Frieze. The solution of some random np-hard problems in polynomial\n\nexpected time. Journal of Algorithms, 10(4):451\u2013489, 1989.\n\n[24] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey.\n\nIEEE Trans. Knowl. Data Eng., 19(1):1\u201316, 2007.\n\n[25] U. Feige, P. Raghavan, D. Peleg, and E. Upfal. Computing with noisy information. SIAM\n\nJournal on Computing, 23(5):1001\u20131018, 1994.\n\n[26] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical\n\nAssociation, 64(328):1183\u20131210, 1969.\n\n[27] D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB,\n\n9(5):384\u2013395, 2016.\n\n[28] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges.\n\nPVLDB, 5(12):2018\u20132019, 2012.\n\n[29] A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse\n\ndetection in user-generated content. In EC, pages 167\u2013176, 2011.\n\n[30] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone:\nHands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601\u2013612, 2014.\n\n[31] A. Gruenheid, B. Nushi, T. Kraska, W. Gatterbauer, and D. Kossmann. Fault-tolerant entity\n\nresolution with the crowd. CoRR, abs/1512.00537, 2015.\n\n[32] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semide\ufb01nite\nprogramming: Extensions. IEEE Transactions on Information Theory, 62(10):5918\u20135937,\n2016.\n\n[33] T. S. Han and S. Verdu. Generalizing the fano inequality. IEEE Transactions on Information\n\nTheory, 40(4):1247\u20131251, 1994.\n\n[34] F. Harary et al. On the notion of balance of a signed graph. The Michigan Mathematical Journal,\n\n2(2):143\u2013146, 1953.\n\n[35] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican statistical association, 58(301):13\u201330, 1963.\n\n[36] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social\n\nnetworks, 5(2):109\u2013137, 1983.\n\n[37] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nNIPS, pages 1953\u20131961, 2011.\n\n[38] R. Kleinberg. Lecture notes in learning, games, and electronic markets, 2007.\n\n11\n\n\f[39] C. A. Lampe, E. Johnston, and P. Resnick. Follow the reader: \ufb01ltering comments on slashdot.\nIn Proceedings of the SIGCHI conference on Human factors in computing systems, pages\n1253\u20131262. ACM, 2007.\n\n[40] M. D. Larsen and D. B. Rubin.\n\nIterative automated record linkage using mixture models.\n\nJournal of the American Statistical Association, 96(453):32\u201341, 2001.\n\n[41] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online\nsocial networks. In Proceedings of the 19th international conference on World wide web, pages\n641\u2013650. ACM, 2010.\n\n[42] S. H. Lim, Y. Chen, and H. Xu. Clustering from labels and time-varying graphs. In Z. Ghahra-\nmani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 27, pages 1188\u20131196. Curran Associates, Inc., 2014.\n\n[43] K. Makarychev, Y. Makarychev, and A. Vijayaraghavan. Correlation clustering with noisy partial\ninformation. In Proceedings of The 28th Conference on Learning Theory, pages 1321\u20131342,\n2015.\n\n[44] C. Mathieu and W. Schudy. Correlation clustering with noisy input. In Proceedings of the\nTwenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas,\nUSA, January 17-19, 2010, pages 712\u2013728, 2010.\n\n[45] A. Mazumdar and B. Saha. Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839,\n\n2016.\n\n[46] A. Mazumdar and B. Saha. A Theoretical Analysis of First Heuristics of Crowdsourced Entity\n\nResolution. The Thirty-First AAAI Conference on Arti\ufb01cial Intelligence (AAAI-17), 2017.\n\n[47] A. Mazumdar and B. Saha. Query complexity of clustering with side information. In Advances\n\nin Neural Information Processing Systems (NIPS) 31, 2017.\n\n[48] M. Mitzenmacher and C. E. Tsourakakis. Predicting signed edges with o(n(1+\u0001)logn) queries.\n\nCoRR, abs/1609.00750, 2016.\n\n[49] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In\nProceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages\n69\u201375. ACM, 2015.\n\n[50] Y. Polyanskiy and S. Verd\u00b4u. Arimoto channel coding converse and r\u00b4enyi divergence.\n\nIn\nCommunication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on,\npages 1327\u20131333. IEEE, 2010.\n\n[51] D. Prelec, H. S. Seung, and J. McCoy. A solution to the single-question crowd wisdom problem.\n\nNature, 541(7638):532\u2013535, 2017.\n\n[52] V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. In 31st IEEE Interna-\ntional Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015,\npages 219\u2013230, 2015.\n\n[53] V. Verroios, H. Garcia-Molina, and Y. Papakonstantinou. Waldo: An adaptive human interface\n\nfor crowd entity resolution. In SIGMOD, pages 219\u2013230, 2017.\n\n[54] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB,\n\n7(12):1071\u20131082, 2014.\n\n[55] R. K. Vinayak and B. Hassibi. Crowdsourced clustering: Querying edges vs triangles. In\n\nAdvances in Neural Information Processing Systems, pages 1316\u20131324, 2016.\n\n[56] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.\n\nPVLDB, 5(11):1483\u20131494, 2012.\n\n12\n\n\f", "award": [], "sourceid": 2956, "authors": [{"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Massachusetts Amherst"}, {"given_name": "Barna", "family_name": "Saha", "institution": "University of Massachusetts Amherst"}]}