{"title": "Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 6489, "page_last": 6499, "abstract": "Source coding is the canonical problem of data compression in information theory. In a locally encodable source coding, each compressed bit depends on only few bits of the input. In this paper, we show that a recently popular model of semisupervised clustering is equivalent to locally encodable source coding. In this model, the task is to perform multiclass labeling of unlabeled elements. At the beginning, we can ask in parallel a set of simple queries to an oracle who provides (possibly erroneous) binary answers to the queries. The queries cannot involve more than two (or a fixed constant number $\\Delta$ of) elements. Now the labeling of all the elements (or clustering) must be performed based on the (noisy) query answers. The goal is to recover all the correct labelings while minimizing the number of such queries. The equivalence to locally encodable source codes leads us to find lower bounds on the number of queries required in variety of scenarios. We are also able to show fundamental limitations of pairwise `same cluster' queries - and propose pairwise AND queries, that provably performs better in many situations.", "full_text": "Semisupervised Clustering, AND-Queries and Locally\n\nEncodable Source Coding\n\nArya Mazumdar\n\nSoumyabrata Pal\n\nCollege of Information & Computer Sciences\n\nCollege of Information & Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\narya@cs.umass.edu\n\nAmherst, MA 01003\n\nsoumyabratap@umass.edu\n\nAbstract\n\nSource coding is the canonical problem of data compression in information theory.\nIn a locally encodable source coding, each compressed bit depends on only few bits\nof the input. In this paper, we show that a recently popular model of semisupervised\nclustering is equivalent to locally encodable source coding. In this model, the task\nis to perform multiclass labeling of unlabeled elements. At the beginning, we can\nask in parallel a set of simple queries to an oracle who provides (possibly erroneous)\nbinary answers to the queries. The queries cannot involve more than two (or a\n\ufb01xed constant number \u2206 of) elements. Now the labeling of all the elements (or\nclustering) must be performed based on the (noisy) query answers. The goal is to\nrecover all the correct labelings while minimizing the number of such queries. The\nequivalence to locally encodable source codes leads us to \ufb01nd lower bounds on\nthe number of queries required in variety of scenarios. We are also able to show\nfundamental limitations of pairwise \u2018same cluster\u2019 queries - and propose pairwise\nAND queries, that provably performs better in many situations.\n\nIntroduction\n\n1\nSuppose we have n elements, and the ith element has a label Xi \u2208 {0, 1, . . . , k\u2212 1},\u2200i \u2208 {1, . . . , n}.\nWe consider the task of learning the labels of the elements (or learning the label vector). This can\nalso be easily thought of as a clustering problem of n elements into k clusters, where there is a\nground-truth clustering1. There exist various approaches to this problem in general. In many cases\nsome similarity values between pair of elements are known (a high similarity value indicate that\nthey are in the same cluster). Given these similarity values (or a weighted complete graph), the task\nis equivalent to to graph clustering; when perfect similarity values are known this is equivalent to\n\ufb01nding the connected components of a graph.\nA recent approach to clustering has been via crowdsourcing. Suppose there is an oracle (expert\nlabelers, crowd workers) with whom we can make pairwise queries of the form \u201cdo elements u and v\nbelong to the same cluster?\u201d. We will call this the \u2018same cluster\u2019 query (as per [4]). Based on the\nanswers from the oracle, we then try to reconstruct the labeling or clustering. This idea has seen\na recent surge of interest especially in the entity resolution research (see, for e.g. [33, 30, 8, 20]).\nSince each query to crowd workers cost time and money, a natural objective will be to minimize the\nnumber of queries to the oracle and still recover the clusters exactly. Carefully designed adaptive and\ninteractive querying algorithms for clustering has also recently been developed [33, 30, 8, 22, 21]. In\n\n1The difference between clustering and learning labels is that in the case of clustering it is not necessary to\nknow the value of the label for a cluster. Therefore any unsupervised labeling algorithm will be a clustering\nalgorithm, however the reverse is not true. In this paper we are mostly concerned about the labeling problem,\nhence our algorithms (upper bounds) are valid for clustering as well.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fparticular, the query complexity for clustering with a k-means objective had recently been studied\nin [4], and there are signi\ufb01cant works in designing optimal crowdsourcing schemes in general (see,\n[12, 13, 28, 34, 15]). Note that, a crowd worker may potentially handle more than two elements\nat a time; however it is of interest to keep the number of elements involved in a query as small as\npossible. As an example, recent work in [31] considers triangle queries (involving three elements in a\nquery). Also crowd workers can compute some simple functions on this small set of inputs - instead\nof answering a \u2018same cluster\u2019 query. But again it is desirable that the answer the workers provide to\nbe simple, such as a binary answer.\nThe queries to the oracle can be asked adaptively or non-adaptively. For the clustering problem,\nboth the adaptive version and the nonadaptive versions have been studied. While both versions\nhave obvious advantages and disadvantages, for crowdsourcing applications it is helpful to have a\nparallelizable querying scheme in most scenarios for faster response-rate and real time analysis. In\nthis paper, we concentrate on the nonadaptive version of the problem, i.e., we perform the labeling\nalgorithm after all the query answers are all obtained.\nBudgeted crowdsourcing problems can be quite straight-forwardly viewed as a canonical source-\ncoding or source-channel coding problem of information theory (e.g., see the recent paper [14]).\nA main contribution of our paper is to view this as a locally encodable source coding problem: a\ndata compression problem where each compressed bit depends only on a constant number of input\nbits. The notion of locally encodable source coding is not well-studied even within information\ntheory community, and the only place where it is mentioned to the best of our knowledge is in [23],\nalthough the focus of that paper is a related notion of smooth encoding. Another related notion of\nlocal decoding seem to be much more well-studied [19, 18, 16, 26, 6, 25, 5, 32].\nBy posing the querying problem as such we can get a number of information theoretic lower bounds\non the number of queries required to recover the correct labeling. We also provide nonadaptive\nschemes that are near optimal. Another of our main contributions is to show that even within queries\nwith binary answers, \u2018same cluster\u2019 queries (or XOR queries) may not be the best possible choice. A\nsmaller number of queries can be achieved for approximate recovery by using what we call an AND\nquery. Among our settings, we also consider the case when the oracle gives incorrect answers with\nsome probability. A simple scheme to reduce errors in this case could be to take a majority vote after\nasking the same question to multiple different crowd workers. However, often that is not suf\ufb01cient.\nExperiments on several real datasets (see [21]) with answers collected from Amazon Mechanical Turk\n[9, 29] show that majority voting could even increase the errors. Interestingly, such an observation\nhas been made by a recent paper as well [27, Figure 1]. The probability of error of a query answer\nmay also be thought of as the aggregated answer after repeating the query several times. Once the\nanswer has been aggregated, it cannot change \u2013 and thus it reduces to the model where repeating\nthe same question multiple times is not allowed. On the other hand, it is usually assumed that the\nanswers to different queries are independently erroneous (see [10]). Therefore we consider the case\nwhere repetition of a same query multiple times is not allowed2, however different queries can result\nin erroneous answers independently.\nIn this case, the best known algorithms need O(n log n) queries to perform the clustering with two\nclusters [21]. We show that by employing our AND querying method (1 \u2212 \u03b4)-proportion of all labels\nin the label vector will be recovered with only O(n log 1\nAlong the way, we also provide new information theoretic results on fundamental limits of locally\nencodable source coding. While the the related notion of locally decodable source code [19, 16, 26, 6],\nas well as smooth compression [23, 26] have been studied, there was no nontrivial result known\nrelated to locally encodable codes in general. Although the focus of this paper is primarily theoretical,\nwe also perform a real crowdsourcing experiment to validate our algorithm.\n\n\u03b4 ) queries.\n\n2 Problem Setup and Information Theoretic View\nFor n elements, consider a label vector X \u2208 {0, . . . , k \u2212 1}n, where Xi, the ith entry of X, is\nthe label of the ith element and can take one of k possible values. Suppose P (Xi = j) = pj\u2200j\nand Xi\u2019s are independent. In other words, the prior distribution of the labels is given by the vector\n\n2Independent repetition of queries is also theoretically not interesting, as by repeating any query just O(log n)\n\ntimes one can reduce the probability of error to near zero.\n\n2\n\n\fp \u2261 (p0, . . . , pk\u22121).For the special case of k = 2, we denote p0 \u2261 1 \u2212 p and p1 \u2261 p. While we\nemphasize on the case of k = 2 our results extends in the case of larger k, as will be mentioned.\nA query Q : {0, . . . , k \u2212 1}\u2206 \u2192 {0, 1} is a deterministic function that takes as argument \u2206 labels,\n\u2206 (cid:28) n, and outputs a binary answer. While the query answer need not be binary, we restrict ourselves\nmostly to this case for being the most practical choice.\nSuppose a total of m queries are made and the query answers are given by Y \u2208 {0, 1}m. The objective\nis to reconstruct the label vector X from Y , such that the number of queries m is minimized.\nWe assume our recovery algorithms to have the knowledge of p. This prior distribution, or the relative\nsizes of clusters, is usually easy to estimate by subsampling a small (O(log n)) subset of elements\nand performing a complete clustering within that set (by, say, all pairwise queries). In many prior\nworks, especially in the recovery algorithms of popular statistical models such as stochastic block\nmodel, it is assumed that the relative sizes of the clusters are known (see [1]).\nWe also consider the setting where query answers may be erroneous with some probability of error.\nFor crowdsourcing applications, this is a valid assumption since many times even expert labelers\ncan make errors, and such assumption can be made. To model this we assume each entry of Y is\n\ufb02ipped independently with some probability q. Such independence assumption has been used many\ntimes previously to model errors in crowdsourcing systems (see, e.g., [10]). While this may not be\nthe perfect model, we do not allow a single query to be repeated multiple times in our algorithms\n(see the Introduction for a justi\ufb01cation). For the analysis of our algorithm we just need to assume\nthat the answers to different queries are independent. While we analyze our algorithms under these\nassumptions for theoretical guarantees, it turns out that even in real crowdsourcing situations our\nalgorithmic results mostly follow the theoretical results, giving further validation to the model.\nFor the k = 2 case, and when q = 0 (perfect oracle), it is easy to see that n queries are suf\ufb01cient for\nthe task. One simply compares every element with the \ufb01rst element. This does not extend to the case\nwhen k > 2: for perfect recovery, and without any prior, one must make O(n2) queries in this case.\nWhen q > 0 (erroneous oracle), it has been shown that a total number of O(\u03b3nk log n) queries are\nsuf\ufb01cient [21], where \u03b3 is the ratio of the sizes of the largest and smallest clusters.\n\nInformation theoretic view. The problem of learning a label vector x from queries is very similar\nto the canonical source coding (data compression) problem from information theory. In the source\ncoding problem, a (possibly random) vector X is \u2018encoded\u2019 into a usually smaller length binary\nvector called the compressed vector3 Y \u2208 {0, 1}m. The decoding task is to again obtain X from\nthe compressed vector Y . It is known that if X is distributed according to p, then m \u2248 nH(p) is\ni pi log pi is the\n\nboth necessary and suf\ufb01cient to recover x with high probability, where H(p) = \u2212(cid:80)\n\nentropy of p.\nWe can cast our problem in this setting naturally, where entries of Y are just answers to queries made\non X. The main difference is that in source coding each Yi may potentially depend on all the entries\nof X; while in the case of label learning, each Yi may depend on only \u2206 of the xis.\nWe call this locally encodable source coding. This terminology is analogous to the recently developed\nliterature on locally decodable source coding [19, 16]. It is called locally encodable, because each\ncompressed bit depend only on \u2206 of the source (input) bits. For locally decodable source coding,\neach bit of the reconstructed sequence \u02c6X depends on at most a prescribed constant number \u2206 of\nbits from the compressed sequence. Another closely related notion is that of \u2018smooth compression\u2019,\nwhere each source bit contributes to at most \u2206 compressed bits [23]. Indeed, in [23], the notion of\nlocally encodable source coding is also present where it was called robust compression. We provide\nnew information theoretic lower bounds on the number of queries required to reconstruct X exactly\nand approximately for our problem.\nFor the case when there are only two labels, the \u2018same cluster\u2019 query is equivalent to an Boolean XOR\noperation between the labels. There are some inherent limitations to these functions that prohibit the\n\u2018same cluster\u2019 queries to achieve the best possible number of queries for the \u2018approximate\u2019 recovery\nof labeling problem. We use an old result by Massey [17] to establish this limitation. We show that,\ninstead using an operation like Boolean AND, much smaller number of queries are able to recover\nmost of the labels correctly.\n\n3The compressed vector is not necessarily binary, nor it is necessarily smaller length.\n\n3\n\n\fWe also consider the case when the oracle gives faulty answer, or Y is corrupted by some noise\n(the binary symmetric channel). This setting is analogous to the problem of joint source-channel\ncoding. However, just like before, each encoded bit must depend on at most \u2206 bits. We show that for\nthe approximate recovery problem, AND queries are again performing substantially well. In a real\ncrowdsourcing experiment, we have seen that if crowd-workers have been provided with the same set\nof pairs and being asked for \u2018same cluster\u2019 queries as well as AND queries, the error-rate of AND\nqueries is lower. The reason is that for a correct \u2018no\u2019 answer in an AND query, a worker just need to\nknow one of the labels in the pair. For a \u2018same cluster\u2019 query, both the labels must be known to the\nworker for any correct answer.\nThere are multiple reasons why one would ask for a \u2018combination\u2019 or function of multiple labels\nfrom a worker instead of just asking for a label itself (a \u2018label-query\u2019). Note that, asking for labels\nwill never let us recover the clusters in less than n queries, whereas, as we will see, the queries that\ncombine labels will. Also in case of erroneous answer with AND queries or \u2018same cluster\u2019 queries,\nwe have the option of not repeating a query, and still reduce errors. No such option is available with\ndirect label-queries.\n\nIn summary our contributions can be listed as follows.\n\nContributions.\n1. Noiseless queries and exact recovery (Sec. 3.1): For two clusters, we provide a querying scheme\nthat asks \u03b1n, \u03b1 < 1 number of nonadaptive pairwise \u2018same cluster\u2019 queries, and recovers the all\nlabels with high probability, for a range of prior probabilities. We also provide a new lower bound\nthat is strictly better than nH(p) for some p.\n2. Noiseless queries and approximate recovery (Sec. 3.2): We provide a new lower bound on the\nnumber of queries required to recover (1 \u2212 \u03b4) fraction of the labels \u03b4 > 0. We also show that \u2018same\ncluster\u2019 queries are insuf\ufb01cient, and propose a new querying strategy based on AND operation that\nperforms substantially better.\n3. Noisy queries and approximate recovery (Sec. 3.3). For this part we assumed the query answer\nto be k-ary (k \u2265 2) where k is the number of clusters. This section contains the main algorithmic\nresult that uses the AND queries as main primitive. We show that, even in the presence of noise in the\nquery answers, it is possible to recover (1 \u2212 \u03b4) proportion of all labels correctly with only O(n log k\n\u03b4 )\nnonadaptive queries. We validate this theoretical result in a real crowdsourcing experiment in Sec. 4.\n\n3 Main results and Techniques\n\n3.1 Noiseless queries and exact recovery\n\nIn this scenario we assume the query answer from oracle to be perfect and we wish to get back the\nall of the original labels exactly without any error. Each query is allowed to take only \u2206 labels\nas input. When \u2206 = 2, we are allowed to ask only pairwise queries. Let us consider the case\nwhen there are only two labels, i.e., k = 2. That means the labels Xi \u2208 {0, 1}, 1 \u2264 i \u2264 n,\nare iid Bernoulli(p) random variable. Therefore the number of queries m that are necessary and\nsuf\ufb01cient to recover all the labels with high probability is approximately nh(p) \u00b1 o(n) where\nh(x) \u2261 \u2212x log2 x\u2212 (1\u2212 x) log2(1\u2212 x) is the binary entropy function. However the suf\ufb01ciency part\nhere does not take into account that each query can involve only \u2206 labels.\nQuerying scheme: We use the following type of queries. For each query, labels of \u2206 elements are\ngiven to the oracle, and the oracle returns a simple XOR operation of the labels. Note, for \u2206 = 2, our\nqueries are just \u2018same cluster\u2019 queries.\nTheorem 1. There exists a querying scheme with m = n(h(p)+o(1))\n\u03b1 = 1\nprobability by a Maximum Likelihood decoder.\n\nqueries of above type, where\n2 (1 + (1 \u2212 4p(1 \u2212 p))\u2206), such that it will be possible to recover all the labels with high\n\nlog2\n\n1\n\u03b1\n\nProof. Let the number of queries asked is m. Let us de\ufb01ne Q to be the random binary query matrix\nof dimension m \u00d7 n where each row has exactly \u2206 ones, other entries being zero. Now for a label\nvector X we can represent the set of query outputs by Y = QX mod 2. Now if we use Maximum\nLikelihood Decoding then we will not make an error as long as the query output vector is different\n\n4\n\n\ffor every X that belong to the typical set4 of X. Let us de\ufb01ne a \u2018bad event\u2019 for two different label\nvectors X 1 and X 2 to be the event QX 1 = QX 2 or Q(X 1 + X 2) = 0 mod 2 because in that case\nwe will not be able to differentiate between those two sequences. Now consider a random ensemble\nof matrices where in each row \u2206 positions are chosen uniformly randomly from the n positions to be\n1. In this random ensemble, the probability of a \u2018bad event\u2019 for any two \ufb01xed typical label vectors\nX 1 and X 2 is going to be\n\n(cid:18)(cid:80)\n\ni=0:\u2206\ni even\n\n(cid:0)nr(p)\n(cid:1)(cid:0)n\u2212nr(p)\n(cid:1)\n(cid:1)\n(cid:0) n\n\n\u2206\u2212i\n\ni\n\n\u2206\n\n2 ((cid:0) n\n(cid:19)m \u2264(cid:16) 1\n\n\u2206\n\n(cid:1) +(cid:0)n\u22122nr(p)\n(cid:1))\n(cid:0) n\n\n(cid:1)\n\n\u2206\n\n(cid:17)m \u2264\n\n(cid:18) 1\n\n2\n\n\u2206\n\n(cid:19)m\n\n,\n\n(1 + (1 \u2212 2r(p))\u2206)\n\nwhere r(p) = 2p(1 \u2212 p). This is because , X 1 + X 2 mod 2 has r(p) = 2p(1 \u2212 p) ones with high\nprobability since they are typical vectors.\nNow we have to use the \u2018coding theoretic\u2019 idea of expurgation to complete the analysis. From linearity\nof expectation, the expected number of \u2018bad events\u2019 is going to be\n\n(cid:18)T\n\n(cid:19)(cid:18) 1\n\n2\n\n2\n\n(cid:19)m\n\n,\n\n(1 + (1 \u2212 2r(p))\u2206)\n\nwhere T is the size of the typical set and T \u2264 2n(h(p)+o(1)). If this expected number of \u2018bad events\u2019\nis smaller than \u0001T then for every \u2018bad event\u2019, we can throw out 1 label vector and there will be no\nmore bad events. This will imply perfect recovery, as long as\n(1 + (1 \u2212 2r(p))\u2206)\n\n(cid:19)(cid:18) 1\n\n(cid:19)m\n\n(cid:18)T\n\n< \u0001T.\n\n2\n\n2\n\nn )/(log 1\n\nSubstituting the upper bound for T , we have that perfect recovery is possible as long as, m\n(h(p) + o(1)\u2212 log 2\u0001\nwe will have a vanishing fraction of typical label vectors which will be expurgated and log \u0001\nTherefore m = n(h(p)+o(1))\nthere must exist a querying scheme with m = n(h(p)+o(1))\n\nn >\n\u03b1 ). Now if we take \u0001 to be of the form n\u2212\u03b2 for \u03b2 > 0 then asymptotically\nn \u2192 0.\nqueries will going to recover all the labels with high probability. Hence\n\nqueries that will work.\n\nlog 1\n\u03b1\n\nlog 1\n\u03b1\n\nThe number of suf\ufb01cient queries guaranteed by the above theorem is strictly less than n for all\n0.0694 \u2264 p < 0.5 even for \u2206 = 2. Note that, with \u2206 = 2, by querying the \ufb01rst element with all\nothers nonadaptively (total n \u2212 1 queries), it is possible to deduce the two clusters. In contrast, if\none makes just random \u2018same cluster\u2019 queries, then O(n log n) queries are required to recover the\nclusters with high probability (see, e.g., [2]).\nNow we provide a lower bound on the number of queries required.\nTheorem 2. The minimum number of queries necessary to recover all labels with high probability is\nat least by nh(p) \u00b7 max{1, max\u03c1\n\n} where r(p) \u2261 2p(1 \u2212 p).\n\n(1\u2212\u03c1)\n\nh( (1\u2212\u03c1)r(p)\u2206\n\n)\n\n\u03c1\n\nProof. Every query involves at most \u2206 elements. Therefore the average number of queries an element\nn . Therefore 1 \u2212 \u03c1 fraction of all the elements (say the set S \u2282 {1, . . . , n}) are part\nis part of is \u2206m\n\u03c1n queries. Now consider the set {1, . . . , n} \\ S. Consider all typical label vectors\nof less than \u2206m\nC \u2208 {0, 1}n such that their projection on {1, . . . , n} \\ S is a \ufb01xed typical sequence. We know that\nthere are 2n(1\u2212\u03c1)h(p) such sequences. Let X 0 be one of these sequences. Now, almost all sequences\nof C must have a distance of n(1\u2212\u03c1)r(p)+o(n) from X 0. Let Y 0 be the corresponding query outputs\nwhen X 0 is the input. Now any query output for input belonging to C must reside in a Hamming ball\nof radius (1 \u2212 \u03c1)r(p)\u2206m\n\nfrom Y 0. Therefore we must have mh( (1\u2212\u03c1)r(p)\u2206\n\n) (cid:62) n(1\u2212 \u03c1)h(p).\n\n\u03c1\n\n\u03c1\n\nThis lower bound is better than the naive nh(p) for p < 0.03475 when \u2206 = 2.\n\n4Here a typical set of labels is all such label vectors where the number of ones is between np \u2212 n2/3 and\n\nnp + n2/3.\n\n5\n\n\fFor \u2206 = 2, the plot of the corresponding upper and lower bounds have been shown in Figure 1. The\nmain takeaway from this part is that, by exploiting the prior probabilities (or relative cluster sizes),\nit is possible to know the labels with strictly less than n queries (and close to the lower bound for\np \u2265 0.3), even with pairwise \u2018same cluster\u2019 queries.\n\n3.2 Noiseless queries and approximate recovery\n\nAgain let us consider the case when k = 2, i.e., only\ntwo possible labels. Let X \u2208 {0, 1}n be the label vector.\nSuppose we have a querying algorithm that, by using m\nqueries, recovers a label vector \u02c6X.\nDe\ufb01nition. We call a querying algorithm to be (1 \u2212 \u03b4)-\ngood if for any label vector, at least (1 \u2212 \u03b4)n labels are\ncorrectly recovered. This means for any label-recovered\nlabel pair X, \u02c6X, the Hamming distance is at most \u03b4n. For\nan almost equivalent de\ufb01nition, we can de\ufb01ne a distortion\nfunction d(X, \u02c6X) = X + \u02c6X mod 2, for any two labels\nX, \u02c6X \u2208 {0, 1}. We can see that Ed(X, \u02c6X) = Pr(X (cid:54)=\n\u02c6X), which we want to be bounded by \u03b4.\n\nFigure 1: Number of pairwise queries for\nnoiseless queries and exact recovery\n\nUsing standard rate-distortion theory [7], it can be seen that, if the queries could involve an arbitrary\nnumber of elements then with m queries it is possible to have a (1 \u2212 \u02dc\u03b4(m/n))-good scheme where\n\u02dc\u03b4(\u03b3) \u2261 h\u22121(h(p) \u2212 \u03b3). When each query is allowed to take only at most \u2206 inputs, we have the\nfollowing lower bound for (1 \u2212 \u03b4)-good querying algorithms.\nTheorem 3. In any (1 \u2212 \u03b4)-good querying scheme with m queries where each query can take as\ninput \u2206 elements, the following must be satis\ufb01ed (below h(cid:48)(x) = dh(x)\ndx ):\n\n\u03b4 \u2265 \u02dc\u03b4(cid:0) m\n\n(cid:1) +\n\nn\n\nh(p) \u2212 h(\u02dc\u03b4( m\nn ))\nn ))(1 + e\u2206h(cid:48)(\u02dc\u03b4( m\n\nn )))\n\nh(cid:48)(\u02dc\u03b4( m\n\nfor approximate recovery.\n\nthat\n\nThis follows from a classical\n\nthe \u2018same cluster\u2019 queries are\nresult of\nlinear codes as rate-distortion codes.\n\nthe main observation that we make is\n\nThe proof of this theorem is quite involved, and we have included it in the appendix in the supple-\nmentary material.\nOne of\nhighly inef\ufb01cient\nAncheta and Massey [17] on the limitation of\nRecall that, the \u2018same cluster\u2019 queries are equivalent to\nXOR operation in the binary \ufb01eld, which is a linear oper-\nation on GF (2). We rephrase a conjecture by Massey in\nour terminology.\nConjecture 1 (\u2018same cluster\u2019 query lower bound). For\nany (1 \u2212 \u03b4)-good scheme with m \u2018same cluster\u2019 queries\n(\u2206 = 2), the following must be satis\ufb01ed: \u03b4 \u2265 p(1\u2212 m\nnh(p) ).\nThis conjecture is known to be true at the point p = 0.5\n(equal sized clusters). We have plotted these two lower\nbounds in Figure 2.\nNow let us provide a querying scheme with \u2206 = 2 that\nwill provably be better than \u2018same cluster\u2019 schemes.\nQuerying scheme: AND Queries: We de\ufb01ne the AND\n\nquery Q : {0, 1}2 \u2192 {0, 1} as Q(X, X(cid:48)) = X(cid:86) X(cid:48), where X, X(cid:48) \u2208 {0, 1}, so that Q(X, X(cid:48)) = 1\n\nFigure 2: Performance of (1 \u2212 \u03b4)-good\nschemes with noiseless queries; p = 0.5\n\nonly when both the elements have labels equal to 1. For each pairwise query the oracle will return\nthis AND operation of the labels.\nTheorem 4. There exists a (1 \u2212 \u03b4)-good querying scheme with m pairwise AND queries such that\n\n\u03b4 = pe\u2212 2m\n\nn +\n\n(1 \u2212 p)kp\n\nn(cid:88)\n\nd=1\n\nn ( 2m\n\nn )d\n\ne\u2212 2m\nd!\n\n(cid:18)n\n\n(cid:19) f (k, d)\n\nd(cid:88)\n\nk\n\nnd\n\nk=1\n\n6\n\n\fwhere f (k, d) =(cid:80)k\n\ni=0(\u22121)i(cid:0)k\n\ni\n\n(cid:1)(k \u2212 i)d.\n\nProof. Assume p < 0.5 without loss of generality. Consider a random bipartite graph where each\n\u2018left\u2019 node represent an element labeled according to the label vector X \u2208 {0, 1}n and each \u2018right\u2019\nnode represents a query. All the query answers are collected in Y \u2208 {0, 1}m. The graph has\nright-degree exactly equal to 2. For each query the two inputs are selected uniformly at random\nwithout replacement.\nRecovery algorithm: For each element we look at the queries that involves it and estimate its label as\n1 if any of the query answers is 1 and predict 0 otherwise. If there are no queries that involves the\nelement, we simply output 0 as the label.\nSince the average left-degree is 2m\nn and since all the edges from the right nodes are randomly and\nindependently thrown, we can model the degree of each left-vertex by a Poisson distribution with\nn . We de\ufb01ne element j to be a two-hop-neighbor of i if there is at least one query\nthe mean \u03bb = 2m\nerror for estimating Xi can be written as, Pr(Xi (cid:54)= \u02c6Xi) =(cid:80)\nwhich involved both the elements i and j . Under our decoding scheme we only have an error when\nthe label of i, Xi = 1 and the labels of all its two-hop-neighbors are 0. Hence the probability of\nd Pr(degree(i) = d) Pr(Xi (cid:54)= \u02c6Xi |\ndegree(i) = d). Now let us estimate Pr(Xi (cid:54)= \u02c6Xi | degree(i) = d). We further condition the error\nof i as Dist(i)) and hence we have that Pr(Xi (cid:54)= \u02c6Xi | degree(i) = d) = (cid:80)d\n(cid:0)n\n(cid:1) f (k,d)\nk) Pr(Xi (cid:54)= \u02c6Xi|degree(i) = d, Dist(i) = k) =(cid:80)d\non the event that there are k distinct two-hop-neighbors (lets call the number of distinct neighbors\nk=1 Pr(Dist(i) =\nnd p(1 \u2212 p)k. Now using the Poisson\n\nk=1\n\nk\n\nassumption we get the statement of the theorem.\n\nThe performance of this querying scheme is plotted against the number of queries for prior probabili-\nties p = 0.5 in Figure 2.\nComparison with \u2018same cluster\u2019 queries: We see in Figure 2 that the AND query scheme beats the\n\u2018same cluster\u2019 query lower bound for a range of query-performance trade-off in approximate recovery\nfor p = 1\n2. For smaller p, this range of values of \u03b4 increases further. If we randomly choose \u2018same\ncluster\u2019 queries and then resort to maximum likelihood decoding (note that, for AND queries, we\npresent a simple decoding) then O(n log n) queries are still required even if we allow for \u03b4 proportion\nof incorrect labels (follows from [11]). The best performance for \u2018same cluster\u2019 query in approximate\nrecovery that we know of for small \u03b4 is given by: m = n(1 \u2212 \u03b4) (neglect n\u03b4 elements and just query\nthe n(1 \u2212 \u03b4) remaining elements with the \ufb01rst element). However, such a scheme can be achieved by\nAND queries as well in a similar manner. Therefore, there is no point in the query vs \u03b4 plot that we\nknow of where \u2018same cluster\u2019 query achievability outperforms AND query achievability.\n\n3.3 Noisy queries and approximate recovery\n\nThis section contains our main algorithmic contribution. In contrast to the previous sections here\nwe consider the general case of k \u2265 2 clusters. Recall that the label vector X \u2208 {0, 1, . . . , k \u2212 1}n,\nand the prior probability of each label is given by the probabilities p = (p0, . . . , pk\u22121). Instead of\nbinary output queries, in this part we consider an oracle that can provide one of k different answers.\nWe consider a model of noise in the query answer where the oracle provides correct answer with\nprobability 1 \u2212 q, and any one of the remaining incorrect answers with probability q\nk\u22121. Note that we\ndo not allow the same query to be asked to the oracle multiple time (see Sec. 2 for justi\ufb01cation). We\nalso de\ufb01ne a (1 \u2212 \u03b4)-good approximation scheme exactly as before.\nQuerying Scheme: We only perform pairwise queries. For a pair of labels X and X(cid:48) we de\ufb01ne a\nquery Y = Q(X, X(cid:48)) \u2208 {0, 1, . . . , k \u2212 1}. For our algorithm we de\ufb01ne the Q as\n\n(cid:27)\n\n(cid:26) i\n\n0\n\nQ(X, X(cid:48)) =\n\nif X = X(cid:48) = i\notherwise.\n\nWe can observe that for k = 2, this query is exactly same as the binary AND query that we de\ufb01ned in\nthe previous section. In our querying scheme, we make a total of nd\n2 queries, for an integer d > 1.\nWe design a d-regular graph G(V, E) where V = {1, . . . , n} is the set of elements that we need to\nlabel. We query all the pairs of elements (u, v) \u2208 E.\nUnder this querying scheme, we propose to use Algorithm 1 for reconstructions of labels.\n\n7\n\n\fTheorem 5. The querying scheme with m = O(n log k\nfor approximate recovery of labels from noisy queries.\n\n\u03b4 ) queries and Algorithm 1 is (1 \u2212 \u03b4)-good\n\n2 queries\n\nAlgorithm 1 Noisy query approximate recovery\nwith nd\nRequire: PRIOR p \u2261 (p0, . . . , pk\u22121)\nRequire: Query Answers Yu,v : (u, v) \u2208 E\n\nfor i \u2208 [1, 2, . . . , k \u2212 1] do\nk\u22121\n\nk\u22121 + dpi\n\n(cid:0)1 \u2212 qk\n\n2\n\n(cid:1)\n\nCi = dq\nend for\nfor u \u2208 V do\n\nNu,i =(cid:80)d\n\nfor i \u2208 [1, 2, . . . , k \u2212 1] do\nv=1 1{Yu,v = i}\n\nif Nu,i \u2265 (cid:100)Ci(cid:101) then\nXu \u2190 i\nAssigned \u2190 True\nbreak\n\nend if\nend for\nif \u00ac Assigned then\n\nXu \u2190 0\n\nend if\nend for\n\nWe can come up with more exact relation be-\ntween number of queries m = nd\n2 , \u03b4, p, q and k.\nThis is deferred to the appendix in the supple-\nmentary material.\n\nk\u22121 + dpi\n\n(cid:0)1 \u2212 qk\n\n(cid:1) when the\n\nProof of Theorem 5. The\ntotal number of\nqueries is m = nd\n2 . Now for a particular\nelement u \u2208 V , we look at the values of d\nnoisy oracle answers {Yu,v}d\nv=1. We have,\nE(Nu,i) = dq\nk\u22121\ntrue label of u is i (cid:54)= 0. When the true label\nis something else, E(Nu,i) = dq\nk\u22121. There is\nan obvious gap between these expectations.\nClearly when the true label is i, the proba-\nbility of error in assignment of the label of\nj:j(cid:54)=i,j(cid:54)=0 Pr(Nu,j >\nCj) + Pr(Nu,i < Ci) \u2264 cke\u22122d\u00012\nfor\nsome constants c and \u0001 depending on the\ngap, from Chernoff bound. And when the\ntrue label\nthe probability of error is\nj:j(cid:54)=0 P (Nu,j > Cj) \u2264 c(cid:48)ke\u22122d\u0001(cid:48)2, for\ni piPi, we can\n\u03b4 ). Hence\n\nu is given by, Pi \u2264 (cid:80)\nP0 \u2264(cid:80)\nsome constants c(cid:48), \u0001(cid:48). Let \u03b4 =(cid:80)\n\neasily observe that d scales as O(log k\n\nis 0,\n\n\u03b4 ).\n\n2 = O(n log k\n\ntotal number of incorrectly labeled elements is Z = (cid:80)\nZu \u223c Zv if Zu and Zv are dependent. Now \u2206\u2217 \u2261 (cid:80)\n\nthe total number of queries is nd\nThe only thing that remains to be proved is that the number of incorrect labels is \u03b4n with high\nprobability. Let Zu be the event that element u has been incorrectly labeled. Then EZu = \u03b4. The\nu Zu. We have EZ = n\u03b4. Now de\ufb01ne\nPr(Zu|Zv) \u2264 d2 + d because the\nZu\u223cZv\nmaximum number of nodes dependent with Zu are its 1-hop and 2-hop neighbors. Now using\nCorollary 4.3.5 in [3], it is evident that Z = EZ = n\u03b4 almost always.\n\nThe theoretical performance guarantee of Algorithm 1 (a detailed version of Theorem 5 is in the\nsupplementary material) for k = 2 is shown in Figures 3 and 4. We can observe from Figure 3 that\nfor a particular q, incorrect labeling decreases as p becomes higher. We can also observe from Figure\n4 that if q = 0.5 then the incorrect labeling is 50% because the complete information from the oracle\nis lost. For other values of q, we can see that the incorrect labeling decreases with increasing d.\nWe point out that \u2018same cluster\u2019 queries are not a good choice here, because of the symmetric nature\nof XOR due to which there is no gap between the expected numbers (contrary to the proof of Theorem\n5) which we had exploited in the algorithm to a large extent.\nLastly, we show that Algorithm 1 can work without knowing the prior distribution and only with the\nknowledge of relative sizes of the clusters. The ground truth clusters can be adversarial as long as\nthey maintain the relative sizes.\nTheorem 6. Suppose we have ni, the number of elements with label i, i = 0, 1, . . . , k \u2212 1, as input\ninstead of the priors. By taking a random permutation over the nodes while constructing the d-regular\ngraph, Algorithm 1 will be (1 \u2212 \u03b4)-good approximation with m = O(n log k\n\u03b4 ) queries as n \u2192 \u221e\nn .\nwhen we set pi = ni\n\nThe proof of this theorem is deferred to the appendix in the supplementary material.\n\n8\n\n\fFigure 3: Recovery error for a\n\ufb01xed p, d = 100 and varying q\n\nFigure 4: Recovery error for a\n\ufb01xed p, q and varying d\n\nFigure 5: Algorithm 1 on real\ncrowdsourced dataset\n\n4 Experiments\n\nThough our main contribution is theoretical we have veri\ufb01ed our work by using our algo-\nrithm on a real dataset created by local crowdsourcing. We \ufb01rst picked a list of 100 \u2018action\u2019\nmovies and 100 \u2018romantic\u2019 movies from IMDB (http://www.imdb.com/list/ls076503982/\nand http://www.imdb.com/list/ls058479560/). We then created the queries as given in the\nquerying scheme of Sec. 3.3 by creating a d-regular graph (where d is even). To create the graph we\nput all the movies on a circle and took a random permutation on them in a circle. Then for each node\nwe connected d\n2 edges on either side to its closest neighbors in the permuted circular list. This random\npermutation will allow us to use the relative sizes of the clusters as priors as explained in Sec. 3.3. Us-\n2 = 1000 queries with each query being the following question: Are both the\ning d = 10 , we have nd\nmovies \u2018action\u2019 movies?. Now we divided these 1000 queries into 10 surveys (using SurveyMonkey\nplatform) with each survey carrying 100 queries for the user to answer. We used 10 volunteers to \ufb01ll\nup the survey. We instructed them not to check any resources and answer the questions spontaneously\nand also gave them a time limit of a maximum of 10 minutes. The average \ufb01nish time of the surveys\nwere 6 minutes. The answers represented the noisy query model since some of the answers were\nwrong. In total, we have found 105 erroneous answers in those 1000 queries. For each movie we\nevaluate the d query answer it is part of, and use different thresholds T for prediction. That is, if there\nare more than T \u2018yes\u2019 answers among those d answers we classi\ufb01ed the movie as \u2018action\u2019 movie and\na \u2018romantic\u2019 movie otherwise.The theoretical threshold for predicting an \u2018action\u2019 movie is T = 2 for\noracle error probability q = 0.105, p = 0.5 and d = 10 . But we compared other thresholds as well.\nWe now used Algorithm 1 to predict the true label vector\nfrom a subset of queries by taking \u02dcd edges for each node\nwhere \u02dcd < d and \u02dcd is even i.e \u02dcd \u2208 {2, 4, 6, 8, 10}. Obvi-\nously, for \u02dcd = 2 , the thresholds T = 3, 4 is meaningless\nas we always estimate the movie as \u2018romantic\u2019 and hence\nthe distortion starts from 0.5 in that case. We plotted the\nerror for each case against the number of queries ( n \u02dcd\n2 ) and\nalso plotted the theoretical distortion obtained from our\nresults for k = 2 labels and p = 0.5, q = 0.105. We\ncompare these results along with the theoretical distortion\nthat we should have for q = 0.105. All these results have\nbeen compiled in Figure 5 and we can observe that the\ndistortion is decreasing with the number of queries and\nthe gap between the theoretical result and the experimental results is small for T = 2. These results\nvalidate our theoretical results and our algorithm to a large extent.\nWe have also asked \u2018same cluster\u2019 queries with the same set of 1000 pairs to the participants to \ufb01nd\nthat the number of erroneous responses to be 234 (whereas with AND queries it was 105). This\nsubstantiates the claim that AND queries are easier to answer for workers. Since this number of errors\nis too high, we have compared the performance of \u2018same cluster\u2019 queries with AND queries and our\nalgorithm in a synthetically generated dataset with two hundred elements (Figure 6). For recovery\nwith \u2018same cluster\u2019 queries, we have used the popular spectral clustering algorithm with normalized\ncuts [24]. The detailed results obtained can be found in Figure 7 in the supplementary material.\n\nFigure 6: Comparison of \u2018same clus-\nter\u2019 query with AND queries when both\nachieve 80% accuracy\n\n9\n\n\fAcknowledgements: This research is supported in parts by NSF Awards CCF-BSF 1618512, CCF\n1642550 and an NSF CAREER Award CCF 1642658. The authors thank Barna Saha for many\ndiscussions on the topics of this paper. The authors also thank the volunteers who participated in the\ncrowdsourcing experiments for this paper.\n\nReferences\n[1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE\n\nTrans. Information Theory, 62(1):471\u2013487, 2016.\n\n[2] K. Ahn, K. Lee, and C. Suh. Community recovery in hypergraphs. In Communication, Control,\nand Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 657\u2013663. IEEE,\n2016.\n\n[3] N. Alon and J. H. Spencer. The probabilistic method. John Wiley & Sons, 2004.\n\n[4] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. In Advances\n\nIn Neural Information Processing Systems, pages 3216\u20133224, 2016.\n\n[5] H. Buhrman, P. B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal?\n\nSIAM Journal on Computing, 31(6):1723\u20131744, 2002.\n\n[6] V. B. Chandar. Sparse graph codes for compression, sensing, and secrecy. PhD thesis,\n\nMassachusetts Institute of Technology, 2010.\n\n[7] T. M. Cover and J. A. Thomas. Elements of information theory, 2nd Ed. John Wiley & Sons,\n\n2012.\n\n[8] D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB,\n\n9(5):384\u2013395, 2016.\n\n[9] A. Gruenheid, B. Nushi, T. Kraska, W. Gatterbauer, and D. Kossmann. Fault-tolerant entity\n\nresolution with the crowd. CoRR, abs/1512.00537, 2015.\n\n[10] A. Gruenheid, B. Nushi, T. Kraska, W. Gatterbauer, and D. Kossmann. Fault-tolerant entity\n\nresolution with the crowd. arXiv preprint arXiv:1512.00537, 2015.\n\n[11] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semide\ufb01nite\nprogramming: Extensions. IEEE Transactions on Information Theory, 62(10):5918\u20135937,\n2016.\n\n[12] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nAdvances in neural information processing systems, pages 1953\u20131961, 2011.\n\n[13] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing\n\nsystems. Operations Research, 62(1):1\u201324, 2014.\n\n[14] F. Lahouti and B. Hassibi. Fundamental limits of budget-\ufb01delity trade-off in label crowdsourcing.\n\nIn Advances in Neural Information Processing Systems, pages 5059\u20135067, 2016.\n\n[15] Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In Advances in neural\n\ninformation processing systems, pages 692\u2013700, 2012.\n\n[16] A. Makhdoumi, S.-L. Huang, M. M\u00b4edard, and Y. Polyanskiy. On locally decodable source\ncoding. In Communications (ICC), 2015 IEEE International Conference on, pages 4394\u20134399.\nIEEE, 2015.\n\n[17] J. L. Massey. Joint source and channel coding. Technical report, DTIC Document, 1977.\n\n[18] A. Mazumdar, V. Chandar, and G. W. Wornell. Update-ef\ufb01ciency and local repairability\nlimits for capacity approaching codes. IEEE Journal on Selected Areas in Communications,\n32(5):976\u2013988, 2014.\n\n10\n\n\f[19] A. Mazumdar, V. Chandar, and G. W. Wornell. Local recovery in data compression for general\nsources. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 2984\u2013\n2988. IEEE, 2015.\n\n[20] A. Mazumdar and B. Saha. A theoretical analysis of \ufb01rst heuristics of crowdsourced entity\n\nresolution. The Thirty-First AAAI Conference on Arti\ufb01cial Intelligence (AAAI-17), 2017.\n\n[21] A. Mazumdar and B. Saha. Clustering with noisy queries. In Advances in Neural Information\n\nProcessing Systems (NIPS) 31, 2017.\n\n[22] A. Mazumdar and B. Saha. Query complexity of clustering with side information. In Advances\n\nin Neural Information Processing Systems (NIPS) 31, 2017.\n\n[23] A. Montanari and E. Mossel. Smooth compression, gallager bound and nonlinear sparse-graph\ncodes. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages\n2474\u20132478. IEEE, 2008.\n\n[24] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In\n\nAdvances in neural information processing systems, pages 849\u2013856, 2002.\n\n[25] A. Pananjady and T. A. Courtade. Compressing sparse sequences under local decodability\nconstraints. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages\n2979\u20132983. IEEE, 2015.\n\n[26] M. Patrascu. Succincter. In Foundations of Computer Science, 2008. FOCS\u201908. IEEE 49th\n\nAnnual IEEE Symposium on, pages 305\u2013313. IEEE, 2008.\n\n[27] D. Prelec, H. S. Seung, and J. McCoy. A solution to the single-question crowd wisdom problem.\n\nNature, 541(7638):532\u2013535, 2017.\n\n[28] A. Vempaty, L. R. Varshney, and P. K. Varshney. Reliable crowdsourcing for multi-class labeling\nusing coding theory. IEEE Journal of Selected Topics in Signal Processing, 8(4):667\u2013679, 2014.\n\n[29] V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. In 31st IEEE Interna-\ntional Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015,\npages 219\u2013230, 2015.\n\n[30] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB,\n\n7(12):1071\u20131082, 2014.\n\n[31] R. K. Vinayak and B. Hassibi. Crowdsourced clustering: Querying edges vs triangles. In\n\nAdvances in Neural Information Processing Systems, pages 1316\u20131324, 2016.\n\n[32] E. Viola. Bit-probe lower bounds for succinct data structures. SIAM Journal on Computing,\n\n41(6):1593\u20131604, 2012.\n\n[33] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.\n\nPVLDB, 5(11):1483\u20131494, 2012.\n\n[34] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax\n\nentropy. In Advances in Neural Information Processing Systems, pages 2195\u20132203, 2012.\n\n11\n\n\f", "award": [], "sourceid": 3256, "authors": [{"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Massachusetts Amherst"}, {"given_name": "Soumyabrata", "family_name": "Pal", "institution": "University of Massachusetts Amherst"}]}