{"title": "Iterative Learning for Reliable Crowdsourcing Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 1953, "page_last": 1961, "abstract": "Crowdsourcing systems, in which tasks are electronically distributed to numerous ``information piece-workers'', have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such rowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give new algorithms for deciding which tasks to assign to which workers and for inferring correct answers from the workers\u2019 answers. We show that our algorithm significantly outperforms majority voting and, in fact, are asymptotically optimal through comparison to an oracle that knows the reliability of every worker.", "full_text": "Iterative Learning for Reliable Crowdsourcing\n\nSystems\n\nDavid R. Karger\nDepartment of Electrical Engineering and Computer Science\n\nSewoong Oh\n\nDevavrat Shah\n\nMassachusetts Institute of Technology\n\nAbstract\n\nCrowdsourcing systems, in which tasks are electronically distributed to numerous\n\u201cinformation piece-workers\u201d, have emerged as an effective paradigm for human-\npowered solving of large scale problems in domains such as image classi\ufb01cation,\ndata entry, optical character recognition, recommendation, and proofreading. Be-\ncause these low-paid workers can be unreliable, nearly all crowdsourcers must\ndevise schemes to increase con\ufb01dence in their answers, typically by assigning\neach task multiple times and combining the answers in some way such as ma-\njority voting. In this paper, we consider a general model of such crowdsourcing\ntasks, and pose the problem of minimizing the total price (i.e., number of task as-\nsignments) that must be paid to achieve a target overall reliability. We give a new\nalgorithm for deciding which tasks to assign to which workers and for inferring\ncorrect answers from the workers\u2019 answers. We show that our algorithm signi\ufb01-\ncantly outperforms majority voting and, in fact, is asymptotically optimal through\ncomparison to an oracle that knows the reliability of every worker.\n\n1\n\nIntroduction\n\nBackground. Crowdsourcing systems have emerged as an effective paradigm for human-powered\nproblem solving and are now in widespread use for large-scale data-processing tasks such as image\nclassi\ufb01cation, video annotation, form data entry, optical character recognition, translation, recom-\nmendation, and proofreading. Crowdsourcing systems such as Amazon Mechanical Turk provide a\nmarket where a \u201ctaskmaster\u201d can submit batches of small tasks to be completed for a small fee by\nany worker choosing to pick them up. For example, a worker may be able to earn a few cents by indi-\ncating which images from a set of 30 are suitable for children (one of the bene\ufb01ts of crowdsourcing\nis its applicability to such highly subjective questions).\nSince typical crowdsourced tasks are tedious and the reward is small, errors are common even among\nworkers who make an effort. At the extreme, some workers are \u201cspammers\u201d, submitting arbitrary\nanswers independent of the question in order to collect their fee. Thus, all crowdsourcers need\nstrategies to ensure the reliability of answers. Because the worker crowd is large, anonymous, and\ntransient, it is generally dif\ufb01cult to build up a trust relationship with particular workers.1 It is also\ndif\ufb01cult to condition payment on correct answers, as the correct answer may never truly be known\nand delaying payment can annoy workers and make it harder to recruit them for your future tasks.\nInstead, most crowdsourcers resort to redundancy, giving each task to multiple workers, paying\nthem all irrespective of their answers, and aggregating the results by some method such as majority\nvoting. For such systems there is a natural core optimization problem to be solved. Assuming the\ne-mail: {karger,swoh,devavrat}@mit.edu. This work was supported in parts by AFOSR com-\n\nplex networks project, MURI on network tomography, and National Science Foundation.\n\n1For certain high-value tasks, crowdsourcers can use entrance exams to \u201cprequalify\u201d workers and block\nspammers, but this increases the cost and still provides no guarantee that the prequali\ufb01ed workers will try hard.\n\n1\n\n\ftaskmaster wishes to achieve a certain reliability in their answers, how can they do so at minimum\ncost (which is equivalent to asking how they can do so while asking the fewest possible questions)?\nSeveral characteristics of crowdsourcing systems make this problem interesting. Workers are neither\npersistent nor identi\ufb01able; each batch of tasks will be solved by a worker who may be completely\nnew and who you may never see again. Thus one cannot identify and reuse particularly reliable\nworkers. Nonetheless, by comparing one worker\u2019s answer to others\u2019 on the same question, it is\npossible to draw conclusions about a worker\u2019s reliability, which can be used to weight their answers\nto other questions in their batch. However, batches must be of manageable size, obeying limits on\nthe number of tasks that can be given to a single worker. Another interesting aspect of this problem\nis the choice of task assignments. Unlike many inference problems which makes inferences based\non a \ufb01xed set of signals, our algorithm can choose which signals to measure by deciding which\nquestions to ask which workers.\nIn the following, we \ufb01rst de\ufb01ne a formal model that captures these aspects of the problem. We will\nthen describe a scheme for deciding which tasks to assign to which workers and introduce a novel\niterative algorithm to infer the correct answers from the workers\u2019 responses.\nSetup. We model a set of m tasks {ti}i\u2208[m] as each being associated with an unobserved \u2018correct\u2019\nanswer si \u2208 {\u00b11}. Here and after, we use [N ] to denote the set of \ufb01rst N integers. In the earlier im-\nage categorization example, each task corresponds to labeling an image as suitable for children (+1)\nor not (\u22121). We assign these tasks to n workers from the crowd, which we denote by {wj}j\u2208[n].\nWhen a task is assigned to a worker, we get a possibly inaccurate answer from the worker. We\nuse Aij \u2208 {\u00b11} to denote the answer if task ti is assigned to worker wj. Some workers are more\ndiligent or have more expertise than others, while some other workers might be spammers. We\nchoose a simple model to capture this diversity in workers\u2019 reliability: we assume that each worker\nwj is characterized by a reliability pj \u2208 [0, 1], and that they make errors randomly on each question\nthey answer. Precisely, if task ti is assigned to worker wj then\n\n(cid:26) si with probability pj ,\n\n\u2212si with probability 1 \u2212 pj ,\n\nAij =\n\nand Aij = 0 if ti is not assigned to wj. The random variable Aij is independent of any other\nevent given pj. (Throughout this paper, we use boldface characters to denote random variables and\nrandom matrices unless it is clear from the context.) The underlying assumption here is that the\nerror probability of a worker does not depend on the particular task and all the tasks share an equal\nlevel of dif\ufb01culty. Hence, each worker\u2019s performance is consistent across different tasks.\nWe further assume that the reliability of workers {pj}j\u2208[n] are independent and identically dis-\ntributed random variables with a given distribution on [0, 1]. One example is spammer-hammer\nmodel where, each worker is either a \u2018hammer\u2019 with probability q or is a \u2018spammer\u2019 with probability\n1 \u2212 q. A hammer answers all questions correctly, in which case pj = 1, and a spammer gives\nrandom answers, in which case pj = 1/2. Given this random variable pj, we de\ufb01ne an important\nparameter q \u2208 [0, 1], which captures the \u2018average quality\u2019 of the crowd: q \u2261 E[(2pj \u2212 1)2]. A\nvalue of q close to one indicates that a large proportion of the workers are diligent, whereas q close\nto zero indicates that there are many spammers in the crowd. The de\ufb01nition of q is consistent with\nuse of q in the spammer-hammer model. We will see later that our bound on the error rate of our\ninference algorithm holds for any distribution of pj but depends on the distribution only through\nthis parameter q. It is quite realistic to assume the existence of a prior distribution for pj. The model\nis therefore quite general: in particular, it is met if we simply randomize the order in which we up-\nload our task batches, since this will have the effect of randomizing which workers perform which\nbatches, yielding a distribution that meets our requirements. On the other hand, it is not realistic\nto assume that we know what the prior is. To execute our inference algorithm for a given number\nof iterations, we do not require any knowledge of the distribution of the reliability. However, q is\nnecessary in order to determine how many times a task should be replicated and how many iterations\nwe need to run to achieve certain reliability.\nUnder this crowdsourcing model, a taskmaster \ufb01rst decides which tasks should be assigned to which\nworkers, and then estimates the correct solutions {si}i\u2208[m] once all the answers {Aij} are submit-\nted. We assume a one-shot scenario in which all questions are asked simultaneously and then an\nestimation is performed after all the answers are obtained. In particular, we do not allow allocating\n\n2\n\n\ftasks adaptively based on the answers received thus far. Then, assigning tasks to nodes amounts to\ndesigning a bipartite graph G({ti}i\u2208[m] \u222a {wj}j\u2208[n], E) with m task and n worker nodes. Each\nedge (i, j) \u2208 E indicates that task ti was assigned to worker wj.\nPrior Work. A naive approach to identify the correct answer from multiple workers\u2019 responses is to\nuse majority voting. Majority voting simply chooses what the majority of workers agree on. When\nthere are many spammers, majority voting is error-prone since it weights all the workers equally.\nWe will show that majority voting is provably sub-optimal and can be signi\ufb01cantly improved upon.\nTo infer the answers of the tasks and also the reliability of workers, Dawid and Skene [1, 2] proposed\nan algorithm based on expectation maximization (EM) [3]. This approach has also been applied in\nclassi\ufb01cation problems where the training data is annotated by low-cost noisy \u2018labelers\u2019 [4, 5]. In\n[6] and [7], this EM approach has been applied to more complicated probabilistic models for image\nlabeling tasks. However, the performance of these approaches are only empirically evaluated, and\nthere is no analysis that proves performance guarantees. In particular, EM algorithms require an\ninitial starting point which is typically randomly guessed. The algorithm is highly sensitive to this\ninitialization, making it dif\ufb01cult to predict the quality of the resulting estimate. The advantage of\nusing low-cost noisy \u2018labelers\u2019 has been studied in the context of supervised learning, where a set\nof labels on a training set is used to \ufb01nd a good classi\ufb01er. Given a \ufb01xed budget, there is a trade-\noff between acquiring a larger training dataset or acquiring a smaller dataset but with more labels\nper data point. Through extensive experiments, Sheng, Provost and Ipeirotis [8] show that getting\nrepeated labeling can give considerable advantage.\n\nContributions. In this work, we provide a rigorous treatment of designing a crowdsourcing sys-\ntem with the aim of minimizing the budget to achieve completion of a set of tasks with a certain\nreliability. We provide both an asymptotically optimal graph construction (random regular bipartite\ngraph) and an asymptotically optimal algorithm for inference (iterative algorithm) on that graph.\nAs the main result, we show that our algorithm performs as good as the best possible algorithm.\nThe surprise lies in the fact that the optimality of our algorithm is established by comparing it with\nthe best algorithm, one that is free to choose any graph, regular or irregular, and performs optimal\nestimation based on the information provided by an oracle about reliability of the workers. Previous\napproaches focus on developing inference algorithms assuming that a graph is already given. None\nof the prior work on crowdsourcing provides any systematic treatment of the graph construction.\nTo the best of our knowledge, we are the \ufb01rst to study both aspects of crowdsourcing together and,\nmore importantly, establish optimality.\nAnother novel contribution of our work is the analysis technique. The iterative algorithm we in-\ntroduce operates on real-valued messages whose distribution is a priori dif\ufb01cult to analyze. To\novercome this challenge, we develop a novel technique of establishing that these messages are sub-\nGaussian (see Section 3 for a de\ufb01nition) using recursion, and compute the parameters in a closed\nform. This allows us to prove the sharp result on the error rate, and this technique could be of\nindependent interest for analyzing a more general class of algorithms.\n\n2 Main result\n\nUnder the crowdsourcing model introduced, we want to design algorithms to assign tasks and es-\ntimate the answers. In what follows, we explain how to assign tasks using a random regular graph\nand introduce a novel iterative algorithm to infer the correct answers. We state the performance\nguarantees for our algorithm and provide comparisons to majority voting and an oracle estimator.\n\nTask allocation. Assigning tasks amounts to designing a bipartite graph G(cid:0){ti}i\u2208[m] \u222a\n{wj}j\u2208[n], E(cid:1), where each edge corresponds to a task-worker assignment. The taskmaster makes a\n\nchoice of how many workers to assign to each task (the left degree l) and how many tasks to assign\nto each worker (the right degree r). Since the total number of edges has to be consistent, the number\nof workers n directly follows from ml = nr. To generate an (l, r)-regular bipartite graph we use\na random graph generation scheme known as the con\ufb01guration model in random graph literature\n[9, 10]. In principle, one could use arbitrary bipartite graph G for task allocation. However, as we\nshow later in this paper, random regular graphs are suf\ufb01cient to achieve order-optimal performance.\n\n3\n\n\fInference algorithm. We introduce a novel iterative algorithm which operates on real-valued task\nmessages {xi\u2192j}(i,j)\u2208E and worker messages {yj\u2192i}(i,j)\u2208E. The worker messages are initialized\nas independent Gaussian random variables. At each iteration, the messages are updated according\nto the described update rule, where \u2202i is the neighborhood of ti. Intuitively, a worker message yj\u2192i\nrepresents our belief on how \u2018reliable\u2019 the worker j is, such that our \ufb01nal estimate is a weighted sum\n\nof the answers weighted by each worker\u2019s reliability: \u02c6si = sign((cid:80)\n\nj\u2208\u2202i Aijyj\u2192i).\n\nOutput: Estimation \u02c6s(cid:0){Aij}(cid:1)\n\nIterative Algorithm\nInput: E, {Aij}(i,j)\u2208E, kmax\n1: For all (i, j) \u2208 E do\n\n2: For k = 1, . . . , kmax do\n\nj\u2192i with random Zij \u223c N (1, 1) ;\n\nInitialize y(0)\nFor all (i, j) \u2208 E do x(k)\nFor all (i, j) \u2208 E do\ny(k)\nj\u2208\u2202i Aijy(kmax\u22121)\n\n3: For all i \u2208 [m] do xi \u2190(cid:80)\n4: Output estimate vector \u02c6s(cid:0){Aij}(cid:1) = [sign(xi)] .\n\ni\u2192j \u2190(cid:80)\nj\u2192i \u2190(cid:80)\n\nj(cid:48)\u2208\u2202i\\j Aij(cid:48)y(k\u22121)\nj(cid:48)\u2192i\ni(cid:48)\u2208\u2202j\\i Ai(cid:48)jx(k)\ni(cid:48)\u2192j ;\nj\u2192i\n\n;\n\n;\n\nWhile our algorithm is inspired by the standard Belief Propagation (BP) algorithm for approximat-\ning max-marginals [11, 12], our algorithm is original and overcomes a few critical limitations of the\nstandard BP. First, the iterative algorithm does not require any knowledge of the prior distribution of\npj, whereas the standard BP requires the knowledge of the distribution. Second, there is no ef\ufb01cient\nway to implement standard BP, since we need to pass suf\ufb01ceint statistics (or messages) which under\nour general model are distributions over the reals. On the otherhand, the iterative algorithm only\npasses messages that are real numbers regardless of the prior distribution of pj, which is easy to im-\nplement. Third, the iterative algorithm is provably asymptotically order-optimal. Density evolution,\nexplained in detail in Section 3, is a standard technique to analyze the performance of BP. Although\nwe can write down the density evolution for the standard BP, we cannot analyze the densities, ana-\nlytically or numerically. It is also very simple to write down the density evolution equations for the\niterative algorithm, but it is not trivial to analyze the densities in this case either. We develop a novel\ntechnique to analyze the densities and prove optimality of our algorithm.\n\n2.1 Performance guarantee\n\nWe state the main analytical result of this paper: for random (l, r)-regular bipartite graph based task\nassignments with our iterative inference algorithm, the probability of error decays exponentially in\nlq, up to a universal constant and for a broad range of the parameters l, r and q. With a reasonable\nchoice of l = r and both scaling like (1/q) log(1/\u0001), the proposed algorithm is guarantted to achieve\nerror less than \u0001 for any \u0001 \u2208 (0, 1/2). Further, an algorithm independent lower bound that we\nestablish suggests that such an error dependence on lq is unavoidable. Hence, in terms of the task\nallocation budget, our algorithm is order-optimal. The precise statements follow next. Let \u00b5 =\nE[2pj \u2212 1] and recall q = E[(2pj \u2212 1)2]. To lighten the notation, let \u02c6l \u2261 l\u2212 1 and \u02c6r \u2261 r\u2212 1. De\ufb01ne\n\n(cid:16)\n\n(cid:17) 1 \u2212 (1/q2\u02c6l\u02c6r)k\u22121\n\n1 \u2212 (1/q2\u02c6l\u02c6r)\n\n1\nq\u02c6r\n\n.\n\nk \u2261\n\u03c12\n\n2q\n\n+\n\n3 +\n\n\u00b52(q2\u02c6l\u02c6r)k\u22121\nk such that\n\n\u03c12\u221e = (cid:0)3 + (1/q\u02c6r)(cid:1)q2\u02c6l\u02c6r/(q2\u02c6l\u02c6r \u2212 1) .\n\nFor q2\u02c6l\u02c6r > 1, let \u03c12\u221e \u2261 limk\u2192\u221e \u03c12\n\nThen we can show the following bound on the probability of making an error.\nTheorem 2.1. For \ufb01xed l > 1 and r > 1, assume that m tasks are assigned to n = ml/r workers\naccording to a random (l, r)-regular graph drawn from the con\ufb01guration model. If the distribution\nof the worker reliaiblity satisfy \u00b5 \u2261 E[2pj \u2212 1] > 0 and q2 > 1/(\u02c6l\u02c6r), then for any s \u2208 {\u00b11}m, the\nestimates from k iterations of the iterative algorithm achieve\n\nm(cid:88)\n\nP(cid:16)\n\n(cid:0){Aij}(i,j)\u2208E\n\n(cid:1)(cid:17) \u2264 e\u2212lq/(2\u03c12\n\nk) .\n\nsi (cid:54)= \u02c6si\n\n(1)\n\nlim sup\nm\u2192\u221e\n\n1\nm\n\ni=1\n\n4\n\n\fAs we increase k, the above bound converges to a non-trivial limit.\nCorollary 2.2. Under the hypotheses of Theorem 2.1,\n\nlim sup\nk\u2192\u221e\n\nlim sup\nm\u2192\u221e\n\n1\nm\n\n(cid:0){Aij}(i,j)\u2208E\n\n(cid:1)(cid:17) \u2264 e\u2212lq/(2\u03c12\u221e) .\n\nsi (cid:54)= \u02c6si\n\nm(cid:88)\n\nP(cid:16)\n\ni=1\n\nOne implication of this corollary is that, under a mild assumption that r \u2265 l, the probabiilty of\nerror is upper bounded by e\u2212(1/8)(lq\u22121). Even if we \ufb01x the value of q = E[(2pj \u2212 1)2], different\nq]. Surprisingly, the asymptotic\ndistributions of pj can have different values of \u00b5 in the range of [q,\nbound on the error rate does not depend on \u00b5. Instead, as long as q is \ufb01xed, \u00b5 only affects how fast\nthe algorithm converges (cf. Lemma 2.3).\nNotice that the bound in (2) is only meaningful when it is less than a half, whence \u02c6l\u02c6rq2 > 1 and\nlq > 6 log(2) > 4. While as a task master the case of \u02c6l\u02c6rq2 < 1 may not be of interest, for the purpose\nof completeness we comment on the performance of our algorithm in this regime. Speci\ufb01cally, we\nempirically observe that the error rate increases as the number of iterations k increases. Therefore,\nit makes sense to use k = 1. In which case, the algorithm essentially boils down to the majority\nrule. We can prove the following error bound. The proof is omitted due to a space constraint.\n\n\u221a\n\n(cid:0){Aij}(i,j)\u2208E\n\n(cid:1)(cid:17) \u2264 e\u2212l\u00b52/4 .\n\nsi (cid:54)= \u02c6si\n\nm(cid:88)\n\nP(cid:16)\n\ni=1\n\nlim sup\nm\u2192\u221e\n\n1\nm\n\n2.2 Discussion\n\n(2)\n\n(3)\n\n\u221a\n\nHere we make a few comments relating to the execution of the algorithm and the interpretation of\nthe main results. First, the iterative algorithm is ef\ufb01cient with runtime comparable to the simple\nmajority voting which requires O(ml) operations.\nLemma 2.3. Under the hypotheses of Theorem 2.1, the total computational cost suf\ufb01cient to achieve\nthe bound in Corollary 2.2 up to any constant factor in the exponent is O(ml log(q/\u00b52)/ log(q2\u02c6l\u02c6r)).\nBy de\ufb01nition, we have q \u2264 \u00b5 \u2264 \u221a\nq. The runtime is the worst when \u00b5 = q, which happens under\nthe spammer-hammer model, and it is the best when \u00b5 =\nq)/2\ndeterministically. There exists a (non-iterative) polynomial time algorithm with runtime independent\nof q for computing the estimate which achieves (2), but in practice we expect that the number\nof iterations needed is small enough that the iterative algorithm will outperform this non-iterative\nalgorithm. Detailed proof of Lemma 2.3 will be skipped here due to a space constraint.\nSecond, the assumption that \u00b5 > 0 is necessary. If there is no assumption on \u00b5, then we cannot\ndistinguish if the responses came from tasks with {si}i\u2208[m] and workers with {pj}j\u2208[n] or tasks\nwith {\u2212si}i\u2208[m] and workers with {1 \u2212 pj}j\u2208[n]. Statistically, both of them give the same output.\nIn the case when we know that \u00b5 < 0, we can use the same algorithm changing the sign of the \ufb01nal\noutput and get the same performance guarantee.\nThird, our algorithm does not require any information on the distribution of pj. Further, unlike\nother EM based algorithms, the iterative algorithm is not sensitive to initialization and with random\ninitialization converges to a unique estimate with high probability. This follows from the fact that\nthe algorithm is essentially computing a leading eigenvector of a particular linear operator.\n\nq which happens if pj = (1 +\n\n\u221a\n\n2.3 Relation to singular value decomposition\n\nThe leading singular vectors are often used to capture the important aspects of datasets in matrix\nform. In our case, the leading left singular vector of A can be used to estimate the correct answers,\nwhere A \u2208 {0,\u00b11}m\u00d7n is the m \u00d7 n adjacency matrix of the graph G weighted by the submitted\nanswers. We can compute it using power iteration: for u \u2208 Rm and v \u2208 Rn, starting with a\nrandomly initialized v, power iteration iteratively updates u and v according to\n\nfor all i, ui =\n\n(cid:88)\n\nj\u2208\u2202i\n\n(cid:88)\n\ni\u2208\u2202j\n\nAijui\n\n.\n\nAijvj, and for all j, vj =\n\n5\n\n\fIt is known that normalized u converges exponentially to the leading left singular vector. This update\nrule is very similar to that of our iterative algorithm. But there is one difference that is crucial in the\nanalysis: in our algorithm we follow the framework of the celebrated belief propagation algorithm\n[11, 12] and exclude the incoming message from node j when computing an outgoing message\nto j. This extrinsic nature of our algorithm and the locally tree-like structure of sparse random\ngraphs [9, 13] allow us to perform asymptotic analysis on the average error rate. In particular, if we\nuse the leading singular vector of A to estimate s, such that si = sign(ui), then existing analysis\ntechniques from random matrix theory does not give the strong performance guarantee we have.\nThese techniques typically focus on understanding how the subspace spanned by the top singular\nvector behaves. To get a sharp bound, we need to analyze how each entry of the leading singular\nvector is distributed. We introduce the iterative algorithm in order to precisely characterize how\neach of the decision variable xi is distributed. Since the iterative algorithm introduced in this paper\nis quite similar to power iteration used to compute the leading singular vectors, this suggests that\nour analysis may shed light on how to analyze the top singular vectors of a sparse random matrix.\n\n2.4 Optimality of our algorithm\n\nAs a taskmaster, the natural core optimization problem of our concern is how to achieve a certain\nreliability in our answers with minimum cost. Since we pay equal amount for all the task assign-\nments, the cost is proportional to the total number of edges of the graph G. Here we compute the\ntotal budget suf\ufb01cient to achieve a target error rate using our algorithm and show that this is within a\nconstant factor from the necessary budget to achieve the given target error rate using any graph and\nthe best possible inference algorithm. The order-optimality is established with respect to all algo-\nrithms that operate in one-shot, i.e. all task assignments are done simultaneously, then an estimation\nis performed after all the answers are obtained. The proofs of the claims in this section are skipped\nhere due to space limitations.\nFormally, consider a scenario where there are m tasks to complete and a target accuracy \u0001 \u2208 (0, 1/2).\nTo measure accuracy, we use the average probability of error per task denoted by dm(s, \u02c6s) \u2261\nsary and suf\ufb01cient to achieve the target error rate: dm(s, \u02c6s) \u2264 \u0001. To establish this fundamental limit,\nwe use the following minimax bound on error rate. Consider the case where nature chooses a set of\ncorrect answers s \u2208 {\u00b11}m and a distribution of the worker reliability pj \u223c f. The distribution f\nis chosen from a set of all distributions on [0, 1] which satisfy Ef [(2pj \u2212 1)2] = q. We use F(q) to\ndenote this set of distributions. Let G(m, l) denote the set of all bipartite graphs, including irregular\ngraphs, that have m task nodes and ml total number of edges. Then the minimax error rate achieved\nby the best possible graph G \u2208 G(m, l) using the best possible inference algorithm is at least\n\nP(si (cid:54)= \u02c6si). We will show that \u2126(cid:0)(1/q) log(1/\u0001)(cid:1) assignments per task is neces-\n\n(1/m)(cid:80)\n\ni\u2208[m]\n\ninf\n\nALGO,G\u2208G(m,l)\n\nsup\n\ns,f\u2208F (q)\n\ndm(s, \u02c6sG,ALGO) \u2265 (1/2)e\u2212(lq+O(lq2)) ,\n\n(4)\n\nwhere \u02c6sG,ALGO denotes the estimate we get using graph G for task allocation and algorithm ALGO\nfor inference. This minimax bound is established by computing the error rate of an oracle esitimator,\nwhich makes an optimal decision based on the information provided by an oracle who knows how\nreliable each worker is. Next, we show that the error rate of majority voting decays signi\ufb01cantly\nslower: the leading term in the error exponent scales like \u2212lq2. Let \u02c6sMV be the estimate produced\nby majority voting. Then, for q \u2208 (0, 1), there exists a numerical constant C1 such that\n\ninf\n\nG\u2208G(m,l)\n\nsup\n\ns,f\u2208F (q)\n\ndm(s, \u02c6sMV) = e\u2212(C1lq2+O(lq4+1)) .\n\n(5)\n\nThe lower bound in (4) does not depend on how many tasks are assigned to each worker. However,\nour main result depends on the value of r. We show that for a broad range of parameters l, r, and q\nour algorithm achieves optimality. Let \u02c6sIter be the estimate given by random regular graphs and the\niterative algorithm. For \u02c6lq \u2265 C2, \u02c6rq \u2265 C3 and C2C3 > 1, Corollary 2.2 gives\n\nlim\nm\u2192\u221e sup\n\ns,f\u2208F (q)\n\ndm(s, \u02c6sIter) \u2264 e\u2212C4lq .\n\n(6)\n\nThis is also illustrated in Figure 1. We ran numerical experiments with 1000 tasks and 1000 work-\ners from the spammer-hammer model assigned according to random graphs with l = r from the\ncon\ufb01guration model. For the left \ufb01gure, we \ufb01xed q = 0.3 and for the right \ufb01gure we \ufb01xed l = 25.\n\n6\n\n\fPError\n\nPError\n\nl\n\nq\n\nFigure 1: The iterative algorithm improves over majority voting and EM algorithm [8].\n\nNow, let \u2206LB be the minimum cost per task necessary to achieve a target accuracy \u0001 \u2208 (0, 1/2)\nusing any graph and any possible algorithm. Then (4) implies \u2206LB \u223c (1/q) log(1/\u0001), where x \u223c y\nindicates that x scales as y. Let \u2206Iter be the minimum cost per task suf\ufb01cient to achieve a target\naccuracy \u0001 using our proposed algorithm. Then from (6) we get \u2206Iter \u223c (1/q) log(1/\u0001). This\nestablishes the order-optimality of our algorithm.\nIt is indeed surprising that regular graphs are\nsuf\ufb01cient to achieve this optimality. Further, let \u2206Majority be the minimum cost per task necessary\nto achieve a target accuracy \u0001 using the Majority voting. Then \u2206Majority \u223c (1/q2) log(1/\u0001), which\nsigni\ufb01cantly more costly than the optimal scaling of (1/q) log(1/\u0001) of our algorithm.\n\n3 Proof of Theorem 2.1\n\n(1/m)(cid:80)\n\ni\u2208[m]\n\nP(si (cid:54)= \u02c6si) \u2264 P(x(k)\n\nBy symmetry, we can assume all si\u2019s are +1. If I is a random integer drawn uniformly in [m], then\ndenotes the decision variable for task i after\nk iterations of the iterative algorithm. Asymptotically, for a \ufb01xed k, l and r, the local neighborhood\nI \u2264 0), we use a standard proba-\nof xI converges to a regular tree. To analyze limm\u2192\u221e P(x(k)\nbilistic analysis technique known as \u2018density evolution\u2019 in coding theory or \u2018recursive distributional\nequations\u2019 in probabilistic combinatorics [9, 13]. Precisely, we use the following equality.\n\nI \u2264 0), where x(k)\n\ni\n\nP(x(k)\n\nI \u2264 0) = P(\u02c6x(k) \u2264 0) ,\n\nlim\nm\u2192\u221e\n\n(7)\n\nwhere \u02c6x(k) is de\ufb01ned through density evolution equations (8) and (9) in the following.\nDensity evolution. In the large system limit as m \u2192 \u221e, the (l, r)-regular random graph locally\nconverges in distribution to a (l, r)-regular tree. Therefore, for a randomly chosen edge (i, j), the\nmessages xi\u2192j and yj\u2192i converge in distribution to x and ypj de\ufb01ned in the following density\nevolution equations (8). Here and after, we drop the superscript k denoting the iteration number\nwhenever it is clear from the context. We initialize yp with a Gaussian distribution independent of\np: y(0)\n\np \u223c N (1, 1). Let d= denote equality in distribution. Then, for k \u2208 {1, 2, . . .},\n\nx(k) d=\n\nzpi,iy(k\u22121)\n\npi,i\n\n,\n\ny(k)\n\np\n\nd=\n\nzp,jx(k)\n\nj\n\n,\n\n(8)\n\n(cid:88)\n\nj\u2208[r\u22121]\n\n(cid:88)\n\ni\u2208[l\u22121]\n\nwhere xj\u2019s, pi\u2019s, and yp,i\u2019s are independent copies of x, p, and yp, respectively. Also, zp,i\u2019s\nand zp,j\u2019s are independent copies of zp. p \u2208 [0, 1] is a random variable distributed according to\nthe distribution of the worker\u2019s quality. zp,j\u2019s and xj\u2019s are independent. zpi,i\u2019s and ypi,i\u2019s are\nconditionally independent conditioned on pi. Finally, zp is a random variable which is +1 with\nprobability p and \u22121 with probability 1 \u2212 p. Then, for a randomly chosen I, the decision variable\nx(k)\nI\n\nconverges in distribution to\n\n\u02c6x(k) d=\n\nzpi,iy(k\u22121)\n\npi,i\n\n.\n\n(9)\n\n(cid:88)\n\ni\u2208[l]\n\nAnalyzing the density. Our strategy to provide an upper bound on P(\u02c6x(k) \u2264 0) is to show\nthat \u02c6x(k) is sub-Gaussian with appropriate parameters and use the Chernoff bound. A random\n\n7\n\n 1e-05 0.0001 0.001 0.01 0.1 1 0 5 10 15 20 25 30Majority VotingEM AlgorithmIterative AlgorithmLower Bound 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Majority VotingEM AlgorithmIterative AlgorithmLower Bound\fvariable z with mean m is said to be sub-Gaussian with parameter \u03c3 if for all \u03bb \u2208 R the fol-\nlowing inequality holds: E[e\u03bbz] \u2264 em\u03bb+(1/2)\u03c32\u03bb2. De\ufb01ne \u03c32\nk \u2261 2\u02c6l(\u02c6l\u02c6r)k\u22121 + \u00b52\u02c6l3\u02c6r(3q\u02c6r +\n1)(q\u02c6l\u02c6r)2k\u22124(1 \u2212 (1/q2\u02c6l\u02c6r)k\u22121)/(1 \u2212 (1/q2\u02c6l\u02c6r)) and mk \u2261 \u00b5\u02c6l(q\u02c6l\u02c6r)k\u22121 for k \u2208 Z. We will \ufb01rst\nshow that, x(k) is sub-Gaussian with mean mk and parameter \u03c32\nk for a regime of \u03bb we are interested\nin. Precisely, we will show that for |\u03bb| \u2264 1/(2mk\u22121\u02c6r),\n\nE[e\u03bbx(k)\n\n] \u2264 emk\u03bb+(1/2)\u03c32\n\nk\u03bb2\n\n.\n\n(10)\n\n] = E[e\u03bbx(k)\n\n](l/\u02c6l). Therefore, it\nk\u03bb2. Applying the Chernoff bound\n\nBy de\ufb01nition, due to distributional independence, we have E[e\u03bb\u02c6x(k)\n] \u2264 e(l/\u02c6l)mk\u03bb+(l/2\u02c6l)\u03c32\nfollows from (10) that \u02c6x(k) satis\ufb01es E[e\u03bb\u02c6x(k)\nwith \u03bb = \u2212mk/(\u03c32\n\nP(cid:0)\u02c6x(k) \u2264 0(cid:1) \u2264 E(cid:2)e\u03bb\u02c6x(k)(cid:3) \u2264 e\u2212l m2\n\nk), we get\n\nk/(2 \u02c6l \u03c32\n\n(11)\nk) \u2264 \u00b52\u02c6l2(q\u02c6l\u02c6r)2k\u22123/(3\u00b52q\u02c6l3\u02c6r2(q\u02c6l\u02c6r)2k\u22124) = 1/(3\u02c6r), it is easy to check that\n\nSince mkmk\u22121/(\u03c32\n|\u03bb| \u2264 1/(2mk\u22121\u02c6r). Substituting (11) in (7), this \ufb01nishes the proof of Theorem 2.1.\nNow we are left to prove that x(k) is sub-Gaussian with appropriate parameters. We can write down\na recursive formula for the evolution of the moment generating functions of x and yp as\n\nk) ,\n\nE(cid:2)e\u03bbx(k)(cid:3) =\nE(cid:2)e\u03bby(k)\np (cid:3) =\n\n(cid:16)Ep\n(cid:104)\n(cid:16)\npE(cid:2)e\u03bbx(k)(cid:3) + \u00afpE(cid:2)e\u2212\u03bbx(k)(cid:3)(cid:17)\u02c6r\n\npE[e\u03bby(k\u22121)\n\np\n\n,\n\n|p] + \u00afpE[e\u2212\u03bby(k\u22121)\n\np\n\n|p]\n\n(cid:105)(cid:17)\u02c6l\n\n,\n\n(12)\n\n(13)\n\np ] = e\u03bb+(1/2)\u03bb2 regardless of p. Substituting this into (12), we get for any \u03bb, E[e\u03bbx(1)\n\nwhere \u00afp = 1 \u2212 p and \u00afp = 1 \u2212 p. We can prove that these are sub-Gaussian using induction.\n1 = 2\u02c6l,\nFirst, for k = 1, we show that x(1) is sub-Gaussian with mean m1 = \u00b5\u02c6l and parameter \u03c32\nwhere \u00b5 \u2261 E[2p \u2212 1]. Since yp is initialized as Gaussian with unit mean and variance, we have\nE[e\u03bby(0)\n] =\n(E[p]e\u03bb + (1 \u2212 E[p])e\u2212\u03bb)\u02c6le(1/2)\u02c6l\u03bb2 \u2264 e\u02c6l\u00b5\u03bb+\u02c6l\u03bb2, where the inequality follows from the fact that\naez + (1 \u2212 a)e\u2212z \u2264 e(2a\u22121)z+(1/2)z2 for any z \u2208 R and a \u2208 [0, 1] (cf. [14, Lemma A.1.5]).\nNext, assuming E[e\u03bbx(k)\nemk+1\u03bb+(1/2)\u03c32\nthe bound E[e\u03bbx(k)\nFurther applying this bound in (12), we get\n\nk+1\u03bb2 for |\u03bb| \u2264 1/(2mk \u02c6r), and compute appropriate mk+1 and \u03c32\n\nk\u03bb2 for |\u03bb| \u2264 1/(2mk\u22121\u02c6r), we show that E[e\u03bbx(k+1)\n\n] \u2264\nk+1. Substituting\nk\u03bb2.\n\np ] \u2264 (pemk\u03bb + \u00afpe\u2212mk\u03bb)\u02c6re(1/2)\u02c6r\u03c32\n\nk\u03bb2 in (13), we get E[e\u03bby(k)\n\n] \u2264 emk\u03bb+(1/2)\u03c32\n\n] \u2264 emk\u03bb+(1/2)\u03c32\n\nE(cid:104)\ne\u03bbx(k+1)(cid:105) \u2264 (cid:16)Ep\n\n(cid:104)\np(pemk\u03bb + \u00afpe\u2212mk\u03bb)\u02c6r + \u00afp(pe\u2212mk\u03bb + \u00afpemk\u03bb)\u02c6r(cid:105)(cid:17)\u02c6l\np(pez + \u00afpe\u2212z)\u02c6r + \u00afp(\u00afpez + pe\u2212z)\u02c6r(cid:105) \u2264 eq\u02c6rz+(1/2)(3q\u02c6r2+\u02c6r)z2\n\nTo bound the \ufb01rst term in the right-hand side, we use the next key lemma.\nLemma 3.1. For any |z| \u2264 1/(2\u02c6r) and p \u2208 [0, 1] such that q = E[(2p \u2212 1)2], we have\n\ne(1/2)\u02c6l\u02c6r\u03c32\n\n] \u2264 eq\u02c6l\u02c6rmk\u03bb+(1/2)(cid:0)(3q\u02c6l\u02c6r2+\u02c6l\u02c6r)m2\n\nFor the proof, we refer to the journal version of this paper. Applying this inequality to (14) gives\nE[e\u03bbx(k+1)\n1 as per our assumption, mk is non-decreasing in k. At iteration k, the above recursion holds for\n|\u03bb| \u2264 min{1/(2m1\u02c6r), . . . , 1/(2mk\u22121\u02c6r)} = 1/(2mk\u22121\u02c6r). Hence, we get a recursion for mk and\n\u03c3k such that (10) holds for |\u03bb| \u2264 1/(2mk\u22121\u02c6r):\n\n(cid:1)\u03bb2, for |\u03bb| \u2264 1/(2mk \u02c6r). In the regime where q\u02c6l\u02c6r \u2265\n\nk+\u02c6l\u02c6r\u03c32\n\n.(14)\n\nEp\n\n(cid:104)\n\nk\u03bb2\n\n.\n\nk\n\nmk+1 = q\u02c6l\u02c6rmk ,\n\nk+1 = (3q\u02c6l\u02c6r2 + \u02c6l\u02c6r)m2\n\u03c32\n\nk + \u02c6l\u02c6r\u03c32\nk .\n\n1 = 2\u02c6l, we have mk = \u00b5\u02c6l(q\u02c6l\u02c6r)k\u22121 for k \u2208 {1, 2, . . .} and\nk\u22121 + bck\u22122 for k \u2208 {2, 3, . . .}, with a = \u02c6l\u02c6r, b = \u00b52\u02c6l2(3q\u02c6l\u02c6r2 + \u02c6l\u02c6r), and c = (q\u02c6l\u02c6r)2. After\n(cid:96)=0 (a/c)(cid:96). For \u02c6l\u02c6rq2 (cid:54)= 1, we have a/c (cid:54)= 1,\n\n1ak\u22121 + bck\u22122(cid:80)k\u22122\n\nWith the initialization m1 = \u00b5\u02c6l and \u03c32\n\u03c32\nk = a\u03c32\nsome algebra, it follows that \u03c32\nwhence \u03c32\n\n1ak\u22121 + bck\u22122(1 \u2212 (a/c)k\u22121)/(1 \u2212 a/c). This \ufb01nishes the proof of (10).\n\nk = \u03c32\n\nk = \u03c32\n\n8\n\n\fReferences\n[1] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the\nem algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20\u2013\n28, 1979.\n\n[2] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective\nlabelling of venus images. In Advances in neural information processing systems, pages 1085\u2013\n1092, 1995.\n\n[3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):pp.\n1\u201338, 1977.\n\n[4] R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in neural information\n\nprocessing systems, pages 921\u2013928, 2003.\n\n[5] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. J. Mach. Learn. Res., 99:1297\u20131322, August 2010.\n\n[6] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more:\nIn Advances in Neural\n\nOptimal integration of labels from labelers of unknown expertise.\nInformation Processing Systems, volume 22, pages 2035\u20132043, 2009.\n\n[7] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn Advances in Neural Information Processing Systems, pages 2424\u20132432, 2010.\n\n[8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data\nmining using multiple, noisy labelers. In Proceeding of the 14th ACM SIGKDD international\nconference on Knowledge discovery and data mining, KDD \u201908, pages 614\u2013622. ACM, 2008.\n[9] T. Richardson and R. Urbanke. Modern Coding Theory. Cambridge University Press, march\n\n2008.\n\n[10] B. Bollob\u00b4as. Random Graphs. Cambridge University Press, January 2001.\n[11] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publ., San Mateo,\n\nCalifonia, 1988.\n\n[12] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its gen-\neralizations, pages 239\u2013269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,\n2003.\n\n[13] M. Mezard and A. Montanari.\n\nInformation, Physics, and Computation. Oxford University\n\nPress, Inc., New York, NY, USA, 2009.\n\n[14] N. Alon and J. H. Spencer. The Probabilistic Method. John Wiley, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1102, "authors": [{"given_name": "David", "family_name": "Karger", "institution": null}, {"given_name": "Sewoong", "family_name": "Oh", "institution": null}, {"given_name": "Devavrat", "family_name": "Shah", "institution": null}]}