{"title": "A Minimax Optimal Algorithm for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 4352, "page_last": 4360, "abstract": "We consider the problem of accurately estimating the reliability of workers based on noisy labels they provide, which is a fundamental question in crowdsourcing. We propose a novel lower bound on the minimax estimation error which applies to any estimation procedure. We further propose Triangular Estimation (TE), an algorithm for estimating the reliability of workers. TE has low complexity, may be implemented in a streaming setting when labels are provided by workers in real time, and does not rely on an iterative procedure. We prove that TE is minimax optimal and matches our lower bound. We conclude by assessing the performance of TE and other state-of-the-art algorithms on both synthetic and real-world data.", "full_text": "A Minimax Optimal Algorithm for Crowdsourcing\n\nThomas Bonald\nTelecom ParisTech\n\nRichard Combes\n\nCentrale-Supelec / L2S\n\nthomas.bonald@telecom-paristech.fr\n\nrichard.combes@supelec.fr\n\nAbstract\n\nWe consider the problem of accurately estimating the reliability of workers based\non noisy labels they provide, which is a fundamental question in crowdsourcing.\nWe propose a novel lower bound on the minimax estimation error which applies\nto any estimation procedure. We further propose Triangular Estimation (TE), an\nalgorithm for estimating the reliability of workers. TE has low complexity, may\nbe implemented in a streaming setting when labels are provided by workers in real\ntime, and does not rely on an iterative procedure. We prove that TE is minimax\noptimal and matches our lower bound. We conclude by assessing the performance\nof TE and other state-of-the-art algorithms on both synthetic and real-world data.\n\n1 Introduction\n\nThe performance of many machine learning techniques, and in particular data classi\ufb01cation, strongly\ndepends on the quality of the labeled data used in the initial training phase. A common way to label\nnew datasets is through crowdsourcing: many workers are asked to label data, typically texts or\nimages, in exchange of some low payment. Of course, crowdsourcing is prone to errors due to\nthe dif\ufb01culty of some classi\ufb01cation tasks, the low payment per task and the repetitive nature of the\njob. Some workers may even introduce errors on purpose. Thus it is essential to assign the same\nclassi\ufb01cation task to several workers and to learn the reliability of each worker through her past\nactivity so as to minimize the overall error rate and to improve the quality of the labeled dataset.\n\nLearning the reliability of each worker is a tough problem because the true label of each task, the\nso-called ground truth, is unknown; it is precisely the objective of crowdsourcing to guess the true\nlabel. Thus the reliability of each worker must be inferred from the comparison of her labels on\nsome set of tasks with those of other workers on the same set of tasks.\n\nIn this paper, we consider binary labels and study the problem of estimating the workers reliability\nbased on the answers they provide to tasks. We make two novel contributions to that problem:\n\n(i) We derive a lower bound on the minimax estimation error which applies to any estimator of\nthe workers reliability. In doing so we identify \"hard\" instances of the problem, and show that the\nminimax error depends on two factors: the reliability of the three most informative workers and the\nmean reliability of all workers.\n\n(ii) We propose TE (Triangular Estimation), a novel algorithm for estimating the reliability of each\nworker based on the correlations between triplets of workers. We analyze the performance of TE and\nprove that it is minimax optimal in the sense that it matches the lower bound we previously derived.\nUnlike most prior work, we provide non-asymptotic performance guarantees which hold even for a\n\ufb01nite number of workers and tasks. As our analysis reveals, non-asymptotic performance guarantees\nrequire to use \ufb01ner concentration arguments than asymptotic ones.\n\nTE has low complexity in terms of memory space and computation time, does not require to store\nthe whole data set in memory and can be easily applied in a setting in which answers to tasks arrive\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsequentially, i.e., in a streaming setting. Finally, we compare the performance of TE to state-of-the-\nart algorithms through numerical experiments using both synthetic and real datasets.\n\n2 Related Work\n\nThe \ufb01rst problems of data classi\ufb01cation using independent workers appeared in the medical con-\ntext, where each label refers to the state of a patient (e.g., sick or sane) and the workers are clini-\ncians. [Dawid and Skene, 1979] proposed an expectation-maximization (EM) algorithm, admitting\nthat the accuracy of the estimate was unknown. Several versions and extensions of this algorithm\nhave since been proposed and tested in various settings [Hui and Walter, 1980, Smyth et al., 1995,\nAlbert and Dodd, 2004, Raykar et al., 2010, Liu et al., 2012].\n\nA number of Bayesian techniques have also been proposed and applied to this problem by\n[Raykar et al., 2010, Welinder and Perona, 2010, Karger et al., 2011, Liu et al., 2012, Karger et al.,\n2014, 2013] and references therein. Of particular interest is the belief-propagation (BP) algorithm\nof [Karger et al., 2011], which is provably order-optimal in terms of the number of workers required\nper task for any given target error rate, in the limit of an in\ufb01nite number of tasks and an in\ufb01nite\npopulation of workers.\n\nAnother family of algorithms is based on the spectral analysis of some matrix representing the\ncorrelations between tasks or workers. [Ghosh et al., 2011] work on the task-task matrix whose\nentries correspond to the number of workers having labeled two tasks in the same manner, while\n[Dalvi et al., 2013] work on the worker-worker matrix whose entries correspond to the number of\ntasks labeled in the same manner by two workers. Both obtain performance guarantees by the\nperturbation analysis of the top eigenvector of the corresponding expected matrix. The BP algorithm\nof Karger, Oh and Shah is in fact closely related to these spectral algorithms: their message-passing\nscheme is very similar to the power-iteration method applied to the task-worker matrix, as observed\nin [Karger et al., 2011].\n\nTwo notable recent contributions are [Chao and Dengyong, 2015] and [Zhang et al., 2014]. The\nformer provides performance guarantees for two versions of EM, and derives lower bounds on the\nattainable prediction error (the probability of estimating labels incorrectly). The latter provides\nlower bounds on the estimation error of the workers\u2019 reliability as well as performance guarantees\nfor an improved version of EM relying on spectral methods in the initialization phase. Our lower\nbound cannot be compared to that of [Chao and Dengyong, 2015] because it applies to the workers\u2019\nreliability and not the prediction error; and our lower bound is tighter than that of [Zhang et al.,\n2014]. Our estimator shares some features of the algorithm proposed by [Zhang et al., 2014] to\ninitialize EM, which suggests that the EM phase itself is not essential to attain minimax optimality.\n\nAll these algorithms require the storage of all labels in memory and, to the best of our knowledge,\nthe only known streaming algorithm is the recursive EM algorithm of [Wang et al., 2013], for which\nno performance guarantees are available.\n\nThe remainder of the paper is organized as follows. In section 3 we state the problem and introduce\nour notations. The important question of identi\ufb01ability is addressed in section 4. In section 5 we\npresent a lower bound on the minimax error rate of any estimator. In section 6 we present TE, discuss\nits compexity and prove that it is minimax optimal. In section 7 we present numerical experiments\non synthetic and real-world data sets and section 8 concludes the paper. Due to space constraints,\nwe only provide proof outlines for our two main results in this document. Complete proofs are\npresented in the supplementary material.\n\n3 Model\n\nConsider n workers, for some integer n \u2265 3. Each task consists in determining the answer to a\nbinary question. The answer to task t, the \u201cground-truth\", is denoted by G(t) \u2208 {+1,\u22121}. We\nassume that the random variables G(1), G(2), . . . are i.i.d. and centered, so that there is no bias\ntowards one of the answers.\nEach worker provides an answer with probability \u03b1 \u2208 (0, 1]. When worker i \u2208 {1, ..., n} provides\nan answer, this answer is correct with probability 1\n2 (1 + \u03b8i), independently of the other workers, for\nsome parameter \u03b8i \u2208 [\u22121, 1] that we refer to as the reliability of worker i. If \u03b8i > 0 then worker\n\n2\n\n\fi tends to provide correct answers; if \u03b8i < 0 then worker i tends to provide incorrect anwsers; if\n\u03b8i = 0 then worker i is non-informative. We denote by \u03b8 = (\u03b81, . . . , \u03b8n) the reliability vector. Both\n\u03b1 and \u03b8 are unknown.\nLet Xi(t) \u2208 {\u22121, 0, 1} be the output of worker i for task t, where the output 0 corresponds to the\nabsence of an answer. We have:\n\nXi(t) =\uf8f1\uf8f2\n\uf8f3\n\nw.p. \u03b1 1+\u03b8i\nG(t)\n2 ,\n\u2212G(t) w.p. \u03b1 1\u2212\u03b8i\n2\n0\n1 \u2212 \u03b1.\n\nw.p.\n\n(1)\n\nSince the workers are independent, the random variables X1(t), ..., Xn(t) are independent given\nG(t), for each task t. We denote by X(t) the corresponding vector. The goal is to estimate the\nground-truth G(t) as accurately as possible by designing an estimator \u02c6G(t) that minimizes the error\nprobability P( \u02c6G(t) 6= G(t)). The estimator \u02c6G(t) is adaptive and may be a function of X(1), ..., X(t)\nbut not of the unknown parameters \u03b1, \u03b8.\n\nIt is well-known that, given \u03b8 and \u03b1 = 1, an optimal estimator of G(t) is the weighted majority vote\n[Nitzan and Paroush, 1982, Shapley and Grofman, 1984], namely\n\n\u02c6G(t) = 1{W (t) > 0} \u2212 1{W (t) < 0} + Z 1{W (t) = 0},\ni=1 wiXi(t), wi = ln( 1+\u03b8i\nwhere W (t) = 1\n) is the weight of worker i (possibly in\ufb01nite), and Z\n1\u2212\u03b8i\nis a Bernoulli random variable of parameter 1\n2 over {+1,\u22121} (for random tie-breaking). We prove\nthis result for any \u03b1 \u2208 (0, 1].\nProposition 1 Assuming that \u03b8 is known, the estimator (2) is an optimal estimator of G(t).\n\nnPn\n\n(2)\n\nProof. Finding an optimal estimator of G(t) amounts to \ufb01nding an optimal statistical test between\nhypotheses {G(t) = +1} and {G(t) = \u22121}, under a symmetry constraint so that type I and type II\nerror probability are equal. For any x \u2208 {\u22121, 0, 1}n, let L+(x) and L\u2212(x) be the probabilities that\nX(t) = x under hypotheses {G(t) = +1} and {G(t) = \u22121}, respectively. We have\n\nn\n\n(1 + \u03b8i)1{xi=+1}(1 \u2212 \u03b8i)1{xi=\u22121},\n\n(1 + \u03b8i)1{xi=\u22121}(1 \u2212 \u03b8i)1{xi=+1},\n\nL\u2212(x) = H(x)\n\nL+(x) = H(x)\n\nn\n\nYi=1\nYi=1\nL\u2212(x)(cid:17) = Pn\n\ni=1 |xi| is the number of answers and H(x) = 1\n\nwhere \u2113 = Pn\n2\u2113 \u03b1\u2113(1 \u2212 \u03b1)n\u2212\u2113. We deduce the\nlog-likelihood ratio ln(cid:16) L+(x)\ni=1 wixi. By the Neyman-Pearson theorem, for any level\nof signi\ufb01cance, there exists a and b such that the uniformly most powerful test for that level is:\n1{wT x > a} \u2212 1{wT x < a} + Z 1{wT x = a}, where Z is a Bernoulli random variable of\nparameter b over {+1,\u22121}. By symmetry, we must have a = 0 and b = 1\nThis result shows that estimating the true answer G(t) reduces to estimating the unknown parameters\n\u03b1 and \u03b8, which is the focus of the paper. Note that the problem of estimating \u03b8 is important in itself,\ndue to the presence of \"spammers\" (i.e., workers with low reliability); a good estimator can be used\nby the crowdsourcing platform to incentivize good workers.\n\n2 , as announced.\n\n(cid:3)\n\n4 Identi\ufb01ability\n\nEstimating \u03b1 and \u03b8 from X(1), ..., X(t) is not possible unless we have identi\ufb01ability, namely\nthere cannot exist two distinct sets of parameters \u03b1, \u03b8 and \u03b1\u2032, \u03b8\u2032 under which the distribution of\nX(1), ..., X(t) is the same. Let X \u2208 {\u22121, 0, 1}n be any sample, for some parameters \u03b1 \u2208 (0, 1]\nand \u03b8 \u2208 [\u22121, 1]n. The parameter \u03b1 is clearly identi\ufb01able since \u03b1 = P(X1 6= 0). The identi\ufb01ability\nof \u03b8 is less obvious. Assume for instance that \u03b8i = 0 for all i \u2265 3. It follows from (1) that for any\nx \u2208 {\u22121, 0, 1}n, with H(x) de\ufb01ned as in the proof of Proposition 1:\n\nP(X = x) = H(x) \u00d7( 1 + \u03b81\u03b82\n1 \u2212 \u03b81\u03b82\n1\n\nif x1x2 = 1,\nif x1x2 = \u22121,\nif x1x2 = 0.\n\n3\n\n\fIn particular, two parameters \u03b8, \u03b8\u2032 such that \u03b81\u03b82 = \u03b8\u20321\u03b8\u20322 and \u03b8i = \u03b8\u2032i = 0 for all i \u2265 3 cannot be dis-\ntinguished. Similarly, by symmetry, two parameters \u03b8, \u03b8\u2032 such that \u03b8\u2032 = \u2212\u03b8 cannot be distinguished.\n\nLet:\n\n\u0398 =(\u03b8 \u2208 [\u22121, 1]n :\n\nn\n\nn\n\n1{\u03b8i 6= 0} \u2265 3,\n\nXi=1\n\n\u03b8i > 0) .\n\nXi=1\n\nThe \ufb01rst condition states that there are at least 3 informative workers, the second that the average\nreliability is positive.\n\nProposition 2 Any parameter \u03b8 \u2208 \u0398 is identi\ufb01able.\nProof. Any parameter \u03b8 \u2208 \u0398 can be expressed as a function of the covariance matrix of X (section\n6 below): the absolute value and the sign of \u03b8 follow from (4) and (5), respectively.\n\n(cid:3)\n\n5 Lower bound on the minimax error\n\nThe estimation of \u03b1 is straightforward and we here focus on the best estimation of \u03b8 one can\nexpect, assuming \u03b1 is known. Speci\ufb01cally, we derive a lower bound on the minimax error of\nany estimator \u02c6\u03b8 of \u03b8. De\ufb01ne ||\u02c6\u03b8 \u2212 \u03b8||\u221e = maxi=1,...,n |\u02c6\u03b8i \u2212 \u03b8i| and for all \u03b8 \u2208 [\u22121, 1]n,\nA(\u03b8) = mink maxi,j6=kp|\u03b8i\u03b8j| and B(\u03b8) =Pn\nObserve that \u0398 = {\u03b8 \u2208 [\u22121, 1]n : A(\u03b8) > 0, B(\u03b8) > 0}. This suggests that the estimation of\n\u03b8 becomes hard when either A(\u03b8) or B(\u03b8) is small. De\ufb01ne for any a, b \u2208 (0, 1), \u0398a,b =\n{\u03b8 \u2208 [\u22121, 1]n : A(\u03b8) \u2265 a , B(\u03b8) \u2265 b}. We have the following lower bound on the minimax error.\nAs the proof reveals, the parameters a and b characterize the dif\ufb01culty of estimating the absolute\nvalue and the sign of \u03b8, respectively.\n\ni=1 \u03b8i.\n\nTheorem 1 (Minimax error) Consider any estimator \u02c6\u03b8(t) of \u03b8.\nFor any \u01eb \u2208 (0, min(a, (1 \u2212 a)/2, 1/4)) and \u03b4 \u2208 (0, 1/4), we have\n\nmin\n\u03b8\u2208\u0398a,b\n\nP(cid:16)||\u02c6\u03b8(t) \u2212 \u03b8||\u221e \u2265 \u01eb(cid:17) \u2265 \u03b4 , \u2200t \u2264 max(T1, T2),\n\n(1\u2212a)4(n\u22124)\n\n\u03b12a2b2\n\nwith T1 = c1\n\nln(cid:0) 1\n\n1\u2212a\n\n\u03b12a4\u01eb2 ln(cid:0) 1\n\n4\u03b4(cid:1), T2 = c2\n\n4\u03b4(cid:1) and c1, c2 > 0 two universal constants.\nOutline of proof. The proof is based on an information theoretic argument. Denote by P\u03b8 the dis-\ntribution of X under parameter \u03b8 \u2208 \u0398, and D(.||.) the Kullback-Leibler (KL) divergence. The main\nelement of proof is lemma 1, where we bound D(P\u03b8 \u2032||P\u03b8) for two well chosen pairs of parameters.\nThe pair \u03b8, \u03b8\u2032 in statement (i) is hard to distinguish when a is small, hence it is hard to estimate the\nabsolute value of \u03b8. The pair \u03b8, \u03b8\u2032 of statement (ii) is also hard to distinguish when a or b are small,\nwhich shows that it is dif\ufb01cult to estimate the sign of \u03b8. Proving lemma 1 is involved because of\nthe particular form of distribution P\u03b8, and requires careful manipulations of the likelihood ratio. We\nconclude by reduction to a binary hypothesis test between \u03b8 and \u03b8\u2032 using lemma 2.\nLemma 1 (i) Let a \u2208 (0, 1), \u03b8 = (1, a, a, 0, . . . , 0) and \u03b8\u2032 = (1 \u2212 2\u01eb,\nThen: D(P\u03b8 \u2032||P\u03b8) \u2264 1\n(a, a,\u2212a,\u2212a, c, . . . , c), \u03b8\u2032 = (\u2212a,\u2212a, a, a, c, . . . , c). Then: D(P\u03b8 \u2032||P\u03b8) \u2264 1\nLemma 2 [Tsybakov, 2008, Theorem 2.2] Consider any estimator \u02c6\u03b8(t).\nFor any \u03b8, \u03b8\u2032 \u2208 \u0398 with ||\u03b8 \u2212 \u03b8\u2032||\u221e \u2265 2\u01eb we have:\n\n(ii) Let n > 4, de\ufb01ne c = b/(n \u2212 4), and \u03b8 =\n\n1\u22122\u01eb , 0, . . . , 0).\n\n\u03b12a2b2\n\n(n\u22124)(1\u2212a)4 .\n\n\u03b12a4\u01eb2\n1\u2212a\n\nc1\n\na\n\na\n\n1\u22122\u01eb ,\n\nc2\n\n1\n\nmin(cid:16)P\u03b8(||\u02c6\u03b8(t) \u2212 \u03b8||\u221e \u2265 \u01eb), P\u03b8 \u2032(||\u02c6\u03b8(t) \u2212 \u03b8\u2032||\u221e \u2265 \u01eb)(cid:17) \u2265\n\n4 exp(\u2212tD(P\u03b8 \u2032||P\u03b8))\n\n.\n\nRelation with prior work. The lower bound derived in [Zhang et al., 2014][Theorem 3] shows\n2 ). Our lower bound is\n2 ). Another lower\nbound was derived in [Chao and Dengyong, 2015][Theorems 3.4 and 3.5], but this concerns the\n\nthat the minimax error of any estimator \u02c6\u03b8 must be greater than O((\u03b1t)\u2212 1\nstricter, and shows that the minimax error is in fact greater than O(a\u22122\u03b1\u22121t\u2212 1\nprediction error rate, that is P( \u02c6G 6= G), so that it cannot be easily compared to our result.\n\n4\n\n\f6 Triangular estimation\n\nWe here present our estimator. The absolute value of the reliability of each worker k is estimated\nthrough the correlation of her answers with those of the most informative pair i, j 6= k. We refer to\nthis algorithm as triangular estimation (TE). The sign of the reliability of each worker is estimated\nin a second step. We use the convention that sign(0) = +.\n\nCovariance matrix. Let X \u2208 {\u22121, 0, 1}n be any sample, for some parameters \u03b1 \u2208 (0, 1] and\n\u03b8 \u2208 \u0398. We shall see that the parameter \u03b8 could be recovered exactly if the covariance matrix of X\nwere perfectly known. For any i 6= j, let Cij be the covariance of Xi and Xj given that XiXj 6= 0\n(that is, both workers i and j provide an answer). In view of (1),\n\nCij =\n\nE(XiXj)\nE(|XiXj|)\n\n= \u03b8i\u03b8j.\n\n(3)\n\nIn particular, for any distinct indices i, j, k, CikCjk = \u03b8i\u03b8j\u03b82\nk = 1, . . . , n and any pair i, j 6= k such that Cij 6= 0,\n\nk = Cij \u03b82\n\nk. We deduce that, for any\n\n\u03b82\nk =\n\nCikCjk\n\nCij\n\n.\n\n(4)\n\ni=1 \u03b8i = \u03b82\n\nNote that such a pair exists for each k because \u03b8 \u2208 \u0398. To recover the sign of \u03b8k, we use the fact that\n\u03b8kPn\n\nk +Pi6=k Cik. Since \u03b8 \u2208 \u0398, we get\nsign(\u03b8k) = sign\uf8eb\nk +Xi6=k\n\uf8ed\u03b82\n\nCik\uf8f6\n\uf8f8 .\n\n(5)\n\nThe TE algorithm consists in estimating the covariance matrix to recover \u03b8 from the above expres-\nsions.\n\nTE algorithm. At any time t, de\ufb01ne\n\n\u2200i, j = 1, . . . , n,\n\n\u02c6Cij =\n\n.\n\n(6)\n\nFor all k = 1, . . . , n, \ufb01nd the most informative pair (ik, jk) \u2208 arg maxi6=j6=k | \u02c6Cij| and let\n\ns=1 Xi(s)Xj(s)\n\nPt\nmax(cid:16)Pt\ns=1 |Xi(s)Xj(s)|, 1(cid:17)\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nif | \u02c6Cik jk (t)| > 0,\notherwise.\n\n\u02c6Cik k \u02c6Cjk k\n\n\u02c6Cik jk\n0\n\n|\u02c6\u03b8k| =\uf8f1\uf8f4\uf8f2\ns(cid:12)(cid:12)(cid:12)(cid:12)\n\uf8f4\uf8f3\nk +Pi6=k\n\n\u02c6\u03b82\n\nNext, de\ufb01ne k\u2217 = arg maxk(cid:12)(cid:12)(cid:12)\n\nand let\n\n\u02c6Cik(cid:12)(cid:12)(cid:12)\nsign(\u02c6\u03b8k) =( sign(\u02c6\u03b82\nk\u2217 +Pi6=k\u2217 \u02c6Cik\u2217 )\n\nsign(\u02c6\u03b8k\u2217 \u02c6Ckk\u2217 )\n\nif k = k\u2217,\notherwise,\n\nComplexity. First note that the TE algorithm is a streaming algorithm since \u02c6Cij (t) can be written\n\n\u02c6Cij =\n\nMij\n\nmax(Nij, 1)\n\nwith Mij =\n\nt\n\nXs=1\n\nXi(s)Xj(s) and Nij =\n\nt\n\nXs=1\n\n|Xi(s)Xj (s)|.\n\nThus TE requires O(n2) memory space (to store the matrices M and N ) and has a time complexity\nof O(n2ln(n)) per task: O(n2) operations to update \u02c6C, O(n2ln(n)) operations to sort the entries of\n| \u02c6C(t)|, O(n2) operations to compute |\u02c6\u03b8|, O(n2) operations to compute the sign of \u02c6\u03b8.\n\n5\n\n\fMinimax optimality. The following result shows that the proposed estimator is minimax optimal.\nNamely the sample complexity of our estimator matches the lower bound up to an additive logarith-\nmic term ln(n) and a multiplicative constant.\n\nTheorem 2 Let \u03b8 \u2208 \u0398a,b and denote by \u02c6\u03b8(t) the estimator de\ufb01ned above. For any \u01eb \u2208 (0, min( b\nand \u03b4 \u2208 (0, 1), we have\n\n3 , 1))\n\nwith T \u20321 = c\u20321\n\nP(||\u02c6\u03b8(t) \u2212 \u03b8||\u221e \u2265 \u01eb) \u2264 \u03b4 , \u2200t \u2265 max(T \u20321, T \u20322),\n\u03b4 (cid:17), T \u20322 = c\u20322\n\n\u03b12a2b2 ln(cid:16) 4n2\n\n\u03b4 (cid:17), and c\u20321, c\u20322 > 0 two universal constants.\n\nn\n\n1\n\n\u03b12a4\u01eb2 ln(cid:16) 6n2\n\nOutline of proof. De\ufb01ne || \u02c6C \u2212 C||\u221e = maxi,j:i6=j | \u02c6Cij \u2212 Cij|. The TE estimator is a function of\n\u02c6Cij . The main dif\ufb01culty is to prove\nthe empirical pairwise correlations ( \u02c6Cij )i,j and the sumsPj6=i\nlemma 3, a concentration inequality forPj6=i\n( \u02c6Cij \u2212 Cij )| \u2265 \u03b5(cid:17) \u2264 2 exp(cid:18)\u2212\n\n30 max(B(\u03b8)2, n)(cid:19) + 2n exp(cid:18)\u2212\n\nLemma 3 For all i = 1, . . . , n and all \u03b5 > 0,\n\n8(n \u2212 1)(cid:19) .\n\nP(cid:16)|Xj6=i\n\n\u03b52\u03b12t\n\n\u02c6Cij .\n\nt\u03b12\n\nConsider i \ufb01xed. We dissociate the set of tasks answered by each worker from the actual answers\nand the truth. Let U = (Uj(t))j,t be i.i.d Bernoulli random variables with E(Uj(t)) = \u03b1 and\nV = (Vj (t))j,t be independent random variables on {\u22121, 1} with E(Vj (t)) = \u03b8j . One may readily\ncheck that (Xj(t))j,t has the same distribution as (G(t)Uj (t)Vj (t))j,t. Hence, in distribution:\n\nt\n\nUi(s)Uj(s)Vi(s)Vj(s)\n\nt\n\nNj\n\nXs=1\n\nXj6=i\n\nwith Nj =\n\nUi(s)Uj(s).\n\n\u02c6Cij =Xj6=i\n\nXs=1\nWe prove lemma 3 by conditionning with respect to U . Denote by PU the conditional probability\nwith respect to U . De\ufb01ne N = minj6=i Nij . We prove that for all \u03b5 \u2265 0:\nUi(s)Uj(s)\u03b8j(cid:17)2\nPU(cid:16)Xj6=i\nThe quantity \u03c3 is an upper bound on the conditional variance of Pj6=i\n\n( \u02c6Cij \u2212 Cij ) \u2265 \u03b5(cid:17) \u2264 e\u2212 \u03b52\n\napplying Chernoff\u2019s inequality to both N and S. We get:\n\n\u02c6Cij , which we control by\n\nXs=1(cid:16)Xj6=i\n\n(n \u2212 1)N + S\n\n\u03c32 with S =\n\nand \u03c32 =\n\nN 2\n\nt\n\n.\n\nP(N \u2264 \u03b12t/2) \u2264 (n \u2212 1)e\u2212 t\u03b12\n\n8\n\nand P(S \u2265 2t\u03b12 max(Bi(\u03b8)2, n \u2212 1)) \u2264 e\u2212 t\u03b12\n\n3(n\u22121) .\n\nRemoving the conditionning on U yields the result. We conclude the proof of theorem 2 by linking\nthe \ufb02uctuations of \u02c6C to that of \u02c6\u03b8 in lemma 4.\nLemma 4 If (a) || \u02c6C\u2212C||\u221e \u2264 \u03b5 \u2264 A2(\u03b8) min( 1\nthen ||\u02c6\u03b8 \u2212 \u03b8||\u221e \u2264 24\u03b5\nA2(\u03b8) .\n\n64 ) and (b) maxi |Pj6=i\n\n\u02c6Cij\u2212Cij| \u2264 A(\u03b8)B(\u03b8)\n\n2 , B(\u03b8)\n\n8\n\n,\n\nRelation with prior work. Our upper bound brings improvement over [Zhang et al., 2014] as\nfollows. Two conditions are required for the upper bound of [Zhang et al., 2014][Theorem 4] to\nhold: (i) it is required that maxi |\u03b8i| < 1, and (ii) the number of workers n must grow with both \u03b4\nand t, and in fact must depend on a and b, so that n has to be large if b is smaller than \u221an. Our result\ndoes not require condition (i) to hold. Further there are values of a and b such that condition (ii) is\nnever satis\ufb01ed, for instance n \u2265 5, a = 1\nn\u22124 ) \u2208 \u0398a,b.\nFor [Zhang et al., 2014][Theorem 4] to hold, n should satisfy n \u2265 c3nln(t2n/\u03b4) with c3 a universal\nconstant (see discussion in the supplement) and for t or 1/\u03b4 large enough no such n exists. It is\nnoted that for such values of a and b, our result remains informative. Our result shows that one can\nobtain a minimax optimal algorithm for crowdsourcing which does not involve any EM step.\n\nand \u03b8 = (a,\u2212a, a,\u2212a,\n\nn\u22124 , ...,\n\n\u221an\u22124\n\n2 , b =\n\n2\n\nb\n\nb\n\nThe analysis of [Chao and Dengyong, 2015] also imposes n to grow with t and conditions on the min-\nimal value of b. Speci\ufb01cally the \ufb01rst and the last condition of [Chao and Dengyong, 2015][Theorem\n\n6\n\n\f3.3], require that n \u2265 ln(t) and thatPi \u03b82\ni \u2265 6ln(t). Using the previous example (even for t = 3),\nthis translates to b \u2265 2\u221an \u2212 4.\nIn fact, the value b = O(\u221an) seems to mark the transition between \"easy\" and \"hard\" instances\nof the crowdsourcing problem. Indeed, when n is large and b is large with respect to \u221an, then the\nmajority vote outputs the truth with high probability by the Central Limit Theorem.\n\n7 Numerical Experiments\n\nb\n\nn\u22124 , ...,\n\nSynthetic data. We consider three instances: (i) n = 50, t = 103, \u03b1 = 0.25, \u03b8i = a if i \u2264 n/2 and\n0 otherwise; (ii) n = 50, t = 104, \u03b1 = 0.25, \u03b8 = (1, a, a, 0, ..., 0); (iii) n = 50, t = 104, \u03b1 = 0.25,\na = 0.9, \u03b8 = (a,\u2212a, a,\u2212a,\nInstance (i) is an \"easy\" instance where half of the workers are informative, with A(\u03b8) = a and\nB(\u03b8) = na/2. Instance (ii) is a \"hard\" instance, the dif\ufb01culty being to estimate the absolute value\nof \u03b8 accurately by identifying the 3 informative workers. Instance (iii) is another \"hard\" instance,\nwhere estimating the sign of the components of \u03b8 is dif\ufb01cult. In particular, one must distinguish \u03b8\nfrom \u03b8\u2032 = (\u2212a, a,\u2212a, a,\n\nn\u22124 ), otherwise a large error occurs.\n\nn\u22124 , ...,\n\nb\n\nn\u22124 ).\n\nb\n\nb\n\nBoth \"hard\" instances (ii) and (iii) are inspired by our derivation of the lower bound and constitute\nthe hardest instances in \u0398a,b. For each instance we average the performance of algorithms on 103\nindependent runs and apply a random permutation of the components of \u03b8 before each run. We\nconsider the following algorithms: KOS (the BP algorithm of [Karger et al., 2011]), Maj (major-\nity voting), Oracle (weighted majority voting with optimal weights, the optimal estimator of the\nground truth), RoE (\ufb01rst spectral algorithm of [Dalvi et al., 2013]), EoR (second spectral algorithm\nof [Dalvi et al., 2013]), GKM (spectral algorithm of [Ghosh et al., 2011]), S-EMk (EM algorithm\nwith spectral initialization of [Zhang et al., 2014] with k iterations of EM) and TE (our algorithm).\nWe do not present the estimation error of KOS, Maj and Oracle since these algorithms only predict\nthe ground truth but do not estimate \u03b8 directly.\n\nThe results are shown in Tables 1 and 2, where the best results are indicated in bold. The spectral\nalgorithms RoE, EoR and GKM tend to be outperformed by the other algorithms. To perform well,\nGKM needs \u03b81 to be positive and large (see [Ghosh et al., 2011]); whenever \u03b81 \u2264 0 or |\u03b81| is\nsmall, GKN tends to make a sign mistake causing a large error. Also the analysis of RoE and EoR\nassumes that the task-worker graph is a random D-regular graph (so that the worker-worker matrix\nhas a large spectral gap). Here this assumption is violated and the practical performance suffers\nnoticeably, so that this limitation is not only theoretical. KOS performs consistently well, and seems\nimmune to sign ambiguity, see instance (iii). Further, while the analysis of KOS also assumes that\nthe task-worker graph is random D-regular, its practical performance does not seem sensitive to that\nassumption. The performance of S-EM is good except when sign estimation is hard (instance (iii),\nb = 1). This seems due to the fact that the initialization of S-EM (see the algorithm description) is\nnot good in this case. Hence the limitation of b being of order \u221an is not only theoretical but practical\nas well. In fact (combining our results and the ideas of [Zhang et al., 2014]), this suggests a new\nalgorithm where one uses EM with TE as the initial value of \u03b8.\n\nFurther, the number of iterations of EM brings signi\ufb01cant gains in some cases and should affect the\nuniversal constants in front of the various error bounds (providing theoretical evidence for this seems\nnon trival). TE performs consistently well except for (i) a = 0.3 (which we believe is due to the\nfact that t is relatively small in that instance). In particular when sign estimation is hard TE clearly\noutperforms the competing algorithms. This indeed suggests two regimes for sign estimation: b =\n\nO(1) (hard regime) and b = O(\u221an) (easy regime).\n\nReal-world data. We next consider 6 publicly available data sets (see [Whitehill et al., 2009,\nZhou et al., 2015] and summary information in Table 3), each consisting of labels provided by work-\ners and the ground truth. The density is the average number of labels per worker, i.e., \u03b1 in our model.\nThe worker degree is the average number of tasks labeled by a worker.\n\nFirst, for data sets with more than 2 possible label values, we split the label values into two groups\nand associate them with \u22121 and +1 respectively. The partition of the labels is given in Table 3.\nSecond, we remove any worker who provides less than 10 labels. Our preliminary numerical ex-\nperiments (not shown here for concision) show that without this, none of the studied algorithms\n\n7\n\n\feven match the majority consistently. Workers with low degree create noise and (to the best of our\nknowledge) any theoretical analysis of crowdsourcing algorithms assumes that the worker degree\nis suf\ufb01ciently large. The performance of various algorithms is reported in Table 4. No information\n\nabout the workers reliability is available so we only report the prediction error P( \u02c6G 6= G). Further,\n\none cannot compare algorithms to the Oracle, so that the main goal is to outperform the majority.\n\nApart from \"Bird\" and \"Web\", none of the algorithms seem to be able to signi\ufb01cantly outperform\nthe majority and are sometimes noticeably worse. For \"Web\" which has both the largest number of\nlabels and a high worker degree, there is a signi\ufb01cant gain over the majority vote, and TE, despite\nits low complexity, slightly outperforms S-EM and is competitive with KOS and GKM which both\nperform best on this dataset.\n\nInstance\n(i) a = 0.3\n(i) a = 0.9\n(ii) a = 0.55\n(ii) a = 0.95\n(iii) b = 1\n(iii) b = \u221an\n\nRoE\n0.200\n0.274\n0.551\n0.528\n0.253\n0.105\n\nEoR GKM S-EM1\n0.100\n0.131\n0.022\n0.265\n0.045\n0.459\n0.034\n0.522\n0.533\n0.222\n0.075\n0.437\n\n0.146\n0.271\n0.479\n0.541\n0.256\n0.085\n\nS-EM10\n\n0.041\n0.022\n0.044\n0.033\n0.389\n0.030\n\nTE\n\n0.134\n0.038\n0.050\n0.039\n0.061\n0.045\n\nTable 1: Synthetic data: estimation error E(||\u02c6\u03b8 \u2212 \u03b8||\u221e).\n\nInstance\n(i) a = 0.3\n(i) a = 0.9\n(ii) a = 0.55\n(ii) a = 0.95\n(iii) b = 1\n(iii) b = \u221an\n\nOracle Maj\n0.298\n0.227\n0.004\n0.046\n0.441\n0.284\n0.419\n0.219\n0.472\n0.181\n0.126\n0.315\n\nKOS\n0.228\n0.004\n0.292\n0.220\n0.185\n0.133\n\nRoE\n0.402\n0.217\n0.496\n0.495\n0.443\n0.266\n\nEoR GKM S-EM1\n0.251\n0.398\n0.004\n0.218\n0.284\n0.497\n0.219\n0.496\n0.388\n0.455\n0.284\n0.258\n\n0.374\n0.202\n0.495\n0.483\n0.386\n0.207\n\nS-EM10\n\n0.228\n0.004\n0.285\n0.219\n0.404\n0.127\n\nTE\n\n0.250\n0.004\n0.284\n0.219\n0.192\n0.128\n\nTable 2: Synthetic data: prediction error P( \u02c6G 6= G).\n\nData Set\n\n# Tasks\n\n# Workers\n\n# Labels Density Worker Degree\n\nLabel Domain\n\nBird\nDog\n\nDuchenne\n\nRTE\nTemp\nWeb\n\n108\n807\n159\n800\n462\n2,653\n\n39\n109\n64\n164\n76\n177\n\n4,212\n8,070\n1,221\n8,000\n4,620\n15,539\n\n1\n\n0.09\n0.12\n0.06\n0.13\n0.03\n\n108\n74\n19\n49\n61\n88\n\n{0} vs {1}\n\n{0,2} vs {1,3}\n\n{0} vs {1}\n{0} vs {1}\n{1} vs {2}\n\n{1,2,3} vs {4,5}\n\nTable 3: Summary of the real-world datasets.\n\nData Set Maj KOS RoE EoR GKM S-EM1\n\nS-EM10\n\nBird\nDog\n\nDuchenne\n\nRTE\nTemp\nWeb\n\n0.24\n0.18\n0.28\n0.10\n0.06\n0.14\n\n0.28\n0.19\n0.30\n0.50\n0.43\n0.02\n\n0.29\n0.18\n0.29\n0.50\n0.24\n0.13\n\n0.29\n0.18\n0.28\n0.89\n0.10\n0.14\n\n0.28\n0.20\n0.29\n0.49\n0.43\n0.02\n\n0.20\n0.24\n0.28\n0.32\n0.06\n0.04\n\n0.28\n0.17\n0.30\n0.16\n0.06\n0.06\n\nTE\n0.18\n0.20\n0.26\n0.38\n0.08\n0.03\n\nTable 4: Real-world data: prediction error P( \u02c6G 6= G).\n\n8 Conclusion\n\nWe have derived a minimax error lower bound for the crowdsourcing problem and have proposed\nTE, a low-complexity algorithm which matches this lower bound. Our results open several questions\nof interest. First, while recent work has shown that one can obtain strong theoretical guarantees by\ncombining one step of EM with a well-chosen initialization, we have shown that, at least in the case\nof binary labels, one can forgo the EM phase altogether and still obtain both minimax optimality\nand good numerical performance. It would be interesting to know if this is still possible when there\nare more than two possible labels, and also if one can do so using a streaming algorithm.\n\n8\n\n\fReferences\n\nPaul S Albert and Lori E Dodd. A cautionary note on the robustness of latent class models for\n\nestimating diagnostic error without a gold standard. Biometrics, 60(2):427\u2013435, 2004.\n\nGao Chao and Zhou Dengyong. Minimax optimal convergence rates for estimating ground truth\n\nfrom crowdsourced labels. Tech Report http://arxiv.org/abs/1310.5764, 2015.\n\nNilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced\n\nbinary ratings. In Proc. of WWW, 2013.\n\nA. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the\nEM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20\u201328,\n1979.\n\nArpita Ghosh, Satyen Kale, and R. Preston McAfee. Who moderates the moderators?: crowdsourc-\n\ning abuse detection in user-generated content. In Proc. of ACM EC, 2011.\n\nSui L Hui and Steven D Walter. Estimating the error rates of diagnostic tests. Biometrics, pages\n\n167\u2013171, 1980.\n\nDavid R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing\n\nsystems. In Proc. of NIPS, 2011.\n\nDavid R Karger, Sewoong Oh, and Devavrat Shah. Ef\ufb01cient crowdsourcing for multi-class labeling.\n\nACM SIGMETRICS Performance Evaluation Review, 41(1):81\u201392, 2013.\n\nDavid R Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable\n\ncrowdsourcing systems. Operations Research, 62(1):1\u201324, 2014.\n\nQiang Liu, Jian Peng, and Alex T Ihler. Variational inference for crowdsourcing. In Proc. of NIPS,\n\n2012.\n\nShmuel Nitzan and Jacob Paroush. Optimal decision rules in uncertain dichotomous choice situa-\n\ntions. International Economic Review, pages 289\u2013297, 1982.\n\nVikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca\nBogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11:\n1297\u20131322, 2010.\n\nLloyd Shapley and Bernard Grofman. Optimizing group judgmental accuracy in the presence of\n\ninterdependencies. Public Choice, 43(3):329\u2013343, 1984.\n\nPadhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. Inferring ground\n\ntruth from subjective labelling of venus images. In Proc. of NIPS, 1995.\n\nAlexandre B. Tsybakov. Introduction to non-parametric estimation. Springer, 2008.\n\nDong Wang, Tarek Abdelzaher, Lance Kaplan, and Charu C Aggarwal. Recursive fact-\ufb01nding: A\nstreaming approach to truth estimation in crowdsourcing applications. In Proc. of IEEE ICDCS,\n2013.\n\nPeter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-\n\neffective labels. In Proc. of IEEE CVPR (Workshops), 2010.\n\nJacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote\nshould count more: Optimal integration of labels from labelers of unknown expertise. In Proc. of\nNIPS, 2009.\n\nYuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet EM: A\n\nprovably optimal algorithm for crowdsourcing. In Proc. of NIPS, 2014.\n\nDengyong Zhou, Qiang Liu, John C Platt, Christopher Meek, and Nihar B Shah. Regularized\nminimax conditional entropy for crowdsourcing. Tech Report, http://arxiv.org/pdf/1503.07240,\n2015.\n\n9\n\n\f", "award": [], "sourceid": 2272, "authors": [{"given_name": "Thomas", "family_name": "Bonald", "institution": "Telecom ParisTech"}, {"given_name": "Richard", "family_name": "Combes", "institution": "Centrale-Supelec"}]}