{"title": "Learning from the Wisdom of Crowds by Minimax Entropy", "book": "Advances in Neural Information Processing Systems", "page_first": 2195, "page_last": 2203, "abstract": "An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem.", "full_text": "Learning from the Wisdom of Crowds by Minimax\n\nEntropy\n\nDengyong Zhou, John C. Platt, Sumit Basu, and Yi Mao\n\nMicrosoft Research\n\n1 Microsoft Way, Redmond, WA 98052\n\n{denzho,jplatt,sumitb,yimao}@microsoft.com\n\nAbstract\n\nAn important way to make large training sets is to gather noisy labels from crowds\nof nonexperts. We propose a minimax entropy principle to improve the quality\nof these labels. Our method assumes that labels are generated by a probability\ndistribution over workers, items, and labels. By maximizing the entropy of this\ndistribution, the method naturally infers item confusability and worker expertise.\nWe infer the ground truth by minimizing the entropy of this distribution, which we\nshow minimizes the Kullback-Leibler (KL) divergence between the probability\ndistribution and the unknown truth. We show that a simple coordinate descent\nscheme can optimize minimax entropy. Empirically, our results are substantially\nbetter than previously published methods for the same problem.\n\n1\n\nIntroduction\n\nThere is an increasing interest in using crowdsourcing to collect labels for machine learning [19,\n6, 21, 17, 20, 10, 13, 12]. Currently, many companies provide crowdsourcing services. Amazon\nMechanical Turk (MTurk) [2] and CrowdFlower [4] are perhaps the most well-known ones. An\nadvantage of crowdsourcing is that we can obtain a large number of labels at the low cost of pennies\nper label. However, these workers are not experts, so the labels collected from them are often fairly\nnoisy. A fundamental challenge in crowdsourcing is inferring ground truth from noisy labels by a\ncrowd of nonexperts.\nWhen each item is labeled several times by different workers, a straightforward approach is to use\nthe most common label as the true label. From reported experimental results on real crowdsourcing\ndata [19] and our own experience, majority voting performs signi\ufb01cantly better on average than\nindividual workers. However, majority voting considers each item independently. When many items\nare simultaneously labeled, it is reasonable to assume that the performance of a worker is consistent\nacross different items. This assumption underlies the work of Dawid and Skene [5, 18, 19, 11, 17],\nwhere each worker is associated with a probabilistic confusion matrix that generates her labels. Each\nentry of the matrix indicates the probability that items in one class are labeled as another. Given the\nobserved responses, the true labels for each item and the confusion matrices for each worker can be\njointly estimated by a maximum likelihood method. The optimization can be implemented by the\nexpectation-maximization (EM) algorithm [7].\nDawid and Skene\u2019s method works well in practice. However, their method only contains a per-\nworker probabilistic confusion model of generating labels.\nIn this paper, we assume a separate\nprobabilistic distribution for each worker-item pair. We propose a novel minimax entropy principle\nto jointly estimate the distributions and the ground truth given the observed labels by workers in\nSection 2. The theoretical justi\ufb01cation of minimum entropy is given in Section 2.1. To prevent over-\n\ufb01tting, we relax the minimax entropy optimization in Section 3. We describe an easy-to-implement\ntechnique to carry out the minimax program in Section 4 and link minimax entropy to a principle of\n\n1\n\n\fitem 1\n\nitem 2\n\nworker 1\nworker 2\n\nz11\nz21\n...\nworker m zm1\n\n...\n\nz12\nz22\n...\nzm2\n\n...\n...\n...\n...\n...\n\nitem n\n\nz1n\nz2n\n...\nzmn\n\nworker 1\nworker 2\n\nitem 1\n\u03c011\n\u03c021\n...\nworker m \u03c0m1\n\n...\n\nitem 2\n\u03c012\n\u03c022\n...\n\u03c0m2\n\n...\n...\n...\n...\n...\n\nitem n\n\u03c01n\n\u03c02n\n...\n\u03c0mn\n\nFigure 1: Left: observed labels. Right: underlying distributions. Highlights on both tables indicate\nthat rows and columns of the distributions are constrained by sums over observations.\n\nobjective measurements in Section 5. Finally, we present superior experimental results on real-world\ncrowdsourcing data in Section 6.\n\n2 Minimax Entropy Principle\n\nWe propose a model illustrated in Figure 1. Each row corresponds to a crowdsourced worker indexed\nby i (from 1 to m). Each column corresponds to an item to be labeled, indexed by j (from 1 to n).\nEach item has an unobserved label represented as a vector yjl, which is 1 when item j is in class l\n(from 1 to c), and 0 otherwise. More generally, we can treat yjl as the probability that item j is in\nclass l. We observe a matrix of labels zij by workers. The label matrix can also be represented as a\ntensor zijk, which is 1 when worker i labels item j as class k , and 0 otherwise. We assume that zij\nare drawn from \u03c0ij, which is the distribution for worker i to generate a label for item j. Again, \u03c0ij\ncan also be represented as a tensor \u03c0ijk, which is the probability that worker i labels item j as class\nk. Our method will estimate yjl from the observed zij.\nWe specify the form of \u03c0ij through the maximum entropy principle, where the constraints on the\nmaximum entropy combine the best ideas from previous work. Majority voting suggests that we\nshould be constraining the \u03c0ij per column, with the empirical observation of the number of votes\ni \u03c0ijk. Dawid and Skene\u2019s method suggests that we\nj yjlzijk\nj yjl\u03c0ijk. We thus have the following maximum entropy model for \u03c0ij given yjl:\n\nper class per item (cid:80)\nshould be constraining the \u03c0ij per row, with the empirical confusion matrix per worker(cid:80)\nshould match(cid:80)\n\nn(cid:88)\n\ni zijk should match (cid:80)\n\u2212 m(cid:88)\nm(cid:88)\nc(cid:88)\n\nc(cid:88)\nm(cid:88)\n\n\u03c0ijk =\n\nk=1\n\nj=1\n\ni=1\n\ni=1\n\ni=1\n\n\u03c0ijk ln \u03c0ijk\n\nzijk, \u2200j, k,\n\nmax\n\n\u03c0\n\ns.t.\n\n\u03c0ijk = 1, \u2200i, j, \u03c0ijk \u2265 0, \u2200i, j, k.\n\nk=1\n\nn(cid:88)\n\nj=1\n\nyjl\u03c0ijk =\n\nn(cid:88)\n\nj=1\n\nyjlzijk, \u2200i, k, l,\n\n(1a)\n\n(1b)\n\n(2a)\n\n(2b)\n\nWe propose that, to infer yjl, we should choose yjl to minimize the entropy in Equation (1). Intu-\nitively, making \u03c0ij \u201cpeaky\u201d means that zij is the least random given yjl. We make this intuition\nrigorous in Section 2.1. Thus, the inference for yjl can be expressed by a minimax entropy program:\n\nmin\n\ny\n\nmax\n\n\u03c0\n\ns.t.\n\nn(cid:88)\n\nj=1\n\nc(cid:88)\nm(cid:88)\n\nk=1\n\ni=1\n\ni=1\n\n\u2212 m(cid:88)\nm(cid:88)\nc(cid:88)\n\ni=1\n\n\u03c0ijk =\n\nzijk, \u2200j, k,\n\nyjl\u03c0ijk =\n\nyjlzijk, \u2200i, k, l,\n\n\u03c0ijk ln \u03c0ijk\n\nn(cid:88)\n\nj=1\n\nn(cid:88)\nc(cid:88)\n\nj=1\n\n\u03c0ijk = 1, \u2200i, j, \u03c0ijk \u2265 0, \u2200i, j, k,\n\nyjl = 1, \u2200j, yjl \u2265 0, \u2200j, l.\n\nk=1\n\nl=1\n\n2\n\n\f2.1\n\nJusti\ufb01cation for Minimum Entropy\n\nNow we justify the principle of choosing yjl by minimizing entropy. Think of yjl as a set of param-\neters to the worker-item label models \u03c0ij. The goal in choosing the yjl is to select \u03c0ij that are as\nclose as possible to the true distributions \u03c0\u2217\nij.\nTo \ufb01nd a principle to choose the yjl, assume that we have access to the row and column measure-\nments on the true distributions \u03c0\u2217\nij. That is, assume that we know the true values of the column\nijk, for a chosen set of yjl\nvalues. Knowing these true row and column measurements, we can apply the maximum entropy\nprinciple to generate distributions \u03c0ij:\n\nmeasurements \u03c6jk = (cid:80)\n\nj yjl\u03c0\u2217\n\ni \u03c0\u2217\n\nijk and row measurements \u03d5ikl = (cid:80)\n\u2212 m(cid:88)\nm(cid:88)\n\nc(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n\u03c0ijk ln \u03c0ijk\n\n\u03c0ijk = \u03c6jk, \u2200j, k,\n\nk=1\n\nj=1\n\ni=1\n\ni=1\n\nj=1\n\nmax\n\n\u03c0\n\ns.t.\n\nyjl\u03c0ijk = \u03d5ikl, \u2200i, k, l.\n\n(3)\n\nLet DKL(\u00b7 (cid:107) \u00b7) denote the KL divergence between two distributions. We can choose yjl to minimize\na loss of \u03c0ij with respect to \u03c0\u2217\n\nDKL(\u03c0\u2217\n\nij (cid:107) \u03c0ij).\n\n(4)\n\nThe minimum loss can be attained by choosing yjl to minimize the entropy of the maximum distri-\nbutions \u03c0ij. This can be shown by writing the Lagrangian of program (3):\n\nL = \u2212 m(cid:88)\nn(cid:88)\nc(cid:88)\nc(cid:88)\nn(cid:88)\n\nk=1\n\nj=1\n\ni=1\n\n+\n\n\u03c4jk\n\nm(cid:88)\n\n\u03c0ijk ln \u03c0ijk +\n\ni=1\n\nj=1\n\n(\u03c0ijk \u2212 \u03c0\u2217\n\nijk) +\n\n(cid:18) c(cid:88)\nc(cid:88)\nc(cid:88)\nm(cid:88)\n\n\u03bbij\n\nk=1\n\n\u03c0ijk \u2212 1\n\n(cid:19)\nn(cid:88)\n\nj=1\n\nk=1\n\ni=1\n\ni=1\n\nk=1\n\nl=1\n\nj=1\n\n\u03c3ikl\n\nyjl(\u03c0ijk \u2212 \u03c0\u2217\n\nijk),\n\nwhere the newly introduced variables \u03c4jk and \u03c3ikl are the Lagrange multipliers. For a solution to be\noptimal, the Karush-Kuhn-Tucker (KKT) conditions must be satis\ufb01ed [3]. Thus,\nyjl(\u03c4jk + \u03c3ikl) = 0, \u2200i, j, k,\n\n= \u2212 ln \u03c0ijk \u2212 1 + \u03bbij +\n\nc(cid:88)\n\nwhich can be rearranged as\n\nl=1\n\n\u03c0ijk = exp\n\nyjl(\u03c4jk + \u03c3ikl) + \u03bbij \u2212 1\n\n, \u2200i, j, k.\n\nFor being a probability measure, the variables \u03c0ijk have to satisfy\n\n\u03c0ijk =\n\nyjl(\u03c4jk + \u03c3ikl) + \u03bbij \u2212 1\n\n= 1,\u2200i, j.\n\n\u2202L\n\u2202\u03c0ijk\n\nc(cid:88)\n\nk=1\n\n(cid:21)\n\n(cid:21)\n\n(5)\n\n(6)\n\nij given by\n(cid:96)(\u03c0\u2217, \u03c0) =\n\nm(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\nm(cid:88)\n\nn(cid:88)\n\nc(cid:88)\n\nk=1\n\nl=1\n\nexp\n\n(cid:20) c(cid:88)\n(cid:20) c(cid:88)\nexp(cid:80)c\ns=1 exp(cid:80)c\n(cid:80)c\nc(cid:88)\nn(cid:88)\nm(cid:88)\n\nl=1\n\ni=1\n\nj=1\n\nk=1\n\n3\n\nEliminating \u03bbij by jointly considering Equations (5) and (6), we obtain a labeling model in the\nexponential family:\n\n\u03c0ijk =\n\nl=1 yjl (\u03c4jk + \u03c3ikl)\n\nl=1 yjl (\u03c4js + \u03c3isl)\n\n, \u2200i, j, k.\n\n(7)\n\nPlugging Equation (7) into (4) and performing some algebraic manipulations, we prove\n\nTheorem 2.1 Let \u03c0ij be the maximum entropy distributions in (3). Then,\n\n(cid:96)(\u03c0\u2217, \u03c0) =\n\n(\u03c0\u2217\n\nijk ln \u03c0\u2217\n\nijk \u2212 \u03c0ijk ln \u03c0ijk).\n\n\fThe second term is the only term that depends on yjl. Therefore, we should choose yjl to minimize\nthe entropy of the maximum entropy distributions.\nThe labeling model expressed by Equation (7) has a natural interpretation. For each worker i, the\nmultiplier set {\u03c3ikl} is a measure of her expertise, while for each item j, the multiplier set {\u03c4jk} is a\nmeasure of its confusability. A worker correctly labels an item either because she has good expertise\nor because the item is not that confusing. When the item or worker parameters are shifted by an\narbitrary constant, the probability given by Equation (7) does not change. The redundancy of the\nconstraints in (2a) causes the redundancy of the parameters.\n\n3 Constraint Relaxation\n\nIn real crowdsourcing applications, each item is usually labeled only a few times. Moreover, a\nworker usually only labels a small subset of items rather than all of them.\nIn such cases, it is\nunreasonable to expect that the constraints in (2a) hold for the true underlying distributions \u03c0ij. As\nin the literature of regularized maximum entropy [14, 1, 9], we relax the optimization problem to\nprevent over\ufb01tting:\n\nn(cid:88)\n\nc(cid:88)\n\nj=1\n\nk=1\n\n\u03c0ijk ln \u03c0ijk \u2212 n(cid:88)\nc(cid:88)\nn(cid:88)\n\nk=1\n\nj=1\n\n\u2212 m(cid:88)\n\nc(cid:88)\n\nc(cid:88)\n\ni=1\n\nk=1\n\nl=1\n\n\u03b6 2\nikl\n2\u03b2i\n\n\u03be2\njk\n2\u03b1j\n\n(\u03c0ijk \u2212 zijk) = \u03bejk, \u2200j, k,\n\nyjl(\u03c0ijk \u2212 zijk) = \u03b6ikl, \u2200i, k, l,\n\nmin\n\ny\n\nmax\n\u03c0,\u03be,\u03b6\n\ns.t.\n\n\u03c0ijk = 1, \u2200i, j, \u03c0ijk \u2265 0, \u2200i, j, k,\n\nyjl = 1, \u2200j, yjl \u2265 0, \u2200j, l,\n\nk=1\n\nl=1\n\nwhere \u03b1j and \u03b2i are regularization parameters. It is obvious that program (8) is reduced to program\n(2) when the slack variables \u03bejk and \u03b6ikl are set to zero. The two (cid:96)2-norm based regularization terms\nin the objective function force the slack variables to be not far away from zero. Other vector or\nmatrix norms, such as the (cid:96)1-norm and the trace norm, can be applied as well [14, 1, 9]. We choose\nthe (cid:96)2-norm only for the sake of simplicity in computation.\nThe justi\ufb01cation for minimum entropy in Section 2.1 can be extended to the regularized minimax\nentropy formulation (8) with minor modi\ufb01cations. Instead of knowing the exact marginals, we need\nto choose \u03c0ij based on noisy marginals:\n\ni=1\n\n\u2212 m(cid:88)\nm(cid:88)\nc(cid:88)\n\ni=1\n\nj=1\n\nc(cid:88)\n\n(8a)\n\n(8b)\n\n\u03c6jk =\n\n\u03c0\u2217\nijk + \u03be\u2217\n\nyjl\u03c0\u2217\n\nijk + \u03b6\u2217\n\nikl,\u2200i, k, l.\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\nn(cid:88)\n\nj=1\n\njk,\u2200j, k, \u03d5ikl =\nn(cid:88)\n\nWe thus maximize the regularized entropy subject to the relaxed constraints:\n\n\u03c0ijk + \u03bejk = \u03c6jk, \u2200j, k,\n\nyjl\u03c0ijk + \u03b6ikl = \u03d5ikl, \u2200i, k, l.\n\n(9)\n\ni=1\n\nj=1\n\nLemma 3.1 To be the regularized maximum entropy distributions subject to (9), \u03c0ij must be repre-\nsented as in Equation (7). Moreover, we should have \u03bejk = \u03b1j\u03c4jk, \u03b6ikl = \u03b2i\u03c3ikl.\n\nProof The \ufb01rst part of the result can be veri\ufb01ed as before. By using the labeling model in Equation\n(7), the Lagrangian of the regularized maximum entropy program can be written as\n\nn(cid:88)\nL = \u2212 m(cid:88)\nc(cid:88)\nn(cid:88)\n\nj=1\n\ni=1\n\nln\n\n+\n\n(cid:18) c(cid:88)\nc(cid:88)\n(cid:20)\n\u2212 m(cid:88)\n\nexp\n\ns=1\n\nl=1\n\n\u03c4jk\n\nyjl (\u03c4js + \u03c3isl)\n\nijk + (\u03bejk \u2212 \u03be\u2217\n\u03c0\u2217\njk)\n\n(cid:19)\n(cid:21)\n\n\u2212 n(cid:88)\nm(cid:88)\n\nj=1\n\nc(cid:88)\nc(cid:88)\n\nk=1\n\n+\n\nc(cid:88)\nc(cid:88)\n\u2212 m(cid:88)\n(cid:20)\n\u2212 n(cid:88)\n\nk=1\n\ni=1\n\nl=1\n\n\u03c3ikl\n\nyjl\u03c0\u2217\n\n\u03be2\njk\n2\u03b1j\n\nc(cid:88)\n\n\u03b6 2\nikl\n2\u03b2i\nijk + (\u03b6ikl \u2212 \u03b6\u2217\nikl)\n\n(cid:21)\n\n.\n\nj=1\n\nk=1\n\ni=1\n\ni=1\n\nk=1\n\nl=1\n\nj=1\n\nFor \ufb01xed \u03c4jk and \u03c3ikl, maximizing the Lagrange dual over \u03bejk and \u03b6ikl provides the proof.\n\n4\n\n\fBy Lemma 3.1 and some algebraic manipulations, we obtain\n\nTheorem 3.2 Let \u03c0ij be the regularized maximum entropy distributions subject to (9). Then,\n\n(cid:96)(\u03c0\u2217, \u03c0) =\n\nm(cid:88)\nn(cid:88)\nc(cid:88)\nijk \u2212 m(cid:88)\nijk ln \u03c0\u2217\n\u03c0\u2217\n\u2212 m(cid:88)\nn(cid:88)\nc(cid:88)\nc(cid:88)\nc(cid:88)\n\nk=1\n\nj=1\n\ni=1\n\ni=1\n\n+\n\n\u03b6 2\nikl\n\u03b2i\n\nn(cid:88)\n\nc(cid:88)\n\nk=1\n\nj=1\n\u03be\u2217\njk\u03bejk\n\u03b1j\n\n+\n\n\u03c0ijk ln \u03c0ijk \u2212 n(cid:88)\nm(cid:88)\nc(cid:88)\nc(cid:88)\n\nj=1\n\nc(cid:88)\n\nk=1\n\n\u03b6\u2217\nikl\u03b6ikl\n\u03b2i\n\n.\n\n\u03be2\njk\n\u03b1j\n\n(10)\n\ni=1\n\nk=1\n\nl=1\n\nj=1\n\nk=1\n\ni=1\n\nk=1\n\nl=1\n\nWe cannot minimize the loss by minimizing the right side of Equation (10) since the random noise\nis unknown. However, we can consider minimizing an upper bound instead. Note that\nikl)/2, \u2200i, k, l.\n\nikl\u03b6ikl \u2264 (\u03b6\u22172\n\u03b6\u2217\n\njk\u03bejk \u2264 (\u03be\u22172\n\u03be\u2217\n\njk)/2, \u2200j, k,\n\nikl + \u03b6 2\n\njk + \u03be2\n\n(11)\n\nDenote by \u2126(\u03c0, \u03be, \u03b6) the objective function of the regularized minimax entropy program (8). Sub-\nstituting the inequalities in (11) into Equation (10), we have\n\n(cid:96)(\u03c0\u2217, \u03c0) \u2264 \u2126(\u03c0, \u03be, \u03b6) \u2212 \u2126(\u03c0\u2217, \u03be\u2217, \u03b6\u2217).\n\n(12)\n\nSo minimizing the regularized maximum entropy leads to minimizing an upper bound of the loss.\n\n4 Optimization Algorithm\n\n+\n\n+\n\nln\n\nj=1\n\ni=1\n\nj=1\n\nk=1\n\n(cid:21)\n\ni=1\n\nk=1\n\nl=1\n\nA typical approach to constrained optimization is to covert the primal problem to its dual form. By\nLemma 3.1, the Lagrangian of program (8) can be written as\n\n(cid:20) m(cid:89)\n\nexp(cid:80)c\n(cid:80)c\n(cid:80)c\ns=1 exp(cid:80)c\n\nL = \u2212 n(cid:88)\nThe dual problem minimizes L subject to the constraints \u2206 = {yjl|(cid:80)c\n\nl=1 yjl (\u03c4js + \u03c3isl)\n\nl=1 yjl(\u03c4jk + \u03c3ikl)\n\nn(cid:88)\n\nc(cid:88)\n\nk=1 zijk\n\n\u03b2i\u03c32\nikl\n2\n\n.\n\n\u03b1j\u03c4 2\njk\n2\n\nm(cid:88)\n\nc(cid:88)\n\nc(cid:88)\n\npjl =\n\nl=1 yjl = 1, \u2200j, yjl \u2265\n0, \u2200j, l}. It can be solved by coordinate descent with the variables being split into two groups: {yjl}\nand {\u03c4jk, \u03c3ikl}. It is easy to check that, when the variables in one group are \ufb01xed, the optimization\nproblem on the variables in the other group is convex. When the yjl are restricted to be {0, 1}, that\nis, deterministic labels, the coordinate descent procedure can be simpli\ufb01ed. Let\n\nexp(cid:80)c\nm(cid:89)\n(cid:80)c\nFor any set of real-valued numbers {\u00b5jl|(cid:80)c\ns=1 exp (\u03c4js + \u03c3isl)\nl=1 \u00b5jl = 1, \u2200j, \u00b5jl > 0, \u2200j, l}, we have the inequality\nn(cid:88)\nn(cid:88)\nn(cid:88)\n\n(cid:80)c\nexp(cid:80)c\n(cid:20) m(cid:89)\ns=1 exp(cid:80)c\n(cid:80)c\n(cid:18) yjlpjl\n(cid:19)\nc(cid:88)\n\u2265 n(cid:88)\nc(cid:88)\nc(cid:88)\n\u00b5jl ln(yjlpjl) \u2212 n(cid:88)\nc(cid:88)\n\n(cid:21)\nn(cid:88)\n(cid:19) (cid:0)Jensen\u2019s inequality(cid:1)\n\n(cid:0)deterministic labels(cid:1)\n\n(cid:18) yjlpjl\n\nk=1 zijk(\u03c4jk + \u03c3ikl)\n\nl=1 yjl (\u03c4js + \u03c3isl)\n\nl=1 yjl(\u03c4jk + \u03c3ikl)\n\nc(cid:88)\n\n\u00b5jl ln \u00b5jl.\n\nk=1 zijk\n\nyjlpjl\n\n\u00b5jl ln\n\n\u00b5jl\n\n\u00b5jl\n\n\u00b5jl\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\nl=1\n\ni=1\n\ni=1\n\n=\n\nln\n\n=\n\nln\n\nl=1\n\nl=1\n\nln\n\n=\n\n.\n\nj=1\n\nl=1\n\nj=1\n\nl=1\n\nPlugging the last line into the Lagrangian L, we obtain an upper bound of L, called F . It can be\nshown that we must have yjl = \u00b5jl at any stationary point of F. Our optimization algorithm is a\ncoordinate descent minimization of this F [15, 7]. We initialize yjl with majority vote in Equation\n(13). In each iteration step, we \ufb01rst optimize over \u03c4jk and \u03c3ikl in (14a), which can be solved by any\nconvex optimization procedure, and next optimize over yjl using a simple closed form in (14b). The\noptimization over yjl is the same as applying Bayes\u2019 theorem where the result from the last iteration\nis considered as a prior. This algorithm can be shown to produce only deterministic labels.\n\n5\n\n\fAlgorithm 1 Minimax Entropy Learning from Crowds\n\ninput: {zijk} \u2208 {0, 1}m\u00d7n\u00d7c, {\u03b1j} \u2208 Rn\ninitialization:\n\ny0\njl =\n\n+\n\ni=1 zijl\n\nk=1 zijk\n\n, \u2200j, l\n\n(cid:80)m\n+, {\u03b2i} \u2208 Rm\n(cid:80)c\n(cid:80)m\n(cid:20)\nc(cid:88)\nexp(\u03c4js + \u03c3isl) \u2212 c(cid:88)\nm(cid:88)\nc(cid:88)\nc(cid:88)\n(cid:1)\n\n\u03b2i\u03c32\nikl\n2\n\n, \u2200j, l\n\nk=1\nikl)\n\njk + \u03c3t\n\nlog\n\ns=1\n\nk=1\n\ni=1\n\nl=1\n\ni=1\n\njl\n\nl=1\n\ni=1\n\nj=1\n\nyt\u22121\n\nm(cid:88)\nn(cid:88)\nc(cid:88)\nn(cid:88)\nc(cid:88)\nexp(cid:80)c\ns=1 exp(cid:0)\u03c4 t\n(cid:80)c\n\n\u03b1j\u03c4 2\njk\n2\n\nk=1\nk=1 zijk(\u03c4 t\n\nj=1\n\n+\n\njs + \u03c3t\nisl\n\n+\n\nm(cid:89)\n\ni=1\n\n(13)\n\n(cid:21)\n\nzijk(\u03c4jk + \u03c3ikl)\n\n(14a)\n\n(14b)\n\nfor t = 1, 2, . . .\n\n{\u03c4 t\n\njk, \u03c3t\n\nikl} = arg min\n\n\u03c4,\u03c3\n\njl \u221d yt\u22121\nyt\noutput: {yt\njl}\n\njl\n\n5 Measurement Objectivity Principle\n\nThe measurement objectivity principle can be roughly stated as follows: (1) a comparison of labeling\nconfusability between two items should be independent of which particular workers are included for\nthe comparison; (2) symmetrically, a comparison of labeling expertise between two workers should\nbe independent of which particular items are included for the comparison. The \ufb01rst statement is\nabout the objectivity of item confusability. The second statement is about the objectivity of worker\nexpertise. In what follows, we mathematically de\ufb01ne the measurement objectivity principle. For\ndeterministic labels, we show that the labeling model in Equation (7) can be recovered from the\nmeasurement objectivity principle.\nFrom Equation (7), given item j in class l, the probability that worker i labels it as class k is\n\n(cid:80)c\n\n\u03c0ijkl =\n\nexp (\u03c4jk + \u03c3ikl)\ns=1 exp (\u03c4js + \u03c3isl)\n\n.\n\n(15)\n\nAssume that a worker i has labeled two items j and j(cid:48) both of which are from the same class l. With\nrespect to the given worker i, for each item, we measure the confusability for class k by\n\n\u03c1ijk =\n\n\u03c0ijkl\n\u03c0ijll\n\n,\n\n\u03c1ij(cid:48)k =\n\n\u03c0ij(cid:48)kl\n\u03c0ij(cid:48)ll\n\n.\n\n(16)\n\nFor comparing the item confusabilities, we compute a ratio between them. To maintain the objectiv-\nity of confusability, the ratio should not depend on whichever worker is involved in the comparison.\nHence, given another worker i(cid:48), we should have\n\n(cid:18) \u03c0ijkl\n\n(cid:19)(cid:30)(cid:18) \u03c0ij(cid:48)kl\n\n(cid:19)\n\n(cid:18) \u03c0i(cid:48)jkl\n\n(cid:19)(cid:30)(cid:18) \u03c0i(cid:48)j(cid:48)kl\n\n(cid:19)\n\n.\n\n(17)\n\n\u03c0ijll\n\n\u03c0ij(cid:48)ll\n\n\u03c0i(cid:48)jll\n\n\u03c0i(cid:48)j(cid:48)ll\n\n=\n\nIt is straightforward to verify that the labeling model in Equation (15) indeed satis\ufb01es the objectivity\nrequirement given by Equation (17). We can further show that a labeling model which satis\ufb01es\nEquation (17) has to be expressed by Equation (15). Let us rewrite Equation (17) as\n\n\u03c0ijkl\n\u03c0ijll\n\n=\n\n\u03c0ij(cid:48)kl\n\u03c0ij(cid:48)ll\n\n\u03c0i(cid:48)jkl\n\u03c0i(cid:48)jll\n\n\u03c0i(cid:48)j(cid:48)ll\n\u03c0i(cid:48)j(cid:48)kl\n\n.\n\nWithout loss of generality, choose i(cid:48) = 0 and j(cid:48) = 0 as the \ufb01xed references such that\n\n\u03c0ijkl\n\u03c0ijll\n\n=\n\n\u03c0i0kl\n\u03c0i0ll\n\n\u03c00jkl\n\u03c00jll\n\n\u03c000ll\n\u03c000kl\n\n.\n\n(18)\n\nAssume that the referenced worker 0 chooses a class uniformly at random for the referenced item 0.\nSo we have \u03c000ll = \u03c000kl = 1/c. Equation (18) implies \u03c0ijkl \u221d \u03c0i0kl\u03c00jkl. Reparameterizing with\n\n6\n\n\f(a) Norfolk Terrier\n\n(b) Norwich Terrier\n\n(c) Irish Wolfhound\n\n(d) Scottish Deerhound\n\nFigure 2: Sample images of four breeds of dogs from the Stanford dogs dataset\n\n\u03c0i0kl = exp(\u03c3ikl) and \u03c00jkl = exp(\u03c4jk) (note that l is dropped since it is determined by j), we have\n\u03c0ijkl \u221d exp(\u03c4jk + \u03c3ikl). The labeling model in Equation (15) has been recovered.\nSymmetrically, we can also start from the objectivity of worker expertise to recover the labeling\nmodel in (15). Assume that two workers i and i(cid:48) have labeled a common item j which is from class\nl. With respect to the given item j, for each worker, we measure the confusion from class l to k by\n\n\u03c1ijk =\n\n\u03c0ijkl\n\u03c0ijll\n\n,\n\n\u03c1i(cid:48)jk =\n\n\u03c0i(cid:48)jkl\n\u03c0i(cid:48)jll\n\n.\n\n(19)\n\nFor comparing the worker expertises, we compute a ratio between them. To maintain the objectivity\nof expertise, the ratio should not depend on whichever item is involved in the comparison. Hence,\ngiven another item j(cid:48) in class l, we should have\n\n(cid:18) \u03c0ijkl\n\n(cid:19)(cid:30)(cid:18) \u03c0i(cid:48)jkl\n\n(cid:19)\n\n(cid:18) \u03c0ij(cid:48)kl\n\n(cid:19)(cid:30)(cid:18) \u03c0i(cid:48)j(cid:48)kl\n\n(cid:19)\n\n\u03c0ijll\n\n\u03c0i(cid:48)jll\n\n\u03c0ij(cid:48)ll\n\n\u03c0i(cid:48)j(cid:48)ll\n\n=\n\n.\n\n(20)\n\nWe can see that Equation (20) is actually just a rearrangement of Equation (17).\n\n6 Experimental Validation\n\nWe compare our method with majority voting and Dawid & Skene\u2019s method [5] using real crowd-\nsourcing data. One is multiclass image labeling, and the other is web search relevance judging.\n\n6.1\n\nImage Labeling\n\nWe chose the images of 4 breeds of dogs from the Stanford dogs dataset [8]: Norfolk Terrier (172),\nNorwich Terrier (185), Irish Wolfhound (218), and Scottish Deerhound (232) (see Figure 2). The\nnumbers of the images for each breed are in the parentheses. There are 807 images in total. We\nsubmitted them to MTurk, and received the labels from 109 MTurk workers. A worker labeled an\nimage at most once, and each image was labeled 10 times. It is dif\ufb01cult to evaluate a worker if\nshe only labeled few images. We thus only consider the workers who labeled at least 40 images,\nwhich yields a label set that contains 7354 labels by 52 workers. Each image has at least 4 la-\nbels and around 95% of the images have at least 8 labels. The average accuracy of the workers\nis 70.60%. The best worker achieved an accuracy of 88.24% while only labeled 68 images. The\nworker who labeled the most labeled 345 images and achieved an accuracy of 68.99%. The av-\nerage worker confusion matrix between breeds is shown in Table 2. As expected, it consists of\ntwo blocks. One block contains Norfolk Terrier and Norwich Terrier, and the other block contains\nIrish Wolfhound and Scottish Deerhound. For our method, the regularization parameters are set\nas \u03b1j = 100/(number of labels for item j), \u03b2i = 100/(number of labels by worker i). The perfor-\nmance of various methods on this image labeling task is summarized in Table 1. For this problem,\nour method is somewhat better than Dawid and Skene\u2019s method.\n\n6.2 Web Search Relevance Judging\n\nIn another experiment, we asked workers to rate a set of 2665 query-URL pairs on a relevance rating\nscale from 1 to 5. The larger the rating, the more relevant the URL. The true labels were derived by\n\n7\n\n\fMethod\n\nMinimax Entropy\nDawid & Skene\nMajority Voting\nAverage Worker\n\nDogs Web\n88.05\n84.63\n83.98\n84.14\n73.07\n82.09\n70.60\n37.05\n\nNorfolk\nNorwich\n\nIrish\n\nScottish\n\nNorfolk Norwich\n71.04\n31.99\n1.19\n1.20\n\n27.35\n66.71\n0.55\n0.38\n\nIrish\n1.03\n1.13\n69.35\n26.77\n\nScottish\n\n0.58\n0.18\n28.91\n71.65\n\nTable 1: Accuracy of methods (%)\n\nTable 2: Average worker confusion (%)\n\nusing consensus from 9 experts. The noisy labels were provided by 177 nonexpert workers. Each\npair was judged by around 6 workers, and each worker judged a subset of the pairs. The average\naccuracy of workers is 37.05%. Seventeen workers have an accuracy of 0 and they judged at most 7\npairs. The worker who judged the most judged 1225 pairs and achieved an accuracy of 76.73%. For\nour method, the regularization parameters are set as \u03b1j = 200/(number of labels for item j), \u03b2i =\n200/(number of labels by worker i). The performance of various methods on this relevance judging\ntask is summarized in Table 1. In this case, our method is substantially better.\n\n7 Related Work\n\nThis paper can be regarded as a natural extension to Dawid and Skene\u2019s work [5], discussed in Sec-\ntion 1. Our approach can be reduced to Dawid and Skene\u2019s by setting the regularization parameters\nto be \u03b1j = \u221e, \u03b2i = 0. The essential difference between our work and Dawid and Skene\u2019s work is\nthat, in addition to worker expertise, we also take item confusability into account.\nIn computer vision, a minimax entropy method was proposed for estimating the probability density\nof certain visual patterns such as textures [22]. The authors compute empirical marginal distributions\nthrough various features, then construct a density model that can reproduce all empirical marginal\ndistributions. Among all models satisfying the constraints, the one with maximum entropy is pre-\nferred. However, one wants to select the features which are most informative: the constructed model\nshould approximate the underlying density by minimizing a KL divergence. The authors formulate\nthe combined density estimation and feature selection as a minimax entropy problem.\nThe measurement objectivity principle is inspired by the Rasch model [16], used to design and ana-\nlyze psychological and educational measurements. In the Rasch model, given an examinee and a test\nitem, the probability of a correct response is modeled as a logistic function of the difference between\nthe examinee ability and the item dif\ufb01culty. Rasch de\ufb01ned \u201cspeci\ufb01c objectivity\u201d: the comparison of\nany two subjects can be carried out in such a way that no other parameters are involved than those\nof the two subjects. The speci\ufb01c objectivity property of the Rasch model comes from the algebraic\nseparation of examinee and item parameters. If the probability of a correct response is modeled with\nother forms, such as a logistic function of the ratio between the examinee ability and the item dif\ufb01-\nculty [21], objective measurements cannot be achieved. The most fundamental difference between\nthe Rasch model and our work is that we must infer ground truth, rather than take them as given.\n\n8 Conclusion\n\nWe have proposed a minimax entropy principle for estimating the true labels from the judgements\nof a crowd of nonexperts. We have also shown that the labeling model derived from the minimax\nentropy principle uniquely satis\ufb01es an objectivity principle for measuring worker expertise and item\nconfusability. Experimental results on real-world crowdsourcing data demonstrate that the proposed\nmethod estimates ground truth more accurately than previously proposed methods. The presented\nframework can be easily extended. For example, in the web search experiment, the multilevel rel-\nevance scale is treated as multiclass. By taking the ordinal property of ratings into account, the\naccuracy may be further improved. The framework could be extended to real-valued labels. A\ndetailed discussion on those topics is beyond the scope of this paper.\n\nAcknowledgments\n\nWe thank Daniel Hsu, Xi Chen, Chris Burges and Chris Meek for helpful discussions, and Gabriella\nKazai for generating the web search dataset.\n\n8\n\n\fReferences\n[1] Y. Altun and A. Smola. Unifying divergence minimization and statistical inference via convex\n\nduality. In Proceedings of the 19th Annual Conference on Learning Theory, 2006.\n\n[2] Amazon Mechanical Turk. https://www.mturk.com/mturk.\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] CrowdFlower. http://crowd\ufb02ower.com/.\n[5] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using\n\nthe EM algorithm. Journal of the Royal Statistical Society, 28(1):20\u201328, 1979.\n\n[6] O. Dekel and O. Shamir. Vox populi: Collecting high-quality labels from a crowd. In Proceed-\n\nings of the 22nd Annual Conference on Learning Theory, 2009.\n\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society, 39(1):1\u201338, 1977.\n\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A large-scale hi-\nerarchical image database. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 248\u2013255, 2009.\n\n[9] M. Dudik, S. J. Phillips, and R. E. Schapire. Maximum entropy density estimation with gener-\nalized regularization and an application to species distribution modeling. Journal of Machine\nLearning Research, 8:1217\u20131260, 2007.\n\n[10] S. Ertekin, H. Hirsh, and C. Rudin. Approximating the wisdom of the crowd. In Proceedings\n\nof the Workshop on Computational Social Science and the Wisdom of Crowds, 2011.\n\n[11] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In\n\nProceedings of the ACM SIGKDD Workshop on Human Computation, pages 64\u201367, 2010.\n\n[12] E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-\nIn Proceedings of the 11th International Conference on Autonomous\n\nscale crowdsourcing.\nAgents and Multiagent Systems, pages 467\u2013474, 2012.\n\n[13] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nAdvances in Neural Information Processing Systems 24, pages 1953\u20131961, 2011.\n\n[14] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In\n\nAdvances in Neural Information Processing Systems 14, pages 447\u2013454, 2001.\n\n[15] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse,\nIn M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368.\n\nand other variants.\nKluwer Academic, Dordrecht, MA, 1998.\n\n[16] G. Rasch. On general laws and the meaning of measurement in psychology. In Proceedings\nof the 4th Berkeley Symposium on Mathematical Statistics and Probability, volume 4, pages\n321\u2013333, Berkeley, CA, 1961.\n\n[17] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. Journal of Machine Learning Research, 11:1297\u20131322, 2010.\n\n[18] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective\nlabelling of venus images. In Advances in neural information processing systems, pages 1085\u2013\n1092, 1995.\n\n[19] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast\u2014but is it good? Evalu-\nating non-expert annotations for natural language tasks. In Proceedings of the Conference on\nEmpirical Methods in Natural Language Processing, pages 254\u2013263, 2008.\n\n[20] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn Advances in Neural Information Processing Systems 23, pages 2424\u20132432, 2010.\n\n[21] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more:\noptimal integration of labels from labelers of unknown expertise. In Advances in Neural Infor-\nmation Processing Systems 22, pages 2035\u20132043, 2009.\n\n[22] S. C. Zhu, Y. N. Wu, and D. B. Mumford. Minimax entropy principle and its applications to\n\ntexture modeling. Neural Computation, 9:1627\u20131660, 1997.\n\n9\n\n\f", "award": [], "sourceid": 4490, "authors": [{"given_name": "Dengyong", "family_name": "Zhou", "institution": null}, {"given_name": "Sumit", "family_name": "Basu", "institution": null}, {"given_name": "Yi", "family_name": "Mao", "institution": null}, {"given_name": "John", "family_name": "Platt", "institution": null}]}