{"title": "Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise", "book": "Advances in Neural Information Processing Systems", "page_first": 2035, "page_last": 2043, "abstract": "Modern machine learning-based approaches to computer vision require very large databases of labeled images. Some contemporary vision systems already require on the order of millions of images for training (e.g., Omron face detector). While the collection of these large databases is becoming a bottleneck, new Internet-based services that allow labelers from around the world to be easily hired and managed provide a promising solution. However, using these services to label large databases brings with it new theoretical and practical challenges: (1) The labelers may have wide ranging levels of expertise which are unknown a priori, and in some cases may be adversarial; (2) images may vary in their level of difficulty; and (3) multiple labels for the same image must be combined to provide an estimate of the actual label of the image. Probabilistic approaches provide a principled way to approach these problems. In this paper we present a probabilistic model and use it to simultaneously infer the label of each image, the expertise of each labeler, and the difficulty of each image. On both simulated and real data, we demonstrate that the model outperforms the commonly used ``Majority Vote heuristic for inferring image labels, and is robust to both adversarial and noisy labelers.", "full_text": "Whose Vote Should Count More:\n\nOptimal Integration of Labels from Labelers of\n\nUnknown Expertise\n\nJacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan\n\nMachine Perception Laboratory\n\nUniversity of California, San Diego\n\n{ jake, paul, ting, jbergsma, movellan }@mplab.ucsd.edu\n\nLa Jolla, CA, USA\n\nAbstract\n\nModern machine learning-based approaches to computer vision require very large\ndatabases of hand labeled images. Some contemporary vision systems already\nrequire on the order of millions of images for training (e.g., Omron face detector\n[9]). New Internet-based services allow for a large number of labelers to collab-\norate around the world at very low cost. However, using these services brings\ninteresting theoretical and practical challenges: (1) The labelers may have wide\nranging levels of expertise which are unknown a priori, and in some cases may\nbe adversarial; (2) images may vary in their level of dif\ufb01culty; and (3) multiple\nlabels for the same image must be combined to provide an estimate of the actual\nlabel of the image. Probabilistic approaches provide a principled way to approach\nthese problems. In this paper we present a probabilistic model and use it to si-\nmultaneously infer the label of each image, the expertise of each labeler, and the\ndif\ufb01culty of each image. On both simulated and real data, we demonstrate that\nthe model outperforms the commonly used \u201cMajority Vote\u201d heuristic for inferring\nimage labels, and is robust to both noisy and adversarial labelers.\n\n1 Introduction\n\nIn recent years machine learning-based approaches to computer vision have helped to greatly ac-\ncelerate progress in the \ufb01eld. However, it is now becoming clear that many practical applications\nrequire very large databases of hand labeled images. The labeling of very large datasets is becoming\na bottleneck for progress. One approach to address this incoming problem is to make use of the vast\nhuman resources on the Internet. Indeed, projects like the ESP game [17], the Listen game[16], Soy-\nlent Grid [15], and reCAPTCHA [18] have revealed the possibility of harnessing human resources to\nsolve dif\ufb01cult machine learning problems. While these approaches use clever schemes to obtain data\nfrom humans for free, a more direct approach is to hire labelers online. Recent Web tools such as\nAmazon\u2019s Mechanical Turk [1] provide ideal solutions for high-speed, low cost labeling of massive\ndatabases.\nDue to the distributed and anonymous nature of these tools, interesting theoretical and practical\nchallenges arise. For example, principled methods are needed to combine the labels from multiple\nexperts and to estimate the certainty of the current labels. Which image should be labeled (or\nrelabeled) next must also be decided \u2013 it may be prudent, for example, to collect many labels for\neach image in order to increase one\u2019s con\ufb01dence in that image\u2019s label. However, if an image is easy\nand the labelers of that image are reliable, a few labels may be suf\ufb01cient and valuable resources may\nbe used to label other images. In practice, combining the labels of multiple coders is a challenging\nprocess due to the fact that: (1) The labelers may have wide ranging levels of expertise which are\n\n\funknown a priori, and in some cases may be adversarial; (2) images may also vary in their level of\ndif\ufb01culty, in a manner that may also be unknown a priori.\nProbabilistic methods provide a principled way to approach this problem using standard inference\ntools. We explore one such approach by formulating a probabilistic model of the labeling process,\nwhich we call GLAD (Generative model of Labels, Abilities, and Dif\ufb01culties), and using inference\nmethods to simultaneously infer the expertise of each labeler, the dif\ufb01culty of each image, and the\nmost probable label for each image. On both simulated and real-life data, we demonstrate that the\nmodel outperforms the commonly used \u201cMajority Vote\u201d heuristic for inferring image labels, and is\nrobust to both adversarial and noisy labelers.\n\n2 Modeling the Labeling Process\n\nConsider a database of n images, each of which belongs to one of two possible categories of interest\n(e.g., face/non-face; male/female; smile/non-smile; etc.). We wish to determine the class label Zj\n(0 or 1) of each image j by querying from m labelers. The observed labels depend on several causal\nfactors: (1) the dif\ufb01culty of the image; (2) the expertise of the labeler; and (3) the true label. We\nmodel the dif\ufb01culty of image j using the parameter 1/\u03b2j \u2208 [0,\u221e) where \u03b2j is constrained to be\npositive. Here 1/\u03b2j = \u221e means the image is very ambiguous and hence even the most pro\ufb01cient\nlabeler has a 50% chance of labeling it correctly. 1/\u03b2j = 0 means the image is so easy that even the\nmost obtuse labeler will always label it correctly.\nThe expertise of each labeler i is modeled by the parameter \u03b1i \u2208 (\u2212\u221e, +\u221e). Here an \u03b1 = +\u221e\nmeans the labeler always labels images correctly; \u2212\u221e means the labeler always labels the images\nincorrectly, i.e., he/she can distinguish between the two classes perfectly but always inverts the label,\neither maliciously or because of a consistent misunderstanding. In this case (\u03b1i < 0), the labeler\nis said to be adversarial. Finally, \u03b1i = 0 means that the labeler cannot discriminate between the\ntwo classes \u2013 his/her labels carry no information about the true image label Zj. Note that we do not\nrequire the labelers to be human \u2013 labelers can also be, for instance, automatic classi\ufb01ers. Hence,\nthe proposed approach will provide a principled way of combining labels from any combination of\nhuman and previously existing machine-based classi\ufb01ers.\nThe labels given by labeler i to image j (which we call the given labels) are denoted as Lij and,\nunder the model, are generated as follows:\n\np(Lij = Zj|\u03b1i, \u03b2j) =\n\n1\n\n1 + e\u2212\u03b1i\u03b2j\n\n(1)\n\nThus, under the model, the log odds for the obtained labels being correct are a bilinear function\nfunction of the dif\ufb01culty of the label and the expertise of the labeler, i.e.,\n\nlog\n\np(Lij = Zj)\n1 \u2212 p(Lij = Zj)\n\n= \u03b1i\u03b2j\n\n(2)\n\nMore skilled labelers (higher \u03b1i) have a higher probability of labeling correctly. As the dif\ufb01culty\n1/\u03b2j of an image increases, the probability of the label being correct moves toward 0.5. Similarly,\nas the labeler\u2019s expertise decreases (lower \u03b1i), the chance of correctness likewise drops to 0.5.\nAdversarial labelers are simply labelers with negative \u03b1.\nFigure 1 shows the causal structure of the model. True image labels Zj, labeler accuracy values \u03b1i,\nand image dif\ufb01culty values \u03b2j are sampled from a known prior distribution. These determine the\nobserved labels according to Equation 1. Given a set of observed labels l = {lij}, the task is to infer\nsimultaneously the most likely values of Z = {Zj} (the true image labels) as well as the labeler\naccuracies \u03b1 = {\u03b1i} and the image dif\ufb01culty parameters \u03b2 = {\u03b2j}. In the next section we derive\nthe Maximum Likelihood algorithm for inferring these values.\n\n3 Inference\nThe observed labels are samples from the {Lij} random variables. The unobserved variables are\nthe true image labels Zj, the different labeler accuracies \u03b1i, and the image dif\ufb01culty parameters\n1/\u03b2j. Our goal is to ef\ufb01ciently search for the most probable values of the unobservable variables\n\n2\n\n\f\u03b21\n\n\u03b22\n\nZ1\n\nZ2\n\nImage dif\ufb01culties\n\u03b23\n\n\u03b2n\n\n...\n\nTrue labels\n...\n\nZn\n\nZ3\n\nObserved labels\n\nL11\n\nL21\n\n...\n\n...\nL12\n\nL22\n\nL32\n\n...\n\n\u03b11\n\n\u03b12\n\n\u03b13\n\nLabeler accuracies\n...\n\n\u03b1m\n\nFigure 1: Graphical model of image dif\ufb01culties, true image labels, observed labels, and labeler\naccuracies. Only the shaded variables are observed.\n\nZ, \u03b1 and \u03b2 given the observed data. Here we can use Expectation- Maximization approach (EM)\nto obtain maximum likelihood estimates of the parameters of interest (the full derivation is in the\nSupplementary Materials):\nE step: Let the set of all given labels for an image j be denoted as lj = {lij0 | j0 = j}. Note\nthat not every labeler must label every single image. In this case, the index variable i in lij0 refers\nonly to those labelers who labeled image j. We need to compute the posterior probabilities of all\nzj \u2208 {0, 1} given the \u03b1, \u03b2 values from the last M step and the observed labels:\n\np(zj|l, \u03b1, \u03b2) =p(z\n\nj|lj, \u03b1, \u03b2j)\n\n\u221d p(zj)Y\n\n\u221d p(zj|\u03b1, \u03b2j)p(lj|zj, \u03b1, \u03b2j)\np(lij|zj, \u03b1i, \u03b2j)\n\nwhere we noted that p(zj|\u03b1, \u03b2j) = p(zj) using the conditional independence assumptions from the\ngraphical model.\nM step: We maximize the standard auxiliary function Q, which is de\ufb01ned as the expectation of the\njoint log- likelihood of the observed and hidden variables (l, Z) given the parameters (\u03b1, \u03b2), w.r.t. the\nposterior probabilities of the Z values computed during the last E step:\n\nQ(\u03b1, \u03b2) =E [ln p(l, z|\u03b1, \u03b2)]\n\n \n\n= E\n\n\uf8ee\uf8f0lnY\np(zj)Y\n= X\nE [ln p(zj)] +X\n\nj\n\n!\uf8f9\uf8fb\n\np(lij|zj, \u03b1i, \u03b2j)\n\ni\n\ni\n\nsince lij are cond. indep. given z, \u03b1, \u03b2\n\nE [ln p(lij|zj, \u03b1i, \u03b2j)]\n\nj\n\nij\n\nwhere the expectation is taken over z given the old parameter values \u03b1old, \u03b2old as estimated during\nthe last E- step. Using gradient ascent, we \ufb01nd values of \u03b1 and \u03b2 that locally maximize Q.\n\n3.1 Priors on \u03b1, \u03b2\n\nThe Q function can be modi\ufb01ed straightforwardly to handle a prior over each \u03b1i and \u03b2j by adding a\nlog- prior term for each of these variables. These priors may be useful, for example, if we know that\nmost labelers are not adversarial. In this case, the prior for \u03b1 can be made very low for \u03b1 < 0.\nThe prior probabilities are also useful when the ground- truth Z value of particular images is (some-\nhow) known for certain. By \u201cclamping\u201d the Z values (using the prior) for the images on which the\n\n3\n\n\ftrue label is known for sure, the model may be able to better estimate the other parameters. The Z\nvalues for such images can be clamped by setting the prior probability p(zj) (used in the E-Step) for\nthese images to be very high towards one particular class. In our implementation we used Gaussian\npriors (\u00b5 = 1, \u03c3 = 1) for \u03b1. For \u03b2, we need a prior that does not generate negative values. To do so\nwe re-parameterized \u03b2\n\nand imposed a Gaussian prior (\u00b5 = 1, \u03c3 = 1) on \u03b20.\n\n.= e\u03b20\n\n3.2 Computational Complexity\n\nThe computational complexity of the E-Step is linear in the number of images and the total number\nof labels. For the M-Step, the values of Q and \u2207Q must be computed repeatedly until convergence.1\nComputing each function is linear in the number of images, number of labelers, and total number of\nimage labels.\nEmpirically when using the approach on a database of 1 million images that we recently collected\nand labeled we found that the EM procedure converged in about 10 minutes using a single core of\na Xeon 2.8 GHz processor. The algorithm is parallelizable and hence this running time could be\nreduced substantially using multiple cores. Real time inference may also be possible if we maintain\nparameters close to the solution that are updated as new labels become available. This would allow\nusing the algorithm in an active manner to choose in real-time which images should be labeled next\nso as to minimize the uncertainty about the image labels.\n\n4 Simulations\n\nHere we explore the performance of the model using a set of image labels generated by the model\nitself. Since, in this case we know the parameters Z, \u03b1, and \u03b2 that generated the observed labels,\nwe can compare them with corresponding parameters estimated using the EM procedure.\nIn particular, we simulated between 4 and 20 labelers, each labeling 2000 images, whose true labels\nZ were either 0 or 1 with equal probability. The accuracy \u03b1i of each labeler was drawn from a normal\ndistribution with mean 1 and variance 1. The inverse-dif\ufb01culty for each image \u03b2j was generated\nby exponentiating a draw from a normal distribution with mean 1 and variance 1. Given these\nlabeler abilities and image dif\ufb01culties, the observed labels lij were sampled according to Equation\n1 using Z. Finally, the EM inference procedure described above was executed to estimate \u03b1, \u03b2, Z.\nThis procedure was repeated 40 times to smooth out variability between trials. On each trial we\ncomputed the correlation between the parameter estimates \u02c6\u03b1, \u02c6\u03b2 and the true parameter values \u03b1, \u03b2.\nThe results (averaged over all 40 experimental runs) are shown in Figure 2. As expected, as the\nnumber of labelers grows, the parameter estimates converge to the true values.\nWe also computed the proportion of label estimates \u02c6Z that matched the true image labels Z. We\ncompared the maximum likelihood estimates of the GLAD model to estimates obtained by taking\nthe majority vote as the predicted label. The predictions of the proposed GLAD model were ob-\ntained by thresholding at 0.5 the posterior probability of the label of each image being of class 1\ngiven the accuracy and dif\ufb01culty parameters returned by EM (see Section 3). Results are shown\nin Figure 2. GLAD makes fewer errors than the majority vote heuristic. The difference between\nthe two approaches is particularly pronounced when the number of labelers per image is small. On\nmany images, GLAD correctly infers the true image label Z even when that Z value was the mi-\nnority opinion. In essence, GLAD is exploiting the fact that some labelers are experts (which it\ninfers automatically), and hence their votes should count more on these images than the votes of less\nskilled labelers.\nModeling Image Dif\ufb01culty : To explore the importance of estimating image dif\ufb01culty we performed\na simple simulation: Image labels (0 or 1) were assigned randomly (with equal probability) to 1000\nimages. Half of the images were \u201chard\u201d, and half were \u201ceasy.\u201d Fifty simulated labelers labeled all\n1000 images. The proportion of \u201cgood\u201d to \u201cbad\u201d labelers is 25:1. The probability of correctness for\neach image dif\ufb01culty and labeler quality combination was given by the table below:\n\n1The libgsl conjugate gradient descent optimizer we used requires both Q and \u2207Q.\n\n4\n\n\fFigure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlying\nclass labels on simulation data. Right: The ability of GLAD to recover the true alpha and beta\nparameters on simulation data.\n\nImage Type\nLabeler type Hard Easy\n\nGood\nBad\n\n0.95\n0.54\n\n1\n1\n\nWe measured performance in terms of proportion of correctly estimated labels. We compared three\napproaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labeler\nability but not image dif\ufb01culty; and (3) Majority Vote. The simulations were repeated 20 times\nand average performance calculated for the three methods. The results shown below indicated that\nmodeling image dif\ufb01culty can result in signi\ufb01cant performance improvements.\n\nMethod\nGLAD\n\nMajority Vote\n\nDawid & Skene [5]\n\nError\n4.5%\n11.2%\n8.4%\n\n4.1 Stability of EM under Various Starting Points\n\nEmpirically we found that the EM procedure was fairly insensitive to varying the starting point of the\nparameter values. In a simulation study of 2000 images and 20 labelers, we randomly selected each\n\u03b1i \u223c U[0, 4] and log(\u03b2j) \u223c U[0, 3], and EM was run until convergence. Over the 50 simulation\nruns, the average percent-correct of the inferred labels was 85.74%, and the standard deviation of\nthe percent-correct over all the trials was only 0.024%.\n\n5 Empirical Study I: Greebles\n\nAs a \ufb01rst test-bed for GLAD using real data obtained from the Mechanical Turk, we posted pictures\nof 100 \u201cGreebles\u201d [6], which are synthetically generated images that were originally created to study\nhuman perceptual expertise. Greebles somewhat resemble human faces and have a \u201cgender\u201d: Males\nhave horn-like organs that point up, whereas for females the horns point down. See Figure 3 (left)\nfor examples. Each of the 100 Greeble images was labeled by 10 different human coders on the Turk\nfor gender (male/female). Four greebles of each gender (separate from the 100 labeled images) were\ngiven as examples of each class. Shown at a resolution of 48x48 pixels, the task required careful\ninspection of the images in order to label them correctly. The ground-truth gender values were all\nknown with certainty (since they are rendered objects) and thus provided a means of measuring the\naccuracy of inferred image labels.\n\n5\n\n51015200.750.80.850.90.951Effect of Number of Labelers on AccuracyNumber of LabelersProportion of Labels Correct GLADMajority vote510152000.20.40.60.81Effect of Number of Labelers on Parameter EstimatesNumber of LabelersCorrelation Beta: Spearman Corr.Alpha: Pearson Corr.\fFigure 3: Left: Examples of Greebles. The top two are \u201cmale\u201d and the bottom two are \u201cfemale.\u201d\nRight: Accuracy of the inferred labels, as a function of the number of labels M obtained for each\nimage, of the Greeble images using either GLAD or Majority Vote. Results were averaged over 100\nexperimental runs.\n\nWe studied the effect of varying the number of labels M obtained from different labelers for each\nimage, on the accuracy of the inferred Z. Hence, from the 10 labels total we obtained per Greeble\nimage, we randomly sampled 2 \u2264 M \u2264 8 labels over all labelers during each experimental trial. On\neach trial we compared the accuracy of labels Z as estimated by GLAD (using a threshold of 0.5\non p(Z)) to labels as estimated by the Majority Vote heuristic. For each value of M we averaged\nperformance for each method over 100 trials.\nResults are shown in Figure 3 (right). For all values of M we tested, the labels as inferred by GLAD\nare signi\ufb01cantly higher than for Majority Vote (p < 0.01). This means that, in order to achieve the\nsame level of accuracy, fewer labels are needed. Moreover, the variance in accuracy was less for\nGLAD than for Majority Vote for all M that were tested, suggesting that the quality of GLAD\u2019s\noutputs is more stable than of the heuristic method. Finally, notice how, for the even values of M,\nthe Majority Vote accuracy decreases. This may stem from the lack of optimal decision rule under\nMajority Vote when an equal number of labelers say an image is Male as who say it is Female.\nGLAD, since it makes its decisions by also taking ability and dif\ufb01culty into account, does not suffer\nfrom this problem.\n\n6 Empirical Study II: Duchenne Smiles\n\nAs a second experiment, we used the Mechanical Turk to label face images containing smiles as\neither Duchenne or Non-Duchenne. A Duchenne smile (\u201cenjoyment\u201d smile) is distinguished from a\nNon-Duchenne (\u201csocial\u201d smile) through the activation of the Orbicularis Oculi muscle around the\neyes, which the former exhibits and the latter does not (see Figure 4 for examples). Distinguishing\nthe two kinds of smiles has applications in various domains including psychology experiments,\nhuman-computer interaction, and marketing research. Reliable coding of Duchenne smiles is a\ndif\ufb01cult task even for certi\ufb01ed experts in the Facial Action Coding System.\nWe obtained Duchenne/Non-Duchenne labels for 160 images from 20 different Mechanical Turk\nlabelers; in total, there were 3572 labels. (Hence, labelers labeled each image a variable number of\ntimes.) For ground truth, these images were also labeled by two certi\ufb01ed experts in the Facial Action\nCoding System. According to the expert labels, 58 out of 160 images contained Duchenne smiles.\nUsing the labels obtained from the Mechanical Turk, we inferred the image labels using either\nGLAD or the Majority Vote heuristic, and then compared them to ground truth.\n\n6\n\n23456780.850.90.951Number of labels per imageAccuracy (% correct)Inferred Label Accuracy of Greeble Images GLADMajority Vote\fDuchenne Smiles\n\nNon-Duchenne Smiles\n\nFigure 4: Examples of Duchenne (left) and Non-Duchenne (right) smiles. The distinction lies in\nthe activation of Orbicularis Oculi muscle around the eyes, and is dif\ufb01cult to discriminate even for\nexperts.\n\nFigure 5: Accuracy (percent correct) of inferred Duchenne/Non-Duchenne labels using either\nGLAD or Majority Vote under (left) noisy labelers or (right) adversarial labelers. As the number of\nnoise/adversarial labels increases, the performance of labels inferred using Majority Vote decreases.\nGLAD, in contrast, is robust to these conditions.\n\nResults: Using just the raw labels obtained from the Mechanical Turk, the labels inferred using\nGLAD matched the ground-truth labels on 78.12% of the images, whereas labels inferred using\nMajority Vote were only 71.88% accurate. Hence, GLAD resulted in about a 6% performance gain.\n\nSimulated Noisy and Adversarial Labelers: We also simulated noisy and adversarial labeler con-\nditions. It is to be expected, for example, that in some cases labelers may just try to complete the task\nin a minimum amount of time disregarding accuracy. In other cases labelers may misunderstand the\ninstructions, or may be adversarial, thus producing labels that tend to be opposite to the true labels.\nRobustness to such noisy and adversarial labelers is important, especially as the popularity of Web-\nbased labeling tools increases, and the quality of labelers becomes more diverse. To investigate the\nrobustness of the proposed approaches we generated data from virtual \u201clabelers\u201d whose labels were\ncompletely uninformative, i.e., uniformly random. We also added arti\ufb01cial \u201cadversarial\u201d labelers\nwhose labels tended to be the opposite of the true label for each image.\nThe number of noisy labels was varied from 0 to 5000 (in increments of 500), and the number of\nadversarial labels was varied from 0 to 750 (in increments of 250). For each setting, label inference\naccuracy was computed for both GLAD and the Majority Vote method. As shown in Figure 5, the\naccuracy of GLAD-based label inference is much less affected from labeling noise than is Majority\nVote. When adversarial labels are introduced, GLAD automatically inferred that some labelers were\npurposely giving the opposite label and automatically \ufb02ipped their labels. The Majority Vote heuris-\ntic, in contrast, has no mechanism to recover from this condition, and the accuracy falls steeply.\n\n7\n\n0100020003000400050000.660.680.70.720.740.760.780.8Num of Noisy LabelsAccuracyAccuracy under Noise GLADMajority Vote02004006008000.40.50.60.70.80.91Num of Adversarial LabelsAccuracyAccuracy under Adversarialness GLADMajority Vote\f7 Related Work\n\nTo our knowledge GLAD is the \ufb01rst model in the literature to simultaneously estimate the true label,\nitem dif\ufb01culty, and coder expertise in an unsupervised and ef\ufb01cient manner.\nOur work is related to the literature on standardized tests, particularly the Item Response Theory\n(IRT) community (e.g., Rasch [10], Birnbaum [3]). The GLAD model we propose in this paper can\nbe seen as an unsupervised version of previous IRT models for the case in which the correct answers\n(i.e., labels) are unknown.\nSnow, et al [14] used a probabilistic model similar to Naive Bayes to show that by averaging mul-\ntiple naive labelers (<= 10) one can obtain labels as accurate as a few expert labelers. Two key\ndifferences between their model and GLAD are that: (1) they assume a signi\ufb01cant proportion of\nimages have been pre-labeled with ground truth values, and (2) all the images have equal dif\ufb01culty.\nAs we show in this paper, modeling image dif\ufb01culty may be very important in some cases. Sheng,\net al [12] examine how to identify which images of an image dataset to label again in order to reduce\nuncertainty in the posterior probabilities of latent class labels.\nDawid and Skene [5] developed a method to handle polytomous latent class variables. In their case\nthe notion of \u201cability\u201d is handled using full confusion matrices for each labeler. Smyth, et al [13]\nused a similar approach to combine labels from multiple experts for items with homogeneous levels\nof dif\ufb01culty. Batchelder and Romney [2] infer test answers and test-takers\u2019 abilities simultaneously,\nbut do not estimate item dif\ufb01culties and do not admit adversarial labelers.\nOther approaches employ a Bayesian model of the labeling process that considers both variability\nin labeler accuracies as well as item dif\ufb01culty (e.g. [8, 7, 11]). However, inference in these models\nis based on MCMC which is likely to suffer from high computational expense, and the need to wait\n(arbitrarily long) for parameters to \u201cburn in\u201d during sampling.\n\n8 Summary and Further Research\n\nAn important bottleneck facing the machine learning community is the need for very large datasets\nwith hand-labeled data. Datasets whose scale was unthinkable a few years ago are becoming com-\nmonplace today. The Internet makes it possible for people around the world to cooperate on the\nlabeling of these datasets. However, this makes it unrealistic for individual researchers to obtain the\nground truth of each label with absolute certainty. Algorithms are needed to automatically estimate\nthe reliability of ad-hoc anonymous labelers, the dif\ufb01culty of the different items in the dataset, and\nthe probability of the true labels given the currently available data.\nWe proposed one such system, GLAD, based on standard probabilistic inference on a model of the\nlabeling process. The approach can handle the millions of parameters (one dif\ufb01culty parameter per\nimage, and one expertise parameter per labeler) needed to process large datasets, at little compu-\ntational cost. The model can be used seamlessly to combine labels from both human labelers and\nautomatic classi\ufb01ers. Experiments show that GLAD can recover the true data labels more accurately\nthan the Majority Vote heuristic, and that it is highly robust to both noisy and adversarial labelers.\nActive Sampling: One advantage of probabilistic models is that they lend themselves to implement-\ning active methods (e.g., Infomax [4]) for selecting which images should be re-labeled next. We are\ncurrently pursuing the development of control policies for optimally choosing whether to obtain\nmore labels for a particular item \u2013 so that the inferred Z label for that item becomes more certain\n\u2013 versus obtaining more labels from a particular labeler \u2013 so that his/her accuracy \u03b1 may be better\nestimated, and all the images that he/she labeled can have their posterior probability estimates of Z\nimproved.\nA software implementation of GLAD is available at http://mplab.ucsd.edu/\u223cjake.\n\nReferences\n[1] Amazon. Mechanical turk. http://www.mturk.com.\n[2] W. H. Batchelder and A. K. Romney. Test theory without an answer key. Psychometrika, 53(1):71\u201392,\n\n1988.\n\n8\n\n\fof mental test scores, 1968.\n[4] N. Butko and J. Movellan.\n\nI-POMDP: An infomax model of eye movement.\n\nIn Proceedings of the\n\n[3] A. Birnbaum. Some latent trait models and their use in inferring an examinee\u2019s ability. Statistical theories\n\nInternational Conference on Development and Learning, 2008.\n\n[5] A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm.\n\nApplied Statistics, 28(1):20\u201328, 1979.\n\n[6] I. Gauthier and M. Tarr. Becoming a \u201cgreeble\u201d expert: Exploring mechanisms for face recognition. Vision\n\nResearch, 37(12), 1997.\n\n[7] V. Johnson. On bayesian analysis of multi-rater ordinal data: An application to automated essay grading.\n\nJournal of the American Statistical Association, 91:42\u201351, 1996.\n\n[8] G. Karabatsos and W. H. Batchelder. Markov chain estimation for test theory without an answer key.\n\nPsychometrika, 68(3):373\u2013389, 2003.\n\n[9] Omron. OKAO vision brochure, July 2008.\n[10] G. Rasch. Probabilistic Models for Some Intelligence and Attainment Tests. Denmark, 1960.\n[11] S. Rogers, M. Girolami, and T. Polajnar. Semi-parametric analysis of multi-rater data. Statistics and\n\nComputing, 2009.\n\n[12] V. Sheng, F. Provost, and P. Ipeirotis. Get another label? improving data quality and data mining using\n\nmultiple noisy labelers. In Knowledge Discovery and Data Mining, 2008.\n\n[13] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of\n\nvenus images. In Advances of Neural Information Processing Systems, 1994.\n\n[14] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast - but is it good? evaluating non-expert\nannotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods on\nNatural Language Processing, 2008.\n\n[15] S. Steinbach, V. Rabaud, and S. Belongie. Soylent grid: it\u2019s made of people! In International Conference\n\non Computer Vision, 2007.\n\n[16] D. Turnbull, R. Liu, L. Barrington, and G. Lanckriet. A Game-based Approach for Collecting Semantic\n\nAnnotations of Music. In 8th International Conference on Music Information Retrieval (ISMIR), 2007.\n\n[17] L. von Ahn and L. Dabbish. Labeling Images with A Computer Game. In Proceedings of the SIGCHI\nconference on Human factors in computing systems, pages 319\u2013326. ACM Press New York, NY, USA,\n2004.\n\n[18] L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. reCAPTCHA: Human-Based Character\n\nRecognition via Web Security Measures. Science, 321(5895):1465, 2008.\n\n9\n\n\f", "award": [], "sourceid": 100, "authors": [{"given_name": "Jacob", "family_name": "Whitehill", "institution": null}, {"given_name": "Ting-fan", "family_name": "Wu", "institution": null}, {"given_name": "Jacob", "family_name": "Bergsma", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "Paul", "family_name": "Ruvolo", "institution": null}]}