{"title": "Bayesian Bias Mitigation for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 1800, "page_last": 1808, "abstract": "Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the {\\it effects} of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the {\\it sources} of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics.", "full_text": "Bayesian Bias Mitigation for Crowdsourcing\n\nFabian L. Wauthier\n\nUniversity of California, Berkeley\n\nflw@cs.berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\nBiased labelers are a systemic problem in crowdsourcing, and a comprehensive\ntoolbox for handling their responses is still being developed. A typical crowd-\nsourcing application can be divided into three steps: data collection, data cura-\ntion, and learning. At present these steps are often treated separately. We present\nBayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify\nall three. Most data curation methods account for the effects of labeler bias by\nmodeling all labels as coming from a single latent truth. Our model captures the\nsources of bias by describing labelers as in\ufb02uenced by shared random effects.\nThis approach can account for more complex bias patterns that arise in ambigu-\nous or hard labeling tasks and allows us to merge data curation and learning into\na single computation. Active learning integrates data collection with learning, but\nis commonly considered infeasible with Gibbs sampling inference. We propose a\ngeneral approximation strategy for Markov chains to ef\ufb01ciently quantify the effect\nof a perturbation on the stationary distribution and specialize this approach to ac-\ntive learning. Experiments show BBMC to outperform many common heuristics.\n\n1\n\nIntroduction\n\nCrowdsourcing is becoming an increasingly important methodology for collecting labeled data, as\ndemonstrated among others by Amazon Mechanical Turk, reCAPTCHA, Net\ufb02ix, and the ESP game.\nMotivated by the promise of a wealth of data that was previously impractical to gather, researchers\nhave focused in particular on Amazon Mechanical Turk as a platform for collecting label data [11,\n12]. Unfortunately, the data collected from crowdsourcing services is often very dirty: Unhelpful\nlabelers may provide incorrect or biased responses that can have major, uncontrolled effects on\nlearning algorithms. Bias may be caused by personal preference, systematic misunderstanding of\nthe labeling task, lack of interest or varying levels of competence. Further, as soon as malicious\nlabelers try to exploit incentive schemes in the data collection cycle yet more forms of bias enter.\nThe typical crowdsourcing pipeline can be divided into three main steps: 1) Data collection. The\nresearcher farms the labeling tasks to a crowdsourcing service for annotation and possibly adds a\nsmall set of gold standard labels. 2) Data curation. Since labels from the crowd are contaminated by\nerrors and bias, some \ufb01ltering is applied to curate the data, possibly using the gold standard provided\nby the researcher. 3) Learning. The \ufb01nal model is learned from the curated data.\nAt present these steps are often treated as separate. The data collection process is often viewed as\na black box which can only be minimally controlled. Although the potential for active learning to\nmake crowdsourcing much more cost effective and goal driven has been appreciated, research on\nthe topic is still in its infancy [4, 9, 17]. Similarly, data curation is in practice often still performed\nas a preprocessing step, before feeding the data to a learning algorithm [6, 8, 10, 11, 12, 14]. We\nbelieve that the lack of systematic solutions to these problems can make crowdsourcing brittle in\nsituations where labelers are arbitrarily biased or even malicious, such as when tasks are particularly\nambiguous/hard or when opinions or ratings are solicited.\n\n1\n\n\fOur goal in the current paper is to show how crowdsourcing can be leveraged more effectively by\ntreating the overall pipeline within a Bayesian framework. We present Bayesian Bias Mitigation for\nCrowdsourcing (BBMC) as a way to achieve this. BBMC makes two main contributions.\nThe \ufb01rst is a \ufb02exible latent feature model that describes each labeler\u2019s idiosyncrasies through mul-\ntiple shared factors and allows us to combine data curation and learning (steps 2 and 3 above)\ninto one inferential computation. Most of the literature accounts for the effects of labeler bias\nby assuming a single, true latent labeling from which labelers report noisy observations of some\nkind [2, 3, 4, 6, 8, 9, 10, 11, 15, 16, 17, 18]. This assumption is inappropriate when labels are so-\nlicited on subjective or ambiguous tasks (ratings, opinions, and preferences) or when learning must\nproceed in the face of arbitrarily biased labelers. We believe that an unavoidable and necessary\nextension of crowdsourcing allows multiple distinct (yet related) \u201ctrue\u201d labelings to co-exist, but\nthat at any one time we may be interested in learning about only one of these \u201ctruths.\u201d Our BBMC\nframework achieves this by modeling the sources of labeler bias through shared random effects.\nNext, we want to perform active learning in this model to actively query labelers, thus integrating\nstep 1 with steps 2 and 3. Since our model requires Gibbs sampling for inference, a straightforward\napplication of active learning is infeasible: Each active learning step relies on many inferential\ncomputations and would trigger a multitude of subordinate Gibbs samplers to be run within one\nlarge Gibbs sampler. Our second contribution is a new methodology for solving this problem. The\nbasic idea is to approximate the stationary distribution of a perturbed Markov chain using that of\nan unperturbed chain. We specialize this idea to active learning in our model and show that the\ncomputations are ef\ufb01cient and that the resulting active learning strategy substantially outperforms\nother active learning schemes.\nThe paper is organized as follows: We discuss related work in Section 2. In Section 3 we propose\nthe latent feature model for labelers and in Section 4 we discuss the inference procedure that com-\nbines data curation and learning. Then we present a general method to approximate the stationary\ndistribution of perturbed Markov chains and apply it to derive an ef\ufb01cient active learning criterion\nin Section 5. In Section 6 we present comparative results and we draw conclusions in Section 7.\n\n2 Related Work\n\nRelevant work on active learning in multi-teacher settings has been reported in [4, 9, 17]. Sheng\net al. [9] use the multiset of current labels with a random forest label model to score which task to\nnext solicit a repeat label for. The quality of the labeler providing the new label does not enter the\nselection process. In contrast, Donmez et al. [4] actively choose the labeler to query next using a\nformulation based on interval estimation, utilizing repeated labelings of tasks. The task to label next\nis chosen separately from the labeler. In contrast, our BBMC framework can perform meaningful\ninferences even without repeated labelings of tasks and treats the choices of which labeler to query\non which task as a joint choice in a Bayesian framework. Yan et al. [17] account for the effects of\nlabeler bias through a coin \ufb02ip observation model that \ufb01lters a latent label assignment, which in turn\nis modeled through a logistic regression. As in [4], the labeler is chosen separately from the task\nby solving two optimization problems. In other work on data collection strategies, Wais et al. [14]\nrequire each labeler to \ufb01rst pass a screening test before they are allowed to label any more data. In\na similar manner, reputation systems of various forms are used to weed out historically unreliable\nlabelers before collecting data.\nConsensus voting among multiple labels is a commonly used data curation method [12, 14].\nIt\nworks well when low levels of bias or noise are expected but becomes unreliable when labelers vary\ngreatly in quality [9]. Earlier work on learning from variable-quality teachers was revisited by Smyth\net al. [10] who looked at estimating the unknown true label for a task from a set of labelers of varying\nquality without external gold standard signal. They used an EM strategy to iteratively estimate the\ntrue label and the quality of the labelers. The work was extended to a Bayesian formulation by\nRaykar et al. [8] who assign latent variables to labelers capturing their mislabeling probabilities.\nIpeirotis et al. [6] pointed out that a biased labeler who systematically mislabels tasks is still more\nuseful than a labeler who reports labels at random. A method is proposed that separates low quality\nlabelers from high quality, but biased labelers. Dekel and Shamir [3] propose a two-step process.\nFirst, they \ufb01lter labelers by how far they disagree from an estimated true label and then retrain the\nmodel on the cleaned data. They give a generalization analysis for anticipated performance. In a\n\n2\n\n\fsimilar vein, Dekel and Shamir [2] show that, under some assumptions, restricting each labeler\u2019s\nin\ufb02uence on a learned model can control the effect of low quality or malicious labelers. Together\nwith [8, 16, 18], [2] and [3] are among the recent lines of research to combine data curation and\nlearning. Work has also focused on using gold standard labels to determine labeler quality. Going\nbeyond simply counting tasks on which labelers disagree with the gold standard, Snow et al. [11]\nestimate labeler quality in a Bayesian setting by comparing to the gold standard.\nLastly, collaborative \ufb01ltering has looked extensively at completing sparse matrices of ratings [13].\nGiven some gold standard labels, collaborative \ufb01ltering methods could in principle also be used to\ncurate data represented by a sparse label matrix. However, collaborative \ufb01ltering generally does not\ncombine this inference with the learning of a labeler-speci\ufb01c model for prediction (step 3). Also,\nwith the exception of [19], active learning has not been studied in the collaborative \ufb01ltering setting.\n\n3 Modeling Labeler Bias\n\nIn this section we specify a Bayesian latent feature model that accounts for labeler bias and allows\nus to combine data curation and learning into a single inferential calculation. For ease of exposition\nwe will focus on binary classi\ufb01cation, but our method can be generalized. Suppose we solicited\nlabels for n tasks from m labelers. In practical settings it is unlikely that a task is labeled by more\nthan 3\u201310 labelers [14]. Let task descriptions xi \u2208 Rd, i = 1, . . . , n, be collected in the matrix\nX. The label responses are recorded in the matrix Y so that yi,l \u2208 {\u22121, 0, +1} denotes the label\ngiven to task i by labeler l. The special label 0 denotes that a task was not labeled. A researcher is\ninterested in learning a model that can be used to predict labels for new tasks. When consensus is\nlacking among labelers, our desideratum is to predict the labels that the researcher (or some other\nexpert) would have assigned, as opposed to labels from an arbitrary labeler in the crowd. In this\nsituation it makes sense to stratify the labelers in some way. To facilitate this, the researcher r\nprovides gold standard labels in column r of Y to a small subset of the tasks. Loosely speaking,\nthe gold standard allows our model to curate the data by softly combining labels from those labelers\nwhose responses will useful in predicting r\u2019s remaining labels. It is important to note that our model\nis entirely symmetric in the role of the researcher and labelers. If instead we were interested in\npredicting labels for labeler l, we would treat column l as containing the gold standard labels. The\nresearcher r is just another labeler, the only distinction being that we wish to learn a model that\npredicts r\u2019s labels. To simplify our presentation, we will accordingly refer to labelers in the crowd\nand the researcher occasionally just as \u201clabelers,\u201d indexed by l, and only use the distinguishing index\nr when necessary. We account for each labeler l\u2019s idiosyncrasies by assigning a parameter \u03b2l \u2208 Rd\nto l and modeling labels yi,l, i = 1, . . . , n, through a probit model p(yi,l|xi, \u03b2l) = \u03a6(yi,lx(cid:62)\ni \u03b2l),\nwhere \u03a6(\u00b7) is the standard normal CDF. This section describes a joint Bayesian prior on parameters\n\u03b2l that allows for parameter sharing; two labelers that share parameters have similar responses. In\nthe context of this model, the two-step process of data curation and learning a model that predicts\nr\u2019s labels is reduced to posterior inference on \u03b2r given X and Y . Inference softly integrates labels\nfrom relevant labelers, while at the same time allowing us to predict r\u2019s remaining labels.\n\n3.1 Latent feature model\n\nLabelers are not independent, so it makes sense to impose structure on the set of \u03b2l\u2019s. Speci\ufb01cally,\neach vector \u03b2l is modeled as the sum of a set of latent factors that are shared across the population.\nLet zl be a latent binary vector for labeler l whose component zl,b indicates whether the latent\nfactor \u03b3b \u2208 Rd contributes to \u03b2l. In principle, our model allows for an in\ufb01nite number of distinct\nfactors (i.e., zl is in\ufb01nitely long), as long as only a \ufb01nite number of those factors is active (i.e.,\nb=1 be the concatenation of the factors \u03b3b. Given a labeler\u2019s vector\n\n(cid:80)\u221e\nzl and factors \u03b3 we de\ufb01ne the parameter \u03b2l =(cid:80)\u221e\nb=1 zl,b < \u221e). Let \u03b3 = (\u03b3b)\u221e\n\nb=1 zl,b\u03b3b.\n\nFor multiple labelers we let the in\ufb01nitely long matrix Z = (z1, . . . , zm)(cid:62) collect the vectors zl and\nde\ufb01ne the index set of all observed labels L = {(i, l) : yi,l (cid:54)= 0}, so that the likelihood is\n\np(Y |X, \u03b3, Z) =\n\np(yi,l|xi, \u03b3, zl) =\n\n\u03a6(yi,lx(cid:62)\n\ni \u03b2l).\n\n(1)\n\n(cid:89)\n\n(i,l)\u2208L\n\n(cid:89)\n\n(i,l)\u2208L\n\nTo complete the model we need to specify priors for \u03b3 and Z. We de\ufb01ne the prior distribution of\neach \u03b3b to be a zero-mean Gaussian \u03b3b \u223c N (0, \u03c32I), and let Z be governed by an Indian Buffet\n\n3\n\n\fthat(cid:80)\u221e\n\nProcess (IBP) Z \u223c IBP(\u03b1), parameterized by \u03b1 [5]. The IBP is a stochastic process on in\ufb01nite\nbinary matrices consisting of vectors zl. A central property of the IBP is that with probability one, a\nsampled matrix Z contains only a \ufb01nite number of nonzero entries, thus satisfying our requirement\nb=1 zl,b < \u221e. In the context of our model this means that when working with \ufb01nite data,\nwith probability one only a \ufb01nite set of features is active across all labelers. To simplify notation in\nsubsequent sections, we use this observation and collapse an in\ufb01nite matrix Z and vector \u03b3 to \ufb01nite\ndimensional equivalents. From now on, we think of Z as the \ufb01nite matrix having all zero-columns\nremoved. Similarly, we think of \u03b3 as having all blocks \u03b3b corresponding to zero-columns in the\noriginal matrix Z removed. With probability one, the number of columns K(Z) of Z is \ufb01nite so we\n\nb=1 zl,b\u03b3b (cid:44) Z(cid:62)\n\nl \u03b3, with Zl = zl \u2297 I the Kronecker product of zl and I.\n\nmay write \u03b2l =(cid:80)K(Z)\n\n4\n\nInference: Data Curation and Learning\n\nWe noted before that our model combines data curation and learning in a single inferential compu-\ntation. In this section we lay out the details of a Gibbs sampler for achieving this. Given a task j\nwhich was not labeled by r (and possibly no other labeler), we need the predictive probability\n\n(cid:90)\n\np(yj,r = +1|X, Y ) =\n\np(yj,r = +1|xj, \u03b2r)p(\u03b2r|X, Y )d\u03b2r.\n\n(2)\n\nTo approximate this probability we need to gather samples from the posterior p(\u03b2r|Y, X). Equiv-\nr \u03b3, we need samples from the posterior p(\u03b3, zr|Y, X). Because latent fac-\nalently, since \u03b2r = Z(cid:62)\ntors can be shared across multiple labelers, the posterior will softly absorb label information from\nlabelers whose latent factors tend to be similar to those of the researcher r. Thus, Bayesian infer-\nence p(\u03b2r|Y, X) automatically combines data curation and learning by weighting label information\nthrough an inferred sharing structure. Importantly, the posterior is informative even when no labeler\nin the crowd labeled any of the tasks the researcher labeled.\n\n4.1 Gibbs sampling\n\nFor Gibbs sampling in the probit model one commonly augments the likelihood in Eq. (1) with\nintermediate random variables T = {ti,l : yi,l (cid:54)= 0}. The generative model for the label yi,l given\nxi, \u03b3 and zl \ufb01rst samples ti,l from a Gaussian N (\u03b2(cid:62)\nl xi, 1). Conditioned on ti,l, the label is then\nde\ufb01ned as yi,l = 2 1[ti,l > 0] \u2212 1. Figure 1(a) summarizes the augmented graphical model by\nletting \u03b2 denote the collection of \u03b2l variables. We are interested in sampling from p(\u03b3, zr|Y, X).\nThe Gibbs sampler for this lives in the joint space of T, \u03b3, Z and samples iteratively from the three\nconditional distributions p(T|X, \u03b3, Z), p(\u03b3|X, Z, T ) and p(Z|\u03b3, X, Y ). The different steps are:\n\nSampling T given X, \u03b3, Z: We independently sample elements of T given X, \u03b3, Z from a trun-\ncated normal as\n\n(3)\nwhere we use N \u22121(t|\u00b5, 1) and N +1(t|\u00b5, 1) to indicate the density of the negative- and positive-\northant-truncated normal with mean \u00b5 and variance 1, respectively, evaluated at t.\n\n(ti,l|X, \u03b3, Z) \u223c N yi,l (ti,l|\u03b3(cid:62)Zlxi, 1),\n\nSampling \u03b3 given X, Z, T : Straightforward calculations show that conditional sampling of \u03b3\ngiven X, Z, T follows a multivariate Gaussian\n\n(\u03b3|X, Z, T ) \u223c N (\u03b3|\u00b5, \u03a3),\n\nwhere\n\n(cid:88)\n\n(i,l)\u2208L\n\n\u03a3\u22121 =\n\nI\n\u03c32 +\n\nZlxix(cid:62)\n\ni Z(cid:62)\n\nl\n\n(cid:88)\n\n(i,l)\u2208L\n\n\u00b5 = \u03a3\n\nZlxiti,l.\n\n4\n\n(4)\n\n(5)\n\n\f\u03b3\n\nZ\n\nX\n\n\u03b2\n\nT\n\nY\n\n(a)\n\n\u03b2t\u22121\n\n\u02c6\u03b2t\u22121\n\n\u03b2t\n\n\u02c6\u03b2t\n\n\u03b2t+1\n\n\u03b2t+2\n\n\u02c6\u03b2t+1\n\n\u02c6\u03b2t+2\n\n(b)\n\nFigure 1: (a) A graphical model of the augmented latent feature model. Each node corresponds to\na collection of random variables in the model. (b) A schematic of our approximation scheme. The\ntop chain indicates an unperturbed Markov chain, the lower a perturbed Markov chain. Rather than\nsampling from the lower chain directly (dashed arrows), we transform samples from the top chain\nto approximate samples from the lower (wavy arrows).\n\nto the full IBP [5]. Let m\u2212l,b =(cid:80)\nactive. De\ufb01ne \u03b2l(zl,b) = zl,b\u03b3b +(cid:80)\n\nSampling Z given \u03b3, X, Y : Finally, for inference on Z given \u03b3, X, Y we may use techniques\noutlined in [5]. We are interested in performing active learning in our model, so it is imperative\nto keep the conditional sampling calculations as compact as possible. One simple way to achieve\nthis is to work with a \ufb01nite-dimensional approximation to the IBP: We constrain Z to be an m \u00d7 K\nmatrix, assigning each labeler at most K active latent features. This is not a substantial limitation;\nin practice the truncated IBP often performs comparably, and for K \u2192 \u221e converges in distribution\nl(cid:48)(cid:54)=l zl(cid:48),b be the number of labelers, excluding l, with feature b\nb(cid:48)(cid:54)=b zl,b(cid:48)\u03b3b(cid:48) as the parameter \u03b2l either speci\ufb01cally including\nor excluding \u03b3b. Now if we let z\u2212l,b be the column b of Z, excluding element zl,b then updated\nelements of Z can be sampled one by one as\np(zl,b = 1|z\u2212l,b) =\n\nm\u2212l,b + \u03b1\nK\n\np(zl,b|z\u2212l,b, \u03b3, X, Y ) \u221d p(zl,b|z\u2212l,b)\n\n\u03a6(yi,lx(cid:62)\n\ni \u03b2l(zl,b)).\n\nn + \u03b1\nK\n\n(cid:89)\n\ni:yi,l(cid:54)=0\n\n(6)\n\n(7)\n\nAfter reaching approximate stationarity, we collect samples (\u03b3s, Z s) , s = 1, . . . , S, from the Gibbs\nsampler as they are generated. We then compute samples from p(\u03b2r|Y, X) by writing \u03b2s\n(cid:62)\u03b3s.\n\nr = Z s\nr\n\n5 Active Learning\n\nThe previous section outlined how, given a small set of gold standard labels from r, the remaining\nlabels can be predicted via posterior inference p(\u03b2r|Y, X). In this section we take an active learning\napproach [1, 7] to incrementally add labels to Y so as to quickly learn about \u03b2r while reducing data\nacquisition costs. Active learning allows us to guide the data collection process through model in-\nferences, thus integrating the data collection, data curation and learning steps of the crowdsourcing\npipeline. We envision a uni\ufb01ed system that automatically asks for more labels from those labelers\non those tasks that are most useful in inferring \u03b2r. This is in contrast to [9], where labelers cannot be\ntargeted with tasks. It is also unlike [4] since we can let labelers be arbitrarily unhelpful, and differs\nfrom [17] which assumes a single latent truth.\nA well-known active learning criterion popularized by Lindley [7] is to label that task next which\nmaximizes the prior-posterior reduction in entropy of an inferential quantity of interest. The original\nformulation has been generalized beyond entropy to arbitrary utility functionals U (\u00b7) of the updated\nposterior probability [1]. The functional U (\u00b7) is a model parameter that can depend on the type of\ninferences we are interested in. In our particular setup, we wish to infer the parameter \u03b2r to predict\nlabels for the researcher r. Suppose we chose to solicit a label for task i(cid:48) from labeler l(cid:48), which pro-\nduced label yi(cid:48),l(cid:48). The utility of this observation is U (p(\u03b2r|yi(cid:48),l(cid:48))). The average utility of receiving\n\n5\n\n\fis taken with respect to the predictive label probabilities p(yi(cid:48),l(cid:48)|xi(cid:48)) =(cid:82) p(yi(cid:48),l(cid:48)|xi(cid:48), \u03b2l(cid:48))p(\u03b2l(cid:48))d\u03b2l(cid:48).\n\na label on task i(cid:48) from labeler l(cid:48) is I((i(cid:48), l(cid:48)) , p(\u03b2r)) = E(U (p(\u03b2r|yi(cid:48),l(cid:48)))), where the expectation\nActive learning chooses that pair (i(cid:48), l(cid:48)) which maximizes I((i(cid:48), l(cid:48)) , p(\u03b2r)). If we want to choose\nthe next task for the researcher to label, we constrain l(cid:48) = r. To query the crowd we let l(cid:48) (cid:54)= r. Sim-\nilarly, we can constrain i(cid:48) to any particular value or subset of interest. For the following discussion\nwe let U (p(\u03b2r|yi(cid:48),l(cid:48))) = ||Ep(\u03b2r)(\u03b2r)\u2212 Ep(\u03b2r|yi(cid:48) ,l(cid:48) )(\u03b2r)||2 be the (cid:96)2 norm of the difference in means\nof \u03b2r. Picking the task that shifts the posterior mean the most is similar in spirit to the common\ncriterion of maximizing the Kullback-Leibler divergence between the prior and posterior.\n\n5.1 Active learning for MCMC inference\n\nA straightforward application of active learning is impractical using Gibbs sampling, because to\nscore a single task-labeler pair (i(cid:48), l(cid:48)) we would have to run two Gibbs samplers (one for each of\nthe two possible labels) in order to approximate the updated posterior distributions. Suppose we\nstarted with k task-labeler pairs that active learning could choose from. Depending on the number\nof selections we wish to perform, we would have to run k (cid:46) g (cid:46) k2 Gibbs samplers within the\ntopmost Gibbs sampler of Section 4. Clearly, such a scoring approach is not practical. To solve\nthis problem, we propose a general purpose strategy to approximate the stationary distribution of\na perturbed Markov chain using that of an unperturbed Markov chain. The approximation allows\nef\ufb01cient active learning in our model that outperforms na\u00a8\u0131ve scoring both in speed and quality.\nr|\u03b2t\u22121\nThe main idea can be summarized as follows. Suppose we have two Markov chains, p(\u03b2t\n)\nand \u02c6p( \u02c6\u03b2t\n), the latter of which is a slight perturbation of the former. Denote the stationary\ndistributions by p\u221e(\u03b2r) and \u02c6p\u221e( \u02c6\u03b2r), respectively. If we are given the stationary distribution p\u221e(\u03b2r)\nof the unperturbed chain, then we propose to approximate the perturbed stationary distribution by\n\nr| \u02c6\u03b2t\u22121\n\nr\n\nr\n\n\u02c6p\u221e( \u02c6\u03b2r) \u2248\n\n\u02c6p( \u02c6\u03b2r|\u03b2r)p\u221e(\u03b2r)d\u03b2r.\n\n(8)\nIf \u02c6p( \u02c6\u03b2t| \u02c6\u03b2t\u22121) = p( \u02c6\u03b2t| \u02c6\u03b2t\u22121) the approximation is exact. Our hope is that if the perturbation is\nsmall enough the above approximation is good. To use this practically with MCMC, we \ufb01rst run the\nunperturbed MCMC chain to approximate stationarity, and then use samples of p\u221e(\u03b2r) to compute\napproximate samples from \u02c6p\u221e( \u02c6\u03b2r). Figure 1(b) shows this scheme visually.\nr|\u03b2t\u22121\nTo map this idea to our active learning setup we conceptually let the unperturbed chain p(\u03b2t\n)\nr| \u02c6\u03b2t\u22121\nbe the chain on \u03b2r induced by the Gibbs sampler in Section 4. The perturbed chain \u02c6p( \u02c6\u03b2t\n)\nrepresents the chain where we have added a new observation yi(cid:48),l(cid:48) to the measured data. If we have\nS samples \u03b2s\n\nr from p\u221e(\u03b2r), then we approximate the perturbed distribution as\n\nr\n\nr\n\n(cid:90)\n\nS(cid:88)\n(cid:16)\n\ns=1\n\n\u02c6p\u221e( \u02c6\u03b2r) \u2248 1\nS\n\n\u02c6p( \u02c6\u03b2r|\u03b2s\nr ),\n\n(cid:17)\n\n(9)\n\n(cid:0)\u03b3t|\u03b3t\u22121, Z(cid:1) d= \u03b7\u03a3 + \u00b5 = \u03a3\n\nand the active learning score as U (p(\u03b2r|yi(cid:48),l(cid:48))) \u2248 U\n. To further specialize this strategy\nto our model we \ufb01rst rewrite the Gibbs sampler outlined in Section 4. We suppress mentions of\n\nX and Y in the subsequent presentation. Instead of \ufb01rst sampling(cid:0)T|\u03b3t\u22121, Z(cid:1) from Eq. (3), and\nthen sampling (\u03b3t|T, Z) from Eq. (4), we combine them into one larger sampling step(cid:0)\u03b3t|\u03b3t\u22121, Z(cid:1).\n\n\u02c6p\u221e( \u02c6\u03b2r)\n\nStarting from a \ufb01xed \u03b3t\u22121 and Z we sample from \u03b3t as\n\n(cid:88)\n\n(cid:2)\u03b71 +(cid:0)ti,l|\u03b3t\u22121, Z(cid:1)(cid:3)\uf8f9\uf8fb,\n\n\uf8ee\uf8f0\u03b7\u03c3\u22122I +\n(cid:1), as referred to in Eqs. (8) and (9). As this is not possible,\n\n(i,l)\u2208L\n\nZlxi\n\n(10)\n\nwhere \u03b7\u03a3 is a zero-mean Gaussian with covariance \u03a3, and \u03b71 a standard normal random variable. If it\nwere feasible, we could also absorb the intermediate sampling of Z into the notation and write down\n\na single induced Markov chain(cid:0)\u03b2t\n\nr|\u03b2t\u22121\n\nr\n\nwe will account for Z separately. We see that the effect of adding a new observation yi(cid:48),l(cid:48) is to perturb\nthe Markov chain in Eq. (10) by adding an element to L. Supposing we added this new observation\nat time t \u2212 1, let \u03a3(i(cid:48),l(cid:48)) be de\ufb01ned as \u03a3 but with (i(cid:48), l(cid:48)) added to L. Straightforward calculations\nusing the Sherman-Morrison-Woodbury identity on \u03a3(i(cid:48),l(cid:48)) give that, conditioned on \u03b3t\u22121, Z, we can\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Examples of easy and ambiguous labeling tasks. We asked labelers to determine if the\ntriangle is to the left or above the square.\n\n(cid:16)\n\nwrite the \ufb01rst step of the perturbed Gibbs sampler as a function of the unperturbed Gibbs sampler.\nIf we let Ai(cid:48),l(cid:48) = \u03a3Zl(cid:48)xi(cid:48)x(cid:62)\n\nl(cid:48) \u03a3Zl(cid:48)xi(cid:48)) for compactness, then we yield\n\nl(cid:48) /(1 + x(cid:62)\n\ni(cid:48) Z(cid:62)\n\ni(cid:48) Z(cid:62)\n\n(i(cid:48),l(cid:48))|\u03b3t\u22121, Z\n\u03b3t\n\nU (p(\u03b2r|yi(cid:48),l(cid:48))) =\n\n(11)\nTo approximate the utility U (\u00b7) we now appeal to Eq. (9) and estimate the difference in means using\nrecent samples \u03b3s, Z s, s = 1, . . . , S from the unperturbed sampler. In terms of Eqs. (10) and (11),\n\n(cid:17) d= (I \u2212 Ai(cid:48),l(cid:48))(cid:0)\u03b3t|\u03b3t\u22121, Z(cid:1) + \u03a3(i(cid:48),l(cid:48))Zl(cid:48)xi(cid:48)(cid:2)\u03b71 +(cid:0)ti(cid:48),l(cid:48)|\u03b3t\u22121, Z(cid:1)(cid:3).\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Ep(\u03b2r)(\u03b2r) \u2212 Ep(\u03b2r|yi(cid:48),l(cid:48) )(\u03b2r)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n(cid:62)(cid:2)(cid:0)\u03b3|\u03b3s\u22121, Z s\u22121(cid:1) \u2212(cid:0)\u03b3(i(cid:48),l(cid:48))|\u03b3s\u22121, Z s\u22121(cid:1)(cid:3)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:32)\nof(cid:0)\u03b3|\u03b3s\u22121, Z s\u22121(cid:1). We have used this to approximate E(cid:0)(cid:0)\u03b3|\u03b3s\u22121, Z s\u22121(cid:1)(cid:1) \u2248 \u03b3s. Thus, the sum\n\nBy simple cancellations and expectations of truncated normal variables we can reduce the above\nexpression to a sample average of elementary calculations. Note that the sample \u03b3s is a realization\nonly runs over S \u2212 1 terms. In principle the exact expectation could also be computed. The \ufb01nal\nutility calculation is straightforward but too long to expand. Finally, we use samples from the Gibbs\nsampler to approximate p(yi(cid:48),l(cid:48)|xi(cid:48)) and estimate I((i(cid:48), l(cid:48)) , p(\u03b2r)) for querying labeler l(cid:48) on task i(cid:48).\n\nS(cid:88)\n\nZ s\u22121\n\nr\n\n1\n\nS \u2212 1\n\n(12)\n\n.\n\n(13)\n\n\u2248\n\ns=2\n\n6 Experimental Results\n\nWe evaluated our active learning method on an ambiguous localization task which asked labelers on\nAmazon Mechanical Turk to determine if a triangle was to the left or above a rectangle. Examples\nare shown in Figure 6. Tasks such as these are important for learning computer vision models of\nperception. Rotation, translation and scale, as well as aspect ratios, were pseudo-randomly sampled\nin a way that produced ambiguous tasks. We expected labelers to use centroids, extreme points\nand object sizes in different ways to solve the tasks, thus leading to structurally biased responses.\nAdditionally, our model will also have to deal with other forms of noise and bias. The gold standard\nwas to compare only the centroids of the two objects. For training we generated 1000 labeling tasks\nand solicited 3 labels for each task. Tasks were solved by 75 labelers with moderate disagreement.\nTo emphasize our results, we retained only the subset of 523 tasks with disagreement. We provided\nabout 60 gold standard labels to BBMC and then performed inference and active learning on \u03b2r so as\nto learn a predictive model emulating gold standard labels. We evaluated methods based on the log\nlikelihood and error rate on a held-out test set of 1101 datapoints.1 All results shown in Table 1 were\naveraged across 10 random restarts. We considered two scenarios. The \ufb01rst compares our model to\nother methods when no active learning is performed. This will demonstrate the advantages of the\nlatent feature model presented in Sections 3 and 4. The second scenario compares performance of\nour active learning scheme to various other methods. This will highlight the viability of our overall\nscheme presented in Section 5 that ties data collection together with data curation and learning.\nFirst we show performance without active learning. Here only about 60 gold standard labels and all\nthe labeler data is available for training. The results are shown in the top three rows of Table 1. Our\nmethod, \u201cBBMC,\u201d outperforms the other two methods by a large margin. The BBMC scores were\ncomputed by running the Gibbs sampler of Section 4 with 2000 iterations burnin and then computing\n\n1The test set was similarly constructed by selecting from 2000 tasks those on which three labelers disagreed.\n\n7\n\n\fGOLD\nCONS\nBBMC\n\nFinal Loglik\n\u22123716 \u00b1 1695\n\u2212421.1 \u00b1 2.6\n\u2212219.1 \u00b1 3.1\n\u22121957 \u00b1 696\nGOLD-ACT\n\u2212396.1 \u00b1 3.6\nCONS-ACT\n\u2212186.0 \u00b1 2.2\nRAND-ACT\n\u2212198.3 \u00b1 5.8\nDIS-ACT\nMCMC-ACT \u2212196.1 \u00b1 6.7\nBBMC-ACT \u2212160.8 \u00b1 3.9\n\nFinal Error\n\n0.0547 \u00b1 0.0102\n0.0935 \u00b1 0.0031\n0.0309 \u00b1 0.0033\n0.0290 \u00b1 0.0037\n0.0906 \u00b1 0.0024\n0.0292 \u00b1 0.0029\n0.0392 \u00b1 0.0052\n0.0492 \u00b1 0.0050\n0.0188 \u00b1 0.0018\n\nTable 1: The top three rows give results without and the bottom six rows results with active learning.\n\na predictive model by averaging over the next 20000 iterations. The alternatives include \u201cGOLD,\u201d\nwhich is a logistic regression trained only on gold standard labels, and \u201cCONS,\u201d which evaluates\nlogistic regression trained on the overall majority consensus. Training on the gold standard only\noften over\ufb01ts, and training on the consensus systematically misleads.\nNext, we evaluate our active learning method. As before, we seed the model with about 60 gold\nstandard labels. We repeatedly select a new task for which to receive a gold standard label from the\nresearcher. That is, for this experiment we constrained active learning to use l(cid:48) = r. Of course, in our\nframework we could have just as easily queried labelers in the crowd. Following 2000 steps burnin\nwe performed active learning every 200 iterations for a total of 100 selections. The reported scores\nwere computed by estimating a predictive model from the last 200 iterations. The results are shown\nin the lower six rows of Table 1. Our model with active learning, \u201cBBMC-ACT,\u201d outperforms all\nalternatives. The \ufb01rst alternative we compared against, \u201cMCMC-ACT,\u201d does active learning with the\nMCMC-based scoring method outlined in Section 5. In line with our utility U (\u00b7) this method scores\na task by running two Gibbs samplers within the overall Gibbs sampler and then approximates\nthe expected mean difference of \u03b2r. Due to time constraints, we could only afford to run each\nsubordinate chain for 10 steps. Even then, this method requires on the order of 10 \u00d7 83500 Gibbs\nsampling iterations for 100 active learning steps. It takes about 11 hours to run the entire chain,\nwhile BBMC only requires 2.5 hours. The MCMC method performs very poorly. This demonstrates\nour point: Since the MCMC method computes a similar quantity as our approximation, it should\nperform similarly given enough iterations in each subchain. However, 10 iterations is not nearly\nenough time for the scoring chains to mix and also quite a small number to compute empirical\naverages, leading to decreased performance. A more realistic alternative to our model is \u201cDIS-ACT,\u201d\nwhich picks one of the tasks with most labeler disagreement to label next. Lastly, the baseline\nalternatives include \u201cGOLD-ACT\u201d and \u201cCONS-ACT\u201d which pick a random task to label and then\nlearn logistic regressions on the gold standard or consensus labels respectively. Those results can\nbe directly compared against \u201cRAND-ACT,\u201d which uses our model and inference procedure but\nsimilarly selects tasks at random. In line with our earlier evaluation, we still outperform these two\nmethods when effectively no active learning is done.\n\n7 Conclusions\n\nWe have presented Bayesian Bias Mitigation for Crowdsourcing (BBMC) as a framework to unify\nthe three main steps in the crowdsourcing pipeline: data collection, data curation and learning.\nOur model captures labeler bias through a \ufb02exible latent feature model and conceives of the en-\ntire pipeline in terms of probabilistic inference. An important contribution is a general purpose\napproximation strategy for Markov chains that allows us to ef\ufb01ciently perform active learning, de-\nspite relying on Gibbs sampling for inference. Our experiments show that BBMC is fast and greatly\noutperforms a number of commonly used alternatives.\n\nAcknowledgements\n\nWe would like to thank Purnamrita Sarkar for helpful discussions and Dave Golland for assistance\nin developing the Amazon Mechanical Turk HITs.\n\n8\n\n\fReferences\n[1] K. Chaloner and I. Verdinelli. Bayesian Experimental Design: A Review. Statistical Science,\n\n10(3):273\u2013304, 1995.\n\n[2] O. Dekel and O. Shamir. Good Learners for Evil Teachers.\n\nIn L. Bottou and M. Littman,\neditors, Proceedings of the 26th International Conference on Machine Learning (ICML). Om-\nnipress, 2009.\n\n[3] O. Dekel and O. Shamir. Vox Populi: Collecting High-Quality Labels from a Crowd.\n\nIn\nProceedings of the 22nd Annual Conference on Learning Theory (COLT), Montreal, Quebec,\nCanada, 2009.\n\n[4] P. Donmez, J. G. Carbonell, and J. Schneider. Ef\ufb01ciently Learning the Accuracy of Label-\ning Sources for Selective Sampling. In Proceedings of the 15th ACM SIGKDD, KDD, Paris,\nFrance, 2009.\n\n[5] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite Latent Feature Models and the Indian Buffet Pro-\n\ncess. Technical report, Gatsby Computational Neuroscience Unit, 2005.\n\n[6] P. G. Ipeirotis, F. Provost, and J. Wang. Quality Management on Amazon Mechanical Turk. In\nProceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP, pages 64\u201367,\nWashington DC, 2010.\n\n[7] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The Annals of\n\nMathematical Statistics, 27(4):986\u20131005, 1956.\n\n[8] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom Crowds. Journal of Machine Learning Research, 11:1297\u20131322, April 2010.\n\n[9] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get Another Label? Improving Data Quality and\nData Mining using Multiple, Noisy Labelers. In Proceeding of the 14th ACM SIGKDD, KDD,\nLas Vegas, Nevada, 2008.\n\n[10] P. Smyth, U. M. Fayyad, M. C. Burl, P. Perona, and P. Baldi. Inferring Ground Truth from\nSubjective Labelling of Venus Images. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors,\nAdvances in Neural Information Processing Systems 7 (NIPS). MIT Press, 1994.\n\n[11] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast\u2014But is it Good? Evaluating\nNon-Expert Annotations for Natural Language Tasks. In Proceedings of EMNLP. Association\nfor Computational Linguistics, 2008.\n\n[12] A. Sorokin and D. Forsyth. Utility Data Annotation with Amazon Mechanical Turk. In CVPR\n\nWorkshop on Internet Vision, Anchorage, Alaska, 2008.\n\n[13] X. Su and T. M. Khoshgoftaar. A Survey of Collaborative Filtering Techniques. Advances in\n\nArti\ufb01cial Intelligence, 2009:4:2\u20134:2, January 2009.\n\n[14] P. Wais, S. Lingamnei, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin, and H. Si-\nmons. Towards Building a High-Quality Workforce with Mechanical Turk. In NIPS Workshop\non Computational Social Science and the Wisdom of Crowds, Whistler, BC, Canada, 2010.\n\n[15] P. Welinder, S. Branson, S. Belongie, and P. Perona. The Multidimensional Wisdom of Crowds.\nIn J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, editors, Advances\nin Neural Information Processing Systems 23 (NIPS). MIT Press, 2010.\n\n[16] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose Vote Should Count\nMore: Optimal Integration of Labels from Labelers of Unknown Expertise.\nIn Y. Bengio,\nD. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural\nInformation Processing Systems 22 (NIPS). MIT Press, 2009.\n\n[17] Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active Learning from Crowds. In L. Getoor and\nT. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning\n(ICML), Bellevue, Washington, 2011.\n\n[18] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. G. Dy.\nIn\n\nModeling Annotator Expertise: Learning When Everybody Knows a Bit of Something.\nProceedings of AISTATS, volume 9, Chia Laguna, Sardinia, Italy, 2010.\n\n[19] K. Yu, A. Schwaighofer, V. Tresp, X. Xu, and H. Kriegel. Probabilistic Memory-based Col-\nlaborative Filtering. IEEE Transactions On Knowledge and Data Engineering, 16(1):56\u201369,\nJanuary 2004.\n\n9\n\n\f", "award": [], "sourceid": 1021, "authors": [{"given_name": "Fabian", "family_name": "Wauthier", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}