{"title": "Ranking annotators for crowdsourced labeling tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1809, "page_last": 1817, "abstract": "With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers--annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we formalize the notion of a spammer and define a score which can be used to rank the annotators---with the spammers having a score close to zero and the good annotators having a high score close to one.", "full_text": "Ranking annotators for crowdsourced labeling tasks\n\nVikas C. Raykar\n\nSiemens Healthcare, Malvern, PA, USA\n\nShipeng Yu\n\nSiemens Healthcare, Malvern, PA, USA\n\nvikas.raykar@siemens.com\n\nshipeng.yu@siemens.com\n\nAbstract\n\nWith the advent of crowdsourcing services it has become quite cheap and reason-\nably effective to get a dataset labeled by multiple annotators in a short amount of\ntime. Various methods have been proposed to estimate the consensus labels by\ncorrecting for the bias of annotators with different kinds of expertise. Often we\nhave low quality annotators or spammers\u2013annotators who assign labels randomly\n(e.g., without actually looking at the instance). Spammers can make the cost of\nacquiring labels very expensive and can potentially degrade the quality of the con-\nsensus labels.\nIn this paper we formalize the notion of a spammer and de\ufb01ne\na score which can be used to rank the annotators\u2014with the spammers having a\nscore close to zero and the good annotators having a high score close to one.\n\n1 Spammers in crowdsourced labeling tasks\n\nAnnotating an unlabeled dataset is one of the bottlenecks in using supervised learning to build good\npredictive models. Getting a dataset labeled by experts can be expensive and time consuming. With\nthe advent of crowdsourcing services (Amazon\u2019s Mechanical Turk being a prime example) it has\nbecome quite easy and inexpensive to acquire labels from a large number of annotators in a short\namount of time (see [8], [10], and [11] for some computer vision and natural language processing\ncase studies). One drawback of most crowdsourcing services is that we do not have tight control\nover the quality of the annotators. The annotators can come from a diverse pool including genuine\nexperts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good\nquality labels requestors typically get each instance labeled by multiple annotators and these multiple\nannotations are then consolidated either using a simple majority voting or more sophisticated meth-\nods that model and correct for the annotator biases [3, 9, 6, 7, 14] and/or task complexity [2, 13, 12].\n\nIn this paper we are interested in ranking annotators based on how spammer like each annotator is.\nIn our context a spammer is a low quality annotator who assigns random labels (maybe because the\nannotator does not understand the labeling criteria, does not look at the instances when labeling, or\nmaybe a bot pretending to be a human annotator). Spammers can signi\ufb01cantly increase the cost of\nacquiring annotations (since they need to be paid) and at the same time decrease the accuracy of the\n\ufb01nal consensus labels. A mechanism to detect and eliminate spammers is a desirable feature for any\ncrowdsourcing market place. For example one can give monetary bonuses to good annotators and\ndeny payments to spammers.\n\nThe main contribution of this paper is to formalize the notion of a spammer for binary, categorical,\nand ordinal labeling tasks. More speci\ufb01cally we de\ufb01ne a scalar metric which can be used to rank the\nannotators\u2014with the spammers having a score close to zero and the good annotators having a score\nclose to one (see Figure 4). We summarize the multiple parameters corresponding to each annotator\ninto a single score indicative of how spammer like the annotator is. While this spammer score was\nimplicit for binary labels in earlier works [3, 9, 2, 6] the extension to categorical and ordinal labels is\nnovel and is quite different from the accuracy computed from the confusion rate matrix. An attempt\nto quantify the quality of the workers based on the confusion matrix was recently made by [4] where\nthey transformed the observed labels into posterior soft labels based on the estimated confusion\n\n1\n\n\fmatrix. While we obtain somewhat similar annotator rankings, we differ from this work in that our\nscore is directly de\ufb01ned in terms of the annotator parameters (see \u00a7 5 for more details).\nThe rest of the paper is organized as follows. For ease of exposition we start with binary labels\n(\u00a7 2) and later extend it to categorical (\u00a7 3) and ordinal labels (\u00a7 4). We \ufb01rst specify the annotator\nmodel used, formalize the notion of a spammer, and propose an appropriate score in terms of the\nannotator model parameters. We do not dwell too much on the estimation of the annotator model\nparameters. These parameters can either be estimated directly using known gold standard 1 or the\niterative algorithms that estimate the annotator model parameters without actually knowing the gold\nstandard [3, 9, 2, 6, 7]. In the experimental section (\u00a7 6) we obtain rankings for the annotators using\nthe proposed spammer scores on some publicly available data from different domains.\n\n2 Spammer score for crowdsourced binary labels\n\ni \u2208 {0, 1} be the label assigned to the ith instance by the jth annotator, and\nAnnotator model Let yj\nlet yi \u2208 {0, 1} be the actual (unobserved) binary label. We model the accuracy of the annotator\nseparately on the positive and the negative examples. If the true label is one, the sensitivity (true\npositive rate) \u03b1j for the jth annotator is de\ufb01ned as the probability that the annotator labels it as one.\n\n\u03b1j := Pr[yj\n\ni = 1|yi = 1].\n\nOn the other hand, if the true label is zero, the speci\ufb01city (1\u2212false positive rate) \u03b2j is de\ufb01ned as the\nprobability that annotator labels it as zero.\n\n\u03b2j := Pr[yj\n\ni = 0|yi = 0].\n\nExtensions of this basic model have been proposed to include item level dif\ufb01culty [2, 13] and also\nto model the annotator performance based on the feature vector [14]. For simplicity we use the\nbasic model proposed in [7] in our formulation. Based on many instances labeled by multiple\nannotators the maximum likelihood estimator for the annotator parameters (\u03b1j, \u03b2j) and also the\nconsensus ground truth (yi) can be estimated iteratively [3, 7] via the Expectation Maximization\n(EM) algorithm. The EM algorithm iteratively establishes a particular gold standard (initialized via\nmajority voting), measures the performance of the annotators given that gold standard (M-step), and\nre\ufb01nes the gold standard based on the performance measures (E-step).\nWho is a spammer? Intuitively, a spammer assigns labels randomly\u2014maybe because the annotator\ndoes not understand the labeling criteria, does not look at the instances when labeling, or maybe a\nbot pretending to be a human annotator. More precisely an annotator is a spammer if the probability\nof observed label yj\n\ni being one given the true label yi is independent of the true label, i.e.,\n\nThis means that the annotator is assigning labels randomly by \ufb02ipping a coin with bias Pr[yj\nwithout actually looking at the data. Equivalently (1) can be written as\n\ni = 1]\n\nPr[yj\n\ni = 1|yi] = Pr[yj\n\ni = 1].\n\n(1)\n\nPr[yj\n\ni = 1|yi = 1] = Pr[yj\n\ni = 1|yi = 0] which implies \u03b1j = 1 \u2212 \u03b2j .\n\n(2)\nHence in the context of the annotator model de\ufb01ned earlier a perfect spammer is an annotator for\nwhom \u03b1j + \u03b2j \u2212 1 = 0. This corresponds to the diagonal line on the Receiver Operating Character-\nistic (ROC) plot (see Figure 1(a)) 2. If \u03b1j + \u03b2j \u2212 1 < 0 then the annotators lies below the diagonal\nline and is a malicious annotator who \ufb02ips the labels. Note that a malicious annotator has discrimi-\nnatory power if we can detect them and \ufb02ip their labels. In fact the methods proposed in [3, 7] can\nautomatically \ufb02ip the labels for the malicious annotators. Hence we de\ufb01ne the spammer score for\nan annotator as\n\nS j = (\u03b1j + \u03b2j \u2212 1)2\n\n(3)\n\nAn annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect\nannotator has S j = 1.\n\n1One of the commonly used strategy to \ufb01lter out spammers is to inject some items into the annotations with\n\nknown labels. This is the strategy used by CrowdFlower (http://crowdflower.com/docs/gold).\n\n2Also note that (\u03b1j + \u03b2j )/2 is equal to the area shown in the plot and can be considered as a non-parametric\napproximation to the area under the ROC curve (AUC) based on one observed point. It is also equal to the\nBalanced Classi\ufb01cation Rate (BCR). So a spammer can also be de\ufb01ned as having BCR or AUC equal to 0.5.\n\n2\n\n\fEqual accuracy contours (prevalence=0.5)\n\nEqual spammer score contours\n\n)\n \n\nj\n\n\u03b1\n \n(\n \ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\nGood\nAnnotators\n\nBiased\nAnnotators\n\n[ 1\u2212\u03b2j, \u03b1j ]\n\nSpammers\n\nArea = (\u03b1j+\u03b2j)/2\n\nBiased\nAnnotators\n\n0.2\n\n0.6\n0.4\n1\u2212Specificity ( \u03b2j )\n\nMalicious\nAnnotators\n\n0.8\n\n1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0.9\n\n0.8\n\n0.7\n\n0.5\n\n0.6\n\n0\n0\n\n0.2\n\n0.8\n\n0.7\n\n0.5\n\n0.4\n\n0.6\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.3\n0.4\n0.6\n1\u2212Specificity\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.5\n\n0.4\n\n0.1\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0.3\n\n0.2\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.1\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.2\n\n0.1\n\n0.1\n\n0.2\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.8\n\n1\n\n0\n0\n\n0.2\n\n0.3\n\n0.1\n0.4\n0.6\n1\u2212Specificity\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.8\n\n1\n\n(a) Binary annotator model\n\n(b) Accuracy\n\n(c) Spammer score\n\nFigure 1: (a) For binary labels an annotator is modeled by his/her sensitivity and speci\ufb01city. A perfect spammer\nlies on the diagonal line on the ROC plot. (b) Contours of equal accuracy (4) and (c) equal spammer score (3).\n\nAccuracy This notion of a spammer is quite different for that of the accuracy of an annotator. An\nannotator with high accuracy is a good annotator but one with low accuracy is not necessarily a\nspammer. The accuracy is computed as\n\n1\n\nAccuracyj = Pr[yj\n\ni = yi] =\n\nPr[yj\n\ni = 1|yi = k]Pr[yi = k] = \u03b1jp + \u03b2j (1 \u2212 p),\n\n(4)\n\nXk=0\n\nwhere p := Pr[yi = 1] is the prevalence of the positive class. Note that accuracy depends on\nprevalence. Our proposed spammer score does not depend on prevalence and essentially quanti\ufb01es\nthe annotator\u2019s inherent discriminatory power. Figure 1(b) shows the contours of equal accuracy\non the ROC plot. Note that annotators below the diagonal line (malicious annotators) have low\naccuracy. The malicious annotators are good annotators but they \ufb02ip their labels and as such are not\nspammers if we can detect them and then correct for the \ufb02ipping. In fact the EM algorithms [3, 7]\ncan correctly \ufb02ip the labels for the malicious annotators and hence they should not be treated as\nspammers. Figure 1(c) also shows the contours of equal score for our proposed score and it can be\nseen that the malicious annotators have a high score and only annotators along the diagonal have a\nlow score (spammers).\nLog-odds Another interpretation of a spammer can be seen from the log odds. Using Bayes\u2019 rule\nthe posterior log-odds can be written as\n\nlog\n\nPr[yi = 1|yj\ni ]\nPr[yi = 0|yj\ni ]\n\n= log\n\nPr[yj\nPr[yj\n\ni |yi = 1]\ni |yi = 0]\n\n+ log\n\np\n\n1 \u2212 p\n\n.\n\n1\u2212p . Essentially the annotator\nIf an annotator is a spammer (i.e., (2) holds) then log\nprovides no information in updating the posterior log-odds and hence does not contribute to the\nestimation of the actual true label.\n\n= log p\n\nPr[yi=1|yj\ni ]\nPr[yi=0|yj\ni ]\n\n3 Spammer score for categorical labels\n\nAnnotator model Suppose there are K \u2265 2 categories. We introduce a multinomial parameter\n\u03b1j\n\nc = (\u03b1j\n\nc1, . . . , \u03b1j\n\ncK) for each annotator, where\n\n\u03b1j\nck := Pr[yj\n\ni = k|yi = c]\n\nand\n\n\u03b1j\n\nck = 1.\n\nK\n\nXk=1\n\nck denotes the probability that annotator j assigns class k to an instance given that the\n\nThe term \u03b1j\ntrue class is c. When K = 2, \u03b1j\nWho is a spammer? As earlier a spammer assigns labels randomly, i.e.,\n\n11 and \u03b1j\n\n00 are sensitivity and speci\ufb01city, respectively.\n\nPr[yj\n\ni = k|yi] = Pr[yj\n\ni = k], \u2200k.\n\n3\n\n\fThis is equivalent to Pr[yj\ni = k|yi = c0], \u2200c, c0, k = 1, . . . , K\u2014 which means\nknowing the true class label being c or c0 does not change the probability of the annotator\u2019s assigned\nlabel. This indicates that the annotator j is a spammer if\n\ni = k|yi = c] = Pr[yj\n\nLet Aj be the K \u00d7 K confusion rate matrix with entries [Aj]ck = \u03b1ck\u2014a spammer would have\n\n\u03b1j\nck = \u03b1j\n\nc0k, \u2200c, c0, k = 1, . . . , K.\n\n(5)\n\nall the rows of Aj equal, for example, Aj = (cid:20) 0.50\n\n0.50\n0.50\n\n0.25\n0.25\n0.25\n\nannotation problem. Essentially Aj is a rank one matrix of the form Aj = ev>\nvector vj \u2208 RK that satis\ufb01es v>\n\nj e = 1, where e is column vector of ones.\n\n0.25\n0.25\n\n0.25 (cid:21), for a three class categorical\n\nj , for some column\n\nIn the binary case we had this natural notion of spammer as an annotator for whom \u03b1j + \u03b2j \u2212 1 was\nclose to zero. One natural way to summarize (5) would be in terms of the distance (Frobenius norm)\nof the confusion matrix to the closest rank one approximation, i.e,\n\nS j := kAj \u2212 e\u02c6v>\n\nj k2\nF ,\n\nwhere \u02c6vj solves\n\n\u02c6vj = arg min\nvj\n\nkAj \u2212 ev>\n\nj k2\n\nF\n\ns.t. v>\n\nj e = 1.\n\n(6)\n\n(7)\n\nSolving (7) yields \u02c6vj = (1/K)Aj >e, which is the mean of the rows of Aj. Then from (6) we have\n\nSo a spammer is an annotator for whom S j is close to zero. A perfect annotator has S j = K \u2212 1.\nWe normalize this score to lie between 0 and 1.\n\nS j =\n\n(\u03b1j\n\nck \u2212 \u03b1j\n\nc0k)2\n\n(8)\n\n(\u03b1j\n\nck \u2212 \u03b1j\n\nc0k)2.\n\nK Xc<c0Xk\n\n1\nK\n\nS j =(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:18)I \u2212\n\n2\n\nF\n\n1\n\n=\n\nee>(cid:19) Aj(cid:13)(cid:13)(cid:13)(cid:13)\nK(K \u2212 1) Xc<c0Xk\n\n1\n\nWhen K = 2 this is equivalent to the score proposed earlier for binary labels. As earlier this\nnotion of a spammer is different than the accuracy computed from the confusion rate matrix and\nthe prevalence. The accuracy is computed as Accuracyj = Pr[yj\ni = k|yi =\n\nk=1 Pr[yj\n\ni = yi] = PK\n\nk]Pr[yi = k] =PK\n\nk=1 \u03b1j\n\nkkPr[yi = k].\n\n4 Spammer score for ordinal labels\n\nA commonly used paradigm to annotate instances is to use ordinal scales where an annotator is\nasked to rate an instance on a certain ordinal scale, say {1, . . . , K}. For example, rating a restaurant\non a scale of 1 to 5 or assessing the malignancy of a lesion on a BIRADS scale of 1 to 5 for\nmammography. This differs from categorical labels where there is no order among the multiple\nclass labels. An ordinal variable expresses rank and there is an implicit ordering 1 < . . . < K.\nAnnotator model It is conceptually easier to think of the true label to be binary, that is, yi \u2208 {0, 1}.\nFor example in mammography a lesion is either malignant (1) or benign (0) (which can be con\ufb01rmed\nby biopsy) and the BIRADS ordinal scale is a means for the radiologist to quantify the uncertainty\nbased on the digital mammogram. The radiologist assigns a higher value of the label if he/she\nthinks the true label is closer to one. As earlier we characterize each annotator by the sensitivity\nand the speci\ufb01city, but the main difference is that we now de\ufb01ne the sensitivity and speci\ufb01city for\neach ordinal label (or threshold) k \u2208 {1, . . . , K}. Let \u03b1j\nk be the sensitivity and speci\ufb01city\nrespectively of the jth annotator corresponding to the threshold k, that is,\n\nk and \u03b2j\n\n\u03b1j\nk = Pr[yj\n1 = 0 and \u03b1j\n1 = 1, \u03b2j\n\ni \u2265 k | yi = 1]\nK+1 = 0, \u03b2j\n\nNote that \u03b1j\nis parameterized by a set of 2(K \u2212 1) parameters [\u03b1j\nempirical ROC curve for the annotator (Figure 2).\n\nand \u03b2j\n\nk = Pr[yj\n\ni < k | yi = 0].\n\nK+1 = 1 from this de\ufb01nition. Hence each annotator\nK]. This corresponds to an\n\n2, . . . , \u03b1j\n\nK, \u03b2j\n\n2, \u03b2j\n\n4\n\n\fk+1 and Pr[yj\n\nWho is a spammer? As earlier we de\ufb01ne an an-\nnotator j to be a spammer if Pr[yj\ni = k|yi = 1] =\nPr[yj\ni = k|yi = 0] \u2200k = 1, . . . , K. Note that from\nthe annotation model we have 3 Pr[yj\ni = k | yi =\nk \u2212 \u03b1j\n1] = \u03b1j\ni = k | yi = 0] =\n\u03b2j\nk+1 \u2212 \u03b2j\nk. This implies that annotator j is a spam-\nmer if \u03b1j\nk \u2212 \u03b1j\nk, \u2200k = 1, . . . , K,\nwhich leads to \u03b1j\n1 = 1, \u2200k. This\nmeans that for every k, the point (1 \u2212 \u03b2j\nk) lies on\nthe diagonal line in the ROC plot shown in Figure 2.\nThe area under the empirical ROC curve can be com-\nputed as (see Figure 2) AUCj = 1\nk+1 +\n\u03b1j\nk)(\u03b2j\nk), and can be used to de\ufb01ne the fol-\nlowing spammer score as (2AUCj \u2212 1)2 to rank the\ndifferent annotators.\n\nk+1 \u2212 \u03b2j\nk = \u03b1j\n\nk+1 = \u03b2j\nk + \u03b2j\n\n2PK\n\nk+1 \u2212 \u03b2j\n\nk=1(\u03b1j\n\n1 + \u03b2j\n\nk, \u03b1j\n\nk=1\n\nk=2\n\nk=3 [ 1\u2212\u03b23, \u03b13 ]\n\nk=4\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n)\n \n\nj\n\n\u03b1\n \n(\n \ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0\n0\n\n0.2\n\n0.6\n0.4\n1\u2212Specificity ( \u03b2j )\n\n0.8\n\n1\n\nFigure 2: Ordinal labels: An annotator is mod-\neled by sensitivity/speci\ufb01city for each threshold.\n\nS j = \" K\nXk=1\n\n(\u03b1j\n\nk+1 + \u03b1j\n\nk)(\u03b2j\n\nk+1 \u2212 \u03b2j\n\n2\n\nk)# \u2212 1!\n\n(9)\n\nWith two levels this expression defaults to the binary case. An annotator is a spammer if S j is close\nto zero. Good annotators have S j > 0 while a perfect annotator has S j = 1.\n\n5 Previous work\n\ni ] for c = 1, . . . , K, where yj\n\nRecently Ipeirotis et.al. [4] proposed a score for categorical labels based on the expected cost of\nthe posterior label. In this section we brie\ufb02y describe their approach and compare it with our pro-\nposed score. For each instance labeled by the annotator they \ufb01rst compute the posterior (soft) label\nPr[yi = c|yj\ni is the label assigned to the ith instance by the jth\nannotator and yi is the true unknown label. The posterior label is computed via Bayes\u2019 rule as\nPr[yi = c|yj\ni ,k)pc, where pc = Pr[yi = c] is the preva-\nlence of class c. The score for a spammer is based on the intuition that the posterior label vector\n(Pr[yi = 1|yj\ni ]) for a good annotator will have all the probability mass concen-\ntrated on single class. For example for a three class problem (with equal prevalence), a posterior label\nvector of (1, 0, 0) (certain that the class is one) comes from a good annotator while a (1/3, 1/3, 1/3)\n(complete uncertainty about the class label) comes from spammer. Based on this they de\ufb01ne the\nfollowing score for each annotator\n\ni |yi = c]Pr[yi = c] = (\u03b1j\n\ni ], . . . , Pr[yi = K|yj\n\ni ] \u221d Pr[yj\n\nck)\u03b4(yj\n\nScorej =\n\n1\nN\n\nK\n\nN\n\nXi=1\" K\nXc=1\n\nXk=1(cid:16)costckPr[yi = k|yj\n\ni ](cid:17)# .\n\ni ]Pr[yi = c|yj\n\n(10)\n\nwhere costck is the misclassi\ufb01cation cost when an instance of class c is classi\ufb01ed as k. Essentially\nthis is capturing some sort of uncertainty of the posterior label averaged over all the instances.\nPerfect workers have a score Scorej = 0 while spammers will have high score. An entropic version\nof this score based on similar ideas has also been recently proposed in [5]. Our proposed spammer\nscore differs from this approach in the following aspects: (1) Implicit in the score de\ufb01ned above (10)\nis the assumption that an annotator is a spammer when Pr[yi = c|yj\ni ] = Pr[yi = c], i.e., the estimated\nposterior labels are simply based on the prevalence and do not depend on the observed labels. By\nBayes\u2019 rule this is equivalent to Pr[yj\ni ] which is what we have used to de\ufb01ne\nour spammer score. (2) While both notions of a spammer are equivalent, the approach of [4] \ufb01rst\ncomputes the posterior labels based on the observed data, the class prevalence and the annotator\n\ni |yi = c] = Pr[yj\n\n3This can be seen as follows: Pr[yj\n\ni < k + 1 | yi = 1] \u2212 Pr[(yj\n\ni = k | yi = 1] = Pr[(yj\ni \u2265 k) OR (yj\n\ni \u2265 k) AND (yj\ni < k + 1) | yi = 1] = Pr[yj\n\ni < k + 1) | yi = 1] = Pr[yj\n\ni \u2265\ni \u2265 k | yi =\n\ni \u2265 k + 1 | yi = 1] = \u03b1j\n\nk \u2212 \u03b1j\n\nk+1. Here we used the fact that Pr[(yj\n\ni \u2265 k) OR (yj\n\ni < k + 1)] = 1.\n\nk | yi = 1] + Pr[yj\n1] \u2212 Pr[yj\n\n5\n\n\fsimulated | 500 instances | 30 annotators\n\nsimulated | 500 instances | 30 annotators \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\n30\n\n25\n\n20\n\n15\n\n10\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\na\no\nn\nn\nA\n\nt\n\nt\n\n9\n78\n\n6\n\n2\n10\n5\n\n1\n\n4\n\n3\n\n20\n\n16\n\n14\n\n15\n\n11\n\n17\n\n12\n\n18\n\n22\n\n24\n23\n\n29\n\n25\n\n26\n\n21\n\n30\n\n28\n\n27\n\n19\n\n13\n\n0.2\n\n0.4\n0.6\n1\u2212Specificity\n\n0.8\n\n1\n\n(a) Simulation setup\n\nsimulated | 500 instances | 30 annotators\n27\n\n30\n\n28\n\n2223\n26\n\n24\n\n21\n\n29\n\n25\n\n14\n\n17\n\n18\n\n11\n\n12\n\n15\n\n16\n20\n\n19\n\n13\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\n]\n\n4\n\n[\n.\nl\n\na\n\n.\nt\n\ne\n\n \ns\ni\nt\n\no\nr\ni\ne\np\n\nI\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n5\n\n0\n0\n\n9\n\n4\n\n1\n\n5\n\n2\n\n10\n7\n\n6\n\n3\n\n8\n\n5\n\n10\n\n15\n\n20\n\n25\n\nAnnotator rank (median) via spammer score\n(c) Comparison with accuracy\n\n(\n \nk\nn\na\nr\n \nr\no\na\no\nn\nn\nA\n\nt\n\nt\n\n30\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n0\n0\n5\n\n49\n\n78\n2\n\n063\n3\n\n87\n2\n\n02\n1\n\n3\n2\n\n2\n2\n\n6\n2\n\n451\n2\n\n1\n2\n\n9\n2\n\n5\n2\n\n4\n1\n\n2\n1\n\n7\n1\n\n1\n1\n\n8\n1\n\n0\n2\n\n9\n1\n\n5\n1\n\n6\n1\n\n3\n1\n\nAnnotator\n\n(b) Annotator ranking\n\nsimulated | 500 instances | 30 annotators\n\n18\n\n1516\n20\n19\n\n13\n\n12\n\n17\n\n11\n\n14\n\n25\n\n29\n\n21\n\n5\n\n1\n\n24\n\n26\n22\n23\n\n2\n\n10\n7\n\n28\n3\n\n6\n\n8\n30\n\n27\n\n9\n\n4\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nAnnotator rank (median) via spammer score\n(d) Comparison with Ipeirotis et. al. [4]\n\nFigure 3: (a) The simulation setup consisting of 10 good annotators (annotators 1 to 10), 10 spammers (11 to\n20), and 10 malicious annotators (21 to 30). (b) The ranking of annotators obtained using the proposed spammer\nscore. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean\nspammer score and the 95% con\ufb01dence intervals (CI) are shown\u2014obtained from 100 bootstrap replications.\nThe annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the\nnumber of instances annotated by that annotator. (c) and (d) Comparison of the median rank obtained via the\nspammer score with the rank obtained using (c) accuracy and (d) the method proposed by Ipeirotis et. al. [4].\n\nparameters and then computes the expected cost. Our proposed spammer score does not depend\non the prevalence of the class. Our score is also directly de\ufb01ned only in terms of the annotator\nconfusion matrix and does not need the observed labels. (3) For the score de\ufb01ned in (10) while\nperfect annotators have a score of 0 it is not clear what should be a good baseline for a spammer.\nThe authors suggest to compute the baseline by assuming that a worker assigns as label the class\nwith maximum prevalence. Our proposed score has a natural scale with a perfect annotator having\na score of 1 and a spammer having a score of 0. (4) However one advantage of the approach in [4]\nis that they can directly incorporate varied misclassi\ufb01cation costs.\n\n6 Experiments\n\nRanking annotators based on the con\ufb01dence interval As mentioned earlier the annotator model\nparameters can be estimated using the iterative EM algorithms [3, 7] and these estimated annotator\nparameters can then be used to compute the spammer score. The spammer score can then be used\nto rank the annotators. However one commonly observed phenomenon when working with crowd-\nsourced data is that we have a lot of annotators who label only a very few instances. As a result the\nannotator parameters cannot be reliably estimated for these annotators. In order to factor this uncer-\ntainty in the estimation of the model parameters we compute the spammer score for 100 bootstrap\nreplications. Based on this we compute the 95% con\ufb01dence intervals (CI) for the spammer score for\neach annotator. We rank the annotators based on the lower limit of the 95% CI. The CIs are wider\n\n6\n\n\fTable 1: Datasets N is the number of instances. M is the number of annotators. M \u2217 is the mean/median\nnumber of annotators per instance. N \u2217 is the mean/median number of instances labeled by each annotator.\n\nDataset\n\nType\n\nbluebird\n\nbinary\n\ntemp\n\nbinary\n\nwsd\n\ncategorical/3\n\nsentiment\n\ncategorical/3\n\nwosi\nvalence\n\nordinal/[0 10]\nordinal[-100 100]\n\nN M M \u2217\n\nN \u2217\n\nBrief Description\n\n108\n\n462\n\n177\n\n1660\n\n30\n100\n\n39\n\n76\n\n34\n\n33\n\n10\n38\n\n39/39\n\n108/108\n\n10/10\n\n61/16\n\n10/10\n\n52/20\n\n6/6\n\n291/175\n\n10/10\n10/10\n\n30/30\n26/20\n\nbird identi\ufb01cation [12] The annotator had to identify whether there was an Indigo\nBunting or Blue Grosbeak in the image.\nevent annotation [10] Given a dialogue and a pair of verbs annotators need to label\nwhether the event described by the \ufb01rst verb occurs before or after the second.\n\nword sense disambiguation [10] The labeler is given a paragraph of text containing\nthe word \u201dpresident\u201d and asked to label one of the three appropriate senses.\nirish economic sentiment analysis [1] Articles from three Irish online news sources\nwere annotated by volunteer users as positive, negative, or irrelevant.\n\nword similarity [10] Numeric judgements of word similarity.\naffect recognition [10] Each annotator is presented with a short headline and asked\nto rate it on a scale [-100,100] to denote the overall positive or negative valence.\n\nbluebird | 108 instances | 39 annotators \n\nwsd | 177 instances | 34 annotators \n\nwosi | 30 instances | 10 annotators \n\n1\n\n8\n0\n1\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n81\n0\n1\n\n8\n0\n81\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n8\n0\n1\n\n78\n1\n\n7\n2\n\n0\n3\n\n5\n2\n\n51\n3\n\n2\n1\n\n2\n3\n\n7\n3\n\n8\n3\n\n6\n1\n\n29\n2\n\n9\n2\n\n5\n1\n\n0\n2\n\n95\n1\n\n93\n3\n\n1\n2\n\n3\n2\n\n42\n1\n\n0\n1\n\n47\n2\n\n3\n3\n\n3\n1\n\n6\n3\n\n14\n3\n\n4\n3\n\n8\n2\n\n8\n1\n\n16\n1\n\n6\n2\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n7\n7\n1\n\n7\n5\n1\n\n7\n7\n1\n\n7\n5\n1\n\n0\n8\n\n0\n4\n\n0\n4\n\n0\n2\n\n7\n7\n\n0\n0\n1\n\n0\n2\n\n0\n2\n\n0\n2\n\n7\n1\n1\n\n7\n7\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n7\n7\n\n0\n2\n\n0\n2\n\n0\n6\n\n7\n1\n\n0\n4\n\n7\n1\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n3\n1\n\n1\n3\n\n0\n1\n\n3\n2\n\n9124689\n2\n\n4\n1\n\n5\n1\n\n7\n1\n\n2\n2\n\n25\n3\n\n8\n1\n\n6\n1\n\n9\n1\n\n1\n1\n\n2\n1\n\n0\n2\n\n1\n2\n\n4\n2\n\n5\n2\n\n6\n2\n\n7\n2\n\n8\n2\n\n0\n3\n\n3\n3\n\n473\n3\n\nAnnotator\n\nAnnotator\n\ntemp | 462 instances | 76 annotators \n\nsentiment | 1660 instances | 33 annotators \n\n0\n8\n\n2\n0\n4\n\n0\n3\n\n2\n5\n\n0\n3\n\n0\n6\n\n0\n3\n\n0\n1\n\n0\n2\n\n0\n1\n\n2\n3\n1\n\n0\n1\n\n0\n6\n3\n\n0\n1\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2\n4\n4\n\n2\n6\n4\n\n2\n5\n4\n\n0\n1\n\n0\n1\n\n9\n9\n0\n1\n\n1\n1\n2\n1\n\n2\n7\n5\n\n7\n3\n4\n\n5\n2\n5\n\n1\n4\n5\n\n5\n7\n1\n\n9\n1\n1\n\n3\n4\n\n7\n7\n\n7\n6\n\n2\n1\n\n6\n4\n3\n\n9\n2\n2\n\n3\n5\n4\n\n8\n2\n4\n\n4\n7\n3\n\n9\n4\n2\n\n4\n8\n2\n\n4\n0\n1\n\n7\n1\n9\n\n5\n7\n\n4\n5\n6\n\n1\n7\n1\n\n8\n3\n2\n\n626\n2\n\n15\n1\n\n43\n1\n\n09\n2\n\n2\n2\n\n1\n3\n\n0\n1\n\n2\n1\n\n88\n1\n\n3\n1\n\n041\n3\n\n9\n2\n\n9\n1\n\n7\n1\n\n7\n2\n\n8\n2\n\n1\n2\n\n5\n1\n\n5\n2\n\n37\n2\n\n3\n3\n\n6\n1\n\n4\n2\n\n2\n3\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\ne\nr\no\nc\nS\n\n \nr\ne\nm\nm\na\np\nS\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n0\n3\n\n2 4 1 3 5 6 8 9 7\n\n0\n1\n\nAnnotator\n\nvalence | 100 instances | 38 annotators \n\n0\n4\n\n0\n4\n\n0\n0\n1\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n6\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n4\n\n0\n1\n\n0\n5\n\n0\n1\n\n0\n1\n\n0\n4\n\n0\n1\n\n0\n7\n\n0\n0\n1\n\n0\n8\n\n0\n4\n\n2\n9\n1\n\n0\n9\n1\n\n0\n5\n3\n\n0\n4\n\n2\n3\n\n0\n6\n\n0\n7\n\n0\n2\n\n0\n2\n\n0\n4\n\n0\n2\n\n0\n5\n\n0\n5\n\n0\n5\n\n0\n3\n\n0\n1\n\n0\n1\n\n0\n3\n\n0\n2\n\n0\n1\n\n0\n2\n\n2\n2\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n0\n1\n\n2\n1\n\n9\n2\n\n1\n\n7\n8\n\n2\n1\n\n5\n3\n\n1\n1\n\n5\n1\n\n77\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n4\n\n0\n2\n\n0\n6\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n0\n2\n\n3\n1\n\n8\n1\n\n2\n5\n\n5\n7\n\n3\n3\n\n2\n3\n\n2\n1\n\n4\n7\n\n1\n3\n\n1\n5\n\n1\n4\n\n57\n5\n\n4\n1\n\n0\n7\n\n2\n4\n\n8\n5\n\n5\n6\n\n31\n4\n\n0\n1\n\n7\n4\n\n1\n6\n\n3\n7\n\n5\n2\n\n7\n3\n\n6\n7\n\n7\n6\n\n4\n2\n\n6\n4\n\n4\n5\n\n8\n4\n\n9\n3\n\n6\n5\n\n5\n1\n\n2\n6\n\n8\n6\n\n4\n4\n\n3\n5\n\n4\n6\n\n09\n4\n\n862\n2\n\n73458\n5\n\n1\n1\n\n6\n1\n\n7\n1\n\n9\n1\n\n0\n2\n\n1\n2\n\n2\n2\n\n3\n2\n\n6\n2\n\n7\n2\n\n9\n2\n\n0\n3\n\n4\n3\n\n5\n3\n\n6\n3\n\n8\n3\n\n5\n4\n\n9\n4\n\n0\n5\n\n9\n5\n\n0\n6\n\n3\n6\n\n6\n6\n\n9\n6\n\n1\n7\n\n2\n7\n\n1\n\n6\n2\n\n0\n1\n\n8\n1\n\n8\n2\n\n55\n1\n\n6\n3\n\n3\n2\n\n28\n1\n\n2\n3\n\n1\n3\n\n8\n3\n\n3\n1\n\n7\n1\n\n7\n2\n\n12\n1\n\n5\n3\n\n4\n2\n\n996\n1\n\n0\n3\n\n3\n3\n\n7\n3\n\n4\n1\n\n943\n2\n\n0\n2\n\n4\n3\n\n2\n2\n\n57\n2\n\n6\n1\n\n1\n2\n\nAnnotator\n\nAnnotator\n\nAnnotator\n\nFigure 4: Annotator Rankings The rankings obtained for the datasets in Table 1. The spammer score ranges\nfrom 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95%\ncon\ufb01dence intervals (CI) are shown\u2014obtained from 100 bootstrap replications. The annotators are ranked\nbased on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances\nannotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances.\n\nwhen the annotator labels only a few instances. For a crowdsourced labeling task the annotator has\nto be good and also label a reasonable number of instances in order to be reliably identi\ufb01ed.\nSimulated data We \ufb01rst illustrate our proposed spammer score on simulated binary data (with equal\nprevalence for both classes) consisting of 500 instances labeled by 30 annotators of varying sensitiv-\nity and speci\ufb01city (see Figure 3(a) for the simulation setup). Of the 30 annotators we have 10 good\nannotators (annotators 1 to 10 who lie above the diagonal in Figure 3(a)), 10 spammers (annotators\n11 to 20 who lie around the diagonal), and 10 malicious annotators (annotators 21 to 30 who lie be-\nlow the diagonal). Figure 3(b) plots the ranking of annotators obtained using the proposed spammer\nscore with the annotator model parameters estimated via the EM algorithm [3, 7]. The spammer\nscore ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer\nscore and the 95% con\ufb01dence interval (CI) obtained via bootstrapping are shown. The annotators\nare ranked based on the lower limit of the 95% CI. As can be seen all the spammers (annotators 11\nto 20) have a low spammer score and appear at the bottom of the list. The malicious annotators have\nhigher score than the spammers since we can correct for their \ufb02ipping. The malicious annotators\nare good annotators but they \ufb02ip their labels and as such are not spammers if we detect that they are\nmalicious. Figure 3(c) compares the (median) rank obtained via the spammer score with the (me-\ndian) rank obtained using accuracy as the score to rank the annotators. While the good annotators\nare ranked high by both methods the accuracy score gives a low rank to the malicious annotators.\nAccuracy does not capture the notion of a spammer. Figure 3(d) compares the ranking with the\nmethod proposed by Ipeirotis et. al. [4] which gives almost similar rankings as our proposed score.\n\n7\n\n\fbluebird | 108 instances | 39 annotators\n\nbluebird | 108 instances | 39 annotators\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\na\n\nt\n\nt\n\no\nn\nn\nA\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n8\n17\n\n5\n\n0\n0\n\n21\n\n23\n\n10\n\n6\n\n4\n34\n11\n\n26\n\n18\n\n7\n14\n\n28\n\n31\n\n13\n33\n36\n\n24\n\n3\n\n5\n\n2\n\n19\n39\n\n15\n\n20\n\n9\n29\n\n12\n\n16\n\n22\n37\n38\n\n1\n32\n\n25\n35\n\n27\n30\n\n5\n35\nAnnotator rank (median) via spammer score\n\n10\n\n15\n\n20\n\n25\n\n30\n\n]\n\n4\n\n[\n.\nl\n\na\n\n.\nt\n\ne\n\n \ns\ni\nt\n\no\nr\ni\ne\np\n\nI\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\n\nt\n\nt\n\na\no\nn\nn\nA\n\n40\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n8\n17\n\n5\n\n0\n0\n\n31\n7\n\n13\n\n10\n\n2\n\n23\n\n24\n\n33\n14\n36\n\n3\n\n21\n5\n\n20\n\n15\n\n39\n19\n\n22\n37\n\n38\n\n16\n\n12\n\n9\n29\n\n1\n32\n\n25\n35\n\n27\n30\n\n6\n18\n26\n\n11\n34\n4\n\n28\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n8\n\n27\n\n17\n\n30\n25\n\n1\n\n35\n32\n\n38\n29\n9\n\n19\n39\n\n36\n\n16\n\n22\n20\n\n37\n\n15\n\n24\n\n33\n28\n\n26\n\n18\n\n5\n35\nAnnotator rank (median) via spammer score\n\n10\n\n15\n\n20\n\n25\n\n30\n\n40\n\n0\n0\n\n0.2\n\n(a)\n\n(b)\n\n21\n\n23\n\n0.4\n0.6\n1\u2212Specificity\n(c)\n\n0.8\n\n1\n\nbluebird | 108 instances | 39 annotators\n14\n\n12\n\n3\n\n7\n\n6\n\n4\n\n5\n\n34\n\n10\n\n2\n\n13\n\n31\n\n11\n\nFigure 5: Comparison of the rank obtained via the spammer score with the rank obtained using (a) accuracy\nand (b) the method proposed by Ipeirotis et. al. [4] for the bluebird binary dataset. (c) The annotator model\nparameters as estimated by the EM algorithm [3, 7].\n\nwsd | 177 instances | 34 annotators\n\nwsd | 177 instances | 34 annotators\n\nsentiment | 1660 instances | 33 annotators\n\nsentiment | 1660 instances | 33 annotators\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\n\nt\n\na\n\nt\n\no\nn\nn\nA\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n19\n\n18\n7\n\n24\n\n20\n\n12\n\n3\n14\n\n5\n32\n\n9\n\n8\n\n1\n\n34\n\n21\n\n28\n\n22\n\n17\n\n26\n\n29\n31\n\n23\n\n15\n\n13\n\n10\n\n6\n\n4\n\n2\n\n5\n\n30\nAnnotator rank (median) via spammer score\n\n10\n\n15\n\n20\n\n25\n\n27\n\n33\n11\n30\n\n25\n\n16\n\n]\n\n4\n\n[\n.\nl\n\na\n\n.\nt\n\ne\n\n \ns\ni\nt\n\no\nr\ni\ne\np\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n19\n7\n\n3\n14\n\n9\n\n8\n\n25\n\n16\n\n28\n\n1\n\n18\n\n4\n\n10\n\n5\n32\n29\n31\n\n15\n\n23\n\n26\n\n34\n\n17\n\n22\n\n21\n\n20\n24\n\n12\n\n5\n\n2\n\n6\n\n13\n\n0\n0\n\n5\n\n30\nAnnotator rank (median) via spammer score\n\n25\n\n20\n\n10\n\n15\n\n27\n\n33\n11\n30\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n24\n\n33\n\n15\n17\n\n28\n\n27\n\n19\n\n8\n\n1\n4\n\n18\n\n12\n29\n\n13\n\n10\n\n20\n\n3\n\n30\n31\n\n9\n\n2223\n\n32\n\n5\n\n2\n6\n\n16\n\n11\n14\n\n26\n\n5\n\n30\nAnnotator rank (median) via spammer score\n\n10\n\n15\n\n20\n\n25\n\n7\n\n25\n\n21\n\n]\n\n4\n\n[\n.\nl\n\na\n\n.\nt\n\ne\n\n \ns\ni\nt\n\no\nr\ni\ne\np\n\nI\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\n\nt\n\na\n\nt\n\no\nn\nn\nA\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n7\n21\n\n25\n\n19\n\n15\n17\n27\n\n28\n\n8\n12\n\n4\n\n18\n\n13\n\n29\n\n1\n\n24\n\n20\n\n33\n10\n\n3\n\n9\n\n23\n\n5\n\n2\n6\n\n16\n\n32\n\n3031\n\n11\n14\n\n22\n\n26\n\n5\n\n30\nAnnotator rank (median) via spammer score\n\n20\n\n15\n\n10\n\n25\n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\n\nt\n\na\n\nt\n\no\nn\nn\nA\n\n35\n\nI\n \n\ni\n\na\nv\n \n)\nn\na\nd\ne\nm\n\ni\n\n(\n \nk\nn\na\nr\n \nr\no\n\nt\n\nt\n\na\no\nn\nn\nA\n\n35\n\nFigure 6: Comparison of the median rank obtained via the spammer score with the rank obtained using\naccuracy and he method proposed by Ipeirotis et. al. [4] for the two categorial datasets in Table 1.\n\nMechanical Turk data We report results on some publicly available linguistic and image annotation\ndata collected using the Amazon\u2019s Mechanical Turk (AMT) and other sources. Table 1 summarizes\nthe datasets. Figure 4 plots the spammer scores and rankings obtained. The mean and the 95% CI\nobtained via bootstrapping are also shown. The number at the top of the CI bar shows the number\nof instances annotated by that annotator. The rankings are based on the lower limit of the 95% CI\nwhich factors the number of instances labeled by the annotator into the ranking. An annotator who\nlabels only a few instances will have very wide CI. Some annotators who label only a few instances\nmay have a high mean spammer score but the CI will be wide and hence ranked lower. Ideally we\nwould like to have annotators with a high score and at the same time label a lot of instances so that\nwe can reliablly identify them. The authors [1] for the sentiment dataset shared with us some of the\nqualitative observations regarding the annotators and they somewhat agree with our rankings. For\nexample the authors made the following comments about Annotator 7 \u201dQuirky annotator - had a lot\nof debate about what was the meaning of the annotation question. I\u2019d say he changed his labeling\nstrategy at least once during the process\u201d. Our proposed score gave a low rank to this annotator.\nComparison with other approaches Figure 5 and 6 compares the proposed ranking with the rank\nobtained using accuracy and the method proposed by Ipeirotis et. al. [4] for some binary and cate-\ngorical datasets in Table 1. Our proposed ranking is somewhat similar to that obtained by Ipeirotis\net. al. [4] but accuracy does not quite capture the notion of spammer. For example for the bluebird\ndataset for annotator 21 (see Figure 5(a)) accuracy ranks it at the bottom of the list while the pro-\nposed score puts is in the middle of the list. From the estimated model parameters it can be seen that\nannotator 21 actually \ufb02ips the labels (below the diagonal in Figure 5(c)) but is a good annotator.\n\n7 Conclusions\n\nWe proposed a score to rank annotators for crowdsourced binary, categorical, and ordinal labeling\ntasks. The obtained rankings and the scores can be used to allocate monetary bonuses to be paid\nto different annotators and also to eliminate spammers from further labeling tasks. A mechanism\nto rank annotators should be desirable feature of any crowdsourcing service. The proposed score\nshould also be useful to specify the prior for Bayesian approaches to consolidate annotations.\n\n8\n\n\fReferences\n[1] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track\nsentiment in online media. In Proceedings of the 6th Conference on Prestigious Applications\nof Intelligent Systems (PAIS\u201910), 2010.\n\n[2] B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report\n\navailable at http://lingpipe-blog.com/lingpipe-white-papers/, 2008.\n\n[3] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using\n\nthe EM algorithm. Applied Statistics, 28(1):20\u201328, 1979.\n\n[4] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk.\nIn Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP\u201910), pages\n64\u201367, 2010.\n\n[5] V. C. Raykar and S. Yu. An entropic score to rank annotators for crowdsourced labelling tasks.\nIn Proceedings of the Third National Conference on Computer Vision, Pattern Recognition,\nImage Processing and Graphics (NCVPRIPG), 2011.\n\n[6] V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy.\nSupervised learning from multiple experts: Whom to trust when everyone lies a bit. In Pro-\nceedings of the 26th International Conference on Machine Learning (ICML 2009), pages 889\u2013\n896, 2009.\n\n[7] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. Journal of Machine Learning Research, 11:1297\u20131322, April 2010.\n\n[8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data\nmining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 614\u2013622, 2008.\n\n[9] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective\nlabelling of venus images. In Advances in Neural Information Processing Systems 7, pages\n1085\u20131092. 1995.\n\n[10] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast\u2014but is it good? Evaluating\nNon-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on\nEmpirical Methods in Natural Language Processing (EMNLP \u201908), pages 254\u2013263, 2008.\n\n[11] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Pro-\n\nceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1\u20138, 2008.\n\n[12] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn Advances in Neural Information Processing Systems 23, pages 2424\u20132432. 2010.\n\n[13] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more:\nIn Advances in Neural\n\nOptimal integration of labels from labelers of unknown expertise.\nInformation Processing Systems 22, pages 2035\u20132043. 2009.\n\n[14] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy. Mod-\neling annotator expertise: Learning when everybody knows a bit of something. In Proceedings\nof the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS\n2010), pages 932\u2013939, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1026, "authors": [{"given_name": "Vikas", "family_name": "Raykar", "institution": null}, {"given_name": "Shipeng", "family_name": "Yu", "institution": null}]}