{"title": "Reputation-based Worker Filtering in Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 2492, "page_last": 2500, "abstract": "In this paper, we study the problem of aggregating noisy labels from crowd workers to infer the underlying true labels of binary tasks. Unlike most prior work which has examined this problem under the random worker paradigm, we consider a much broader class of {\\em adversarial} workers with no specific assumptions on their labeling strategy. Our key contribution is the design of a computationally efficient reputation algorithm to identify and filter out these adversarial workers in crowdsourcing systems. Our algorithm uses the concept of optimal semi-matchings in conjunction with worker penalties based on label disagreements, to assign a reputation score for every worker. We provide strong theoretical guarantees for deterministic adversarial strategies as well as the extreme case of {\\em sophisticated} adversaries where we analyze the worst-case behavior of our algorithm. Finally, we show that our reputation algorithm can significantly improve the accuracy of existing label aggregation algorithms in real-world crowdsourcing datasets.", "full_text": "Reputation-based Worker Filtering in Crowdsourcing\n\nSrikanth Jagabathula1 Lakshminarayanan Subramanian2,3 Ashwin Venkataraman2,3\n\n1Department of IOMS, NYU Stern School of Business\n2Department of Computer Science, New York University\n\n3CTED, New York University Abu Dhabi\n\nsjagabat@stern.nyu.edu\n\n{lakshmi,ashwin}@cs.nyu.edu\n\nAbstract\n\nIn this paper, we study the problem of aggregating noisy labels from crowd work-\ners to infer the underlying true labels of binary tasks. Unlike most prior work\nwhich has examined this problem under the random worker paradigm, we consider\na much broader class of adversarial workers with no speci\ufb01c assumptions on their\nlabeling strategy. Our key contribution is the design of a computationally ef\ufb01cient\nreputation algorithm to identify and \ufb01lter out these adversarial workers in crowd-\nsourcing systems. Our algorithm uses the concept of optimal semi-matchings\nin conjunction with worker penalties based on label disagreements, to assign a\nreputation score for every worker. We provide strong theoretical guarantees for\ndeterministic adversarial strategies as well as the extreme case of sophisticated\nadversaries where we analyze the worst-case behavior of our algorithm. Finally,\nwe show that our reputation algorithm can signi\ufb01cantly improve the accuracy of\nexisting label aggregation algorithms in real-world crowdsourcing datasets.\n\nIntroduction\n\n1\nThe growing popularity of online crowdsourcing services (e.g. Amazon Mechanical Turk, Crowd-\nFlower etc.) has made it easy to collect low-cost labels from the crowd to generate training datasets\nfor machine learning applications. However, these applications remain vulnerable to noisy labels\nintroduced either unintentionally by unreliable workers or intentionally by spammers and malicious\nworkers [10, 11]. Recovering the underlying true labels in the face of noisy input in online crowd-\nsourcing environments is challenging due to three key reasons: (a) Workers are often anonymous\nand transient and can provide random or even malicious labels (b) The reliabilities or reputations of\nthe workers are often unknown (c) Each task may receive labels from only a (small) subset of the\nworkers.\nSeveral existing approaches aim to address the above challenges under the following standard setup.\nThere is a set T of binary tasks, each with a true label in {1, 1}. A set of workers W are asked\nto label the tasks, and the assignment of the tasks to the workers can be represented by a bipartite\ngraph with the workers on one side, tasks on the other side, and an edge connecting each worker\nto the set of tasks she is assigned. We term this the worker-task assignment graph. Workers are\nassumed to generate labels according to a probabilistic model - given a task t, a worker w provides\nthe true label with probability pw. Note that every worker is assumed to label each task independent\nof other tasks. The goal then is to infer the underlying true labels of the tasks by aggregating the\nlabels provided by the workers. Prior works based on the above model can be broadly classi\ufb01ed into\ntwo categories: machine-learning based and linear-algebra based. The machine-learning approaches\nare typically based on variants of the EM algorithm [3, 16, 24, 14]. These algorithms perform well\nin most scenarios, but they lack any theoretical guarantees. More recently, linear-algebra based\nalgorithms [9, 6, 2] have been proposed, which provide guarantees on the error in estimating the\ntrue labels of the tasks (under appropriate assumptions), and have also been shown to perform well\non various real-world datasets. While existing work focuses on workers making random errors,\nrecent work and anecdotal evidence have shown that worker labeling strategies that are common in\npractice do not \ufb01t the standard random model [19]. Speci\ufb01c examples include vote pollution attacks\n\n1\n\n\fon Digg [18], malicious behavior in social media [22, 12] and low-precision worker populations in\ncrowdsourcing experiments [4].\nIn this paper, we aim to go beyond the standard random model and study the problem of inferring\nthe true labels of tasks under a much broader class of adversarial worker strategies with no speci\ufb01c\nassumptions on their labeling pattern. For instance, deterministic labeling, where the workers always\ngive the same label, cannot be captured by the standard random model. Also, malicious workers can\nemploy arbitrary labeling patterns to degrade the accuracy of the inferred labels. Our goal is to\naccurately infer the true labels of the tasks without restricting workers\u2019 strategies.\nMain results. Our main contribution is the design of a reputation algorithm to identify and \ufb01lter out\nadversarial workers in online crowdsourcing systems. Speci\ufb01cally, we propose 2 computationally\nef\ufb01cient algorithms to compute worker reputations using only the labels provided by the workers\n(see Algorithms 1 and 2), which are robust to manipulation by adversaries. We compute worker\nreputations by assigning penalties to a worker for each task she is assigned. The assigned penalty is\nhigher for tasks on which there is \u201ca lot\u201d of disagreement with the other workers. The penalties are\nthen aggregated in a \u201cload-balanced\u201d manner using the concept of optimal semi-matchings [7]. The\nreputation algorithm is designed to be used in conjunction with any of the existing label aggregation\nalgorithms that are designed for the standard random worker model: workers with low reputations1\nare \ufb01ltered out and the aggregation algorithm is used on the remaining labels. As a result, our\nalgorithm can be used to boost the performance of existing label aggregation algorithms.\nWe demonstrate the effectiveness of our algorithm using a combination of strong theoretical guar-\nantees and empirical results on real-world datasets. Our analysis considers three scenarios. First,\nwe consider the standard setting in which workers are not adversarial and provide labels according\nto the random model. In this setting, we show that when the worker-task assignment graph is (l, r)-\nregular, the reputation scores are proportional to the reliabilities of the workers (see Theorem 1), so\nthat only unreliable workers are \ufb01ltered out. As a result, our reputation scores are consistent with\nworker reliabilities in the absence of adversarial workers. The analysis becomes signi\ufb01cantly com-\nplicated for more general graphs (a fact observed in prior works; see [2]); hence, we demonstrate\nimprovements using simulations and experiments on real world datasets. Second, we evaluate the\nperformance of our algorithm in the presence of workers who use deterministic labeling strategies\n(always label 1 or 1). For these strategies, when the worker-task assignment graph is (l, r)-regular,\nwe show (see Theorem 2) that the adversarial workers receive lower reputations than their \u201chonest\u201d\ncounterparts, provided honest workers have \u201chigh enough\u201d reliabilities \u2013 the exact bound depends\non the prevalence of tasks with true label 1, the fraction of adversarial workers and the average\nreliability of the honest workers.\nThird, we consider the case of sophisticated adversaries, i.e. worst-case adversarial workers whose\ngoal is to maximize the number of tasks they affect (i.e. cause to get incorrect labels). Under\nthis assumption, we provide bounds on the \u201cdamage\u201d they can do: We prove that irrespective of\nthe label aggregation algorithm (as long as it is agnostic to worker/task identities), there is a non-\ntrivial minimum fraction of tasks whose true label is incorrectly inferred. This bound depends on\nthe graph structure of the honest worker labeling pattern (see Theorem 3 for details). Our result\nis valid across different labeling patterns and a large class of label aggregation algorithms, and\nhence provides fundamental limits on the damage achievable by adversaries. Further, we propose a\nlabel aggregation algorithm utilizing the worker reputations computed in Algorithm 2 and prove the\nexistence of an upper bound on the worst-case accuracy in inferring the true labels (see Theorem 4).\nFinally, using several publicly available crowdsourcing datasets (see Section 4), we show that our\nreputation algorithm: (a) can help in enhancing the accuracy of state-of-the-art label aggregation\nalgorithms (b) is able to detect workers in these datasets who exhibit certain \u2018non-random\u2019 strategies.\nAdditional Related Work: In addition to the references cited above, there have been works which\nuse gold standard tasks, i.e. tasks whose true label is already known [17, 5, 11] to correct for worker\nbias. [8] proposed a way of quantifying worker quality by transforming the observed labels into soft\nposterior labels based on the estimated confusion matrix [3]. The authors in [13] propose an empir-\nical Bayesian algorithm to eliminate workers who label randomly without looking at the particular\ntask (called spammers), and estimate the consensus labels from the remaining workers. Both these\n\n1As will become evident later, reputations are measures of how adversarial a worker is and are different\n\nfrom reliabilities of workers.\n\n2\n\n\fworks use the estimated parameters to de\ufb01ne \u201cgood workers\u201d whereas we compute the reputation\nscores using only the labels provided by the workers. The authors in [20] focus on detecting speci\ufb01c\nkinds of spammers and then replace their labels with new workers. We consider all types of adver-\nsarial workers, not just spammers and don\u2019t assume access to a pool of workers who can be asked\nto label the tasks.\n2 Model and reputation algorithms\nNotation. Consider a set of tasks T having true labels in {1,1}. Let yj denotes the true label of a\ntask tj 2T and suppose that the tasks are sampled from a population in which the prevalence of the\npositive tasks is  2 [0, 1], so that there is a fraction  of tasks with true label 1. A set of workers\nW provide binary labels to the tasks in T . We let G denote the bipartite worker-task assignment\ngraph where an edge between worker wi and task tj indicates that wi has labeled tj. Further, let\nwi(tj) denote the label provided by worker wi to task tj, where we set wi(tj) = 0 if worker wi did\nnot label task tj. For a task tj, let Wj \u21e2 W denote the set of workers who labeled tj and likewise,\nj (resp. dj ) the\nfor a worker wi, let Ti denote the set of tasks the worker has labeled. Denote by d+\nnumber of workers labeling task tj as 1 (resp. 1). Finally, let L2{ 1, 0,1}|W|\u21e5|T | denote the\nmatrix representing the labels assigned by the workers to the tasks, i.e. Lij = wi(tj). Given L, the\ngoal is to infer the true labels yj of the tasks.\nWorker model. We consider the setting in which workers may be honest or adversarial. That is,\nW = H [ A with H \\ A = ;. Honest workers are assumed to provide labels according to a\nprobabilistic model: for task tj with true label yj, worker wi provides label yj with probability pi\nand yj with probability 1pi. Note that the parameter pi doesn\u2019t depend on the particular task that\nthe worker is labeling, so an honest worker labels each task independently. It is standard to de\ufb01ne the\nreliability of an honest worker as \u00b5i = 2pi1, so that we have \u00b5i 2 [1, 1]. Further, we assume that\nthe honest workers are sampled from a population with average reliability \u00b5 > 0. Adversaries, on\nthe other hand, may adopt any arbitrary (deterministic or randomized) labeling strategy that cannot\nbe described using the above probabilistic process. For instance, the adversary could always label\nall tasks as 1, irrespective of the true label. Another example is when the adversary decides her\nlabels based on existing labels cast by other workers (assuming the adversary has access to such\ninformation). Note however, that adversarial workers need not always provide the incorrect labels.\nEssentially, the presence of such workers breaks the assumptions of the model and can adversely\nimpact the performance of aggregation algorithms. Hence, our focus in this paper is on designing\nalgorithms to identify and \ufb01lter out such adversarial workers. Once this is achieved, we can use\nexisting state-of-the-art label aggregation algorithms on the remaining labels to infer the true labels\nof the tasks.\nTo identify these adversarial workers, we propose an algorithm for computing \u201creputation\u201d or \u201ctrust\u201d\nscores for each worker. More concretely, we assign penalties (in a suitable way) to every worker and\nhigher the penalty, worse the reputation of the worker. First note that any task that has all 1 labels\n(or 1 labels) does not provide us with any information about the reliabilities of the workers who\nlabeled the task. Hence, we focus on the tasks that have both 1 and 1 labels and we call this set\nthe con\ufb02ict set Tcs. Further, since we have no \u201cside\u201d information on the identities of workers, any\nreputation score computation must be based solely on the labels provided by the workers.\nWe start with the following basic idea to compute reputation scores: a worker is penalized for every\n\u2018con\ufb02ict\u2019 s/he is involved in (a task in the con\ufb02ict set the worker has labeled on). This idea is\nmotivated by the fact that in an ideal scenario, where all honest workers have a reliability \u00b5i = 1,\na con\ufb02ict indicates the presence of an adversary and the reputation score aims to capture a measure\nof the number of con\ufb02icts each worker is involved in: the higher the number of con\ufb02icts, the worse\nthe reputation score. However, a straightforward aggregation of penalties for each worker may over-\npenalize (honest) workers who label several tasks.\nIn order to overcome the issue of over-penalizing (honest) workers, we propose two techniques:\n(a) soft and (b) hard assignment of penalties. In the soft assignment of penalties (Algorithm 1),\nwe assign a penalty of 1/d+\nto all workers who\nj\nlabel 1 on task tj. Then, for each worker, we aggregate the penalties across all assigned tasks by\ntaking the average. The above assignment of penalties implicitly rewards agreements by making\nthe penalty inversely proportional to the number of other workers that agree with a worker. Further,\ntaking the average normalizes for the number of tasks labeled by the worker. Since we expect the\n\nto all workers who label 1 on task tj and 1/dj\n\n3\n\n\fhonest workers to agree with the majority more often than not, we expect this technique to assign\nlower penalties to honest workers when compared to adversaries. The soft assignment of penalties\ncan be shown to perform quite well in identifying low reliability and adversarial workers (refer\nto Theorems 1 and 2). However, it may still be subject to manipulation by more \u201csophisticated\u201d\nadversaries who can adapt and modify their labeling strategy to target certain tasks and to in\ufb02ate\nthe penalty of speci\ufb01c honest workers. In fact for such worst-case adversaries, we can show that\n(Theorem 3) given any honest worker labeling pattern, there exists a lower bound on the number of\ntasks whose true label cannot be inferred correctly, by any label aggregation algorithm.\nTo address the case of these sophisticated adversaries, we propose a hard penalty assignment scheme\n(Algorithm 2) where the key idea is not to distribute the penalty evenly across all workers; but to\nonly choose two workers to penalize per con\ufb02ict task: one \u201crepresentative\u201d worker among those\nwho labeled 1 and another \u201crepresentative\u201d worker among those who labeled 1. While choosing\nsuch workers, the goal is to pick these representative workers in a load-balanced manner to \u201cspread\u201d\nthe penalty across all workers, so that it is not concentrated on one/few workers. The \ufb01nal penalty of\neach worker is the sum of the accrued penalties across all the (con\ufb02ict) tasks assigned to the worker.\nIntuitively, such hard assignment of penalties will penalize workers with higher degrees and many\ncon\ufb02icts (who are potential \u2018worst-case\u2019 adversaries), limiting their impact.\nWe use the concept of optimal semi-matchings [7] on bipartite graphs to distribute penalties in a\nload balanced manner, which we brie\ufb02y discuss here. For a bipartite graph B = (U, V, E), a semi-\nmatching in B is a set of edges M \u2713 E such that each vertex in V is incident to exactly one edge in\nM (note that vertices in U could be incident to multiple edges in M). A semi-matching generalizes\nthe notion of matchings on bipartite graphs. To de\ufb01ne an optimal semi-matching, we introduce a cost\nfunction for a semi-matching - for each u 2 U, let degM (u) denote the number of edges in M that\nare incident to u and let costM (u) be de\ufb01ned as costM (u) =PdegM (u)\ni = degM (u)(degM (u)+1)\n.\nAn optimal semi-matching then, is one which minimizesPu2U costM (u). This notion of cost is\n\nmotivated by the load balancing problem for scheduling tasks on machines (refer to [7] for further\ndetails). Intuitively, an optimal semi-matching fairly matches the V -vertices across the U-vertices\nso that the maximum \u201cload\u201d on any U-vertex is minimized.\nAlgorithm 1 SOFT PENALTY\n1: Input: W , T and L\n2: For every task tj 2T cs, assign penalty sij\n\nAlgorithm 2 HARD PENALTY\n1: Input: W , T and L\n2: Create a bipartite graph Bcs as follows:\n\ni=1\n\n2\n\nto each worker wi 2 Wj as follows:\nsij = 1\nd+\nj\nsij = 1\ndj\n\nif Lij = 1\nif Lij = 1\npen(wi) = Ptj2Ti\\T cs\n\n3: Output: Penalty of worker wi\n\n|Ti \\T cs|\n\n(iii) Add the edge (wi, t+\n\n(i) Each worker wi 2 W is represented by\na node on the left (ii) Each task tj 2T cs is\nrepresented by two nodes on the right t+\nj and\ntj\nj ) if Lij = 1 or\nedge (wi, tj ) if Lij = 1.\n3: Compute an optimal semi-matching OSM on\nBcs and let di ( degree of worker wi in OSM\n4: Output: Penalty of worker wi pen(wi) = di\n\nsij\n\n3 Theoretical Results\n\nSoft penalty. We focus on (l, r)-regular worker-task assignment graphs in which every worker\nis assigned l tasks and every object is labeled by r workers. The performance of our reputation\nalgorithms depend on the reliabilities of the workers as well as the true labels of the tasks. Hence,\nwe consider the following probabilistic model: for a given (l, r)-regular worker-task assignment\ngraph G, the reliabilities of the workers and the true labels of tasks are sampled independently (from\ndistributions described in Section 2). We then analyze the performance of our algorithms as the\ntask degree r (and hence number of workers |W|) goes to in\ufb01nity. Speci\ufb01cally, we establish the\nfollowing results (the proofs of all theorems are in the supplementary material).\nTheorem 1. Suppose there are no adversarial workers, i.e A = ; and that the worker-task assign-\nment graph G is (l, r)-regular. Then, with high probability as r ! 1, for any pair of workers wi\nand wi0, \u00b5i < \u00b5i0 =) pen(wi) > pen(wi0), i.e. higher reliability workers are assigned lower\npenalties by Algorithm 1.\n\n4\n\n\fThe probability in the above theorem is according to the model described above. Note that the the-\norem justi\ufb01es our de\ufb01nition of the reputation scores by establishing their consistency with worker\nreliabilities in the absence of adversarial workers. Next, consider the setting in which adversarial\nworkers adopt the following uniform strategy: label 1 on all assigned tasks (the 1 case is symmet-\nric).\nTheorem 2. Suppose that the worker-task assignment graph G is (l, r)-regular. Let the probability\nof an arbitrary worker being honest be q and suppose each adversary adopts the uniform strategy\nin which she labels 1 on all the tasks assigned to her. Denote an arbitrary honest worker as hi and\nany adversary as a. Then, with high probability as r ! 1, we have\n\n1. If  = 1\n\n2. If  = 1\n\n2 and \u00b5i = 1, then pen(hi) < pen(a) if and only if q > 1\n2 and q > 1\n\n1+\u00b5, then pen(hi) < pen(a) if and only if\n\n1+\u00b5\n\n\u00b5i \n\n(2  q)(1  q  q2\u00b52)  q2\u00b52\n\n(2  q)q + q2\u00b52\n\nThe above theorem establishes that when adversaries adopt the uniform strategy, the soft-penalty\nalgorithm assigns lower penalties to honest workers whose reliability excess a threshold, as long as\nthe fraction of honest workers is \u201clarge enough\u201d. Although not stated, the result above immediately\nextends (with a modi\ufb01ed lower bound for \u00b5i) to the case when > 1/2, which corresponds to\nadversaries adopting smart strategies by labeling based on the prevalence of positive tasks.\nSophisticated adversaries. Numerous real-world incidents show evidence of malicious worker be-\nhavior in online systems [18, 22, 12]. Moreover, attacks on the training process of ML models\nhave also been studied [15, 1]. Recent work [21] has also shown the impact of powerful adversar-\nial attacks by administrators of crowdtur\ufb01ng (malicious crowdsourcing) sites. Motivated by these\nexamples, we consider sophisticated adversaries:\nDe\ufb01nition 1. Sophisticated adversaries are computationally unbounded and colluding. Further,\nthey have knowledge of the labels provided by the honest workers and their goal is to maximize the\nnumber of tasks whose true label is incorrectly identi\ufb01ed.\n\nWe now raise the following question: In the presence of sophisticated adversaries, does there exist\na fundamental limit on the number of tasks whose true label can be correctly identi\ufb01ed, irrespective\nof the aggregation algorithm employed to aggregate the worker labels?\nIn order to answer the above question precisely, we introduce some notation. Let n = |W| and\nm = |T |. Then, we represent any label aggregation algorithm as a decision rule R : L!{ 1,1}m,\nwhich maps the observed labeling matrix L to a set of output labels for each task. Because of the\nabsence of any auxiliary information about the workers or the tasks, the class of decision rules, say\nC, is invariant to permutations of the identities of workers and/or tasks. More precisely, C denotes\nthe class of decision rules that satisfy R(PLQ) = R(L)Q, for any n \u21e5 n permutation matrix P and\nm\u21e5 m permutation matrix Q. We say that a task is affected if the decision rule outputs the incorrect\nlabel for the task. We de\ufb01ne the quality of a decision rule R(\u00b7) as the worst-case number of affected\ntasks over all possible true labelings and adversary strategies with a \ufb01xed honest worker labeling\npattern. Fixing the honest worker labeling pattern allows isolation of the effect of the adversary\nstrategy on the accuracy of the decision rule. Considering the worst-case over all possible true\nlabels makes the metric robust to ground-truth assignments, which are typically application speci\ufb01c.\nNext to formally de\ufb01ne the quality, let BH denote the honest worker-task assignment graph and\ny = (y1, y2, . . . , ym) denote the vector of true labels for the tasks. Note that since the number of\naffected tasks also depends on the actual honest worker labels, we further assume that all honest\nworkers have reliability \u00b5i = 1, i.e they always label correctly. This assumption allows us to\nattribute any mis-identi\ufb01cation of true labels to the presence of adversaries because, otherwise, in\nthe absence of any adversaries, the true labels of all the tasks can be trivially identi\ufb01ed. Finally, let\nSk denote the strategy space of k adversaries, where each strategy  2S k speci\ufb01es the k \u21e5 m label\nmatrix provided by the adversaries. Since we do not restrict the adversary strategy in any way, it\nfollows that Sk = {1, 0, 1}k\u21e5m. The quality of a decision rule is then de\ufb01ned as\n\nA\u21b5(R,BH, k) =\n\nmax\n\n2Sk,y2{1,1}mntj 2T : Ry,\n\ntj\n\n6= yj)o ,\n\n5\n\n\fwhere Ry,\nt 2{ 1,1} is the label output by the decision rule R for task t when the true label vector\nis y and the adversary strategy is . Note that our notation A\u21b5(R,BH, k) makes the dependence of\nthe quality measure on the honest worker-task assignment graph BH and the number of adversaries k\nexplicit. We answer the question raised above in the af\ufb01rmative, i.e. there does exist a fundamental\nlimit on identi\ufb01cation. In the theorem below, PreIm(T 0) is the set of honest workers who label\natleast one task in T 0.\nTheorem 3. Suppose that k = |A| and \u00b5h = 1 for all honest workers h 2 H. Then, given\nany honest worker-task assignment graph BH, there exists an adversary strategy \u21e4 2S k that is\nindependent of any decision rule R 2C such that\n\nL \uf8ff max\nL =\n\ny2{1,1}m\n1\n2\n\nA\u21b5(R, \u21e4, y) 8R 2C , where\nT 0\u2713T : |PreIm(T 0)|\uf8ffk |T 0| ,\n\nmax\n\nand A\u21b5(R, \u21e4, y) denotes the number of affected tasks under adversary strategy \u21e4, decision rule\nR, and true label vector y (with the assumption that max over an empty set is zero).\n\nWe describe the main idea of the proof which proceeds in two steps: (i) we provide an explicit\nconstruction of an adversary strategy \u21e4 that depends on BH and y, and (ii) we show the existence of\nanother true labeling \u02c6y such that R outputs exactly the same labels in both scenarios. The adversary\nlabeling strategy we construct uses the idea of indistinguishability, which captures the fact that by\ncarefully choosing their labels, the adversaries can render themselves indistinguishable from honest\nworkers. Speci\ufb01cally, in the simple case when there is only one honest worker, the adversary simply\nlabels the opposite of the honest worker on all assigned tasks, so that each task has two labels of\nopposite parity. It can be argued that since there is no other information, it is impossible for any\ndecision rule R 2C to distinguish the honest worker from the adversary and hence identify the\ntrue label of any task (better than a random guess). We extend this to the general case, where the\nadversary \u201ctargets\u201d atmost k honest workers and derives a strategy based on the subgraph of BH\nrestricted to the targeted workers. The resultant strategy can be shown to result in incorrect labels\nfor atleast L tasks for some true labeling of the tasks.\nHard penalty. We now analyze the performance of the hard penalty-based reputation algorithm in\nthe presence of sophisticated adversarial workers. For the purposes of the theorem, we consider a\nnatural extension of our reputation algorithm to also perform label aggregation (see \ufb01gure 1).\nTheorem 4. Suppose that k = |A| and \u00b5i = 1 for each honest worker, i.e an honest worker always\nprovides the correct label. Further, let d1  d2 \u00b7\u00b7\u00b7 d|H| denote the degrees of the honest\nworkers in the optimal semi-matching on BH. For any true labeling of the tasks and under the\npenalty-based label aggregation algorithm (with the convention that di = 0 for i > |H|) :\n1. There exists an adversary strategy \u21e4 such that the number of tasks affected Pk1\ni=1 di.\n(a) U =Pk\n(b) U =P2k\n\ni=1 di , when atmost one adversary provides correct labels\ni=1 di , in the general case\n\nA few remarks are in order. First, it can be shown [7] that for optimal semi-matchings, the degree\nsequence is unique and therefore, the bounds in the theorem above are uniquely de\ufb01ned given BH.\nAlso, the assumption that \u00b5i = 1 is required for analytical tractability, proving theoretical guaran-\ntees in crowd-sourced settings (even without adversaries) for general graph structures is notoriously\nhard [2]. Note that the result of Theorem 4 provides both a lower and upper bound for the number\nof tasks that can be affected by k adversaries when using the penalty-based aggregation algorithm.\nThe characterization we obtain is reasonably tight for the case when atmost 1 adversary provides\ncorrect labels; in this case the gap between the upper and lower bounds is dk, which can be \u201csmall\u201d\nfor k large enough. However, our characterization is loose in the general case when adversaries can\ni=k di which we attribute to our proof technique and conjecture\n\n2. No adversary strategy can affect more than U tasks where\n\n4 Experiments\nIn this section, we evaluate the performance of our reputation algorithms on both synthetic and real\ndatasets. We consider the following popular label aggregation algorithms: (a) simple majority vot-\n\nlabel arbitrarily; here the gap isP2k\nthat the upper bound ofPk\n\ni=1 di also applies in the more general case.\n\n6\n\n\fRandom\n\nLow\n9.9\n-1.9\n-4.3\n-3.9\n81.7\n\nHigh\n7.9\n6.3\n13.1\n7.3\n82.1\n\nMalicious\nLow High\n15.6\n16.8\n-1.6\n-49.4\n-98.7\n-8.3\n-69.6\n-8.3\n92.5\n59.4\n\nUniform\n\nLow\n26.0\n-1.2\n-6.5\n-6.0\n80.8\n\nHigh\n15.0\n-9.1\n12.9\n10.7\n62.4\n\nMV\nEM\nKOS\nKOS+\n\nPRECISION\n\nBEST\n\nMV-SOFT MV-HARD MV-SOFT KOS MV-SOFT MV-HARD\n\nPENALTY-BASED AGGREGATION\nwt ( worker that task t is mapped\nto in OSM in Algorithm 2\nOutput y(t) = 1 if dwt+ < dwt\ny(t) = 1 if dwt+ > dwt and\ny(t) = 0 otherwise\n(here y refers to the label of the task\nand dw refers to the degree of worker\nw in OSM)\n\nFigure 1: Left: Percentage decrease in incorrect tasks on synthetic data (negative indicates increase in incorrect\ntasks). We implemented both SOFT and HARD and report the best outcome. Also reported is the precision\nwhen removing 15 workers with the highest penalties. The columns specify the three types of adversaries\nand High/Low indicates the degree bias of the adversaries. The probability that a worker is honest q was\nset to 0.7 and the prevalence  of positive tasks was set to 0.5. The numbers reported are an average over\n100 experimental runs. The last row lists the combination with the best accuracy in each case. Right: The\npenalty-based label aggregation algorithm.\n\ning MV (b) the EM algorithm [3] (c) the BP-based KOS algorithm [9] and (d) KOS+, a normalized\nversion of KOS that is amenable for arbitrary graphs (KOS is designed for random regular graphs),\nand compare their accuracy in inferring the true labels of the tasks, when implemented in conjunc-\ntion with our reputation algorithms. We implemented iterative versions of Algorithms 1(SOFT) and\n2(HARD), where in each iteration we \ufb01ltered out the worker with the highest penalty and recomputed\npenalties for the remaining workers.\nSynthetic Dataset. We analyzed the performance of our soft penalty-based reputation algorithm on\n(l, r)-regular graphs in section 3. In many practical scenarios, however, the worker-task assignment\ngraph forms organically where the workers decide which tasks to label on. To consider this case, we\nsimulated a setup of 100 workers with a power-law distribution for worker degrees to generate the\nbipartite worker-task assignment graph. We assume that an honest worker always labels correctly\n(the results are qualitatively similar when honest workers make errors with small probability) and\nconsider three notions of adversaries: (a) random - who label each task 1 or 1 with prob. 1/2\n(b) malicious - who always label incorrectly and (c) uniform - who label 1 on all tasks. Also,\nwe consider both cases - one where the adversaries are biased to have high degrees and the other\nwhere they have low degrees. Further, we arbitrarily decided to remove 15% of the workers with the\nhighest penalties and we de\ufb01ne precision as the percentage of workers \ufb01ltered who were adversarial.\nFigure 1 shows the performance improvement of the different benchmarks in the presence of our\nreputation algorithm.\nWe make a few observations. First, we are successful in identifying random adversaries as well as\nlow-degree malicious and uniform adversaries (precision > 80%). This shows that our reputation\nalgorithms also perform well when worker-task assignment graphs are non-regular, complementing\nthe theoretical results (Theorems 1 and 2) for regular graphs. Second, our \ufb01ltering algorithm can\nresult in signi\ufb01cant reduction (upto 26%) in the fraction of incorrect tasks. In fact, in 5 out of 6 cases,\nthe best performing algorithm incorporates our reputation algorithm. Note that since 15 workers\nare \ufb01ltered out, labels from fewer workers are used to infer the true labels of the tasks. Despite\nusing fewer labels, we improve performance because the high precision of our algorithms results in\nmostly adversaries being \ufb01ltered out. Third, we note that when the adversaries are malicious and\nhave high degrees, the removal of 15 workers degrades the performance of the KOS (and KOS+) and\nEM algorithms. We attribute this to the fact that while KOS and EM are able to automatically invert\nthe malicious labels, we discard these labels which hurts performance, since the adversaries have\nhigh degrees. Finally, note that the SOFT (HARD) penalty algorithm tends to perform better when\nadversaries are biased towards low (high) degrees, and this insight can be used to aid the choice of\nthe reputation algorithm to be employed in different scenarios.\nReal Datasets. Next, we evaluated our algorithm on some standard datasets: (a) TREC2: a collection\nof topic-document pairs labeled as relevant or non-relevant by workers on AMT. We consider two\nversions: stage2 and task2. (b) NLP [17]: consists of annotations by AMT workers for different\nNLP tasks (1) rte - which provides binary judgments for textual entailment, i.e. whether one\n\n2http://sites.google.com/site/treccrowd/home\n\n7\n\n\fDataset\n\nMV\nSoft\n\nKOS\n\nHard Base\n\nBase\n91.9 92.1(8) 92.5(3)\nrte\n94.3(5)\n93.9 93.9\ntemp\n75.9\n75.9 75.9\nbluebird\n81.4(3)\n74.1 74.1\nstage2\n68.4(10) 66.8 66.8\ntask2\n64.3 64.3\n82.5\naggregate 80.0 80.0\n81.6 81.7\n\nEM\nSoft\n92.7 92.7\n94.1 94.1\n89.8 89.8\n64.7 65.3(6) 78.9(2) 74.5 74.5\n67.3(9) 57.4 57.4\n84.7\n62.1 73.2\n\nHard Base\n93.3(9) 49.7 88.8(9) 91.6(10) 91.3 92.7(8) 92.8(10)\n93.9 94.3(7) 94.3(1)\n94.1\n72.2 75.9(3) 72.2\n89.8\n75.2(3)\n75.5 76.6(2) 77.2(3)\n66.7(10) 59.3 59.4(4) 67.9(9)\n79.9\n\n56.9 69.2(4) 93.7(3)\n72.2 75.9(3) 72.2\n\nSoft\n\nHard Base\n\n78.4 79.8\n\n80.9\n\nKOS+\nSoft\n\nHard\n\nTable 1: Percentage accuracy of benchmark algorithms when combined with our reputation algorithms. For\neach benchmark, the best performing combination is shown in bold. The number in the parentheses repre-\nsents the number of workers \ufb01ltered by our reputation algorithm (an absence indicates that no performance\nimprovement was achieved while removing upto 10 workers with the highest penalties).\n\nsentence can be inferred from another (2) temp - which provides binary judgments for temporal\nordering of events. (c) bluebird [23] contains judgments differentiating two kinds of birds in\nan image. Table 1 reports the best accuracy achieved when upto 10 workers are \ufb01ltered using our\niterative reputation algorithms.\nThe main conclusion we draw is that our reputation algorithms are able to boost the performance\nof state-of-the-art aggregation algorithms by a signi\ufb01cant amount across the datasets: the average\nimprovement for MV and KOS+ is 2.5%, EM is 3% and for KOS is almost 18%, when using the hard\npenalty-based reputation algorithm. Second, we can elevate the performance of simpler algorithms\nsuch as KOS and MV to the levels of the more general (and in some cases, complicated) EM algo-\nrithm. The KOS algorithm for instance, is not only easier to implement, but also has tight theoretical\nguarantees when the underlying assignment graph is sparse random regular and further is robust to\ndifferent initializations [9]. The modi\ufb01ed version KOS+ can be used in graphs where worker de-\ngrees are skewed, but we are still able to enhance its accuracy. Our results provide evidence for the\nfact that existing random models are inadequate in capturing the behavior of workers in real-world\ndatasets, necessitating the need for a more general approach. Third, note that the hard penalty-based\nalgorithm outperforms the soft version across the datasets. Since the hard penalty algorithm works\nwell when adversaries have higher degrees (a fact noticed in the simulation results above), this sug-\ngests the presence of high-degree adversarial workers in theses datasets. Finally, our algorithm was\nsuccessful in identifying the following types of \u201cadversaries\u201d: (1) uniform workers who always la-\nbel 1 or 1 (in temp, task2, stage2), (2) malicious workers who mostly label incorrectly (in\nbluebird, stage2) and (3) random workers who label each task independent of its true label\n(such workers were identi\ufb01ed as \u201cspammers\u201d in [13]). Therefore, our algorithm is able to identify a\nbroad set of adversary strategies in addition to those detected by existing techniques.\n\n5 Conclusions\nThis paper analyzes the problem of inferring true labels of tasks in crowdsourcing systems against\na broad class of adversarial labeling strategies. The main contribution is the design of a reputation-\nbased worker \ufb01ltering algorithm that uses a combination of disagreement-based penalties and opti-\nmal semi-matchings to identify adversarial workers. We show that our reputation scores are consis-\ntent with the existing notion of worker reliabilities and further can identify adversaries that employ\ndeterministic labeling strategies. Empirically, we show that our algorithm can be applied in real\ncrowd-sourced datasets to enhance the accuracy of existing label aggregation algorithms. Further,\nwe analyze the scenario of worst-case adversaries and establish lower bounds on the minimum\n\u201cdamage\u201d achievable by the adversaries.\n\nAcknowledgments\nWe thank the anonymous reviewers for their valuable feedback. Ashwin Venkataraman was sup-\nported by the Center for Technology and Economic Development (CTED).\n\nReferences\n[1] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. arXiv preprint\n\narXiv:1206.6389, 2012.\n\n8\n\n\f[2] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceed-\nings of the 22nd international conference on World Wide Web, pages 285\u2013294. International World Wide\nWeb Conferences Steering Committee, 2013.\n\n[3] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em\n\nalgorithm. Applied Statistics, pages 20\u201328, 1979.\n\n[4] G. Demartini, D. E. Difallah, and P. Cudr\u00b4e-Mauroux. Zencrowd:\n\nand crowdsourcing techniques for large-scale entity linking.\nconference on World Wide Web, pages 469\u2013478. ACM, 2012.\n\nleveraging probabilistic reasoning\nIn Proceedings of the 21st international\n\n[5] J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?:\nscreening mechanical turk workers. In Proceedings of the 28th international conference on Human factors\nin computing systems, pages 2399\u20132402. ACM, 2010.\n\n[6] A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection\nin user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, pages\n167\u2013176. ACM, 2011.\n\n[7] N. J. Harvey, R. E. Ladner, L. Lov\u00b4asz, and T. Tamir. Semi-matchings for bipartite graphs and load\n\nbalancing. In Algorithms and data structures, pages 294\u2013306. Springer, 2003.\n\n[8] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings\n\nof the ACM SIGKDD workshop on human computation, pages 64\u201367. ACM, 2010.\n\n[9] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. Neural Informa-\n\ntion Processing Systems, 2011.\n\n[10] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the\n\nSIGCHI conference on human factors in computing systems, pages 453\u2013456. ACM, 2008.\n\n[11] J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evalu-\n\nation: The effects of training question distribution.\n\n[12] K. Lee, P. Tamilarasan, and J. Caverlee. Crowdturfers, campaigns, and social media: tracking and reveal-\ning crowdsourced manipulation of social media. In 7th international AAAI conference on weblogs and\nsocial media (ICWSM), Cambridge, 2013.\n\n[13] V. C. Raykar and S. Yu. Eliminating spammers and ranking annotators for crowdsourced labeling tasks.\n\nThe Journal of Machine Learning Research, 13:491\u2013518, 2012.\n\n[14] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds.\n\nThe Journal of Machine Learning Research, 99:1297\u20131322, 2010.\n\n[15] B. I. Rubinstein, B. Nelson, L. Huang, A. D. Joseph, S.-h. Lau, S. Rao, N. Taft, and J. Tygar. Antidote:\nunderstanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM\nSIGCOMM conference on Internet measurement conference, pages 1\u201314. ACM, 2009.\n\n[16] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of\n\nvenus images. Advances in neural information processing systems, pages 1085\u20131092, 1995.\n\n[17] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast\u2014but is it good?: evaluating non-expert\nannotations for natural language tasks. In Proceedings of the conference on empirical methods in natural\nlanguage processing, pages 254\u2013263. Association for Computational Linguistics, 2008.\n\n[18] N. Tran, B. Min, J. Li, and L. Subramanian. Sybil-resilient online content voting.\n\nIn Proceedings of\nthe 6th USENIX symposium on Networked systems design and implementation, pages 15\u201328. USENIX\nAssociation, 2009.\n\n[19] J. Vuurens, A. P. de Vries, and C. Eickhoff. How much spam can you take? an analysis of crowdsourcing\nresults to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval\n(CIR11), pages 21\u201326, 2011.\n\n[20] J. B. Vuurens and A. P. de Vries. Obtaining high-quality relevance judgments using crowdsourcing.\n\nInternet Computing, IEEE, 16(5):20\u201327, 2012.\n\n[21] G. Wang, T. Wang, H. Zheng, and B. Y. Zhao. Man vs. machine: Practical adversarial detection of\nmalicious crowdsourcing workers. In 23rd USENIX Security Symposium, USENIX Association, CA, 2014.\n[22] G. Wang, C. Wilson, X. Zhao, Y. Zhu, M. Mohanlal, H. Zheng, and B. Y. Zhao. Serf and turf: crowdtur\ufb01ng\nfor fun and pro\ufb01t. In Proceedings of the 21st international conference on World Wide Web, pages 679\u2013688.\nACM, 2012.\n\n[23] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. Advances\n\nin Neural Information Processing Systems, 23:2424\u20132432, 2010.\n\n[24] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal\nintegration of labels from labelers of unknown expertise. Advances in Neural Information Processing\nSystems, 22(2035-2043):7\u201313, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1310, "authors": [{"given_name": "Srikanth", "family_name": "Jagabathula", "institution": "NYU"}, {"given_name": "Lakshminarayanan", "family_name": "Subramanian", "institution": "NYU"}, {"given_name": "Ashwin", "family_name": "Venkataraman", "institution": "New York University"}]}