{"title": "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural no-free-lunch requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. Interestingly, this unique mechanism takes a multiplicative form. The simplicity of the mechanism is an added benefit. In preliminary experiments involving over several hundred workers, we observe a significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure.", "full_text": "Double or Nothing: Multiplicative\n\nIncentive Mechanisms for Crowdsourcing\n\nNihar B. Shah\n\nUniversity of California, Berkeley\nnihar@eecs.berkeley.edu\n\nDengyong Zhou\nMicrosoft Research\n\ndengyong.zhou@microsoft.com\n\nAbstract\n\nCrowdsourcing has gained immense popularity in machine learning applications\nfor obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but\nsuffers from the problem of low-quality data. To address this fundamental chal-\nlenge in crowdsourcing, we propose a simple payment mechanism to incentivize\nworkers to answer only the questions that they are sure of and skip the rest. We\nshow that surprisingly, under a mild and natural \u201cno-free-lunch\u201d requirement, this\nmechanism is the one and only incentive-compatible payment mechanism pos-\nsible. We also show that among all possible incentive-compatible mechanisms\n(that may or may not satisfy no-free-lunch), our mechanism makes the small-\nest possible payment to spammers. Interestingly, this unique mechanism takes a\n\u201cmultiplicative\u201d form. The simplicity of the mechanism is an added bene\ufb01t. In\npreliminary experiments involving over several hundred workers, we observe a\nsigni\ufb01cant reduction in the error rates under our unique mechanism for the same\nor lower monetary expenditure.\n\n1\n\nIntroduction\n\nComplex machine learning tools such as deep learning are gaining increasing popularity and are\nbeing applied to a wide variety of problems. These tools, however, require large amounts of labeled\ndata [HDY+12, RYZ+10, DDS+09, CBW+10]. These large labeling tasks are being performed by\ncoordinating crowds of semi-skilled workers through the Internet. This is known as crowdsourcing.\nCrowdsourcing as a means of collecting labeled training data has now become indispensable to the\nengineering of intelligent systems.\nMost workers in crowdsourcing are not experts. As a consequence, labels obtained from crowd-\nsourcing typically have a signi\ufb01cant amount of error [KKKMF11, VdVE11, WLC+10]. Recent\nefforts have focused on developing statistical techniques to post-process the noisy labels in order\nto improve its quality (e.g., [RYZ+10, ZLP+15, KOS11, IPSW14]). However, when the inputs to\nthese algorithms are erroneous, it is dif\ufb01cult to guarantee that the processed labels will be reliable\nenough for subsequent use by machine learning or other applications. In order to avoid \u201cgarbage in,\ngarbage out\u201d, we take a complementary approach to this problem: cleaning the data at the time of\ncollection.\nWe consider crowdsourcing settings where the workers are paid for their services, such as in the\npopular crowdsourcing platforms of Amazon Mechanical Turk and others. These commercial plat-\nforms have gained substantial popularity due to their support for a diverse range of tasks for machine\nlearning labeling, varying from image annotation and text recognition to speech captioning and ma-\nchine translation. We consider problems that are objective in nature, that is, have a de\ufb01nite answer.\nFigure 1a depicts an example of such a question where the worker is shown a set of images, and for\neach image, the worker is required to identify if the image depicts the Golden Gate Bridge.\n\n1\n\n\fIs this the Golden Gate Bridge? \n\nIs this the Golden Gate Bridge? \n\nYes!\nNo \n\n(a)!\n\nYes!\nNo!\nI\u2019m not sure \n\n(b)!\n\nFigure 1: Different interfaces in a crowdsourcing setup: (a) the conventional interface, and (b) with\nan option to skip.\n\nOur approach builds on the simple insight that in typical crowdsourcing setups, workers are simply\npaid in proportion to the amount of tasks they complete. As a result, workers attempt to answer\nquestions that they are not sure of, thereby increasing the error rate of the labels. For the questions\nthat a worker is not sure of, her answers could be very unreliable [WLC+10, KKKMF11, VdVE11,\nJSV14]. To ensure acquisition of only high-quality labels, we wish to encourage the worker to\nskip the questions about which she is unsure, for instance, by providing an explicit \u201cI\u2019m not sure\u201d\noption for every question (see Figure 1b). Our goal is to develop payment mechanisms to encourage\nthe worker to select this option when she is unsure. We will term any payment mechanism that\nincentivizes the worker to do so as \u201cincentive compatible\u201d.\nIn addition to incentive compatibility, preventing spammers is another desirable requirement from\nincentive mechanisms in crowdsourcing. Spammers are workers who answer randomly without\nregard to the question being asked, in the hope of earning some free money, and are known to exist\nin large numbers on crowdsourcing platforms [WLC+10, Boh11, KKKMF11, VdVE11]. It is thus\nof interest to deter spammers by paying them as low as possible. An intuitive objective, to this end,\nis to ensure a zero expenditure on spammers who answer randomly. In this paper, however, we\nimpose a strictly and signi\ufb01cantly weaker condition, and then show that there is one and only one\nincentive-compatible mechanism that can satisfy this weak condition. Our requirement, referred to\nas the \u201cno-free-lunch\u201d axiom, says that if all the questions attempted by the worker are answered\nincorrectly, then the payment must be zero.\nWe propose a payment mechanism for the aforementioned setting (\u201cincentive compatibility\u201d plus\n\u201cno-free-lunch\u201d), and show that surprisingly, this is the only possible mechanism. We also show that\nadditionally, our mechanism makes the smallest possible payment to spammers among all possible\nincentive compatible mechanisms that may or may not satisfy the no-free-lunch axiom. Our payment\nmechanism takes a multiplicative form: the evaluation of the worker\u2019s response to each question is\na certain score, and the \ufb01nal payment is a product of these scores. This mechanism has additional\nappealing features in that it is simple to compute, and is also simple to explain to the workers. Our\nmechanism is applicable to any type of objective questions, including multiple choice annotation\nquestions, transcription tasks, etc.\nIn order to test whether our mechanism is practical, and to assess the quality of the \ufb01nal labels\nobtained, we conducted experiments on the Amazon Mechanical Turk crowdsourcing platform. In\nour preliminary experiments that involved over several hundred workers, we found that the quality\nof data improved by two-fold under our unique mechanism, with the total monetary expenditure\nbeing the same or lower as compared to the conventional baseline.\n\n2 Problem Setting\n\nIn the crowdsourcing setting that we consider, one or more workers perform a task, where a task\nconsists of multiple questions. The questions are objective, by which we mean, each question has\nprecisely one correct answer. Examples of objective questions include multiple-choice classi\ufb01cation\nquestions such as Figure 1, questions on transcribing text from audio or images, etc.\nFor any possible answer to any question, we de\ufb01ne the worker\u2019s con\ufb01dence about an answer as the\nprobability, according to her belief, of this answer being correct. In other words, one can assume\nthat the worker has (in her mind) a probability distribution over all possible answers to a question,\nand the con\ufb01dence for an answer is the probability of that answer being correct. As a shorthand, we\nalso de\ufb01ne the con\ufb01dence about a question as the con\ufb01dence for the answer that the worker is most\n\n2\n\n\fcon\ufb01dent about for that question. We assume that the worker\u2019s con\ufb01dences for different questions\nare independent. Our goal is that for every question, the worker should be incentivized to:\n\n1. skip if the con\ufb01dence is below a certain pre-de\ufb01ned threshold, otherwise:\n2. select the answer that she thinks is most con\ufb01dent about.\n\nMore formally, let T 2 (0, 1) be a prede\ufb01ned value. The goal is to design payment mechanisms that\nincentivize the worker to skip the questions for which her con\ufb01dence is lower than T , and attempt\nthose for which her con\ufb01dence is higher than T . 1 Moreover, for the questions that she attempts to\nanswer, she must be incentivized to select the answer that she believes is most likely to be correct.\nThe threshold T may be chosen based on various factors of the problem at hand, for example, on\nthe downstream machine learning algorithms using the crowdsourced data, or the knowledge of the\nstatistics of worker abilities, etc. In this paper we assume that the threshold T is given to us.\nLet N denote the total number of questions in the task. Among these, we assume the existence of\nsome \u201cgold standard\u201d questions, that is, a set of questions whose answers are known to the requester.\nLet G (1 \uf8ff G \uf8ff N ) denote the number of gold standard questions. The G gold standard questions\nare assumed to be distributed uniformly at random in the pool of N questions (of course, the worker\ndoes not know which G of the N questions form the gold standard). The payment to a worker for\na task is computed after receiving her responses to all the questions in the task. The payment is\nbased on the worker\u2019s performance on the gold standard questions. Since the payment is based on\nknown answers, the payments to different workers do not depend on each other, thereby allowing us\nto consider the presence of only one worker without any loss in generality.\nWe will employ the following standard notation. For any positive integer K, the set {1, . . . , K} is\ndenoted by [K]. The indicator function is denoted by 1, i.e., 1{z} = 1 if z is true, and 0 otherwise.\nThe notation R+ denotes the set of all non-negative real numbers.\nLet x1, . . . , xG 2 {1, 0, +1} denote the evaluations of the answers that the worker gives to the G\ngold standard questions. Here, \u201c0\u201d denotes that the worker skipped the question, \u201c1\u201d denotes that\nthe worker attempted to answer the question and that answer was incorrect, and \u201c+1\u201d denotes that\nthe worker attempted to answer the question and that answer was correct. Let f : {1, 0, +1}G !\nR+ denote the payment function, namely, a function that determines the payment to the worker\nbased on these evaluations x1, . . . , xG. Note that the crowdsourcing platforms of today mandate the\npayments to be non-negative. We will let \u00b5 (> 0) denote the budget, i.e., the maximum amount that\ncan be paid to any individual worker for this task:\n\nmax\n\nx1,...,xG\n\nf (x1, . . . , xG) = \u00b5.\n\nThe amount \u00b5 is thus the amount of compensation paid to a perfect agent for her work. We will\nassume this budget condition of \u00b5 throughout the rest of the paper.\nWe assume that the worker attempts to maximize her overall expected payment. In what follows, the\nexpression \u2018the worker\u2019s expected payment\u2019 will refer to the expected payment from the worker\u2019s\npoint of view, and the expectation will be taken with respect to the worker\u2019s con\ufb01dences about her\nanswers and the uniformly random choice of the G gold standard questions among the N questions\nin the task. For any question i 2 [N ], let yi = 1 if the worker attempts question i, and set yi = 0\notherwise. Further, for every question i 2 [N ] such that yi 6= 0, let pi be the con\ufb01dence of the\nworker for the answer she has selected for question i, and for every question i 2 [N ] such that\nyi = 0, let pi 2 (0, 1) be any arbitrary value. Let E = (\u270f1, . . . ,\u270f G) 2 {1, 1}G. Then from the\nworker\u2019s perspective, the expected payment for the selected answers and con\ufb01dence-levels is\n\nIn the expression above, the outermost summation corresponds to the expectation with respect to the\nrandomness arising from the unknown choice of the gold standard questions. The inner summation\ncorresponds to the expectation with respect to the worker\u2019s beliefs about the correctness of her\nresponses.\n\n1In the event that the con\ufb01dence about a question is exactly equal to T , the worker may be equally incen-\n\ntivized to answer or skip.\n\n3\n\n1\n\nG X(j1,...,jG)\nN\n\n\u2713{1,...,N}\n\nXE2{1,1}G f (\u270f1yj1, . . . ,\u270f GyjG)\n\n(pji)\n\n1+\u270fi\n\n2 (1 pji)\n\nGYi=1\n\n1\u270fi\n\n2 ! .\n\n\fWe will call any payment function f as an incentive-compatible mechanism if the expected payment\nof the worker under this payment function is strictly maximized when the worker responds in the\nmanner desired.2\n\n3 Main results: Incentive-compatible mechanism and guarantees\n\nIn this section, we present the main results of the paper, namely, the design of incentive-compatible\nmechanisms with practically useful properties. To this end, we impose the following natural re-\nquirement on the payment function f that is motivated by the practical considerations of budget\nconstraints and discouraging spammers and miscreants [Boh11, KKKMF11, VdVE11, WLC+10].\nWe term this requirement as the \u201cno-free-lunch axiom\u201d:\nAxiom 1 (No-free-lunch axiom). If all the answers attempted by the worker in the gold standard are\nwrong, then the payment is zero. More formally, for every set of evaluations (x1, . . . , xG) that satisfy\ni=1 1{xi = 1}, we require the payment to satisfy f (x1, . . . , xG) = 0.\nObserve that no-free-lunch is an extremely mild requirement. In fact, it is signi\ufb01cantly weaker than\nimposing a zero payment on workers who answer randomly. For instance, if the questions are of\nbinary-choice format, then randomly choosing among the two options for each question would result\nin 50% of the answers being correct in expectation, while the no-free-lunch axiom is applicable only\nwhen none of them turns out to be correct.\n\ni=1 1{xi 6= 0} =PG\n\n0