{"title": "Fundamental Limits of Budget-Fidelity Trade-off in Label Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 5058, "page_last": 5066, "abstract": "Digital crowdsourcing (CS) is a modern approach to perform certain large projects using small contributions of a large crowd. In CS, a taskmaster typically breaks down the project into small batches of tasks and assigns them to so-called workers with imperfect skill levels. The crowdsourcer then collects and analyzes the results for inference and serving the purpose of the project. In this work, the CS problem, as a human-in-the-loop computation problem, is modeled and analyzed in an information theoretic rate-distortion framework. The purpose is to identify the ultimate fidelity that one can achieve by any form of query from the crowd and any decoding (inference) algorithm with a given budget. The results are established by a joint source channel (de)coding scheme, which represent the query scheme and inference, over parallel noisy channels, which model workers with imperfect skill levels. We also present and analyze a query scheme dubbed k-ary incidence coding and study optimized query pricing in this setting.", "full_text": "Fundamental Limits of Budget-Fidelity Trade-off in\n\nLabel Crowdsourcing\n\nElectrical Engineering Department, California Institute of Technology\n\nFarshad Lahouti\n\nlahouti@caltech.edu\n\nElectrical Engineering Department, California Institute of Technology\n\nBabak Hassibi\n\nhassibi@caltech.edu\n\nAbstract\n\nDigital crowdsourcing (CS) is a modern approach to perform certain large projects\nusing small contributions of a large crowd. In CS, a taskmaster typically breaks\ndown the project into small batches of tasks and assigns them to so-called workers\nwith imperfect skill levels. The crowdsourcer then collects and analyzes the results\nfor inference and serving the purpose of the project. In this work, the CS problem,\nas a human-in-the-loop computation problem, is modeled and analyzed in an\ninformation theoretic rate-distortion framework. The purpose is to identify the\nultimate \ufb01delity that one can achieve by any form of query from the crowd and any\ndecoding (inference) algorithm with a given budget. The results are established\nby a joint source channel (de)coding scheme, which represent the query scheme\nand inference, over parallel noisy channels, which model workers with imperfect\nskill levels. We also present and analyze a query scheme dubbed k-ary incidence\ncoding and study optimized query pricing in this setting.\n\n1\n\nIntroduction\n\nDigital crowdsourcing (CS) is a modern approach to perform certain large projects using small\ncontributions of a large crowd. Crowdsourcing is usually used when the tasks involved may better suite\nhumans rather than machines or in situations where they require some form of human participation.\nAs such crowdsourcing is categorized as a form of human-based computation or human-in-the-loop\ncomputation system. This article examines the fundamental performance limits of crowdsourcing\nand sheds light on the design of optimized crowdsourcing systems.\nCrowdsourcing is used in many machine learning projects for labeling of large sets of unlabeled\ndata and Amazon Mechanical Turk (AMT) serves as a popular platform to this end. Crowdsourcing\nis also useful in very subjective matters such as rating of different goods and services, as is now\nwidely popular in different online rating platforms and applications such as Yelp. Another example is\nif we wish to classify a large number of images as suitable or unsuitable for children. In so-called\ncitizen research projects, a large number of \u2013often human deployed or operated\u2013 sensors contribute\nto accomplish a wide array of crowdsensing objectives, e.g., [2] and [3].\nIn crowdsourcing, a taskmaster typically breaks down the project into small batches of tasks, recruits\nso-called workers and assigns them the tasks accordingly. The crowdsourcer then collects and\nanalyzes the results collectively to address the purpose of the project. The worker\u2019s pay is often\nlow or non-existent. In cases such as labeling, the work is typically tedious and hence the workers\nusually handle only a small batch of work in a given project. The workers are often non-specialists\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a): An information theoretic crowdsourcing model; (b) 3IC for N = 2 valid responses, (c)\ninvalid responses\n\nand as such there may be errors in their completion of assigned tasks. Due to the nature of task\nassignment by the taskmaster, the workers and the skill level are typically unknown a priori. In case\nof rating systems, such as Yelp, there is no pay for regular reviewers but only non-monetary personal\nincentives; however, there are illegal reviewers who are paid to write fake reviews. In many cases\nof crowdsourcing, the ground truth is not known at all. The transitory or \ufb02eeting characteristic of\nworkers, their unknown and imperfect skill levels and their possible motivations for spamming makes\nthe design of crowdsourcing projects and the analysis of the obtained results particularly challenging.\nResearchers have studied the optimized design of crowdsourcing systems within the setting described\nfor enhanced reliability. Most research reported so far is devoted to optimized design and analysis\nof aggregation and inference schemes and possibly using redundant task assignment. In AMT-type\ncrowdsourcing, two popular approaches for aggregation are namely majority voting and the Dawid\nand Skene (DS) algorithm [5]. The former sets the estimate based on what the majority of the crowd\nagrees on, and is provably suboptimal [8]. Majority voting is susceptible to error, when there are\nspammers in the crowd, as it weighs the opinion of everybody in the crowd the same. The DS\nalgorithm, within a probabilistic framework, aims at joint estimation of the workers\u2019 skill levels\nand a reliable label based on the data collected from the crowd. The scheme runs as an expectation\nmaximization (EM) formulation in an iterative manner. More recent research with similar EM\nformulation and a variety of probabilistic models are reported in [9, 12, 10]. In [8], a label inference\nalgorithm for CS is presented that runs iteratively over a bipartite graph. In [1], the CS problem\nis posed as a so-called bandit survey problem for which the trade offs of cost and reliability in the\ncontext of worker selection is studied. Schemes for identifying workers with low skill levels is studied\nin, e.g., [6]. In [13], an analysis of the DS algorithm is presented and an improved inference algorithm\nis presented. Another class of works on crowdsourcing for clustering relies on convex optimization\nformulations of inferring clusters within probabilistic graphical models, e.g., [11] and the references\ntherein.\nIn this work, a crowdsourcing problem is modeled and analyzed in an information theoretic setting.\nThe purpose is to seek ultimate performance bounds, in terms of the CS budget (or equivalently the\nnumber of queries per item) and a CS \ufb01delity, that one can achieve by any form of query from the\nworkers and any inference algorithm. Two particular scenarios of interest include the case where the\nworkers\u2019 skill levels are unknown both to the taskmaster and the crowdsourcer, or the case where\nthe skill levels are perfectly estimated during inference by the crowdsourcer. Within the presented\nframework, we also investigate a class of query schemes dubbed k-ary incidence coding and analyze\nits performance. At the end, we comment on an associated query pricing strategy.\n\n2 Modeling Crowdsourcing\n\nIn this Section, we present a communication system model for crowdsourcing. The model, as depicted\nin Figure 1a, then enables the analysis of the fundamental performance limits of crowdsourcing.\n\n2\n\n\f2.1 Data Set: Source\nConsider a dataset X = {X1, . . . , XL} composed of L items, e.g., images. In practice, there is\ncertain function B(X) \u2208 B(X ) of the items that is of interest in crowdsourcing and is here considered\nas the source. The value of this function is to be determined by the crowd for the given dataset. In the\ncase of crowdsourced clustering, B(Xi) = Bj \u2208 B(X ) = {B1, . . . , BN} indicates the bin or cluster\nto which the item Xi ideally belongs. We have B(X1, . . . , Xn) = B(X n) = (B(X1), . . . , B(Xn)).\nThe number of clusters, |B(X )| = N, may or may not be known a priori.\n\n2.2 Crowd Workers: Channels\n\nThe crowd is modeled by a set of parallel noisy channels in which each channel Ci, i = 1, . . . , W,\nrepresents the ith worker. The channel input is a query that is designed based on the source. The\nchannel output is what a user perceives or responds to a query. The output may or may not be the\ncorrect answer to the query depending on the skill level of the worker and hence the noisy channel is\nmeant to model possible errors by the worker.\nA suitable model for Ci is a discrete channel model. The channels may be assumed independent,\non the basis that different individuals have different knowledge sets. Related probabilistic models\nrepresenting the possible error in completion of a task by a worker are reviewed in [8]. Formally, a\nchannel (worker) is represented by a probability distribution P (v|u), u \u2208 U, v \u2208 V, where U is the\nset of possible responses to a query and V is the set of choices offered to the worker in responding\nto a query. For the example of images suitable for children, in general we may consider a shade of\npossible responses to the query, U, including the extremes of totally suitable and totally unsuitable; As\npossible choices offered to the worker to answer the query, V, we may consider two options of suitable\nand unsuitable. As described below, in this work we consider two channel models representing\npossibly erroneous responses of the workers: an M-ary symmetric channel model (MSC) and a\nspammer-hammer channel model (SHC).\nAn MSC model with parameter \u0001, is a symmetric discrete memoryless channel without feedback [4]\nand with input u \u2208 U and output v \u2208 V (|U| = |V| = M), that is characterized by the following\ntransition probability\n\n(1)\n\n(cid:2)\n\n\u0001\n\nM\u22121\n1 \u2212 \u0001\n\nv (cid:3)= u\nv = u.\n\nP (v|u) =\n(cid:3)n\n\ni=1\n\nIf we consider a sequence of channel inputs un = (u1, . . . , un) and the corresponding output\nsequence vn, we have P (vn|un) =\nP (vi|ui), which holds because of the memoryless and no\nfeedback assumptions. In case of clustering and MSC, the probability of misclassifying any input\nfrom a given cluster in another cluster only depends on the worker and not the corresponding clusters.\nIn the spammer-hammer channel model with the probability of being a hammer of q (SHC(q)), a\nspammer randomly and uniformly chooses a valid response to a query, and a hammer perfectly\nanswers the query [8]. The corresponding discrete memoryless channel model without feedback,\nwith input u \u2208 U and output v \u2208 V and state C \u2208 {S, H}, P (C = H) = q is described as follows\n\n\u23a7\u23a8\n\u23a9\n\nP (v|u, C) =\n\n0\n1\n1|V|\n\nC = H and v (cid:3)= u\nC = H and v = u\nC = S\n\n(2)\n\nwhere C \u2208 {S, H} indicates whether the worker (channel) is a spammer or a hammer. In the case of\nour current interest |U| = |V| = M, and we have P (vn|un, C n) =\nIn the sequel, we consider the following two scenarios: when the workers\u2019 skill levels are unknown\n(SL-UK) and when it is perfectly known by the crowdsourcer (SL-CS). In both cases, we assume that\nthe skill levels are not known at the taskmaster (transmitter).\nThe presented framework can also accommodate other more general scenarios of interest. For\nexample, the feedforward link in Figure 1a could be used to model a channel whose state is affected\nby the input, e.g., dif\ufb01culty of questions. These extensions remain for future studies.\n\nP (vi|ui, C i).\n\ni=1\n\n(cid:3)n\n\n3\n\n\f2.3 Query Scheme and Inference: Coding\n\nIn the system model presented in Figure 1a, encoding shows the way the queries are posed. A basic\nquery is that the worker is asked of the value of B(X). In the example of crowdsourcing for labeling\nimages that are suitable for children, the query is \"This image suits children; true or false?\" The\ndecoder or the crowdsourcer collects the responses of workers to the queries and attempts to infer the\nright label (cluster) for each of the images. This is while the collected responses could be in general\nincomplete or erroneous.\nIn the case of crowdsourcing for labeling a large set of dog images with their breeds, a query may be\nformed by showing two pictures at once and inquiring whether they are from the same breed [11].\nThe queries in fact are posed as showing the elements of a binary incidence matrix, A, whose rows\nand columns correspond to X. In this case, A(X1, X2) = 1 indicates that the two are members of\nthe same cluster (breed) and A(X1, X2) = 0 indicates otherwise. The matrix is symmetric and its\ndiagonal is 1. We refer to this query scheme as Binary Incidence Coding. If we show three pictures\nat once and ask the user to classify them (put the pictures in similar or distinct bins); it is as if we\nask about three elements of the same matrix, i.e., A(X1, X2), A(X1, X3) and A(X2, X3) (Ternary\nIncidence Coding). In general, if we show k pictures as a single query, it is equivalent to inquiring\nabout C(k, 2) (choose 2 out of k elements) entries of the matrix (k-ary Incidence Coding or kIC). As\nwe elaborate below, out of the 2C(k,2) possibilities, a number of the choices remain invalid and this\nprovides an error correction capability for kIC.\nFigures 1b and 1c show the graphical representation of 3IC, and the choices a worker would have in\nclustering with this code. The nodes denote the items and the edges indicate whether they are in the\nsame cluster. In 3IC, if X1 and X2 are in the same cluster as X3, then all three of them are in the\nsame cluster. It is straightforward to see that in 3IC and for N = 2, we only have four valid responses\n(Figure 1b) to a query as opposed to 2C(3,2) = 8. The \ufb01rst item in Figure 1c is invalid because there\nare only two clusters (N = 2) (in case we do not know the number of clusters or N \u2265 3, then this\nwould remain a valid response). In this setting, the encoded signal u can be one of the four valid\nsymbols in the set U; and similarly what the workers may select v (decoded signal over the channel)\nis from the set V, where U = V. As such, since in kIC the obviously erroneous answers are removed\nfrom the choices a worker can make in responding to a query, one expects an improved overall CS\nperformance, i.e., an error correction capability for kIC. In Section 4, we study the performance of\nthis code in greater details. Note that in clustering with kIC (k \u2265 2) described above, the code would\nidentify clusters up to their speci\ufb01c labellings.\nWhile we presented kIC as a concrete example, there may be many other forms of query or coding\nschemes. Formally, the code is composed of encoder and decoder mappings:\ni=1 mi \u2192 \u02c6B(X )\n\n(3)\nwhere n is the block size or number of items in each encoding (we assume n|L), and mi is the number\nof uses of channel Ci or queries that worker i, 1 \u2264 i \u2264 W, handles. In many practical cases of\ninterest, we have \u02c6B(X ) = B(X ) and we may have n = L. The rate of the code is R =\nmi/n\nqueries per item. In this setting, C(cid:2)(C(B(X n))) = \u02c6B(X n).\nDepending on the availability of feedback from the decoder to the encoder, the code can adapt for\noptimized performance. The feedback could provide the encoder with the results of prior queries.\nWe here focus on non-adaptive codes in (3) that are static and remain unchanged over the period of\ncrowdsourcing project. We will elaborate on code design in Section 3.\nDepending on the type of code in use and the objectives of crowdsourcing, one may design different\ndecoding schemes. For instance, in the simple case of directly inquiring workers about the function\nB(Xi), with multiple queries for each item, popular approaches are majority voting, and EM style\ndecoding due to Dawid and Skene [5], where it attempts to jointly estimate the workers\u2019 skill\nlevels and decode B(X). In the case of clustering with 2IC, an inference scheme based on convex\noptimization is presented in [11].\nThe rate of the code is proportional to the CS budget and we use the rate as a proxy for budget\nthroughout this analysis. However, since different types of query have different costs both \ufb01nancially\n(in crowdsourcing platforms) and from the perspective of time or effort it takes from the user to\nprocess it, one needs to be careful in comparing the results of different coding schemes. We shall\nelaborate on this issue for the case of kIC in Appendix E.\n\nC : B(X )n \u2192 U (cid:2)W\n\nC(cid:2) : V(cid:2)W\n\n(cid:7)W\n\ni=1\n\ni=1 mi ,\n\nn\n\n,\n\n4\n\n\f2.4 Distortion and the Design Problem\n\nIn the framework of Figure 1a, we are interested to design the CS code, i.e., the query and inference\nschemes, such that with a given budget a certain CS \ufb01delity is optimized. We consider the \ufb01delity as\nan average distortion with respect to the source (dataset). For a distance function d(B(x), \u02c6B(x)), for\nwhich d(B(xn), \u02c6B(xn)) = 1\n\nd(B(xi), \u02c6B(xi)), the average distortion is\n\nD(B(X), \u02c6B(X)) = Ed(B(X n), \u02c6B(X n))\n\nX n P (B(X n))P ( \u02c6B(X n)|B(X n))d(B(X n), \u02c6B(X n)),\n\n(4)\n\n(cid:7)n\n(cid:7)\n\ni=1\n\nn\n\n=\n\nwhere P (B(X n)) = P (B(X))n for iid B(X). The design problem is therefore one of CS \ufb01delity-\nquery budget optimization (or distortion-rate, D(R), optimization) and may be expressed as follows\n\nD\u2217(Rt) = min\n\nC,C(cid:2),R\u2264Rt\n\nD(B(X), \u02c6B(X))\n\n(5)\n\nwhere Rt is a target rate or query budget. The optimization is with respect to the coding and\ndecoding schemes, the type of feedback (if applicable), and query assignment and rate allocation.\nThe optimum solution to the above problem is referred to as the distortion-rate function, D\u2217(Rt) (or\nCS \ufb01delity-query budget function). A basic distance function, for the case where B(X) is discrete,\nis l0(B(X), \u02c6B(X)), or the Hamming distance. In this case, the average distortion D(B(X), \u02c6B(X))\nre\ufb02ects the average probability of error. As such, the D(R) optimization problem may be rewritten\nas follows\n\nD\u2217(Rt) = min\n\nC,C(cid:2),R\u2264Rt\n\nP (E : \u02c6B(X) (cid:3)= B(X)).\n\n(6)\n\nIn case of crowdsourcing for clustering, this quanti\ufb01es the performance in terms of the overall\nprobability of error in clustering. For other crowdsourcing problems, we may consider other distortion\nfunctions. Equivalently, we may consider minimizing the rate subject to a constrained distortion in\ncrowdsourcing. The R(D) problem is expressed as follows\n\nW(cid:8)\n\ni=1\n\nR\u2217(Dt) =\n\nmin\n\nC,C(cid:2),D(B(X), \u02c6B(X))\u2264Dt\n\nR =\n\nmin\n\nC,C(cid:2),D(B(X), \u02c6B(X))\u2264Dt\n\nmi/n\n\n(7)\n\nwhere Dt is a target distortion or average probability of error. The optimum solution to the above\nproblem is referred to as the rate-distortion function, R\u2217(Dt) (CS query budget-\ufb01delity function). In\ncase, the taskmaster does not know the skill level of the workers, different users -disregarding their\nskill levels- would receive the same number of queries (mi = m(cid:2),\u2200i); and the code design involves\ndesigning the query and inference schemes.\n\n3\n\nInformation Theoretic CS Budget-Fidelity Limits\n\nIn the CS budget-\ufb01delity optimization problem in (5), the code providing the optimized solution\nindeed needs to balance two opposing design criteria to meet the target CS \ufb01delity: On one hand the\ndesign aims at ef\ufb01ciency of the query and making as small number of queries as possible; On the\nother hand, the code needs to take into account the imperfection of worker responses and incorporate\nsuf\ufb01cient redundancy. In information theory (coding theory) realm, the former corresponds to source\ncoding (compression) and the latter corresponds to channel coding (error control coding) and coding\nto serve both purposes is a joint source channel code.\nIn this Section, we \ufb01rst present a brief overview on joint source channel coding and related results\nin information theory. Next, we present the CS budget-\ufb01delity function in two cases of SL-UK and\nSL-CS described in Section 2.2.\n\n3.1 Background\nConsider the communication of a random source Z from a \ufb01nite alphabet Z over a discrete memo-\nryless channel. The source is \ufb01rst processed by an encoder C and whose output is communicated\nover the channel. The channel output is processed by a decoder C(cid:2), which reconstructs the source as\n\u02c6Z \u2208 \u02c6Z and we often have Z = \u02c6Z.\n\n5\n\n\fFrom a rate-distortion theory perspective, we \ufb01rst consider the case where the channel is error free.\nThe source is iid distributed with probability mass function P (Z) and based on Shannon\u2019s source\ncoding theorem is characterized by a rate-distortion function,\n\nR\u2217(Dt) =\n\nmin\n\nC,C(cid:2):D(Z, \u02c6Z)\u2264Dt\n\nI(Z, \u02c6Z),\n\n(8)\n\nwhere I(., .) indicates mutual information between two random variables. The source coding is\nde\ufb01ned by the following two mappings:\n\nC : Z n \u2192 {1, . . . , 2nR},\n\nC(cid:2) : {1, . . . , 2nR} \u2192 \u02c6Z n\n\n(9)\nThe average distortion is de\ufb01ned in (4) and Dt is the target performance. The optimization in source\ncoding with distortion is with respect to the source coding or compression scheme, that is described\nprobabilistically as P ( \u02c6Z|Z) in information theory. The proof of the source coding theorem follows\nin two steps: In the \ufb01rst step, we prove that any rate R \u2265 R\u2217(Dt) is achievable in the sense that there\nexists a family (as a function of n) of codes {Cn,C(cid:2)\nn} for which as n grows to in\ufb01nity the resulting\naverage distortion satis\ufb01es the desired constraint. In the second step or the converse, we prove that\nany code with rate R < R\u2217(Dt) results in an average distortion that violates the desired constraint.\nThis establishes the described rate-distortion function as the fundamental limit for lossy compression\nof a source with a desired maximum average distortion.\nFrom the perspective of Shannon\u2019s channel coding theorem, we consider the source as an iid uniform\nsource and the channel as a discrete memoryless channel characterized by P (V |U ), where U \u2208 U is\nthe channel input and, V \u2208 V is the channel output. The channel coding is de\ufb01ned by the following\ntwo mappings:\n(10)\nThe theorem establishes the capacity of the channel as C = maxC,C(cid:2) I(Z, \u02c6Z) and states that for a\nrate R, there exists a channel code that provides a reliable communication over the noisy channel if\nand only if R \u2264 C. Again the proof follows in two steps: First, we establish achievability, i.e., we\nshow that for any rate R \u2264 C, there exists a family of codes (as a function of length n) for which\nthe average probability of error P ( \u02c6Z (cid:3)= Z) goes to zero as n grows to in\ufb01nity. Next, we prove the\nconverse, i.e., we show that for any rate R > C, the probability of error is always greater than zero\nand grows exponentially fast to 1/2 as R goes beyond C. This establishes the described capacity as\nthe fundamental limit for transmission of an iid uniform source over a discrete memoryless channel.\nFor the problem of our interest, i.e., the transmission of an iid source (not necessarily uniform)\nover a discrete memoryless channel, the joint source channel coding theorem, aka source-channel\nseparation theorem, is instrumental. The theorem states that in this setting a code exists that can\nfacilitate reconstruction of the source with distortion D(Z, \u02c6Z) \u2264 Dt if and only if R\u2217(Dt) < C. For\ncompleteness, we reproduce the theorem form [4] below.\n\nC(cid:2) : V n \u2192 {1, . . . ,|Z|}\n\nC : {1, . . . ,|Z|} \u2192 U n\n\nTheorem 1 Let Z be a \ufb01nite alphabet iid source which is encoded as a sequence of n input symbols\nU n of a discrete memoryless channel with capacity C. The output of the channel V n is mapped onto\nthe reconstruction alphabet \u02c6Z n = C(cid:2)(V n). Let D(Z n, \u02c6Z n) be the average distortion achieved by this\njoint source and channel coding scheme. Then distortion D is achievable if and only if C > R\u2217(Dt).\nThe proof follows a similar two step approach described above and assumes large block length\n(n \u2192 \u221e). The result is important from a communication theoretic perspective as a concatenation\nof a source code, which removes the redundancy and produces an iid uniform output at a rate\nR > R\u2217(Dt), and a channel code, which communicates this reliably over the noisy channel at a rate\nR < C, can achieve the same fundamental limit.\n\n3.2 Basic Information Theoretic Bounds\n\nWe here consider crowdsourcing within the presented framework, and derive basic information\ntheoretic bounds. Following Section 2.1, we examine the case where a large dataset X (L \u2192 \u221e) and\na function of interest B(X) with associated probability mass function, P (B(X)), are available.\nWe consider the MSC worker pool model described in Section 2.2, where the skill set of workers are\nfrom a discrete set E = {\u00011, \u00012, . . . , \u0001W (cid:2)} with probability P (\u0001), \u0001 \u2208 E. The number of workers in\neach skill level class is assumed large. We here study the two scenarios of SL-UK and SL-CS.\n\n6\n\n\fAt any given instance, a query is posed to a random worker with a random skill level within the set, E.\nWe assume there is no feedback available from the decoder (non-adaptive coding) and the queries do\nnot in\ufb02uence the channel probabilities (no feedforward). Extensions remain for future work.\nThe following theorem identi\ufb01es the information theoretic minimum number of queries per item to\nperform at least as good as a target \ufb01delity in case the skill levels are not known (SL-UK). The bound\nis oblivious to the type of code used and serves as an ultimate performance bound.\nTheorem 2 In crowdsourcing for a large dataset of N-ary discrete source B(X) \u223c P (B(X)) with\nHamming distortion, when a large number of unknown workers with skill levels \u0001 \u2208 E, \u0001 \u223c P (\u0001)\nfrom an MSC population participate (SL-UK), the minimum number of queries per item to obtain an\noverall error probability of at most \u02c6\u0001, is given by\n\nH(B(X))\u2212HN (\u02c6\u0001)\nlog2 M\u2212HM (E(\u0001))\n0\n\n\u02c6\u0001 \u2264 min{1 \u2212 pmax, 1 \u2212 1\nN }\n\nRmin =\n\notherwise,\n\n(11)\nin which HN (\u0001) (cid:2) H(1 \u2212 \u0001, \u0001/(N \u2212 1), . . . , \u0001/(N \u2212 1)), and pmax = maxB(X)\u2208B(X ) P (B(X)).\nThe proof is provided in Appendix A. Another interesting scenario is when the crowdsourcer attempts\nto estimate the worker skill levels from the data it has collected as part of the inference. In case this\nestimation is done perfectly, the next theorem identi\ufb01es the corresponding fundamental limit on the\ncrowdsourcing rate. The proof is provided in Appendix B.\nTheorem 3 In crowdsourcing for a large dataset of N-ary discrete source B(X) \u223c P (B(X)) with\nHamming distortion, when a large number of workers with skill levels \u0001 \u2208 E, \u0001 \u223c P (\u0001) -known to the\ncrowdsourcer (SL-CS)- from an MSC population participate, the minimum number of queries per\nitem to obtain an overall error probability of at most \u02c6\u0001, is given by\n\n(cid:2)\n\n(cid:2)\n\nRmin =\n\nH(B(X))\u2212HN (\u02c6\u0001)\nlog2 M\u2212E(HM (\u0001))\n0\n\n\u02c6\u0001 \u2264 min{1 \u2212 pmax, 1 \u2212 1\nN }\notherwise.\n\n(12)\n\nComparing the results in Theorems 2 and 3 the following interesting observation can be made. In case\nthe worker skill levels are unknown, the CS system provides an overall work quality (capacity) of an\naverage worker; whereas when the skill levels are known at the crowdsourcer, the system provides an\noverall work quality that pertains to the average of the work quality of the workers.\n\n4 k-ary Incidence Coding\n\nIn this Section, we examine the performance of the k-ary incidence coding introduced in Section 2.3.\nThe k-ary incidence code poses a query as a set of k \u2265 2 items and inquires the workers to identify\nthose with the same label. In the sequel, we begin with deriving a lower-bound on the performance\nof kIC with a spammer-hammer worker pool. We then presents numerical results along with the\ninformation theoretic lower bounds presented in the previous Section.\n\n4.1 Performance of kIC with SHC Worker Pool\n\nWe consider kIC for crowdsourcing in the following setting. The items X in the dataset are iid\nwith N = 2. There is no feedback from the decoder to the task manager (encoder), i.e., the code\nis non-adaptive. Since the task manager has no knowledge of the workers\u2019 skill levels, it queries\nthe workers at the same \ufb01xed rate of R queries per item. To compose a query, the items are drawn\nuniformly at random from the dataset. We assume that the workers are drawn from the SHC(q) model\nelaborated in Section 2.2. The purpose is to obtain a lower-bound on the performance assuming an\nOracle decoder that can perfectly identify the workers\u2019 skill levels (here a spammer or a hammer)\nand perform an optimal decoding. Speci\ufb01cally, we consider the following:\n\nP (E : \u02c6B(X) (cid:3)= B(X))\n\nmin\nC(cid:2),C:kIC\n\n(13)\n\nwhere minimization is with respect to the choice of a decoder for a given kIC code. We note that the\ncode length is governed by how the decoder operates, and often could be as long as the dataset. As\n\n7\n\n\fevident in (2), in the SHC model, the channel error rate (worker reliability) is explicitly in\ufb02uenced by\nthe code and parameter, k. In the model of Figure 1a, this implies that a certain static feedforward\nexists in this setting. We \ufb01rst present a lemma, which is used later to establish a Theorem 4 on kIC\nperformance. The proofs are respectively provided in Appendix C and Appendix D.\n\nLemma 1 In crowdsourcing for binary labeling (N = 2) of a uniformly distributed dataset, with\nkIC and a SHC worker pool, the probability of error in labeling of an item by a spammer (C = S), is\ngiven by\n\n\u00af\u0001S = P (E : \u02c6B(X) (cid:3)= B(X)|C = S) =\n\n\u23a7\u23aa\u23aa\u23a8\n\u23aa\u23aa\u23a9\n\n(cid:7)(cid:6)(k\u22121)/2(cid:7)\ni \u00d7\n(cid:12)(cid:7)(cid:6)(k\u22121)/2(cid:7)\ni \u00d7\n\ni=0\n\ni=0\n\n1\n\nk2k\u22121\n\n1\n\nk2k\u22121\n\n(cid:10)\n(cid:10)\n\nk\ni\n\n(cid:11)\n(cid:11)\n\nk\ni\n\n(cid:11)(cid:13)\n\n(cid:10)\n\n+ k\n4\n\nk\n\nk/2\n\nk odd\n\nk even.\n\nTheorem 4 Assuming crowdsourcing using a non-adaptive kIC over a uniformly distributed dataset\n(k \u2265 2), if the number of queries per item, R, is less than\n, then no decoder can achieve\nan average probability of labeling error less than \u02c6\u0001 for any L under the SHC(q) worker model.\n\nk ln(1\u2212q)\n\nln \u02c6\u0001\n\u00af\u0001S\n\n1\n\nTo interpret and use the result in Theorem 4, we consider the following points: (i) The theorem\npresents a necessary condition, i.e., the minimum rate (budget) requirement identi\ufb01ed here for kIC\nwith a given \ufb01delity is a lower bound. This is due to the fact that we are considering an oracle CS\ndecoder that can perfectly identify the workers\u2019 skill levels and correctly label the item if the item\nis at least labeled by one hammer out of R(cid:2) times it is processed by the workers. (ii) In the current\nsetting, where the taskmaster does not know the workers\u2019 skill levels, each item is included in exactly\nR(cid:2) \u2208 Z+ k-ary queries. That is due to the nature of the code R(cid:2). (iii) As discussed in Appendix\nE, Theorem 4 can also be used to establish an approximate rule of thumb for pricing. Speci\ufb01cally,\nconsidering two query schemes k1IC and k2IC, the query price \u03c0 is to be set as \u03c0(k1)\n\u03c0(k2)\n\n\u2248 k1\nk2 .\n\n4.2 Numerical Results\n\nTo obtain an information theoretic benchmark, the next corollary specializes Theorem 3 to the setting\nof interest in this Section.\n\nCorollary 1 In crowdsourcing for binary labeling of a uniformly distributed dataset with a SHC(q)\nworker pool -known to the crowdsourcer (SL-CS)- and number of choices in responding to a query of\nM, the minimum rate for any given coding scheme to obtain a probability of error of at most \u02c6\u0001, is\n\nRmin =\n\n1\u2212Hb(\u02c6\u0001)\nq log2 M ,\n0\n\n0 \u2264 \u02c6\u0001 \u2264 0.5\notherwise.\n\nqueries per item\n\n(14)\n\n(cid:2)\n\nFigure 2 shows the information theoretic limit of Corollary 1 and the bound obtained in Theorem 4.\nFor rates (budgets) greater than the former bound, there exist a code which provides crowdsourcing\nwith the desired \ufb01delity; and for rates below this bound no such code exists. The coding theoretic\nlower bounds for kIC depend on k, q and \ufb01delity, and improve as k and q grow. The kIC bounds for\nk = 1 is equivalent to the analysis leading to Lemma 1 of [8].\n\n8\n\n\fFigure 2: kIC performance bound and the information theoretic limit\n\nReferences\n[1] I. Abraham, O. Alonso, V. Kandylas, and A. Slivkins. Adaptive crowdsourcing algorithms for\n\nthe bandit survey problem. In 26th Conference on Learning Theory (COLT), 2013.\n\n[2] Audubon. History of the christmas bird count, 2015. URL http://birds.audubon.org.\n[3] Caltech. Community seismic network project, 2016. URL http://csn.caltech.edu.\n[4] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiely, New Jersey, USA,\n\n2006.\n\n[5] A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using\nthe EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):\n20\u201328, 1979.\n\n[6] O. Dekel and O. Shamir. Vox populi: Collecting high-quality labels from a crowd. In Proceed-\n\nings of the Twenty-Second Annual Conference on Learning Theory, June 2009.\n\n[7] A El Gamal and Y.-H. Kim. Network Information Theory. Cambridge University Press, New\n\nYork, USA, 2011.\n\n[8] D. R. Karger, S. Ohy, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing\n\nsystems. Operations Research, 61(1):1\u201324, 2014.\n\n[9] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. J. Machine Learn. Res., 99(1):1297\u20131322, 2010.\n\n[10] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective\n\nlabeling of venus images. Adv. Neural Inform. Processing Systems, pages 1085\u20131092, 1995.\n\n[11] R. Korlakai Vinayak, S. Oymak, and B. Hassibi. Graph clustering with missing data: Convex\nalgorithms and analysis. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems, Montreal, Quebec, Canada, pages\n2996\u20133004, 2014.\n\n[12] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more:\noptimal integration of labels from labelers of unknown expertise. Adv. Neural Inform. Processing\nSystems, 22(1):2035\u20132043, 2009.\n\n[13] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal\nalgorithm for crowdsourcing.\nIn Advances in Neural Information Processing Systems 27:\nAnnual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada,\npages 1260\u20131268, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2609, "authors": [{"given_name": "Farshad", "family_name": "Lahouti", "institution": "Caltech"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}