{"title": "Eliciting Categorical Data for Optimal Aggregation", "book": "Advances in Neural Information Processing Systems", "page_first": 2450, "page_last": 2458, "abstract": "Models for collecting and aggregating categorical data on crowdsourcing platforms typically fall into two broad categories: those assuming agents honest and consistent but with heterogeneous error rates, and those assuming agents strategic and seek to maximize their expected reward. The former often leads to tractable aggregation of elicited data, while the latter usually focuses on optimal elicitation and does not consider aggregation. In this paper, we develop a Bayesian model, wherein agents have differing quality of information, but also respond to incentives. Our model generalizes both categories and enables the joint exploration of optimal elicitation and aggregation. This model enables our exploration, both analytically and experimentally, of optimal aggregation of categorical data and optimal multiple-choice interface design.", "full_text": "Eliciting Categorical Data for Optimal Aggregation\n\nChien-Ju Ho\n\nCornell University\n\nch624@cornell.edu\n\nRafael Frongillo\n\nCU Boulder\n\nraf@colorado.edu\n\nYiling Chen\n\nHarvard University\n\nyiling@seas.harvard.edu\n\nAbstract\n\nModels for collecting and aggregating categorical data on crowdsourcing plat-\nforms typically fall into two broad categories: those assuming agents honest and\nconsistent but with heterogeneous error rates, and those assuming agents strategic\nand seek to maximize their expected reward. The former often leads to tractable\naggregation of elicited data, while the latter usually focuses on optimal elicitation\nand does not consider aggregation. In this paper, we develop a Bayesian model,\nwherein agents have differing quality of information, but also respond to incen-\ntives. Our model generalizes both categories and enables the joint exploration\nof optimal elicitation and aggregation. This model enables our exploration, both\nanalytically and experimentally, of optimal aggregation of categorical data and\noptimal multiple-choice interface design.\n\n1\n\nIntroduction\n\nWe study the general problem of eliciting and aggregating information for categorical questions. For\nexample, when posing a classi\ufb01cation task to crowd workers who may have heterogeneous skills or\namount of information about the underlying true label, the principle wants to elicit workers\u2019 private\ninformation and aggregate it in a way to maximize the probability that the aggregated information\ncorrectly predicts the underlying true label.\nIdeally, in order to maximize the probability of correctly predicting the ground truth, the principal\nwould want to elicit agents\u2019 full information by asking agents for their entire belief in the form of a\nprobability distribution over labels. However, this is not always practical, e.g., agents might not be\nable to accurately differentiate 92% and 93%. In practice, the principal is often constrained to elicit\nagents\u2019 information via a multiple-choice interface, which discretizes agents\u2019 continuous belief into\n\ufb01nite partitions. An example of such an interface is illustrated in Figure 1. Moreover, disregard\nof whether full or partial information about agents\u2019 beliefs is elicited, aggregating the information\ninto a single belief or answer is often done in an ad hoc fashion (e.g. majority voting for simple\nmultiple-choice questions).\nIn this work, we explore the joint problem of elic-\niting and aggregating information for categorical\ndata, with a particular focus on how to design the\nmultiple-choice interface,\nhow to discretize\nagents\u2019 belief space to form discrete choices. The\ngoal is to maximize the probability of correctly pre-\ndicting the ground truth while incentivizing agents to\ntruthfully report their beliefs. This problem is chal-\nlenging. Changing the interface not only changes\nwhich agent beliefs lead to which responses, but\nalso in\ufb02uences how to optimally aggregate these re-\nsponses into a single label. Note that we focus on the abstract level of interface design. We explore\n\nFigure 1: An example of the task interface.\n\nWhat\u2019s\tthe\ttexture\tshown\tin\tthe\timage?\t\n\ni.e.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe problem of how to partition agents\u2019 belief spaces for optimal aggregations. We do not discuss\nother behavioral aspects of interface design, such as question framing, layouts, etc\nWe propose a Bayesian framework, which allows us to achieve our goal in three interleaving steps.\nFirst, we constrain our attention to interfaces which admit economically robust payment functions,\nthat is, where agents seeking to maximize their expected payment select the answer that corresponds\nto their belief. Second, given an interface, we develop a principled way of aggregating information\nelicited through it, to obtain the maximum a posteriori (MAP) estimator. Third, given the constraints\non interfaces (e.g., only binary choice question is allowed) and aggregation methods, we can then\nchoose the optimal interface, which leads to the highest prediction accuracy after both elicitation\nand aggregation. (Note that if there are no constraints, eliciting full information is always optimal.)\nUsing theoretical analysis, simulations, and experiments, we provide answers to several interesting\nquestions. Our main results are summarized as follows:\n\n\u2022 If the principal can elicit agents\u2019 entire belief distributions, our framework can achieve optimal\naggregation, in the sense that the principal can make predictions as if she has observed the\nprivate information of all agents (Section 4.1). This resolves the open problem of optimal\naggregation for categorical data that was considered impossible to achieve in [1].\n\n\u2022 For the binary-choice interface design question, we explore the design of optimal interfaces for\nsmall and large numbers of agents (Section 4.2). We conduct human-subject experiments on\nAmazon\u2019s Mechanical Turk and demonstrate that our optimal binary-choice interface leads to\nbetter prediction accuracy than a natural baseline interface (Section 5.3).\n\n\u2022 Our framework gives a simple principled way of aggregating data from arbitrary interfaces\n(Section 5.1). Applied to experimental data from [2] for a particular multiple-choice interface,\nour aggregation method has better prediction accuracy than their majority voting (Section 5.2).\n\u2022 For general multiple-choice interfaces, we use synthetic experiments to obtain qualitative in-\nsights of the optimal interface. Moreover, our simple (heuristic) aggregation method performs\nnearly optimally, demonstrating the robustness of our framework (Section 5.1).\n\n1.1 Related Work\n\nEliciting private information from strategic agents has been a central question in economics and\nother related domains. The focus here is often on designing payment rules such that agents are in-\ncentivized to truthfully report their information. In this direction, proper scoring rules [3, 4, 5] have\nlong been used for eliciting beliefs about categorical and continuous variables. When the realized\nvalue of a random variable will be observed, proper scoring rules have been designed for eliciting\neither the complete subjective probability distributions of the random variable [3, 4] or some sta-\ntistical properties of these distributions [6, 7]. When the realized value of a random variable will\nnot be available, a class of peer prediction mechanisms [8, 9] has been designed for truthful elicita-\ntion. These mechanisms often use proper scoring rules and leverage on the stochastic relationship\nof agents\u2019 private information about the random variable in a Bayesian setting. However, work in\nthis direction often takes elicitation as an end goal and doesn\u2019t offer insights on how to aggregate\nthe elicited information.\nAnother theme in the existing literature is the development of statistical inference and probabilistic\nmodeling methods for the purpose of aggregating agents\u2019 inputs. Assuming a batch of noisy inputs,\nthe EM algorithm [10] can be adopted to learn the skill level of agents and obtain estimates of\nthe best answer [11, 12, 13, 14, 15]. Recently, extensions have been made to also consider task\nassignment and online task assignment in the context of these probabilistic models of agents [16,\n17, 18, 19]. Work under this theme often assumes non-strategic agents who have some error rate\nand are rewarded with a \ufb01xed payment that doesn\u2019t depend on their reports.\nThis paper attempts to achieve both truthful elicitation and principled aggregation of information\nwith strategic agents. The closest work to our paper is [1], which has the same general goal and uses\na similar Bayesian model of information. That work achieves optimal aggregation by associating the\ncon\ufb01dence of an agent\u2019s prediction with hyperparameters of a conjugate prior distribution. However,\nthis approach leaves optimal aggregation for categorical data as an open question, which we resolve.\nMoreover, our model allows us to elicit con\ufb01dence about an answer over a coarsened report space\n(e.g. a partition of the probability simplex) and to reason about optimal coarsening for the purpose\n\n2\n\n\fof aggregation. In comparison, [2] also elicit quanti\ufb01ed con\ufb01dence on reported labels in their mech-\nanism. Their mechanism is designed to incentivize agents to truthfully report the label that they\nbelieve to be correct when their con\ufb01dence on the report is above a threshold and skip the question\nwhen it\u2019s below the threshold. Majority voting is then used to aggregate the reported labels. These\nthresholds provide a coarsened report space for eliciting con\ufb01dence, and thus are well modeled by\nour approach. However, in that work the thresholds are given a priori, and moreover, the elicited\ncon\ufb01dence is not used in aggregation. These are both holes which our approach \ufb01ll; in Section 5,\nwe demonstrate how to derive optimal thresholds, and aggregation policies, which depend critically\non the prior distribution and the number of agents.\n\n2 Bayesian Model\n\nIn our model, the principal would like to get information about a categorical question (e.g., predict-\ning who will win the presidential election, or identifying whether there is a cancer cell in a picture\nof cells) from m agents. Each question has a \ufb01nite number of possible answers X , |X| = k. The\nground truth (correct answer) \u21e5 is drawn from a prior distribution p(\u2713), with realized value \u2713 2X .\nThis prior distribution is common knowledge to the principal and the agents. We use \u2713\u21e4 to denote\nthe unknown, realized ground truth.\nAgents have heterogeneous levels of knowledge or abilities on the question that are unknown to\nthe principal. To model agents\u2019 abilities, we assume each agent has observed independent noisy\nsamples related to the ground truth. Hence, each agent\u2019s ability can be expressed as the number of\nnoisy samples she has observed. The number of samples observed can be different across agents\nand is unknown to the principal. Formally, given the ground truth \u2713\u21e4, each noisy sample X, with\nx 2X , is i.i.d. drawn according to the distribution p(x|\u2713\u21e4). 1\nIn this paper, we focus our discussion on the symmetric noise distribution, de\ufb01ned as\n\np(x|\u2713) = (1 \u270f)1{\u2713 = x} + \u270f \u00b7 1/k.\n\nThis noise distribution is common knowledge to the principal and the agents. While the symmetric\nnoise distribution may appear restrictive, it is indeed quite general. In Appendix C, we discuss how\nour model covers many scenarios considered in the literature as special cases.\n\nBeliefs of Agents.\n\nIf an agent has observed n noisy samples, X1 = x1, . . . , Xn = xn, her belief\nis determined by a count vector ~c= {c\u2713 : \u2713 2X} where c\u2713 =Pn\n1{xi = \u2713} is the number of\nsample \u2713 that the agent has observed. According to Bayes\u2019 rule, we write her posterior belief on \u21e5\nas p(\u2713|x1, . . . , xn), which can be expressed as\n\ni=1\n\np(\u2713|x1, . . . , xn) = Qn\nwhere \u21b5 = 1 \u270f + \u270f/k and = \u270f/k.\nIn addition to the posterior on \u21e5, the agent also has an updated belief, called the posterior predictive\ndistribution (PPD), about an independent sample X given observed samples X1 = x1, . . . , Xn =\nxn. The PPD can be considered as a noisy version of the posterior:\n\nP\u271302X \u21b5c\u27130 nc\u27130 p(\u27130)\n\nj=1 p(xj|\u2713)p(\u2713)\np(x1, . . . , xn)\n\n\u21b5c\u2713 nc\u2713 p(\u2713)\n\n=\n\n,\n\np(x|x1, . . . , xn) =\n\n+ (1 \u270f)p(\u21e5 = x|x1, . . . , xn)\n\n\u270f\nk\n\nIn fact, in our setting the PPD and posterior are in one-to-one correspondence, so while our theoret-\nical results focus on the PPD, our experiments will consider the posterior without loss of generality.\n\nInterface. An interface de\ufb01nes the space of reports the principal can elicit from agents. The\nreports elicited via the interface naturally partition agents\u2019 beliefs, a k-dimensional probabil-\nity simplex, into a (potentially in\ufb01nite) number of cells, which each correspond to a coarsened\nversion of agents\u2019 PPD. Formally, each interface consists of a report space R and a partition\nD = {Dr \u2713 k}r2R, with each cell Dr corresponding to a report r andSr2R Dr = k.2 In\nthis paper, we sometime use only R or D to represent an interface.\nother distributions.\n\n1When there is no ambiguity, we use p(x|\u2713\u21e4) to represent p(X = x|\u21e5= \u2713\u21e4) and similar notations for\n2Strictly speaking, we will allow cells to overlap on their boundary; see Section 3 for more discussion.\n\n3\n\n\fIn this paper, we focus on the abstract level of the interface design. We explore the problem of how to\npartition agents\u2019 belief spaces for optimal aggregations. We do not discuss other aspects of interface\ndesign, such as question framing, layouts, etc. In practice there are often pre-speci\ufb01ed constraints\non the design of interfaces, e.g., the principal can only ask agents a multiple-choice question with\nno more than 2 choices. We explore how to optimal design interfaces with given constraints.\n\nObjective. The goal of the principal is to choose an interface corresponding to a partition D, satis-\nfying some constraints, and an aggregation method AggD\n, to maximize the probability of correctly\npredicting the ground truth. One very important constraint is that there should exist a payment\nmethod for which agents are correctly incentivized to report r if their belief is in Dr; see Section 3.\nWe can formulate the goal as the following optimization problem,\n\nmax\n\n(R,D)2Interfaces\n\nmax\nAggD\n\nPr[AggD(R1, . . . , Rm) =\u21e5]\n\n,\n\n(1)\n\nwhere Ri are random variables representing the reports chosen by agents after \u2713\u21e4 and the samples\nare drawn.\n\n3 Our Mechanism\nWe assume the principal has access to a single independent noisy sample X drawn from p(x|\u2713\u21e4).\nThe principal can then leverage this sample to elicit and aggregate agents\u2019 beliefs by adopting tech-\nniques in proper scoring rules [3, 5]. This assumption can be satis\ufb01ed by, for example, allowing\nthe principal to ask for an additional opinion outside of the m agents, or by asking agents multiple\nquestions and only scoring a small random subset for which answers can be obtained separately\n(often, on the so-called \u201cgold standard set\u201d).\nOur mechanism can be described as follows. The principal chooses an interface with report space R\nand partition D, and a scoring rule S(r, x) for r 2R and x 2X . The principal then requests a report\nri 2R for each agent i 2{ 1, . . . , m}, and observes her own sample X = x. She then gives a score\nof S(ri, x) to agent i and aggregates the reports via a function AggD : R\u21e5\u00b7\u00b7\u00b7\u21e5R ! X . Agents are\nassumed to be rational and aim to maximize their expected scores. In particular, if an agent i believes\nX is drawn from some distribution p, she will choose to report ri 2 argmaxr2R EX\u21e0p[S(r, X)].\nElicitation. To elicit truthful reports from agents, we adopt techniques from proper scoring rules\n[3, 5]. A scoring rule is strictly proper if reporting one\u2019s true belief uniquely maximizes the expected\nscore. For example, a strictly proper score is the logarithmic scoring rule, S(p, x) = log p(x), where\np(x) is the agent\u2019s belief of the distribution x is drawn from.\nIn our setting, we utilize the requester\u2019s additional sample p(x|\u2713\u21e4) to elicit agents\u2019 PPDs\np(x|x1, . . . , xn). If the report space R = k, we can simply use any strictly proper scoring rules,\nsuch as the logarithmic scoring rule, to elicit truthful reports. If the set of report space R is \ufb01nite,\nwe must specify what it means to be truthful. The partition D de\ufb01ned in the interface is a way of\ncodifying this relationship: a scoring rule is truthful with respect to a partition if report r is optimal\nwhenever an agent\u2019s belief lies in cell Dr.3\nDe\ufb01nition 1. S(r, x) is truthful with respect to D if for all r 2R and all p 2 k we have\n\np 2 Dr () 8r0 6= r EpS(r, X) EpS(r0, X) .\n\nSeveral natural questions arise from this de\ufb01nition. For which partitions D can we devise such\ntruthful scores? And if we have such a partition, what are all the scores which are truthful for it?\nAs it happens, these questions have been answered in the \ufb01eld of property elicitation [20, 21], with\nthe verdict that there exist truthful scores for D if and only if D forms a power diagram, a type of\nweighted Voronoi diagram [22].\nThus, when we consider the problem of designing the interface for a crowdsourcing task, if we want\nto have robust economic incentives, we must con\ufb01ne ourselves to interfaces which induce power\n3As mentioned above, strictly speaking, the cells {Dr}r2R do not form a partition because their boundaries\noverlap. This is necessary: for any (nontrivial) \ufb01nite-report mechanism, there exist distributions for which the\nagent is indifferent between two or more reports. Fortunately, the set of all such distributions has Lebesgue\nmeasure 0 in the simplex, so these boundaries do not affect our analysis.\n\n4\n\n\fdiagrams on the set of agent beliefs. In this paper, we focus on two classes of power diagrams:\nthreshold partitions, where the membership p 2 Dr can be decided by comparisons of the form\nt1 \uf8ff p\u2713 \uf8ff t2, and shadow partitions, where p 2 Dr () r = argmaxx p(x) p\u21e4(x) for\nsome reference distribution p\u21e4. Threshold partitions cover those from [2], and shadow partitions are\ninspired by the Shadowing Method from peer prediction [23].\nAggregation. The goal of the principal is to aggregate the agents\u2019 reports into a single prediction\nwhich maximizes the probability of correctly predicting the ground truth.\nMore formally, let us assume that the principal obtains reports r1, . . . , rm from m agents such\nthat the belief pi of agent i lies in Di := Dri. In order to maximize the probability of correct\npredictions, the principal aggregates the reports by calculating the posterior p(\u2713|D1, . . . , Dm) for\nall \u2713 and making the prediction \u02c6\u2713 that maximizes the posterior.\n\n\u02c6\u2713 = argmax\n\n\u2713\n\np(\u2713|D1, . . . , Dm) = argmax\n\n\u2713\n\np(Di|\u2713)! p(\u2713) ,\n\n mYi=1\n\nwhere p(Di|\u2713) is the probability that the PPD of agent i falls within Di giving the ground truth to\nbe \u2713. To calculate p(D|\u2713), we assume agents\u2019 abilities, represented by the number of samples, are\ndrawn from a distribution p(n). We assume p(n) is known to the principal. This assumption can be\nsatis\ufb01ed if the principal is familiar with the market and has knowledge of agents\u2019 skill distribution.\nEmpirically, in our simulation, the optimal interface is robust to the choice of this distribution.\n\n0@ X~c:p(\u2713|~c)2D\u2713n\n\np(x1..xn|\u2713)1A p(n) =Xn\n~c\u25c6\u21b5c\u2713 nc\u27131A p(n)\n~c = n!/(Qi ci!), where ci is the i-th component of ~c.\n\n0@\np(D|\u2713) =Xn\nXx1..xn:p(\u2713|x1..xn)2D\nwith Z(n) =P~cn\n~c\u21b5c1nc1 andn\nInterface Design. Let P (D) be the probability of correctly predicting the ground truth given par-\ntition D, assuming the best possible aggregation policy. The expectation is taken over which cell\nDi 2D agent i reports for m agents.\nP (D) = XD1,...,Dm\n= XD1,...,Dm\n\np(Di|\u2713)! p(\u2713) .\n\nmax\n\n\u2713\n\np(\u2713|D1, . . . , Dm)p(D1, . . . , Dm)\n\nmax\n\n\u2713 mYi=1\n\n,\n\nZ(n)\n\nThe optimal interface design problem is to \ufb01nd an interface with partition D within the set of feasible\ninterfaces such that in expectation, P (D) is maximized.\n4 Theoretical Analysis\n\nIn this section, we analyze two settings to illustrate what our mechanism can achieve. We \ufb01rst con-\nsider the setting in which the principal can elicit full belief distributions from agents. We show that\nour mechanism can obtain optimal aggregation, in the sense that the principal can make prediction\nas if she has observed all the private signals observed by all workers. In the second setting, we\nconsider a common setting with binary signals and binary cells (e.g., binary classi\ufb01cation tasks with\ntwo-option interface). We demonstrate how to choose the optimal interface when we aim to collect\ndata from one single agent and when we aim to collect data from a large number of agents.\n\n4.1 Collecting Full Distribution\nConsider the setting in which the allowed reports are full distributions over labels. We show that\nin this setting, the principal can achieve optimal aggregation. Formally, the interface consists of a\nreport space R = k \u21e2 [0, 1]k, which is the k-dimensional probability simplex, corresponding to\nbeliefs about the principal\u2019s sample X given the observed samples of an agent. The aggregation is\noptimal if the principal can obtain global PPD.\nDe\ufb01nition 2 ([1]). Let S be the set of all samples observed by agents. Given the prior p(\u2713) and data\nS distributed among the agents, the global PPD is given by p(x|S).\n\n5\n\n\fIn general, as noted in [1], computing the global PPD requires access to agents\u2019 actual samples, or\nat least their counts, whereas the principal can at most elicit the PPD. In that work, it is therefore\nconsidered impossible for the principal to leverage a single sample to obtain the global PPD for a\ncategorical question, as there does not exist a unique mapping from PPDs to sample counts. While\nour setting differs from that paper, we intuitively resolve this impossibility by \ufb01nding a non-trivial\nunique mapping between the differences of sample counts and PPDs.\nLemma 1. Fix \u27130 2X and let di\u21b5i 2 Zk1 be the vector di\u21b5i\n\u2713 encoding the differences\n\u2713 = ci\nin the number of samples of \u2713 and \u27130 that agent i has observed. There exists an unique mapping\nbetween di\u21b5i and the PPD of agent i.\n\n\u27130 ci\n\nWith Lemma 1 in hand, assuming the principal can obtain the full PPD from each agent, she can\nnow compute the global PPD: she simply converts each agents\u2019 PPD into a sample count difference,\nsums these differences, and \ufb01nally converts the total differences into the global PPD.\nTheorem 2. Given the PPDs of all agents, the principal can obtain the global PPD.\n\nInterface Design in Binary Settings\n\n4.2\nTo gain the intuition about optimal interface design, we examine a simple setting with binary signal\nX = {0, 1} and a partitions with only two cells. To simplify the discussion, we also assume all\nIn this setting, each partition can be determined by a\nagents have observed exactly n samples.\nsingle parameter, the threshold pT ; its cells indicate whether the agent believes the probability of the\nprincipal\u2019s sample X to be 0 is larger than pT or not. Note that we can also write the threshold as T ,\nthe number of samples that the agent observes to be signal 0. Membership in the two cells indicates\nwhether or not the agents observes more than T samples with signal 0.\nWe \ufb01rst give the result when there is only one agent. 4\nLemma 3. In the binary-signal and two-cell setting, if the number of agents is one, the optimal\npartition has threshold pT \u21e4 = 1/2.\nIf the number of agents is large, we numerically solve for the optimal partition with a wide range of\nparameters. We \ufb01nd that the optimal partition is to set the threshold such that agents\u2019 posterior belief\non the ground truth is the same as the prior. This is equivalent to asking agents whether they observe\nmore samples with signal 0 or with signal 1. Please see Appendix B and H for more discussion.\nThe above arguments suggest that when the principal plans to collect data from multiple agents for\ndatasets with asymmetric priors (e.g., identifying anomaly images from a big dataset), adopting our\ninterface would lead to better aggregation than traditional interface do. We have evaluated this claim\nin real-world experiments in Section 5.3.\n\n5 Experiments\n\nTo con\ufb01rm our theoretical results and test our model, we turn to experimental results. In our syn-\nthetic experiments, we simply explore what the model tells us about optimal partitions and how they\nbehave as a function of the model, giving us qualitative insights into interface design. We also in-\ntroduce a heuristic aggregation method, which allows our results to be easily applied in practice. In\naddition to validating our heuristics numerically, we show that they lead to real improvements over\nsimple majority voting by re-aggregating some data from previous work [2]. Finally, we perform\nour own experiments for a binary signal task and show that the optimal mechanism under the model,\ncoupled with heuristic aggregation, signi\ufb01cantly outperforms the baseline.\n\n5.1 Synthetic Experiments\nFrom our theoretical results, we expect that in the binary setting, the boundary of the optimal parti-\ntion should be roughly uniform for small numbers of agents and quickly approach the prior as the\nnumber of agents per task increases. In the Appendix, we con\ufb01rm this numerically. Figure 2 ex-\ntends this intuition to the 3-signal case, where the optimal reference point p\u21e4 for a shadow partition\nclosely tracks the prior. Figure 2 also gives insight into the design of threshold partitions, showing\n\n4Our result can be generalized to k signals and one agent. See Lemma 4 in Appendix G.\n\n6\n\n\f2\n\n2\n\n2\n\n2\n\n1\n\n0\n\n1\n\n0\n\n2\n\n2\n\n1\n\n0\n\n2\n\n2\n\n2\n\n2\n\n1\n\n0\n\n2\n\n2\n\n1\n\n0\n\n1\n\n0\n\n2\n\n2\n\n1\n\n0\n\n2\n\n2\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n1\n\n0\n\n0\n\nFigure 2: Optimal interfaces as a function of the model; the prior is shown in each as a red dot. Each triangle\nrepresents the probability simplex on three signals (0,1,2), and the cells (set of posteriors) of the partition\nde\ufb01ned by the interface are delineated by dashed lines. Top: the optimal shadow partition for three agents.\nHere the reference distribution p\u21e4 is close to the prior, but often slightly toward uniform as suggested by the\nbehavior in the binary case (Section 4.2); for larger numbers of agents this point in fact matches the prior\nalways. Bottom: the optimal threshold partition for increasing values of \u270f. Here as one would expect, the more\nuncertainty agents have about the true label, the lower the thresholds should be.\n\nFigure 3: Prediction error according to our model as a function of the prior for (a) the optimal partition with\noptimal aggregation, (b) the optimal partition with heuristic aggregation, and (c) the na\u00a8\u0131ve partition and aggre-\ngation. As we see, the heuristics are nearly optimal, and yield signi\ufb01cantly lower error than the baseline.\n\nthat the threshold values should decrease as agent uncertainty increases. The Appendix gives other\nqualitative \ufb01ndings.\nThe optimal partitions and aggregation policies suggested by our framework are often quite com-\nplicated. Thus, to be practical, one would like simple partitions and aggregation methods which\nperform nearly optimally under our framework. Here we suggest a heuristic aggregation (HA)\nmethod which is de\ufb01ned for a \ufb01xed number of samples n: for each cell Dr, consider the set of\ncount vectors after which an agent\u2019s posterior would lie in Dr, and let cr be the average count\nvector in this set. Now when agents report r1, . . . , rm, simply sum the count vectors and choose\n\u02c6\u2713 = HA(r1, . . . , rm) = argmax\u2713 p(\u2713|cr1 +. . .+crm). Thus, by simply translating the choice of cell\nDr to a representative sample count an agent may have observed, we arrive at a weighted-majority-\nlike aggregation method. This simple method performs quite well in simulations, as Figure 3 shows.\nIt also performs well in practice, as we will see in the next two subsections.\n\n5.2 Aggregation Results for Existing Mechanisms\n\nWe evaluate our heuristic aggregation method using the dataset collected from existing mechanisms\nin previous work [2]. Their dataset is collected by asking workers to answer a multi-choice question\nand select one of the two con\ufb01dence levels at the same time. We compared our heuristic aggregation\n(HA) with simple majority voting (Maj) as adopted in their paper. For our heuristics, we used the\nmodel with n = 4 and \u270f = 0.85 for every case here; this was the simplest model for which every\ncell in every partition contained at least one possible posterior. Our results are fairly robust to the\nchoice of the model subject to this constraint, however, and often other models perform even better.\nIn Figure 4, we demonstrate the aggregation results for one of their tasks (\u201cNational Flags\u201d) in the\ndataset. Although the improvement is relatively small, it is statistically signi\ufb01cant for every setting\nplotted. Our HA outperformed Maj for all of their datasets and for all values of m.\n\n7\n\n\fFigure 4: The prediction error of aggregating data col-\nlect from existing mechanisms in previous work [2].\n\nFigure 5: The prediction error of aggregating data\ncollected from Amazon Mechanical Turk.\n\n5.3 Experiments on Amazon Mechanical Turk\n\nWe conducted experiments on Amazon Mechanical Turk (mturk.com) to evaluate our interface\ndesign. Our goal was to examine whether workers respond to different interfaces, and whether the\ninterface and aggregation derived from our framework actually leads to better predictions.\nExperiment setup. In our experiment, workers are asked to label 20 blurred images of textures.\nWe considered an asymmetric prior: 80% of the images were carpet and 20% were granite, and we\ncommunicated this to the workers. Upon accepting the task, workers were randomly assigned to one\nof two treatments: Baseline or ProbBased. Both offered a base payment of 10 cents, but the bonus\npayments on the 5 randomly chosen \u201cground truth\u201d images differed between the treatments.\nThe Baseline treatment is the mostly commonly seen interface in crowdsourcing markets. For each\nimage, the worker is asked to choose from {Carpet, Granite}. She can get a bonus of 4 cents for each\ncorrect answer in the ground truth set. In the ProbBased interface, the worker was asked whether\nshe thinks the probability of the image to be Carpet is {more than 80%, no more than 80%}. From\nSection 4.2, this threshold is optimal when we aim to aggregate information from a potentially large\nnumber of agents. To simplify the discussion, we map the two options to {Carpet, Granite} for the\nrest of this section. For the 5 randomly chosen ground truth images, the worker would get 2 cents\nfor each correct answer of carpet images, and get 8 cents for each correct answer of granite images.\nWe tuned the bonus amount such that the expected bonus for answering all questions correctly is\napproximately the same for each treatment. One can also easily check that for these bonus amounts,\nworkers maximize their expected bonus by honestly reporting their beliefs.\nResults. This experiment is completed by 200 workers, 105 in Baseline and 95 in ProbBased. We\n\ufb01rst observe whether workers\u2019 responses differ for different interfaces. In particular, we compare the\nratio of workers reporting Granite. As shown in Figure 6 (in Appendix A), our result demonstrates\nthat workers do respond to our interface design and are more likely to choose Granite for all images.\nThe differences are statistically signi\ufb01cant (p < 0.01). We then examine whether this interface com-\nbined with our heuristic aggregation leads to better predictions. We perform majority voting (Maj)\nfor Baseline, and apply our heuristic aggregation (HA) to ProbBased. We choose the simplest model\n(n = 1) for HA though the results are robust for higher n. Figure 5 shows that our interface leads to\nconsiderably smaller aggregation for different numbers of randomly selected workers. Performing\nHA for Baseline and Maj for ProbBased both led to higher aggregation errors, which underscores\nthe importance of matching the aggregation to the interface.\n\n6 Conclusion\n\nWe have developed a Bayesian framework to model the elicitation and aggregation of categorical\ndata, giving a principled way to aggregate information collected from arbitrary interfaces, but also\nto design the interfaces themselves. Our simulation and experimental results show the bene\ufb01t of our\nframework, resulting in signi\ufb01cant prediction performance gains over standard interfaces and aggre-\ngation methods. Moreover, our theoretical and simulation results give new insights into the design\nof optimal interfaces, some of which we con\ufb01rm experimentally. While certainly more experiments\nare needed to fully validate our methods, we believe our general framework to have value when\ndesigning interfaces and aggregation policies for eliciting categorical information.\n\n8\n\n\fAcknowledgments\nWe thank the anonymous reviewers for their helpful comments. This research was partially sup-\nported by NSF grant CCF-1512964, NSF grant CCF-1301976, and ONR grant N00014-15-1-2335.\n\nReferences\n[1] R. M. Frongillo, Y. Chen, and I. Kash. Elicitation for aggregation. In The Twenty-Ninth AAAI Conference\n\non Arti\ufb01cial Intelligence, 2015.\n\n[2] N. B. Shah and D. Zhou. Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing.\n\nIn Neural Information Processing Systems, NIPS \u201915, 2015.\n\n[3] Glenn W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly Weather Review,\n\n78(1):1\u20133, 1950.\n\n[4] L. J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical\n\nAssociation, 66(336):783\u2013801, 1971.\n\n[5] T. Gneiting and A. E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the\n\nAmerican Statistical Association, 102(477):359\u2013378, 2007.\n\n[6] N.S. Lambert, D.M. Pennock, and Y. Shoham. Eliciting properties of probability distributions. In Pro-\n\nceedings of the 9th ACM Conference on Electronic Commerce, EC \u201908, pages 129\u2013138. ACM, 2008.\n\n[7] R. Frongillo and I. Kash. Vector-Valued Property Elicitation. In Proceedings of the 28th Conference on\n\nLearning Theory, pages 1\u201318, 2015.\n\n[8] N. Miller, P. Resnick, and R. Zeckhauser. Eliciting informative feedback: The peer-prediction method.\n\nManagement Science, 51(9):1359\u20131373, 2005.\n\n[9] D. Prelec. A bayesian truth serum for subjective data. Science, 306(5695):462\u2013466, 2004.\n[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society: Series B, 39:1\u201338, 1977.\n\n[11] V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal\n\nof Machine Learning Research, 11:1297\u20131322, 2010.\n\n[12] S. R. Cholleti, S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: Combining expert opin-\nions without labeled data. In Proceedings 20th IEEE international Conference on Tools with Arti\ufb01cial\nintelligence, 2008.\n\n[13] R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information Processing\n\nSystems, volume 15, pages 897\u2013904, 2003.\n\n[14] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal\nintegration of labels from labelers of unknown expertise. In Advances in Neural Information Processing\nSystems, volume 22, pages 2035\u20132043, 2009.\n\n[15] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM\n\nalgorithm. Applied Statistics, 28:20\u201328, 1979.\n\n[16] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In The 25th\n\nAnnual Conference on Neural Information Processing Systems (NIPS), 2011.\n\n[17] D. R. Karger, S. Oh, and D. Shah. Budget-optimal crowdsourcing using low-rank matrix approximations.\n\nIn Proc. 49th Annual Conference on Communication, Control, and Computing (Allerton), 2011.\n\n[18] J. Zou and D. C. Parkes. Get another worker? Active crowdlearning with sequential arrivals. In Proceed-\n\nings of the Workshop on Machine Learning in Human Computation and Crowdsourcing, 2012.\n\n[19] C. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classi\ufb01cation. In The\n\n30th International Conference on Machine Learning (ICML), 2013.\n\n[20] N. Lambert and Y. Shoham. Eliciting truthful answers to multiple-choice questions. In Proceedings of\n\nthe Tenth ACM Conference on Electronic Commerce, EC \u201909, pages 109\u2013118, 2009.\n\n[21] R. Frongillo and I. Kash. General truthfulness characterizations via convex analysis. In Web and Internet\n\nEconomics, pages 354\u2013370. Springer, 2014.\n\n[22] F. Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing,\n\n16(1):78\u201396, 1987.\n\n[23] J. Witkowski and D. Parkes. A robust bayesian truth serum for small populations. In Proceedings of the\n\n26th AAAI Conference on Arti\ufb01cial Intelligence, AAAI \u201912, 2012.\n\n[24] V. Sheng, F. Provost, and P. Ipeirotis. Get another label? Improving data quality using multiple, noisy\n\nlabelers. In ACM SIGKDD Conferences on Knowledge Discovery and Data Mining (KDD), 2008.\n\n[25] P. Ipeirotis, F. Provost, V. Sheng, and J. Wang. Repeated labeling using multiple noisy labelers. Data\n\nMining and Knowledge Discovery, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1283, "authors": [{"given_name": "Chien-Ju", "family_name": "Ho", "institution": "Cornell University"}, {"given_name": "Rafael", "family_name": "Frongillo", "institution": "CU Boulder"}, {"given_name": "Yiling", "family_name": "Chen", "institution": "Harvard University"}]}