{"title": "Online Prediction with Selfish Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 1300, "page_last": 1310, "abstract": "We consider the problem of binary prediction with expert advice in settings where experts have agency and seek to maximize their credibility. This paper makes three main contributions. First, it defines a model to reason formally about settings with selfish experts, and demonstrates that ``incentive compatible'' (IC) algorithms are closely related to the design of proper scoring rules. Second, we design IC algorithms with good performance guarantees for the absolute loss function. Third, we give a formal separation between the power of online prediction with selfish experts and online prediction with honest experts by proving lower bounds for both IC and non-IC algorithms. In particular, with selfish experts and the absolute loss function, there is no (randomized) algorithm for online prediction---IC or otherwise---with asymptotically vanishing regret.", "full_text": "Online Prediction with Sel\ufb01sh Experts\n\nTim Roughgarden\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\ntim@cs.stanford.edu\n\nOkke Schrijvers\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\nokkes@cs.stanford.edu\n\nAbstract\n\nWe consider the problem of binary prediction with expert advice in settings where\nexperts have agency and seek to maximize their credibility. This paper makes\nthree main contributions. First, it de\ufb01nes a model to reason formally about settings\nwith sel\ufb01sh experts, and demonstrates that \u201cincentive compatible\u201d (IC) algorithms\nare closely related to the design of proper scoring rules. Second, we design\nIC algorithms with good performance guarantees for the absolute loss function.\nThird, we give a formal separation between the power of online prediction with\nsel\ufb01sh versus honest experts by proving lower bounds for both IC and non-IC\nalgorithms.\nIn particular, with sel\ufb01sh experts and the absolute loss function,\nthere is no (randomized) algorithm for online prediction\u2014IC or otherwise\u2014with\nasymptotically vanishing regret.\n\n1\n\nIntroduction\n\nIn the months leading up to elections and referendums, a plethora of pollsters try to \ufb01gure out\nhow the electorate is going to vote. Different pollsters use different methodologies, reach different\npeople, and may have sources of random errors, so generally the polls don\u2019t fully agree with each\nother. Aggregators such as Nate Silver\u2019s FiveThirtyEight, and The Upshot by the New York Times\nconsolidate these different reports into a single prediction, and hopefully reduce random errors.1\nFiveThirtyEight in particular has a solid track record for their predictions, and as they are transparent\nabout their methodology we use them as a motivating example. To a \ufb01rst-order approximation, they\noperate as follows: \ufb01rst they take the predictions of all the different pollsters, then they assign a\nweight to each of the pollsters based on past performance (and other factors), and \ufb01nally they use the\nweighted average of the pollsters to run simulations and make their own prediction.2\nBut could the presence of an institution that rates pollsters inadvertently create perverse incentives\nfor pollsters? The FiveThirtyEight pollster ratings are publicly available.3 They can be interpreted\nas a reputation, and a low rating can negatively impact future revenue opportunities for a pollster.\nMoreover, it has been demonstrated in practice that experts do not always report their true beliefs\nabout future events. For example, in weather forecasting there is a known \u201cwet bias,\u201d where consumer-\nfacing weather forecasters deliberately overestimate low chances of rain (e.g. a 5% chance of rain is\nreported as a 25% chance of rain) because people don\u2019t like to be surprised by rain [Bickel and Kim,\n2008].\n\n1https://fivethirtyeight.com/, https://www.nytimes.com/section/upshot.\n2This is of course a simpli\ufb01cation. FiveThirtyEight also uses features like the change in a poll over time,\nthe state of the economy, and correlations between states. See https://fivethirtyeight.com/features/\nhow-fivethirtyeight-calculates-pollster-ratings/ for details. Our goal in this paper is not to\naccurately model all of the \ufb01ne details of FiveThirtyEight (which are anyways changing all the time). Rather, it\nis to formulate a general model of prediction with experts that clearly illustrates why incentives matter.\n\n3https://projects.fivethirtyeight.com/pollster-ratings/\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThese examples motivate the development of models of aggregating predictions that endow agency to\nthe data sources.4 While there are multiple models in which we can investigate this issue, a natural\ncandidate is the problem of prediction with expert advice. By focusing on a standard model, we\nabstract away from the \ufb01ne details of FiveThirtyEight (which are anyways changing all the time),\nwhich allows us to formulate a general model of prediction with experts that clearly illustrates why\nincentives matter. In the classical model [Littlestone and Warmuth, 1994, Freund and Schapire, 1997],\nat each time step, several experts make predictions about an unknown event. An online prediction\nalgorithm aggregates experts\u2019 opinions and makes its own prediction at each time step. After this\nprediction, the event at this time step is realized and the algorithm incurs a loss as a function of its\nprediction and the realization. To compare its performance against individual experts, for each expert\nthe algorithm calculates what its loss would have been had it always followed the expert\u2019s prediction.\nWhile the problems introduced in this paper are relevant for general online prediction, to focus on\nthe most interesting issues we concentrate on the case of binary events, and real-valued predictions\nin [0, 1]. For different applications, different notions of loss are appropriate, so we parameterize the\nmodel by a loss function `. Thus our formal model is: at each time step t = 1, 2, . . . , T :\n\n1. Each expert i makes a prediction p(t)\n2. The online algorithm commits to a probability q(t) 2 [0, 1] as a prediction for event \u201c1.\u201d\n3. The outcome r(t) 2{ 0, 1} is realized.\n4. The algorithm incurs expected loss `(q(t), r(t)), each expert i is assigned loss `(p(t)\ni\n\ni 2 [0, 1], representing advocacy for event \u201c1.\u201d\n\n, r(t)).\nThe standard goal in this problem is to design an online prediction algorithm that is guaranteed to\nhave expected loss not much larger than that incurred by the best expert in hindsight. The classical\nsolutions maintain a weight for each expert and make a prediction according to which outcome has\nmore expert weight behind it. An expert\u2019s weight can be interpreted as a measure of its credibility in\nlight of its past performance. The (deterministic) Weighted Majority (WM) algorithm always chooses\nthe outcome with more expert weight. The Randomized Weighted Majority (RWM) algorithm\nrandomizes between the two outcomes with probability proportional to their total expert weights.\nThe most common method of updating experts\u2019 weights is via multiplication by 1 \u2318`(p(t)\n, r(t))\nafter each time step t, where \u2318 is the learning rate. We call this the \u201cstandard\u201d or \u201cclassical\u201d version\nof the WM and RWM algorithms.\nThe classical model instills no agency in the experts. To account for this, in this paper we replace\nStep 1 of the classical model by:\n\ni\n\n1a. Each expert i formulates a belief b(t)\n1b. Each expert i reports a prediction p(t)\n\ni 2 [0, 1].\ni 2 [0, 1] to the algorithm.\n\nEach expert now has two types of loss at each time step \u2014 the reported loss `(p(t)\ni\nto the reported prediction and the true loss `(b(t)\ni\nWhen experts care about the weight that they are assigned, and with it their reputation and in\ufb02uence\nin the algorithm, different loss functions can lead to different expert behaviors. For example, for\nthe quadratic loss function, in the standard WM and RWM algorithms, experts have no reason to\nmisreport their beliefs (see Proposition 8). This is not the case for other loss functions, such as the\nabsolute loss function.6 The standard algorithm with the absolute loss function incentivizes extremal\nreporting, i.e. an expert reports 1 whenever b(t)\n2 and 0 otherwise. This follows from a simple\n\n, r(t)) with respect to her true beliefs.5\n\n, r(t)) with respect\n\ni 1\n\n4More generally, one can investigate how the presence of machine learning algorithms affects data-generating\n\nprocesses, either during learning or deployment. We discuss some of this work in the related work section.\n\n5When we speak of the best expert in hindsight, we are always referring to the true losses. Guarantees with\nrespect to reported losses follow from standard results [Littlestone and Warmuth, 1994, Freund and Schapire,\n1997, Cesa-Bianchi et al., 2007], but are not immediately meaningful.\n\n6The loss function is often tied to the particular application. For example, in the current FiveThir-\ntyEight pollster rankings, the performance of a pollster is primarily measured according to an absolute\nloss function and also whether the candidate with the highest polling numbers ended up winning (see\nhttps://github.com/fivethirtyeight/data/tree/master/pollster-ratings). However, in 2008\nFiveThirtyEight used the notion of \u201cpollster introduced error\u201d or PIE, which is the square root of a difference\nof squares, as the most important feature in calculating the weights, see https://fivethirtyeight.com/\nfeatures/pollster-ratings-v31/.\n\n2\n\n\fderivation or alternatively from results in the property elicitation literature.7 This shows that for the\nabsolute loss function the standard WM algorithm is not \u201cincentive-compatible\u201d in a sense that we\nformalize in Section 2. There are similar examples for the other commonly studied weight update\nrules and for the RWM algorithm. We might care about truthful reporting for its own sake, but\nadditionally the worry is that non-truthful reports will impede our ability to get good regret guarantees\n(with respect to experts\u2019 true losses).\nWe study several fundamental questions about online prediction with sel\ufb01sh experts:\n\n1. What is the design space of \u201cincentive-compatible\u201d online prediction algorithms, where\n\nevery expert is incentivized to report her true beliefs?\n\n2. Given a loss function like absolute loss, are there incentive-compatible algorithms with good\n\nregret guarantees?\n\n3. Is online prediction with sel\ufb01sh experts strictly harder than in the classical model with\n\nhonest experts?\n\nOur Results. The \ufb01rst contribution of this paper is the development of a model for reasoning\nformally about the design and analysis of weight-based online prediction algorithms when experts are\nsel\ufb01sh (Section 2), and the de\ufb01nition of an \u201cincentive-compatible\u201d (IC) such algorithm. Intuitively, an\nIC algorithm is such that each expert wants to report its true belief at each time step. We demonstrate\nthat the design of IC online prediction algorithms is closely related to the design of strictly proper\nscoring rules. Using this, we show that for the quadratic loss function, the standard WM and RWM\nalgorithms are IC, whereas these algorithms are not generally IC for other loss functions.\nOur second contribution is the design of IC prediction algorithms for the absolute loss function with\nnon-trivial performance guarantees. For example, our best result for deterministic algorithms is: the\nWM algorithm, with experts\u2019 weights evolving according to the spherical proper scoring rule (see\nSection 3), is IC and has loss at most 2 + p2 times the loss of best expert in hindsight (in the limit as\nT ! 1). A variant of the RWM algorithm with the Brier scoring rule is IC and has expected loss at\nmost 2.62 times that of the best expert in hindsight (also in the limit, see Section 5).\nOur third and most technical contribution is a formal separation between online prediction with\nsel\ufb01sh experts and the traditional setting with honest experts. Recall that with honest experts, the\nclassical (deterministic) WM algorithm has loss at most twice that of the best expert in hindsight (as\nT ! 1) [Littlestone and Warmuth, 1994]. We prove in Section 4 that the worst-case loss of every\n(deterministic) IC algorithm, and every (deterministic) non-IC algorithm satisfying mild technical\nconditions, is bounded away from twice that of the best expert in hindsight (even as T ! 1).\nA consequence of our lower bound is that, with sel\ufb01sh experts, there is no natural (randomized)\nalgorithm for online prediction\u2014IC or otherwise\u2014with asymptotically vanishing regret.\nFinally, in Section 6 we show simulations that indicate that different IC methods show similar regret\nbehavior, and that their regret is substantially better than that of the non-IC standard algorithms,\nsuggesting that the worst-case characterization we prove holds more generally.\n\nRelated Work. We believe that our model of online prediction over time with sel\ufb01sh experts is\nnovel. We next survey the multiple other ways in which online learning and incentive issues have\nbeen blended, and the other efforts to model incentive issues in machine learning.\nThere is a large literature on prediction and decision markets (e.g. Chen and Pennock [2010], Horn\net al. [2014]), which also aim to aggregate information over time from multiple parties and make use\nof proper scoring rules to do it. However, prediction markets provide incentives through payments,\nrather than in\ufb02uence, and lack the feedback mechanism to select among experts. While there are\nstrong mathematical connections between cost function-based prediction markets and regularization-\nbased online learning algorithms in the standard (non-IC) model [Abernethy et al., 2013], there does\nnot appear to be any interesting implications for online prediction with sel\ufb01sh experts.\nThere is also an emerging literature on \u201cincentivizing exploration\u201d in partial feedback models such as\nthe bandit model (e.g. Frazier et al. [2014], Mansour et al. [2016]). Here, the incentive issues concern\nthe learning algorithm itself, rather than the experts (or \u201carms\u201d) that it makes use of.\n\n7The absolute loss function is known to elicit the median [Bonin, 1976][Thomson, 1979], and since we have\n\nbinary realizations, the median is either 0 or 1.\n\n3\n\n\fThe question of how an expert should report beliefs has been studied before in the literature on strictly\nproper scoring rules [Brier, 1950, McCarthy, 1956, Savage, 1971, Gneiting and Raftery, 2007], but\nthis literature typically considers the evaluation of a single prediction, rather than low-regret learning.\nBayarri and DeGroot [1989] look at correlated settings where strictly proper scoring rules don\u2019t\nsuf\ufb01ce, though they also do not consider how an aggregator can achieve low regret.\nFinally, there are many works that fall under the broader umbrella of incentives in machine learning.\nRoughly, work in this area can be divided into two genres: incentives during the learning stage, e.g.\n[Cai et al., 2015, Shah and Zhou, 2015, Liu and Chen, 2016, Dekel et al., 2010], or incentives during\nthe deployment stage, e.g. Br\u00fcckner and Scheffer [2011], Hardt et al. [2016]. Finally, Babaioff et al.\n[2010] consider the problem of no-regret learning with sel\ufb01sh experts in an ad auction setting, where\nthe incentives come from the allocations and payments of the auction, rather than from weights as in\nour case.\n\nandPn\n\n2 Preliminaries and Model\nStandard Model. At each time step t 2 1, ..., T we want to predict a binary realization r(t) 2\n{0, 1}. To help in the prediction, we have access to n experts that for each time step report a prediction\np(t)\ni 2 [0, 1] about the realization. The realizations are determined by an oblivious adversary, and\nthe predictions of the experts may or may not be accurate. The goal is to use the predictions of the\nexperts in such a way that the algorithm performs nearly as well as the best expert in hindsight. Most\nof the algorithms proposed for this problem fall into the following framework.\nDe\ufb01nition 1 (Weight-update Online Prediction Algorithm). A weight-update online prediction algo-\ni p(t)\nrithm maintains a weight w(t)\ni\ni\ni ). After the algorithm makes its prediction, the realization r(t) is revealed, and\n\nfor each expert and makes its prediction q(t) based onPn\n\nthe algorithm updates the weights of experts using the rule\n\ni (1 p(t)\n\ni=1 w(t)\n\ni w(t)\n\n, r(t)\u2318 \u00b7 w(t)\n= f\u21e3p(t)\nwhere f : [0, 1] \u21e5{ 0, 1}! R+ is a positive function on its domain.\n, r(t)) = 1 \u2318`(p(t)\nThe standard WM algorithm has f (p(t)\n2 ) is the learning\ni\nrate, and predicts q(t) = 1 if and only ifPn\ni Pn\ni w(t)\ni ). Let the total loss of the al-\ngorithm be M (T ) =PT\n, r(t)).\nThe MW algorithm has the property that M (T ) \uf8ff 2(1 + \u2318)m(T )\nfor each expert i, and\nRWM \u2014where the algorithm picks 1 with probability proportional to Pn\ni w(t)\ni \u2014 satis\ufb01es\nM (T ) \uf8ff (1 + \u2318)m(T )\n\u2318 for each expert i [Littlestone and Warmuth, 1994][Freund and Schapire,\n1997]. The notion of \u201cno \u21b5-regret\u201d [Kakade et al., 2009] captures the idea that the per time-step loss\nof an algorithm is \u21b5 times that of the best expert in hindsight, plus a term that goes to 0 as T grows:\nDe\ufb01nition 2 (\u21b5-regret). An algorithm is said to have no \u21b5-regret if M (T ) \uf8ff \u21b5 mini m(T )\nBy taking \u2318 = O(1/pT ), MW is a no 2-regret algorithm, and RWM is a no 1-regret algorithm.\n\n, r(t)) where \u2318 2 (0, 1\ni (1 p(t)\ni =PT\n\nt=1 `(q(t), r(t)) and let the total loss of expert i be m(T )\n\ni + 2 ln n\n\ni + o(T ).\n\nt=1 `(p(t)\n\ni + ln n\n\ni p(t)\n\nw(t+1)\n\ni\n\ni p(t)\n\ni\n\ni w(t)\n\n,\n\ni\n\n(1)\n\n\u2318\n\ni\n\ni\n\nSel\ufb01sh Model. We consider a model in which experts have agency about the prediction they report,\nand care about the weight that they are assigned. In the sel\ufb01sh model, at time t the expert formulates\na private belief b(t)\nto the algorithm.\ni\nLet Bern(p) be a Bernoulli random variable with parameter p. For any non-negative weight update\nfunction f,\n\nabout the realization, but she is free to report any prediction p(t)\ni\n\np Eb(t)\nmax\n\ni\n\n[w(t+1)\n\ni\n\n] = max\n\np Er\u21e0Bern\u21e3b(t)\n\ni \u2318[f (p, r) w(t)\n\ni\n\n] = w(t)\n\ni\n\n\u00b7\u2713max\n\np Er\u21e0Bern\u21e3b(t)\n\ni \u2318[f (p, r)]\u25c6 .\n\nSo expert i will report whichever p(t)\ni will maximize the expectation of the weight update function.\nPerformance of an algorithm with respect to the reported loss of experts follows from the standard\nanalysis [Littlestone and Warmuth, 1994]. However, the true loss may be worse (in Section 3 we\n\n4\n\n\fshow this for the standard update rule, Section 4 shows it more generally). Unless explicitly stated\notherwise, in the remainder of this paper m(T )\n, r(t)) refers to the true loss of expert i.\nFor now this motivates restricting the weight update rule f to functions where reporting p(t)\ni = b(t)\nmaximizes the expected weight of experts. We call these weight-update rules Incentive Compatible\n(IC).\nDe\ufb01nition 3 (Incentive Compatibility). A weight-update function f is incentive compatible (IC) if\nreporting the true belief b(t)\nis always a best response for every expert at every time step. It is strictly\ni\nIC when p(t)\n\ni =PT\n\nis the only best response.\n\nt=1 `(b(t)\n\ni\n\ni\n\ni = b(t)\n\ni\n\nBy a \u201cbest response,\u201d we mean an expected utility-maximizing report, where the expectation is with\nrespect to the expert\u2019s beliefs.\nCollusion. The de\ufb01nition of IC does not rule out the possibility that experts can collude to jointly\nmisreport to improve their weights. We therefore also consider a stronger notion of incentive\ncompatibility for groups with transferable utility.8\nDe\ufb01nition 4 (IC for Groups with Transferable Utility). A weight-update function f is IC for groups\nwith transferable utility (TU-GIC) if for every subset S of players, the total expected weight of the\n\n] is maximized by each reporting their private belief b(t)\ni\n\n.\n\ngroupPi2S Eb(t)\n\ni\n\n[w(t+1)\n\ni\n\nProper Scoring Rules.\nIncentivizing truthful reporting of beliefs has been studied extensively, and\nthe set of functions that do this is called the set of proper scoring rules. Since we focus on predicting\na binary event, we restrict our attention to this class of functions.\nDe\ufb01nition 5 (Binary Proper Scoring Rule, [Schervish, 1989]). A function f : [0, 1] \u21e5{ 0, 1}!\nR[{\u00b11} is a binary proper scoring rule if it is \ufb01nite except possibly on its boundary and whenever\nfor p 2 [0, 1] it holds that p 2 maxq2[0,1] p \u00b7 f (q, 1) + (1 p) \u00b7 f (q, 0).\nA function f is a strictly proper scoring rule if p is the only value that maximizes the expectation.\nThe \ufb01rst and perhaps most well-known proper scoring rule is the Brier scoring rule.\nExample 6 (Brier Scoring Rule, [Brier, 1950]). The Brier score is Br(p, r) = 2pr (p2 + (1 p)2)\nwhere pr = pr + (1 p)(1 r) is the report for the event that materialized.\nWe will use the Brier scoring rule in Section 5 to construct an incentive-compatible randomized\nalgorithm with good guarantees. The following proposition follows directly from De\ufb01nitions 3 and 5.\nProposition 7. Weight-update rule f is (strictly) IC if and only if f is a (strictly) proper scoring rule.\n\nSurprisingly, this result remains true even when experts can collude. While the realizations are\nobviously correlated, linearity of expectation causes the sum to be maximized exactly when each\nexpert maximizes their own expected weight.\nProposition 8. A weight-update rule f is (strictly) incentive compatible for groups with transferable\nutility if and only if f is a (strictly) proper scoring rule.\n\nThus, for online prediction with sel\ufb01sh experts, we get TU-GIC \u201cfor free.\u201d It is quite uncommon for\nproblems in non-cooperate game theory to admit good TU-GIC solutions. For example, results for\nauctions (either for revenue or welfare) break down once bidders collude, see e.g. [Goldberg and\nHartline, 2005]. In the remainder of the paper we will simply use IC to refer to IC and TU-GIC, as\nstrictly proper scoring rules yield algorithms that satisfy both de\ufb01nitions.\nThus, for IC algorithms we are restricted to considering (bounded) proper scoring rules as weight-\nupdate rules. Conversely, any bounded scoring rule can be used, possibly after an af\ufb01ne transformation\n(which preserve proper-ness). Are there any proper scoring rules that give an online prediction\nalgorithm with a good performance guarantee? The standard algorithm for quadratic losses yields a\nweight-update function that is equivalent to the Brier strictly proper scoring rule, and thus is IC. The\nstandard algorithm with absolute losses is not IC, so in the remainder of this paper we discuss this\nsetting in more detail.\n\n8Note that TU-GIC is a strictly stronger concept than IC and group IC with nontransferable utility (NTU-GIC)\n\n[Moulin, 1999][Jain and Mahdian, 2007].\n\n5\n\n\f3 Deterministic Algorithms for Sel\ufb01sh Experts\n\nThis section studies the question if there are good online prediction algorithms with sel\ufb01sh experts\nfor the absolute loss function. We restrict our attention here to deterministic algorithms; Section 5\ngives a randomized algorithm with good guarantees.\nProposition 7 tells us that for sel\ufb01sh experts to have a strict incentive to report truthfully, the weight-\nupdate rule must be a strictly proper scoring rule. This section gives a deterministic algorithm based\non the spherical strictly proper scoring rule that has no (2 + p2)-regret (Theorem 10). Additionally,\nwe consider the question if the non-truthful reports from experts in using the standard (non-IC) WM\nalgorithm are harmful. We show that this is the case by proving it is not a no (4 O(1))-regret\nalgorithm for any constant smaller than 4 (Proposition 11). This shows that, when experts are sel\ufb01sh,\nthe IC online prediction algorithm with the spherical rule outperforms the standard WM algorithm\n(in the worst case).\n\nOnline Prediction using a Spherical Rule. We next give an algorithm that uses a strictly proper\nscoring rule that is based on the spherical rule scoring rule.9 Consider the following weight-update\nrule:\n\nfsp\u21e3p(t)\n\ni\n\n, r(t)\u2318 = 1 \u2318\u27131 \u21e31 | p(t)\n\ni r(t)|\u2318/qp(t)\n\ni\n\n\u00b7 p(t)\ni + (1 p(t)\n\ni )\u25c6 .\ni ) \u00b7 (1 p(t)\n\n(2)\n\nThe following proposition establishes that this is in fact a strictly proper scoring rule. Due to space\nconstraints, all proofs appear in Appendix A of the supplementary material.\nProposition 9. The spherical weight-update rule in (2) is a strictly proper scoring rule.\n\nIn addition to incentivizing truthful reporting, the WM algorithm with the update rule fsp does not do\nmuch worse than the best expert in hindsight.\nTheorem 10. WM with weight-update rule (2) for \u2318 = O(1/pT ) < 1\n\n2 has no (2 + p2)-regret.\n\nTrue Loss of the Non-IC Standard Rule.\nIt is instructive to compare the guarantee in Theorem 10\nwith the performance of the standard (non-IC) WM algorithm. WM with the standard weight update\nfunction f (p(t)\n2 ) has no 2-regret with respect to the reported\ni\nloss of experts. However, this algorithm incentivizes extremal reports (for details see Appendix B in\nthe supplementary material), and in the worst case, this algorithm\u2019s loss can be as bad as 4 times the\ntrue loss of the best expert in hindsight. Theorem 10 shows that a suitable IC algorithm obtains a\nsuperior worst-case guarantee.\n\ni r(t)| for \u2318 2 (0, 1\n\n, r(t)) = 1 \u2318|p(t)\n\nProposition 11. The standard WM algorithm with weight-update rule f\u21e3p(t)\nr(t)| results in a total worst-case loss no better than M (T ) 4 \u00b7 mini m(T )\n4 The Cost of Sel\ufb01sh Experts\n\ni\n\n, r(t)\u2318 = 1 \u2318|p(t)\n\ni \n\ni o(1).\n\nWe now address the third fundamental question: whether or not online prediction with sel\ufb01sh experts\nis strictly harder than with honest experts. As there exists a deterministic algorithm for honest experts\nwith no 2-regret, showing a separation between honest and sel\ufb01sh experts boils down to proving that\nthere exists a constant > 0 such that best possible no \u21b5-regret algorithm has \u21b5 = 2 + . In this\nsection we show that such a exists, and that it is independent of the learning rate. Hence the lower\nbound also holds for algorithms that, like the classical prediction algorithms, use a time-varying\nlearning rate. Due to space considerations, this section only states the main results, for details\nand proofs refer to the supplementary materials where in Appendix D we give the results for IC\nalgorithms, and in Appendix E we give the results for the non-IC algorithms. We extend these results\nto randomized algorithms in Section 5, where we rule out the existence of a (possibly randomized)\nno-regret algorithm for sel\ufb01sh experts.\n\n9In Appendix G in the supplementary materials we give an intuition for why this rule yields better results\n\nthan other natural candidates, such as the Brier scoring rule.\n\n6\n\n\fIC Algorithms. To prove the lower bound, we have to be speci\ufb01c about which set of algorithms\nwe consider. To cover algorithms that have a decreasing learning parameter, we \ufb01rst show that any\npositive proper scoring rule can be interpreted as having a learning parameter \u2318.\nProposition 12. Let f be any strictly proper scoring rule. We can write f as f (p, r) = a + bf0(p, r)\nwith a 2 R, b 2 R+ and f0 a strictly proper scoring rule with min(f0(0, 1), f0(1, 0)) = 0 and\nmax(f0(0, 0), f0(1, 1)) = 1.\nWe call f0 : [0, 1] \u21e5{ 0, 1}! [0, 1] a normalized scoring rule. Using normalized scoring rules, we\ncan de\ufb01ne a family of scoring rules with different learning rates \u2318. De\ufb01ne F as the following family\nof proper scoring rules generated by normalized strictly proper scoring rule f:\n\nF = {f0(p, r) = a (1 + \u2318(f (p, r) 1)) : a > 0 and \u2318 2 (0, 1)}\n\nBy Proposition 12 the union of families generated by normalized strictly proper scoring rules cover\nall strictly proper scoring rules. Using this we can now formulate the class of deterministic algorithms\nthat are incentive compatible.\nDe\ufb01nition 13 (Deterministic IC Algorithms). Let Ad be the set of deterministic algorithms that\nupdate weights by w(t+1)\n, for a normalized strictly proper scoring\ni=1 w(t)\n, A\nrule f and \u2318 2 (0, 1\n2 and uses any deterministic tie breaking rule for q = 1\n2.\npicks q(t) = 0 if q < 1\n\n2 ) with \u2318 possibly decreasing over time. For q =Pn\n\n, r(t)) 1))w(t)\n\ni /Pn\n\n2, q(t) = 1 if q > 1\n\n= a(1 + \u2318(f (p(t)\ni\n\ni=1 w(t)\n\ni p(t)\n\ni\n\ni\n\ni\n\nUsing this de\ufb01nition we can now state our main lower bound result for IC algorithms:\nTheorem 14. For the absolute loss function, there does not exists a deterministic and incentive-\ncompatible algorithm A 2A d with no 2-regret.\nOf particular interest are symmetric scoring rules, which occur often in practice, and which have a\nrelevant parameter that drives the lower bound results:\nDe\ufb01nition 15 (Scoring Rule Gap). The scoring rule gap of family F with generator f is =\nf ( 1\n\n2 (f (0) + f (1)) = f ( 1\n\n2.\n2 ) 1\n\n2 ) 1\n\nBy de\ufb01nition, the scoring rule gap for strictly proper scoring rules is strictly positive, and it drives the\nlower bound for symmetric functions:\nLemma 16. Let F be a family of scoring rules generated by a symmetric strictly proper scoring rule\nf, and let be the scoring rule gap of F. In the worst case, MW with any scoring rule f0 from F\nwith \u2318 2 (0, 1\nAs a consequence of Lemma 16, we can calculate lower bounds for speci\ufb01c strictly proper scoring\nrules. For example, the spherical rule used in Section 3 is a symmetric strictly proper scoring rule\nwith a gap parameter =\n\n2 ) can do no better than M (T ) \u21e32 + 1\n\nd1e\u2318 \u00b7 m(T )\n\n.\n\ni\n\np2\n2 1\n\n2, and hence 1/d1e = 1\n5.\n\nNon-IC Algorithms. What about non-incentive-compatible algorithms? Could it be that, even\nwith experts reporting strategically instead of honestly, there is a deterministic algorithm with loss at\nmost twice that of the best expert in hindsight (or a randomized algorithm with vanishing regret), to\nmatch the classical results for honest experts? Under mild technical conditions, the answer is no. The\nfollowing de\ufb01nition captures how players are incentivized to report differently from their beliefs.\nDe\ufb01nition 17 (Rationality Function). For a weight update function f, let \u21e2f : [0, 1] ! [0, 1] be the\nfunction from beliefs to predictions, such that reporting \u21e2f (b) is rational for an expert with belief b.\n\nUnder mild technical conditions on the rationality function, we show our main lower bound for\n(potentially non-IC) algorithms.\nTheorem 18. For a weight update function f with continuous or non-strictly increasing rationality\nfunction \u21e2f , there is no deterministic no 2-regret algorithm.\n\nNote that Theorem 18 covers the standard algorithm, as well as other common update rules such\nas the Hedge update rule fHedge(p(t)\ni r(t)| [Freund and Schapire, 1997], and all IC\ni\nmethods, since they have the identity rationality function (though the bounds in Thm 14 are stronger).\n\n, r(t)) = e\u2318|p(t)\n\n7\n\n\f5 Randomized Algorithms: Upper and Lower Bounds\n\nImpossibility of Vanishing Regret. We now consider randomized online learning algorithms,\nwhich can typically achieve better worst-case guarantees than deterministic algoritms. For example,\nwith honest experts, there are randomized algorithms no 1-regret. Unfortunately, the lower bounds in\nSection 4 imply that no such result is possible for randomized algorithms (more details in Appendix F).\nCorollary 19. Any incentive compatible randomized weight-update algorithm or non-IC randomized\nalgorithm with continuous or non-strictly increasing rationality function cannot be no 1-regret.\n\ni=1\n\niPn\n\nj\n\nw(t)\nj=1 w(t)\n\nAn IC Randomized Algorithm. While we cannot hope to achieve a no-regret algorithm for online\nprediction with sel\ufb01sh experts, we can do better than the deterministic algorithm from Section 3.\nConsider the following class of randomized algorithms:\nDe\ufb01nition 20 (\u2713-randomized weighted majority). Let Ar be the class of algorithms that maintains\nexpert weights as in De\ufb01nition 1. Let b(t) =Pn\n\u00b7 p(t)\ni be the weighted predictions. For\nparameter \u2713 2 [0, 1\n\n2 ] the algorithm chooses 1 with probability p(t) =8<:\n\nWe call algorithms in Ar \u2713-RWM algorithms. We\u2019ll use the Brier rule fBr(p(t)\n(1 p(t)\ni )) with s(t)\nTheorem 21. Let A 2A r be a \u2713-RWM algorithm with the Brier weight update rule fBr and\n\u2713 = 0.382 and with \u2318 = O(1/pT ) 2 (0, 1\n6 Simulations\n\ni = |p(t)\n2 ). A has no 2.62-regret.\n\nif b(t) \uf8ff \u2713\nif \u2713< b (t) \uf8ff 1 \u2713\notherwise\n\ni )2 + 1)/2 (1 s(t)\n\n, r(t)) = 1\u2318((p(t)\n\ni r(t)|.\n\n0\nb(t)\n1\n\ni )2 +\n\n.\n\ni\n\nThe theoretical results presented so far indicate that when faced with sel\ufb01sh experts, one should\nuse an IC weight update rule, and ones with smaller scoring rule gap are better. Two objections to\nthese conclusions are: \ufb01rst, the presented results are worst-case, and may not represent behavior on\na typical input. It is of particular interest to see if on non-worst-case inputs, the non-IC standard\nweight-update rule does better or worse than the IC methods proposed in this paper. Second, there\nis a gap between our upper and lower bounds for IC rules, so it\u2019s interesting to see what numerical\nregret is obtained.\n\ni \u21e0 U[ 1\n\n2 , 1]. The probability to exit a state is 1\n\n2. For the IC methods, experts report p(t)\n\nResults.\nIn our \ufb01rst simulation, experts are represented by a simple two-state hidden Markov model\n(HMM) with a \u201cgood\u201d state and a \u201cbad\u201d state. Realization r(t) is given by a fair coin. For r(t) = 0\n(otherwise beliefs are reversed), in the good state expert i believes b(t)\ni \u21e0 min{Exp(1)/5, 1}, in the\nbad state b(t)\n10 for both states. This data generating\nprocess models that experts that have information about the event are more accurate than experts\nwho lack the information. Figure 1a shows the regret as a function of time for the standard (non-\nIC) algorithm, and IC scoring rules including one from the Beta family [Buja et al., 2005] with\ni = 1 if\n\u21b5 = = 1\nb(t)\ni = 0 otherwise. The y axis is the ratio of the total loss of each of the algorithms to\ni 1\nthe performance of the best expert at that time. The plot is for 10 experts, T = 10, 000, \u2318 = 102,\nand the randomized10 versions of the algorithms, averaged over 30 runs. Varying model parameters\nand the deterministic version show similar results.\nEach of the IC methods does signi\ufb01cantly better than the standard weight-update algorithm, and\neven at T = 200, 000 (not shown in the graph), the IC methods have a regret factor of about 1.003,\nwhereas the standard algorithm still has 1.14. This gives credence to the notion that failing to account\nfor incentive issues is problematic beyond the worst-case bounds presented earlier. Moreover, while\nthere is a worst-case lower bound that rules out no-regret, for natural synthetic data, the loss of all the\nIC algorithms approaches that of the best expert in hindsight, while the standard algorithm fails to do\n\n, for the standard algorithm p(t)\n\n2 and p(t)\n\ni = b(t)\n\ni\n\n10Here we use the regular RWM algorithm, so in the notation of Section 5, we have \u2713 = 0.\n\n8\n\n\f(a) The HMM data-generating process.\n\n(b) The greedy lower bound instance.\n\nFigure 1: Regret for different data-generating processes.\n\nTable 1: Comparison of lower bound results with simulation. The simulation is run for T =\n10, 000,\u2318 = 104 and we report the average of 30 runs. For the lower bounds, the \ufb01rst number is the\nlower bound from Lemma 16, i.e. 2 + 1\nBeta .5\n2.2983\n2.3186\n\n, the second number (in parentheses) is 2 + .\n\nSpherical\n2.2071\n2.2070\n\nBeta .7\n2.2758\n2.2847\n\nBeta .1\n2.3708\n2.4414\n\nBeta .9\n2.2584\n2.2599\n\nGreedy LB\n\nLB Sim\n\nd1e\n\nBrier\n2.2507\n2.2502\n2.25\n\nLem 16 LB\n\n2.33 (2.441)\n\n2.25 (2.318)\n\n2.25 (2.285)\n\n2.25 (2.260)\n\n2.2 (2.207)\n\nthis. This seems to indicate that eliciting the truthful beliefs of the experts is more important than the\nexact weight-update rule.\nComparison of LB Instances. We consider both the lower bound instance described the proof of\nLemma 16, and a greedy version that punishes the algorithm every time w(t)\nis \u201csuf\ufb01ciently\u201d large.11\n0\nFigure 1b shows the regret for different algorithms on the greedy lower bound instance. Table 1 shows\nthat it very closely traces 2 + , as do the numerical results for the lower bound from Lemma 16. In\nfact, for the analysis, we needed to use d1e when determining the \ufb01rst phase of the instance. When\nwe use instead numerically, the regret seems to trace 2 + quite closely, rather than the weaker\nproven lower bound of 2 + 1\n. Table 1 shows that the analysis of Lemma 16 is essentially tight\n(up to the rounding of ). Closing the gap between the lower and upper bound requires \ufb01nding a\ndifferent lower bound instance, or a better analysis for the upper bound.\n\nd1e\n\n7 Open Problems\n\nThere area number of interesting questions that this work raises. First of all, our utility model\neffectively causes experts to optimize their weight independently of other experts. Bayarri and\nDeGroot [1989] discuss different objective functions for experts, including optimizing relative weight\namong experts under different informational assumptions. These would impose different constraints\nas to which algorithms would lead to truthful reporting, and it would be interesting to see if no-regret\nlearning is possible in this setting.\nIt also remains an open problem to close the gap between the best known upper and lower bounds that\nwe presented in this paper. The simulations showed that the analysis for the lower bound instances is\nalmost tight, so this requires a novel upper bound and/or a different lower bound instance.\nFinally, strictly proper scoring rules are also well-de\ufb01ned beyond binary outcomes. It would be\ninteresting to see what bounds can be proved for predictions over more than two outcomes.\n\nis suf\ufb01ciently large we make e0 (and thus the algorithm) wrong twice: b(t)\n\n11When w(t)\n0\nb(t)\n2 = 1\nhigh enough for the algorithm to follow its advice during both steps.\n\n2 , r(t) = 1, and b(t+1)\n\n1 = 1,\n0 = 1, r(t) = 1. \u201cSuf\ufb01ciently\u201d here means that weight of e0 is\n\n0 = 0, b(t)\n\n= 0, b(t)\n\n1 = 1\n\n2 , b(t)\n\n0\n\n9\n\n\fReferences\nJacob Abernethy, Yiling Chen, and Jennifer Wortman Vaughan. Ef\ufb01cient market making via con-\nvex optimization, and a connection to online learning. ACM Transactions on Economics and\nComputation, 1(2):12, 2013.\n\nMoshe Babaioff, Robert D. Kleinberg, and Aleksandrs Slivkins. Truthful mechanisms with implicit\npayment computation. In Proceedings of the 11th ACM Conference on Electronic Commerce, EC\n\u201910, pages 43\u201352, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-822-3. doi: 10.1145/\n1807342.1807349. URL http://doi.acm.org/10.1145/1807342.1807349.\n\nM. J. Bayarri and M. H. DeGroot. Optimal reporting of predictions. Journal of the American\n\nStatistical Association, 84(405):214\u2013222, 1989. doi: 10.1080/01621459.1989.10478758.\n\nJ Eric Bickel and Seong Dae Kim. Veri\ufb01cation of the weather channel probability of precipitation\n\nforecasts. Monthly Weather Review, 136(12):4867\u20134881, 2008.\n\nJohn P Bonin. On the design of managerial incentive structures in a decentralized planning environ-\n\nment. The American Economic Review, 66(4):682\u2013687, 1976.\n\nCraig Boutilier. Eliciting forecasts from self-interested experts: scoring rules for decision makers. In\nProceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-\nVolume 2, pages 737\u2013744. International Foundation for Autonomous Agents and Multiagent\nSystems, 2012.\n\nGlenn W Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly weather review,\n\n78(1):1\u20133, 1950.\n\nMichael Br\u00fcckner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. In\nProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data\nmining, pages 547\u2013555. ACM, 2011.\n\nAndreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation\n\nand classi\ufb01cation: Structure and applications. 2005.\n\nYang Cai, Constantinos Daskalakis, and Christos H Papadimitriou. Optimum statistical estimation\n\nwith strategic data sources. In COLT, pages 280\u2013296, 2015.\n\nNicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for predic-\n\ntion with expert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\nYiling Chen and David M Pennock. Designing markets for prediction. AI Magazine, 31(4):42\u201352,\n\n2010.\n\nOfer Dekel, Felix Fischer, and Ariel D Procaccia. Incentive compatible regression learning. Journal\n\nof Computer and System Sciences, 76(8):759\u2013777, 2010.\n\nPeter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. Incentivizing exploration. In\nProceedings of the \ufb01fteenth ACM conference on Economics and computation, pages 5\u201322. ACM,\n2014.\n\nYoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\nTilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.\n\nJournal of the American Statistical Association, 102(477):359\u2013378, 2007.\n\nAndrew V Goldberg and Jason D Hartline. Collusion-resistant mechanisms for single-parameter\nagents. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms,\npages 620\u2013629. Society for Industrial and Applied Mathematics, 2005.\n\nMoritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classi\ufb01cation.\nIn Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science,\npages 111\u2013122. ACM, 2016.\n\n10\n\n\fChristian Franz Horn, Bjoern Sven Ivens, Michael Ohneberg, and Alexander Brem. Prediction\n\nmarkets\u2013a literature review 2014. The Journal of Prediction Markets, 8(2):89\u2013126, 2014.\n\nKamal Jain and Mohammad Mahdian. Cost sharing. Algorithmic game theory, pages 385\u2013410, 2007.\nVictor Richmond R Jose, Robert F Nau, and Robert L Winkler. Scoring rules, generalized entropy,\n\nand utility maximization. Operations research, 56(5):1146\u20131157, 2008.\n\nSham M Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation\n\nalgorithms. SIAM Journal on Computing, 39(3):1088\u20131106, 2009.\n\nNick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and\n\ncomputation, 108(2):212\u2013261, 1994.\n\nYang Liu and Yiling Chen. A bandit framework for strategic regression. In Advances in Neural\n\nInformation Processing Systems, pages 1813\u20131821, 2016.\n\nYishay Mansour, Aleksandrs Slivkins, Vasilis Syrgkanis, and Zhiwei Steven Wu. Bayesian explo-\n\nration: Incentivizing exploration in bayesian games. arXiv preprint arXiv:1602.07570, 2016.\n\nJohn McCarthy. Measures of the value of information. Proceedings of the National Academy of\n\nSciences of the United States of America, 42(9):654, 1956.\n\nEdgar C Merkle and Mark Steyvers. Choosing a strictly proper scoring rule. Decision Analysis, 10\n\n(4):292\u2013304, 2013.\n\nNolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-\n\nprediction method. Management Science, 51(9):1359\u20131373, 2005.\n\nHerv\u00e9 Moulin. Incremental cost sharing: Characterization by coalition strategy-proofness. Social\n\nChoice and Welfare, 16(2):279\u2013320, 1999.\n\nTim Roughgarden and Eva Tardos. Introduction to the inef\ufb01ciency of equilibria. Algorithmic Game\n\nTheory, 17:443\u2013459, 2007.\n\nLeonard J Savage. Elicitation of personal probabilities and expectations. Journal of the American\n\nStatistical Association, 66(336):783\u2013801, 1971.\n\nMark J Schervish. A general method for comparing probability assessors. The Annals of Statistics,\n\npages 1856\u20131879, 1989.\n\nNihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanisms for\n\ncrowdsourcing. In Advances in neural information processing systems, pages 1\u20139, 2015.\n\nWilliam Thomson. Eliciting production possibilities from a well-informed manager. Journal of\n\nEconomic Theory, 20(3):360\u2013380, 1979.\n\n11\n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Tim", "family_name": "Roughgarden", "institution": "Stanford University"}, {"given_name": "Okke", "family_name": "Schrijvers", "institution": "Facebook Inc."}]}