{"title": "Learning in Games: Robustness of Fast Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 4734, "page_last": 4742, "abstract": "We show that learning algorithms satisfying a low approximate regret property experience fast convergence to approximate optimality in a large class of repeated games. Our property, which simply requires that each learner has small regret compared to a (1+eps)-multiplicative approximation to the best action in hindsight, is ubiquitous among learning algorithms; it is satisfied even by the vanilla Hedge forecaster. Our results improve upon recent work of Syrgkanis et al. in a number of ways. We require only that players observe payoffs under other players' realized actions, as opposed to expected payoffs. We further show that convergence occurs with high probability, and show convergence under bandit feedback. Finally, we improve upon the speed of convergence by a factor of n, the number of players. Both the scope of settings and the class of algorithms for which our analysis provides fast convergence are considerably broader than in previous work. Our framework applies to dynamic population games via a low approximate regret property for shifting experts. Here we strengthen the results of Lykouris et al. in two ways: We allow players to select learning algorithms from a larger class, which includes a minor variant of the basic Hedge algorithm, and we increase the maximum churn in players for which approximate optimality is achieved. In the bandit setting we present a new algorithm which provides a \"small loss\"-type bound with improved dependence on the number of actions in utility settings, and is both simple and efficient. This result may be of independent interest.", "full_text": "Learning in Games: Robustness of Fast Convergence\n\nDylan J. Foster\u21e4 Zhiyuan Li\u2020 Thodoris Lykouris\u21e4 Karthik Sridharan\u21e4 \u00c9va Tardos\u21e4\n\nAbstract\n\nWe show that learning algorithms satisfying a low approximate regret property\nexperience fast convergence to approximate optimality in a large class of repeated\ngames. Our property, which simply requires that each learner has small regret\ncompared to a (1 + \u270f)-multiplicative approximation to the best action in hindsight,\nis ubiquitous among learning algorithms; it is satis\ufb01ed even by the vanilla Hedge\nforecaster. Our results improve upon recent work of Syrgkanis et al. [28] in a\nnumber of ways. We require only that players observe payoffs under other players\u2019\nrealized actions, as opposed to expected payoffs. We further show that convergence\noccurs with high probability, and show convergence under bandit feedback. Finally,\nwe improve upon the speed of convergence by a factor of n, the number of players.\nBoth the scope of settings and the class of algorithms for which our analysis\nprovides fast convergence are considerably broader than in previous work.\nOur framework applies to dynamic population games via a low approximate regret\nproperty for shifting experts. Here we strengthen the results of Lykouris et al. [19]\nin two ways: We allow players to select learning algorithms from a larger class,\nwhich includes a minor variant of the basic Hedge algorithm, and we increase the\nmaximum churn in players for which approximate optimality is achieved.\nIn the bandit setting we present a new algorithm which provides a \u201csmall loss\u201d-type\nbound with improved dependence on the number of actions in utility settings, and\nis both simple and ef\ufb01cient. This result may be of independent interest.\n\n1\n\nIntroduction\n\nConsider players repeatedly playing a game, all acting independently to minimize their cost or\nmaximize their utility. It is natural in this setting for each player to use a learning algorithm that\nguarantees small regret to decide on their strategy, as the environment is constantly changing due\nto each player\u2019s choice of strategy. It is well known that such decentralized no-regret dynamics\nare guaranteed to converge to a form of equilibrium for the game. Furthermore, in a large class of\ngames known as smooth games [23] they converge to outcomes with approximately optimal social\nwelfare matching the worst-case ef\ufb01ciency loss of Nash equilibria (the price of anarchy). In smooth\ncost minimization games the overall cost is /(1 \u00b5) times the minimum cost, while in smooth\nmechanisms [29] such as auctions it is / max(1, \u00b5) times the maximum total utility (where and \u00b5\nare parameters of the smoothness condition). Examples of smooth games and mechanisms include\nrouting games and many forms of auction games (see e.g. [23, 29, 24]).\nThe speed at which the game outcome converges to this approximately optimal welfare is governed by\nindividual players\u2019 regret bounds. There are a large number of simple regret minimization algorithms\n(Hedge/Multiplicative Weights, Mirror Decent, Follow the Regularized Leader; see e.g. [12]) that\n\n\u21e4Cornell University {djfoster,teddlyk,sridharan,eva}@cs.cornell.edu. Work supported in\npart under NSF grants CDS&E-MSS 1521544, CCF-1563714, ONR grant N00014-08-1-0031, a Google faculty\nresearch award, and an NDSEG fellowship.\n\n\u2020Tsinghua University, lizhiyuan13@mails.tsinghua.edu.cn. Research performed while author was\n\nvisiting Cornell University.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fguarantee that the average regret goes down as O(1/pT ) with time T , which is tight in adversarial\nsettings.\nTaking advantage of the fact that playing a game against opponents who themselves are also using\nregret minimization is not a truly adversarial setting, a sequence of papers [9, 22, 28] showed that by\nusing speci\ufb01c learning algorithms, the dependence on T of the convergence rate can be improved to\nO(1/T ) (\u201cfast convergence\u201d). Concretely, Syrgkanis et al. [28] show that all algorithms satisfying the\nso-called RVU property (Regret by Variation in Utilities), which include Optimistic Mirror Descent\n[22], converge at a O(1/T ) rate with a \ufb01xed number of players.\nOne issue with the works of [9, 22, 28] is that they use expected cost as their feedback model for the\nplayers. In each round every player receives the expected cost for each of their available actions, in\nexpectation over the current action distributions of all other players. This clearly represents more\ninformation than is realistically available to players in games \u2014 at most each player sees the cost\nof each of their actions given the actions taken by the other players (realized feedback). In fact,\neven if each player had access to the action distributions of the other players, simply computing this\nexpectation is generally intractable when n, the number of players, is large.\nWe improve the result of [28] on the convergence to approximate optimality in smooth games in a\nnumber of different aspects. To achieve this, we relax the quality of approximation from the bound\nguaranteed by smoothness. Typical smoothness bounds on the price of anarchy in auctions are small\nconstants, such a factor of 1.58 or 2 in item auctions. Increasing the approximation factor by an\narbitrarily small constant \u270f> 0 enables the following results:\n\nexpected outcomes.\n\n\u2022 We show that learning algorithms obtaining fast convergence are ubiquitous.\n\u2022 We improve the speed of convergence by a factor of n, the number of players.\n\u2022 For all our results, players only need feedback based on realized outcomes, instead of\n\u2022 We show that convergence occurs with high probability in most settings.\n\u2022 We extend the results to show that it is enough for the players to observe realized bandit\n\u2022 Our results apply to settings where the set of players in the game changes over time [19].\nWe strengthen previous results by showing that a broader class of algorithms achieve\napproximate ef\ufb01ciency under signi\ufb01cant churn.\n\nfeedback, only seeing the outcome of the action they play.\n\nWe achieve these results using a property we term Low Approximate Regret, which simply states\nthat an online learning algorithm achieves good regret against a multiplicative approximation of the\nbest action in hindsight. This property is satis\ufb01ed by many known algorithms including even the\nvanilla Hedge algorithm, as well as Optimistic Hedge [21, 28] (via a new analysis). The crux of our\nanalysis technique is the simple observation that for many types of data-dependent regret bounds\nwe can fold part of the regret bound into the comparator term, allowing us to explore the trade-off\nbetween additive and multiplicative approximation.\nIn Section 3, we show that Low Approximate Regret implies fast convergence to the social welfare\nguaranteed by the price of anarchy via the smoothness property. This convergence only requires\nfeedback from the realized actions played by other players, not their action distribution or the\nexpectation over their actions. We further show that this convergence occurs with high probability in\nmost settings. For games with a large number of players we also improve the speed of convergence.\n[28] shows that players using Optimistic Hedge in a repeated game with n players converge to the\napproximately optimal outcome guaranteed by smoothness at a rate of O(n2/T ). They also offer an\nanalysis guaranteeing convergence of O(n/T ) , at the expense of a constant factor decrease in the\nquality of approximation (e.g., a factor of 4 in atomic congestion games with af\ufb01ne congestion). We\nachieve the convergence bound of O(n/T ) with only an arbitrarily small loss in the approximation.\nAlgorithms that satisfy the Low Approximate Regret property are ubiquitous and include simple,\nef\ufb01cient algorithms such as Hedge and variants. The observation that this broad class of algorithms\nenjoys fast convergence in realistic settings suggests that fast convergence occurs in practice.\nComparing our work to [28] with regard to feedback, Low Approximate Regret algorithms require\nonly realized feedback, while the analysis of the RVU property in [28] requires expected feedback.\nTo see the contrast, consider the load balancing game introduced in [17] with two players and two\nbins, where each player selects a bin and observes cost given by the number of players in that bin.\nInitialized at the uniform distribution, any learning algorithm with expectation feedback (e.g. those\nin [28]) will stay at the uniform distribution forever, because the expected cost vector distributes\n\n2\n\n\fcost equally across the two bins. This gives low regret under expected costs, but suppose we were\ninterested in realized costs: The only \u201cblack box\u201d way to lift [28] to this case would be to simply\nevaluate the regret bound above under realized costs, but here players will experience \u21e5(1/pT )\nvariation because they select bins uniformly at random, ruining the fast convergence. Our analysis\nsidesteps this issue because players achieve Low Approximate Regret with high probability.\nIn Section 4 we consider games where players can only observe the cost of the action they played\ngiven the actions taken by the other players, and receive no feedback for actions not played (bandit\nfeedback). [22] analyzed zero-sum games with bandit feedback, but assumed that players receive\nexpected cost over the strategies of all other players. In contrast, the Low Approximate Regret\nproperty can be satis\ufb01ed by just observing realizations, even with bandit feedback. We propose a\nnew bandit algorithm based on log-barrier regularization with importance sampling that guarantees\nfast convergence of O(d log T /\u270f) where d is the number of actions. Known techniques would either\nresult in a convergence rate of O(d3 log T ) (e.g. adaptations of SCRiBLe [21]) or would not extend\nto utility maximization settings (e.g. GREEN [2]). Our technique is of independent interest since it\nimproves the dependence of approximate regret bounds on the number of experts while applying to\nboth cost minimization and utility maximization settings.\nFinally, in Section 5, we consider the dynamic population game setting of [19], where players enter\nand leave the game over time. [19] showed that regret bounds for shifting experts directly in\ufb02uence\nthe rate at which players can turn over and still guarantee close to optimal solutions on average. We\nshow that a number of learning algorithms have the Low Approximate Regret property in the shifting\nexperts setting, allowing us to extend the fast convergence result to dynamic games. Such learning\nalgorithms include a noisy version of Hedge as well as AdaNormalHedge [18], which was previously\nstudied in the dynamic setting in [19]. Low Approximate Regret allows us to increase the turnover\nrate from the one in [19], while also widening and simplifying the class of learning algorithms that\nplayers can use to guarantee the close to optimal average welfare.\n2 Repeated Games and Learning Dynamics\nWe consider a game G among a set of n players. Each player i has an action space Si and a cost\nfunction costi : S1 \u21e5\u00b7\u00b7\u00b7\u21e5 Sn ! [0, 1] that maps an action pro\ufb01le s = (s1, . . . , sn) to the cost\ncosti(s) that player experiences1. We assume that the action space of each player has cardinality\nd, i.e. |Si| = d. We let w = (w1, . . . , wn) denote a list of probability distributions over all players\u2019\nactions, where wi 2 (Si) and wi,x is the probability of action x 2 Si.\nThe game is repeated for T rounds. At each round t each player i picks a probability distribution\ni 2 (Si) over actions and draws their action st\ni from this distribution. Depending on the game\nwt\nplaying environment under consideration, players will receive different types of feedback after each\nround. In Sections 3 and 5 we consider feedback where at the end of the round each player i observes\nthe utility they would have received had they played any possible action x 2 Si given the actions\ntaken by the other players. More formally let ct\ni is the set of strategies\nof all but the ith player at round t, and let ct\ni,x)x2Si. Note that the expected cost of player i at\nround t (conditioned on the other players\u2019 actions) is simply the inner product hwt\nWe refer to this form of feedback as realized feedback since it only depends on the realized ac-\ntions st\ni. This\nshould be contrasted with the expectation feedback used by [28, 9, 22], where player i observes\nEst\nSections 4 and 5 consider extensions of our repeated game model. In Section 4 we examine partial\ninformation (\u201cbandit\u201d) feedback, where players observe only the cost of their own realized actions. In\nSection 5 we consider a setting where the player set is evolving over time. Here we use the dynamic\npopulation model of [19], where at each round t each player i is replaced (\u201cturns over\u201d) with some\nprobability p. The new player has cost function costt\ni which may change\ni(\u00b7) and action space St\narbitrarily subject to certain constraints. We will formalize this notion later on.\n\ni sampled by the opponents; it does not directly depend on their distributions wt\n\ni), where st\n\ni,x = costi(x, st\n\ni\u21e0wt\ni\n\n[costi(x, st\n\ni)] for each x.\n\ni = (ct\n\nii.\ni, ct\n\nLearning Dynamics We assume that players select their actions using learning algorithms satisfy-\ning a property we call Low Approximate Regret, which simply requires that the cumulative cost of the\nlearner multiplicatively approximates the cost of the best action they could have chosen in hindsight.\n\n1See Appendix D for analogous de\ufb01nitions for utility maximization games.\n\n3\n\n\fWe will see in subsequent sections that this property is ubiquitous and leads to fast convergence in a\nrobust range of settings.\nDe\ufb01nition 1. (Low Approximate Regret) A learning algorithm for player i satis\ufb01es the Low Ap-\nproximate Regret property for parameter \u270f> 0 and function A(d, T ) if for all action distributions\nf 2 (Si),\n\n(1 \u270f)\n\ni, ct\n\nhwt\n\nii \uf8ff\n\nhf, ct\n\nii +\n\nA(d, T )\n\n\u270f\n\n.\n\n(1)\n\nTXt=1\n\nTXt=1\n\nA learning algorithm satis\ufb01es Low Approximate Regret against shifting experts if for all sequences\nf 1, . . . , f T 2 (Si), letting K = |{i > 2 : f t1 6= f t}| be the number of shifts,\n\n(1 \u270f)\n\ni, ct\n\nhwt\n\nii \uf8ff\n\nhf t, ct\n\nii + (1 + K)\n\nA(d, T )\n\n\u270f\n\n.\n\n(2)\n\nTXt=1\n\nTXt=1\n\nIn the bandit feedback setting, we require (1) to hold in expectation over the realized strategies of\nplayer i for any f 2 (Si) \ufb01xed before the game begins.\nWe use the version of the Low Approximate Regret property with shifting experts when considering\nplayers in dynamic population games in Section 5. In this case, the game environment is constantly\nchanging due to churn in the population, and we need the players to have low approximate regret\nwith shifting experts to guarantee high social welfare despite the churn.\nWe emphasize that all algorithms we are aware of that satisfy Low Approximate Regret can be made\nto do so for any \ufb01xed choice of the approximation factor \u270f via an appropriate selection of parameters.\nMany algorithms have an even stronger property: They satisfy (1) or (2) for all \u270f> 0 simultaneously.\nWe say that such algorithms satisfy the Strong Low Approximate Regret property. This property has\nfavorable consequences in the context of repeated games.\nThe Low Approximate Regret property differs from previous properties such as RVU in that it only\nrequires that the learner\u2019s cost be close to a multiplicative approximation to the cost of the best action\nin hindsight. Consequently, it is always smaller than the regret. For instance, if we consider only\nuniform (i.e. not data-dependent) regret bounds the Hedge algorithm can only achieve O(pT log d)\nexact regret, but can achieve Low Approximate Regret with parameters \u270f and A(d, T ) = O(log d)\nfor any \u270f> 0. Low Approximate Regret is analogous to the notion of \u21b5-regret from [15], with\n\u21b5 = (1 + \u270f).\nIn Appendix D we show that the Low Approximate Regret property and our subsequent results\nnaturally extend to utility maximization games.\n\nSmooth Games\nIt is well-known that in a large class of games, termed smooth games by Rough-\ngarden [23], traditional learning dynamics converge to approximately optimal social welfare. In\nsubsequent sections we analyze the convergence of Low Approximate Regret learning dynamics\nin such smooth games. We will see that Low Approximate Regret (for suf\ufb01ciently small A(d, T ))\ncoupled with smoothness of the game implies fast convergence of learning dynamics to desirable\nsocial welfare under a variety of conditions. Before proving this result we review social welfare and\nsmooth games.\n\nFor a given action pro\ufb01le s, the social cost is C(s) =Pn\n\ndue to the sel\ufb01sh behavior of the players we de\ufb01ne\n\ni=1 costi(s). To bound the ef\ufb01ciency loss\n\nOPT = min\nso\n\ncosti(so).\n\nnXi=1\n\nDe\ufb01nition 2. (Smooth game [23]) A cost minimization game is called (, \u00b5)-smooth if for all strategy\n\npro\ufb01les s and s\u21e4:Pi costi(s\u21e4i , si) \uf8ff \u00b7 costi(s\u21e4) + \u00b5 \u00b7 costi(s).\n\nThis property is typically applied using a (close to) optimal action pro\ufb01le s\u21e4 = so. For this case the\nproperty implies that if s is an action pro\ufb01le with very high cost, then some player deviating to her\nshare of the optimal pro\ufb01le s\u21e4i will improve her cost.\nFor smooth games, the price of anarchy is at most /(1 \u00b5), meaning that Nash equilibria of\nthe game, as well as no-regret learning outcomes in the limit, have social cost at most a factor of\n/(1 \u00b5) above the optimum. Smooth cost minimization games include congestion games such\n\n4\n\n\fas routing or load balancing. For example, atomic congestion games with af\ufb01ne cost functions are\n3 )-smooth [8], non-atomic games are (1, 0.25) smooth [25], implying a price of anarchy of 2.5\n( 5\n3 , 1\nand 1.33 respectively. While we focus on cost-minimization games for simplicity of exposition, an\nanalogous de\ufb01nition also applies for utility maximization, including smooth mechanisms [29], which\nwe elaborate on in Appendix D. Smooth mechanisms include most simple auctions. For example, the\n\ufb01rst price item auction is (1 1/e, 1)-smooth and all-pay actions are (1/2, 1)-smooth, implying a\nprice of anarchy of 1.58 and 2 respectively. All of our results extend to such mechanisms.\n\n3 Learning in Games with Full Information Feedback\n\nWe now analyze the ef\ufb01ciency of algorithms with the Low Approximate Regret property in the\nfull information setting. Our \ufb01rst proposition shows that, for smooth games with full information\nfeedback, learners with the Low Approximate Regret property converge to ef\ufb01cient outcomes.\nProposition 1. In any (, \u00b5)-smooth game, if all players use Low Approximate Regret algorithms\nsatisfying Eq. (1) with parameters \u270f and A(d, T ), then for the action pro\ufb01les st drawn on round t\nfrom the corresponding mixed actions of the players,\n\n1\n\nT Xt\n\nE\u21e5C(st)\u21e4 \uf8ff\n\n\n\n1 \u00b5 \u270f\n\nOPT +\n\nn\nT \u00b7\n\n1\n\n1 \u00b5 \u270f \u00b7\n\nA(d, T )\n\n\u270f\n\n.\n\nProof. This proof is a straightforward modi\ufb01cation of the usual price of anarchy proof for smooth\n\nApproximate Regret property with f = s\u21e4i for each player i for the optimal solution s\u21e4, then using the\n\ngames. We obtain the claimed bound by writingPt E[C(st)] =PiPt E[costi(st)], using the Low\nsmoothness property for each time t to boundPi costi(s\u21e4i , st\ni), and \ufb01nally rearranging terms.\nFor \u270f<< (1 \u00b5) the approximation factor of /(1 \u00b5 \u270f) is very close to the price of anarchy\n/(1\u00b5). This shows that Low Approximate Regret learning dynamics quickly converge to outcomes\nwith social welfare arbitrarily close to the welfare guaranteed for exact Nash equilibria by the price\nof anarchy. A simple corollary of this proposition is that, when players use learning algorithms that\nsatisfy the Strong Low Approximate Regret property, the bound above can be taken to depend on\nOPT even though this value is unknown to the players.\nWhenever the Low Approximate Regret property is satis\ufb01ed, a high probability version of the property\nwith similar dependence on \u270f and A(d, T ) is also satis\ufb01ed. This implies that in addition to quickly\nconverging to ef\ufb01cient outcomes in expectation, Low Approximate Regret learners experience fast\nconvergence with high probability.\nProposition 2. In any (, \u00b5)-smooth game, if all players use Low Approximate Regret algorithms\nsatisfying Eq. (1) for parameters \u270f and A(d, T ), then for the action pro\ufb01le st drawn on round t from\nthe players\u2019 mixed actions and = 2\u270f/(1 + \u270f), we have that 8> 0, with probability at least 1 ,\n\n1\n\nT Xt\n\nC(st) \uf8ff\n\n\n\n1 \u00b5 \n\nOPT +\n\nn\nT \u00b7\n\n1\n\n1 \u00b5 \u00b7\uf8ff 4A(d, T )\n\n\n\n+\n\n12 log(n log2(T )/))\n\n\n\n,\n\nExamples of Simple Low Approximate Regret Algorithms Propositions 1 and 2 are informative\nwhen applied with algorithms for which A(d, T ) is suf\ufb01ciently small. One would hope that such\nalgorithms are relatively simple and easy to \ufb01nd. We show now that the well-known Hedge algorithm\nas well as basic variants such as Optimistic Hedge and Hedge with online learning rate tuning satisfy\nthe property with A(d, T ) = O(log d), which will lead to fast convergence both in terms of n and\nT . For these algorithms and indeed all that we consider in this paper, we can achieve the Low\nApproximate Regret property for any \ufb01xed \u270f> 0 via an appropriate parameter setting. In Appendix\nA.2, we provide full descriptions and proofs for these algorithms.\nExample 1. Hedge satis\ufb01es the Low Approximate Regret property with A(d, T ) = log(d). In\nparticular one can achieve the property for any \ufb01xed \u270f> 0 by using \u270f as the learning rate.\nExample 2. Hedge with online learning rate tuning satis\ufb01es the Strong Low Approximate Regret\nproperty with A(d, T ) = O(log d).\nExample 3. Optimistic Hedge satis\ufb01es the Low Approximate Regret property with A(d, T ) =\n8 log(d). As with vanilla Hedge, we can choose the learning rate to achieve the property with any \u270f.\n\n5\n\n\fExample 4. Any algorithm satisfying a \u201csmall loss\u201d regret bound of the formp(Learner\u2019s cost) \u00b7 A\norp(Cost of best action) \u00b7 A satis\ufb01es Strong Low Approximate Regret via the AM-GM inequality,\ni.e.p(Learner\u2019s cost) \u00b7 A / inf \u270f>0[\u270f \u00b7 (Learner\u2019s cost) + A/\u270f]. In particular, this implies that the\n\nfollowing algorithms have Strong Low Approximate Regret: Canonical small loss and self-con\ufb01dent\nalgorithms, e.g. [11, 4, 30], Algorithm of [7], Variation MW [13], AEG-Path [26], AdaNormalHedge\n[18], Squint [16], and Optimistic PAC-Bayes [10].\n\nExample 4 shows that the Strong Low Approximate Regret property in fact is ubiquitous, as it is\nsatis\ufb01ed by any algorithm that provides small loss regret bounds or one of many variants on this type\nof bound. Moreover, all algorithms that satisfy the Low Approximate Regret property for all \ufb01xed \u270f\ncan be made to satisfy the strong property using the doubling trick.\n\nMain Result for Full Information Games:\nTheorem 3. In any (, \u00b5)-smooth game, if all players use Low Approximate Regret algorithms\nsatisfying (1) for parameter \u270f 2 and A(d, T ) = O(log d), then\n\n1\n\nn\nT \u00b7\nand furthermore, 8> 0, with probability at least 1 ,\n\nE\u21e5C(st)\u21e4 \uf8ff\n\n1 \u00b5 \u270f\n\n\n\n1\n\nOPT +\n\n1 \u00b5 \u270f \u00b7\n1 \u00b5 \u270f \u00b7\uf8ff O(log d)\n\n+\n\n1\n\n\u270f\n\nO(log d)\n\n\u270f\n\n,\n\nO(log(n log2(T )/))\n\n\u270f\n\n.\n\nT Xt\nE\u21e5C(st)\u21e4 \uf8ff\n\n1\n\nT Xt\n\n\n\n1 \u00b5 \u270f\n\nOPT +\n\nn\nT \u00b7\n\nCorollary 4. If all players use Strong Low Approximate Regret algorithms then: 1. The above results\nhold for all \u270f> 0 simultaneously. 2. Individual players have regret bounded by O(T 1/2), even in\nadversarial settings. 3. The players approach a coarse correlated equilibrium asymptotically.\n\nComparison with Syrgkanis et al. [28]. By relaxing the standard /(1 \u00b5) price of anarchy\nbound, Theorem 3 substantially broadens the class of algorithms that experience fast convergence to\ninclude even the common Hedge algorithm. The main result of [28] shows that learning algorithms\nthat satisfy their RVU property converge to the price of anarchy bound /(1 \u00b5) at rate n2 log d/T .\nThey further achieve a worse approximation of (1 + \u00b5)/(\u00b5(1 \u00b5)) at the improved (in terms of\nn) rate of n log d/T . We converge to an approximation arbitrarily close to /(1 + \u00b5) at a rate of\nn log d/T . Note that in atomic congestion games with af\ufb01ne congestion function \u00b5 = 1/3, so their\nbound of (1 + \u00b5)/\u00b5(1 \u00b5) loses a factor of 4 compared to the price of anarchy.\nStrong Low Approximate Regret algorithms such as Hedge with online learning rate tuning simulta-\nneously experience both fast O(n/T ) convergence in games and an O(1/pT ) bound on individual\nregret in adversarial settings. In contrast, [28] only shows O(n/pT ) individual regret and O(n3/T )\nconvergence to price of anarchy simultaneously.\nLow Approximate Regret algorithms only need realized feedback, whereas [28] require expectation\nfeedback. Having players receive expectation feedback is unrealistic in terms of both information\nand computation. Indeed, even if the necessary information was available, computing expectations\nover discrete probability distributions is not tractable unless n is taken to be constant.\nOur results imply that Optimistic Hedge enjoys the best of two worlds: It enjoys fast convergence\nto the exact /(1 \u00b5) price of anarchy using expectation feedback as well as fast convergence to\nthe \u270f-approximate price of anarchy using realized feedback. Our new analysis of Optimistic Hedge\n(Appendix A.2.2) sheds light on another desirable property of this algorithm: Its regret is bounded in\nterms of the net cost incurred by Hedge. Figure 1 summarizes the differences between our results.\n\nRVU property [28]\nLAR property (section 2)\n\nFeedback\n\nExpected costs\nRealized costs\n\nTime comp.\nPOA\nexact\ndO(n) per round\n\u270f-approx O(n log d/(\u270fT )) O(d) per round\n\nRate\nO(n2 log d/T )\n\nFigure 1: Comparison of Low Approximate Regret and RVU properties.\n\n2We can also show that the theorem holds if players satisfy the property for different values of \u270f, but with a\n\ndependence on the worst case value of \u270f across all players.\n\n6\n\n\f4 Bandit Feedback\n\nIn many realistic scenarios, the players of a game might not even know what they would have lost\nor gained if they had deviated from the action they played. We model this lack of information with\nii, per round.3\nbandit feedback, in which each player observes a single scalar, costi(st) = hst\nWhen the game considered is smooth, one can use the Low Approximate Regret property as in the\nfull information setting to show that players quickly converge to ef\ufb01cient outcomes. Our results here\nhold with the same generality as in the full information setting: As long as learners satisfy the Low\nApproximate Regret property (1), an ef\ufb01ciency result analogous to Proposition 1 holds.\nProposition 5. Consider a (, \u00b5)-smooth game. If all players use bandit learning algorithms with\nLow Approximate Regret A(d, T ) then\n\ni, ct\n\n1\n\nT E\"Xt\n\nC(st)# \uf8ff\n\n\n\n1 \u00b5 \u270f\n\nOPT +\n\nn\nT \u00b7\n\n1\n\n1 \u00b5 \u270f \u00b7\n\nA(d, T )\n\n\u270f\n\n.\n\nBandit Algorithms with Low Approximate Regret The bandit Low Approximate Regret property\nrequires that (1) holds in expectation against any sequence of adaptive and potentially adversarially\nchosen costs, but only for an obliviously chosen comparator f.4 This is weaker than requiring that an\nalgorithm achieve a true expected regret bound; it is closer to pseudo-regret.\nThe Exp3Light algorithm [27] satis\ufb01es Low Approximate Regret with A(d, T ) = d2 log T . The\nSCRiBLe algorithm introduced in [1] (via the analysis in [21]) enjoys the Low Approximate Regret\nproperty with A(d, T ) = d3 log(dT ). The GREEN algorithm [2] achieves the Low Approximate\nRegret property with A(d, T ) = d log(T ), but only works with costs and not gains. This prevents it\nfrom being used in utility settings such as auctions, as in Appendix D.\nWe present a new bandit algorithm (Algorithm 3) that achieves Low Approximate Regret with\nA(d, T ) = d log(T /d) and thus matches the performance of GREEN, but works in both cost\nminimization and utility maximization settings. This method is based on Online Mirror Descent\nwith a logarithmic barrier for the positive orthant, but differs from earlier algorithms based on the\nlogarithmic barrier (e.g. [21]) in that it uses the classical importance-weighted estimator for costs\ninstead of sampling based on the Dikin elipsoid. It can be implemented in \u02dcO(d) time per round, using\nline search to \ufb01nd . We provide proofs and further discussion of Algorithm 3 in Appendix B.\nAlgorithm 3: Initialize w1 to the uniform distribution. On each round t, perform update:\n\nAlgorithm 3 update: wt\n\nst1 =\n\nwt1\nst1\nst1 + wt1\nst1\n\n1 + \u2318ct\n\nand 8j 6= st1 wt\n\nj =\n\nwt1\n\nj\n\n1 + wt1\n\nj\n\n,\n\n(3)\n\nwhere \uf8ff 0 is chosen so that wt is a valid probability distribution.\nLemma 6. Algorithm 3 with \u2318 = \u270f/(1 + \u270f) has Low Approximate Regret with A(d, T ) = O(d log T ).\n\nComparison to Other Algorithms\nIn contrast to the full information setting where the most\ncommon algorithm, Hedge, achieves Low Approximate Regret with competitive parameters, the most\ncommon adversarial bandit algorithm Exp3 does not seem to satisfy Low Approximate Regret. [3]\nprovide a small loss bound for bandits which would be suf\ufb01cient for Low Approximate Regret, but\ntheir algorithm requires prior knowledge on the loss of the best action (or a bound on it), which is\nnot appropriate in our game setting. Similarly, the small loss bound in [20] is not applicable in our\nsetting as the work assumes an oblivious adversary and so does not apply to the games we consider.\n\n5 Dynamic Population Games\n\nIn this section we consider the dynamic population repeated game setting introduced in [19]. Detailed\ndiscussion and proofs are deferred to Appendix C. Given a game G as described in Section 2, a\ndynamic population game with stage game G is a repeated game where at each round t game G is\nplayed and every player i is replaced by a new player with a turnover probability p. Concretely, when\na player turns over, their strategy set and cost function are changed arbitrarily subject to the rules\n\n3With slight abuse of notation, st\ni denotes the identity vector associated to the strategy player i used at time t.\n4This is because we only need to evaluate (1) with the game\u2019s optimal solution s? to prove ef\ufb01ciency results.\n\n7\n\n\fi = costt\n\ni(\u00b7, st\n\ni of socially optimal strategies achieving OPTt = mins\u21e4tPi costt\n\nof the game. This models a repeated game setting where players have to adapt to an adversarially\nchanging environment. We denote the cost function of player i at round t as costt\ni(\u00b7). As in Section 3,\nwe assume that the players receive full information feedback. At the end of each round they observe\nthe entire cost vector ct\ni), but are not aware of the costs of other players in the game.\nLearning in Dynamic Population Games and the Price of Anarchy To guarantee small overall\ncost using the smoothness analysis from Section 2, players need to exhibit low regret against a shifting\nbenchmark s\u21e4t\ni(s\u21e4t). Even with a\nsmall probability p of change, the sequence of optimal solutions can have too many changes to be\nable to achieve low regret. In spite of this apparent dif\ufb01culty, [19] prove that at least a \u21e2/(1 \u00b5 \u270f)\nfraction of the optimal welfare is guaranteed if 1. players are using low adaptive regret algorithms\n(see [14, 18]) and 2. for the underlying optimization problem there exists a relatively stable sequence\nof solutions which at each step approximate the optimal solution by a factor of \u21e2. This holds as long\nas the turnover probability p is upper bounded by a function of \u270f (and of certain other properties of\nthe game, such as the stability of the close to optimal solution).\nWe consider dynamic population games where each player uses a learning algorithm satisfying Low\nApproximate Regret for shifting experts (2). This shifting version of Low Approximate Regret\nimplies a dynamic game analog of our main ef\ufb01ciency result, Proposition 1.\nAlgorithms with Low Approximate Regret for Shifting Experts A simple variant of Hedge we\nterm Noisy Hedge, which mixes the Hedge update at each round with a small amount of uniform noise,\nsatis\ufb01es the Low Approximate Regret property for shifting experts with A(d, T ) = O(log(dT )).\nMoreover, algorithms that satisfy a small loss version of the adaptive regret property [14] used in\n[19] satisfy the Strong Low Approximate Regret property.\nProposition 7. Noisy Hedge with learning rate \u2318 = \u270f satis\ufb01es the Low Approximate Regret property\nfor shifting experts with A(d, T ) = 2 log(dT ).\nExtending Proposition 1 to the Dynamic Population Game Setting Let s\u21e41:T denote a stable\ni(s\u21e4t) \uf8ff \u21e2 \u00b7 OPTt for all rounds t. As discussed\nin [19], such stable sequences can come from simple greedy algorithms (where each change in the\ninput of one player affects the output of few other players) or via differentially private algorithms\n(where each change in the input of one player affects the output of all other players with small\nprobability); in the latter case the sequence is randomized. For a deterministic sequence s\u21e41:T\nof\nplayer i\u2019s actions, we let the random variable Ki denote the number of changes in the sequence. For\na randomized sequence s\u21e41:T\n, we let Ki be the sum of total variation distances between subsequent\npairs s\u21e4t1\nProposition 8. (PoA with Dynamic Population) If all players use Low Approximate Regret algorithms\nsatisfying (2) in a dynamic population game, where the stage game is (, \u00b5)-smooth, and Ki as\nde\ufb01ned above then\n\ni . The stability of a sequence of solutions is determined by E[Pi Ki].\nE\u21e5C(st)\u21e4 \uf8ff\n\nsequence of near-optimal solutions s\u21e4t withPi costt\n\nHere the expectation is taken over the random turnover in the population playing the game, as well\nas the random choices of the players on the left hand side.\nTo claim a price of anarchy bound, we need to ensure that the additive term in (4) is a small fraction\nof the optimal cost. The challenge is that high turnover probability reduces stability, increasing\n\nE[Pi Ki]. By using algorithms with smaller A(d, T ), we can allow for higher E[Pi Ki] and hence\n\nhigher turnover probability. Combining Noisy Hedge with Proposition 8 strengthens the results in\n[19] by both weakening the behavioral assumption on the players, allowing them to use simpler\nlearning algorithms, and allowing a higher turnover probability.\nComparison to Previous Results [19] use the more complex AdaNormalHedge algorithm of [18],\nwhich satis\ufb01es the adaptive regret property of [14], but has O(dT ) space complexity. In contrast,\nNoisy Hedge only requires space complexity of just O(d). Moreover, a broader class of algorithms\nsatisfy the Low Approximate Regret property which makes the ef\ufb01ciency guarantees more prescriptive\nsince this property serves as a behavioral assumption. Finally, the our guarantees we provide improve\non the turnover probability that can be accommodated as discussed in Appendix C.1.\n\n \u00b7 \u21e2\n\n1 \u00b5 \u270fXt\n\nn + E\u21e5Pi Ki\u21e4\n\nT\n\nE\u21e5OPTt\u21e4 +\n\n\u00b7\n\n1 \u00b5 \u270f \u00b7\n\n1\n\nT Xt\n\nand s\u21e4t\n\ni\n\n1\nT\n\ni\n\ni\n\n1\n\nA(d, T )\n\n.\n\n(4)\n\n\u270f\n\nAcknowledgements We thank Vasilis Syrgkanis for sharing his simulation software and the NIPS\nreviewers for pointing out the GREEN algorithm [2].\n\n8\n\n\fReferences\n[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\n\nbandit linear optimization. In Proc. of the 21st Annual Conference on Learning Theory (COLT), 2008.\n\n[2] Chamy Allenberg, Peter Auer, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Gy\u00f6rgy Ottucs\u00e1k. Hannan Consistency in On-Line\nLearning in Case of Unbounded Losses Under Partial Monitoring, pages 229\u2013243. Springer Berlin\nHeidelberg, Berlin, Heidelberg, 2006.\n\n[3] Jean-Yves Audibert and S\u00e9bastien Bubeck. Regret bounds and minimax policies under partial monitoring.\n\nThe Journal of Machine Learning Research, 11:2785\u20132836, 2010.\n\n[4] Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-con\ufb01dent on-line learning\n\nalgorithms. Journal of Computer and System Sciences, 64(1):48\u201375, 2002.\n\n[5] Peter L Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari.\nHigh-probability regret bounds for bandit online linear optimization. In Proceedings of 21st Annual\nConference on Learning Theory (COLT), pages 335\u2013342, 2008.\n\n[6] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\n[7] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction\n\nwith expert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[8] Giorgos Christodoulou and Elias Koutsoupias. The price of anarchy of \ufb01nite congestion games.\n\nIn\nProceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 67 \u2013 73, 2005.\n[9] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for\n\nzero-sum games. Games and Economic Behavior, 92:327\u2013348, 2015.\n\n[10] Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In Advances in\n\nNeural Information Processing Systems, pages 3357\u20133365, 2015.\n\n[11] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, August 1997.\n\n[12] Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2016.\n[13] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs.\n\nMachine learning, 80(2-3):165\u2013188, 2010.\n\n[14] Elad Hazan and C. Seshadhri. Ef\ufb01cient learning algorithms for changing environments. In Proceedings of\n\nthe 26th Annual International Conference on Machine Learning (ICML), pages 393\u2013400, 2009.\n\n[15] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms.\n\nSIAM J. Comput., 39:1088 \u2013 1106, 2009.\n\n[16] Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial\n\ngames. In Proceedings of The 28th Conference on Learning Theory (COLT), pages 1155\u20131175, 2015.\n\n[17] Elias Koutsoupias and Christos Papadimitriou. Worst-case equilibria. Comp. sci. review, 3(2):65\u201369, 2009.\n[18] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Proceedings\n\nof The 28th Conference on Learning Theory (COLT), pages 1286\u20131304, 2015.\n\n[19] Thodoris Lykouris, Vasilis Syrgkanis, and \u00c9va Tardos. Learning and ef\ufb01ciency in games with dynamic\npopulation. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms\n(SODA), pages 120\u2013129. SIAM, 2016.\n\n[20] Gergely Neu. First-order regret bounds for combinatorial semi-bandits. In Proceedings of the 27th Annual\n\nConference on Learning Theory (COLT), pages 1360\u20131375, 2015.\n\n[21] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on\n\nLearning Theory (COLT), pages 993\u20131019, 2013.\n\n[22] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 3066\u20133074, 2013.\n\n[23] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM, 2015.\n[24] Tim Roughgarden, Vasilis Syrgkanis, and Eva Tardos. The price of anarchy in auctions. Available at\n\nhttps://arxiv.org/abs/1607.07684, 2016.\n\n[25] Tim Roughgarden and Eva Tardos. How bad is sel\ufb01sh routing? Journal of the ACM, 49:236 \u2013 259, 2002.\n[26] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient\nalgorithm. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages\n1593\u20131601, 2014.\n\n[27] Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. PhD thesis,\n\nUniversite Paris-Sud, 2005.\n\n[28] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized\nlearning in games. In Advances in Neural Information Processing Systems (NIPS), pages 2989\u20132997, 2015.\n[29] Vasilis Syrgkanis and \u00c9va Tardos. Composable and ef\ufb01cient mechanisms. In ACM Symposium on Theory\n\nof Computing (STOC), pages 211\u2013220, 2013.\n\n[30] Rani Yaroshinsky, Ran El-Yaniv, and Steven S. Seiden. How to better use expert advice. Machine Learning,\n\n55(3):271\u2013309, 2004.\n\n9\n\n\f", "award": [], "sourceid": 2409, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Zhiyuan", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Thodoris", "family_name": "Lykouris", "institution": "Cornell University"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "University of Pennsylvania"}, {"given_name": "Eva", "family_name": "Tardos", "institution": "Cornell University"}]}