{"title": "Strategizing against No-regret Learners", "book": "Advances in Neural Information Processing Systems", "page_first": 1579, "page_last": 1587, "abstract": "How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium utility. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We construct the optimal game-play for the player against a mean-based no-regret learner who has three actions. When the no-regret learner's strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility.", "full_text": "Strategizing against No-regret Learners\n\nYuan Deng\n\nDuke University\n\nericdy@cs.duke.edu\n\nJon Schneider\nGoogle Research\n\njschnei@google.com\n\nAbstract\n\nBalasubramanian Sivan\n\nGoogle Research\n\nbalusivan@google.com\n\nHow should a player who repeatedly plays a game against a no-regret learner\nstrategize to maximize his utility? We study this question and show that under\nsome mild assumptions, the player can always guarantee himself a utility of at least\nwhat he would get in a Stackelberg equilibrium of the game. When the no-regret\nlearner has only two actions, we show that the player cannot get any higher utility\nthan the Stackelberg equilibrium utility. But when the no-regret learner has more\nthan two actions and plays a mean-based no-regret strategy, we show that the\nplayer can get strictly higher than the Stackelberg equilibrium utility. We provide\na characterization of the optimal game-play for the player against a mean-based\nno-regret learner as a solution to a control problem. When the no-regret learner\u2019s\nstrategy also guarantees him a no-swap regret, we show that the player cannot get\nanything higher than a Stackelberg equilibrium utility.\n\n1\n\nIntroduction\n\nConsider a two player bimatrix game with a \ufb01nite number of actions for each player repeated over T\nrounds. When playing a repeated game, a widely adopted strategy is to employ a no-regret learning\nalgorithm: a strategy that guarantees the player that in hindsight no single action when played\nthroughout the game would have performed signi\ufb01cantly better. Knowing that one of the players (the\nlearner) is playing a no-regret learning strategy, what is the optimal gameplay for the other player\n(the optimizer)? This question is the focus of our work.\nIf this were a single-shot strategic game where learning is not relevant, a (pure or mixed strategy)\nNash equilibrium is a reasonable prediction of the game\u2019s outcome. In the T rounds game with\nlearning, can the optimizer guarantee himself a per-round utility of at least what he could get in a\nsingle-shot game? Is it possible to get signi\ufb01cantly more utility than this? Does this utility depend on\nthe speci\ufb01c choice of learning algorithm of the learner? What gameplay the optimizer should adopt\nto achieve maximal utility? None of these questions are straightforward, and indeed none of these\nhave unconditional answers.\n\nOur results. Central to our results is the idea of the Stackelberg equilibrium of the underlying\ngame. The Stackelberg variant of our game is a single-shot two-stage game where the optimizer is\nthe \ufb01rst player and can publicly commit to a mixed strategy; the learner then best responds to this\nstrategy. The Stackelberg equilibrium is the resulting equilibrium of this game when both players\nplay optimally. Note that the optimizer\u2019s utility in the Stackelberg equilibrium is always weakly\nlarger than his utility in any (pure or mixed strategy) Nash equilibrium, and is often strictly larger.\nLet V be the utility of the optimizer in the Stackelberg equilibrium. With some mild assumptions on\nthe game, we show that the optimizer can always guarantee himself a utility of at least (V \u2212\u03b5)T \u2212o(T )\nin T rounds for any \u03b5 > 0, irrespective of the learning algorithm used by the learner as long as it has\nthe no-regret guarantee (Theorem 4). This means that if one of the players is a learner the other player\ncan already pro\ufb01t over the Nash equilibrium regardless of the speci\ufb01cs of the learning algorithm\nemployed or the structure of the game. Further, if any one of the following conditions is true:\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1. the game is a constant-sum game,\n2. the learner\u2019s no-regret algorithm has the stronger guarantee of no-swap regret (see Section 2),\n3. the learner has only two possible actions in the game,\nthe optimizer cannot get a utility higher than V T + o(T ) (see Theorem 5, Theorem 6, Theorem 7).\nIf the learner employs a learning algorithm from a natural class of algorithms called mean-based\nlearning algorithms [Braverman et al., 2018] (see Section 2) that includes popular no-regret algorithms\nlike the Multiplicative Weights algorithm, the Follow-the-Perturbed-Leader algorithm, and the EXP3\nalgorithm, we show that there exist games where the optimizer can guarantee himself a utility\nV (cid:48)T \u2212 o(T ) for some V (cid:48) > V (see Theorem 8). We note the contrast between the cases of 2 and\n3 actions for the learner: in the 2-actions case even if the learner plays a mean-based strategy, the\noptimizer cannot get anything more than V T + o(T ) (Theorem 7), whereas with 3 actions, there are\ngames where he is able to guarantee a linearly higher utility.\nGiven this possibility of exceeding Stackelberg utility, our \ufb01nal result is on the nature and structure\nof the utility optimal gameplay for the optimizer against a learner that employs a mean-based strategy.\nFirst, we give a crisp characterization of the optimizer\u2019s asymptotic optimal algorithm as the solution\nto a control problem (see Section 4.2) in N dimensions where N is the number of actions for the\nlearner. This characterization is predicated on the fact that just knowing the cumulative historical\nutilities of each of the learner\u2019s actions is essentially enough information to accurately predict the\nlearner\u2019s next action in the case of a mean-based learner. These N cumulative utilites thus form an\nN-dimensional \u201cstate\u201d for the learner which the optimizer can manipulate via their choice of action.\nWe then proceed to make multiple observations that simplify the solution space for this control\nproblem. We leave as a very interesting open question of computing or characterizing the optimal\nsolution to this control problem and we further provide one conjecture of a potential characterization.\n\nComparison to prior work. The very recent work of Braverman et al. [2018] is the closest to ours.\nThey study the speci\ufb01c 2-player game of an auction between a single seller and single buyer. The\nmain difference from Braverman et al. [2018] is that they consider a Bayesian setting where the\nbuyer\u2019s type is drawn from a distribution, whereas there is no Bayesian element in our setting. But\nbeyond that the seller\u2019s choice of the auction represents his action, and the buyer\u2019s bid represents her\naction. They show that regardless of the speci\ufb01c algorithm used by the buyer, as long as the buyer\nplays a no-regret learning algorithm the seller can always earn at least the optimal revenue in a single\nshot auction. Our Theorem 4 is a direct generalization of this result to arbitrary games without any\nstructure. Further Braverman et al. [2018] show that there exist no-regret strategies for the buyer\nthat guarantee that the seller cannot get anything better than the single-shot optimal revenue. Our\nTheorems 5, 6 and 7 are both generalizations and re\ufb01nements of this result, as they pinpoint both the\nexact learner\u2019s strategies and the kind of games that prevent the optimizer from going beyond the\nStackelberg utility. Braverman et al. [2018] show that when the buyer plays a mean-based strategy,\nthe seller can design an auction to guarantee him a revenue beyond the per round auction revenue.\nOur control problem can be seen as a rough parallel and generalization of this result.\n\nOther related work. The \ufb01rst notion of regret (without the swap quali\ufb01cation) we use in the paper\nis also referred to as external-regret (see Hannan [1957], Foster and Vohra [1993], Littlestone and\nWarmuth [1994], Freund and Schapire [1997], Freund and Schapire [1999], Cesa-Bianchi et al.\n[1997]). The other notion of regret we use is swap regret. There is a slightly weaker notion of regret\ncalled internal regret that was de\ufb01ned earlier in Foster and Vohra [1998], which allows all occurrences\nof a given action x to be replaced by another action y. Many no-internal-regret algorithms have been\ndesigned (see for example Hart and Mas-Colell [2000], Foster and Vohra [1997, 1998, 1999], Cesa-\nBianchi and Lugosi [2003]). The stronger notion of swap regret was introduced in Blum and Mansour\n[2005], and it allows one to simultaneously swap several pairs of actions. Blum and Mansour show\nhow to ef\ufb01ciently convert a no-regret algorithm to a no-swap-regret algorithm. One of the reasons\nbehind the importance of internal and swap regret is their close connection to the central notion of\ncorrelated equilibrium introduced by Aumann [1974]. In a general n players game, a distribution over\naction pro\ufb01les of all the players is a correlated equilibrium if every player has zero internal regret.\nWhen all players use algorithms with no-internal-regret guarantees, the time averaged strategies of\nthe players converges to a correlated equilibrium (see Hart and Mas-Colell [2000]). When all players\nsimply use algorithms with no-external-regret guarantees, the time averaged strategies of the players\nconverges to the weaker notion of coarse correlated equilibrium. When the game is a zero-sum game,\n\n2\n\n\fthe time-averaged strategies of players employing no-external-regret dynamics converges to the Nash\nequilbrium of the game.\nOn the topic of optimizing against a no-regret-learner, Agrawal et al. [2018] study a setting similar\nto Braverman et al. [2018] but also consider other types of buyer behavior apart from learning, and\nshow to how to robustly optimize against various buyer strategies in an auction.\n\n2 Model and Preliminaries\n\n2.1 Games and equilibria\n\nuO(\u03b1, \u03b2) =(cid:80)M\n\ni=1\n\n(cid:80)N\n\nThroughout this paper, we restrict our attention to simultaneous two-player bimatrix games G. We\nrefer to the \ufb01rst player as the optimizer and the second player as the learner. We denote the set\nof actions available to the optimizer as A = {a1, a2, . . . , aM} and the set of actions available to\nthe learner as B = {b1, b2, . . . , bN}. If the optimizer chooses action ai and the learner chooses\naction bj, then the optimizer receives utility uO(ai, bj) and the learner receives utility uL(ai, bj).\nWe normalize the utility such that |uO(ai, bj)| \u2264 1 and |uL(ai, bj)| \u2264 1. We write \u2206(A) and\n\u2206(B) to denote the set of mixed strategies for the optimizer and learner respectively. When the\noptimizer plays \u03b1 \u2208 \u2206(A) and the learner plays \u03b2 \u2208 \u2206(B), the optimizer\u2019s utility is denoted by\n\nj=1 \u03b1i\u03b2juO(ai, bj), similarly for the learner\u2019s utility.\n\nWe say that a strategy b \u2208 B is a best-response to a strategy \u03b1 \u2208 \u2206(A) if b \u2208 argmaxb(cid:48)\u2208B uL(\u03b1, b(cid:48)).\nWe are now ready to de\ufb01ne Stackelberg equilibrium [Von Stackelberg, 2010].\nDe\ufb01nition 1. The Stackelberg equilibrium of a game is a pair of strategies \u03b1 \u2208 \u2206(A) and b \u2208 B that\nmaximizes uO(\u03b1, b) under the constraint that b is a best-response to \u03b1. We call the value uO(\u03b1, b)\nthe Stackelberg value of the game.\nA game is zero-sum if uO(ai, bj) + uL(ai, bj) = 0 for all i \u2208 [M ] and j \u2208 [N ]; likewise, a game\nis constant-sum if uO(ai, bj) + uL(ai, bj) = C for some \ufb01xed constant C for all i \u2208 [M ] and\nj \u2208 [N ]. Note that for zero-sum or constant-sum games, the Stackelberg equilibrium coincides with\nthe standard notion of Nash equilibrium due to the celebrated minimax theorem [von Neumann,\n1928]. Moreover, throughout this paper, we assume that the learner does not have weakly dominated\nstrategies: a strategy b \u2208 B is weakly dominated if there exists \u03b2 \u2208 \u2206(B \\ {b}) such that for all\na \u2208 A, uL(a, \u03b2) \u2265 uL(a, b).\nWe are interested in the setting where the optimizer and the learner repeatedly play the game G for T\nrounds. We will denote the optimizer\u2019s action at time t as at; likewise we will denote the learner\u2019s\naction at time t as bt. Both the optimizer and learner\u2019s utilities are additive over rounds with no\ndiscounting.\nThe optimizer\u2019s strategy can be adaptive (i.e. at can depend on the previous values of bt) or non-\nadaptive (in which case it can be expressed as a sequence of mixed strategies (\u03b11, \u03b12, . . . , \u03b1T )).\nUnless otherwise speci\ufb01ed, all positive results (results guaranteeing the optimizer can guarantee some\nutility) apply for non-adaptive optimizers and all negative results apply even to adaptive optimizers.\nAs the name suggests, the learner\u2019s (adaptive) strategy will be speci\ufb01ed by some variant of a low-regret\nlearning algorithm, as described in the next section.\n\n2.2 No-regret learning and mean-based learning\n\nIn the classic multi-armed bandit problem with T rounds, the learner selects one of K options (a.k.a.\narms) on round t and receives a reward ri,t \u2208 [0, 1] if he selects option i. The rewards can be chosen\nadversarially and the learner\u2019s objective is to maximize her total reward.\n(cid:80)T\nLet it be the arm pulled by the learner at round t. The regret for a (possibly randomized) learning\nalgorithm Alg is de\ufb01ned as the difference between performance of the algorithm Alg and the best\nt=1 ri,t \u2212 rit,t. An algorithm Alg for the multi-armed bandit problem\narm: Reg(Alg) = maxi\nis no-regret if the expected regret is sub-linear in T , i.e., E[Reg(Alg)] = o(T ). In addition to the\nbandits setting in which the learner only learns the reward of the arm he pulls, our results also apply\nto the experts setting in which the learner can learn the rewards of all arms for every round. Simple\nno-regret strategies exist in both the bandits and the experts settings.\n\n3\n\n\fAmong no-regret algorithms, we are interested in two special classes of algorithms. The \ufb01rst is the\nclass of mean-based strategies:\n\nDe\ufb01nition 2 (Mean-based Algorithm). Let \u03c3i,t =(cid:80)t\n\ns=1 ri,s be the cumulative reward for pulling\narm i for the \ufb01rst t rounds. An algorithm is \u03b3-mean-based if whenever \u03c3i,t < \u03c3j,t \u2212 \u03b3T , the\nprobability for the algorithm to pull arm i on round t is at most \u03b3. An algorithm is mean-based if it is\n\u03b3-mean-based for some \u03b3 = o(1).\n\nIntuitively, mean-based strategies are strategies that play the arm that historically performs the best.\nBraverman et al. [2018] show that many no-regret algorithms are mean-based, including commonly\nused variants of EXP3 (for the bandits setting), the Multiplicative Weights algorithm (for the experts\nsetting) and the Follow-the-Perturbed-Leader algorithm (for the experts setting).\nThe second class is the class of no-swap-regret algorithms:\nDe\ufb01nition 3 (No-Swap-Regret Algorithm). The swap regret Regswap(Alg) of an algorithm Alg is\nde\ufb01ned as\n\nRegswap(Alg) = max\n\n\u03c0:[K]\u2192[K]\n\nReg(Alg, \u03c0) =\n\nr\u03c0(it),t \u2212 rit,t\n\nT(cid:88)\n\nt=1\n\nwhere the maximum is over all functions \u03c0 mapping options to options. An algorithm is no-swap-regret\nif the expected swap regret is sublinear in T , i.e. E[Regswap(Alg)] = o(T ).\nIntuitively, no-swap-regret strategies strengthen the no-regret criterion in the following way: no-regret\nguarantees the learning algorithm performs as well as the best possible arm overall, but no-swap-\nregret guarantees the learning algorithm performs as well as the best possible arm over each subset of\nrounds where the same action is played. Given a no-regret algorithm, a no-swap-regret algorithm can\nbe constructed via a clever reduction (see Blum and Mansour [2005]).\n\n3 Playing against no-regret learners\n\n3.1 Achieving Stackelberg equilibrium utility\n\nTo begin with, we show that the optimizer can achieve an average utility per round arbitrarily close to\nthe Stackelberg value against a no-regret learner.\nTheorem 4. Let V be the Stackelberg value of the game G. If the learner is playing a no-regret\nlearning algorithm, then for any \u03b5 > 0, the optimizer can guarantee at least (V \u2212 \u03b5)T \u2212 o(T ) utility.\n\nProof. Let (\u03b1, b) be the Stackelberg equilibrium of the game G. Since (\u03b1, b) forms a Stackelberg\nequilibrium, b \u2208 argmaxb(cid:48) uL(\u03b1, b(cid:48)). Moreover, by the assumption that the learner does not have a\nweakly dominated strategy, there does not exist \u03b2 \u2208 \u2206(B \\ {b}) such that for all a \u2208 A, uL(a, \u03b2) \u2265\nuL(a, b). By Farkas\u2019s lemma [Farkas, 1902], there must exist an \u03b1(cid:48) \u2208 \u2206(A) such that for all\nb(cid:48) \u2208 B \\ {b}, uL(\u03b1(cid:48), b) \u2265 uL(\u03b1(cid:48), b(cid:48)) + \u03ba for \u03ba > 0.\nTherefore, for any \u03b4 \u2208 (0, 1), the optimizer can play the strategy \u03b1\u2217 = (1 \u2212 \u03b4)\u03b1 + \u03b4\u03b1(cid:48) such that b is\nthe unique best response to \u03b1\u2217 and playing strategy b(cid:48) (cid:54)= b will induce a utility loss at least \u03b4\u03ba for\nthe learner. As a result, since the leaner is playing a no-regret learning algorithm, in expectation,\n(cid:54)= b. It follows that the optimizer\u2019s\nthere is at most o(T ) rounds in which the learner plays b(cid:48)\nutility is at least V T \u2212 \u03b4(V \u2212 uO(\u03b1(cid:48), b))T \u2212 o(T ). Thus, we can conclude our proof by setting\n\u03b5 = \u03b4(V \u2212 uO(\u03b1(cid:48), b)).\n\nNext, we show that in the special class of constant-sum games, the Stackelberg value is the best that\nthe optimizer can hope for when playing against a no-regret learner.\nTheorem 5. Let G be a constant-sum game, and let V be the Stackelberg value of this game. If the\nlearner is playing a no-regret algorithm, then the optimizer receives no more than V T + o(T ) utility.\n\nProof. Let (cid:126)a = (a1,\u00b7\u00b7\u00b7 , aT ) be the sequence of the optimizer\u2019s actions. Moreover, let \u03b1\u2217 \u2208 \u2206(A)\nbe a mixed strategy such that \u03b1\u2217 plays ai \u2208 A with probability \u03b1\u2217\n\ni = |{t | at = ai}|/T .\n\n4\n\n\fSince the learner is playing a no-regret learning algorithm, the learner\u2019s cumulative utility is at least\nmaxb(cid:48)\u2208B uL(a\u2217, b(cid:48))T \u2212 o(T ) = CT \u2212 (minb(cid:48)\u2208B uO(a\u2217, b(cid:48))T + o(T )), where C is the constant sum,\nwhich implies that the optimizer\u2019s utility is at most\n\nmax\n\na\u2217\u2208\u2206(A)\n\nb(cid:48)\u2208B uO(a\u2217, b(cid:48))T + o(T ) = V T + o(T )\n\nmin\n\nwhere the equality follows that the Stackelberg value is equal to the minimax value by the minimax\ntheorem for a constant-sum game.\n\n3.2 No-swap-regret learning\n\nIn this section, we show that if the learner is playing a no-swap-regret algorithm, the optimizer can\nonly achieve their Stackelberg utility per round.\nTheorem 6. Let V be the Stackelberg value of the game G. If the learner is playing a no-swap-regret\nalgorithm, then the optimizer will receive no more than V T + o(T ) utility.\nProof. Let (cid:126)a = (a1,\u00b7\u00b7\u00b7 , aT ) be the sequence of the optimizer\u2019s actions and let (cid:126)b = (b1,\u00b7\u00b7\u00b7 , bT ) be\nthe realization of the sequence of the learner\u2019s actions. Moreover, let Pr[(cid:126)b] be the probability that\nthe learner (who is playing some no-swap-regret learning algorithm) plays (cid:126)b given that the adversary\nplays (cid:126)a. Then, the marginal probability for the learner to play bj \u2208 B at round t is\n\nLet \u03b1bj \u2208 \u2206(A) be a mixed strategy such that \u03b1bj plays ai \u2208 A with probability\n\n(cid:88)\n\n(cid:126)b:bt=bj\n\nPr[bt = bj] =\n\nPr[(cid:126)b].\n\n(cid:80)\n\n(cid:80)\n\n\u03b1bj\ni =\n\nt:at=ai\n\nPr[bt = bj]\n\nt Pr[bt = bj]\n\n.\n\n=\n\nt\n\nt\n\nbj\u2208 \u00afB\n\nt\n\nPr[bt = bj]\n\nbj\u2208B\n\nbj\u2208 \u00afB\n\nPr[bt = bj]\n\nPr[bt = bj]\n\n(cid:33)\n(cid:33)\n\nLet \u00afB = {bj \u2208 B : bj (cid:54)\u2208 argmaxb(cid:48) uL(\u03b1bj , b(cid:48))} and consider a mapping \u03c0 such that \u03c0(bj) \u2208\nargmaxb(cid:48) uL(\u03b1bj , b(cid:48)). Then, the swap-regret under \u03c0 is\n\n(cid:32)(cid:0)uL(\u03b1bj , \u03c0(bj)) \u2212 uL(\u03b1bj , bj)(cid:1) \u00b7(cid:88)\n(cid:88)\n(cid:32)(cid:0)uL(\u03b1bj , \u03c0(bj)) \u2212 uL(\u03b1bj , bj)(cid:1) \u00b7(cid:88)\n(cid:88)\n(cid:32)(cid:88)\n\u2265 \u03b4 \u00b7 (cid:88)\nwhere \u03b4 = minbj\u2208 \u00afB(cid:0)uL(\u03b1bj , \u03c0(bj)) \u2212 uL(\u03b1bj , bj). Therefore, since the learner is playing a no-\nswap-regret algorithm, we have(cid:80)\nbj\u2208 \u00afB ((cid:80)\nuO(\u03b1bj , bj) \u00b7(cid:88)\n(cid:32)\nuO(\u03b1bj , bj) \u00b7(cid:88)\n(cid:88)\n(cid:32)\nuO(\u03b1bj , bj) \u00b7(cid:88)\n(cid:88)\n(cid:32)(cid:88)\n\u2264 V \u00b7 (cid:88)\n\nMoreover, for bj \u2208 B \\ \u00afB, the optimizer\u2019s utility when the learner plays bj is at most\n\nt Pr[bt = bj]) = o(T ).\n\nPr[bt = bj] \u2264 V \u00b7(cid:88)\n\n(cid:32)\nuO(\u03b1bj , bj) \u00b7(cid:88)\n\n(cid:33)\n+ 1 \u00b7 (cid:88)\n\nThus, the optimizer\u2019s utility is at most\n\n+\n\n(cid:88)\n(cid:32)(cid:88)\n\nbj\u2208 \u00afB\n\nPr[bt = bj].\n\nPr[bt = bj]\n\nPr[bt = bj]\n\nPr[bt = bj]\n\nt\n\nt\n\n=\n\nbj\u2208B\\ \u00afB\n\nt\n\nt\n\n(cid:33)\n\nt\n\nPr[bt = bj]\n\nPr[bt = bj]\n\n(cid:33)\n\n(cid:33)\n\n(cid:33)\n\n(cid:33)\n\nbj\u2208B\n\nbj\u2208B\\ \u00afB\n\u2264 V T + o(T ).\n\nt\n\nbj\u2208 \u00afB\n\nt\n\n5\n\n\fTheorem 7. Let G be a game where the learner has N = 2 actions, and let V be the Stackelberg\nvalue of this game. If the learner is playing a no-regret algorithm, then the optimizer receives no\nmore than V T + o(T ) utility.\n\nProof. By Theorem 6, it suf\ufb01ces to show that when there are two actions for the learner, a no-regret\nlearning algorithm is in fact a no-swap-regret learning algorithm.\nWhen there are only two actions, there are three possible mappings from B \u2192 B other than the\nidentity mapping. Let \u03c01 be a mapping such that \u03c01(b1) = b1 and \u03c01(b2) = b1, \u03c02 be a mapping such\nthat \u03c02(b1) = b2 and \u03c02(b2) = b2, and \u03c03 be a mapping such that \u03c03(b1) = b2 and \u03c03(b2) = b1.\nSince the learner is playing a no-regret learning algorithm, we have E[Reg(A, \u03c01)] = o(T ) and\nE[Reg(A, \u03c02)] = o(T ). Moreover, notice that\n\nE[Reg(A, \u03c03)] = E[Reg(A, \u03c01)] + E[Reg(A, \u03c02)] = o(T ),\n\nwhich concludes the proof.\n\n4 Playing against mean-based learners\n\nFrom the results of the previous section, it is natural to conjecture that no optimizer can achieve more\nthan the Stackelberg value per round if playing against a no-regret algorithm. After all, this is true for\nthe subclass of no-swap-regret algorithms (Theorem 6) and is true for simple games: constant-sum\ngames (Theorems 5) and games in which the learner only has two actions (Theorem 7).\nIn this section we show that this is not the case. Speci\ufb01cally, we show that there exist games G where\nan optimizer can win strictly more than the Stackelberg value every round when playing against a\nmean-based learner. We emphasize that the same strategy for the optimizer will work against any\nmean-based learning algorithm the learner uses.\nWe then proceed to characterize the optimal strategy for a non-adaptive optimizer playing against\na mean-based learner as the solution to an optimal control problem in N dimensions (where N is\nthe number of actions of the learner), and make several preliminary observations about structure an\noptimal solution to this control problem must possess. Understanding how to ef\ufb01ciently solve this\ncontrol problem (or whether the optimal solution is even computable) is an intriguing open question.\n\n4.1 Beating the Stackelberg value\n\nWe begin by showing it is possible for the optimizer to get signi\ufb01cantly (linear in T ) more utility\nwhen playing against a mean-based learner.\nTheorem 8. There exists a game G with Stackelberg value V where the optimizer can receive utility\nat least V (cid:48)T \u2212 o(T ) against a mean-based learner for some V (cid:48) > V .\n\nProof. Assume that the learner is using a \u03b3-mean-based algorithm. Consider the bimatrix game\nshown in Table 1 in which the optimizer is the row player (These utilities are bounded in [\u22122, 2]\ninstead of [\u22121, 1] for convenience; we can divide through by 2 to get a similar example where utility\nis bounded in [\u22121, 1]). We \ufb01rst argue that the Stackelberg value of this game is 0. Notice that if the\noptimizer plays Bottom with probability more than 0.5, then the learner\u2019s best response is to play Mid,\nresulting in a \u22122 utility for the optimizer . However, if the optimizer plays Bottom with probability\nat most 0.5, the expected utility for the optimizer from each column is at most 0. Therefore, in the\nStackelberg equilibrium, the optimizer will play Top and Bottom with probability 0.5 each, and the\nlearner will best respond with purely playing Right.\n\n\u221a\nLeft\n\u03b3)\nBottom (0, -1)\n\nTop\n\n(0,\n\nTable 1: Example game for beating the Stackelberg value.\n\nMid\n(-2, -1)\n(-2, 1)\n\nRight\n(-2, 0)\n(2, 0)\n\n6\n\n\fHowever, the optimizer can obtain utility T \u2212 o(T ) by playing Top for the \ufb01rst 1\nthen playing Bottom for the remaining 1\nrounds, the learner will play Left with probability at least (1 \u2212 2\u03b3) after \ufb01rst\nremaining 1\n( 1+\n\u03b3 \u2212 \u221a\n1\n\n2 T rounds and\n2 T rounds. Given the optimizer\u2019s strategy, for the \ufb01rst 1\n2 T\n\u03b3T rounds. For the\n2 T rounds, the learner will switch to play Right with probability at least (1 \u2212 2\u03b3) between\n\u221a\n2 + \u03b3)T -th round and (1 \u2212 \u03b3)T -th round, since the cumulative utility for playing Left is at most\n2 T \u00b7 \u221a\n2 T \u2212 \u03b3T = \u2212\u03b3T and the cumulative utility for playing Mid is at most \u2212\u03b3T .\n\n\u221a\n\n\u03b3\n\n\u03b3\n\n\u2212 \u221a\n\n(1 \u2212 2\u03b3)(\n\nTherefore, the cumulative utility for the optimizer for the \ufb01rst 1\n2 T rounds is at least\n\u2212 \u221a\n2 T rounds is at least\nand the cumulative utility for the optimizer for the remaining 1\n\u221a\n\u03b3\n2\n\nT \u2212 (1 \u2212 2\u03b3)(\n\nT \u2212 (1 \u2212 2\u03b3)(\n\n\u2212 2\u03b3)T \u00b7 2 +\n\n(1 \u2212 2\u03b3)(\n\n\u03b3)T \u00b7 0 +\n\n\u2212 2\u03b3)T\n\n\u221a\n\n\u03b3\n2\n\n\u2212\n\n1\n2\n\n\u2212\n\n1\n2\n\n(cid:19)\n\n1\n2\n\n(cid:18) 1\n(cid:18) 1\n\n2\n\n2\n\n(cid:19)\n\n1\n2\n\n\u03b3)T\n\n\u00b7 (\u22122) = \u2212o(T ),\n\n\u00b7 (\u22122) = T \u2212 o(T ).\n\nThus, the optimizer can obtain a total utility T \u2212 o(T ), which is greater than V T = 0 for the\nStackelberg value V = 0 in this game.\n\n4.2 The geometry of mean-based learning\n\n\u03b1i \u2208 \u2206(A) and ti \u2208 [0, 1] satisfying(cid:80)\n\nWe have just seen that it is possible for the optimizer to get more than the Stackelberg value when\nplaying against a mean-based learner. This raises an obvious next question: how much utility can\nan optimizer obtain when playing against a mean-based learner? What is the largest \u03b1 such that an\noptimizer can always obtain utility \u03b1T \u2212 o(T ) against a mean-based learner?\nIn this section, we will see how to reduce the problem of constructing the optimal gameplay of a\nnon-adaptive optimizer to solving a control problem in N dimensions. The primary insight is that a\nmean-based learner\u2019s behavior depends only on their historical cumulative utilities for each of their\nN actions, and therefore we can characterize the essential \u201cstate\u201d of the learner by a tuple of N real\nnumbers that represent the cumulative utilities for different actions. The optimizer can control the\nstate of the learner by playing different actions, and in different regions of the state space the learner\nplays speci\ufb01c responses.\nMore formally, our control problem will involve constructing a path in RN starting at the origin. For\neach i \u2208 [N ], let Si equals the subset of (u1, u2, . . . , uN ) \u2208 RN where ui = max(u1, u2, . . . , uN )\n(this will represent the subset of state space where the learner will play action bi). Note that these\nsets Si (up to some intersection of measure 0) partition the entire space RN .\nWe represent the optimizer\u2019s strategy \u03c0 as a sequence of tuples (\u03b11, t1), (\u03b12, t2), . . . , (\u03b1k, tk) with\ni ti = 1. Here the tuple (\u03b1i, ti) represents the optimizer\nplaying mixed strategy \u03b1i for a ti fraction of the total rounds. This strategy evolves the learner\u2019s state\nas follows. The learner originally starts at the state P0 = 0. After the ith tuple (\u03b1i, ti), the learner\u2019s\nstate evolves according to Pi = Pi\u22121 + ti(uL(\u03b1i, b1), uL(\u03b1i, b2), . . . , uL(\u03b1i, bN )) (in fact, the state\nlinearly interpolates between Pi\u22121 and Pi as the optimizer plays this action). For simplicity, we will\nassume that positive combinations of vectors of the form (uL(\u03b1i, b1), uL(\u03b1i, b2), . . . , uL(\u03b1i, bN ))\ncan generate the entire state space RN .\nTo characterize the optimizer\u2019s reward, we must know which set Si the learner\u2019s state belongs to.\nFor this reason, we will insist that for each 1 \u2264 i \u2264 k, there exists a ji such that both Pi\u22121 and Pi\nbelong to the same region Sji. It is possible to convert any strategy \u03c0 into a strategy of this form\nby subdividing a step (\u03b1, t) that crosses a region boundary into two steps (\u03b1, t(cid:48)) and (\u03b1, t(cid:48)(cid:48)) with\nt = t(cid:48) + t(cid:48)(cid:48) so that the \ufb01rst step stops exactly at the region boundary. If there is more than one possible\nchoice for ji (i.e. Pi\u22121 and Pi lie on the same region boundary), then without loss of generality we\nlet the optimizer choose ji, since the optimizer can always modify the initial path slightly so that Pi\nand Pi\u22121 both lie in a unique region.\nOnce we have done this, the optimizer\u2019s average utility per round is given by the expression:\n\nU (\u03c0) =\n\ntiuO(\u03b1i, bji).\n\nk(cid:88)\n\ni=1\n\n7\n\n\fTheorem 9. Let U\u2217 = sup\u03c0 U (\u03c0) where the supremum is over all valid strategies \u03c0 in this control\ngame. Then\n\n1. For any \u03b5 > 0, there exists a non-adaptive strategy for the optimizer which guarantees\nexpected utility at least (U\u2217 \u2212 \u03b5)T \u2212 o(T ) when playing against any mean-based learner.\n2. For any \u03b5 > 0, there exists no non-adaptive strategy for the optimizer which can guarantee\nexpected utility at least (U\u2217 + \u03b5)T + o(T ) when playing against any mean-based learner.\n\nUnderstanding how to solve this control problem (even inef\ufb01ciently, in \ufb01nite time) is an interesting\nopen problem. In the remainder of this section, we make some general observations which will let us\ncut down the strategy space of the optimizer even further and propose a conjecture to the form of the\noptimal strategy.\nThe \ufb01rst observation is that when the learner has N actions, our state space is truly N \u22121 dimensional,\nnot N dimensional. This is because in addition to the learner\u2019s actions only depending on the\ncumulative reward for each action, they in fact only depend on the differences between cumulative\nrewards for different actions (see De\ufb01nition 2). This means we can represent the state of the learner\nas a vector (x1, x2, . . . , xN\u22121) \u2208 RN\u22121, where xi = ui \u2212 uN . The sets Si for 1 \u2264 i \u2264 N \u2212 1 can\nbe written in terms of the xi as\n\nSi = {x|xi = max(x1, . . . , xN\u22121, 0)}\n\nand\n\nSN = {x|0 = max(x1, . . . , xN\u22121, 0)}.\n\n(cid:16) \u03b1iti+\u03b1i+1ti+1\n\nThe next observation is that if the optimizer makes several consecutive steps in the same region Si,\nwe can combine them into a single step. Speci\ufb01cally, assume Pi, Pi+1, and Pi+2 all belong to some\nregion Sj, where (\u03b1i, ti) sends Pi to Pi+1 and (\u03b1i+1, ti+1) sends Pi+1 to Pi+2. Then replacing these\ntwo steps with\nresults in a strategy with the exact same reward U (\u03c0).\nApplying this fact whenever possible, this means we can restrict our attention to strategies where all\nPi (with the possible exception of the \ufb01nal state Pk) lie on the boundary of two or more regions Si.\nFinally, we observe that this control problem is scale-invariant; if\n\n, ti + ti+1\n\n(cid:17)\n\nti+ti+1\n\n\u03c0 = ((\u03b11, t1), (\u03b12, t2), . . . , (\u03b1n, tn))\n\nis a valid policy that obtains utility U, then\n\n\u03bb\u03c0 = ((\u03b11, \u03bbt1), (\u03b12, \u03bbt2), . . . , (\u03b1n, \u03bbtn))\n\nis another valid policy (with the exception that(cid:80) ti = \u03bb, not 1) which obtains utility \u03bbU (this is\nto policies with(cid:80) ti = 1; we can choose a policy of any total time, as long as we normalize the\nutility by(cid:80) ti. This generalizes the strategy space, but is useful for the following reason. Consider a\n\ntrue since all the regions Si are cones with apex at the origin). This means we do not have to restrict\n\nsequence of steps \u03c0 which starts at some point P (not necessarily 0) and ends at P . Then if U is the\naverage utility of this cycle, then U\u2217 \u2265 U (in particular, we can consider any policy which goes from\n0 to P and then repeats this cycle many times). Likewise, if we have a sequence of steps \u03c0 which\nstarts at some point P and ends at \u03bbP for some \u03bb > 1 which achieves average utility U, then again\nU\u2217 \u2265 U (by considering the policy which proceeds 0 \u2192 P \u2192 \u03bbP \u2192 \u03bb2P \u2192 . . . (note that it is\nessential that \u03bb \u2265 1 to prevent this from converging back to 0 in \ufb01nite time).\nThese observations motivate the following conjecture.\nConjecture 10. The value U\u2217 is achieved by either:\n\n1. The average utility of a policy starting at the origin and consisting of at most N steps (in\n\ndistinct regions).\n\n2. The average utility of a path of at most N steps (in distinct regions) which starts at some\n\npoint P and returns to \u03bbP for some \u03bb \u2265 1.\n\nWe leave it as an interesting open problem to compute the optimal solution to this control problem.\n\n8\n\n\fReferences\nShipra Agrawal, Constantinos Daskalakis, Vahab S. Mirrokni, and Balasubramanian Sivan. Robust\nrepeated auctions under heterogeneous buyer behavior. In Proceedings of the 2018 ACM Conference\non Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, page 171, 2018.\n\nRobert J. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical\n\nEconomics, 1(1):67 \u2013 96, 1974. ISSN 0304-4068.\n\nAvrim Blum and Yishay Mansour. From external to internal regret. In Peter Auer and Ron Meir,\n\neditors, Learning Theory, 2005.\n\nMark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. In\nProceedings of the 2018 ACM Conference on Economics and Computation, pages 523\u2013538. ACM,\n2018.\n\nNicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Potential-based algorithms in on-line prediction and game\n\ntheory. Machine Learning, 51(3):239\u2013261, Jun 2003.\n\nNicol\u00f2 Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and\nManfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427\u2013485, May 1997. ISSN\n0004-5411.\n\nJulius Farkas. Theorie der einfachen ungleichungen. Journal f\u00fcr die reine und angewandte Mathe-\n\nmatik, 124:1\u201327, 1902.\n\nDean P. Foster and Rakesh V. Vohra. A randomization rule for selecting forecasts. Operations\n\nResearch, 41(4):704\u2013709, 1993.\n\nDean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and\n\nEconomic Behavior, 21(1):40 \u2013 55, 1997.\n\nDean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379\u2013390, 06 1998.\n\nDean P. Foster and Rakesh V. Vohra. Regret in the on-line decision problem. Games and Economic\n\nBehavior, 29(1):7 \u2013 35, 1999.\n\nYoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119 \u2013 139, 1997.\n\nYoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games\n\nand Economic Behavior, 29(1):79 \u2013 103, 1999.\n\nJames Hannan. Approximation to bayes risk in repeated plays. Contributions to the Theory of Games,\n\n3:97\u2013139, 1957.\n\nSergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium.\n\nEconometrica, 68(5):1127\u20131150, 2000.\n\nN. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation,\n\n108(2):212 \u2013 261, 1994.\n\nJohn von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295\u2013320,\n\n1928.\n\nHeinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media,\n\n2010.\n\n9\n\n\f", "award": [], "sourceid": 908, "authors": [{"given_name": "Yuan", "family_name": "Deng", "institution": "Duke University"}, {"given_name": "Jon", "family_name": "Schneider", "institution": "Google Research"}, {"given_name": "Balasubramanian", "family_name": "Sivan", "institution": "Google Research"}]}