{"title": "Online EXP3 Learning in Adversarial Bandits with Delayed Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 11349, "page_last": 11358, "abstract": "Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays \\left\\{ d_{t}\\right\\} that are unknown to the player. After picking arm a_{t} at round t, the player receives the cost of playing this arm d_{t} rounds later. In cases where t+d_{t}>T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O\\left(\\sqrt{\\ln K\\left(KT+\\sum_{t=1}^{T}d_{t}\\right)}\\right). For the case where \\sum_{t=1}^{T}d_{t} and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O\\left(\\sqrt{\\ln K\\left(K^{2}T+\\sum_{t=1}^{T}d_{t}\\right)}\\right). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the \u201cno-regret property\u201d, (e.g., where d_{t}=O\\left(t\\log t\\right)) the ergodic average of the strategy profile still converges to the set of Nash equilibria of the game. The result is made possible by choosing an adaptive step size \\eta_{t} that is not summable but is square summable, and proving a \u201cweighted regret bound\u201d for this general case.", "full_text": "Online EXP3 Learning in Adversarial Bandits with\n\nDelayed Feedback\n\nIlai Bistritz1, Zhengyuan Zhou23, Xi Chen2, Nicholas Bambos1, Jose Blanchet1\n\n2New York University, Stern School of Business\n\n1Stanford University\n\n3IBM Research\n\n{bistritz,bambos,jose.blanchet}@stanford.edu, {zzhou,xchen3}@stern.nyu.edu\n\nAbstract\n\n(cid:80)T\n(cid:18)(cid:114)\n\n(cid:16)\n\nK 2T +(cid:80)T\n\n(cid:18)(cid:114)\n(cid:17)(cid:19)\n\nConsider a player that in each of T rounds chooses one of K arms. An ad-\nversary chooses the cost of each arm in a bounded interval, and a sequence\nof feedback delays {dt} that are unknown to the player. After picking arm\nat at round t,\nthe player receives the cost of playing this arm dt rounds\nlater.\nIn cases where t + dt > T , this feedback is simply missing. We\nprove that the EXP3 algorithm (that uses the delayed feedback upon its ar-\n\n(cid:16)\n\nKT +(cid:80)T\n\n(cid:17)(cid:19)\n\nrival) achieves a regret of O\n\nln K\n\nt=1 dt\n\n. For the case where\n\nt=1 dt and T are unknown, we propose a novel doubling trick for online\nlearning with delays and prove that this adaptive EXP3 achieves a regret of\n\nln K\n\nt=1 dt\n\n. We then consider a two player zero-sum game\nO\nwhere players experience asynchronous delays. We show that even when the de-\nlays are large enough such that players no longer enjoy the \u201cno-regret property\u201d,\n(e.g., where dt = O (t log t)) the ergodic average of the strategy pro\ufb01le still con-\nverges to the set of Nash equilibria of the game. The result is made possible by\nchoosing an adaptive step size \u03b7t that is not summable but is square summable,\nand proving a \u201cweighted regret bound\u201d for this general case.\n\n1\n\nIntroduction\n\nConsider an agent that makes T sequential decisions from a set of K options (i.e., arms), where each\ndecision incurs some cost. The cost sequences are chosen by an adversary that knows the agent\u2019s\nstrategy. The agent\u2019s goal is to minimize this cost over time. In the full information case the agent\ngets to know the cost of all arms after choosing a single arm. A more challenging case is the bandit\nfeedback one, where the agent only observes the cost of the chosen arm. In this paper, we consider\nthe bandit feedback case. The question of what the agent learns about the costs (i.e., full information\nor bandit) naturally in\ufb02uences the best performance the agent can guarantee. Another fundamental\nquestion is when the agent gets to know the cost.\nAn online learning scenario with no delays means that the agent always knows how bene\ufb01cial all\nthe past actions were when making the current decision. This is rarely the case in practice, where\nmany decisions have to be made before all the feedback from past choices is received. Determining\nthe feedback in practice is not always straightforward and might involve some computations and es-\ntimations. Furthermore, the time it takes to receive the feedback varies between different decisions\nand times. All of these effects are accentuated when an adversary has control over the feedback\nmechanism. Following this reasoning, online learning with delayed feedback has attracted consid-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:17)\n\nT K ln K\n\nt /\u2208M dt\n\n(cid:16)(cid:113)\n\n(cid:1) + |M|(cid:17)\n\nln K(cid:0)KT +(cid:80)\n\nerable attention [1\u201312]. The concept of adversarial delays (i.e., arbitrary delay sequences) was \ufb01rst\nintroduced in [13], for the full information case and under the assumption that all feedback is re-\nceived before round T (which we do not make here). The \ufb01rst goal of this paper is to address the\nmore challenging bandit cost scenario.\nWhen there is no delayed feedback, EXP3 [14\u201316] is the state-of-the-art algorithm for adversarial\nonline learning with bandit feedback. In EXP3, the agent keeps a weight for each arm, and picks\nan arm at random with a probability that is proportional to the exponents of the weights. When a\ncost l(i) is incurred for choosing arm i, which was picked with probability p(i), l(i)\np(i) is added to the\nweight of this arm. The idea is that on average over the randomness of the decisions, the weights\n\nare adjusted with the vector of costs(cid:0)l(1), ..., l(K)(cid:1). With no delays, the expected regret of EXP3 is\n(cid:16)\u221a\n\n. Having a sublinear regret, the average regret per round goes to zero as T \u2192 \u221e,\n\nO\nwhich is known as the \u201cno-regret property\u201d [17].\nOur \ufb01rst main contribution in this paper is to show that with an arbitrary sequence of delays dt,\n, where M is the set\nEXP3 achieves an expected regret of O\nof rounds whose feedback is not received before round T . This expression makes clear which delay\nsequences will maintain the no-regret property and which will lead to linear regret in T .\nAn omnipotent adversary represents the embodiment of the agent\u2019s worst fears when learning to\noptimize its decisions in an unknown environment. An algorithm with performance guarantees in\nthis worst case scenario is an appealing choice from a designer\u2019s point of view. As such, it is more\nlikely that the opponents that the agent will face are online learning agents like itself, which have\nlimited knowledge and power. These agents have interests of their own, but in the worst case these\ninterests are in a direct con\ufb02ict with those of our agent. Therefore, zero-sum games are the natural\nframework to analyze the outcome of an interaction against another agent instead of against an all\npowerful adversary. Interestingly enough, it turns out that with delayed feedback, the outcome of\nplaying against another agent can be essentially different from playing against an adversary.\nIt is well known that when two agents use a no-regret learning algorithm against each other in a\nzero-sum game, the dynamics will result in a Nash equilibrium (NE) [18]. To be precise, the ergodic\naverage strategy converges to the set of NE strategies and the ergodic average cost to the value of\nthe zero-sum game. The last iterate does not converge in general to a NE, and even moves away\nfrom it [19]. However, the emergence of a NE in a game where such an agent \ufb01nds itself against\nanother agent using a no-regret algorithm provides yet another strong evidence for the importance of\nthe concept of NE. From a more practical point of view, convergence of the ergodic average to a NE\nmakes no-regret algorithms an appealing way to compute a NE when the game matrix is unknown\nand only simulating the game is possible. In such a simulation of an unknown game, bandit feedback\nis a more realistic assumption than full information.\nWith no delays, the only purpose of the step size of the EXP3 algorithm is to minimize the regret.\nIf the horizon of the game T is unknown, one can use the doubling trick and choose the step sizes\naccordingly. With delayed feedback, a varying step-size plays a much more central role. With\ndelayed feedback, it is not surprising that convergence of the ergodic average to the set of NE is\nmaintained if the algorithm still has a sublinear regret (asymptotically zero average regret). When\nthe delays become larger, for example super-linear delays that grow like O (t log t), this is no longer\ntrue and the regret of EXP3 (or any other algorithm) becomes linear in the horizon T .\nOur second main contribution in this paper is to show that even with delays that cause a linear regret,\nthe ergodic average may still converge to the set of NE by using a time-varying step size \u03b7t. This\nmeans that computing a NE using EXP3 is still possible even in scenarios where EXP3 does not\nenjoy a sublinear regret (i.e., the no-regret). Since delays are a prominent feature of almost every\ncomputational environment, this is an encouraging \ufb01nding.\n\n2 EXP3 in Adversarial Bandits under Feedback Delays\n\nConsider a player that at each round t has to pick one out of K arms. Denote the arm the player\nchooses at round t by at. The cost at round t from arm i is l(i)\nbe the cost vector. These costs are arbitrarily chosen by an adversary that knows the player\u2019s strategy\n\nt \u2208 [0, 1], and let lt =\n\n, ..., l(K)\n\nl(1)\nt\n\nt\n\n(cid:16)\n\n(cid:17)\n\n2\n\n\fAlgorithm 1 EXP3 with delays\nInitialization: Let {\u03b7t} be a positive non-increasing sequence, and set \u02dcL(i)\ni = 1, ..., K.\nFor t = 1, ..., T do\n\n1 = 0 and p(i)\n\n1 = 1\n\nK for\n\n1. Choose an arm at at random according to the distribution pt.\n2. Obtain a set of delayed costs l(as)\n3. Update the weights of arm as for all s \u2208 St, using\n\ns\n\nfor all s \u2208 St, where as is the arm played at round s.\n\n\u02dcL(as)\nt = \u02dcL(as)\n\nt\u22121 + \u03b7s\n\nl(as)\ns\np(as)\ns\n\n.\n\n4. Update the mixed strategy\n\np(i)\nt+1 =\n\nEnd\n\nt(cid:80)n\n\ne\u2212 \u02dcL(i)\nj=1 e\u2212 \u02dcL(j)\n\nt\n\n.\n\n(cid:110)\n\n(cid:111)\n\n(3)\n\n(4)\n\nl(i)\nt\n\nt\n\nin advance. Hence, we can assume that the adversary chooses\nfor each i in advance, knowing\nexactly how the player is going to react. The player gets to know the cost of playing at at round t at\nthe end of the t + dt \u2212 1 round (i.e., after a delay of dt \u2265 1 rounds), so the feedback is available at\nthe beginning of round t + dt. The set of costs (feedback samples) received at round t is denoted St,\nso s \u2208 St means that the cost of as from round s is received at round t. Since the game lasts for T\nrounds, all costs for which t + dt > T are never received. Of course, the value of dt does not matter\nas long as t + dt > T , and these are just samples that the adversary chose to prevent the player from\nreceiving. We name these costs the missing samples, and denote their set by M.\nThe player wants to have a learning algorithm that uses the past observations to make good decisions\nover time. Denote the vector of probabilities of the player for choosing arms at round t by pt \u2208 \u2206K,\nwhere \u2206K denotes the K-simplex. This is also known as the mixed strategy of the player. The\nperformance of the player\u2019s algorithm, or strategy, is measured using the regret. The expected regret\nis the total expected cost over an horizon of T rounds, compared to the total cost that would result\nfrom playing the best \ufb01xed mixed strategy in all rounds:\nDe\ufb01nition 1. The expected regret is de\ufb01ned as:\n\n(cid:40) T(cid:88)\n\nt=1\n\n(cid:41)\n\nT(cid:88)\n\nt=1\n\nEa {R (T )} = Ea\n\nt \u2212 min\nl(at)\n\ni\n\nl(i)\nt\n\n(1)\n\nwhere Ea is the expectation over the random actions a1, ..., aT the agent chooses at each round.\n\nAt round t, EXP3 (detailed in Algorithm 1) chooses an arm at random according to the distribution\npt that depends on the history of the game. De\ufb01ne the following \ufb01ltration\n\nFt = \u03c3 ({as | s + ds \u2264 t})\n\n(2)\nwhich is generated from all the actions for which the feedback was received up to round t. Note that\nthe mixed strategy of the player pt is a Ft-measurable random variable, since pt is a function of all\nfeedback received up to round t.\nOur main result of this section establishes the expected regret bound for EXP3 with delays. Note\nthat Algorithm 1 is nothing but the obvious variant of EXP3 for the case of delayed feedback.\nTherefore, the importance of the following result is in the novel analysis of how delays, which are a\npart of every practical system, affect a well-known and widely used algorithm such as EXP3. While\nwaiting for the delayed feedback, the agent is making decisions that incur a larger regret than in\nthe usual no-delay case where all the past feedback has been received. The proof of Theorem 1\nbounds this addition to the regret. The proof analyzes the novel notion of weighted-regret, given in\nthe following Lemma. The goal of this more general result is to be both used here and for the proof\nof Theorem 3 in the next section.\n\n3\n\n\f(cid:110)\n\n(cid:111)\n\nLemma 1. Let {\u03b7t} be a non-increasing step size sequence. Let\nbe a cost sequence such that\nt \u2208 [0, 1] for every t, i. Let {dt} be a delay sequence such that the cost from round t is received\nl(i)\nat round t + dt. De\ufb01ne M to be the set of all samples that are not received before round T . Then\nusing EXP3 (Algorithm 1) guarantees\n\nl(i)\nt\n\n(cid:41)\n\nT(cid:88)\n\nT(cid:88)\n\nt=1\n\n(cid:88)\n\nt /\u2208M\n\n\u2264 ln K +\n\nK\n2\n\n(cid:88)\n\nt\u2208M\n\n\u03b72\nt + 4\n\n\u03b72\nt dt +\n\n\u03b7t.\n\n(5)\n\nProof. Let ei be the pure strategy that picks arm i with probability 1. Then for each i\n\n(cid:40) T(cid:88)\n\nEa\n\n(cid:40) T(cid:88)\n\nt=1\n\nEa\n\n\u03b7tl(at)\n\nt=1\n\ni\n\nt=1\n\n\u03b7tl(i)\nt\n\n\u03b7tl(at)\n\nt \u2212 min\n(cid:40) T(cid:88)\n(cid:41)\n\u03b7t (cid:104)lt, pt(cid:105) \u2212 T(cid:88)\n(cid:88)\n\nEa(cid:110)\n(cid:41)\n(cid:41)\n\u03b7s (cid:104)ls, ps \u2212 ei(cid:105)\n\nt \u2212 T(cid:88)\n(cid:40) T(cid:88)\n(cid:40) T(cid:88)\n\n\u03b7tl(i)\nt\n\n\u03b7tl(i)\nt\n\n= Ea\n\nEa\n\nEa\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\ns\u2208St\n\n\u03b7tl(at)\n\nt\n\n= Ea\n\n=\n\nt=1\n\n|Ft\n\n\u03b7tl(i)\nt\n\n(cid:41)\n(cid:41)\n\u03b7t (cid:104)lt, pt \u2212 ei(cid:105)\n(cid:41)\n\n(cid:111) \u2212 T(cid:88)\n(cid:40) T(cid:88)\n(cid:40)(cid:88)\n(cid:40) T(cid:88)\n\n\u03b7t (cid:104)lt, pt \u2212 ei(cid:105)\n(cid:88)\n\n(cid:41)\n\u03b7s (cid:104)ls, ps \u2212 ei(cid:105)\n\nt\u2208M\n\n\u2264\n\nt=1\n\n=\n\n(a)\n\nt=1\n\ns\u2208St\n\n+ Ea\n\nEa\n\n(cid:88)\n\nt\u2208M\n\n+\n\n\u03b7t\n\n(6)\n\nwhere (a) follows from (cid:104)lt, pt \u2212 ei(cid:105) \u2264 1, since 0 \u2264 l(i)\nDe\ufb01ne St,s = {r \u2208 St; r < s}. This is the set of feedback samples arriving at round t that the\nalgorithm uses before s. De\ufb01ne s\u2212 as the step a moment before using the feedback from round s, so\nps\u2212 is the mixed strategy at this moment. De\ufb01ne s+ as the step a moment after using the feedback\nfrom round s. This step is taking place in round t if s \u2208 St. We analyze the \ufb01rst term in (6) by\nsplitting it as follows\n\nt \u2264 1 for every i.\n\n(cid:40) T(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208St\n\nEa\n\n(cid:41)\n\u03b7s (cid:104)ls, ps \u2212 ei(cid:105)\n\n(cid:40) T(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208St\n\n(cid:68)\n\n= Ea\n\nls, ps\u2212 \u2212 ei\n\n\u03b7s\n\n+\n\n(cid:69)\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208St\n\n(cid:68)\nls, ps \u2212 ps\u2212\n\n\u03b7s\n\n(cid:69)(cid:41)\n\n(7)\nwhere the \ufb01rst part is interpreted as the regret with no delays, and the second as the regret penalty\nthe delays incur. From Lemma 3 we have\n\nNext we analyze the delay term. Let \u02dclt =\n\n, ..., 0\n\n. First note that for all i we have\n\n(cid:40) T(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208St\n\n(cid:68)\n\nEa\n\n\u03b7s\n\nls, ps\u2212\n\nT(cid:88)\n\nt=1\n\n\u03b72\nt .\n\n\u03b7tl(i)\nt\n\n\u2264 ln K +\n\nK\n2\n\n(cid:41)\n\n(cid:19)\n(cid:16) \u02dcLq\n\n(cid:17)\n\n(cid:44) hi\n\n(cid:18)\n\n(cid:69) \u2212 T(cid:88)\n\nt=1\n\n0, ..., l(at)\nt\np(at)\nt\n\n\u2212 \u02dcL(i)\nq\u2212\n\u2212 \u02dcL(j)\nq\u2212\n\nj=1 e\n\ne\n\n\u02dclq\n\n(cid:17)\n\np(i)\nq\u2212 =\n\n(cid:80)K\n(cid:40) K(cid:88)\n(cid:111) \u2264 2\u03b7qEa\n|Fq\u2212\nq\u2212 Ea(cid:110)\u02dcl(i)\nK(cid:88)\n\np(i)\n\n2\u03b7q\n\ni=1\n\nq\n\ni=1\n\n4\n\n(cid:41)\n|Fq\u2212\nK(cid:88)\n\n2\u03b7q\n\ni=1\n\n=\n(a)\n\np(i)\nq\u2212\n\n\u02dcl(i)\nq\n\n(cid:111)\n\n|Fq\u2212\n\n=\n(b)\n\nand p(i)\nobtain\n\nq+ = hi\n\n(cid:16) \u02dcLq\u2212 + \u03b7q\nEa(cid:110)(cid:13)(cid:13)pq+ \u2212 pq\u2212(cid:13)(cid:13)1\n\nq \u2264 2\u03b7q\n\nq\u2212 l(i)\np(i)\n\nK(cid:88)\n\ni=1\n\np(i)\nq\u2212 = 2\u03b7q\n\n(10)\n\n(8)\n\n(9)\n\n, so from Lemma 2 using x = \u02dcLq\u2212 and \u2206 = \u03b7q\n\n\u02dclq, so h (x) = pq\u2212 we\n\n\fwhere (a) uses p(i)\nl(i)\nq\np(i)\nq\nthe feedback from aq was not received until round q\u2212. Therefore\n\nis\nq and zero otherwise. Note that aq is independent of Fq\u2212 since by de\ufb01nition\n\nq \u2208 Fq\u2212 (since q < q\u2212) together with the fact that \u02dcl(i)\n\nwith probability p(i)\n\nq\n\n(b)\n\n(a)\n\n+\n\n+\n\n=\n\n+\n\n+\n\nr=s\n\nr=s\n\nr=s\n\nr=s\n\nls,\n\nq\u2208Sr\n\nq\u2208Sr\n\nq\u2208Sr\n\n(cid:11)(cid:33)(cid:41)\nt\u22121(cid:88)\n(cid:10)ls, pr \u2212 pr+1\n(cid:1)(cid:43)\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe \u2264\n(cid:42)\n(cid:88)\nt\u22121(cid:88)\n(cid:0)pq\u2212 \u2212 pq+\n\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe \u2264\n(cid:0)pq+ \u2212 pq\u2212(cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:88)\nt\u22121(cid:88)\n(cid:107)ls(cid:107)\u221e\n\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe =\n(cid:88)\nt\u22121(cid:88)\n(cid:13)(cid:13)pq+ \u2212 pq\u2212(cid:13)(cid:13)1\n(cid:111)\uf8fc\uf8fd\uf8fe +\n(cid:110)(cid:13)(cid:13)pq+ \u2212 pq\u2212(cid:13)(cid:13)1\n(cid:111)\uf8fc\uf8fd\uf8fe \u2264\n(cid:110)(cid:13)(cid:13)pq+ \u2212 pq\u2212(cid:13)(cid:13)1\n\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe \u2264\n\uf8eb\uf8ed (cid:88)\nt\u22121(cid:88)\n(cid:12)(cid:12)(cid:12) \u2264 1 for every i and using the triangle\n(cid:12)(cid:12)(cid:12)l(i)\nT(cid:88)\n\n|Fq\u2212\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n|Fq\u2212\n\nq\u2208St,s\n\n\u03b72\nt dt\n\nq\u2208Sr\n\nt /\u2208M\n\n\u03b7q +\n\n(11)\n\nr=s\n\n\u03b7q\n\n(d)\n\n(c)\n\n4\n\nt\n\n\u03b72\nt + 4\n\n\u03b72\nt dt +\n\nt=1\n\nt /\u2208M\n\n\u03b7t.\n\nt\u2208M\n\n(12)\n\nEa\n\nEa\n\n=\n\nt=1\n\nt=1\n\nt=1\n\n\u03b7s\n\n\u03b7s\n\n\u03b7s\n\nEa\n\nls,\n\nEa\n\ns\u2208St\n\ns\u2208St\n\ns\u2208St\n\nq\u2208St,s\n\nls, pt \u2212 ps\u2212\n\nq\u2212 \u2208 Fq\u2212 and (b) uses p(i)\n(cid:69)(cid:41)\n(cid:40) T(cid:88)\n(cid:68)\n(cid:88)\nls, ps \u2212 ps\u2212\n(cid:40) T(cid:88)\n(cid:32)(cid:68)\n(cid:69)\n(cid:88)\n\uf8f1\uf8f2\uf8f3 T(cid:88)\n\uf8eb\uf8ed(cid:42)\n(cid:1)(cid:43)\n(cid:88)\n(cid:88)\n(cid:0)pq\u2212 \u2212 pq+\n(cid:0)pq+ \u2212 pq\u2212(cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\uf8f1\uf8f2\uf8f3 T(cid:88)\n\uf8eb\uf8ed(cid:107)ls(cid:107)\u221e\n(cid:88)\n(cid:88)\n\uf8f1\uf8f2\uf8f3 T(cid:88)\n\uf8eb\uf8ed (cid:88)\n(cid:88)\n(cid:13)(cid:13)pq+ \u2212 pq\u2212(cid:13)(cid:13)1\n\uf8f1\uf8f2\uf8f3 T(cid:88)\n(cid:88)\n(cid:88)\n\uf8f1\uf8f2\uf8f3 T(cid:88)\nt\u22121(cid:88)\n(cid:88)\n(cid:88)\n\uf8f1\uf8f2\uf8f3 T(cid:88)\n(cid:88)\n\nq\u2208St,s\n\nq\u2208St,s\n\nq\u2208St,s\n\nq\u2208Sr\n\ns\u2208St\n\ns\u2208St\n\ns\u2208St\n\ns\u2208St\n\n2Ea\n\nEa\n\nEa\n\nEa\n\nr=s\n\n\u03b7s\n\n\u03b7s\n\n\u03b7s\n\n\u03b7s\n\n\u03b7s\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nE\n\nE\n\nt=1\n\ns\u2208St\n\nwhere (a) follows from H\u00a8older\u2019s inequality, (b) since\ninequality, (c) from (10) and (d) follows from Lemma 4.\nCombining (6), (8) and (11) yields, for all i = 1, ..., K\n\nEa\n\n\u03b7tl(i)\nt\n\n\u2264 ln K +\n\nK\n2\n\nt=1\n\nt=1\n\n\u03b7tl(at)\n\n(cid:40) T(cid:88)\nt \u2212 T(cid:88)\n(cid:113) ln K\nKT +(cid:80)\n(cid:40) T(cid:88)\n\n(cid:41)\n\n(cid:110)\nT(cid:88)\n\nt=1\n\n(cid:41)\n\n(cid:111)\n(cid:41)\n\nTheorem 1. De\ufb01ne M to be the set of all samples that are not received before round T . Choose the\nt \u2208 [0, 1] for every\n\ufb01xed step size \u03b7 =\n(cid:33)\nt, i. Let {dt} be a delay sequence such that the cost from round t is received at round t + dt. Then\n\nbe a cost sequence such that l(i)\n\n(cid:32)\n\n. Let\n\nt /\u2208M dt\n\nl(i)\nt\n\nEa (R (T )) = E\n\nt \u2212 min\nl(at)\n\ni\n\nl(i)\nt\n\n\u2264 O\n\nt=1\n\nKT +\n\ndt\n\n+ |M|\n\n(13)\n\nProof of Theorem 1. To obtain Theorem 1, substitute \u03b7t = \u03b7 in (5) of Lemma 1, and divide both\nsides by \u03b7:\n\n\uf8f6\uf8f8 .\n\n(cid:33)\n\nl(i)\nt\n\n\u2264 O\n\nln K\n\n+ \u03b7\n\nKT +\n\n+ |M|\n\ndt\n\n(14)\n\nEa\n\n(cid:40) T(cid:88)\nt \u2212 min\nl(at)\n(cid:113) ln K\nKT +(cid:80)\n\nt=1\n\ni\n\nT(cid:88)\n\nt=1\n\nThen, choosing \u03b7 =\n\nyields (13).\n\nt /\u2208M dt\n\n(cid:88)\n\nt /\u2208M\n\n(cid:33)\n\n(cid:88)\n\nt /\u2208M\n\n\uf8eb\uf8ed(cid:118)(cid:117)(cid:117)(cid:116)ln K\n(cid:32)\n\n(cid:32)\n\n\u03b7\n\n5\n\n\f(cid:18)(cid:114)\n\n(cid:16)\n\nKT +(cid:80)T\n\n(cid:17)(cid:19)\n\nIt is worthwhile noting that our bound is tighter than O\nthat does not\ntake M into account, since counting delays that go beyond round T is redundant. For example, if\ndt = t2 then\n\n. Our subsequent Corollary formalizes this intuition.\n\n(cid:113)(cid:80)T\n\nt=1 dt\n\nln K\n\n2\n\nt=1 dt = O\n\nt \u2208 [0, 1] for\nCorollary 1. Let \u03b7 =\nevery t, i. Let {dt} be a delay sequence such that the cost from round t is received at round t + dt.\nThen\n\nbe a cost sequence such that l(i)\n\n. Let\n\nt /\u2208M dt\n\nl(i)\nt\n\nT 3\n\n(cid:16)\n(cid:17)\n(cid:113) ln K\nKT +(cid:80)\n(cid:40) T(cid:88)\n\n\uf8eb\uf8ed\n\n(cid:118)(cid:117)(cid:117)(cid:116)ln K\n\n(cid:32)\n\nT(cid:88)\n\n(cid:33)\uf8f6\uf8f8\n\nKT +\n\ndt\n\n(15)\n\nEa (R (T )) = Ea\n\nt \u2212 min\nl(at)\n\ni\n\nl(i)\nt\n\n\u2264 O\n\nt=1\n\nt=1\n\nt=1\n\n(cid:110)\n\n(cid:111)\n\n(cid:41)\n\nT(cid:88)\n\nProof. The m = |M| missing samples (received after T ) contribute at least m(m+1)\nto the sum of\nt=1 dt (since the best case is when the feedback of T is delayed by one and arrives after T ,\nthe feedback of T \u2212 1 now has to be delayed by at least 2 to be received after T and so on m times).\n\n2\n\ndelays(cid:80)T\nHence(cid:118)(cid:117)(cid:117)(cid:116)ln K\n(cid:32)\n(cid:118)(cid:117)(cid:117)(cid:116)ln K\n(cid:32)\n\n1\n2\n\nKT +\n\ndt\n\nKT +\n\n(cid:32)\n\n(cid:33)\n(cid:33)\n\n\u2265\n\n(cid:118)(cid:117)(cid:117)(cid:116)ln K\n(cid:114)\n\nT(cid:88)\n(cid:88)\n\nt=1\n\nt /\u2208M\n\nKT +\n\ndt\n\n+\n\nln K\n\n1\n2\n\nm (m + 1)\n\n2\n\n\u2265 O\n\n(cid:88)\n\nt /\u2208M\n\ndt +\n\n(cid:33)\n(cid:32)\n\n\u2265\n\n(a)\n\nKT +\n\nm (m + 1)\n\n2\n\n\uf8eb\uf8ed(cid:118)(cid:117)(cid:117)(cid:116)ln K\n\n\uf8f6\uf8f8\n\n+ |M|\n\n(cid:33)\n\n(cid:88)\n\nt /\u2208M\n\ndt\n\n(16)\n\nwhere (a) follows from the concavity of f (x) =\n\nx.\n\n(cid:17)\n\n(cid:16)\u221a\n\nWhile the \ufb01rst term in the regret, KT ln K, has a factor of K, the delay term(cid:80)T\n\nThe expression in (15) reveals a robustness property of the regret bound of EXP3 under delays.\nt=1 dt does not have\na factor of K. Consider bounded delays of the form dt = K. Then, the order of magnitude of the\n, exactly as that of EXP3 without delays [14]. For\nregret as a function of T and K is O\ncomparison, consider the full information case where at each round the cost of all arms is received.\nAssume that the player uses the exponential weights algorithm, which is the equivalent of EXP3\nfor the full information case. For the same delay sequence dt = K, exponential weights achieves\na regret bound of O\nthat exponential\nweights with no delays achieves. The intuition for this result is that EXP3 already \u201cpaid the price\u201d\nfor using K times less feedback than in the full information case. Depending on less feedback,\nEXP3 is inherently more robust to feedback delays.\n\nK times worse than the O\n\n(cid:16)\u221a\n\n(cid:16)\u221a\n\nT K ln K\n\nT K ln K\n\nT ln K\n\n[13],\n\n(cid:17)\n\n(cid:17)\n\n\u221a\n\n2.1 Adaptive Algorithm: Doubling Trick with Delays\n\nused in Algorithm 1 requires knowledge of T and(cid:80)T\n\nt /\u2208M dt\n\nThe step size \u03b7 =\nt=1 dt. With\nno delays, the standard doubling trick (see [20]) can be used if T is unknown. However, the same\ndoubling trick does not work with delayed feedback.We now present a novel doubling trick for the\nt=1 dt are unknown. De\ufb01ne mt as the number of missing\nfeedback samples at round t, starting from the t-th feedback. The idea is to start a new epoch every\n\n(cid:113) ln K\nKT +(cid:80)\n\u03c4 =1 m\u03c4 , that tracks(cid:80)t\n\ndelayed feedback case, where T and(cid:80)T\ntime(cid:80)t\n(cid:40)\n\n\u03c4 =1 d\u03c4 , doubles. De\ufb01ne the e-th epoch as\n\nt|2e\u22121 \u2264 t(cid:88)\n\n\u03c4 =1\n\nTe =\n\n(cid:41)\n\nm\u03c4 < 2e\n\n.\n\n(17)\n\nwhich is a set of consecutive rounds when the sum of delays is within a given interval. During\nthe e-th epoch, the EXP3 algorithm using our doubling trick uses step size \u03b7e =\n2e . Feedback\n\n(cid:113) ln K\n\n\u221a\n\n6\n\n\fAlgorithm 2 Adaptive EXP3 with delays for unknown T and(cid:80)T\n\nt=1 dt\n\nInitialization: Set \u02dcL(i)\nFor t = 1, ..., T do\n\n1 = 0 and p(i)\n\n1 = 1\n\nK for i = 1, ..., K. Set the epoch index e = 0 and \u03b70 = 1.\n\n1. Choose an arm at at random according to the distribution pt.\n2. Obtain a set of delayed costs l(as)\n3. Update the number of missing samples so far\n\ns\n\nfor all s \u2208 St, where as is the arm played at round s.\n\nmt = t \u2212 t(cid:88)\n\n\u03c4 =1\n\n|S\u03c4| .\n\n4. If(cid:80)t\n\n\u03c4 =1 m\u03c4 \u2265 2e, then update e = e + 1 and initialize \u02dcL(i)\n\nt = 0 for i = 1, ..., K.\n5. Update the weights of arm as for all s \u2208 St such that s \u2208 Te using step size \u03b7e =\n\n(19)\n\n(cid:113) ln K\n\n2e :\n\n(20)\n\n(21)\n\n\u02dcL(as)\nt = \u02dcL(as)\n\nt\u22121 + \u03b7e\n\nl(as)\ns\np(as)\ns\n\n.\n\n6. Update the mixed strategy\n\np(i)\nt+1 =\n\nEnd\n\nt(cid:80)n\n\ne\u2212 \u02dcL(i)\nj=1 e\u2212 \u02dcL(j)\n\nt\n\n.\n\nregret guarantee (up to a constant) as in Theorem 1, despite the fact that T and(cid:80)T\n\nsamples originated in previous epoch are discarded once received. The resulting algorithm is detailed\nin Algorithm 2.\nThe next Theorem shows that thanks to our novel doubling trick, Algorithm 2 achieves the same\nt=1 dt are un-\nknown. We conjecture that the K 2 factor replacing K can be improved with a more careful analysis.\nHowever, this factor has no effect on the order of the regret when the average delay is larger than\nK 2.\nt \u2208 [0, 1] for every t, i. Let {dt} be a delay\nTheorem 2. Let\nsequence such that the cost from round t is received at round t + dt. If player uses Algorithm 2 then\n\nbe a cost sequence such that l(i)\n\n(cid:110)\n\nl(i)\nt\n\nEa (R (T )) = E\n\n(cid:118)(cid:117)(cid:117)(cid:116)ln K\n\uf8eb\uf8ed\n\n(cid:32)\n\nO\n\nt \u2212 min\nl(at)\nT(cid:88)\n\ni\n\nKT +\n\nmin{dt, T \u2212 t + 1}\n\n(cid:33)\uf8f6\uf8f8 \u2264 O\n\n\uf8eb\uf8ed\n\n(cid:118)(cid:117)(cid:117)(cid:116)ln K\n\n(cid:32)\n\nK 2T +\n\n(cid:33)\uf8f6\uf8f8 .\n\ndt\n\nT(cid:88)\n\nt=1\n\n(18)\n\n(cid:111)\n(cid:40) T(cid:88)\n\nt=1\n\n(cid:41)\n\nl(i)\nt\n\n\u2264\n\nT(cid:88)\n\nt=1\n\nt=1\n\nProof. See Appendix.\n\n3 Two Player Zero-Sum Game with Delayed Bandit Feedback\n\nIn this section we consider a two player zero-sum game where both players play according to the\nEXP3 algorithm with feedback delays.\nIt is well known that without delays, an algorithm with\nsublinear regret such as EXP3, played against itself, will converge to a NE (in the ergodic average\nsense) [18]. Our main result in this section, given in Theorem 3, generalizes this statement for the\ncase of arbitrarily (i.e., adversarially) delayed feedback, and reveals that with delays, convergence\nto a NE can occur even without sublinear regret.\n\n7\n\n\fLet U be the cost matrix, such that when the row player plays i and the column player plays j, the\n\ufb01rst pays a cost of U (i, j) and the second gains a reward of U (i, j) (i.e., a cost of \u2212U (i, j)). We\nassume without loss of generality that 0 \u2264 U (i, j) \u2264 1 for any i, j. Note that if pt, qt \u2208 \u2206K are\nmixed strategies, then we use the convention that\n\nU (pt, j) (cid:44) K(cid:88)\nU (pt, qt) (cid:44) K(cid:88)\nK(cid:88)\n\ni=1\n\ni=1\n\nj=1\n\np(i)\nt U (i, j)\n\nt q(j)\np(i)\n\nt U (i, j) .\n\n(22)\n\n(23)\n\nand\n\nt , q\u2217\n\nNash Equilibrium (NE) is a key concept in game theory for predicting the outcome of a game. A\nNE is a strategy pro\ufb01le (p\u2217\nt ) such that no player wants to switch a strategy given that the other\nplayer keeps his strategy. For our result, we need to de\ufb01ne the set of all approximate (and pure) NE:\nDe\ufb01nition 2. The set of all \u03b5-NE points is\n(p\u2217, q\u2217) | U (p\u2217, q\u2217) \u2264 min\n\nU (p, q) + \u03b5 , U (p\u2217, q\u2217) \u2265 max\n\nU (p, q) \u2212 \u03b5\n\n(cid:27)\n\n(cid:26)\n\n(24)\n\nN\u03b5 =\n\np\n\nq\n\nand the set of NE points is N0.\nThe entity that converges to the set of NE in our case is the ergodic average of (pt, qt). For the\nt , the ergodic average of pt is simply the running average of the sequence pt.\nspecial case of \u03b7\u03c4 = 1\nDe\ufb01nition 3. The ergodic average of a sequence of distributions pt is de\ufb01ned as:\n\n(cid:80)t\n(cid:80)t\n\n(cid:44)\n\n\u00afpt\n\n\u03c4 =1 \u03b7\u03c4 p\u03c4\n\u03c4 =1 \u03b7\u03c4\n\n.\n\n(25)\n\n(26)\n\n(27)\n\nWe say that ( \u00afpT , \u00afqT ) converges in L1 to the set of NE if\n\nT\u2192\u221e arg min\nlim\nT )\u2208N0\nwhich also implies that for every \u03b5 > 0\n\nT ,q\u2217\n\n(p\u2217\n\nE {(cid:107)( \u00afpT , \u00afqT ) \u2212 (p\u2217\n\nT , q\u2217\n\nT )(cid:107)1} = 0\n\nT\u2192\u221e arg min\nlim\nT )\u2208N0\n\nT ,q\u2217\n\n(p\u2217\n\nP ((cid:107)( \u00afpT , \u00afqT ) \u2212 (p\u2217\n\nT , q\u2217\n\nT )(cid:107)1 \u2265 \u03b5) = 0.\n\nOur theorem below establishes the convergence of EXP3 versus itself to a NE, even under signi\ufb01cant\ndelays. Note that the convergence of the ergodic mean to the set of NE is in the L1 sense (so also in\nprobability), which is much stronger than convergence of the expected ergodic mean.\nTheorem 3. Let two players play a zero-sum game with a cost matrix U such that 0 \u2264 U (i, j) \u2264 1\nfor each i, j, using EXP3. The step size sequence of both players is {\u03b7t}\u221e\nt=1. Let the delay sequences\nof the row player and the column player be {dr\nt}, respectively. Let the mixed strategies of the\nrow and column players at round t be pt and qt, respectively. If\n\nt} ,{dc\n\nt < \u221e.\n\nt\u2192\u221e\u03b7tdc\nt=1 dc\n\nt < \u221e.\nt \u03b72\n\n2.\n\n1. (cid:80)\u221e\nt=1 \u03b7t = \u221e.\n3. (cid:80)\u221e\nt\u2192\u221e\u03b7tdr\nlim\nt=1 dr\nt \u03b72\n(cid:16)(cid:80)T\nThen, as T \u2192 \u221e:\n(cid:80)T\n(cid:16)(cid:80)T\n(cid:80)T\n\nt=1 \u03b7tpt\nt=1 \u03b7t\n\nt < \u221e and lim\n\nt < \u221e and(cid:80)\u221e\n(cid:17)\n(cid:80)T\n(cid:80)T\n(cid:80)T\n(cid:80)T\n\nt=1 \u03b7tqt\nt=1 \u03b7t\n\nt=1 \u03b7tpt\nt=1 \u03b7t\n\nt=1 \u03b7tqt\nt=1 \u03b7t\nis the value of the game.\n\n2. U\n\n(cid:17)\n\n1.\n\n,\n\n,\n\nconverges in L1 to the set of NE of the zero-sum game.\n\nconverges in L1 to min\np\n\nmax\n\nj\n\nU (p, j) = max\n\nq\n\nmin\n\ni\n\nU (i, q), which\n\n8\n\n\fSomewhat surprisingly, the delays do not have to be bounded (in t) for the convergence to NE to\nhold. Key examples of application of Theorem 3 are:\n\n\u2022 For bounded delays dr\n\nt \u2264 D and dc\n\nt \u2264 D for all t:\n\u2013 For a \ufb01nite horizon T one can choose \u03b7t = 1\u221a\n\n\u2013 For the in\ufb01nite horizon case one can choose any \u03b7t such that (cid:80)\u221e\n(cid:80)\u221e\nt=1 \u03b7t = \u221e.\n\nfor all t.\n\nT\n\n\u2022 For unbounded sublinear delays such as dr\n\u2022 For unbounded superlinear delays such as dr\n\n\u03b7t = 1\n\nt2/3 .\n\nt and dc\n\nt \u2264 \u221a\nt \u2264 \u221a\nt \u2264 t log t and dc\n\nt for all t, one can choose\nt \u2264 t log t, one can choose\n\nt=1 \u03b72\n\nt < \u221e and\n\n\u03b7t =\n\nt(log t)(log log t).\n\n1\n\nIn general, the feedback of the players does not need to be synchronized, and they may have a\ncompletely different sequence of delays.\nNext we show that the ergodic average of the EXP3 strategies converges to the set of NE even in a\ndelayed feedback scenario where EXP3 has linear regret, so the \u201cno-regret\u201d property does not hold.\nProposition 1. Let the mixed strategies of the row and column players at round t be pt and qt,\nrespectively. There exist {dr\n\nt}t and a cost sequence\n\nsuch that\n\n, ..., l(K)\n\n(cid:111)\n\nt , dc\n\nl(1)\nt\n\nt\n\n(cid:40) T(cid:88)\n\nt=1\n\nEa\n\nT(cid:88)\n\nt=1\n\n(cid:110)\n(cid:41)\n\n(cid:18)\n\nt\n\n(cid:19) T\n\n2\n\nt \u2212 min\nl(at)\n\np\n\np(i)l(i)\nt\n\n\u2265\n\n1 \u2212 1\nK\n\n(28)\n\nbut still the step sizes {\u03b7t} for Algorithm 1 can be chosen such that the conclusion of Theorem 3 still\nholds (\u201cconvergence to NE\u201d).\n\nt=1 \u03b72\n\nt = dc\n\nProof. Let dr\n\nt < \u221e,(cid:80)T\n\nt = dt = t and \u03b7t = 1\nt < \u221e and lim\n\n,(cid:80)T\nThis sequence yields an expected regret of exactly(cid:0)1 \u2212 1\n\nt=1 \u03b7t = \u221e\nt log t for all t, for which dt\u03b72\nt\u2192\u221e\u03b7tdt = 0. Hence, Theorem 3 applies and ( \u00afpT , \u00afqT )\n2 rounds is\n2 . Consider the\nt = 1 for all j > 1 and all t > T\n2 .\n\nconverges in L1 to the set of NE of the game. However, the feedback for the last T\nnever received. Therefore, the mixed strategies pt and qt stay constant for all t \u2265 T\nsequence of costs l(i)\n\nt = 0 for all i and all t \u2264 T\n\nt = 0, l(j)\n\n2 and l(1)\n\nt = 1\n\nt=1 dt\u03b72\n\n(cid:1) T\n\nt log2 t so(cid:80)T\n\n2 .\n\nK\n\n4 Conclusions\n\n(cid:18)(cid:114)\n\n(cid:16)\n\nKT +(cid:80)T\n\n(cid:17)(cid:19)\n(cid:16)\u221a\n\nIn this paper, we analyzed the regret of the EXP3 algorithm subjected to an arbitrary (i.e.,\nadversarial) sequence dt of feedback delays. We have shown that\nis\n\nthe expected regret\n\n(cid:17)\n\nln K\n\nt=1 dt\n\n. This shows that the EXP3 algorithm is inherently robust to de-\nO\nlays, since for dt \u2264 K the order of magnitude of the regret does not change (as a function of T\nand K) from the famous O\n. We have also proved that the convergence of the ergodic\naverage to a Nash equilibrium under delays is a more robust property than the no-regret property of\nEXP3. The ergodic average converges to the set of Nash equilibria even under super-linear delays\nwhere EXP3 has a linear regret in T . This serves as a concrete example where competing versus\nanother agent is essentially easier than competing versus an omnipotent adversary, even if the other\nagent is not subject to any delays.\n\nK ln KT\n\nAcknowledgments\n\nThis research was supported by the Koret Foundation grant for Smart Cities and Digital Living.\nZhengyuan Zhou gratefully acknowledges IBM Goldstine Fellowship. Xi Chen is supported by\nNSF via IIS-1845444.\n\n9\n\n\fReferences\n[1] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, \u201cDelay and cooperation in nonstochastic ban-\n\ndits,\u201d The Journal of Machine Learning Research, vol. 20, no. 1, pp. 613\u2013650, 2019.\n\n[2] Z. Zhou, P. Mertikopoulos, N. Bambos, P. W. Glynn, and C. Tomlin, \u201cCountering feedback\ndelays in multi-agent learning,\u201d in Advances in Neural Information Processing Systems, 2017,\npp. 6171\u20136181.\n\n[3] P. Joulani, A. Gyorgy, and C. Szepesv\u00b4ari, \u201cOnline learning under delayed feedback,\u201d in Inter-\n\nnational Conference on Machine Learning, 2013, pp. 1453\u20131461.\n\n[4] A. Agarwal and J. C. Duchi, \u201cDistributed delayed stochastic optimization,\u201d in Advances in\n\nNeural Information Processing Systems, 2011, pp. 873\u2013881.\n\n[5] G. Neu, A. Antos, A. Gy\u00a8orgy, and C. Szepesv\u00b4ari, \u201cOnline markov decision processes under\nbandit feedback,\u201d in Advances in Neural Information Processing Systems, 2010, pp. 1804\u2013\n1812.\n\n[6] Z. Zhou, R. Xu, and J. Blanchet, \u201cLearning in generalized linear contextual bandits with\n\nstochastic delays,\u201d in Advances in Neural Information Processing Systems, 2019.\n\n[7] M. J. Weinberger and E. Ordentlich, \u201cOn delayed prediction of individual sequences,\u201d IEEE\n\nTransactions on Information Theory, vol. 48, no. 7, pp. 1959\u20131976, 2002.\n\n[8] M. Zinkevich, J. Langford, and A. J. Smola, \u201cSlow learners are fast,\u201d in Advances in neural\n\ninformation processing systems, 2009, pp. 2331\u20132339.\n\n[9] T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popovi\u00b4c, \u201cThe queue method: Handling delay,\nheuristics, prior data, and evaluation in bandits,\u201d in Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n[10] C. Vernade, O. Capp\u00b4e, and V. Perchet, \u201cStochastic bandit models for delayed conversions,\u201d\n\narXiv preprint arXiv:1706.09186, 2017.\n\n[11] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder, \u201cBandits with delayed, aggre-\ngated anonymous feedback,\u201d in Proceedings of the 35th International Conference on Machine\nLearning, 2018, pp. 4105\u20134113.\n\n[12] Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.-J. Li, and L. Fei-Fei, \u201cDistributed\nasynchronous optimization with unbounded delays: How slow can you go?\u201d in International\nConference on Machine Learning, 2018, pp. 5965\u20135974.\n\n[13] K. Quanrud and D. Khashabi, \u201cOnline learning with adversarial delays,\u201d in Advances in neural\n\ninformation processing systems, 2015, pp. 1270\u20131278.\n\n[14] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, \u201cGambling in a rigged casino: The\nadversarial multi-armed bandit problem,\u201d in Proceedings of IEEE 36th Annual Foundations of\nComputer Science.\n\nIEEE, 1995, pp. 322\u2013331.\n\n[15] G. Stoltz, \u201cInformation incompl\u00b4ete et regret interne en pr\u00b4ediction de suites individuelles,\u201d\n\nPh.D. dissertation, Universit\u00b4e Paris-XI Orsay, Orsay, France, 2005.\n[16] S. Bubeck, N. Cesa-Bianchi et al., \u201cRegret analysis of stochastic and nonstochastic multi-\narmed bandit problems,\u201d Foundations and Trends R(cid:13) in Machine Learning, vol. 5, no. 1, pp.\n1\u2013122, 2012.\n\n[17] M. Bowling, \u201cConvergence and no-regret in multiagent learning,\u201d in Advances in neural infor-\n\nmation processing systems, 2005, pp. 209\u2013216.\n\n[18] Y. Cai and C. Daskalakis, \u201cOn minmax theorems for multiplayer games,\u201d in Proceedings of the\ntwenty-second annual ACM-SIAM symposium on Discrete Algorithms. Society for Industrial\nand Applied Mathematics, 2011, pp. 217\u2013234.\n\n[19] J. P. Bailey and G. Piliouras, \u201cMultiplicative weights update in zero-sum games,\u201d in Proceed-\nings of the 2018 ACM Conference on Economics and Computation. ACM, 2018, pp. 321\u2013338.\n[20] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth,\n\n\u201cHow to use expert advice,\u201d Journal of the ACM (JACM), vol. 44, no. 3, pp. 427\u2013485, 1997.\n\n10\n\n\f", "award": [], "sourceid": 6063, "authors": [{"given_name": "Ilai", "family_name": "Bistritz", "institution": "Stanford"}, {"given_name": "Zhengyuan", "family_name": "Zhou", "institution": "Stanford University"}, {"given_name": "Xi", "family_name": "Chen", "institution": "New York University"}, {"given_name": "Nicholas", "family_name": "Bambos", "institution": null}, {"given_name": "Jose", "family_name": "Blanchet", "institution": "Stanford University"}]}