{"title": "Learning in Games with Lossy Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 5134, "page_last": 5144, "abstract": "We consider a game-theoretical multi-agent learning problem where the feedback information can be lost during the learning process and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the classical online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts ROGD, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accomodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, these results contribute to the broad lanscape of multi-agent online learning by significantly relaxing the feedback information that is required to achieve desirable outcomes.", "full_text": "Learning in Games with Lossy Feedback\n\nZhengyuan Zhou\nStanford University\n\nzyzhou@stanford.edu\n\nPanayotis Mertikopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, LIG\npanayotis.mertikopoulos@imag.fr\n\nSusan Athey\n\nStanford University\n\nathey@stanford.edu\n\nNicholas Bambos\nStanford University\n\nPeter Glynn\n\nStanford University\n\nYinyu Ye\n\nStanford University\n\nbambos@stanford.edu\n\nglynn@stanford.edu\n\nyinyu-ye@stanford.edu\n\nAbstract\n\nWe consider a game-theoretical multi-agent learning problem where the feedback\ninformation can be lost during the learning process and rewards are given by a\nbroad class of games known as variationally stable games. We propose a simple\nvariant of the classical online gradient descent algorithm, called reweighted online\ngradient descent (ROGD) and show that in variationally stable games, if each\nagent adopts ROGD, then almost sure convergence to the set of Nash equilibria is\nguaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated\namong agents. We then extend the framework to deal with unknown feedback loss\nprobabilities by using an estimator (constructed from past data) in its replacement.\nFinally, we further extend the framework to accomodate both asynchronous loss\nand stochastic rewards and establish that multi-agent ROGD learning still converges\nto the set of Nash equilibria in such settings. Together, these results contribute\nto the broad lanscape of multi-agent online learning by signi\ufb01cantly relaxing the\nfeedback information that is required to achieve desirable outcomes.\n\n1\n\nIntroduction\n\ni , an\u2212i), where an\n\nIn multi-agent online learning [13, 14, 21, 45], a set of agents repeatedly interact with the environment\nand each other while accumulating rewards in an online manner. The key feature in this problem is\nthat to each agent, the environment consists of all other agents who are simultaneously making such\nsequential decisions, and hence, each agent\u2019s reward depends not only on their own action, but also\non the joint action of all other agents. A common way to model reward structures in this multi-agent\nonline learning setting is through a repeated but otherwise unknown stage game: the reward of the\ni-th agent at iteration n is ri(an\ni is agent i\u2019s action at n and an\u2212i is the vector of all\nother agents\u2019s actions at stage n.\nEven though the underlying stage game is \ufb01xed (i.e. each ri(\u00b7) is \ufb01xed), from each agent\u2019s own\nperspective, its reward is nevertheless a time-varying function when viewed solely as a function\nof its own action, i.e., rn\ni before receiving\nthe reward rn\ni ). As such, each agent is exactly engaged in a classical online learning process\n[9, 23, 42, 43]. In this context, there is a fruitful line of existing literature providing a rich source\nof online learning algorithms to minimize the standard performance metric known as regret [10],\nde\ufb01ned as RegAlg\ni generated\nby some online learning algorithm Alg. These algorithms, already classical, include \u201cfollow-the-\nregularized-leader\u201d [26], online gradient descent [57], multiplicative/exponential weights [3], online\nmirror descent [44], and many others. Perhaps the simplest algorithm in the above list is Zinkevich\u2019s\nonline gradient descent (OGD) method, where the agent simply takes a gradient step (using their\n\ni (\u00b7) = ri(\u00b7, an\u2212i), and it needs to select an action an\n\nT = maxai\u2208A(cid:80)T\n\ni)} with the sequence of actions at\n\ni (an\n\nt=1{rt\n\ni(ai) \u2212 rt\n\ni(at\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcurrent reward function) to form the action for the next stage, performing a projection if the iterate\nsteps out of the feasible action set. This algorithm, straightforward as it is, provides (asymptotically)\noptimal regret guarantees (see Section B in the appendix for a brief review) and is arguably one of the\nmost well-studied and widely-used algorithms in online learning theory and applications [24, 39, 57].\nConsequently, several natural questions arise in multi-agent online learning : if each agent adopts\nOGD to minimize regret, what is the resulting evolution of the joint action? Speci\ufb01cally, under what\ngames/assumptions would it lead to a Nash equilibrium (the leading solution concept in multi-agent\ngames)? These questions fall in the broader inquiry of understanding the joint convergence of\nno-regret online learning algorithms, an inquiry that lies at the heart of game-theoretical learning\nfor well-grounded reasons: Speci\ufb01cally, studying whether the process converges at all provides an\nanswer as to the stability of the joint learning dynamics, while studying whether it converges to a\nNash equilibrium provides an answer as to whether the joint learning dynamics lead to the emergence\nof rationality. Speci\ufb01cally, for the latter point, if, when agents adopt an online learning algorithm\nthe joint action converges to a point that is not a Nash equilibrium, each agent would be able to do\nbetter by unilaterally deviating from that algorithm. Hence, convergence to a non-Nash point would\ninherently produce \u201cregret\" in equilibrium.\n\n1.1 Related Work\n\nDespite the fact that game-theoretic learning has received signi\ufb01cant scrutiny in the literature, the\nquestions raised above are still open for several reasons.\nFirst, in general, joint convergence of no-regret learning does not hold. In fact, even in simple \ufb01nite\ngames (where each agent has a \ufb01nite number of actions), no-regret learning may fail to converge [31].\nEven if it does converge, in mixed extensions of \ufb01nite games (where each agent\u2019s action set is a\nprobability simplex over a \ufb01nite number of actions), the limit can assign positive weight only to\nstrictly dominated actions. Consequently, establishing joint convergence to Nash equilibria under\nno-regret learning algorithms (OGD included) for a broad and meaningful subclass of games has\nbeen a challenging ongoing effort.\nSecond, the extensive existing literature in the \ufb01eld has mainly focused on studying convergence in\n(mixed extensions of) \ufb01nite games [8, 34, 45, 46]. Earlier work of game-theoretic learning (see [18]\nfor a comprehensive review) has mainly focused on learning in \ufb01nite games with dynamics that are\nnot necessarily regret-less (e.g. best-response dynamics, \ufb01ctitious play and the like). Subsequently,\nthe seminal work [14] (see also the many references therein) has provided a uni\ufb01ed treatment of\njoint convergence of various no-regret learning algorithms in mixed extensions of \ufb01nite games.\nThe primary focus of [14] is convergence to other, coarser equilibrium notions (e.g. correlated\nequilibria), where a fairly complete characterization is given. On the other hand, as pointed out\nin [14], convergence to Nash is a much more dif\ufb01cult problem: recent results of [47] have clearly\nhighlighted the gap between (coarse) correlated equilibria obtained by no-regret learning processes\nand Nash equilibria. More positive results can be obtained in the class of potential games where, in\na recent paper, the authors of [15] established the convergence of multiplicative weights and other\nregularized strategies in potential games with only payoff-based, bandit feedback.\nHowever, moving beyond mixed extensions of \ufb01nite games, much less is known, and only a moderate\namount of literature exists. In the context of mixing in games with continuous action spaces, the\nauthors of [37] provide a convergence analysis for a perturbed version of the multiplicative weights\nalgorithm in potential games. In a pure-strategy setting, the network games considered in [29]\nbelong to the much broader class of games known as concave games: each agent\u2019s reward function is\nindividually concave.1 Therein, the dynamics investigated may lead to positive regret in the limit.\nThe recent paper [5] studied a two-player continuous zero-sum game, and showed that if both players\nadopt a no-regret learning algorithm, then the empirical time-average of the joint action converges\nto Nash equilibria. However, barring a few recent exceptions, the territory of no-regret learning on\nconcave games is not well understood (let alone in general games with continuous action sets). An\nexception to this are the recent papers [30, 32] where the authors establish the convergence of mirror\n\n1In this vein, mixed extensions of \ufb01nite games can be called linear games, because each agent\u2019s reward is\nindividually linear in its own action. Note also that convex potential games is a subclass of concave games.\nThe following gives the set membership: \ufb01nite games \u2282 mixed extension of \ufb01nite games \u2282 concave games \u2282\ngeneral continuous games.\n\n2\n\n\f(cid:80)t\n\n(cid:80)t\n\ndescent in monotone games \u2013 a result later extended to learning with bandit, zeroth-order feedback in\n[7, 11].\nThird, the convergence mode that is commonly adopted in the existing literature is ergodic conver-\ngence (i.e. convergence of 1\nt=n at), rather than the convergence of the actual sequence of play\nn\n(i.e. an). The former is convergence of the empirical frequency of play, while the latter is convergence\nof actual play: note that the latter implies the former, but not the other way round. It is important\nto point out, however, that convergence of the actual sequence of play is more desirable2 not only\nbecause it is theoretically more appealing, but also because it characterizes the joint evolution of the\nsystem in no-regret learning (since an, rather than 1\nn\nThis issue was highlighted in [31], where it is shown that even though continuous-time follow-\nthe-regularized-leader (another no-regret learning algorithm) converges to Nash equilibrium in\nlinear zero-sum games in the sense of time-averages, actual play orbits Nash equilibria in perpetuity.\nHowever, most of the existing work focus on establishing ergodic convergence (which granted is some\nform of stability), where the tools developed are far from suf\ufb01cient to ensure convergence of actual\nplay. Some recent exceptions in speci\ufb01c games do exist: [27, 28] studied nonatomic routing games (a\nspecial class of concave games) and established that both online mirror descent and multiplicative\nweights (yet another two no-regret algorithms) converge to Nash equilibria; while [36] established\nmultiplicative weights converge to Nash equilibria under certain conditions in atomic routing games\n(a subclass of \ufb01nite games). Theoretically, the main dif\ufb01culty in establishing convergence of the\nsequence of play lies in cleverly designing (as done in [27, 28, 36, 53]) speci\ufb01c and much sharper\nLyapunov functions for the speci\ufb01c learning algorithms at hand. This dif\ufb01culty is partially overcome\nin [11] for strongly monotone games where, assuming perfect gradient observations, an O(1/n)\nconvergence rate is established.\n\nt=n at, is the actual action taken).\n\n1.2 Our Contributions\n\nIn this paper, we study the convergence of OGD to Nash equilibrium in general continuous games.\nTo the best of our knowledge, the existing state-of-the-art convergence guarantee is that multi-agent\nOGD converges to Nash equilibria in monotone games in the sense of ergodic averages, a result\ndue to [35]. Monotone games3 are an important and well-known subclass of concave games (in\nparticular, it includes convex potential games as a special case). On a related note, a very recent\npaper [41] considered pseudo-monotone games (which strictly contain monotone games but also\nbelong to concave games) and devised two distributed algorithms for computing Nash equilibria.\nHowever, it is far from clear that the algorithms devised are no-regret when used in an online fashion.\nIn this paper, we work with a broad class of general continuous games (not necessarily concave even),\nknown as variationally stable games [53, 54], that strictly include pseudo-monotone games (and\nhence, in a chain of inclusion, monotone games and convex potential games), and also nonatomic\nrouting games, linear Cournot games, resource allocation auctions, etc. More importantly, we go a\nstep further and consider an even more general multi-agent online learning problem, where we allow\nagents to have asynchronous gradient feedback losses (note that so far the game-theoretical learning\ndiscussion is under perfect feedback, where the resulting landscape is already challenging). More\nspeci\ufb01cally, instead of assuming that every agent receives information at every stage, we allow for\ncases where a (random) subset of agents remains informed (speci\ufb01cally, each agent i has a probability\npi of receving the feedback and probability 1 \u2212 pi of losing it).\nTwo important features of this model are: 1) the feedback loss can be asynchronous; 2) the feedback\nloss can be arbitrarily correlated among agents (whether some agent loses its feedback on the current\niteration may affect whether others lose theirs). In this asynchronous context, we design a simple\nvariant of OGD, called reweighted online gradient descent (ROGD), where each agent corrects its\nown marginal bias via dividing the received gradient (if any) by pi. This is inspired by the classical\nEXP3 algorithm [4, 12]) in the multi-armed literature, where it has a similar feature of weighting the\nobserved bandit feedback by the probability of selecting the corresponding arm. We then establish in\n\nof actual play fails to hold.\n\n2Of course, convergence in ergodic average is still valuable in many contexts, particularly when convergence\n3In short, it means (\u2207a1 r1(\u00b7),\u2207a2 r2(\u00b7), . . . ,\u2207an rn(\u00b7)) is a monotone operator on the joint action space.\nNote that if there is only 1 player, this means the underlying problem is a convex optimization problem since the\ngradient of a convex function is monotone. In this case, a Nash equilibrium is a globally optimal solution.\n\n3\n\n\fTheorem 4.3 that, in this asynchronous feedback loss setting, the sequence of play induced by joint\nROGD converges almost surely to Nash equilibria in variationally stable games. We achieve this\nstrong theoretical guarantee by designing an energy function that sharply tracks the progress made in\nthe joint evolution of the ROGD update of all agents under feedback loss. We mention that a very\nrecent work that also studies multi-agent online learning under imperfect information at the generality\nof variationally stable games is [53]. In particular, it is shown there that online mirror descent (also\nno-regret) converges to Nash even in the presence of super-linear but sub-quradtic growing delays\nin gradients. However, in that context, in addition to studying a different algorithm, [53] focuses\non delays and in particular requires all gradients to be received (i.e. no gradient loss is allowed).\nConsequently, the results in [53] are strictly complementary to ours here: in particular, it is unclear\nwhether online mirror descent would converge to Nash under feedback loss.\nFinally, we make two practically useful and theoretically meaningful extensions. First, we extend\nto the case where the loss probabilites (which are assumed to be known so far) are not known. In\nthis case, we can replace the actual probabilities pi by an estimate. Our main message (Theorem\n5.1) is that provided the estimator is\nn-consistent, then convergence to Nash in last iterate under\nROGD is still guaranteed. Note that a simple estimator that satis\ufb01es this guarantee is the sample mean\nestimator (with Laplace smoothing). Second, we extend the multi-agent online learning setup to also\naccomodate for stochastic reward functions, where in each iteration only a random reward function is\nrealized. In such cases, we \ufb01rst note an important equivalence result, both from a modeling standpoint\nand from a proof standpoint: the setup where agents\u2019 reward functions are iid can be identi\ufb01ed\nwith the setup where agents\u2019 reward functions are \ufb01xed and deterministic but the received feedback\ngradients are noisy (but unbiased). We then establish that (in either setup) multi-agent ROGD learning\nstill converges almost surely to the set of Nash equilibria when noise has bounded support (Theorem\nE.4 in the appendix); and converges in probability to the set of Nash equilibria if noise has unbounded\nsupport but has \ufb01nite second moment (Theorem E.6 in the appendix): both are in last iterate. Due to\nspace limitation, this part is placed in the appendix. Together, these results not only make meaningful\nprogress towards the challenging open problem of convergence of no-regret algorithms to Nash in\ngeneral continuous games under perfect information, but also, more importantly, contribute to the\nbroad lanscape of multi-agent online learning under imperfect information.\n\n\u221a\n\n2 Problem setup\n\n2.1 Rewards for Multi-Agent Online Learning\n\nWe consider a general continuous game model for the rewards in multi-agent online learning:\n\nDe\ufb01nition 2.1. A continuous game G is a tuple G = (N ,A =(cid:81)N\n\ni=1), where N is the\nset of N agents, Xi is a convex and compact subset of Rdi representing the action space of agent i,\nand ri : A \u2192 R is agent i\u2019s reward function, such that ri(a) = ri(a1, a2, . . . , aN ) is continuous in a\nand continuously differentiable in ai for each i \u2208 N and \u2207airi(a) is Lipschitz continuous in a.\nDe\ufb01nition 2.2. Given a continuous game G, a\u2217 \u2208 X is called a Nash equilibrium if for each i \u2208 N ,\nri(a\u2217\n\ni=1 Ai,{ri}N\n\n\u2212i) \u2265 ri(ai, a\u2217\n\nFor the rest of the paper, we set d =(cid:80)N\nwe obtain multi-agent OGD in Algorithm 1 (we assume(cid:80)\u221e\n\ni=1 di, which denotes the dimension of the joint action space:\nA \u2282 Rd. Variables associated with agent i are denoted by subscripts. If there is no subscript, then it is\nunderstood that it is a joint variable of all agents. When each agent is applying4 OGD independently,\n\nn=1 \u03b3n+1 = \u221e,(cid:80)\u221e\n\nn+1 < \u221e):\n\nn=1 \u03b32\n\ni , a\u2217\n\n\u2212i),\u2200ai \u2208 Xi.\n\n4Note that in particular, the gradient here is a partial gradient with respect to one\u2019s own action, and its\n\ndimension is equal to di, the dimension of its own action space.\n\n4\n\n\fi \u2208 Rdi\n\ni\n\ni \u2190 a0\n\nAlgorithm 1: Multi-Agent OGD Learning\nRequire: Each agent i picks an arbitrary a0\n1: n \u2190 0, y0\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8: until end\n\nyn+1\ni\ni = ProjAi\nan+1\nend for\nn \u2190 n + 1\n\ni + \u03b3n+1\u2207airi(an)\n\nfor each agent i do\n\n(yn+1\n\n= yn\n\n)\n\ni\n\n2.2 Asynchronous Feedback Loss\n\nThe setup described in Algorithm 1 is multi-agent learning under perfect feedback information. Here\nwe extend the model to allow for asynchronous feedback loss, where at each time step, only a subset\nN n+1 \u2282 N of the agents receive their gradients from time5 n, while other agents\u2019 gradients are lost.\nWe assume that this subset is drawn iid from a \ufb01xed distribution across time steps.\nTo faciliate the discussion, let the indicator variable I n+1\n(at the beginning of n + 1) and 0 otherwise. Mathematically:\n\nbe 1 if agent i\u2019s gradient from n is received\n\ni\n\n(cid:26)1,\n\n0,\n\nI n+1\ni\n\n=\n\nif i \u2208 N n+1,\nif i /\u2208 N n+1.\n\n(2.1)\n\ni\n\nand I n+1\n\nNote that under this model, even though N n+1 is drawn iid across time steps, within the same\ntime step, I n+1\ncan be arbitrarily correlated (whether agent i\u2019s feedback is lost can\nin\ufb02uence whether agent j\u2019s feedback is lost). We assume each agent i only knows its own marginal\nloss probability 1 \u2212 pi, where pi = E[I n+1\n]. We refer to this setup as asynchronous feedback\nloss, because different agents can have different feedback loss pattern at any iteration. Applying\nmulti-agent OGD in the feedback loss model yields:\n\nj\n\ni\n\nAlgorithm 2: Multi-Agent OGD Learning under Asynchronous Feedback Loss\nRequire: Each agent i picks an arbitrary A0\n1: n \u2190 0, Y 0\n2: repeat\n3:\n\nfor each agent i do\n\ni \u2190 A0\n\ni \u2208 Rdi\n\ni\n\n(cid:26)Y n\n\n=\n\ni + \u03b3n+1\u2207airi(An),\nY n+1\ni\nY n\ni ,\ni = ProjAi\nAn+1\nend for\nn \u2190 n + 1\n\n(Y n+1\n\n)\n\ni\n\n4:\n\n5:\n6:\n7:\n8: until end\n\nif I n+1\nif I n+1\n\ni\n\ni\n\n= 1\n= 0\n\nHere we have capitalized the iterates Y n\ni because they are now random variables (due to random\nfeedback loss). Speci\ufb01cally, denoting by I the vector of indicator variables for all the agents, we have\ni are adapted to A0, I 1, I 2, . . . , I n. Note the important but subtle point here: Y n\nthat both Y n\ni\nand An\ni , because each individual\u2019s gradient update is coupled\nwith all the other agents\u2019 actions, which is a fundamental phenomenon in multi-agent learning.\n\ni are not adapted to A0\n\ni and An\n\ni , . . . , I n\n\ni , An\n\ni , I 2\n\ni , I 1\n\n2.3 Variationally Stable Games\n\nConsequently, special structures must be imposed on the games/reward functions. Here we consider\na broad meta class of games, called variationally stable games [32, 54], that contain many existing\n\n5Here the superscript n + 1 means that set N n+1 is revealed at the beginning of time n + 1\n\n5\n\n\fi=1(cid:104)\u2207airi(a), ai \u2212 a\u2217\n\nsatis\ufb01es: (cid:104)\u2207ar(a), a \u2212 a\u2217(cid:105) (cid:44)(cid:80)N\n\nwell-known classes of games, such as convex potential games, monotone games, pseudo-monotone\ngames [56], coercive games [41], in\ufb02uence network games [52, 55], non-atomic routing games [40].\nSee [53] for more details.\nDe\ufb01nition 2.3. G is a variationally stable game if its set of Nash equilibria A\u2217 is non-empty and\ni (cid:105) \u2264 0, \u2200a \u2208 A,\u2200a\u2217 \u2208 A\u2217, with equality if\nand only if a \u2208 A\u2217.\nRemark 1. Monotone games require (cid:104)\u2207ar(a)\u2212\u2207ar(a(cid:48)), a\u2212 a(cid:48)(cid:105) \u2264 0,\u2200a, a(cid:48) \u2208 A. Pseudo-monotone\ngames require: if (cid:104)\u2207ar(a(cid:48)), a \u2212 a(cid:48)(cid:105) \u2264 0, then (cid:104)\u2207ar(a), a \u2212 a(cid:48)(cid:105) \u2264 0,\u2200a, a(cid:48) \u2208 A. That monotone is\na special case of pseudo-monotone is obvious. That pseduo-monotone is a special case of variational\nstability follows by recalling the standard characterization of a Nash equilibrium: a\u2217 satis\ufb01es\n(cid:104)\u2207ar(a\u2217), a \u2212 a\u2217(cid:105) \u2264 0,\u2200a \u2208 A.\nIn the presence of feedback loss, the convergence of multi-agent OGD given in Algorithm 2 to Nash\nequilibria cannot be guaranteed in variationally stable games (and there exist cases where it doesn\u2019t,\nin fact, convergence of OGD is not guaranteed even in convex potential games). In the next section,\nwe present a simple modi\ufb01cation of the vanilla multi-agent OGD that will later be shown to converge\nto Nash equilibria, even in the presence of asynchronous delays. We shall then see that as a corollary\n(see Corollay 4.4), multi-agent OGD will converge to the set of Nash equilibria (almost surely) if the\nfeedback loss are synchronous (and as a further special case, if there is no feedback loss).\n\n3 Algorithm and energy function\n\nIn this section, to deal with asynchronous feedback loss, we give a new algorithm called Reweighted\nOnline Gradient Descent (ROGD) for the multi-agent learning problem.\n\n3.1 Reweighted Online Gradient Descent\n\nThe main idea in the modi\ufb01cation of OGD lies in each agent individually correcting its marginal bias\n(which comes from the loss of the gradient feedback) by dividing the probability. More speci\ufb01cally,\nwhen a gradient is lost on the current time step, the agent uses the previous action as in OGD.\nOtherwise, when a gradient is indeed available, the gradient will be weighted by pi \ufb01rst before getting\nupdated. This results in reweighted online gradient descent:\n\nAlgorithm 3: Multi-Agent ROGD Learning under Asynchronous Feedback Loss\nRequire: Each agent i picks an arbitrary A0\n1: n \u2190 0, Y 0\n2: repeat\n3:\n\nfor each agent i do\n\ni \u2208 Rdi\n\ni\n\ni \u2190 A0\n(cid:40)\n\n\u2207ai ri(An)\n\npi\n\n,\n\ni\n\nif I n+1\nif I n+1\n\ni\n\n= 1\n= 0\n\n4:\n\n=\n\nY n+1\ni\n\nY n\ni + \u03b3n+1\nY n\ni ,\ni = ProjAi\nAn+1\nend for\nn \u2190 n + 1\n\n(Y n+1\n\ni\n\n)\n\n5:\n6:\n7:\n8: until end\n\nWe emphasize once again that the gradient loss among agents can be correlated. However, in ROGD,\neach agent only corrects its own marginal bias and does not concern itself with other agents\u2019 feedback\nloss. To analyze the ROGD algorithm, we next introduce an important theoretical tool.\n\n3.2 Energy Function\nWe start by de\ufb01ning the energy function, an important tool we use to establish convergence later. a\u2217\nis always any \ufb01xed Nash equilibrium.\nDe\ufb01nition 3.1. De\ufb01ne the energy function L : Rd \u2192 R as follows:\n\nL(y) =(cid:107)a\u2217(cid:107)2\n\n2 \u2212 (cid:107)ProjA(y)(cid:107)2\n\n2 + 2(cid:104)y, ProjA(y) \u2212 a\u2217(cid:105).\n\n(3.1)\n\n6\n\n\fLemma 3.2. Let yn be a sequence in Rd\n\n1. L(yn) \u2192 0 implies ProjA(yn) \u2192 a\u2217.\n2. (cid:107)ProjA(y) \u2212 \u02c6y(cid:107)2\n2 \u2212 (cid:107)ProjA(\u02c6y) \u2212 \u02c6y(cid:107)2\n3. L(y + \u2206y) \u2212 L(y) \u2264 2(cid:104)\u2206y, ProjA(y) \u2212 a\u2217(cid:105) + (cid:107)\u2206y(cid:107)2\n\n2 \u2264 (cid:107)y \u2212 \u02c6y(cid:107)2\n\n2, for any y, \u02c6y \u2208 Rd\n\n2, for any y, \u2206y \u2208 Rd.\n\nRemark 2. The second statement of the lemma serves as an important intermediate step in proving the\nthird statement, and is established by leveraging the envelop theorem and several important properties\nof Euclidean projection. To see that this is not trivial, consider the quantity (cid:107)ProjA(y) \u2212 \u02c6y(cid:107)2 \u2212\n(cid:107)ProjA(\u02c6y) \u2212 \u02c6y(cid:107)2, which we know by triangle\u2019s inequality satis\ufb01es:\n\n(cid:107)ProjA(y) \u2212 \u02c6y(cid:107)2 \u2212 (cid:107)ProjA(\u02c6y) \u2212 \u02c6y(cid:107)2 \u2264 (cid:107)ProjA(y) \u2212 ProjA(\u02c6y)(cid:107)2 \u2264 (cid:107)y \u2212 \u02c6y(cid:107)2,\n\n(3.2)\nwhere the last inequality follows from the fact that projection is an non-expansive map. However, this\ninequality is not suf\ufb01cient for our purposes because in quantifying the perturbation L(y +\u2206y)\u2212L(y),\nwe also need the squared term (cid:107)\u2206y(cid:107)2\n2, which is not easily obtainable from Equation (3.2). In fact, a\n\ufb01ner-grained analysis is needed to establish that (cid:107)y \u2212 \u02c6y(cid:107)2\n2 \u2212\n(cid:107)ProjA(\u02c6y) \u2212 \u02c6y(cid:107)2\n2.\n\n2 is an upper bound on (cid:107)ProjA(y) \u2212 \u02c6y(cid:107)2\n\n4 Almost sure convergence to Nash equilibria\n\nIn this section, we establish that when agents\u2019 rewards come from a variationally stable game, multi-\nagent ROGD converges to the set of Nash equilibria almost surely under asynchronous feedback loss.\nFor ease of exposition, we break the framework into three steps, each of which centers on one idea\nand is described in detail in a subsection. All the proof details are given in the appendix. For the \ufb01rst\ntwo subsections, let a\u2217 be an arbitrary Nash equilibrium.\n\n4.1 Controlling the Tail Behavior of Expectation\n\nOur \ufb01rst step lies in controlling the tail behavior of the expected value of the sequence\n{(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105)}\u221e\n\nn=0. Note that by variational stability, we have:\n\n\u2200n,(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105) \u2265 0, a.s.\n\n\u221e(cid:88)\n\nConsequently, E[(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105)] \u2265 0,\u2200n. By leveraging the energy function, its telescoping\nsum and an appropritate conditioning, we show that (next lemma) this sequece of expectations should\nbe rather small in the limit and its tail should \u201cmostly\" go to 0.\nLemma 4.1.\n\nRemark 3. Since(cid:80)\u221e\n\n\u03b3n+1 E[(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105)] < \u221e.\n\n(4.1)\nt=0 \u03b3n+1 = \u221e, Lemma 4.1 that lim inf n\u2192\u221e E[(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105)] = 0. Note\nn=0 converges to 0,\n\nt=0\n\nthat the converse is not true: when a subsequence of {E[(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105)]}\u221e\nthe sum need not be \ufb01nite. As a simple example, consider \u03b3n+1 = 1\nn , if t = 2k\n1, otherwise.\n\nE[(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105)] =\n\n(cid:26) 1\n\nn, and\n\n(4.2)\n\nThen the subsequence on indicies 2k converges to 0, but the sum still diverges. This means that\nEquation (4.1) is stronger than subsequence convergence.\n\n4.2 Bounding the Successive Differences\nHowever, Equation (4.1) is still not strong enough to guarantee that limn\u2192\u221e E[(cid:104)\u2207ar(At), a\u2217 \u2212\nAt(cid:105)] = 0, let alone limn\u2192\u221e(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105) = 0, a.s.. This is because the convergent sum\ngiven in Equation (4.1) only limits the tail growth somewhat, but not completely. As an example to\ndemonstrate this point, let Ct be the following boolean variable6:\n\n(cid:26)1, if t contains the digit 9 in its decimal expansion\n\n(4.3)\n\nCt =\n\n0, otherwise.\n\n6By it de\ufb01nition, c9 = 1, c11 = 0.\n\n7\n\n\fNow de\ufb01ne \u03b3n+1 = 1\n\n(cid:26) 1\nn , if Ct = 1\n1, if Ct = 0.\n\nn, and E[(cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105)] =\n\na straightforward calculation that(cid:80)\u221e\n\nThen it follows from\nt=0 \u03b3n+1 E[(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105)] < \u221e (see Problem 1.3.24\nin [16]). However, the limit E[(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105)] does not exist.\nThis indicates that to obtain almost sure convergence of (cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105) to 0, we need to\nimpose more stringent conditions to ensure its suf\ufb01cient tail decay. One way is to bound the\ndifference between every two successive terms in terms of a decreasing sequence. This ensures that\n(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105) cannot change two much from iteration to iteration. Further, the change between\ntwo successive terms will become smaller and smaller. This result is formalized in the following\nlemma (the proof is given in the appendix):\nLemma 4.2. There exists a constant C > 0 such that for every n,\n\n(cid:104)\u2207ar(An+1), a\u2217 \u2212 An+1(cid:105) \u2212 (cid:104)\u2207ar(An), a\u2217 \u2212 An(cid:105) \u2264 C\u03b1n+1, a.s.\n\n(4.4)\n\n4.3 Main Convergence Result\n\nWe are now \ufb01nally ready to put all the pieces together and state the main convergence result.\nTheorem 4.3. Let the reward functions be given from a variationally stable game. De\ufb01ne the\npoint-to-set distance in the standard way: dist(a,A\u2217) = inf a\u2217\u2208A\u2217 (cid:107)a \u2212 a\u2217(cid:107)2. Then for any\nstrictly positive probabilities {pi}N\ni=1, ROGD converges almost surely to the set of Nash equilibria:\nlimn\u2192\u221e dist(An,A\u2217) = 0 a.s., as n \u2192 \u221e, where An is a sequence generated from Algorithm 3.\nRemark 4. As a quick outline here (the details are in appendix), pick an arbitrary Nash equilibrium\na\u2217 \u2208 A\u2217. Lemma 4.1 and Lemma 4.2 will together ensure that limn\u2192\u221e(cid:104)\u2207ar(At), a\u2217\u2212At(cid:105) = 0, a.s.\nSince (cid:104)\u2207ar(a), a\u2217 \u2212 a(cid:105) > 0 if and only if a /\u2208 A\u2217 and (cid:104)\u2207ar(a), a\u2217 \u2212 a(cid:105) = 0 if and only if\na \u2208 A\u2217, it then follows by continuity of \u2207ar(\u00b7) that limn\u2192\u221e(cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105) = 0, a.s. implies\nlimn\u2192\u221e dist(An,A\u2217) = 0 a.s..\nAlthough not mentioned in this theorem, another useful and interesting structural insight to point\nout here is that as (cid:104)\u2207ar(At), a\u2217 \u2212 At(cid:105) converges to 0, if it ever becomes 0 at n, then At \u2208 A\u2217, and\nfurthermore, the joint action will stay exactly at that Nash equilibrium forever. Why? There are two\ncases to consider.\n\n1. First, this Nash equilibrium a\u2217 is an interior point in A. In this case, \u2207ar(a\u2217) = 0 and\nhence \u2207ar(An) = 0. Consequently, per the ROGD update rule, whether any agent updates\nor not does not matter: either the gradient is not received, in which case no gradient update\nhappens; or a gradient is received, but at this Nash equilibrium, it is 0 and therefore nobody\nwill want to make any update.\n\n2. Second, this Nash equilibrium a\u2217 is a boundary point in A. In this case, \u2207ar(a\u2217) may\nnot be 0, but it always points outside the feasible action set A. Consequently, even if an\nagent receives a gradient and hence makes a gradient update, its action will immediately\nget projected back to the same point. As a result, the joint action An will stay exactly at A\n(even though the Y n variables will still change).\n\nCorollary 4.4. Under the same setup as in Theorem 4.3, if feedback loss is synchronous on average\n(pi = pj,\u2200i, j), then multi-agent OGD in Algorithm 2 converges almost surely to A\u2217 in last iterate.\nThis can be easily seen that by noting that the probability can be absorbed into the step-size.\nCorollary 4.5. Under the same setup as in Theorem 4.3, if there is no feedback loss (pi = 1,\u2200i),\nmulti-agent OGD in Algorithm 2 converges to A\u2217 in last iterate.\nRemark 5. Note that as stated, our above results require a joint learning step-size policy, even if\nthere is no missing gradient feedback \u2013 otherwise, the players\u2019 individual gradients weighted by\nindividual step-sizes might not be \u201cstable\u201d and can cause divergence (and, of course, the situation\nonly becomes worse in the \u201clossy\u201d regime). That said, our analysis still holds if each player i \u2208 N\nn < \u221e for all pairs i, j \u2208 N , i.e.,\nuses an individual step-size policy \u03b3i\nif the players\u2019 updates do not follow different \u201ctime-scales\u201d (so to speak). In that case however, the\nproofs (and the overall write-up) would become much more cumbersome, so we avoided this extra\ndegree of generality in this paper.\n\nn such that lim supn\u2192\u221e \u03b3i\n\nn/\u03b3j\n\n8\n\n\f5 Extension: Unknown Loss Probabilities\n\nSo far we have assumed pi\u2019s are known. We close the paper with a brief comment on how to remove\nthis assumption. When each agent i does not know the underlying loss probability pi, ROGD is\nno longer feasible. To overcome this, we use an estimator \u02c6pi (obtainable from the past history) in\nreplacement of the true probability pi. Since the only information an agent has is the past history\nof received gradients, we require the estimator \u02c6pi to be adapted to the sequence of the indicator\n\u221a\nfunctions I t\ni ). The resulting algorithm will be called EROGD. An estimator \u02c6pn\ni \u2019s: \u02c6pn\nn-consistent if E[(\u02c6pn \u2212 p)2] = O( 1\n\u221a\nis called\nn ), where p is the true parameter (note that it is called\nn-consistent because root-mean-squared error is typically used to de\ufb01ne the rate of consistency.).\nWe have the following result (proof omitted due to space limitation):\nn-consistent for every i \u2208 N , and\nTheorem 5.1. Under the same setup as in Theorem 4.3, if \u02c6pn\n\ni , . . . , I n\n\ni = \u02c6pt\n\ni(I 1\n\n\u221a\n\nn < \u221e, then the last iterate of EROGD converges to A\u2217 in probability.\n1\u221a\n\ni is\n\nif(cid:80)\u221e\n\nn=1 \u03b3n\n\nRemark 6. One simple estimator that is\nsmoothing is used to prevent the estimator from ever reaching 0): pn+1\nstep-size sequences satisfy the requirement: examples include \u03b3n = 1\n\nn-consistent is sample mean with smoothing (where the\n. Further, many\n\nn\u03b1 where 0.5 < \u03b3 < 1.\n\ns=1 I s\nn+1\n\ni =\n\ni +1\n\n(cid:80)n\n\n\u221a\n\n6 Concluding remarks\n\nIn this paper, we have provided an algorithmic framework to deal with multi-agent online learning\nunder feedback loss and obtained broad convergence-to-Nash results under fairly general settings.\nWe also believe more exciting work remains. For instance, our formulation is game-theoretical\nwhere participating agents are self-interested. A parallel facet to multi-agent online learning is\ncoordination [1, 2, 17, 20], where participating agents coordinate to achieve a common goal. Un-\nderstanding how to effectively cooperate under imperfect information will be an interesting future\ndirection. Another direction is to incorporate state into the reward and allow actions to also depend\non the underlying state that may transition. Such settings belong broadly to multi-agent online policy\nlearning, where the imperfect information regime is under-explored. Empirically, we believe the\nrecent advances in deep learning and representation learning could possibly [19, 33, 48\u201350] provide\na \ufb02exible architecture for learning good policies in the imperfect information regime, although\ncharacterizations of theoretical guarantees may require novel machinery. Finally, it would also be\ninteresting to further extend the results into partial feedback settings. In the presence of a single\nagent, such problems have been studied in the context of of\ufb02ine policy learning [6, 51] and online\nbandits (with imperfect information) [22, 25, 38]. However, the multi-agent learning setting is again\nunder-explored; we leave that for future work.\n\nReferences\n[1] A. ANAND, A. GROVER, P. SINGLA, ET AL., Asap-uct: Abstraction of state-action pairs in uct, in\n\nTwenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[2]\n\n, A novel abstraction framework for online planning, in Proceedings of the 2015 International\nConference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous\nAgents and Multiagent Systems, 2015, pp. 1901\u20131902.\n\n[3] S. ARORA, E. HAZAN, AND S. KALE, The multiplicative weights update method: a meta-algorithm and\n\napplications., Theory of Computing, 8 (2012), pp. 121\u2013164.\n\n[4] P. AUER, N. CESA-BIANCHI, Y. FREUND, AND R. E. SCHAPIRE, The nonstochastic multiarmed bandit\n\nproblem, SIAM journal on computing, 32 (2002), pp. 48\u201377.\n\n[5] M. BALANDAT, W. KRICHENE, C. TOMLIN, AND A. BAYEN, Minimizing regret on re\ufb02exive banach\nspaces and learning nash equilibria in continuous zero-sum games, arXiv preprint arXiv:1606.01261,\n(2016).\n\n[6] D. BERTSIMAS AND N. KALLUS, From predictive to prescriptive analytics, arXiv preprint\n\narXiv:1402.5481, (2014).\n\n[7] S. BERVOETS, M. BRAVO, AND M. FAURE, Learning with minimal information in continuous games.\n\nhttps://arxiv.org/abs/1806.11506, 2018.\n\n[8] D. BLOEMBERGEN, K. TUYLS, D. HENNES, AND M. KAISERS, Evolutionary dynamics of multi-agent\n\nlearning: a survey, Journal of Arti\ufb01cial Intelligence Research, 53 (2015), pp. 659\u2013697.\n\n9\n\n\f[9] A. BLUM, On-line algorithms in machine learning, in Online algorithms, Springer, 1998, pp. 306\u2013325.\n[10] A. BLUM AND Y. MANSOUR, From external to internal regret, Journal of Machine Learning Research, 8\n\n(2007), pp. 1307\u20131324.\n\n[11] M. BRAVO, D. S. LESLIE, AND P. MERTIKOPOULOS, Bandit learning in concave N-person games, in\nNIPS \u201918: Proceedings of the 32nd International Conference on Neural Information Processing Systems,\n2018.\nbandit problems, Foundations and Trends R(cid:13) in Machine Learning, 5 (2012), pp. 1\u2013122.\n\n[12] S. BUBECK, N. CESA-BIANCHI, ET AL., Regret analysis of stochastic and nonstochastic multi-armed\n\n[13] L. BUSONIU, R. BABUSKA, AND B. DE SCHUTTER, Multi-agent reinforcement learning: An overview,\n\nin Innovations in multi-agent systems and applications-1, Springer, 2010, pp. 183\u2013221.\n\n[14] N. CESA-BIANCHI AND G. LUGOSI, Prediction, learning, and games, Cambridge university press, 2006.\n[15] J. COHEN, A. H\u00c9LIOU, AND P. MERTIKOPOULOS, Learning with bandit feedback in potential games, in\nNIPS \u201917: Proceedings of the 31st International Conference on Neural Information Processing Systems,\n2017.\n\n[16] P. DE SOUZA AND J. SILVA, Berkeley Problems in Mathematics, Problem Books in Mathematics, Springer\n\nNew York, 2012.\n\n[17] M. DIMAKOPOULOU AND B. VAN ROY, Coordinated exploration in concurrent reinforcement learning,\n\narXiv preprint arXiv:1802.01282, (2018).\n\n[18] D. FUDENBERG AND D. K. LEVINE, The Theory of Learning in Games, vol. 2 of Economic learning and\n\nsocial evolution, MIT Press, Cambridge, MA, 1998.\n\n[19] I. GOODFELLOW, Y. BENGIO, AND A. COURVILLE, Deep Learning, Adaptive computation and machine\n\nlearning, MIT Press, 2016.\n\n[20] A. GROVER, M. AL-SHEDIVAT, J. K. GUPTA, Y. BURDA, AND H. EDWARDS, Evaluating generalization\nin multiagent systems using agent-interaction graphs, in International Conference on Autonomous Agents\nand Multiagent Systems, 2018.\n\n[21] A. GROVER, M. AL-SHEDIVAT, J. K. GUPTA, Y. BURDA, AND H. EDWARDS, Learning policy represen-\n\ntations in multiagent systems, in International Conference on Machine Learning, 2018.\n\n[22] A. GROVER, T. MARKOV, P. ATTIA, N. JIN, N. PERKINS, B. CHEONG, M. CHEN, Z. YANG, S. HARRIS,\nW. CHUEH, ET AL., Best arm identi\ufb01cation in multi-armed bandits with delayed feedback, arXiv preprint\narXiv:1803.10937, (2018).\n\n[23] E. HAZAN, Introduction to Online Convex Optimization, Foundations and Trends(r) in Optimization Series,\n\nNow Publishers, 2016.\n\n[24] E. HAZAN, A. AGARWAL, AND S. KALE, Logarithmic regret algorithms for online convex optimization,\n\nMachine Learning, 69 (2007), pp. 169\u2013192.\n\n[25] P. JOULANI, A. GYORGY, AND C. SZEPESV\u00c1RI, Online learning under delayed feedback, in International\n\nConference on Machine Learning, 2013, pp. 1453\u20131461.\n\n[26] A. KALAI AND S. VEMPALA, Ef\ufb01cient algorithms for online decision problems, Journal of Computer and\n\nSystem Sciences, 71 (2005), pp. 291\u2013307.\n\n[27] S. KRICHENE, W. KRICHENE, R. DONG, AND A. BAYEN, Convergence of heterogeneous distributed\nlearning in stochastic routing games, in Communication, Control, and Computing (Allerton), 2015 53rd\nAnnual Allerton Conference on, IEEE, 2015, pp. 480\u2013487.\n\n[28] K. LAM, W. KRICHENE, AND A. BAYEN, On learning how players learn: estimation of learning dynamics\nin the routing game, in Cyber-Physical Systems (ICCPS), 2016 ACM/IEEE 7th International Conference\non, IEEE, 2016, pp. 1\u201310.\n\n[29] I. MENACHE AND A. OZDAGLAR, Network games: Theory, models, and dynamics, Synthesis Lectures on\n\nCommunication Networks, 4 (2011), pp. 1\u2013159.\n\n[30] P. MERTIKOPOULOS, E. V. BELMEGA, R. NEGREL, AND L. SANGUINETTI, Distributed stochastic\n\noptimization via matrix exponential learning, IEEE Trans. Signal Process., 65 (2017), pp. 2277\u20132290.\n\n[31] P. MERTIKOPOULOS, C. PAPADIMITRIOU, AND G. PILIOURAS, Cycles in adversarial regularized\nlearning, in Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms,\nSIAM, 2018, pp. 2703\u20132717.\n\n[32] P. MERTIKOPOULOS AND Z. ZHOU, Learning in games with continuous action sets and unknown payoff\n\nfunctions, Mathematical Programming, (2018), pp. 1\u201343.\n\n[33] V. MNIH, K. KAVUKCUOGLU, D. SILVER, A. GRAVES, I. ANTONOGLOU, D. WIERSTRA, AND\nM. RIEDMILLER, Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602, (2013).\n[34] B. MONNOT AND G. PILIOURAS, Limits and limitations of no-regret learning in games, The Knowledge\n\nEngineering Review, 32 (2017).\n\n10\n\n\f[35] Y. NESTEROV, Primal-dual subgradient methods for convex problems, Mathematical programming, 120\n\n(2009), pp. 221\u2013259.\n\n[36] G. PALAIOPANOS, I. PANAGEAS, AND G. PILIOURAS, Multiplicative weights update with constant\nstep-size in congestion games: Convergence, limit cycles and chaos, in Advances in Neural Information\nProcessing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, eds., Curran Associates, Inc., 2017, pp. 5872\u20135882.\n\n[37] S. PERKINS, P. MERTIKOPOULOS, AND D. S. LESLIE, Mixed-strategy learning with continuous action\n\nsets, IEEE Trans. Autom. Control, 62 (2017), pp. 379\u2013384.\n\n[38] C. PIKE-BURKE, S. AGRAWAL, C. SZEPESVARI, AND S. GRUNEWALDER, Bandits with delayed,\naggregated anonymous feedback, in International Conference on Machine Learning, 2018, pp. 4102\u20134110.\n[39] K. QUANRUD AND D. KHASHABI, Online learning with adversarial delays, in Advances in Neural\n\nInformation Processing Systems, 2015, pp. 1270\u20131278.\n\n[40] T. ROUGHGARDEN, Sel\ufb01sh routing and the price of anarchy, vol. 174.\n[41] F. SALEHISADAGHIANI AND L. PAVEL, Distributed nash equilibrium seeking via the alternating direction\n\nmethod of multipliers, IFAC-PapersOnLine, 50 (2017), pp. 6166\u20136171.\n\n[42] S. SHALEV-SHWARTZ, Online learning: Theory, algorithms, and applications, PhD thesis, Hebrew\n[43] S. SHALEV-SHWARTZ ET AL., Online learning and online convex optimization, Foundations and Trends R(cid:13)\n\nUniversity of Jerusalem, 2007.\n\nin Machine Learning, 4 (2012), pp. 107\u2013194.\n\n[44] S. SHALEV-SHWARTZ AND Y. SINGER, Convex repeated games and Fenchel duality, in Advances in\n\nNeural Information Processing Systems 19, MIT Press, 2007, pp. 1265\u20131272.\n\n[45] Y. SHOHAM AND K. LEYTON-BROWN, Multiagent systems: Algorithmic, game-theoretic, and logical\n\nfoundations, Cambridge University Press, 2008.\n\n[46] Y. VIOSSAT AND A. ZAPECHELNYUK, No-regret dynamics and \ufb01ctitious play, Journal of Economic\n\nTheory, 148 (2013), pp. 825\u2013842.\n\n[47] Y. VIOSSAT AND A. ZAPECHELNYUK, No-regret dynamics and \ufb01ctitious play, Journal of Economic\n\nTheory, 148 (2013), pp. 825\u2013842.\n\n[48] Z. YIN, K.-H. CHANG, AND R. ZHANG, Deepprobe: Information directed sequence understanding and\nchatbot design via recurrent neural networks, in Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 2131\u20132139.\n\n[49] Z. YIN, V. SACHIDANANDA, AND B. PRABHAKAR, The global anchor method for quantifying linguistic\n\nshifts and domain adaptation, in Advances in Neural Information Processing Systems, 2018.\n\n[50] Z. YIN AND Y. SHEN, On the dimensionality of word embedding, in Advances in Neural Information\n\nProcessing Systems, 2018.\n\n[51] Z. ZHOU, S. ATHEY, AND S. WAGER, Of\ufb02ine multi-action policy learning: Generalization and optimiza-\n\ntion, arXiv preprint arXiv:1810.04778, (2018).\n\n[52] Z. ZHOU, N. BAMBOS, AND P. GLYNN, Dynamics on linear in\ufb02uence network games under stochastic\nenvironments, in International Conference on Decision and Game Theory for Security, Springer, 2016,\npp. 114\u2013126.\n\n[53] Z. ZHOU, P. MERTIKOPOULOS, N. BAMBOS, P. W. GLYNN, AND C. TOMLIN, Countering feedback\ndelays in multi-agent learning, in NIPS \u201917: Proceedings of the 31st International Conference on Neural\nInformation Processing Systems, 2017.\n\n[54] Z. ZHOU, P. MERTIKOPOULOS, A. L. MOUSTAKAS, N. BAMBOS, AND P. GLYNN, Mirror descent\nlearning in continuous games, in Decision and Control (CDC), 2017 IEEE 56th Annual Conference on,\nIEEE, 2017, pp. 5776\u20135783.\n\n[55] Z. ZHOU, B. YOLKEN, R. A. MIURA-KO, AND N. BAMBOS, A game-theoretical formulation of in\ufb02uence\n\nnetworks, in American Control Conference (ACC), 2016, IEEE, 2016, pp. 3802\u20133807.\n\n[56] M. ZHU AND E. FRAZZOLI, Distributed robust adaptive equilibrium computation for generalized convex\n\ngames, Automatica, 63 (2016), pp. 82\u201391.\n\n[57] M. ZINKEVICH, Online convex programming and generalized in\ufb01nitesimal gradient ascent, in ICML \u201903:\n\nProceedings of the 20th International Conference on Machine Learning, 2003, pp. 928\u2013936.\n\n11\n\n\f", "award": [], "sourceid": 2462, "authors": [{"given_name": "Zhengyuan", "family_name": "Zhou", "institution": "Stanford University"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}, {"given_name": "Susan", "family_name": "Athey", "institution": "Stanford University"}, {"given_name": "Nicholas", "family_name": "Bambos", "institution": null}, {"given_name": "Peter", "family_name": "Glynn", "institution": "Stanford University"}, {"given_name": "Yinyu", "family_name": "Ye", "institution": null}]}