{"title": "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4190, "page_last": 4203, "abstract": "There has been a resurgence of interest in multiagent reinforcement learning (MARL), due partly to the recent success of deep neural networks. The simplest form of MARL is independent reinforcement learning (InRL), where each agent treats all of its experience as part of its (non stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe a meta-algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game theoretic analysis to compute meta-strategies for policy selection. The meta-algorithm generalizes previous algorithms such as InRL, iterated best response, double oracle, and fictitious play. Then, we propose a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in three partially observable settings: gridworld coordination problems, emergent language games, and poker.", "full_text": "A Uni\ufb01ed Game-Theoretic Approach to\n\nMultiagent Reinforcement Learning\n\nMarc Lanctot\n\nDeepMind\nlanctot@\n\nVinicius Zambaldi\n\nDeepMind\n\nvzambaldi@\n\nAudr\u00afunas Gruslys\n\nDeepMind\naudrunas@\n\nAngeliki Lazaridou\n\nDeepMind\nangeliki@\n\nKarl Tuyls\nDeepMind\n\nkarltuyls@\n\nJulien P\u00e9rolat\n\nDeepMind\nperolat@\n\nDavid Silver\nDeepMind\n\ndavidsilver@\n\nThore Graepel\n\nDeepMind\nthore@\n\n...@google.com\n\nAbstract\n\nTo achieve general intelligence, agents must learn how to interact with others in\na shared environment: this is the challenge of multiagent reinforcement learning\n(MARL). The simplest form is independent reinforcement learning (InRL), where\neach agent treats its experience as part of its (non-stationary) environment. In\nthis paper, we \ufb01rst observe that policies learned using InRL can over\ufb01t to the\nother agents\u2019 policies during training, failing to suf\ufb01ciently generalize during\nexecution. We introduce a new metric, joint-policy correlation, to quantify this\neffect. We describe an algorithm for general MARL, based on approximate best\nresponses to mixtures of policies generated using deep reinforcement learning, and\nempirical game-theoretic analysis to compute meta-strategies for policy selection.\nThe algorithm generalizes previous ones such as InRL, iterated best response,\ndouble oracle, and \ufb01ctitious play. Then, we present a scalable implementation\nwhich reduces the memory requirement using decoupled meta-solvers. Finally,\nwe demonstrate the generality of the resulting policies in two partially observable\nsettings: gridworld coordination games and poker.\n\n1\n\nIntroduction\n\nDeep reinforcement learning combines deep learning [59] with reinforcement learning [94, 64] to\ncompute a policy used to drive decision-making [73, 72]. Traditionally, a single agent interacts with\nits environment repeatedly, iteratively improving its policy by learning from its observations. Inspired\nby recent success in Deep RL, we are now seeing a renewed interest in multiagent reinforcement\nlearning (MARL) [90, 17, 99]. In MARL, several agents interact and learn in an environment\nsimultaneously, either competitively such as in Go [91] and Poker [39, 105, 74], cooperatively such\nas when learning to communicate [23, 93, 36], or some mix of the two [60, 95, 35].\nThe simplest form of MARL is independent RL (InRL), where each learner is oblivious to the other\nagents and simply treats all the interaction as part of its (\u201clocalized\u201d) environment. Aside from\nthe problem that these local environments are non-stationary and non-Markovian [57] resulting in\na loss of convergence guarantees for many algorithms, the policies found can over\ufb01t to the other\nagents\u2019 policies and hence not generalize well. There has been relatively little work done in RL\ncommunity on over\ufb01tting to the environment [102, 69], but we argue that this is particularly important\nin multiagent settings where one must react dynamically based on the observed behavior of others.\nClassical techniques collect or approximate extra information such as the joint values [62, 19, 29, 56],\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fuse adaptive learning rates [12], adjust the frequencies of updates [48, 81], or dynamically respond\nto the other agents actions online [63, 50]. However, with the notable exceptions of very recent\nwork [22, 80], they have focused on (repeated) matrix games and/or the fully-observable case.\nThere have several proposals for treating partial observability in the multiagent setting. When the\nmodel is fully known and the setting is strictly adversarial with two players, there are policy iteration\nmethods based on regret minimization that scale very well when using domain-speci\ufb01c abstrac-\ntions [27, 14, 46, 47], which was a major component of the expert no-limit poker AI Libratus [15];\nrecently these methods were combined with deep learning to create an expert no-limit poker AI\ncalled DeepStack [74]. There is a signi\ufb01cant amount of work that deals with the case of decentralized\ncooperative problems [76, 79], and in the general setting by extending the notion of belief states\nand Bayesian updating from POMDPs [28]. These models are quite expressive, and the resulting\nalgorithms are fairly complex. In practice, researchers often resort to approximate forms, by sampling\nor exploiting structure, to ensure good performance due to intractability [41, 2, 68].\nIn this paper, we introduce a new metric for quantifying the correlation effects of policies learned by\nindependent learners, and demonstrate the severity of the over\ufb01tting problem. These coordination\nproblems have been well-studied in the fully-observable cooperative case [70]: we observe similar\nproblems in a partially-observed mixed cooperative/competitive setting and, and we show that the\nseverity increases as the environment becomes more partially-observed. We propose a new algorithm\nbased on economic reasoning [82], which uses (i) deep reinforcement learning to compute best\nresponses to a distribution over policies, and (ii) empirical game-theoretic analysis to compute new\nmeta-strategy distributions. As is common in the MARL setting, we assume centralized training for\ndecentralized execution: policies are represented as separate neural networks and there is no sharing\nof gradients nor architectures among agents. The basic form uses a centralized payoff table, which is\nremoved in the distributed, decentralized form that requires less space.\n\n2 Background and Related Work\n\nIn this section, we start with basic building blocks necessary to describe the algorithm. We interleave\nthis with the most relevant previous work for our setting. Several components of the general idea have\nbeen (re)discovered many times across different research communities, each with slightly different\nbut similar motivations. One aim here is therefore to unify the algorithms and terminology.\nA normal-form game is a tuple (\u03a0, U, n) where n is the number of players, \u03a0 = (\u03a01,\u00b7\u00b7\u00b7 , \u03a0n)\nis the set of policies (or strategies, one for each player i \u2208 [[n]], where [[n]] = {1,\u00b7\u00b7\u00b7 , n}), and\nU : \u03a0 \u2192 (cid:60)n is a payoff table of utilities for each joint policy played by all players. Extensive-form\ngames extend these formalisms to the multistep sequential case (e.g. poker).\nPlayers try to maximize their own expected utility. Each player does this by choosing a policy\nfrom \u03a0i, or by sampling from a mixture (distribution) over them \u03c3i \u2208 \u2206(\u03a0i). In this multiagent\nsetting, the quality of \u03c3i depends on other players\u2019 strategies, and so it cannot be found nor assessed\nindependently. Every \ufb01nite extensive-form game has an equivalent normal-form [53], but since it is\nexponentially larger, most algorithms have to be adapted to handle the sequential setting directly.\nThere are several algorithms for computing strategies. In zero-sum games (where \u2200\u03c0 \u2208 \u03a0,(cid:126)1 \u00b7\nU (\u03c0) = 0), one can use e.g. linear programming, \ufb01ctitious play [13], replicator dynamics [97],\nor regret minimization [8]. Some of these techniques have been extended to extensive (sequential)\nform [39, 25, 54, 107] with an exponential increase in the size of the state space. However, these\nextensions have almost exclusively treated the two-player case, with some notable exceptions [54, 26].\nFictitious play also converges in potential games which includes cooperative (identical payoff) games.\nThe double oracle (DO) algorithm [71] solves a set of (two-player, normal-form) subgames induced\nby subsets \u03a0t \u2282 \u03a0 at time t. A payoff matrix for the subgame Gt includes only those entries\ncorresponding to the strategies in \u03a0t. At each time step t, an equilibrium \u03c3\u2217,t is obtained for Gt, and\n\u2217,t\u2212i ) from the full space \u03a0i, so for all i,\nto obtain Gt+1 each player adds a best response \u03c0t+1\n}. The algorithm is illustrated in Figure 1. Note that \ufb01nding an equilibrium in a\n\u03a0t+1\ni = \u03a0t\nzero-sum game takes time polynomial in |\u03a0t|, and is PPAD-complete for general-sum [89].\nClearly, DO is guaranteed to converge to an equilibrium in two-player games. But, in the worst-case,\nthe entire strategy space may have to be enumerated. For example, this is necessary for Rock-Paper-\n\ni \u222a {\u03c0t+1\n\n\u2208 BR(\u03c3\n\ni\n\ni\n\n2\n\n\fFigure 1: The Double Oracle Algorithm. Figure taken from [10] with authors\u2019 permission.\n\n3 , 1\n\n3 , 1\n\nScissors, whose only equilibrium has full support ( 1\n3 ). However, there is evidence that support\nsizes shrink for many games as a function of episode length, how much hidden information is revealed\nand/or affects it has on the payoff [65, 86, 10]. Extensions to the extensive-form games have been\ndeveloped [67, 9, 10] but still large state spaces are problematic due to the curse of dimensionality.\nEmpirical game-theoretic analysis (EGTA) is the study of meta-strategies obtained through simulation\nin complex games [100, 101]. An empirical game, much smaller in size than the full game, is\nconstructed by discovering strategies, and meta-reasoning about the strategies to navigate the strategy\nspace. This is necessary when it is prohibitively expensive to explicitly enumerate the game\u2019s\nstrategies. Expected utilities for each joint strategy are estimated and recorded in an empirical\npayoff table. The empirical game is analyzed, and the simulation process continues. EGTA has been\nemployed in trading agent competitions (TAC) and automated bidding auctions.\nOne study used evolutionary dynamics in the space of known expert meta-strategies in Poker [83].\nRecently, reinforcement learning has been used to validate strategies found via EGTA [104]. In this\nwork, we aim to discover new strategies through learning. However, instead of computing exact best\nresponses, we compute approximate best responses using reinforcement learning. A few epochs of\nthis was demonstrated in continuous double auctions using tile coding [87]. This work follows up in\nthis line, running more epochs, using modern function approximators (deep networks), a scalable\nimplementation, and with a focus on \ufb01nding policies that can generalize across contexts.\nA key development in recent years is deep learning [59]. While most work in deep learning has\nfocused on supervised learning, impressive results have recently been shown using deep neural\nnetworks for reinforcement learning, e.g. [91, 38, 73, 77]. For instance, Mnih et al. [73] train policies\nfor playing Atari video games and 3D navigation [72], given only screenshots. Silver et al. introduced\nAlphaGo [91, 92], combining deep RL with Monte Carlo tree search, outperforming human experts.\nComputing approximate responses is more computationally feasible, and \ufb01ctitious play can handle\napproximations [42, 61]. It is also more biologically plausible given natural constraints of bounded\nrationality. In behavioral game theory [103], the focus is to predict actions taken by humans, and\nthe responses are intentionally constrained to increase predictive ability. A recent work uses a deep\nlearning architecture [34]. The work that closely resembles ours is level-k thinking [20] where\nlevel k agents respond to level k \u2212 1 agents, and more closely cognitive hierarchy [18], in which\nresponses are to distributions over levels {0, 1, . . . , k \u2212 1}. However, our goals and motivations are\nvery different: we use the setup as a means to produce more general policies rather than to predict\nhuman behavior. Furthermore, we consider the sequential setting rather than normal-form games.\nLastly, there has been several studies from the literature on co-evolutionary algorithms; speci\ufb01cally,\nhow learning cycles and over\ufb01tting to the current populations can be mitigated [78, 85, 52].\n\n3 Policy-Space Response Oracles\n\nWe now present our main conceptual algorithm, policy-space response oracles (PSRO). The algorithm\nis a natural generalization of Double Oracle where the meta-game\u2019s choices are policies rather than\nactions. It also generalizes Fictitious Self-Play [39, 40]. Unlike previous work, any meta-solver\ncan be plugged in to compute a new meta-strategy. In practice, parameterized policies (function\napproximators) are used to generalize across the state space without requiring any domain knowledge.\nThe process is summarized in Algorithm 1. The meta-game is represented as an empirical game,\nstarting with a single policy (uniform random) and growing, each epoch, by adding policies (\u201coracles\u201d)\n\n3\n\n\f:initial policy sets for all players \u03a0\n\nAlgorithm 1: Policy-Space Response Oracles\ninput\nCompute exp. utilities U \u03a0 for each joint \u03c0 \u2208 \u03a0\nInitialize meta-strategies \u03c3i = UNIFORM(\u03a0i)\nwhile epoch e in {1, 2,\u00b7\u00b7\u00b7} do\nfor many episodes do\nSample \u03c0\u2212i \u223c \u03c3\u2212i\nTrain oracle \u03c0(cid:48)\n\u03a0i = \u03a0i \u222a {\u03c0(cid:48)\ni}\n\nfor player i \u2208 [[n]] do\n\ni over \u03c1 \u223c (\u03c0(cid:48)\n\ni, \u03c0\u2212i)\n\nCompute missing entries in U \u03a0 from \u03a0\nCompute a meta-strategy \u03c3 from U \u03a0\n\nOutput current solution strategy \u03c3i for player i\n\nAlgorithm 2: Deep Cognitive Hierarchies\ninput\nwhile not terminated do\n\n:player number i, level k\n\nCHECKLOADMS({j|j \u2208 [[n]], j (cid:54)= i}, k)\nCHECKLOADORACLES(j \u2208 [[n]], k(cid:48) \u2264 k)\nCHECKSAVEMS(\u03c3i,k)\nCHECKSAVEORACLE(\u03c0i,k)\nSample \u03c0\u2212i \u223c \u03c3\u2212i,k\nTrain oracle \u03c0i,k over \u03c11 \u223c (\u03c0i,k, \u03c0\u2212i)\nif iteration number mod Tms = 0 then\n\nSample \u03c0i \u223c \u03c3i,k\nCompute ui(\u03c12), where \u03c12 \u223c (\u03c0i, \u03c0\u2212i)\nUpdate stats for \u03c0i and update \u03c3i,k\n\nOutput \u03c3i,k for player i at level k\n\nthat approximate best responses to the meta-strategy of the other players. In (episodic) partially\nobservable multiagent environments, when the other players are \ufb01xed the environment becomes\nMarkovian and computing a best response reduces to solving a form of MDP [30]. Thus, any\nreinforcement learning algorithm can be used. We use deep neural networks due to the recent success\nin reinforcement learning. In each episode, one player is set to oracle(learning) mode to train \u03c0(cid:48)\ni, and\na \ufb01xed policy is sampled from the opponents\u2019 meta-strategies (\u03c0\u2212i \u223c \u03c3\u2212i). At the end of the epoch,\nthe new oracles are added to their policy sets \u03a0i, expected utilities for new policy combinations are\ncomputed via simulation and added to the empirical tensor U \u03a0, which takes time exponential in |\u03a0|.\nDe\ufb01ne \u03a0T = \u03a0T\u22121 \u222a \u03c0(cid:48) as the policy space including the currently learning oracles, and |\u03c3i| = |\u03a0T\ni |\nfor all i \u2208 [[n]]. Iterated best response is an instance of PSRO with \u03c3\u2212i = (0, 0,\u00b7\u00b7\u00b7 , 1, 0). Similarly,\nIndependent RL and \ufb01ctitious play are instances of PSRO with \u03c3\u2212i = (0, 0,\u00b7\u00b7\u00b7 , 0, 1) and \u03c3\u2212i =\n(1/K, 1/K,\u00b7\u00b7\u00b7 , 1/K, 0), respectively, where K = |\u03a0T\u22121\u2212i\n|. Double Oracle is an instance of PSRO\nwith n = 2 and \u03c3T set to a Nash equilibrium pro\ufb01le of the meta-game (\u03a0T\u22121, U \u03a0T \u22121\nAn exciting question is what can happen with (non-\ufb01xed) meta-solvers outside this known space?\nFictitious play is agnostic to the policies it is responding to; hence it can only sharpen the meta-\nstrategy distribution by repeatedly generating the same best responses. On the other hand, responses\nto equilibrium strategies computed by Double Oracle will (i) over\ufb01t to a speci\ufb01c equilibrium in the\nn-player or general-sum case, and (ii) be unable to generalize to parts of the space not reached by any\nequilibrium strategy in the zero-sum case. Both of these are undesirable when computing general\npolicies that should work well in any context. We try to balance these problems of over\ufb01tting with a\ncompromise: meta-strategies with full support that force (mix in) \u03b3 exploration over policy selection.\n\n).\n\n3.1 Meta-Strategy Solvers\n\nA meta-strategy solver takes as input the empirical game (\u03a0, U \u03a0) and produces a meta-strategy \u03c3i\nfor each player i. We try three different solvers: regret-matching, Hedge, and projected replicator\ndynamics. These speci\ufb01c meta-solvers accumulate values for each policy (\u201carm\u201d) and an aggregate\nvalue based on all players\u2019 meta-strategies. We refer to ui(\u03c3) as player i\u2019s expected value given\nall players\u2019 meta-strategies and the current empirical payoff tensor U \u03a0 (computed via multiple\ntensor dot products.) Similarly, denote ui(\u03c0i,k, \u03c3\u2212i) as the expected utility if player i plays their\nkth \u2208 [[K]] \u222a {0} policy and the other players play with their meta-strategy \u03c3\u2212i. Our strategies use\nan exploration parameter \u03b3, leading to a lower bound of\nK+1 on the probability of selecting any \u03c0i,k.\nThe \ufb01rst two meta-solvers (Regret Matching and Hedge) are straight-forward applications of previous\nalgorithms, so we defer the details to Appendix A.1 Here, we introduce a new solver we call projected\nreplicator dynamics (PRD). From Appendix A, when using the asymmetric replicator dynamics,\ne.g. with two players, where U \u03a0 = (A, B), the change in probabilities for the kth component (i.e.,\nthe policy \u03c0i,k) of meta-strategies (\u03c31, \u03c32) = (x, y) are:\ndyk\ndt\n\n= yk[(xT B)k \u2212 xT By],\n\n= xk[(Ay)k \u2212 xT Ay],\n\ndxk\ndt\n\n\u03b3\n\n1Appendices are available in the longer technical report version of the paper, see [55].\n\n4\n\n\fTo simulate the replicator dynamics in practice, discretized updates are simulated using a step-size of \u03b4.\nWe add a projection operator P (\u00b7) to these equations that guarantees exploration: x \u2190 P (x + \u03b4 dx\ndt ),\ny \u2190 P (y + \u03b4 dy\n{||x(cid:48) \u2212 x||}, if any xk < \u03b3/(K + 1) or x\nk xk = 1} is the \u03b3-exploratory simplex of size K + 1.\notherwise, and \u2206K+1\nThis enforces exploratory \u03c3i(\u03c0i,k) \u2265 \u03b3/(K + 1). The PRD approach can be understood as directing\nexploration in comparison to standard replicator dynamics approaches that contain isotropic diffusion\nor mutation terms (which assume undirected and unbiased evolution), for more details see [98].\n\ndt ), where P (x) = argminx(cid:48)\u2208\u2206K+1\n\nK+1 ,(cid:80)\n\n= {x | xk \u2265 \u03b3\n\n\u03b3\n\n\u03b3\n\n3.2 Deep Cognitive Hierarchies\n\nFigure 2: Overview of DCH\n\nWhile the generality of PSRO is clear and appealing, the\nRL step can take a long time to converge to a good re-\nsponse.\nIn complex environments, much of the basic\nbehavior that was learned in one epoch may need to be\nrelearned when starting again from scratch; also, it may be\ndesirable to run many epochs to get oracle policies that can\nrecursively reason through deeper levels of contingencies.\nTo overcome these problems, we introduce a practical\nparallel form of PSRO. Instead of an unbounded number\nof epochs, we choose a \ufb01xed number of levels in advance.\nThen, for an n-player game, we start nK processes in\nparallel (level 0 agents are uniform random): each one trains a single oracle policy \u03c0i,k for player i\nand level k and updates its own meta-strategy \u03c3i,k, saving each to a central disk periodically. Each\nprocess also maintains copies of all the other oracle policies \u03c0j,k(cid:48)\u2264k at the current and lower levels, as\nwell as the meta-strategies at the current level \u03c3\u2212i,k, which are periodically refreshed from a central\ndisk. We circumvent storing U \u03a0 explicitly by updating the meta-strategies online. We call this a\nDeep Cognitive Hierarchy (DCH), in reference to Camerer, Ho, & Chong\u2019s model augmented with\ndeep RL. Example oracle response dynamics shown in Figure 2, and the pseudo-code in Algorithm 2.\nSince each process uses slightly out-dated copies of the other process\u2019s policies and meta-strategies,\nDCH approximates PSRO. Speci\ufb01cally, it trades away accuracy of the correspondence to PSRO\nfor practical ef\ufb01ciency and, in particular, scalability. Another bene\ufb01t of DCH is an asymptotic\nreduction in total space complexity. In PSRO, for K policies and n players, the space required to\nstore the empirical payoff tensor is K n. Each process in DCH stores nK policies of \ufb01xed size, and\nn meta-strategies (and other tables) of size bounded by k \u2264 K. Therefore the total space required\nis O(nK \u00b7 (nK + nK)) = O(n2K 2). This is possible is due to the use of decoupled meta-solvers,\nwhich compute strategies online without requiring a payoff tensor U \u03a0, which we describe now.\n\n3.2.1 Decoupled Meta-Strategy Solvers\n\nIn the \ufb01eld of online learning, the experts algorithms (\u201cfull information\u201d case) receive information\nabout each arm at every round. In the bandit (\u201cpartial information\u201d) case, feedback is only given for\nthe arm that was pulled. Decoupled meta-solvers are essentially sample-based adversarial bandits [16]\napplied to games. Empirical strategies are known to converge to Nash equilibria in certain classes of\ngames (i.e. zero-sum, potential games) due to the folk theorem [8].\nWe try three: decoupled regret-matching [33], Exp3 (decoupled Hedge) [3], and decoupled PRD. Here\nagain, we use exploratory strategies with \u03b3 of the uniform strategy mixed in, which is also necessary\nto ensure that the estimates are unbiased. For decoupled PRD, we maintain running averages for the\noverall average value an value of each arm (policy). Unlike in PSRO, in the case of DCH, one sample\nis obtained at a time and the meta-strategy is updated periodically from online estimates.\n\n4 Experiments\n\nIn all of our experiments, oracles use Reactor [31] for learning, which has achieved state-of-the-art\nresults in Atari game-playing. Reactor uses Retrace(\u03bb) [75] for off-policy policy evaluation, and\n\u03b2-Leave-One-Out policy gradient for policy updates, and supports recurrent network training, which\ncould be important in trying to match online experiences to those observed during training.\n\n5\n\n....................Nplayers\u03c0i,kK+1levels\u03c31,1\u03c01,1randrandrandrand\u03c3i,k\fThe action spaces for each player are identical, but the algorithms do not require this. Our implemen-\ntation differs slightly from the conceptual descriptions in Section 3; see App. C for details.\nFirst-Person Gridworld Games. Each agent has a local \ufb01eld-of-view (making the world partially\nobservable), sees 17 spaces in front, 10 to either side, and 2 spaces behind. Consequently, observations\nare encoded as 21x20x3 RGB tensors with values 0 \u2013 255. Each agent has a choice of turning left or\nright, moving forward or backward, stepping left or right, not moving, or casting an endless light\nbeam in their current direction. In addition, the agent has two composed actions of moving forward\nand turning. Actions are executed simultaneously, and order of resolution is randomized. Agents\nstart on a random spawn point at the beginning of each episode. If an agent is touched (\u201ctagged\u201d) by\nanother agent\u2019s light beam twice, then the target agent is immediately teleported to a spawn point. In\nlaser tag, the source agent then receives 1 point of reward for the tag. In another variant, gathering,\nthere is no tagging but agents can collect apples, for 1 point per apple, which refresh at a \ufb01xed rate.\nIn path\ufb01nd, there is no tagging nor apples, and both agents get 1 point reward when both reach their\ndestinations, ending the episode. In every variant, an episode consists of 1000 steps of simulation.\nOther details, such as speci\ufb01c maps, can be found in Appendix D.\nLeduc Poker is a common benchmark in Poker AI, consisting of a six-card deck: two suits with\nthree cards (Jack, Queen, King) each. Each player antes 1 chip to play, and receives one private card.\nThere are two rounds of betting, with a maximum of two raises each, whose values are 2 and 4 chips\nrespectively. After the \ufb01rst round of betting, a single public card is revealed. The input is represented\nas in [40], which includes one-hot encodings of the private card, public card, and history of actions.\nNote that we use a more dif\ufb01cult version than in previous work; see Appendix D.1 for details.\n\n4.1\n\nJoint Policy Correlation in Independent Reinforcement Learning\n\nTo identify the effect of over\ufb01tting in independent reinforcement learners, we introduce joint policy\ncorrelation (JPC) matrices. To simplify the presentation, we describe here the special case of\nsymmetric two-player games with non-negative rewards; for a general description, see Appendix B.2.\nValues are obtained by running D instances of the same experiment, differing only in the seed used\nto initialize the random number generators. Each experiment d \u2208 [[D]] (after many training episodes)\n2 ). The entries of each D \u00d7 D matrix shows the mean return over T = 100\nproduces policies (\u03c0d\n1 and and player 2 uses\nT (Rt\nt=1\ndj\ncolumn policy \u03c0\n2 . Hence, entries on the diagonals represent returns for policies that learned together\n(i.e., same instance), while off-diagonals show returns from policies that trained in separate instances.\n\n2), obtained when player 1 uses row policy \u03c0di\n\nepisodes,(cid:80)T\n\n1 , \u03c0d\n1 + Rt\n\n1\n\nFigure 3: Example JPC matrices for InRL on Laser Tag small2 map (left) and small4 (right).\n\nFrom a JPC matrix, we compute an average proportional loss in reward as R\u2212 = ( \u00afD\u2212 \u00afO)/ \u00afD where\n\u00afD is the mean value of the diagonals and \u00afO is the mean value of the off-diagonals. E.g. in Figure 3:\nD = 30.44, O = 20.03, R\u2212 = 0.342. Even in a simple domain with almost full observability\n(small2), an independently-learned policy could expect to lose 34.2% of its reward when playing with\nanother independently-learned policy even though it was trained under identical circumstances! This\nclearly demonstrates an important problem with independent learners. In the other variants (gathering\nand path\ufb01nd), we observe no JPC problem, presumably because coordination is not required and the\npolicies are independent. Results are summarized in Table 1. We have also noticed similar effects\nwhen using DQN [73] as the oracle training algorithm; see Appendix B.1 for example videos.\n\n6\n\n01234Player #243210Player #125.127.329.65.329.827.331.727.630.626.214.612.930.37.323.829.930.817.811.015.530.730.923.93.79.261218243001234Player #243210Player #10.53.50.63.919.23.62.64.120.54.718.22.320.06.02.612.218.211.30.88.223.02.918.57.10.648121620\fEnvironment Map\n\nInRL\n\u00afO\n\n\u00afD\n\nLaser Tag\nLaser Tag\nLaser Tag\nGathering\nPath\ufb01nd\n\nsmall2\nsmall3\nsmall4\n\ufb01eld\nmerge\nTable 1: Summary of JPC results in \ufb01rst-person gridworld games.\n\n30.44\n23.06\n20.15\n147.34\n108.73\n\n20.03\n9.06\n5.71\n146.89\n106.32\n\nDCH(Reactor, 2, 10)\n\u00afD\nR\u2212\n0.055\n0.082\n0.150\n0.007\n< 0\n\n26.63\n23.45\n15.90\n138.74\n91.492\n\n28.20\n27.00\n18.72\n139.70\n90.15\n\nR\u2212\n0.342\n0.625\n0.717\n0.003\n0.022\n\n\u00afO\n\nJPC Reduction\n\n28.7 %\n54.3 %\n56.7 %\n\n\u2013\n\u2013\n\nWe see that a (level 10) DCH agent reduces the JPC problem signi\ufb01cantly. On small2, DCH reduces\nthe expected loss down to 5.5%, 28.7% lower than independent learners. The problem gets larger as\nthe map size grows and problem becomes more partially observed, up to a severe 71.7% average loss.\nThe reduction achieved by DCH also grows from 28.7% to 56.7%.\nIs the Meta-Strategy Necessary During Execution? The \ufb01gures above represent the fully-mixed\nstrategy \u03c3i,10. We also analyze JPC for only the highest-level policy \u03c0i,10 in the laser tag levels. The\nvalues are larger here: R\u2212 = 0.147, 0.27, 0.118 for small2-4 respectively, showing the importance of\nthe meta-strategy. However, these are still signi\ufb01cant reductions in JPC: 19.5%, 36.5%, 59.9%.\nHow Many Levels? On small4, we also compute values for level 5 and level 3: R\u2212 = 0.156 and\nR\u2212 = 0.246, corresponding to reductions in JPC of 56.1% and 44%, respectively. Level 5 reduces\nJPC by a similar amount as level 10 (56.1% vs 56.7%), while level 3 less so (44% vs. 56.1%.)\n\n4.2 Learning to Safely Exploit and Indirectly Model Opponents in Leduc Poker\n\ni max\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\nis commonly used in poker AI: NASHCONV(\u03c3) =(cid:80)n\n\nWe now show results for a Leduc poker where strong benchmark algorithms exist, such as counter-\nfactual regret (CFR) minimization [107, 11]. We evaluate our policies using two metrics: the \ufb01rst\nis performance against \ufb01xed players (random, CFR\u2019s average strategy after 500 iterations \u201ccfr500\u201d,\nand a puri\ufb01ed version of \u201ccfr500pure\u201d that chooses the action with highest probability.) The second\ni, \u03c3\u2212i) \u2212 ui(\u03c3), represent-\ning how much can be gained by deviating to their best response (unilaterally), a value that can be\ninterpreted as a distance from a Nash equilibrium (called exploitability in the two-player setting).\nNashConv is easy to compute in small enough games [45]; for CFR\u2019s values see Appendix E.1.\nEffect of Exploration and Meta-Strategy Overview. We now analyze the effect of the various\nmeta-strategies and exploration parameters. In Figure 4, we measure the mean area-under-the-curve\n(MAUC) of the NashConv values for the last (right-most) 32 values in the NashConv graph, and\nexploration rate of \u03b3 = 0.4. Figures for the other values of \u03b3 are in Appendix E, but we found this\nvalue of \u03b3 works best for minimizing NashConv. Also, we found that decoupled replicator dynamics\nworks best, followed by decoupled regret-matching and Exp3. Also, it seems that the higher the level,\nthe lower the resulting NashConv value is, with diminishing improvements. For exploitation, we\nfound that \u03b3 = 0.1 was best, but the meta-solvers seemed to have little effect (see Figure 10.)\nComparison to Neural Fictitious Self-Play. We now compare to Neural Fictitious Self-Play\n(NFSP) [40], an implementation of \ufb01ctitious play in sequential games using reinforcement learning.\nNote that NFSP, PSRO, and DCH are all sample-based learning algorithms that use general function\napproximation, whereas CFR is a tabular method that requires a full game-tree pass per iteration.\nNashConv graphs are shown for {2,3}-player in Figure 5, and performance vs. \ufb01xed bots in Figure 6.\n\n(a)\n\n(b)\n\nFigure 4: (a) Effect of DCH parameters on NashConv in 2 player Leduc Poker. Left: decoupled PRD,\nMiddle: decoupled RM, Right: Exp3, and (b) MAUC of the exploitation graph against cfr500.\n\n7\n\nuprdurmexp3metasolver0.00.51.01.52.02.53.0MAUC (NashConv)level123456789101112131415123456789101112131415level2.01.81.61.41.21.00.80.60.40.2MAUCmin_exploration_weight0.10.250.4\f(a) 2 players\n\n(b) 3 players\n\nFigure 5: Exploitability for NFSP x DCH x PSRO.\n\n(a) Random bots as ref. set\n\n(b) 2-player CFR500 bots as ref. set (c) 3-player CFR500 bots as ref. set\nFigure 6: Evaluation against \ufb01xed set of bots. Each data point is an average of the four latest values.\n\nWe observe that DCH (and PSRO) converge faster than NFSP at the start of training, possibly due to\na better meta-strategy than the uniform random one used in \ufb01ctitious play. The convergence curves\neventually plateau: DCH in two-player is most affected, possibly due to the asynchronous nature of\nthe updates, and NFSP converges to a lower exploitability in later episodes. We believe that this is\ndue to NFSP\u2019s ability to learn a more accurate mixed average strategy at states far down in the tree,\nwhich is particularly important in poker, whereas DCH and PSRO mix at the top over full policies.\nOn the other hand, we see that PSRO/DCH are able to achieve higher performance against the\n\ufb01xed players. Presumably, this is because the policies produced by PSRO/DCH are better able to\nrecognize \ufb02aws in the weaker opponent\u2019s policies, since the oracles are speci\ufb01cally trained for this,\nand dynamically adapt to the exploitative response during the episode. So, NFSP is computing a safe\nequilibrium while PSRO/DCH may be trading convergence precision for the ability to adapt to a range\nof different play observed during training, in this context computing a robust counter-strategy [44, 24].\n\n5 Conclusion and Future Work\n\nIn this paper, we quantify a severe problem with independent reinforcement learners, joint pol-\nicy correlation (JPC), that limits the generality of these approaches. We describe a generalized\nalgorithm for multiagent reinforcement learning that subsumes several previous algorithms. In our\nexperiments, we show that PSRO/DCH produces general policies that signi\ufb01cantly reduce JPC in\npartially-observable coordination games, and robust counter-strategies that safely exploit opponents\nin a common competitive imperfect information game. The generality offered by PSRO/DCH can\nbe seen as a form of \u201copponent/teammate regularization\u201d, and has also been observed recently in\npractice [66, 5]. We emphasize the game-theoretic foundations of these techniques, which we hope\nwill inspire further investigation into algorithm development for multiagent reinforcement learning.\nIn future work, we will consider maintaining diversity among oracles via loss penalties based on policy\ndissimilarity, general response graph topologies, environments such as emergent language games [58]\nand RTS games [96, 84], and other architectures for prediction of behavior, such as opponent\nmodeling [37] and imagining future states via auxiliary tasks [43]. We would also like to investigate\nfast online adaptation [1, 21] and the relationship to computational Theory of Mind [106, 4], as well\nas generalized (transferable) oracles over similar opponent policies using successor features [6].\nAcknowledgments. We would like to thank DeepMind and Google for providing an excellent\nresearch environment that made this work possible. Also, we would like to thank the anonymous\nreviewers and several people for helpful comments: Johannes Heinrich, Guy Lever, Remi Munos,\nJoel Z. Leibo, Janusz Marecki, Tom Schaul, Noam Brown, Kevin Waugh, Georg Ostrovski, Sriram\nSrinivasan, Neil Rabinowitz, and Vicky Holgate.\n\n8\n\n05001000150020002500Episodes (in thousands)012345NashConvNFSPDCHPSRO0100200300400500600700Episodes (in thousands)2468101214NashConvNFSPDCHPSRO2004006008001000120014001600Episodes (in thousands)21012Mean scoreNFSPDCHPSRO50010001500200025003000Episodes (in thousands)5432101Mean scoreNFSPDCHPSRO100200300400500600700800Episodes (in thousands)543210Mean scoreNFSPDCHPSRO\fReferences\n\n[1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Contin-\nuous adaptation via meta-learning in nonstationary and competitive environments. CoRR, abs/1710.03641,\n2017.\n\n[2] Christopher Amato and Frans A. Oliehoek. Scalable planning and learning for multiagent POMDPs. In\n\nAAAI15, pages 1995\u20132002, January 2015.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial\nmulti-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer\nScience, pages 322\u2013331, 1995.\n\n[4] C.L. Baker, R.R. Saxe, and J.B. Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire\nattribution. In Proceedings of the Thirty-Third Annual Conference of the Cognitive Science Society, pages\n2469\u20132474, 2011.\n\n[5] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity\n\nvia multi-agent competition. CoRR, abs/1710.03748, 2017.\n\n[6] Andr\u00e9 Barreto, Will Dabney, R\u00e9mi Munos, Jonathan Hunt, Tom Schaul, David Silver, and Hado van\nHasselt. Transfer in reinforcement learning with successor features and generalised policy improvement.\nIn Proceedings of the Thirty-First Annual Conference on Neural Information Processing Systems (NIPS\n2017), 2017. To appear. Preprint available at http://arxiv.org/abs/1606.05312.\n\n[7] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-\n\nagent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659\u2013697, 2015.\n\n[8] A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game Theory,\n\nchapter 4. Cambridge University Press, 2007.\n\n[9] Branislav Bosansky, Viliam Lisy, Jiri Cermak, Roman Vitek, and Michal Pechoucek. Using double-oracle\nmethod and serialized alpha-beta search for pruning in simultaneous moves games. In Proceedings of the\nTwenty-Third International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2013.\n\n[10] Branislav Bo\u0161ansk\u00fd, Viliam Lis\u00fd, Marc Lanctot, Ji\u02c7r\u00ed \u02c7Cerm\u00e1k, and Mark H.M. Winands. Algorithms for\ncomputing strategies in two-player simultaneous move games. Arti\ufb01cial Intelligence, 237:1\u201340, 2016.\n\n[11] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up Limit Hold\u2019em Poker\n\nis solved. Science, 347(6218):145\u2013149, January 2015.\n\n[12] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Arti\ufb01cial\n\nIntelligence, 136:215\u2013250, 2002.\n\n[13] G. W. Brown. Iterative solutions of games by \ufb01ctitious play. In T.C. Koopmans, editor, Activity Analysis\n\nof Production and Allocation, pages 374\u2013376. John Wiley & Sons, Inc., 1951.\n\n[14] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed equilibrium\ncomputation, and post-processing, with application to a champion no-limit Texas Hold\u2019em agent. In\nProceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages\n7\u201315. International Foundation for Autonomous Agents and Multiagent Systems, 2015.\n\n[15] Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games.\n\nCoRR, abs/1705.02955, 2017.\n\n[16] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[17] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement\nIEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews,\n\nlearning.\n38(2):156\u2013172, 2008.\n\n[18] Colin F. Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The\n\nQuarterly Journal of Economics, 2004.\n\n[19] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In\nProceedings of the Fifteenth National Conference on Arti\ufb01cial Intelligence (AAAI-98), pages 746\u2013752,\n1998.\n\n9\n\n\f[20] M. A. Costa-Gomes and V. P. Crawford. Cognition and behavior in two-person guessing games: An\n\nexperimental study. The American Economy Review, 96(6):1737\u20131768, 2006.\n\n[21] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference\non Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126\u20131135,\nInternational Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[22] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet\nKohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning.\nIn Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.\n\n[23] Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate\nwith deep multi-agent reinforcement learning. In 30th Conference on Neural Information Processing\nSystems (NIPS 2016), 2016.\n\n[24] Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation. ACM Transactions on Economics\n\nand Computation (TEAC), 3(2):1\u201328, 2015. Special issue on selected papers from EC-12.\n\n[25] N. Gatti, F. Panozzo, and M. Restelli. Ef\ufb01cient evolutionary dynamics with extensive-form games. In\n\nProceedings of the Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence, pages 335\u2013341, 2013.\n\n[26] Richard Gibson. Regret minimization in non-zero-sum games with applications to building champion\n\nmultiplayer computer poker agents. CoRR, abs/1305.0034, 2013.\n\n[27] A. Gilpin. Algorithms for Abstracting and Solving Imperfect Information Games. PhD thesis, Carnegie\n\nMellon University, 2009.\n\n[28] Gmytrasiewicz and Doshi. A framework for sequential planning in multiagent settings. Journal of\n\nArti\ufb01cial Intelligence Research, 24:49\u201379, 2005.\n\n[29] Amy Greenwald and Keith Hall. Correlated Q-learning. In Proceedings of the Twentieth International\n\nConference on Machine Learning (ICML 2003), pages 242\u2013249, 2003.\n\n[30] Amy Greenwald, Jiacui Li, and Eric Sodomka. Solving for best responses and equilibria in extensive-form\ngames with reinforcement learning methods. In Rohit Parikh on Logic, Language and Society, volume 11\nof Outstanding Contributions to Logic, pages 185\u2013226. Springer International Publishing, 2017.\n\n[31] Audrunas Gruslys, Mohammad Gheshlaghi Azar, Marc G. Bellemare, and Remi Munos. The Reactor: A\n\nsample-ef\ufb01cient actor-critic architecture. CoRR, abs/1704.04651, 2017.\n\n[32] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica,\n\n68(5):1127\u20131150, 2000.\n\n[33] Sergiu Hart and Andreu Mas-Colell. A reinforcement procedure leading to correlated equilibrium. In\n\nEconomics Essays: A Festschrift for Werner Hildenbrand. Springer Berlin Heidelberg, 2001.\n\n[34] Jason S. Hartford, James R. Wright, and Kevin Leyton-Brown. Deep learning for predicting human\nstrategic behavior. In Proceedings of the 30th Conference on Neural Information Processing Systems\n(NIPS 2016), 2016.\n\n[35] Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. In\n\nProceedings of the International Conference on Learning Representations (ICLR), May 2016.\n\n[36] Matthew John Hausknecht. Cooperation and communication in multiagent deep reinforcement learning.\n\nPhD thesis, University of Texas at Austin, Austin, USA, 2016.\n\n[37] He He, Jordan Boyd-Graber, Kevin Kwok, , and Hal Daum\u00e9 III. Opponent modeling in deep reinforcement\nlearning. In In Proceedings of The 33rd International Conference on Machine Learning (ICML), pages\n1804\u20131813, 2016.\n\n[38] Nicolas Heess, Gregory Wayne, David Silver, Timothy P. Lillicrap, Tom Erez, and Yuval Tassa. Learning\ncontinuous control policies by stochastic value gradients. In Advances in Neural Information Processing\nSystems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015,\nMontreal, Quebec, Canada, pages 2944\u20132952, 2015.\n\n[39] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In\n\nProceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015.\n\n10\n\n\f[40] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information\n\ngames. CoRR, abs/1603.01121, 2016.\n\n[41] Trong Nghia Hoang and Kian Hsiang Low. Interactive POMDP lite: Towards practical planning to predict\nand exploit intentions for interacting with self-interested agents. In Proceedings of the Twenty-Third\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI \u201913, pages 2298\u20132305. AAAI Press, 2013.\n\n[42] Josef Hofbauer and William H. Sandholm. On the global convergence of stochastic \ufb01ctitious play.\n\nEconometrica, 70(6):2265\u20132294, 11 2002.\n\n[43] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David\nSilver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR,\nabs/1611.05397, 2016.\n\n[44] M. Johanson, M. Zinkevich, and M. Bowling. Computing robust counter-strategies. In Advances in\nNeural Information Processing Systems 20 (NIPS), pages 1128\u20131135, 2008. A longer version is available\nas a University of Alberta Technical Report, TR07-15.\n\n[45] Michael Johanson, Michael Bowling, Kevin Waugh, and Martin Zinkevich. Accelerating best response\ncalculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference\non Arti\ufb01cial Intelligence (IJCAI), pages 258\u2013265, 2011.\n\n[46] Michael Johanson, Neil Burch, Richard Valenzano, and Michael Bowling. Evaluating state-space\nIn Proceedings of the Twelfth International Conference on\n\nabstractions in extensive-form games.\nAutonomous Agents and Multi-Agent Systems (AAMAS), 2013.\n\n[47] Michael Bradley Johanson. Robust Strategies and Counter-Strategies: From Superhuman to Optimal\nPlay. PhD thesis, University of Alberta, 2016. http://johanson.ca/publications/theses/\n2016-johanson-phd-thesis/2016-johanson-phd-thesis.pdf.\n\n[48] Michael Kaisers and Karl Tuyls. Frequency adjusted multi-agent Q-learning.\n\nIn 9th International\nConference on Autonomous Agents and Multiagent Systems AAMAS 2010), Toronto, Canada, May 10-14,\n2010, Volume 1-3, pages 309\u2013316, 2010.\n\n[49] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[50] M. Kleiman-Weiner, M. K. Ho, J. L. Austerweil, M. L. Littman, and J. B. Tenenbaum. Coordinate to\ncooperate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the 38th\nAnnual Conference of the Cognitive Science Society, 2016.\n\n[51] D. Koller, N. Megiddo, and B. von Stengel. Fast algorithms for \ufb01nding randomized strategies in game\ntrees. In Proceedings of the 26th ACM Symposium on Theory of Computing (STOC \u201994), pages 750\u2013759,\n1994.\n\n[52] Kostas Kouvaris, Jeff Clune, Loizos Kounios, Markus Brede, and Richard A. Watson. How evolution\nlearns to generalise: Using the principles of learning theory to understand the evolution of developmental\norganisation. PLOS Computational Biology, 13(4):1\u201320, 04 2017.\n\n[53] H. W. Kuhn. Extensive games and the problem of information. Contributions to the Theory of Games,\n\n2:193\u2013216, 1953.\n\n[54] Marc Lanctot. Further developments of extensive-form replicator dynamics using the sequence-form\nrepresentation. In Proceedings of the Thirteenth International Conference on Autonomous Agents and\nMulti-Agent Systems (AAMAS), pages 1257\u20131264, 2014.\n\n[55] Marc Lanctot, Vinicius Zambaldi, Audr\u00afunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P\u00e9rolat,\nDavid Silver, and Thore Graepel. A uni\ufb01ed game-theoretic approach to multiagent reinforcement learning.\nCoRR, abs/1711.00832, 2017.\n\n[56] M. Lauer and M. Riedmiller. Reinforcement learning for stochastic cooperative multi-agent systems. In\n\nProceedings of the AAMAS \u201904, New York, 2004.\n\n[57] Guillaume J. Laurent, La\u00ebtitia Matignon, and Nadine Le Fort-Piat. The world of independent learners\nInternational Journal of Knowledge-Based and Intelligent Engineering Systems,\n\nis not Markovian.\n15(1):55\u201364, March 2011.\n\n[58] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emer-\ngence of (natural) language. In Proceedings of the International Conference on Learning Representations\n(ICLR), April 2017.\n\n11\n\n\f[59] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436\u2013444, 2015.\n\n[60] Joel Z. Leibo, Vinicius Zambaldia, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent\nreinforcement learning in sequential social dilemmas. In Proceedings of the Sixteenth International\nConference on Autonomous Agents and Multiagent Systems, 2017.\n\n[61] David S. Leslie and Edmund J. Collins. Generalised weakened \ufb01ctitious play. Games and Economic\n\nBehavior, 56(2):285\u2013298, 2006.\n\n[62] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning.\n\nIn In\nProceedings of the Eleventh International Conference on Machine Learning, pages 157\u2013163. Morgan\nKaufmann, 1994.\n\n[63] Michael L. Littman. Friend-or-foe Q-learning in general-sum games. In Proceedings of the Eighteenth\nInternational Conference on Machine Learning, ICML \u201901, pages 322\u2013328, San Francisco, CA, USA,\n2001. Morgan Kaufmann Publishers Inc.\n\n[64] Michael L. Littman. Reinforcement learning improves behaviour from evaluative feedback. Nature,\n\n521:445\u2013451, 2015.\n\n[65] J. Long, N. R. Sturtevant, M. Buro, and T. Furtak. Understanding the success of perfect information Monte\nCarlo sampling in game tree search. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence,\npages 134\u2013140, 2010.\n\n[66] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic\n\nfor mixed cooperative-competitive environments. CoRR, abs/1706.02275, 2017.\n\n[67] N. Burch M. Zinkevich, M. Bowling. A new algorithm for generating equilibria in massive zero-sum\n\ngames. In Proceedings of the Twenty-Seventh Conference on Arti\ufb01cial Intelligence (AAAI-07), 2007.\n\n[68] Janusz Marecki, Tapana Gupta, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo. Not all agents\nare equal: Scaling up distributed pomdps for agent networks. In Proceedings of the Seventh International\nJoint Conference on Autonomous Agents and Multi-agent Systems, 2008.\n\n[69] Vukosi N. Marivate. Improved Empirical Methods in Reinforcement Learning Evaluation. PhD thesis,\n\nRutgers, New Brunswick, New Jersey, 2015.\n\n[70] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in cooperative\nMarkov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(01):1\u2013\n31, 2012.\n\n[71] H.B. McMahan, G. Gordon, and A. Blum. Planning in the presence of cost functions controlled by an\nadversary. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003),\n2003.\n\n[72] Volodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning.\nIn Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1928\u20131937,\n2016.\n\n[73] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and\nDemis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529\u2013533, 2015.\n\n[74] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis\u00fd, Dustin Morrill, Nolan Bard, Trevor Davis,\nKevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level arti\ufb01cial intelligence in\nheads-up no-limit poker. Science, 358(6362), October 2017.\n\n[75] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. Safe and ef\ufb01cient off-policy reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, 2016.\n\n[76] Ranjit Nair. Coordinating multiagent teams in uncertain domains using distributed POMDPs. PhD thesis,\n\nUniversity of Southern California, Los Angeles, USA, 2004.\n\n[77] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. Action-conditional\nvideo prediction using deep networks in atari games. In Advances in Neural Information Processing\nSystems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015,\nMontreal, Quebec, Canada, pages 2863\u20132871, 2015.\n\n12\n\n\f[78] F.A. Oliehoek, E.D. de Jong, and N. Vlassis. The parallel Nash memory for asymmetric games. In\n\nProceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2006.\n\n[79] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer-\n\nBriefs in Intelligent Systems. Springer, 2016. Authors\u2019 pre-print.\n\n[80] Shayegan Omidsha\ufb01ei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep decen-\ntralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the\n34th International Conference on Machine Learning (ICML 2017), 2017.\n\n[81] Liviu Panait, Karl Tuyls, and Sean Luke. Theoretical advantages of lenient learners: An evolutionary\n\ngame theoretic perspective. Journal of Machine Learning Research, 9:423\u2013457, 2008.\n\n[82] David C. Parkes and Michael P. Wellman. Economic reasoning and arti\ufb01cial intelligence. Science,\n\n349(6245):267\u2013272, 2015.\n\n[83] Marc Ponsen, Karl Tuyls, Michael Kaisers, and Jan Ramon. An evolutionary game theoretic analysis of\n\npoker strategies. Entertainment Computing, 2009.\n\n[84] F. Sailer, M. Buro, and M. Lanctot. Adversarial planning through strategy simulation. In IEEE Symposium\n\non Computational Intelligence and Games (CIG), pages 37\u201345, 2007.\n\n[85] Spyridon Samothrakis, Simon Lucas, Thomas Philip Runarsson, and David Robles. Coevolving Game-\nIEEE Transactions on Evolutionary\n\nPlaying Agents: Measuring Performance and Intransitivities.\nComputation, April 2013.\n\n[86] Martin Schmid, Matej Moravcik, and Milan Hladik. Bounding the support size in extensive form\ngames with imperfect information. In Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial\nIntelligence, 2014.\n\n[87] L. Julian Schvartzman and Michael P. Wellman. Stronger CDA strategies through empirical game-\ntheoretic analysis and reinforcement learning. In Proceedings of The 8th International Conference on\nAutonomous Agents and Multiagent Systems (AAMAS), pages 249\u2013256, 2009.\n\n[88] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving con-\nvolutional neural networks via concatenated recti\ufb01ed linear units. In Proceedings of the International\nConference on Machine Learning (ICML), 2016.\n\n[89] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical\n\nFoundations. Cambridge University Press, 2009.\n\n[90] Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is the\n\nquestion? Artif. Intell., 171(7):365\u2013377, 2007.\n\n[91] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks\nand tree search. Nature, 529:484\u2013489, 2016.\n\n[92] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,\nThomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,\nLaurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go\nwithout human knowledge. Nature, 550:354\u2013359, 2017.\n\n[93] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropagation. In\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.\n\n[94] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n[95] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru,\nand Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE,\n12(4), 2017.\n\n[96] Anderson Tavares, Hector Azpurua, Amanda Santos, and Luiz Chaimowicz. Rock, paper, starcraft:\nStrategy selection in real-time strategy games. In The Twelfth AAAI Conference on Arti\ufb01cial Intelligence\nand Interactive Digital Entertainment (AIIDE-16), 2016.\n\n13\n\n\f[97] Taylor and Jonker. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences,\n\n40:145\u2013156, 1978.\n\n[98] K. Tuyls and R. Westra. Replicator dynamics in discrete and continuous strategy spaces. In Agents,\n\nSimulation and Applications, pages 218\u2013243. Taylor and Francis, 2008.\n\n[99] Karl Tuyls and Gerhard Weiss. Multiagent learning: Basics, challenges, and prospects. AI Magazine,\n\n33(3):41\u201352, 2012.\n\n[100] W. E. Walsh, R. Das, G. Tesauro, and J.O. Kephart. Analyzing complex strategic interactions in multi-\n\nagent games. In AAAI-02 Workshop on Game Theoretic and Decision Theoretic Agents, 2002., 2002.\n\n[101] Michael P. Wellman. Methods for empirical game-theoretic analysis. In Proceedings of the National\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2006.\n\n[102] S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone. Protecting against evaluation over\ufb01tting in empirical\nreinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement\nLearning (ADPRL), pages 120\u2013127, 2011.\n\n[103] James R. Wright and Kevin Leyton-Brown. Beyond equilibrium: Predicting human behavior in normal\nform games. In Proceedings of the Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence (AAAI-10),\npages 901\u2013907, 2010.\n\n[104] Mason Wright. Using reinforcement learning to validate empirical game-theoretic analysis: A continuous\n\ndouble auction study. CoRR, abs/1604.06710, 2016.\n\n[105] Nikolai Yakovenko, Liangliang Cao, Colin Raffel, and James Fan. Poker-CNN: A pattern learning\nstrategy for making draws and bets in poker games using convolutional networks. In Proceedings of the\nThirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[106] Wako Yoshida, Ray J. Dolan, and Karl J. Friston. Game theory of mind. PLOS Computational Biology,\n\n4(12):1\u201314, 12 2008.\n\n[107] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete\n\ninformation. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.\n\n14\n\n\f", "award": [], "sourceid": 2211, "authors": [{"given_name": "Marc", "family_name": "Lanctot", "institution": "DeepMind"}, {"given_name": "Vinicius", "family_name": "Zambaldi", "institution": "Deepmind"}, {"given_name": "Audrunas", "family_name": "Gruslys", "institution": "Google DeepMind"}, {"given_name": "Angeliki", "family_name": "Lazaridou", "institution": "DeepMind"}, {"given_name": "Karl", "family_name": "Tuyls", "institution": "DeepMind"}, {"given_name": "Julien", "family_name": "Perolat", "institution": null}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}, {"given_name": "Thore", "family_name": "Graepel", "institution": "DeepMind"}]}