{"title": "Multiagent Evaluation under Incomplete Information", "book": "Advances in Neural Information Processing Systems", "page_first": 12291, "page_last": 12303, "abstract": "This paper investigates the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents. Traditionally, researchers have relied on Elo ratings for this purpose, with recent works also using methods based on Nash equilibria. Unfortunately, Elo is unable to handle intransitive agent interactions, and other techniques are restricted to zero-sum, two-player settings or are limited by the fact that the Nash equilibrium is intractable to compute. Recently, a ranking method called $\\alpha$-Rank, relying on a new graph-based game-theoretic solution concept, was shown to tractably apply to general games. However, evaluations based on Elo or $\\alpha$-Rank typically assume noise-free game outcomes, despite the data often being collected from noisy simulations, making this assumption unrealistic in practice. This paper investigates multiagent evaluation in the incomplete information regime, involving general-sum many-player games with noisy outcomes. We derive sample complexity guarantees required to confidently rank agents in this setting. We propose adaptive algorithms for accurate ranking, provide correctness and sample complexity guarantees, then introduce a means of connecting uncertainties in noisy match outcomes to uncertainties in rankings. We evaluate the performance of these approaches in several domains, including Bernoulli games, a soccer meta-game, and Kuhn poker.", "full_text": "Multiagent Evaluation under Incomplete Information\n\nMark Rowland1,\u2217\n\nShayegan Omidsha\ufb01ei2,\u2217\n\nKarl Tuyls2\n\nmarkrowland@google.com\n\nsomidshafiei@google.com\n\nkarltuyls@google.com\n\nJulien P\u00e9rolat1\n\nMichal Valko2\n\nGeorgios Piliouras3\n\nperolat@google.com\n\nvalkom@deepmind.com\n\ngeorgios@sutd.edu.sg\n\nR\u00e9mi Munos2\n\nmunos@google.com\n\n1DeepMind London\n\n2DeepMind Paris\n\n3 Singapore University of Technology and Design\n\n\u2217Equal contributors\n\nAbstract\n\nThis paper investigates the evaluation of learned multiagent strategies in the in-\ncomplete information setting, which plays a critical role in ranking and training of\nagents. Traditionally, researchers have relied on Elo ratings for this purpose, with\nrecent works also using methods based on Nash equilibria. Unfortunately, Elo is\nunable to handle intransitive agent interactions, and other techniques are restricted\nto zero-sum, two-player settings or are limited by the fact that the Nash equilibrium\nis intractable to compute. Recently, a ranking method called \u03b1-Rank, relying on a\nnew graph-based game-theoretic solution concept, was shown to tractably apply\nto general games. However, evaluations based on Elo or \u03b1-Rank typically assume\nnoise-free game outcomes, despite the data often being collected from noisy sim-\nulations, making this assumption unrealistic in practice. This paper investigates\nmultiagent evaluation in the incomplete information regime, involving general-sum\nmany-player games with noisy outcomes. We derive sample complexity guarantees\nrequired to con\ufb01dently rank agents in this setting. We propose adaptive algorithms\nfor accurate ranking, provide correctness and sample complexity guarantees, then\nintroduce a means of connecting uncertainties in noisy match outcomes to uncer-\ntainties in rankings. We evaluate the performance of these approaches in several\ndomains, including Bernoulli games, a soccer meta-game, and Kuhn poker.\n\n1\n\nIntroduction\n\nThis paper investigates evaluation of learned multiagent strategies given noisy game outcomes. The\nElo rating system is the predominant approach used to evaluate and rank agents that learn through,\ne.g., reinforcement learning [12, 35, 42, 43]. Unfortunately, the main caveat with Elo is that it\ncannot handle intransitive relations between interacting agents, and as such its predictive power is too\nrestrictive to be useful in non-transitive situations (a simple example being the game of Rock-Paper-\nScissors). Two recent empirical game-theoretic approaches are Nash Averaging [3] and \u03b1-Rank [36].\nEmpirical Game Theory Analysis (EGTA) can be used to evaluate learning agents that interact in\nlarge-scale multiagent systems, as it remains largely an open question as to how such agents can be\nevaluated in a principled manner [36, 48, 49]. EGTA has been used to investigate this evaluation\nproblem by deploying empirical or meta-games [37, 38, 47, 51\u201354]. Meta-games abstract away the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1.1: (a) Illustration of converting plausible payoff matrices consistent with an empirical\nestimate \u02c6M to empirical rankings \u02c6\u03c0. The set of plausible payoff matrices and plausible rankings\nare shown, respectively, in grey and blue. (b) Ranking uncertainty vs. payoff uncertainty for a\nsoccer meta-game involving 10 agents. Each cluster of bars shows con\ufb01dence intervals over ranking\nweights given an observed payoff matrix with a particular uncertainty level; payoff uncertainty here\ncorresponds to the mean con\ufb01dence interval size of payoff matrix entries. This example illustrates\nthe need for careful consideration of payoff uncertainties when computing agent rankings.\n\natomic decisions made in the game and instead focus on interactions of high-level agent strategies,\nenabling the analysis of large-scale games using game-theoretic techniques. Such games are typically\nconstructed from large amounts of data or simulations. An evaluation of the meta-game then gives\na means of comparing the strengths of the various agents interacting in the original game (which\nmight, e.g., form an important part of a training pipeline [25, 26, 42]) or of selecting a \ufb01nal agent\nafter training has taken place (see Fig. 1.1a).\nBoth Nash Averaging and \u03b1-Rank assume noise-free (i.e., complete) information, and while \u03b1-Rank\napplies to general games, Nash Averaging is restricted to 2-player zero-sum settings. Unfortunately,\nwe can seldom expect to observe a noise-free speci\ufb01cation of a meta-game in practice, as in large\nmultiagent systems it is unrealistic to expect that the various agents under study will be pitted against\nall other agents a suf\ufb01cient number of times to obtain reliable statistics about the meta-payoffs in\nthe empirical game. While there have been prior inquiries into approximation of equilibria (e.g.,\nNash) using noisy observations [15, 28], few have considered evaluation or ranking of agents in\nmeta-games with incomplete information [40, 53]. Consider, for instance, a meta-game based on\nvarious versions of AlphaGo and prior state-of-the-art agents (e.g., Zen) [41, 48]; the game outcomes\nare noisy, and due to computational budget not all agents might play against each other. These issues\nare compounded when the simulations required to construct the empirical meta-game are inherently\nexpensive.\nMotivated by the above issues, this paper contributes to multiagent evaluation under incomplete\ninformation. As we are interested in general games that go beyond dyadic interactions, we focus on\n\u03b1-Rank. Our contributions are as follows: \ufb01rst, we provide sample complexity guarantees describing\nthe number of interactions needed to con\ufb01dently rank the agents in question; second, we introduce\nadaptive sampling algorithms for selecting agent interactions for the purposes of accurate evaluation;\nthird, we develop means of propagating uncertainty in payoffs to uncertainty in agent rankings. These\ncontributions enable the principled evaluation of agents in the incomplete information regime.\n\n2 Preliminaries\n\nSk of pure strategies. Denote by S =(cid:81)\n\nWe review here preliminaries in game theory and evaluation. See Appendix A for related work.\nGames and meta-games. Consider a K-player game, where each player k \u2208 [K] has a \ufb01nite set\nk Sk the space of pure strategy pro\ufb01les. For each tuple s =\n(s1, . . . , sK) \u2208 S of pure strategies, the game speci\ufb01es a joint probability distribution \u03bd(s) of payoffs\nto each player. The vector of expected payoffs is denoted M(s) = (M1(s), . . . , MK(s)) \u2208 RK.\nIn empirical game theory, we are often interested in analyzing interactions at a higher meta-level,\nwherein a strategy pro\ufb01le s corresponds to a tuple of machine learning agents and the matrix M\ncaptures their expected payoffs when played against one another in some domain. Given this, the\nnotions of \u2018agents\u2019 and \u2018strategies\u2019 are considered synonymous in this paper.\nEvaluation. Given payoff matrix M \u2208 (RK)S, a key task is to evaluate the strategies in the game.\nThis is sometimes done in terms of a game-theoretic solution concept (e.g., Nash equilibria), but may\n\n2\n\nAgent Ranking AlgorithmEvaluatingTrainingPlayingMeta-gameSynthesisGame simulationPayoff uncertainty0.00.20.40.60.81.0Agent ranking confidence interval00.010.030.050.0750.10.2Agent 0Agent 1Agent 2Agent 3Agent 4Agent 5Agent 6Agent 7Agent 8Agent 9\falso consist of rankings or numerical scores for strategies. We focus particularly on the evolutionary\ndynamics based \u03b1-Rank method [36], which applies to general many-player games, but also provide\nsupplementary results for the Elo ranking system [12]. There also exist Nash-based evaluation\nmethods, such as Nash Averaging in two-player, zero-sum settings [3, 48], but these are not more\ngenerally applicable as the Nash equilibrium is intractable to compute and select [11, 20].\nThe exact payoff table M is rarely known; instead, an empirical payoff table \u02c6M is typically con-\nstructed from observed agent interactions (i.e., samples from the distributions \u03bd(s)). Based on\ncollected data, practitioners may associate a set of plausible payoff tables with this point estimate,\neither using a frequentist con\ufb01dence set, or a Bayesian high posterior density region. Figure 1.1a\nillustrates the application of a ranking algorithm to a set of plausible payoff matrices, where rankings\ncan then be used for evaluating, training, or prescribing strategies to play. Figure 1.1b visualizes an\nexample demonstrating the sensitivity of computed rankings to estimated payoff uncertainties (with\nranking uncertainty computed as discussed in Section 5). This example highlights the importance of\npropagating payoff uncertainties through to uncertainty in rankings, which can play a critical role,\ne.g., when allocating training resources to agents based on their respective rankings during learning.\n\u03b1-Rank. The Elo ranking system (reviewed in Appendix C) is designed to estimate win-loss\nprobabilities in two-player, symmetric, constant-sum games [12]. Yet despite its widespread use for\nranking [2, 19, 31, 42], Elo has no predictive power in intransitive games (e.g., Rock-Paper-Scissors)\n[3]. By contrast, \u03b1-Rank is a ranking algorithm inspired by evolutionary game theory models, and\napplies to K-player, general-sum games [36]. At a high level, \u03b1-Rank de\ufb01nes an irreducible Markov\nchain over strategy set S, called the response graph of the game [32]. The ordered masses of this\nMarkov chain\u2019s unique invariant distribution \u03c0 yield the strategy pro\ufb01le rankings. The Markov\ntransition matrix, C, is de\ufb01ned in a manner that establishes a link to a solution concept called\nMarkov-Conley chains (MCCs). MCCs are critical for the rankings computed, as they capture agent\ninteractions even under intransitivities and are tractably computed in general games, unlike Nash\nequilibria [11].\nIn more detail, the underlying transition matrix over S is de\ufb01ned by \u03b1-Rank as follows. Let\ns = (s1, . . . , sK) \u2208 S be a pure strategy pro\ufb01le, and let \u03c3 = (\u03c3k, s\u2212k) be the pure strategy pro\ufb01le\nwhich is equal to s, except for player k, which uses strategy \u03c3k \u2208 Sk instead of sk. Denote by \u03b7 the\nreciprocal of the total number of valid pro\ufb01le transitions from a given strategy pro\ufb01le (i.e., where\nl=1(|Sl| \u2212 1))\u22121. Let Cs,\u03c3 denote the\ntransition probability from s to \u03c3, and Cs,s the self-transition probability of s, with each de\ufb01ned as:\n\nonly a single player deviates in her strategy), so that \u03b7 = ((cid:80)K\n\n(cid:40)\n\nCs,\u03c3 =\n\n1\u2212exp(\u2212\u03b1(Mk(\u03c3)\u2212Mk(s)))\n\u03b7\n1\u2212exp(\u2212\u03b1m(Mk(\u03c3)\u2212Mk(s)))\n\u03b7\nm\n\nif Mk(\u03c3) (cid:54)= Mk(s)\notherwise ,\n\nand Cs,s = 1 \u2212(cid:88)\n\nCs,\u03c3 ,\n\n(1)\n\nk\u2208[K]\n\n\u03c3|\u03c3k\u2208Sk\\{sk}\n\nwhere if two strategy pro\ufb01les s and s(cid:48) differ in more than one player\u2019s strategy, then Cs,s(cid:48) = 0.\nHere \u03b1 \u2265 0 and m \u2208 N are parameters to be speci\ufb01ed; the form of this transition probability is\ninformed by particular models in evolutionary dynamics and is explained in detail by Omidsha\ufb01ei\net al. [36], with large values of \u03b1 corresponding to higher selection pressure in the evolutionary\nmodel considered. A key remark is that the correspondence of \u03b1-Rank to the MCC solution concept\noccurs in the limit of in\ufb01nite \u03b1. In practice, to ensure the irreducibility of C and the existence of a\nunique invariant distribution \u03c0, \u03b1 is either set to a large but \ufb01nite value, or a perturbed version of\nC under the in\ufb01nite-\u03b1 limit is used. We theoretically and numerically analyze both the \ufb01nite- and\nin\ufb01nite-\u03b1 regimes in this paper, and provide more details on \u03b1-Rank, response graphs, and MCCs in\nAppendix B.\n\n3 Sample complexity guarantees\n\nThis section provides sample complexity bounds, stating the number of strategy pro\ufb01le observations\nneeded to obtain accurate \u03b1-Rank rankings with high probability. We give two sample complexity\nresults, the \ufb01rst for rankings in the \ufb01nite-\u03b1 regime, and the second an instance-dependent guarantee\non the reconstruction of the transition matrix in the in\ufb01nite-\u03b1 regime. All proofs are in Appendix D.\nTheorem 3.1 (Finite-\u03b1). Suppose payoffs are bounded in the interval [\u2212Mmax, Mmax], and de\ufb01ne\nL(\u03b1, Mmax) = 2\u03b1 exp(2\u03b1Mmax) and g(\u03b1, \u03b7, m, Mmax) = \u03b7 exp(2\u03b1Mmax)\u22121\nexp(2\u03b1mMmax)\u22121 . Let \u03b5 \u2208 (0, 18 \u00d7\n\n3\n\n\fn=1\n\nn\n\n2\u2212|S|(cid:80)|S|\u22121\ni.i.d. interactions of each strategy pro\ufb01le s \u2208 S. Then the invariant distribution \u02c6\u03c0 derived from the\nempirical payoff matrix \u02c6M satis\ufb01es maxs\u2208(cid:81)\nk Sk |\u03c0(s) \u2212 \u02c6\u03c0(s)| \u2264 \u03b5 with probability at least 1 \u2212 \u03b4, if\n\n(cid:0)|S|\n(cid:1)n|S|), \u03b4 \u2208 (0, 1). Let \u02c6M be an empirical payoff table constructed by taking Ns\nmax log(2|S|K/\u03b4)L(\u03b1, Mmax)2(cid:16)(cid:80)|S|\u22121\n\n648M 2\n\n(cid:1)n|S|(cid:17)2\n(cid:0)|S|\n\nn\n\n\u2200s \u2208 S .\n\nNs >\n\n\u03b52g(\u03b1, \u03b7, m, Mmax)2\n\nn=1\n\nThe dependence on \u03b4 and \u03b5 are as expected from typical Chernoff-style bounds, though Markov chain\nperturbation theory introduces a dependence on the \u03b1-Rank parameters as well, most notably \u03b1.\nTheorem 3.2 (In\ufb01nite-\u03b1). Suppose all payoffs are bounded in [\u2212Mmax, Mmax], and that \u2200k \u2208 [K]\nand \u2200s\u2212k \u2208 S\u2212k, we have |Mk(\u03c3, s\u2212k) \u2212 Mk(\u03c4, s\u2212k)| \u2265 \u2206 for all distinct \u03c3, \u03c4 \u2208 Sk, for some\n\u2206 > 0. Let \u03b4 > 0. Suppose we construct an empirical payoff table ( \u02c6Mk(s) | k \u2208 [K], s \u2208 S)\nthrough Ns i.i.d games for each strategy pro\ufb01le s \u2208 S. Then the transition matrix \u02c6C computed from\npayoff table \u02c6M is exact (and hence all MCCs are exactly recovered) with probability at least 1 \u2212 \u03b4, if\n\nNs > 8\u2206\u22122M 2\n\nmax log(2|S|K/\u03b4)\n\n\u2200s \u2208 S .\n\nA consequence of the theorem is that exact in\ufb01nite-\u03b1 rankings are recovered with probability at least\n1 \u2212 \u03b4. We also provide theoretical guarantees for Elo ratings in Appendix C for completeness.\n\n4 Adaptive sampling-based ranking\n\nWhilst instructive, the bounds above have limited utility as the payoff gaps that appear in them are\nrarely known in practice. We next introduce algorithms that compute accurate rankings with high\ncon\ufb01dence without knowledge of payoff gaps, focusing on \u03b1-Rank due to its generality.\nProblem statement. Fix an error tolerance \u03b4 > 0. We seek an algorithm which speci\ufb01es (i) a\nsampling scheme S that selects the next strategy pro\ufb01le s \u2208 S for which a noisy game outcome is\nobserved, and (ii) a criterion C(\u03b4) that stops the procedure and outputs the estimated payoff table\nused for the in\ufb01nite-\u03b1 \u03b1-Rank rankings, which is exactly correct with probability at least 1 \u2212 \u03b4.\nThe assumption of in\ufb01nite-\u03b1 simpli\ufb01es this task; it is suf\ufb01cient for the algorithm to determine, for\neach k \u2208 [K] and pair of strategy pro\ufb01les (\u03c3, s\u2212k), (\u03c4, s\u2212k), whether Mk(\u03c3, s\u2212k) > Mk(\u03c4, s\u2212k)\nor Mk(\u03c3, s\u2212k) < Mk(\u03c4, s\u2212k) holds. If all such pairwise comparisons are correctly made with\nprobability at least 1 \u2212 \u03b4, the correct rankings can be computed. Note that we consider only instances\nfor which the third possibility, Mk(\u03c3, s\u2212k) = Mk(\u03c4, s\u2212k), does not hold; in such cases, it is\nwell-known that it is impossible to design an adaptive strategy that always stops in \ufb01nite time [13].\nThis problem can be described as a related collection of pure exploration bandit problems [4]; each\nsuch problem is speci\ufb01ed by a player index k \u2208 [K] and set of two strategy pro\ufb01les {s, (\u03c3k, s\u2212k)}\n(where s \u2208 S, \u03c3k \u2208 Sk) that differ only in player k; the aim is to determine whether player k receives\na greater payoff under strategy pro\ufb01le s or (\u03c3k, s\u2212k). Each individual best-arm identi\ufb01cation problem\ncan be solved to the required con\ufb01dence level by maintaining empirical means and a con\ufb01dence\nbound for the payoffs concerned. Upon termination, an evaluation technique such as \u03b1-Rank can\nthen be run on the resulting response graph to compute the strategy pro\ufb01le (or agent) rankings.\n\n4.1 Algorithm: ResponseGraphUCB\n\nWe introduce a high-level adaptive sampling algorithm, called ResponseGraphUCB, for computing\naccurate rankings in Algorithm 1. Several variants of ResponseGraphUCB are possible, depending\non the choice of sampling scheme S and stopping criterion C(\u03b4), which we detail next.\nSampling scheme S. Algorithm 1 keeps track of a list of pairwise strategy pro\ufb01le comparisons that\n\u03b1-Rank requires, removing pairs of pro\ufb01les for which we have high con\ufb01dence that the empirical table\nis correct (according to C(\u03b4)), and selecting a next strategy pro\ufb01le for simulation. There are several\nways in which strategy pro\ufb01le sampling can be conducted in Algorithm 1. Uniform (U): A strategy\npro\ufb01le is drawn uniformly from all those involved in an unresolved pair. Uniform-exhaustive (UE):\n\n4\n\n\fAlgorithm 1 ResponseGraphUCB(\u03b4,S,C(\u03b4))\n1: Construct list L of pairs of strategy pro\ufb01les to compare\n2: Initialize tables \u02c6M, N to store empirical means and interaction counts\n3: while L is not empty do\n4:\n5:\n6:\n7: return empirical table \u02c6M\n\nSelect a strategy pro\ufb01le s appearing in an edge in L using sampling scheme S\nSimulate one interaction for s and update \u02c6M, N accordingly\nCheck whether any edges are resolved according to C(\u03b4), remove them from L if so\n\nII\n\n0\n\n1\n\n0.50, 0.50 0.85, 0.15\n0.15, 0.85 0.50, 0.50\n\nI 0\n1\n\n(a) Players I and II payoffs.\n\n(c) Strategy-wise sample counts.\n\n(b) Reconstructed response graph.\n\nFigure 4.1: ResponseGraphUCB(\u03b4 : 0.1, S: UE, C: UCB) run on a two-player game. (a) The payoff\ntables for both players. (b) Reconstructed response graph, together with \ufb01nal empirical payoffs and\ncon\ufb01dence intervals (in blue) and true payoffs (in red). (c) Strategy-wise sample proportions.\n\nA pair of strategy pro\ufb01les is selected uniformly from the set of unresolved pairs, and both strategy\npro\ufb01les are queried until the pair is resolved. Valence-weighted (VW): As each query of a pro\ufb01le\ninforms multiple payoffs and has impacts on even greater numbers of pairwise comparisons, there\nmay be value in \ufb01rst querying pro\ufb01les that may resolve a large number of comparisons. Here we set\nthe probability of sampling s proportional to the squared valence of node s in the graph of unresolved\ncomparisons. Count-weighted (CW): The marginal impact on the width of a con\ufb01dence interval for\na strategy pro\ufb01le with relatively few queries is greater than for one with many queries, motivating\npreferential sampling of strategy pro\ufb01les with low query count. Here, we preferentially sample the\nstrategy pro\ufb01le with lowest count among all strategy pro\ufb01les with unresolved comparisons.\nStopping condition C(\u03b4). The stopping criteria we consider are based on con\ufb01dence-bound methods,\nwith the intuition that the algorithm stops only when it has high con\ufb01dence in all pairwise comparisons\nmade. To this end, the algorithm maintains a con\ufb01dence interval for each of the estimates, and judges\na pairwise comparison to be resolved when the two con\ufb01dence intervals concerned become disjoint.\nThere are a variety of con\ufb01dence bounds that can be maintained, depending on the speci\ufb01cs of the\ngame; we consider Hoeffding (UCB) and Clopper-Pearson (CP-UCB) bounds, along with relaxed\nvariants of each (respectively, R-UCB and R-CP-UCB); full descriptions are given in Appendix F.\nWe build intuition by evaluating ResponseGraphUCB(\u03b4 : 0.1, S : UE, C : UCB), i.e., with a 90%\ncon\ufb01dence level, on a two-player game with payoffs shown in Fig. 4.1a; noisy payoffs are simulated\nas detailed in Section 6. The output is given in Fig. 4.1b; the center of this \ufb01gure shows the estimated\nresponse graph, which matches the ground truth in this example. Around the response graph, mean\npayoff estimates and con\ufb01dence bounds are shown for each player-strategy pro\ufb01le combination in\nblue; in each of the surrounding four plots, ResponseGraphUCB aims to establish which of the true\npayoffs (shown as red dots) is greater for the deviating player, with directed edges pointing towards\nestimated higher-payoff deviations. Figure 4.1b reveals that strategy pro\ufb01le (0, 0) is the sole sink of\nthe response graph, thus would be ranked \ufb01rst by \u03b1-Rank. Each pro\ufb01le has been sampled a different\nnumber of times, with running averages of sampling proportions shown in Fig. 4.1c. Exploiting\nknowledge of game symmetry (e.g., as in Fig. 4.1a) can reduce sample complexity; see Appendix H.3.\n\nWe now show the correctness of ResponseGraphUCB and bound the number samples required for it\nto terminate. Our analysis depends on the choice of con\ufb01dence bounds used in stopping condition\n\n5\n\n0101(0, 1)(1, 0)(0, 0)(1, 1)010150100150200Interactions0.00.20.40.60.81.0Proportions(0, 0)(0, 1)(1, 0)(1, 1)\fC(\u03b4); we describe the correctness proof in a manner agnostic to these details, and give a sample\ncomplexity result for the case of Hoeffding con\ufb01dence bounds. See Appendix E for proofs.\nTheorem 4.1. The ResponseGraphUCB algorithm is correct with high probability: Given \u03b4 \u2208 (0, 1),\nfor any particular sampling scheme there is a choice of con\ufb01dence levels such that ResponseGra-\nphUCB outputs the correct response graph with probability at least 1 \u2212 \u03b4.\n\nTheorem 4.2. The ResponseGraphUCB algorithm, using con\ufb01dence parameter \u03b4 and Hoeffding\ncon\ufb01dence bounds, run on an evaluation instance with \u2206 = min(sk,s\u2212k),(\u03c3k,s\u2212k) |Mk(sk, s\u2212k) \u2212\nMk(\u03c3k, s\u2212k)| requires at most O(\u2206\u22122 log(1/(\u03b4\u2206))) samples with probability at least 1 \u2212 2\u03b4.\n\n5 Ranking uncertainty propagation\n\nThis section considers the remaining key issue of ef\ufb01ciently computing uncertainty in the ranking\nweights, given remaining uncertainty in estimated payoffs. We assume known element-wise upper-\nand lower-con\ufb01dence bounds U and L on the unknown true payoff table M, e.g., as provided by\nResponseGraphUCB. The task we seek to solve is, given a particular strategy pro\ufb01le s \u2208 S and these\npayoff bounds, to output the con\ufb01dence interval for \u03c0(s), the ranking weight for s under the true\npayoff table M; i.e., we seek [inf L\u2264 \u02c6M\u2264U \u03c0 \u02c6M(s), supL\u2264 \u02c6M\u2264U \u03c0 \u02c6M(s)], where \u03c0 \u02c6M denotes the output\nof in\ufb01nite-\u03b1 \u03b1-Rank under payoffs \u02c6M. This section proposes an ef\ufb01cient means of solving this task.\nAt the very highest level, this essentially involves \ufb01nding plausible response graphs (that is, response\ngraphs that are compatible with a payoff matrix \u02c6M within the con\ufb01dence bounds L and U) that\nminimize or maximize the probability \u03c0(s) given to particular strategy pro\ufb01les s \u2208 S under in\ufb01nite-\u03b1\n\u03b1-Rank. Considering the maximization case, intuitively this may involve directing as many edges\nadjacent to s towards s as possible, so as to maximize the amount of time the corresponding Markov\nchain spends at s. It is less clear intuitively what the optimal way to set the directions of edges not\nadjacent to s should be, and how to enforce consistency with the constraints L \u2264 \u02c6M \u2264 U. In fact,\nsimilar problems have been studied before in the PageRank literature for search engine optimization\n[7, 9, 10, 16, 24], and have been shown to be reducible to constrained dynamic programming\nproblems.\nMore formally, the main idea is to convert the problem of obtaining bounds on \u03c0 to a constrained\nstochastic shortest path (CSSP) policy optimization problem which optimizes mean return time for\nthe strategy pro\ufb01le s in the corresponding . In full generality, such constrained policy optimization\nproblems are known to be NP-hard [10]. Here, we show that it is suf\ufb01cient to optimize an uncon-\nstrained version of the \u03b1-Rank CSSP, hence yielding a tractable problem that can be solved with\nstandard SSP optimization routines. Details of the algorithm are provided in Appendix G; here, we\nprovide a high-level overview of its structure, and state the main theoretical result underlying the\ncorrectness of the approach.\nThe \ufb01rst step is to convert the element-wise con\ufb01dence bounds L \u2264 \u02c6M \u2264 U into a valid set of\nconstraints on the form of the underlying response graph. Next, a reduction is used to encode the\nproblem as policy optimization in a constrained shortest path problem (CSSP), as in the PageRank\nliterature [10]; we denote the corresponding problem instance by CSSP(S, L, U, s). Whilst solution\nof CSSPs is in general hard, we note here that it is possible to remove the constraints on the problem,\nyielding a stochastic shortest path problem that can be solved by standard means.\n\nTheorem 5.1. The unconstrained SSP problem given by removing the action consistency constraints\nof CSSP(S, L, U, s) has the same optimal value as CSSP(S, L, U, s).\n\nSee Appendix G for the proof. Thus, the general approach for \ufb01nding worst-case upper and lower\nbounds on in\ufb01nite-\u03b1 \u03b1-Rank ranking weights \u03c0(s) for a given strategy pro\ufb01le s \u2208 S is to formulate\nthe unconstrained SSP described above, \ufb01nd the optimal policy (using, e.g., linear programming,\npolicy or value iteration), and then use the inverse relationship between mean return times and\nstationary distribution probabilities in recurrent Markov chains to obtain the bound on the ranking\nweight \u03c0(s) as required; full details are given in Appendix G. This approach, when applied to the\nsoccer domain described in the sequel, yields Fig. 1.1b.\n\n6\n\n\f(a) Bernoulli games.\n\n(b) Soccer meta-game.\n\n(c) Poker meta-game.\n\nFigure 6.1: Samples needed per strategy pro\ufb01le (Ns) for \ufb01nite-\u03b1 \u03b1-Rank, without adaptive sampling.\n\n6 Experiments\n\ninvariant distribution error \u0001, where maxs\u2208(cid:81)\n\nWe consider three domains of increasing complexity, with experimental procedures detailed in\nAppendix H.1. First, we consider randomly-generated two-player zero-sum Bernoulli games, with\nthe constraint that payoffs Mk(s, \u03c3) \u2208 [0, 1] cannot be too close to 0.5 for all pairs of distinct\nstrategies s, \u03c3 \u2208 S where \u03c3 = (\u03c3k, s\u2212k) (i.e., a single-player deviation from s). This constraint\nimplies that we avoid games that require an exceedingly large number of interactions for the sampler\nto compute a reasonable estimate of the payoff table. Second, we analyze a Soccer meta-game with\nthe payoffs in Liu et al. [33, Figure 2], wherein agents learn to play soccer in the MuJoCo simulation\nenvironment [46] and are evaluated against one another. This corresponds to a two-player symmetric\nzero-sum game with 10 agents, but with empirical (rather than randomly-generated) payoffs. Finally,\nwe consider a Kuhn poker meta-game with asymmetric payoffs and 3 players with access to 3\nagents each, similar to the domain analyzed in [36]; here, only \u03b1-Rank (and not Elo) applies for\nevaluation due to more than two players being involved. In all domains, noisy outcomes are simulated\nby drawing the winning player i.i.d. from a Bernoulli(Mk(s)) distribution over payoff tables M.\nWe \ufb01rst consider the empirical sample complexity of \u03b1-Rank in the \ufb01nite-\u03b1 regime. Figure 6.1\nvisualizes the number of samples needed per strategy pro\ufb01le to obtain rankings given a desired\nk Sk |\u03c0(s) \u2212 \u02c6\u03c0(s)| \u2264 \u03b5. As noted in Theorem 3.1, the\nsample complexity increases with respect to \u03b1, with the larger soccer and poker domains requiring on\nthe order of 1e3 samples per strategy pro\ufb01le to compute reasonably accurate rankings. These results\nare also intuitive given the evolutionary model underlying \u03b1-Rank, where lower \u03b1 induces lower\nselection pressure, such that strategies perform almost equally well and are, thus, easier to rank.\nAs noted in Section 4, sample complexity and ranking error under adaptive sampling are of particular\ninterest. To evaluate this, we consider variants of ResponseGraphUCB in Fig. 6.2, with particular\nfocus on the UE sampler (S: UE) for visual clarity; complete results for all combinations of S\nand C(\u03b4) are presented in Appendix Section H.2. Consider \ufb01rst the results for the Bernoulli games,\nshown in Fig. 6.2a; the top row plots the number of interactions required by ResponseGraphUCB to\naccurately compute the response graph given a desired error tolerance \u03b4, while the bottom row plots\nthe number of response graph edge errors (i.e., the number of directed edges in the estimated response\ngraph that point in the opposite direction of the ground truth graph). Notably, the CP-UCB con\ufb01dence\nbound is guaranteed to be tighter than the Hoeffding bounds used in standard UCB, thus the former\nrequires fewer interactions to arrive at a reasonable response graph estimate with the same con\ufb01dence\nas the latter; this is particularly evident for the relaxed variants R-CP-UCB, which require roughly\nan order of magnitude fewer samples compared to the other sampling schemes, despite achieving a\nreasonably low response graph error rate.\nConsider next the ResponseGraphUCB results given noisy outcomes for the soccer and poker meta-\ngames, respectively in Figs. 6.2b and 6.2c. Due to the much larger strategy spaces of these games, we\ncap the number of samples available at 1e5. While the results for poker are qualitatively similar to the\nBernoulli games, the soccer results are notably different; in Fig. 6.2b (top), the non-relaxed samplers\nuse the entire budget of 1e5 interactions, which occurs due to the large strategy space cardinality.\nSpeci\ufb01cally, the player-wise strategy size of 10 in the soccer dataset yields a total of 900 two-arm\nbandit problems to be solved by ResponseGraphUCB. We note also an interesting trend in Fig. 6.2b\n(bottom) for the three ResponseGraphUCB variants (S: UE, C(\u03b4): UCB), (S: UE, C(\u03b4): R-UCB),\nand (S: UE, C(\u03b4): CP-UCB). In the low error tolerance (\u03b4) regime, the uniform-exhaustive strategy\nused by these three variants implies that ResponseGraphUCB spends the majority of its sampling\nbudget observing interactions of an extremely small set of strategy pro\ufb01les, and thus cannot resolve\nthe remaining response graph edges accurately, resulting in high error. As error tolerance \u03b4 increases,\n\n7\n\n\f(a) Bernoulli games.\n\n(b) Soccer meta-game.\n\n(c) Poker meta-game.\n\nFigure 6.2: ResponseGraphUCB performance metrics versus error tolerance \u03b4 for all games. First\nand second rows, respectively, show the # of interactions required and response graph edge errors.\n\n(a) Soccer meta-game.\n\n(b) Poker meta-game.\n\nFigure 6.3: Payoff table Frobenius error and ranking errors for various ResponseGraphUCB con\ufb01-\ndence levels \u03b4. Number of samples is normalized to [0, 1] on the x-axis.\n\nwhile the probability of correct resolution of individual edges decreases by de\ufb01nition, the earlier\nstopping time implies that the ResponseGraphUCB allocates its budget over a larger set of strategies\nto observe, which subsequently lowers the total number of response graph errors.\nFigure 6.3a visualizes the ranking errors for Elo and in\ufb01nite-\u03b1 \u03b1-Rank given various ResponseG-\nraphUCB error tolerances \u03b4 in the soccer domain. Ranking errors are computed using the Kendall\npartial metric (see Appendix H.4). Intuitively, as the estimated payoff table error decreases due to\nadded samples, so does the ranking error for both algorithms. Figure 6.3b similarly considers the\n\u03b1-Rank ranking error in the poker domain. Ranking errors again decrease gracefully as the number\n\n(a) Soccer meta-game.\n\n(b) Poker meta-game.\n\nFigure 6.4: The ground truth distribution of payoff gaps for all response graph edges in the soccer and\npoker meta-games. We conjecture that the higher ranking variance may be explained by these gaps\ntending to be more heavily distributed near 0 for poker, making it dif\ufb01cult for ResponseGraphUCB to\nsuf\ufb01ciently capture the response graph topology given a high error tolerance \u03b4.\n\n8\n\nS: UE, C(\u03b4): UCBS: UE, C(\u03b4): R-UCBS: UE, C(\u03b4): CP-UCBS: UE, C(\u03b4): R-CP-UCB10-210-1\u03b4101102103104Interactions required10-210-1\u03b40.00.20.40.60.81.01.21.41.6Response graph errors10-210-1\u03b4103104105Interactions required10-210-1\u03b4050100150200250300350Response graph errors10-210-1\u03b4102103104105Interactions required10-210-1\u03b40246810121416Response graph errors\u03b4: 0.1\u03b4: 0.2\u03b4: 0.310-210-1100# samples012345678Payoff table error10-210-1100# samples010203040506070Elo ranking error10-210-1100# samples051015202530354045\u03b1-Rank ranking error10-210-1100# samples0.00.51.01.52.02.53.03.54.04.5Payoff table error10-210-1100# samples050100150200250300\u03b1-Rank ranking error\fof samples increases. Interestingly, while errors are positively correlated with respect to the error\ntolerances \u03b4 for the poker meta-game, this tolerance parameter seems to have no perceivable effect on\nthe soccer meta-game. Moreover, the poker domain results appear to be much higher variance than the\nsoccer counterparts. To explore this further, we consider the distribution of payoff gaps, which play a\nkey role in determining the response graph reconstruction errors. Let \u2206(s, \u03c3) = |Mk(s) \u2212 Mk(\u03c3)|,\nthe payoff difference corresponding to the edge of the response graph where player k deviates, causing\na transition between strategy pro\ufb01les s, \u03c3 \u2208 S. Figure 6.4 plots the ground truth distribution of these\ngaps for all response graph edges in soccer and poker. We conjecture that the higher ranking variance\nmay be explained by these gaps tending to be more heavily distributed near 0 for poker, making it\ndif\ufb01cult for ResponseGraphUCB to distinguish the \u2018winning\u2019 pro\ufb01le and thereby suf\ufb01ciently capture\nthe response graph topology.\nOverall, these results indicate a need for careful consideration of payoff uncertainties when ranking\nagents, and quantify the effectiveness of the algorithms proposed for multiagent evaluation under\nincomplete information. We conclude by remarking that the pairing of bandit algorithms and \u03b1-Rank\nseems a natural means of computing rankings in settings where, e.g., one has a limited budget for\nadaptively sampling match outcomes. Our use of bandit algorithms also leads to analysis which is\n\ufb02exible enough to be able to deal with K-player general-sum games. However, approaches such\nas collaborative \ufb01ltering may also fare well in their own right. We conduct a preliminary analysis\nof this in Appendix H.5, speci\ufb01cally for the case of two-player win-loss games, leaving extensive\ninvestigation for follow-up work.\n\n7 Conclusions\n\nThis paper conducted a rigorous investigation of multiagent evaluation under incomplete information.\nWe focused particularly on \u03b1-Rank due to its applicability to general-sum, many-player games.\nWe provided static sample complexity bounds quantifying the number of interactions needed to\ncon\ufb01dently rank agents, then introduced several sampling algorithms that adaptively allocate samples\nto the agent match-ups most informative for ranking. We then analyzed the propagation of game\noutcome uncertainty to the \ufb01nal rankings computed, providing sample complexity guarantees as\nwell as an ef\ufb01cient algorithm for bounding rankings given payoff table uncertainty. Evaluations\nwere conducted on domains ranging from randomly-generated two-player games to many-player\nmeta-games constructed from real datasets. The key insight gained by this analysis is that noise in\nmatch outcomes plays a prevalent role in determination of agent rankings. Given the recent emergence\nof training pipelines that rely on the evaluation of hundreds of agents pitted against each other in\nnoisy games (e.g., Population-Based Training [25, 26]), we strongly believe that consideration of\nthese uncertainty sources will play an increasingly important role in multiagent learning.\n\nAcknowledgements\n\nWe thank Daniel Hennes and Thore Graepel for extensive feedback on an earlier version of this paper,\nand the anonymous reviewers for their comments and suggestions to improve the paper. Georgios\nPiliouras acknowledges MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01 and\nNRF 2018 Fellowship NRF-NRFF2018-07.\n\n9\n\n\fReferences\n[1] Michele Aghassi and Dimitris Bertsimas. Robust game theory. Mathematical Programming,\n\n107(1):231\u2013273, Jun 2006.\n\n[2] Broderick Arneson, Ryan B Hayward, and Philip Henderson. Monte Carlo tree search in Hex.\n\nIEEE Transactions on Computational Intelligence and AI in Games, 2(4):251\u2013258, 2010.\n\n[3] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. Re-evaluating evaluation. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[4] S\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. Pure exploration in \ufb01nitely-armed and\n\ncontinuous-armed bandits. Theoretical Computer Science, 412(19):1832\u20131852, 2011.\n\n[5] Xiangrui Chao, Gang Kou, Tie Li, and Yi Peng. Jie Ke versus AlphaGo: A ranking approach\nusing decision making method for large-scale data with incomplete information. European\nJournal of Operational Research, 265(1):239\u2013247, 2018.\n\n[6] Charles Clopper and Egon Pearson. The use of con\ufb01dence or \ufb01ducial limits illustrated in the\n\ncase of the binomial. Biometrika, 26(4):404\u2013413, 1934.\n\n[7] Giacomo Como and Fabio Fagnani. Robustness of large-scale stochastic matrices to localized\n\nperturbations. IEEE Transactions on Network Science and Engineering, 2(2):53\u201364, 2015.\n\n[8] R\u00e9mi Coulom. Whole-history rating: A Bayesian rating system for players of time-varying\nstrength. In Computers and Games, 6th International Conference, CG 2008, Beijing, China,\nSeptember 29 - October 1, 2008. Proceedings, pages 113\u2013124, 2008.\n\n[9] Bal\u00e1zs Csan\u00e1d Cs\u00e1ji, Rapha\u00ebl M Jungers, and Vincent D Blondel. PageRank optimization in\npolynomial time by stochastic shortest path reformulation. In International Conference on\nAlgorithmic Learning Theory. Springer, 2010.\n\n[10] Bal\u00e1zs Csan\u00e1d Cs\u00e1ji, Rapha\u00ebl M Jungers, and Vincent D Blondel. PageRank optimization by\n\nedge selection. Discrete Applied Mathematics, 169:73\u201387, 2014.\n\n[11] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of\ncomputing a Nash equilibrium. In Proceedings of the 38th Annual ACM Symposium on Theory\nof Computing, Seattle, WA, USA, May 21-23, 2006, pages 71\u201378, 2006.\n\n[12] Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.\n\n[13] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning\nResearch, 7(Jun):1079\u20131105, 2006.\n\n[14] Ronald Fagin, Ravi Kumar, Mohammad Mahdian, D Sivakumar, and Erik Vee. Comparing\n\npartial rankings. SIAM Journal on Discrete Mathematics, 20(3):628\u2013648, 2006.\n\n[15] John Fearnley, Martin Gairing, Paul W Goldberg, and Rahul Savani. Learning equilibria of\n\ngames via payoff queries. Journal of Machine Learning Research, 16(1):1305\u20131344, 2015.\n\n[16] Olivier Fercoq, Marianne Akian, Mustapha Bouhtou, and St\u00e9phane Gaubert. Ergodic control\nand polyhedral approaches to PageRank optimization. IEEE Transactions on Automatic Control,\n58(1):134\u2013148, 2013.\n\n[17] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation:\nA uni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence. In Advances in Neural Information\nProcessing Systems (NeurIPS). 2012.\n\n[18] Aur\u00e9lien Garivier and Olivier Capp\u00e9. The KL-UCB algorithm for bounded stochastic bandits\n\nand beyond. In Proceedings of the Conference on Learning Theory (COLT), 2011.\n\n[19] Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare,\nand R\u00e9mi Munos. The reactor: A sample-ef\ufb01cient actor-critic architecture. Proceedings of the\nInternational Conference on Learning Representations (ICLR), 2018.\n\n10\n\n\f[20] John Harsanyi and Reinhard Selten. A General Theory of Equilibrium Selection in Games,\n\nvolume 1. The MIT Press, 1 edition, 1988.\n\n[21] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form\n\ngames. In International Conference on Machine Learning, 2015.\n\n[22] Daniel Hennes, Daniel Claes, and Karl Tuyls. Evolutionary advantage of reciprocity in collision\n\navoidance. In AAMAS Workshop on Autonomous Robots and Multirobot Systems, 2013.\n\n[23] Ralf Herbrich, Tom Minka, and Thore Graepel. TrueSkill: A Bayesian skill rating system. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2007.\n\n[24] Romain Hollanders, Giacomo Como, Jean-Charles Delvenne, and Rapha\u00ebl M Jungers. Tight\nbounds on sparse perturbations of Markov chains. In International Symposium on Mathematical\nTheory of Networks and Systems, 2014.\n\n[25] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali\nRazavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based\ntraining of neural networks. arXiv preprint arXiv:1711.09846, 2017.\n\n[26] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia\nCasta\u00f1eda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas\nSonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray\nKavukcuoglu, and Thore Graepel. Human-level performance in 3d multiplayer games with\npopulation-based reinforcement learning. Science, 364(6443):859\u2013865, 2019.\n\n[27] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\nalternating minimization. In Proceedings of the ACM Symposium on Theory of Computing\n(STOC), 2013.\n\n[28] Patrick R Jordan, Yevgeniy Vorobeychik, and Michael P Wellman. Searching for approximate\nequilibria in empirical games. In Proceedings of the International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS), 2008.\n\n[29] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection\nin stochastic multi-armed bandits. In Proceedings of the International Conference on Machine\nLearning (ICML), 2012.\n\n[30] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\nbandits. In Proceedings of the International Conference on Machine Learning (ICML), 2013.\n[31] Matthew Lai. Giraffe: Using deep reinforcement learning to play chess. arXiv preprint\n\narXiv:1509.01549, 2015.\n\n[32] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien\nP\u00e9rolat, David Silver, and Thore Graepel. A uni\ufb01ed game-theoretic approach to multiagent\nreinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2017.\n[33] Siqi Liu, Guy Lever, Nicholas Heess, Josh Merel, Saran Tunyasuvunakool, and Thore Graepel.\nEmergent coordination through competition. In Proceedings of the International Conference on\nLearning Representations (ICLR), 2019.\n\n[34] H. Brendan McMahan, Geoffrey J. Gordon, and Avrim Blum. Planning in the presence of\ncost functions controlled by an adversary. In Proceedings of the International Conference on\nMachine Learning (ICML), 2003.\n\n[35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 02 2015.\n\n[36] Shayegan Omidsha\ufb01ei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland,\nJean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and R\u00e9mi Munos.\n\u03b1-Rank: Multi-agent evaluation by evolution. Scienti\ufb01c Reports, 9, 2019.\n\n11\n\n\f[37] Steve Phelps, Simon Parsons, and Peter McBurney. An evolutionary game-theoretic comparison\nof two double-auction market designs. In Agent-Mediated Electronic Commerce VI, Theories\nfor and Engineering of Distributed Mechanisms and Systems, AAMAS 2004 Workshop, 2004.\n\n[38] Steve Phelps, Kai Cai, Peter McBurney, Jinzhong Niu, Simon Parsons, and Elizabeth Sklar.\nAuctions, evolution, and multi-agent learning. In Adaptive Agents and Multi-Agent Systems\nIII. Adaptation and Multi-Agent Learning, 5th, 6th, and 7th European Symposium, ALAMAS\n2005-2007 on Adaptive and Learning Agents and Multi-Agent Systems, 2007.\n\n[39] Marc J. V. Ponsen, Karl Tuyls, Michael Kaisers, and Jan Ramon. An evolutionary game-theoretic\n\nanalysis of poker strategies. Entertainment Computing, 1(1):39\u201345, 2009.\n\n[40] Achintya Prakash and Michael P. Wellman. Empirical game-theoretic analysis for moving target\n\ndefense. In Proceedings of the Second ACM Workshop on Moving Target Defense, 2015.\n\n[41] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den\nDriessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot,\nSander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P.\nLillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Master-\ning the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[42] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[43] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap,\nKaren Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters\nchess, shogi, and go through self-play. Science, 362(6419):1140\u20131144, 2018.\n\n[44] Samuel Sokota, Caleb Ho, and Bryce Wiedenbeck. Learning deviation payoffs in simulation\n\nbased games. In AAAI Conference on Arti\ufb01cial Intelligence, 2019.\n\n[45] Eilon Solan and Nicolas Vieille. Perturbed Markov chains. Journal of Applied Probability, 40\n\n(1):107\u2013122, 2003.\n\n[46] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based\ncontrol. In Proceedings of the International Conference on Intelligent Robots and Systems\n(IROS), 2012.\n\n[47] Karl Tuyls and Simon Parsons. What evolutionary game theory tells us about multiagent\n\nlearning. Artif. Intell., 171(7):406\u2013416, 2007.\n\n[48] Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A generalised\nmethod for empirical game theoretic analysis. In Proceedings of the International Conference\non Autonomous Agents and Multiagent Systems (AAMAS), 2018.\n\n[49] Karl Tuyls, Julien Perolat, Marc Lanctot, Rahul Savani, Joel Leibo, Toby Ord, Thore Graepel,\nand Shane Legg. Symmetric decomposition of asymmetric games. Scienti\ufb01c Reports, 8(1):\n1015, 2018.\n\n[50] Yevgeniy Vorobeychik. Probabilistic analysis of simulation-based games. ACM Transactions\n\non Modeling and Computer Simulation, 20(3), 2010.\n\n[51] William E. Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O. Kephart. Analyzing complex\nstrategic interactions in multi-agent games. In AAAI Workshop on Game Theoretic and Decision\nTheoretic Agents, 2002.\n\n[52] William E. Walsh, David C. Parkes, and Rjarshi Das. Choosing samples to compute heuristic-\nstrategy Nash equilibrium. In Proceedings of the Fifth Workshop on Agent-Mediated Electronic\nCommerce, 2003.\n\n12\n\n\f[53] Michael P. Wellman. Methods for empirical game-theoretic analysis. In Proceedings of The\nNational Conference on Arti\ufb01cial Intelligence and the Innovative Applications of Arti\ufb01cial\nIntelligence Conference, 2006.\n\n[54] Michael P. Wellman, Tae Hyung Kim, and Quang Duong. Analyzing incentives for protocol\ncompliance in complex domains: A case study of introduction-based routing. In Proceedings of\nthe 12th Workshop on the Economics of Information Security, 2013.\n\n[55] Bryce Wiedenbeck and Michael P. Wellman. Scaling simulation-based game analysis through\ndeviation-preserving reduction. In Proceedings of the International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS), 2012.\n\n[56] Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random\npayoffs. In Proceedings of the International Conference on Machine Learning (ICML), 2017.\n\n13\n\n\f", "award": [], "sourceid": 6646, "authors": [{"given_name": "Mark", "family_name": "Rowland", "institution": "DeepMind"}, {"given_name": "Shayegan", "family_name": "Omidshafiei", "institution": "DeepMind"}, {"given_name": "Karl", "family_name": "Tuyls", "institution": "DeepMind"}, {"given_name": "Julien", "family_name": "Perolat", "institution": "DeepMind"}, {"given_name": "Michal", "family_name": "Valko", "institution": "DeepMind Paris and Inria Lille - Nord Europe"}, {"given_name": "Georgios", "family_name": "Piliouras", "institution": "Singapore University of Technology and Design"}, {"given_name": "Remi", "family_name": "Munos", "institution": "DeepMind"}]}