{"title": "New Criteria and a New Algorithm for Learning in Multi-Agent Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1096, "abstract": null, "full_text": "New Criteria and a New Algorithm for Learning\n in Multi-Agent Systems\n\n\n\n Rob Powers Yoav Shoham\n Computer Science Department Computer Science Department\n Stanford University Stanford University\n Stanford, CA 94305 Stanford, CA 94305\n powers@cs.stanford.edu shoham@cs.stanford.edu\n\n\n\n\n Abstract\n\n We propose a new set of criteria for learning algorithms in multi-agent\n systems, one that is more stringent and (we argue) better justified than\n previous proposed criteria. Our criteria, which apply most straightfor-\n wardly in repeated games with average rewards, consist of three require-\n ments: (a) against a specified class of opponents (this class is a parameter\n of the criterion) the algorithm yield a payoff that approaches the payoff\n of the best response, (b) against other opponents the algorithm's payoff\n at least approach (and possibly exceed) the security level payoff (or max-\n imin value), and (c) subject to these requirements, the algorithm achieve\n a close to optimal payoff in self-play. We furthermore require that these\n average payoffs be achieved quickly. We then present a novel algorithm,\n and show that it meets these new criteria for a particular parameter class,\n the class of stationary opponents. Finally, we show that the algorithm\n is effective not only in theory, but also empirically. Using a recently\n introduced comprehensive game theoretic test suite, we show that the\n algorithm almost universally outperforms previous learning algorithms.\n\n\n\n1 Introduction\n\nThere is rapidly growing interest in multi-agent systems, and in particular in learning algo-\nrithms for such systems. There is a growing body of algorithms proposed, and some argu-\nments about their relative merits and domains of applicability (for example, [14] and [17]).\nIn [15] we survey much of this literature, and argue that it suffers from not having a clear\nobjective criteria with which to evaluate each algorithm (this shortcoming is not unique to\nthe relatively small computer science literature on multi-agent learning, and is shared by\nthe much vaster literature on learning in game theory). In [15] we also define five different\ncoherent agendas one could adopt, and identify one of them the agent-centric one as\nparticularly relevant from the computer science point of view.\n\nIn the agent-centric agenda one asks how an agent can learn optimally in the presence of\nother independent agents, who may also be learning. To make the discussion precise we\nwill concentrate on algorithms for learning in known, fully observable two-player repeated\ngames, with average rewards. We start with the standard definition of a finite stage game\n(aka normal form game):\n\n\f\nDefinition 1 A two-player stage game is a tuple G = (A1, A2, u1, u2), where\n Ai is a finite set of actions available to player i\n ui : A1 A2 is a utility function for player i\nFigure 1 shows two well-known games from the literature, to which we'll refer again later.\n\nIn a repeated game the stage game is repeated, finitely or infinitely. The agent accumulates\nrewards at each round; in the finite case the agent's aggregate reward is the average of the\nstage-game rewards, and in the infinite case it is the limit average (we ignore the subtlety\nthat arises when the limit does not exist, but this case does not present an essential problem).\n\nWhile the vast majority of the literature on multi-agent learning (surprisingly) does not\nstart with a precise statement of objectives, there are some exceptions, and we discuss\nthem in the next section, including their shortcomings. In the following section we pro-\npose a stronger set of criteria that, we believe, does not suffer from these limitations. We\nthen present an algorithm that provably meets these stronger requirements. However, we\nbelieve that all formal requirements including our own are merely baseline guarantees,\nand any proposed algorithm must be subjected to empirical tests. While many previous\nproposals provide empirical results, we think it is fair to say that our level of empirical\nvalidation is unprecedented in the literature. We show the results of tests for all pair-\nwise comparisons of major existing algorithms, using a recently-developed game theoretic\ntestbed called GAMUT [13] to systematically sample a very large space of games.\n\n\n2 Previous criteria for multi-agent learning\n\nTo our knowledge, Bowling and Veloso [1] were the first in the AI community to explicitly\nput forth formal requirements. Specifically they proposed two criteria:\n\nRationality: If the other players' policies converge to stationary policies then the learning\nalgorithm will converge to a stationary policy that is a best-response (in the stage game) to\nthe other players' policies.\n\nConvergence: The learner will necessarily converge to a stationary policy.\n\nThroughout this paper, we define a stationary policy as one that selects an action at each\npoint during the game by drawing from the same distribution, regardless of past history.\n\nBowling and Veloso considered known repeated games and proposed an algorithm that\nprovably meets their criteria in 2x2 games (games with two players and two actions per\nplayer). Later, Conitzer and Sandholm [5] adopted the same criteria, and demonstrated an\nalgorithm meeting the criteria for all repeated games.\n\nAt first glance these criteria are reasonable, but a deeper look is less satisfying. First, note\nthat the property of convergence cannot be applied unconditionally, since one cannot en-\nsure that a learning procedure converges against all possible opponents without sacrificing\nrationality. So implicit in that requirement is some limitation on the class of opponents.\nAnd indeed both [1] and [5] acknowledge this and choose to concentrate on the case of\nself-play, that is, on opponents that are identical to the agent in question.\n\n Dare Y ield Cooperate Def ect\n\n Dare 0, 0 4, 1 Cooperate 3, 3 0, 4\n\n Y ield 1, 4 2, 2 Def ect 4, 0 1, 1\n\n (a) Chicken (b) Prisoner's Dilemma\n\nFigure 1: Example stage games. The payoff for the row player is given first in each cell,\nwith the payoff for the column player following.\n\n\f\nWe will have more to say about self-play later, but there are other aspects of these criteria\nthat bear discussion. While it is fine to consider opponents playing stationary policies,\nthere are other classes of opponents that might be as relevant or even more relevant; this\nshould be a degree of freedom in the definition of the problem. For instance, one might be\ninterested in the classes of opponents that can be modeled by finite automata with at most\nk states; these include both stationary and non-stationary strategies.\n\nWe find the property of requiring convergence to a stationary strategy particularly hard to\njustify. Consider the Prisoner's Dilemma game in Figure 1. The Tit-for-Tat algorithm1\nachieves an average payoff of 3 in self-play, while the unique Nash equilibrium of the\nstage game has a payoff of only 1. Similarly, in the game of Chicken, also shown in\nFigure 1, a strategy that alternates daring while its opponent yields and yielding while its\nopponent dares achieves a higher expected payoff in self-play than any stationary policy\ncould guarantee. This problem is directly addressed in [2] and a counter-proposal made for\nhow to consider equilibria in repeated games. But there is also a fundamental issue with\nthese two criteria; they can both be thought of as a requirement on the play of the agent,\nrather than the reward the agent receives.\n\nOur final point regarding these two criteria is that they express properties that hold in the\nlimit, with no requirements whatsoever on the algorithm's performance in any finite period.\n\nBut this question is not new to the AI community and has been addressed numerous times\nin game theory, under the names of universal consistency, no-regret learning, and the Bayes\nenvelope, dating back to [9] (see [6] for an overview of this history). There is a fundamental\nsimilarity in approach throughout, and we will take the two criteria proposed in [7] as being\nrepresentative.\n\nSafety: The learning rule must guarantee at least the minimax payoff of the game. (The\nminimax payoff is the maximum expected value a player can guarantee against any possible\nopponent.)\n\nConsistency: The learning rule must guarantee that it does at least as well as the best\nresponse to the empirical distribution of play when playing against an opponent whose\nplay is governed by independent draws from any fixed distribution.\n\nThey then define universal consistency as the requirement that a learning rule do at least as\nwell as the best response to the empirical distribution regardless of the actual strategy the\nopponent is employing (this implies both safety and consistency) and show that a modifi-\ncation of the fictitious play algorithm [3] achieves this requirement. A limitation common\nto these game theory approaches is that they were designed for large-population games and\ntherefore ignore the effect of the agent's play on the future play of the opponent. But this\ncan pose problems in smaller games. Consider the game of Prisoner's Dilemma once again.\nEven if the opponent is playing Tit-for-Tat, the only universally consistent strategy would\nbe to defect at every time step, ruling out the higher payoff achievable by cooperating.\n\n\n3 A new set of criteria for learning\n\nWe will try to take the best of each proposal and create a joint set of criteria with the\npotential to address some of the limitations mentioned above.\n\nWe wish to keep the notion of optimality against a specific set of opponents. But instead of\nrestricting this set in advance, we'll make this a parameter of the properties. Acknowledg-\ning that we may encounter opponents outside our target set, we will also incorporate the\nrequirement of safety, which guarantees we achieve at least the security value, also known\n\n 1The Tit-for-Tat algorithm cooperates in the first round and then for each successive round plays\nthe action its opponent played in the previous round.\n\n\f\nas the maximin payoff, for the stage game. As a possible motivation for our approach,\nconsider the game of Rock-Paper-Scissors, which despite its simplicity has motivated sev-\neral international tournaments. While the unique Nash equilibrium policy is to randomize,\nthe winners of the tournaments are those players who can most effectively exploit their\nopponents who deviate without being exploited in turn.\n\nThe question remains of how best to handle self-play. One method would be to require that\na proposed algorithm be added to the set of opponents it is required to play a best response\nto. While this may seem appealing at first glance, it can be a very weak requirement on the\nactual payoff the agent receives. Since our opponent is no longer independent of our choice\nof strategy, we can do better than settling for just any mutual best response, and try to max-\nimize the value we achieve as well. We therefore propose requiring the algorithm achieve\nat least the value of some Nash equilibrium that is Pareto efficient over the set of Nash\nequilibria.2 Similarly, algorithms exist that satisfy `universal consistency' and if played\nby all agents will converge to a correlated equilibria[10], but this result provides an even\nweaker constraint on the actual payoff received than convergence to a Nash equilibrium.\n\nLet k be the number of outcomes for the game and b the maximum possible difference in\npayoffs across the outcomes. We require that for any choice of > 0 and > 0 there exist\na T0, polynomial in 1 , 1 , k, and b, such that for any number of rounds t > T\n 0 the algorithm\nachieves the following payoff guarantees with probability at least 1 - .\n\nTargeted Optimality: When the opponent is a member of the selected set of opponents,\nthe average payoff is at least VBR -, where VBR is the expected value of the best response\nin terms of average payoff against the actual opponent.\n\nCompatibility: During self-play, the average payoff is at least VselfP lay - , where\nVselfP lay is defined as the minimum value achieved by the player in any Nash equilibrium\nthat is not Pareto dominated by another Nash equilibrium.\n\nSafety: Against any opponent, the average payoff is at least Vsecurity - , with Vsecurity\ndefined as max min EV (\n 1 1 22 1, 2).3\n\n\n\n4 An algorithm\n\nWhile we feel designing algorithms for use against more complex classes of opponent is\ncritical, as a minimal requirement we first show an algorithm that meets the above criteria\nfor the class of stationary opponents that has been the focus of much of the existing work.\nOur method incorporates modifications of three simple strategies: Fictitious Play [3], Bully\n[12], and the maximin strategy in order to create a more powerful hybrid algorithm.\n\nFictitious Play has been shown to converge in the limit to the best response against a sta-\ntionary opponent. Each round it plays its best response to the most likely stationary op-\nponent given the history of play. Our implementation uses a somewhat more generous\nbest-response calculation so as to achieve our performance requirements during self-play.\n\n BR() arg max(EOV (x, )), 4\n xX(,)\n\n where X(, ) = {y 1 : EV (y, ) max(EV (z, )) - }\n z1\n\n\n 2An outcome is Pareto efficient over a set if there is no other outcome in that set with a payoff at\nleast as high for every agent and strictly higher for at least one agent.\n 3Throughout the paper, we use EV (1, 2) to indicate the expected payoff to a player for playing\nstrategy 1 against an opponent playing 2 and EOV (1, 2) as the expected payoff the opponent\nachieves. 1 and 2 are the sets of mixed strategies for the agent and its opponent respectively.\n 4Note that BR0() is a member of the standard set of best response strategies to .\n\n\f\nWe extend the Bully algorithm to consider the full set of mixed strategies and again maxi-\nmize our opponent's value when multiple strategies yield equal payoff for our agent.\n\n BullyM ixed arg max(EOV (x, BR(x))),\n xX\n\n where X = {y 1 : EV (y, BR0(y)) = max(EV (z, BR0(z)))}\n z1\n\n\nThe maximin strategy is defined as\n\n M aximin arg max min EV (1, 2)\n 22\n 1 1\n\n\n\nWe will now show how to combine these strategies into a single method satisfying all\nthree criteria. In the code shown below, t is the current round, AvgV aluem is the av-\nerage value achieved by the agent during the last m rounds, VBully is shorthand for\nEV (BullyM ixed, BR0(BullyM ixed)), and dt2\n t represents the distribution of opponent\n 1\nactions for the period from round t1 to round t2.\n\n Set strategy = BullyMixed\n for 1 time steps\n Play strategy\n for 2 time steps\n if (strategy == BullyMixed AND AvgV alueH < VBully - 1)\n With probability, p, set strategy = BR (dt\n 2 0)\n Play strategy\n if ||d1\n 0 - dtt || < \n - 3\n 1\n Set bestStrategy = BR (dt\n 2 0)\n else if (strategy == BullyMixed AND AvgV alueH > VBully - 1)\n Set bestStrategy = BullyMixed\n else\n Set bestStrategy = BestResponse\n while not end of game\n if avgV aluet < V\n -0 security - 0\n Play maximin strategy for 3 time steps\n else\n Play bestStrategy for 3 time steps\n\nThe algorithm starts out with a coordination/exploration period in which it attempts to de-\ntermine what class its opponent is in. At the end of this period it chooses one of three\nstrategies for the rest of the game. If it determines its opponent may be stationary it settles\non a best response to the history up until that point. Otherwise, if the BullyMixed strategy\nhas been performing well it maintains it. If neither of these conditions holds, it adopts a\ndefault strategy, which we have set to be the BestResponse strategy. This strategy changes\neach round, playing the best response to the maximum likelihood opponent strategy based\non the last H rounds of play. Once one of these strategies has been selected, the algo-\nrithm plays according to it whenever the average value meets or exceeds the security level,\nreverting to the maximin strategy if the value drops too low.\n\n\nTheorem 1 Our algorithm satisfies the three properties stated in section 3 for the class of\nstationary opponents, with a T0 proportional to ( b )3 1 .\n \n\n\nThis theorem can be proven for all three properties using a combination of basic probability\ntheory and repeated applications of the Hoeffding inequality [11], but the proof itself is\nprohibitively long for inclusion in this publication.\n\n\f\n5 Empirical results\n\nAlthough satisfying the criteria we put forth is comforting, we feel this is but a first step\nin making a compelling argument that an approach might be useful in practice. Tradition-\nally, researchers suggesting a new algorithm also include an empirical comparison of the\nalgorithm to previous work. While we think this is a critical component of evaluating an\nalgorithm, most prior work has used tests against just one or two other algorithms on a\nvery narrow set of test environments, which often vary from researcher to researcher. This\npractice has made it hard to consistently compare the performance of different algorithms.\n\nIn order to address this situation, we've started to code a collection of existing algorithms.\nCombining this set of algorithms with a wide variety of repeated games from GAMUT\n[13], a game theoretic test suite, we have the beginnings of a comprehensive testbed for\nmulti-agent learning algorithms. In the rest of this section, we'll concentrate on the results\nfor our algorithm, but we hope that this testbed can form the foundation for a broad, con-\nsistent framework of empirical testing in multi-agent learning going forward. For all of\nour environments we conducted our tests using a tournament format, where each algorithm\nplays all other algorithms including itself.\n\n\n MetaStrategy StochFP StochIGA\n 0.500 WoLF-PHC BullyMixed MiniMax\n\n 0.400\n\n\n 0.300\n\n\n 0.200\n\n\n 0.100\n\n\n -\n\n ax A C\n ixed ax\n Bully\n iniM -M\n M andom LocalQ yper-Q\n R tochIG H\n BullyM oLF-PH\n etaStrategy StochFP S W JointQ\n M\n\n Figure 2: Average value for last 20K rounds (of 200K) across all games in GAMUT.\n\nLet us first consider the results of a tournament over a full set of games in GAMUT. Figure\n2 portrays the average value achieved by each agent (y-axis) averaged over all games, when\nplaying different opponents (x-axis). The set of agents includes our strategy (MetaStrat-\negy), six different adaptive learning approaches (Stochastic Fictitious Play [3,8], Stochastic\nIGA[16], WoLF-PHC[1], Hyper-Q learning[18], Local Q-learning[19], and JointQ-Max[4]\n(which learns Q-values over the joint action space but assumes its opponent will cooperate\nto maximize its payoff)), and four fixed strategies (BullyMixed, Bully[12], the maximin\nstrategy, and Random (which selects a stationary mixed strategy at random)). We have\nchosen a subset of the most successful algorithms to display on the graph. Against the four\nstationary opponents, all of the adaptive learners fared equally well, while fixed strategy\nplayers achieved poor rewards. In contrast, BullyMixed fared well against the adaptive\nalgorithms. As desired, our new algorithm combined the best of these characteristics to\nachieve the highest value against all opponents except itself. It fares worse than BullyMixed\nsince it will always yield to BullyMixed, giving away the more advantageous outcome in\ngames like Chicken. However, when comparing how each agent performs in self-play,\nour algorithm scores quite well, finishing a close second to Hyper-Q learning while the\n\n\f\ntwo Bully algorithms finish near last. Hyper-Q is able to gain in self-play by occasionally\nconverging to outcomes with high social welfare that our strategy does not consider.\n\n\n MetaStrategy StochFP StochIGA\n WoLF-PHC BullyMixed MiniMax\n\n 100%\n\n\n 95%\n\n\n 90%\n\n\n 85%\n\n\n 80%\n\n\n 75%\n\n\n 70%\n s n e e e e r r s rt a d e u e a\n e ly e a lla e g m n m m a m e\n x o k m m m m P o v tin ie ffo u a u e m m\n e p a a a a o n\n ic n o n E m o S b a m a\n S o h G G tG G D D V e ile p G ro G ile G\n e lig n n n n tio e d m m m e m s o\n h C c h n P D a D\n O tio tio io u\n A T A rity g o o Z h y w\n fT d ria rs d le rs\n ra a a rs C c\n W b k jo in im e n m o p T\n O n o in v e ra w a h in n m a o a le y\n b e e\n rd o p d R\n G a M tc M o o R n h v B\n ttle rtra o C is m H a d a S o\n a e lla a n\n o o D M ris a ra w\n B B R\n C C G P R T T\n\n\nFigure 3: Percent of maximum value for last 20K rounds (of 200K) averaged across all\nopponents for selected games in GAMUT. The rewards were divided by the maximum\nreward achieved by any agent to make visual comparisons easier.\n\nSo far we've seen that our new algorithm performs well when playing against a variety of\nopponents. In Figure 3 we show the reward for each agent, averaged across the set of possi-\nble opponents for a selection of games in GAMUT. Once again our algorithm outperforms\nthe existing algorithms in nearly all games. When it fails to achieve the highest reward it\noften appears to be due to its policy of \"generosity\"; in games where it has multiple actions\nyielding equal value, it chooses a best response that maximizes its opponent's value.\n\nThe ability to study how individual strategies fare in each class of environment reflects an\nadvantage of our more comprehensive testing approach. In future work, this data can be\nused both to aid in the selection of an appropriate algorithm for a new environment and to\npinpoint areas where an algorithm might be improved. Note that we use environment here\nto indicate a combination of both the game and the distribution over opponents.\n\n\n6 Conclusions and Future Work\n\nOur objective in this work was to put forth a new set of criteria for evaluating the perfor-\nmance of multi-agent learning algorithms as well as propose a more comprehensive method\nfor empirical testing. In order to motivate this new approach for vetting algorithms, we have\npresented a novel algorithm that meets our criteria and outperforms existing algorithms in a\nwide variety of environments. We are continuing to work actively to extend our approach.\nIn particular, we wish to demonstrate the generality of our approach by providing algo-\nrithms that calculate best response to different sets of opponents (conditional strategies,\nfinite automata, etc.) Additionally, the criteria need to be generalized for n-player games\n\n\f\nand we hope to combine our method for known games with methods for learning the struc-\nture of the game, ultimately devising new algorithms for unknown stochastic games.\n\n\nAcknowledgements\n\nThis work was supported in part by a Benchmark Stanford Graduate Fellowship, DARPA\ngrant F30602-00-2-0598, and NSF grant IIS-0205633.\n\n\nReferences\n\n[1] Bowling, M. & Veloso, M. (2002). Multiagent learning using a variable learning rate. In Artificial\nIntelligence, 136, pp. 215-250.\n\n[2] Brafman, R. & Tennenholtz, M. (2002). Efficient Learning Equilibrium. In Advances in Neural\nInformation Processing Systems 15.\n\n[3] Brown, G. (1951). Iterative Solution of Games by Fictitious Play. In Activity Analysis of Produc-\ntion and Allocation. New York: John Wiley and Sons.\n\n[4] Claus, C. & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multia-\ngent systems. In Proceedings of the National Conference on Artificial Intelligence , pp. 746-752.\n\n[5] Conitzer, V. & Sandholm, T. (2003). AWESOME: A General Multiagent Learning Algorithm that\nConverges in Self-Play and Learns a Best Response Against Stationary Opponents. In Proceedings\nof the 20th International Conference on Machine Learning, pp. 83-90, Washington, DC.\n\n[6] Foster, D. & Vohra, R. (1999). Regret in the on-line decision problem. \"Games and Economic\nBehavior\" 29:7-36.\n\n[7] Fudenberg, D. & Levine, D. (1995) Universal consistency and cautious fictitious play. Journal of\nEconomics Dynamics and Control 19:1065-1089.\n\n[8] Fudenberg, D. & Levine, D. (1998). The theory of learning in games. MIT Press.\n\n[9] Hannan, J. (1957) Approximation to Bayes risk in repeated plays. Contributions to the Theory of\nGames 3:97-139.\n\n[10] Hart, S. & Mas-Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium.\nIn Econometrica, Vol. 68, No. 5, pages 1127-1150.\n\n[11] Hoeffding, W. (1956). On the distribution of the number of successes in independent trials.\nAnnals of Mathematical Statistics 27:713-721.\n\n[12] Littman, M. & Stone, P. (2001). Implicit Negotiation in Repeated Games. In Proceedings of the\nEighth International Workshop on Agent Theories, Architectures, and Languages, pp. 393-404.\n\n[13] Nudelman, E., Wortman, J., Leyton-Brown, K., & Shoham, Y. (2004). Run the GAMUT: A\nComprehensive Approach to Evaluating Game-Theoretic Algorithms. AAMAS-2004. To Appear.\n\n[14] Sen, S. & Weiss, G. (1998). Learning in multiagent systems. In Multiagent systems: A modern\nintroduction to distributed artificial intelligence, chapter 6, pp. 259-298, MIT Press.\n\n[15] Shoham, Y., Powers, R., & Grenager, T. (2003). Multi-Agent Reinforcement Learning: a critical\nsurvey. Technical Report.\n\n[16] Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-\nsum games. In Proceedings of UAI-2000, pp. 541-548, Morgan Kaufman.\n\n[17] Stone, P. & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspec-\ntive. Autonomous Robots, 8(3).\n\n[18] Tesauro, G. (2004). Extending Q-Learning to General Adaptive Multi-Agent Systems. In Ad-\nvances in Neural Information Processing Systems 16.\n\n[19] Watkins, C. & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3):279-292.\n\n\f\n", "award": [], "sourceid": 2680, "authors": [{"given_name": "Rob", "family_name": "Powers", "institution": null}, {"given_name": "Yoav", "family_name": "Shoham", "institution": null}]}