{"title": "Playing is believing: The role of beliefs in multi-agent learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1483, "page_last": 1490, "abstract": "", "full_text": "Playing is believing:\n\nThe role of beliefs in multi-agent learning\n\nYu-Han Chang\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, Massachusetts 02139\n\nychang@ai.mit.edu\n\nLeslie Pack Kaelbling\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, Massachusetts 02139\n\nlpk@ai.mit.edu\n\nAbstract\n\nWe propose a new classi\ufb01cation for multi-agent learning algorithms, with\neach league of players characterized by both their possible strategies and\npossible beliefs. Using this classi\ufb01cation, we review the optimality of ex-\nisting algorithms, including the case of interleague play. We propose an\nincremental improvement to the existing algorithms that seems to achieve\naverage payoffs that are at least the Nash equilibrium payoffs in the long-\nrun against fair opponents.\n\n1 Introduction\n\nThe topic of learning in multi-agent environments has received increasing attention over the\npast several years. Game theorists have begun to examine learning models in their study of\nrepeated games, and reinforcement learning researchers have begun to extend their single-\nagent learning models to the multiple-agent case. As traditional models and methods from\nthese two \ufb01elds are adapted to tackle the problem of multi-agent learning, the central issue\nof optimality is worth revisiting. What do we expect a successful learner to do?\n\nMatrix games and Nash equilibrium. From the game theory perspective, the repeated\ngame is a generalization of the traditional one-shot game, or matrix game. The matrix\ngame is de\ufb01ned as a reward matrix Ri for each player, Ri : A1 (cid:2) A2 ! R, where Ai is the\nset of actions available to player i. Purely competitive games are called zero-sum games\nand must satisfy R1 = (cid:0)R2. Each player simultaneously chooses to play a particular\naction ai 2 Ai, or a mixed policy (cid:22)i = P D(Ai), which is a probability distribution over\nthe possible actions, and receives reward based on the joint action taken. Some common\nexamples of single-shot matrix games are shown in Figure 1. The traditional assumption is\nthat each player has no prior knowledge about the other player. As is standard in the game\ntheory literature, it is thus reasonable to assume that the opponent is fully rational and\nchooses actions that are in its best interest. In return, we must play a best response to the\nopponent\u2019s choice of action. A best response function for player i, BRi((cid:22)(cid:0)i), is de\ufb01ned\nto be the set of all optimal policies for player i, given that the other players are playing the\ni ; (cid:22)(cid:0)i) (cid:21) Ri((cid:22)i; (cid:22)(cid:0)i)8(cid:22)i 2 Mig, where\njoint policy (cid:22)(cid:0)i: BRi((cid:22)(cid:0)i) = f(cid:22)(cid:3)\nMi is the set of all possible policies for agent i.\nIf all players are playing best responses to the other players\u2019 strategies, (cid:22)i 2 BRi((cid:22)(cid:0)i)8i,\n\ni 2 MijRi((cid:22)(cid:3)\n\n\fR1 = (cid:20) (cid:0)1\n\n1\n\n1 (cid:0)1 (cid:21) R1 = 2\n4\n\n0 (cid:0)1\n\n1\n\n1\n\n(cid:0)1\n\n0 (cid:0)1\n\n1\n\n0\n\n3\n5\n\nR2 = (cid:0)R1\n\nR2 = (cid:0)R1\n\nR1 = (cid:20) 0\n\n1\n\n3\n\n2 (cid:21)\n\nR1 = (cid:20) 2\n\n3\n\n0\n\n1 (cid:21)\n\nR2 = (cid:20) 0\n\n3\n\n1\n\n2 (cid:21)\n\nR2 = (cid:20) 2\n\n0\n\n3\n\n1 (cid:21)\n\n(a) Matching pennies\n\n(b) Rock-Paper-Scissors\n\n(c) Hawk-Dove\n\n(d) Prisoner\u2019s Dilemna\n\nFigure 1: Some common examples of single-shot matrix games.\n\nthen the game is said to be in Nash equilibrium. Once all players are playing by a Nash\nequilibrium, no single player has an incentive to unilaterally deviate from his equilibrium\npolicy. Any game can be solved for its Nash equilibria using quadratic programming, and\na player can choose an optimal strategy in this fashion, given prior knowledge of the game\nstructure. The only problem arises when there are multiple Nash equilibria. If the players\ndo not manage to coordinate on one equilibrium joint policy, then they may all end up\nworse off. The Hawk-Dove game shown in Figure 1(c) is a good example of this problem.\nThe two Nash equilibria occur at (1,2) and (2,1), but if the players do not coordinate, they\nmay end up playing a joint action (1,1) and receive 0 reward.\n\nStochastic games and reinforcement learning. Despite these problems, there is general\nagreement that Nash equilibrium is an appropriate solution concept for one-shot games. In\ncontrast, for repeated games there are a range of different perspectives. Repeated games\ngeneralize one-shot games by assuming that the players repeat the matrix game over many\ntime periods. Researchers in reinforcement learning view repeated games as a special case\nof stochastic, or Markov, games. Researchers in game theory, on the other hand, view\nrepeated games as an extension of their theory of one-shot matrix games. The result-\ning frameworks are similar, but with a key difference in their treatment of game history.\nReinforcement learning researchers focus their attention on choosing a single stationary\npolicy (cid:22) that will maximize the learner\u2019s expected rewards in all future time periods given\n\nthat we are in time t, max(cid:22) E(cid:22)hPT\n\n(cid:28) =t (cid:13)(cid:28) (cid:0)tR(cid:28) ((cid:22))i, where T may be \ufb01nite or in\ufb01nite,\n\nand (cid:22) = P D(A). In the in\ufb01nite time-horizon case, we often include the discount factor\n0 < (cid:13) < 1.\nLittman [1] analyzes this framework for zero-sum games, proving convergence to the Nash\nequilibrium for his minimax-Q algorithm playing against another minimax-Q agent. Claus\nand Boutilier [2] examine cooperative games where R1 = R2, and Hu and Wellman [3]\nfocus on general-sum games. These algorithms share the common goal of \ufb01nding and\nplaying a Nash equilibrium. Littman [4] and Hall and Greenwald [5] further extend this\napproach to consider variants of Nash equilibrium for which convergence can be guaran-\nteed. Bowling and Veloso [6] and Nagayuki et al: [7] propose to relax the mutual optimality\nrequirement of Nash equilibrium by considering rational agents, which always learn to play\na stationary best-response to their opponent\u2019s strategy, even if the opponent is not playing\nan equilibrium strategy. The motivation is that it allows our agents to act rationally even\nif the opponent is not acting rationally because of physical or computational limitations.\nFictitious play [8] is a similar algorithm from game theory.\n\nGame theoretic perspective of repeated games. As alluded to in the previous section,\ngame theorists often take a more general view of optimality in repeated games. The key\ndifference is the treatment of the history of actions taken in the game. Recall that in the\n\n\fTable 1: Summary of multi-agent learning algorithms under our new classi\ufb01cation.\n\nH0\n\nB0\nminimax-Q,\nNash-Q\n\nB1\n\nB1\nBully\n\nH1\nH1 Q-learning (Q0),\n\nQ1\n\n(WoLF-)PHC,\n\ufb01ctitious play\n* assumes public knowledge of the opponent\u2019s policy at each period\n\nGodfather\nmultiplicative-\nweight*\n\nstochastic game model, we took (cid:22)i = P D(Ai). Here we rede\ufb01ne (cid:22)i : H ! P D(Ai),\n\nwhere H = St H t and H t is the set of all possible histories of length t. Histories are\n\nobservations of joint actions, ht = (ai; a(cid:0)i; ht(cid:0)1). Player i\u2019s strategy at time t is then\nexpressed as (cid:22)i(ht(cid:0)1). In essence, we are endowing our agent with memory. Moreover,\nthe agent ought to be able to form beliefs about the opponent\u2019s strategy, and these beliefs\nought to converge to the opponent\u2019s actual strategy given suf\ufb01cient learning time. Let\n(cid:12)i : H ! P D(A(cid:0)i) be player i\u2019s belief about the opponent\u2019s strategy. Then a learning\npath is de\ufb01ned to be a sequence of histories, beliefs, and personal strategies. Now we can\nde\ufb01ne the Nash equilibrium of a repeated game in terms of our personal strategy and our\nbeliefs about the opponent. If our prediction about the opponent\u2019s strategy is accurate, then\nwe can choose an appropriate best-response strategy. If this holds for all players in the\ngame, then we are guaranteed to be in Nash equilibrium.\nProposition 1.1. A learning path f(ht; (cid:22)i(ht(cid:0)1); (cid:12)i(ht(cid:0)1))jt = 1; 2; : : :g converges to a\nNash equilibrium iff the following two conditions hold:\n\n(cid:15) Optimization: 8t; (cid:22)i(ht(cid:0)1) 2 BRi((cid:12)i(ht(cid:0)1)). We always play a best-response to\n\nour prediction of the opponent\u2019s strategy.\n\n(cid:15) Prediction: limt!1 j(cid:12)i(ht(cid:0)1) (cid:0) (cid:22)(cid:0)i(ht(cid:0)1)j = 0. Over time, our belief about the\n\nopponent\u2019s strategy converges to the opponent\u2019s actual strategy.\n\nHowever, Nachbar and Zame [9] shows that this requirement of simultaneous prediction\nand optimization is impossible to achieve, given certain assumptions about our possible\nstrategies and possible beliefs. We can never design an agent that will learn to both predict\nthe opponent\u2019s future strategy and optimize over those beliefs at the same time. Despite this\nfact, if we assume some extra knowledge about the opponent, we can design an algorithm\nthat approximates the best-response stationary policy over time against any opponent. In\nthe game theory literature, this concept is often called universal consistency. Fudenburg\nand Levine [8] and Freund and Schapire [10] independently show that a multiplicative-\nweight algorithm exhibits universal consistency from the game theory and machine learning\nperspectives. This give us a strong result, but requires the strong assumption that we know\nthe opponent\u2019s policy at each time period. This is typically not the case.\n\n2 A new classi\ufb01cation and a new algorithm\n\nWe propose a general classi\ufb01cation that categorizes algorithms by the cross-product of\ntheir possible strategies and their possible beliefs about the opponent\u2019s strategy, H (cid:2) B. An\nagent\u2019s possible strategies can be classi\ufb01ed based upon the amount of history it has in mem-\nory, from H0 to H1. Given more memory, the agent can formulate more complex policies,\nsince policies are maps from histories to action distributions. H0 agents are memoryless\nand can only play stationary policies. Agents that can recall the actions from the previous\n\n\ftime period are classi\ufb01ed as H1 and can execute reactive policies. At the other extreme,\nH1 agents have unbounded memory and can formulate ever more complex strategies as\nthe game is played over time. An agent\u2019s belief classi\ufb01cation mirrors the strategy classi\ufb01-\ncation in the obvious way. Agents that believe their opponent is memoryless are classi\ufb01ed\nas B0 players, Bt players believe that the opponent bases its strategy on the previous t-\nperiods of play, and so forth. Although not explicitly stated, most existing algorithms make\nassumptions and thus hold beliefs about the types of possible opponents in the world.\nWe can think of each Hs (cid:2) Bt as a different league of players, with players in each league\nroughly equal to one another in terms of their capabilities. Clearly some leagues contain\nless capable players than others. We can thus de\ufb01ne a fair opponent as an opponent from an\nequal or lesser league. The idea is that new learning algorithms should ideally be designed\nto beat any fair opponent.\n\nThe key role of beliefs. Within each league, we assume that players are fully rational\nin the sense that they can fully use their available histories to construct their future policy.\nHowever, an important observation is that the de\ufb01nition of full rationality depends on their\nbeliefs about the opponent. If we believe that our opponent is a memoryless player, then\neven if we are an H1 player, our fully rational strategy is to simply model the opponent\u2019s\nstationary strategy and play our stationary best response. Thus, our belief capacity and our\nhistory capacity are inter-related. Without a rich set of possible beliefs about our opponent,\nwe cannot make good use of our available history. Similarly, and perhaps more obviously,\nwithout a rich set of historical observations, we cannot hope to model complex opponents.\n\nDiscussion of current algorithms. Many of the existing algorithms fall within the H1 (cid:2)\nB0 league. As discussed in the previous section, the problem with these players is that even\nthough they have full access to the history, their fully rational strategy is stationary due to\ntheir limited belief set. A general example of a H1 (cid:2) B0 player is the policy hill climber\n(PHC). It maintains a policy and updates the policy based upon its history in an attempt\nto maximize its rewards. Originally PHC was created for stochastic games, and thus each\npolicy also depends on the current state s. In our repeated games, there is only one state.\nFor agent i, Policy Hill Climbing (PHC) proceeds as follows:\n1. Let (cid:11) and (cid:14) be the learning rates. Initialize\n\nQ(s; a)   0; (cid:22)i(s; a)  \n\n1\n\njAij\n\n8s 2 S; a 2 Ai:\n\n2. Repeat,\na. From state s, select action a according to the mixed policy (cid:22)i(s) with some exploration.\nb. Observing reward r and next state s0, update\n\nQ(s; a)   (1 (cid:0) (cid:11))Q(s; a) + (cid:11)(r + (cid:13) max\n\na0\n\nQ(s0; a0)):\n\nc. Update (cid:22)(s; a) and constrain it to a legal probability distribution:\n\n(cid:22)i(s; a)   (cid:22)i(s; a) +(cid:26)\n\n(cid:14)\n(cid:0)(cid:14)\n\njAij(cid:0)1\n\nif a = argmaxa0 Q(s; a0)\notherwise\n\n:\n\nThe basic idea of PHC is that the Q-values help us to de\ufb01ne a gradient upon which we\nexecute hill-climbing. Bowling and Veloso\u2019s WoLF-PHC [6] modi\ufb01es PHC by adjusting (cid:14)\ndepending on whether the agent is \u201cwinning\u201d or \u201closing.\u201d True to their league, PHC players\nplay well against stationary opponents.\n\n\fAt the opposite end of the spectrum, Littman and Stone [11] propose algorithms in H0(cid:2)B1\nand H1 (cid:2) B1 that are leader strategies in the sense that they choose a \ufb01xed strategy and\nhope that their opponent will \u201cfollow\u201d by learning a best response to that \ufb01xed strategy.\nTheir \u201cBully\u201d algorithm chooses a \ufb01xed memoryless stationary policy, while \u201cGodfather\u201d\nhas memory of the last time period. Opponents included normal Q-learning and Q1 players,\nwhich are similar to Q-learners except that they explicitly learn using one period of memory\nbecause they believe that their opponent is also using memory to learn. The interesting\nresult is that \u201cGodfather\u201d is able to achieve non-stationary equilibria against Q1 in the\nrepeated prisoner\u2019s dilemna game, with rewards for both players that are higher than the\nstationary Nash equilibrium rewards. This demonstrates the power of having belief models.\nHowever, because these algorithms do not have access to more than one period of history,\nthey cannot begin to attempt to construct statistical models the opponent. \u201cGodfather\u201d\nworks well because it has a built-in best response to Q1 learners rather than attempting to\nlearn a best response from experience.\n\nFinally, Hu and Wellman\u2019s Nash-Q and Littman\u2019s minimax-Q are classi\ufb01ed as H0 (cid:2) B0\nplayers, because even though they attempt to learn the Nash equilibrium through experi-\nence, their play is \ufb01xed once this equilibrium has been learned. Furthermore, they assume\nthat the opponent also plays a \ufb01xed stationary Nash equilibrium, which they hope is the\nother half of their own equilibrium strategy. These algorithms are summarized in Table 1.\n\nA new class of players. As discussed above, most existing algorithms do not form beliefs\nabout the opponent beyond B0. None of these approaches is able to capture the essence of\ngame-playing, which is a world of threats, deceits, and generally out-witting the opponent.\nWe wish to open the door to such possibilities by designing learners that can model the\nopponent and use that information to achieve better rewards.\nIdeally we would like to\ndesign an algorithm in H1 (cid:2) B1 that is able to win or come to an equilibrium against\nany fair opponent. Since this is impossible [9], we start by proposing an algorithm in the\nleague H1 (cid:2) B1 that plays well against a restricted class of opponents. Since many of the\ncurrent algorithms are best-response players, we choose an opponent class such as PHC,\nwhich is a good example of a best-response player in H1 (cid:2) B0. We will demonstrate that\nour algorithm indeed beats its PHC opponents and in fact does well against most of the\nexisting fair opponents.\n\nA new algorithm: PHC-Exploiter. Our algorithm is different from most previous work\nin that we are explicitly modeling the opponent\u2019s learning algorithm and not simply his\ncurrent policy. In particular, we would like to model players from H1 (cid:2) B0. Since we\nare in H1 (cid:2) B1, it is rational for us to construct such models because we believe that\nthe opponent is learning and adapting to us over time using its history. The idea is that we\nwill \u201cfool\u201d our opponent into thinking that we are stupid by playing a decoy policy for a\nnumber of time periods and then switch to a different policy that takes advantage of their\nbest response to our decoy policy. From a learning perspective, the idea is that we adapt\nmuch faster than the opponent; in fact, when we switch away from our decoy policy, our\nadjustment to the new policy is immediate. In contrast, the H1 (cid:2) B0 opponent adjusts its\npolicy by small increments and is furthermore unable to model our changing behavior. We\ncan repeat this \u201cbluff and bash\u201d cycle ad in\ufb01nitum, thereby achieving in\ufb01nite total rewards\nas t ! 1. The opponent never catches on to us because it believes that we only play\nstationary policies.\nA good example of a H1 (cid:2) B0 player is PHC. Bowling and Veloso showed that in self-\nplay, a restricted version of WoLF-PHC always reaches a stationary Nash equilibrium in\ntwo-player two-action games, and that the general WoLF-PHC seems to do the same in\nexperimental trials. Thus, in the long run, a WoLF-PHC player achieves its stationary\nNash equilibrium payoff against any other PHC player. We wish to do better than that\nby exploiting our knowledge of the PHC opponent\u2019s learning strategy. We can construct\n\n\fa PHC-Exploiter algorithm for agent i that proceeds like PHC in steps 1-2b, and then\ncontinues as follows:\n\nc. Observing action at\nopponent\u2019s policy:\n\n(cid:0)i at time t, update our history h and calculate an estimate of the\n\n(cid:0)i(s; a) = Pt\n\n^(cid:22)t\n\nw\n\n(cid:28) =t(cid:0)w #(h[(cid:28) ] = a)\n\n8a;\n\nwhere w is the window of estimation and #(h[(cid:28) ] = a) = 1 if the opponent\u2019s action at time\n(cid:28) is equal to a, and 0 otherwise. We estimate ^(cid:22)t(cid:0)w\n(cid:0)i (s) similarly.\nd. Update (cid:14) by estimating the learning rate of the PHC opponent:\n\n(cid:14)   (cid:12)(cid:12)^(cid:22)t\n\n(cid:0)i(s) (cid:0) ^(cid:22)t(cid:0)w\n\n(cid:0)i (s)(cid:12)(cid:12)\n\n:\n\nw\n\ne. Update (cid:22)i(s; a). If we are winning, i.e. Pa0 (cid:22)i(s; a0)Q(s; a0) > Ri(^(cid:22)(cid:3)\n\nthen update\n\ni (s); ^(cid:22)(cid:0)i(s)),\n\n(cid:22)i(s; a)   (cid:26) 1 if a = argmaxa0 Q(s; a0)\n\n0 otherwise\n\n;\n\notherwise we are losing, then update\n\n(cid:22)i(s; a)   (cid:22)i(s; a) +(cid:26)\n\n(cid:14)\n(cid:0)(cid:14)\n\njAij(cid:0)1\n\nif a = argmaxa0 Q(s; a0)\notherwise\n\n:\n\nNote that we derive both the opponent\u2019s learning rate (cid:14) and the opponent\u2019s policy ^(cid:22)(cid:0)i(s)\nfrom estimates using the observable history of actions. If we assume the game matrix is\npublic information, then we can solve for the equilibrium strategy ^(cid:22)(cid:3)\ni (s), otherwise we can\nrun WoLF-PHC for some \ufb01nite number of time periods to obtain an estimate this equi-\nlibrium strategy. The main idea of this algorithm is that we take full advantage of all time\ni (s); ^(cid:22)(cid:0)i(s)).\n\nperiods in which we are winning, that is, when Pa0 (cid:22)i(s; a0)Q(s; a0) > Ri(^(cid:22)(cid:3)\n\nAnalysis. The PHC-Exploiter algorithm is based upon PHC and thus exhibits the same\nbehavior as PHC in games with a single pure Nash equilibrium. Both agents generally\nconverge to the single pure equilibrium point. The interesting case arises in competitive\ngames where the only equilibria require mixed strategies, as discussed by Singh et al [12]\nand Bowling and Veloso [6]. Matching pennies, shown in Figure 1(a), is one such game.\nPHC-Exploiter is able to use its model of the opponent\u2019s learning algorithm to choose better\nactions.\n\nIn the full knowledge case where we know our opponent\u2019s policy (cid:22)2 and learning rate (cid:14)2 at\nevery time period, we can prove that a PHC-Exploiter learning algorithm will guarantee us\nunbounded reward in the long run playing games such as matching pennies.\nProposition 2.1. In the zero-sum game of matching pennies, where the only Nash equi-\nlibrium requires the use of mixed strategies, PHC-Exploiter is able to achieve unbounded\nrewards as t ! 1 against any PHC opponent given that play follows the cycle C de\ufb01ned\nby the arrowed segments shown in Figure 2.\nPlay proceeds along Cw, Cl, then jumps from (0.5, 0) to (1,0), follows the line segments to\n(0.5, 1), then jumps back to (0, 1). Given a point (x; y) = ((cid:22)1(H); (cid:22)2(H)) on the graph in\nFigure 2, where (cid:22)i(H) is the probability by which player i plays Heads, we know that our\nexpected reward is\n\nR1(x; y) = (cid:0)1 (cid:2) [(x)(y) + (1 (cid:0) x)(1 (cid:0) y)] + 1 (cid:2) [(1 (cid:0) x)(y) + (x)(1 (cid:0) y)]:\n\n\fi\n\n \n\ns\nd\na\ne\nH\ng\nn\ns\no\no\nh\nc\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n2\n\n \n\n \nr\ne\ny\na\nP\n\nl\n\nAction distribution of the two agent system\n\nAction distribution of the two agent system\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-0.5\n\nCw\n\nCl\n\n0\n\n0.5\n\n1\n\n1.5\n\nPlayer 1 probability choosing Heads\n\ni\n\n \n\ns\nd\na\ne\nH\ng\nn\ns\no\no\nh\nc\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n2\n\n \n\n \nr\ne\ny\na\nP\n\nl\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-0.5\n\nagent1 winning\nagent1 losing\n\n0\n\n0.5\n\n1\n\n1.5\n\nPlayer 1 probability choosing Heads\n\nFigure 2: Theoretical (left), Empirical (right). The cyclic play is evident in our empirical\nresults, where we play a PHC-Exploiter player 1 against a PHC player 2.\n\nAgent 1 total reward over time\n\nd\nr\na\nw\ne\nr\n \nl\na\nt\no\nt\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\n-2000\n\n-4000\n\n0\n\n20000\n\n40000\n\n60000\n\n80000\n\n100000\n\ntime period\n\nFigure 3: Total rewards for agent 1 increase as we gain reward through each cycle.\n\nWe wish to show that\n\nZC\n\nR1(x; y)dt = 2 (cid:2)(cid:18)ZCw\n\nR1(x; y)dt +ZCl\n\nR1(x; y)dt(cid:19) > 0 :\n\nWe consider each part separately. In the losing section, we let g(t) = x = t and h(t) =\ny = 1=2 (cid:0) t, where 0 (cid:20) t (cid:20) 1=2. Then\n\nZCl\n\nR1(x; y)dt = Z 1=2\n\n0\n\nR1(g(t); h(t))dt = (cid:0)\n\n1\n12\n\n:\n\nSimilarly, we can show that we receive 1/4 reward over Cw. Thus,RC R1(x; y)dt = 1=3 >\n\n0, and we have shown that we receive a payoff greater than the Nash equilibrium payoff of\nzero over every cycle. It is easy to see that play will indeed follow the cycle C to a good\napproximation, depending on the size of (cid:14)2. In the next section, we demonstrate that we\ncan estimate (cid:22)2 and (cid:14)2 suf\ufb01ciently well from past observations, thus eliminating the full\nknowledge requirements that were used to ensure the cyclic nature of play above.\n\nExperimental results. We used the PHC-Exploiter algorithm described above to play\nagainst several PHC variants in different iterated matrix games, including matching pen-\nnies, prisoner\u2019s dilemna, and rock-paper-scissors. Here we give the results for the match-\ning pennies game analyzed above, playing against WoLF-PHC. We used a window of\nw = 5000 time periods to estimate the opponent\u2019s current policy (cid:22)2 and the opponent\u2019s\n\n\flearning rate (cid:14)2. As shown in Figure 2, the play exhibits the cyclic nature that we pre-\ndicted. The two solid vertical lines indicate periods in which our PHC-Exploiter player is\nwinning, and the dashed, roughly diagonal, lines indicate periods in which it is losing.\n\nIn the analysis given in the previous section, we derived an upper bound for our total re-\nwards over time, which was 1/6 for each time step. Since we have to estimate various\nparameters in our experimental run, we do not achieve this level of reward. We gain an\naverage of 0.08 total reward for each time period. Figure 3 plots the total reward for our\nPHC-Exploiter agent over time. The periods of winning and losing are very clear from\nthis graph. Further experiments tested the effectiveness of PHC-Exploiter against other fair\nopponents, including itself. Against all the existing fair opponents shown in Table 1, it\nachieved at least its average equilibrium payoff in the long-run. Not surprisingly, it also\nposted this score when it played against a multiplicative-weight learner.\n\nConclusion and future work.\nIn this paper, we have presented a new classi\ufb01cation for\nmulti-agent learning algorithms and suggested an algorithm that seems to dominate existing\nalgorithms from the fair opponent leagues when playing certain games. Ideally, we would\nlike to create an algorithm in the league H1 (cid:2) B1 that provably dominates larger classes\nof fair opponents in any game. Moreover, all of the discussion contained within this paper\ndealt with the case of iterated matrix games. We would like to extend our framework to\nmore general stochastic games with multiple states and multiple players. Finally, it would\nbe interesting to \ufb01nd practical applications of these multi-agent learning algorithms.\n\nAcknowledgements. This work was supported in part by a Graduate Research Fellow-\nship from the National Science Foundation.\n\nReferences\n[1] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\n\nProceedings of the 11th International Conference on Machine Learning (ICML-94), 1994.\n\n[2] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative\n\nmultiaent systems. In Proceedings of the 15th Natl. Conf. on Arti\ufb01cial Intelligence, 1998.\n\n[3] Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical framework\nand an algorithm. In Proceedings of the 15th Int. Conf. on Machine Learning (ICML-98), 1998.\n[4] Michael L. Littman. Friend-or-foe q-learning in general-sum games. In Proceedings of the 18th\n\nInt. Conf. on Machine Learning (ICML-01), 2001.\n\n[5] Keith Hall and Amy Greenwald. Correlated q-learning. In DIMACS Workshop on Computa-\n\ntional Issues in Game Theory and Mechanism Design, 2001.\n\n[6] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate.\n\nUnder submission.\n\n[7] Yasuo Nagayuki, Shin Ishii, and Kenji Doya. Multi-agent reinforcement learning: An approach\nbased on the other agent\u2019s internal model. In Proceedings of the International Conference on\nMulti-Agent Systems (ICMAS-00), 2000.\n\n[8] Drew Fudenburg and David K. Levine. Consistency and cautious \ufb01ctitious play. Journal of\n\nEconomic Dynamics and Control, 19:1065\u20131089, 1995.\n\n[9] J.H. Nachbar and W.R. Zame. Non-computable strategies and discounted repeated games. Eco-\n\nnomic Theory, 1996.\n\n[10] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights.\n\nGames and Economic Behavior, 29:79\u2013103, 1999.\n\n[11] Michael Littman and Peter Stone. Leading best-response stratgies in repeated games. In 17th\nInt. Joint Conf. on Arti\ufb01cial Intelligence (IJCAI-2001) workshop on Economic Agents, Models,\nand Mechanisms, 2001.\n\n[12] S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in general-sum\n\ngames. In Proceedings of the 16th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2000.\n\n\f", "award": [], "sourceid": 1984, "authors": [{"given_name": "Yu-Han", "family_name": "Chang", "institution": null}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": null}]}