{"title": "Learning in Zero-Sum Team Markov Games Using Factored Value Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1659, "page_last": 1666, "abstract": "", "full_text": "Learning in Zero-Sum Team Markov Games\n\nUsing Factored Value Functions\n\nMichail G. Lagoudakis\n\nRonald Parr\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\nmgl@cs.duke.edu\n\nDuke University\n\nDurham, NC 27708\nparr@cs.duke.edu\n\nAbstract\n\nWe present a new method for learning good strategies in zero-sum\nMarkov games in which each side is composed of multiple agents col-\nlaborating against an opposing team of agents. Our method requires full\nobservability and communication during learning, but the learned poli-\ncies can be executed in a distributed manner. The value function is rep-\nresented as a factored linear architecture and its structure determines the\nnecessary computational resources and communication bandwidth. This\napproach permits a tradeoff between simple representations with little or\nno communication between agents and complex, computationally inten-\nsive representations with extensive coordination between agents. Thus,\nwe provide a principled means of using approximation to combat the\nexponential blowup in the joint action space of the participants. The ap-\nproach is demonstrated with an example that shows the ef\ufb01ciency gains\nover naive enumeration.\n\n1 Introduction\n\nThe Markov game framework has received increased attention as a rigorous model for\nde\ufb01ning and determining optimal behavior in multiagent systems. The zero-sum case, in\nwhich one side\u2019s gains come at the expense of the other\u2019s, is the simplest and best un-\nderstood case1. Littman [7] demonstrated that reinforcement learning could be applied to\nMarkov games, albeit at the expense of solving one linear program for each state visited\nduring learning. This computational (and conceptual) burden is probably one factor behind\nthe relative dearth of ambitious Markov game applications using reinforcement learning.\n\nIn recent work [6], we demonstrated that many previous theoretical results justifying the\nuse of value function approximation to tackle large MDPs could be generalized to Markov\ngames. We applied the LSPI reinforcement learning algorithm [5] with function approxi-\nmation to a two-player soccer game and a router/server \ufb02ow control problem and derived\nvery good results. While the theoretical results [6] are general and apply to any reinforce-\nment learning algorithm, we preferred to use LSPI because LSPI\u2019s ef\ufb01cient use of data\nmeant that we solved fewer linear programs during learning.\n\n1The term Markov game in this paper refers to the zero-sum case unless stated otherwise.\n\n\fSince soccer, routing, and many other natural applications of the Markov game framework\ntend to involve multiple participants it would be very useful to generalize recent advances\nin multiagent cooperative MDPs [2, 4] to Markov games. These methods use a factored\nvalue function architecture and determine the optimal action using a cost network [1] and a\ncommunication structure which is derived directly from the structure of the value function.\nLSPI has been successfuly combined with such methods; in empirical experiments, the\nnumber of state visits required to achieve good performance scaled linearly with the number\nof agents despite the exponential growth in the joint action space [4].\n\nIn this paper, we integrate these ideas and we present an algorithm for learning good strate-\ngies for a team of agents that plays against an opponent team. In such games, players within\none team collaborate, whereas players in different teams compete. The key component of\nthis work is a method for computing ef\ufb01ciently the best strategy for a team, given an ap-\nproximate factored value function which is a linear combination of features de\ufb01ned over\nthe state space and subsets of the joint action space for both sides. This method integrated\nwithin LSPI yields a computationally ef\ufb01cient learning algorithm.\n\n2 Markov Games\n\nA two-player zero-sum Markov game is de\ufb01ned as a 6-tuple (S; A; O; P; R; (cid:13)), where:\nS = fs1; s2; :::; sng is a \ufb01nite set of game states; A = fa1; a2; :::; amg and O =\nfo1; o2; :::; olg are \ufb01nite sets of actions, one for each player; P is a Markovian state transi-\ntion model \u2014 P (s; a; o; s0) is the probability that s0 will be the next state of the game when\nthe players take actions a and o respectively in state s; R is a reward (or cost) function\n\u2014 R(s; a; o) is the expected one-step reward for taking actions a and o in state s; and,\n(cid:13) 2 (0; 1] is the discount factor for future rewards. We will refer to the \ufb01rst player as the\nmaximizer and the second player as the minimizer2. Note that if either player is permitted\nonly a single action, the Markov game becomes an MDP for the other player.\n\nA policy (cid:25) for a player in a Markov game is a mapping, (cid:25) : S ! (cid:10)(A), which yields\nprobability distributions over the maximizer\u2019s actions for each state in S. Unlike MDPs,\nthe optimal policy for a Markov game may be stochastic, i.e., it may de\ufb01ne a mixed strategy\nfor every state. By convention, for any policy (cid:25), (cid:25)(s) denotes the probability distribution\nover actions in state s and (cid:25)(s; a) denotes the probability of action a in state s.\nThe maximizer is interested in maximizing its expected, discounted return in the minimax\nsense, that is, assuming the worst case of an optimal minimizer. Since the underlying\nrewards are zero-sum, it is suf\ufb01cient to view the minimizer as acting to minimize the maxi-\nmizer\u2019s return. For any policy (cid:25), we can de\ufb01ne Q(cid:25)(s; a; o) as the expected total discounted\nreward of the maximizer when following policy (cid:25) after the players take actions a and o for\nthe \ufb01rst step. The corresponding \ufb01xed point equation for Q(cid:25) is:\nX\n\nQ(cid:25)(s; a; o) = R(s; a; o) + (cid:13) X\n\nQ(cid:25)(s0; a0; o0)(cid:25)(s0; a0) :\n\nP (s; a; o; s0) min\no02O\n\ns02S\n\na02A\n\nGiven any Q function, the maximizer can choose actions so as to maximize its value:\n\nV (s) = max\n\n(cid:25)0(s)2(cid:10)(A)\n\nmin\no2O\n\nX\n\na2A\n\nQ(s; a; o)(cid:25)0(s; a) :\n\n(1)\n\nWe will refer to the policy (cid:25)0 chosen by Eq. (1) as the minimax policy with respect to Q.\n\n2Because of the duality, we adopt the maximizer\u2019s point of view for presentation.\n\n\fThis policy can be determined in any state s by solving the following linear program:\n\nMaximize:\nSubject to:\n\nV (s)\n8a 2 A; (cid:25)0(s; a) (cid:21) 0\n\n(cid:25)0(s; a) = 1\n\na2A\n8o 2 O; V (s) (cid:20) a2A\n\nQ(s; a; o)(cid:25)0(s; a) :\n\nIf Q = Q(cid:25), the minimax policy is an improved policy compared to (cid:25). A policy iteration\nalgorithm can be implemented for Markov games in a manner analogous to policy iteration\nfor MDPs by \ufb01xing a policy (cid:25)i, solving for Q(cid:25)i, choosing (cid:25)i+1 as the minimax policy with\nrespect to Q(cid:25)i and iterating. This algorithm converges to the optimal minimax policy (cid:25)(cid:3).\n\n3 Least Squares Policy Iteration (LSPI) for Markov Games\n\nIn practice, the state/action space is too large for an explicit representation of the Q func-\ntion. We consider the standard approach of approximating the Q function as the linear\ncombination of k basis functions (cid:30)j with weights wj, that is bQ(s; a; o) = (cid:30)(s; a; o)|w.\nWith this representation, the minimax policy (cid:25) for the maximizer is determined by\n\n(cid:25)(s) = arg max\n(cid:25)(s) 2(cid:10)(A)\n\nmin\no2O\n\n(cid:25)(s; a)(cid:30)(s; a; o)|w ;\n\nX\n\na2A\n\nand can be computed by solving the following linear program\n\nMaximize:\nSubject to:\n\nV (s)\n8 a 2 A; (cid:25)(s; a) (cid:21) 0\n\n(cid:25)(s; a) = 1\n\na2A\n8 o 2 O; V (s) (cid:20) a2A\n\n(cid:25)(s; a)(cid:30)(s; a; o)\n\nw :\n\nWe chose the LSPI algorithm to learn the weights w of the approximate value function.\nLeast-Squares Policy Iteration (LSPI) [5] is an approximate policy iteration algorithm that\nlearns policies using a corpus of stored samples. LSPI applies also with minor modi\ufb01-\ncations to Markov games [6]. In particular, at each iteration, LSPI evaluates the current\npolicy using the stored samples and keeps the learned weights to represent implicitly the\nimproved minimax policy for the next iteration by solving the linear program above. The\nmodi\ufb01ed update equations account for the minimizer\u2019s action and the distribution over next\nmaximizer actions since the minimax policy is, in general, stochastic. More speci\ufb01cally, at\neach iteration LSPI maintains two matrices, bA andbb, which are updated as follows:\n\u0002A \u0002A + (cid:30)(s; a; o)\n\n\u0002b \u0002b + (cid:30)(s; a; o)r ;\n\n(cid:30)(s; a; o) (cid:0) (cid:13) a02A\n\n(cid:25)(s0; a0)(cid:30)(s0; a0; o0)\n\nfor any sample (s; a; o; r; s0). The policy (cid:25)0(s0) for state s0 is computed using the linear\nprogram above. The action o0 is the minimizing opponent action in computing (cid:25)(s0) and\ncan be identi\ufb01ed by the tight constraint on V (s0). The weight vector w is computed at\nthe end of each iteration as the solution to bAw = bb. The key step in generalizing LSPI\nto team Markov games is \ufb01nding ef\ufb01cient means to perform these operations despite the\nexponentially large joint action space.\n\n;\n\n4 Least Squares Policy Iteration for Team Markov Games\n\nA team Markov game is a Markov game where a team of N maximizers is playing against\na team of M minimizers. Maximizer i chooses actions from Ai, so the team chooses\n\n\u0001\n\u0003\n\u0004\n\u0001\n\factions (cid:22)a = (a1; a2; :::; aN ) from (cid:22)A = A1 (cid:2) A2 (cid:2) ::: (cid:2) AN , where ai 2 Ai. Minimizer\ni chooses actions from Oi, so the minimizer team chooses actions (cid:22)o = (o1; o2; :::; oM )\nfrom (cid:22)O = O1 (cid:2) O2 (cid:2) ::: (cid:2) OM , where oi 2 Oi. Consider now an approximate value\nfunction bQ(s; (cid:22)a; (cid:22)o). The minimax policy (cid:25) for the maximizer team in any given state s can\nbe computed (naively) by solving the following linear program:\n\nMaximize:\nSubject to:\n\nV (s)\n8 (cid:22)a 2 (cid:22)A; (cid:25)(s; (cid:22)a) (cid:21) 0\n\n(cid:25)(s; (cid:22)a) = 1\n\n(cid:22)a2 (cid:22)A\n8 (cid:22)o 2 (cid:22)O; V (s) (cid:20) (cid:22)a2 (cid:22)A\n\n(cid:25)(s; (cid:22)a)\u0002Q(s; (cid:22)a; (cid:22)o) :\n\nSince j (cid:22)Aj is exponential in N and j (cid:22)Oj is exponential in M, the linear program above\nhas an exponential number of variables and constraints and would be intractable to solve,\nunless we make certain assumptions about bQ. We assume a factored approximation [2] of\nthe Q function, given as a linear combination of k localized basis functions. Each basis\nfunction can be thought of as an individual player\u2019s perception of the environment, so\neach (cid:30)j need not depend upon every feature of the state or the actions taken by every\nplayer in the game. In particular, we assume that each (cid:30)j depends only on the actions of\na small subset of maximizers Aj and minimizers Oj, that is, (cid:30)j = (cid:30)j (s; (cid:22)aj ; (cid:22)oj), where\n(cid:22)aj 2 (cid:22)Aj and (cid:22)oj 2 (cid:22)Oj ( (cid:22)Aj is the joint action space of the palyers in Aj and (cid:22)Oj is the\njoint action space of the palyers in Oj). For example, if (cid:30)4 depends only on the actions of\nmaximizers f4; 5; 8g, and the actions of minimizers f3; 2; 7g, then (cid:22)a4 2 A4 (cid:2) A5 (cid:2) A8\nand (cid:22)o4 2 O3 (cid:2) O2 (cid:2) O7. Under this locality assumption, the approximate (factored) value\nfunction is\n\nbQ(s; (cid:22)a; (cid:22)o) =\n\nkX\n\nj=1\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj )wj\n\n;\n\nwhere the assignments to the (cid:22)aj\u2019s and (cid:22)oj\u2019s are consistent with (cid:22)a and (cid:22)o. Given this form\nof the value function the linear program can be simpli\ufb01ed signi\ufb01cantly. We look at the\nconstraints for the value of the state \ufb01rst:\n\nV (s) (cid:20) (cid:22)a2 (cid:22)A\n\nk\n\n(cid:25)(s; (cid:22)a)\n\nk\n\nj=1\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj )wj\n\n(cid:25)(s; (cid:22)a)(cid:30)j (s; (cid:22)aj ; (cid:22)oj)wj\n\nV (s) (cid:20)\n\nV (s) (cid:20)\n\nV (s) (cid:20)\n\nV (s) (cid:20)\n\nk\n\nk\n\nk\n\nj=1 (cid:22)a2 (cid:22)A\nj=1\n(cid:22)aj 2 (cid:22)Aj\n(cid:22)aj 2 (cid:22)Aj\nj=1\n(cid:22)aj 2 (cid:22)Aj\nj=1\n\nwj\n\nwj\n\n(cid:25)(s; (cid:22)a)(cid:30)j (s; (cid:22)aj ; (cid:22)oj)wj\n\n(cid:22)a02 (cid:22)An (cid:22)Aj\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj )\n\n(cid:25)(s; (cid:22)a)\n\n(cid:22)a0 2 (cid:22)An (cid:22)Aj\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj )(cid:25)j(s; (cid:22)aj) ;\n\nwhere each (cid:25)j(s; (cid:22)aj) de\ufb01nes a probability distribution over the actions of the players that\nappear in (cid:30)j. From the last expression, it is clear that we can use (cid:25)j(s; (cid:22)aj) as the variables\nof the linear program. The number of these variables will typically be much smaller than\nthe number of variables (cid:25)(s; (cid:22)a), depending on the size of the Aj\u2019s. However, we must\nadd constraints to ensure that the local probability distributions (cid:25)j(s) are consistent with a\nglobal distribution over the entire joint action space (cid:22)A. The \ufb01rst set of constraints are the\n\n\fstandard ones for any probability distribution:\n: X\n\n8 j = 1; :::; k\n\n(cid:25)j(s; (cid:22)aj) = 1\n\n8 j = 1; :::; k\n\n:\n\n(cid:22)aj 2 (cid:22)Aj\n8 (cid:22)aj 2 (cid:22)Aj ; (cid:25)j (s; (cid:22)aj) (cid:21) 0 :\n\nFor consistency, we must ensure that all marginals over common variables are identical:\n\n8 1 (cid:20) j < h (cid:20) k\n\n: 8 (cid:22)a0 2 (cid:22)Aj \\ (cid:22)Ah; X\n\n(cid:25)j (s; (cid:22)aj) = X\n\n(cid:25)h(s; (cid:22)ah) :\n\nj 2 (cid:22)Aj n (cid:22)Ah\n(cid:22)a0\n\nh2 (cid:22)Ahn (cid:22)Aj\n(cid:22)a0\n\nThese constraints are suf\ufb01cient if the running intersection property is satis\ufb01ed by the\n(cid:25)j(s)\u2019s [3]. If not, it is possible that the resulting (cid:25)j (s)\u2019s will not be consistent with any\nglobal distribution even though they are locally consistent. However, the running intersec-\ntion property can be enforced by introducing certain additional local distributions in the set\nof (cid:25)j(s)\u2019s. This can be achieved using a variable elimination procedure.\nFirst, we establish an elimination order for the maximizers and we let H1 be the set of all\n(cid:25)j(s)\u2019s and L = ?. At each step i, some agent i is eliminated and we let Ei be the set of all\ndistributions in Hi that involve the actions of agent i or have empty domain. We then create\na new distribution !i over the actions of all agents that appear in Ei and we place !i in L.\ni de\ufb01ned as the distribution over the actions of all agents that appear in !i\nWe then create !0\nexcept agent i. Next, we update Hi+1 = Hi [ f!0\nig (cid:0) Ei and repeat until all agents have\nbeen eliminated. Note that HN will necessarily be empty and L will contain at most N\nnew local probability distributions. We can manipulate the elimination order in an attempt\nto keep the distributions in L small (local), however their size will be exponential in the\ninduced tree width. As with Bayes nets, the existence and hardness of discovering ef\ufb01cient\nelimination orderings will depend upon the topology. The set H1 [ L of local probability\ndistributions satis\ufb01es the running intersection property and so we can proceed with this set\ninstead of the original set of (cid:25)j(s)\u2019s and apply the constraints listed above. Even though we\nare only interested in the (cid:25)j(s)\u2019s, the existence of the additional distributions in the linear\nprogram will ensure that the (cid:25)j(s)\u2019s will be globally consistent.\nThe number of constraints needed for the local probability distributions is much smaller\nthan the original number of constraints. In summary, the new linear program will be:\n\nMaximize:\nSubject to:\n\nV (s)\n8 j = 1; :::; k : 8 (cid:22)aj 2 (cid:22)Aj ; (cid:25)j(s; (cid:22)aj) (cid:21) 0\n8 j = 1; :::; k :\n\n(cid:25)j(s; (cid:22)aj) = 1\n\n(cid:22)aj 2 (cid:22)Aj\n\n8 1 (cid:20) j < h (cid:20) k : 8 (cid:22)a0 2 (cid:22)Aj \\ (cid:22)Ah;\n\n(cid:25)j(s; (cid:22)aj) =\n\n(cid:25)h(s; (cid:22)ah)\n\nj 2 (cid:22)Aj n (cid:22)Ah\n(cid:22)a0\n\nh2 (cid:22)Ah n (cid:22)Aj\n(cid:22)a0\n\n8 (cid:22)o 2 (cid:22)O; V (s) (cid:20)\n\nk\n\nj=1\n\nwj\n\n(cid:22)aj 2 (cid:22)Aj\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj)(cid:25)j (s; (cid:22)aj ) :\n\nAt this point we have eliminated the exponential dependency from the number of vari-\nables and partially from the number of constraints. The last set of (exponentially many)\nconstraints can be replaced by a single non-linear constraint:\n\nV (s) (cid:20) min\n(cid:22)o2 (cid:22)O\n\nk\n\nj=1\n\nwj\n\n(cid:22)aj 2 (cid:22)Aj\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj)(cid:25)j (s; (cid:22)aj ) :\n\nWe now show how this non-linear constraint can be turned into a number of linear con-\nstraints which is not exponential in M in general. The main idea is to embed a cost network\ninside the linear program [2]. In particular, we de\ufb01ne an elimination order for the oi\u2019s in (cid:22)o\n\n\n\n\fand, for each oi in turn, we push the min operator for just oi as far inside the summation\nas possible, keeping only terms that have some dependency on oi or no dependency on\nany of the opponent team actions. We replace this smaller min expression over oi with a\nnew function fi (represent by a set of new variables in the linear program) that depends\non the other opponent actions that appear in this min expression. Finally, we introduce a\nset of linear constraints for the value of fi that express the fact that fi is the minimum of\nthe eliminated expression in all cases. We repeat this elimination process until all oi\u2019s and\ntherefore all min operators are eliminated.\nMore formally, at step i of the elimination, let Bi be the set of basis functions that have not\nbeen eliminated up to that point and Fi be the set of the new functions that have not been\neliminated yet. For simplicity, we assume that the elimination order is o1; o2; :::; oM (in\npractice the elimination order needs to be chosen carefully in advance since a poor elimi-\nnation ordering could have serious adverse effects on ef\ufb01ciency). At the very beginning of\nthe elimination process, B1 = f(cid:30)1; (cid:30)2; :::; (cid:30)kg and F1 is empty. When eliminating oi at\nstep i, de\ufb01ne Ei (cid:18) Bi [ Fi to be those functions that contain oi in their domain or have no\ndependency on any opponent action. We generate a new function fi((cid:22)(cid:22)oi) that depends on all\nthe opponent actions that appear in Ei excluding oi:\n\n:\n\nfk((cid:22)(cid:22)ok)\u0003\n\nfi((cid:22)(cid:22)oi) = min\noi2Oi\n\nwj\n\n(cid:30)j 2Ei\n\n(cid:22)aj 2 (cid:22)Aj\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)oj )(cid:25)j(s; (cid:22)aj) + fk 2Ei\n\nWe introduce a new variable in the linear program for each possible setting of the domain\n(cid:22)(cid:22)oi of the new function fi((cid:22)(cid:22)oi). We also introduce a set of constraints for these variables:\n\n8 oi 2 Oi; 8 (cid:22)(cid:22)oi\n\n: fi((cid:22)(cid:22)oi) (cid:20) X\n\nwj X\n\n(cid:30)j (s; (cid:22)aj ; (cid:22)oj)(cid:25)j(s; (cid:22)aj) + X\n\nfk((cid:22)(cid:22)ok)\n\n(cid:30)j 2Ei\n\n(cid:22)aj 2 (cid:22)Aj\n\nfk 2Ei\n\nThese constraints ensure that the new function is the minimum over the possible choices\nfor oi. Now, we de\ufb01ne Bi+1 = Bi (cid:0) Ei and Fi+1 = Fi (cid:0) Ei + ffig and we continue with\nthe elimination of action oi+1. Notice that oi does not appear anywhere in Bi+1 or Fi+1.\nNotice also that fM will necessarily have an empty domain and it is exactly the value of\nthe state, fM = V (s). Summarizing everything, the reduced linear program is\nMaximize:\nSubject to:\n\nfM\n8 j = 1; :::; k : 8 (cid:22)aj 2 (cid:22)Aj ; (cid:25)j(s; (cid:22)aj) (cid:21) 0\n8 j = 1; :::; k :\n\n(cid:25)j(s; (cid:22)aj) = 1\n\n(cid:22)aj 2 (cid:22)Aj\n\n8 1 (cid:20) j < h (cid:20) k : 8 (cid:22)a0 2 (cid:22)Aj \\ (cid:22)Ah;\n\n(cid:25)j(s; (cid:22)aj) =\n\n(cid:25)h(s; (cid:22)ah)\n\nj 2 (cid:22)Aj n (cid:22)Ah\n(cid:22)a0\n\nh2 (cid:22)Ah n (cid:22)Aj\n(cid:22)a0\n\n8 i; 8 oi; 8 (cid:22)(cid:22)oi\n\n: fi((cid:22)(cid:22)oi) (cid:20) (cid:30)j 2Ei\n\nwj\n\n(cid:22)aj 2 (cid:22)Aj\n\n(cid:30)j (s; (cid:22)aj ; (cid:22)oj)(cid:25)j (s; (cid:22)aj ) + fk 2Ei\n\nfk((cid:22)(cid:22)ok)\n\nNotice that the exponential dependency in N and M has been eliminated. The total num-\nber of variables and/or constraints is now exponentially dependent only on the number\nof players that appear together as a group in any of the basis functions or the intermedi-\nate functions and distributions. It should be emphasized that this reduced linear program\nsolves the same problem as the naive linear program and yields the same solution (albeit in\na factored form).\nTo complete the learning algorithm, the update equations of LSPI must also be modi\ufb01ed.\nFor any sample (s; (cid:22)a; (cid:22)o; r; s0), the naive form would be\n\n\u0002A \u0002A + (cid:30)(s; (cid:22)a; (cid:22)o)\n\n(cid:30)(s; (cid:22)a; (cid:22)o) (cid:0) (cid:13) (cid:22)a02 (cid:22)A\n\n(cid:25)(s0; (cid:22)a0)(cid:30)(s0; (cid:22)a0; (cid:22)o0)\n\n;\n\n\u0002b \u0002b + (cid:30)(s; (cid:22)a; (cid:22)o)r :\n\nThe action (cid:22)o0 is the minimizing opponent\u2019s action in computing (cid:25)(s0). Unfortunately,\nthe number of terms in the summation within the \ufb01rst update equation is exponential in\n\n\n\u0001\n\u0002\n\u0004\n\u0005\n\n\n\u0003\n\u0004\n\u0001\n\fN. However, the vector (cid:30)(s; (cid:22)a; (cid:22)o) (cid:0) (cid:13) P(cid:22)a02 (cid:22)A (cid:25)(s0; (cid:22)a0)(cid:30)(s0; (cid:22)a0; (cid:22)o0) can be computed on a\ncomponent-by-component basis avoiding this exponential blowup. In particular, the j-th\ncomponent is:\n\n(cid:25)(s0; (cid:22)a0)(cid:30)j(s0; (cid:22)a0\n\nj ; (cid:22)o0)\n\n(cid:25)(s0; (cid:22)a0)(cid:30)j(s0; (cid:22)a0\n\nj ; (cid:22)o0)\n\n(cid:22)a00\n\nj 2 (cid:22)An (cid:22)Aj\n\nj 2 (cid:22)Aj\n\n(cid:30)j(s0; (cid:22)a0\n\nj ; (cid:22)o0)\n\n(cid:25)(s0; (cid:22)a0)\n\n(cid:30)j(s; (cid:22)aj ; (cid:22)o) (cid:0) (cid:13) (cid:22)a0 2 (cid:22)A\n\n= (cid:30)j(s; (cid:22)a; (cid:22)o) (cid:0) (cid:13) (cid:22)a0\n= (cid:30)j(s; (cid:22)a; (cid:22)o) (cid:0) (cid:13) (cid:22)a0\n= (cid:30)j(s; (cid:22)a; (cid:22)o) (cid:0) (cid:13) (cid:22)a0\n\nj 2 (cid:22)Aj\n\nj 2 (cid:22)Aj\n\n(cid:22)a00\n\nj 2 (cid:22)An (cid:22)Aj\nj ; (cid:22)o0)(cid:25)j (s0; (cid:22)a0\n\nj) ;\n\n(cid:30)j(s0; (cid:22)a0\n\nwhich can be easily computed without exponential enumeration.\n\nA related question is how to \ufb01nd (cid:22)o0, the minimizing opponent\u2019s joint action in computing\n(cid:25)(s0). This can be done after the linear program is solved by going through the fi\u2019s in\nreverse order (compared to the elimination order) and \ufb01nding the choice for oi that imposes\na tight constraint on fi((cid:22)(cid:22)oi) conditioned on the minimizing choice for (cid:22)(cid:22)oi that has been found\nso far. The only complication is that the linear program has no incentive to maximize fi((cid:22)(cid:22)oi)\nunless it contributes to maximizing the \ufb01nal value. Thus, a constraint that appears to be\ntight may not correspond to the actual minimizing choice. The solution to this is to do\na forward pass \ufb01rst (according to the elimination order) marking the fi((cid:22)(cid:22)oi)\u2019s that really\ncome from tight constraints. Then, the backward pass described above will \ufb01nd the true\nminimizing choices by using only the marked fi((cid:22)(cid:22)oi)\u2019s.\nThe last question is how to sample an action (cid:22)a from the global distribution de\ufb01ned by\nthe smaller distributions. We begin with all actions uninstantiated and we go through all\n(cid:25)j(s)\u2019s. For each j, we marginalize out the instantiated actions (if any) from (cid:25)j(s) to\ngenerate the conditional probability and then we sample jointly the actions that remain in\nthe distribution. We repeat with the next j until all actions are instantiated. Notice that this\noperation can be performed in a distributed manner, that is, at execution time only agents\nwhose actions appear in the same (cid:25)j (s) need to communicate to sample actions jointly.\nThis communication structure is directly derived from the structure of the basis functions.\n\n5 An Example\n\nThe algorithm has been implemented and is currently being tested on a large \ufb02ow control\nproblem with multiple routers and servers. Since experimental results are still in progress,\nwe demonstrate the ef\ufb01ciency gained over exponential enumeration with an example. Con-\nsider a problem with N = 5 maximizers and M = 4 minimizers. Assume also that each\nmaximizer or minimizer has 5 actions to choose from. The naive solution would require\nsolving a linear program with 3126 variables and 3751 constraints for any representation\nof the value function. Consider now the following factored value function:\nbQ(s; (cid:22)a; (cid:22)o) = (cid:30)1(s; a1; a2; o1; o2)w1 + (cid:30)2(s; a1; a3; o1; o3)w2 +\nThese basis functions satisfy the running intersection property (there is no cycle of length\nlonger than 3), so there is no need for additional probability distributions. Using the elimi-\nnation order fo4; o3; o1; o2g for the cost network, the reduced linear program contains only\n121 variables and 215 constraints (we present only the 80 constraints on the value of the\nstate that demonstrate the variable elimination procedure, omitting the common constrains\nfor validity and consistency of the local probability distributions):\n\n(cid:30)3(s; a2; a4; o3)w3 + (cid:30)4(s; a3; a5; o4)w4 + (cid:30)5(s; a1; o3; o4)w5 :\n\nMaximize:\n\nf2\n\nSubject to:\n\n\f8 o4 2 O4; 8 o3 2 O3; f4(o3) (cid:20)\n\nw4(cid:30)4(s; a3; a5; o4)(cid:25)4(s; a3; a5) +\n\n(a3;a5)2A3(cid:2)A5\n\na12A1\n\nw5(cid:30)5(s; a1; o3; o4)(cid:25)5(s; a1)\n\n8 o3 2 O3; 8 o1 2 O1; f3(o1) (cid:20)\n\nw2(cid:30)2(s; a1; a3; o1; o3)(cid:25)2(s; a1; a3) +\n\n(a1;a3)2A1(cid:2)A3\n\n(a2;a4)2A2(cid:2)A4\n\nw3(cid:30)3(s; a2; a4; o3)(cid:25)3(s; a2; a4) + f4(o3)\n\n8 o1 2 O1; 8 o2 2 O2; f1(o2) (cid:20)\n\nw1(cid:30)1(s; a1; a2; o1; o2)(cid:25)1(s; a1; a2) + f3(o1)\n\n6 Conclusion\n\n(a1;a2)2A1(cid:2)A2\n\n8 o2 2 O2; f2 (cid:20) f1(o2)\n\nWe have presented a principled approach to the problem of solving large team Markov\ngames that builds on recent advances in value function approximation for Markov games\nand multiagent coordination in reinforcement learning for MDPs. Our approach permits\na tradeoff between simple architectures with limited representational capability and sparse\ncommunication and complex architectures with rich representations and more complex co-\nordination structure. It is our belief that the algorithm presented in this paper can be used\nsuccessfully in real-world, large-scale domains where the available knowledge about the\nunderlying structure can be exploited to derive powerful and suf\ufb01cient factored representa-\ntions.\n\nAcknowledgments\n\nThis work was supported by NSF grant 0209088. We would also like to thank Carlos Guestrin for\nhelpful discussions.\n\nReferences\n[1] R. Dechter. Bucket elimination: A unifying framework for reasoning. Arti\ufb01cial Intelligence,\n\n113(1\u20132):41\u201385, 1999.\n\n[2] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs. In\nProceeding of the 14th Neural Information Processing Systems (NIPS-14), pages 1523\u20131530,\nVancouver, Canada, December 2001.\n\n[3] Carlos Guestrin, Daphne Koller, and Ronald Parr. Solving factored POMDPs with linear value\nfunctions. In IJCAI-01 workshop on Planning under Uncertainty and Incomplete Information,\n2001.\n\n[4] Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforcement learning.\nIn Proceedings of the 19th International Conference on Machine Learning (ICML-02), pages\n227\u2013234, Sydney, Australia, July 2002.\n\n[5] Michail Lagoudakis and Ronald Parr. Model free least squares policy iteration. In Proceedings\nof the 14th Neural Information Processing Systems (NIPS-14), pages 1547\u20131554, Vancouver,\nCanada, December 2001.\n\n[6] Michail Lagoudakis and Ronald Parr. Value function approximation in zero sum Markov games.\nIn Proceedings of the 18th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2002), pages\n283\u2013292, Edmonton, Canada, 2002.\n\n[7] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\nProceedings of the 11th International Conference on Machine Learning (ICML-94), pages 157\u2013\n163, San Francisco, CA, 1994. Morgan Kaufmann.\n\n\n\n\n\n\f", "award": [], "sourceid": 2228, "authors": [{"given_name": "Michail", "family_name": "Lagoudakis", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}]}