{"title": "Solving Stochastic Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1186, "page_last": 1194, "abstract": "Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games to within $\\epsilon$ relative error of the optimal game-theoretic solution, in time polynomial in $1/\\epsilon$. Our algorithm extends Murrays and Gordon\u2019s (2007) modified Bellman equation which determines the \\emph{set} of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts.", "full_text": "Solving Stochastic Games\n\nLiam Mac Dermed\nCollege of Computing\n\nGeorgia Tech\n\n801 Atlantic Drive\n\nAtlanta, GA 30332-0280\nliam@cc.gatech.edu\n\nCharles Isbell\n\nCollege of Computing\n\nGeorgia Tech\n\n801 Atlantic Drive\n\nAtlanta, GA 30332-0280\n\nisbell@cc.gatech.edu\n\nAbstract\n\nSolving multi-agent reinforcement learning problems has proven dif\ufb01cult because\nof the lack of tractable algorithms. We provide the \ufb01rst approximation algorithm\nwhich solves stochastic games with cheap-talk to within \u0001 absolute error of the op-\ntimal game-theoretic solution, in time polynomial in 1/\u0001. Our algorithm extends\nMurray\u2019s and Gordon\u2019s (2007) modi\ufb01ed Bellman equation which determines the\nset of all possible achievable utilities; this provides us a truly general framework\nfor multi-agent learning. Further, we empirically validate our algorithm and \ufb01nd\nthe computational cost to be orders of magnitude less than what the theory pre-\ndicts.\n\n1\n\nIntroduction\n\nIn reinforcement learning, Bellman\u2019s dynamic programming equation is typically viewed as a\nmethod for determining the value function \u2014 the maximum achievable utility at each state. Instead,\nwe can view the Bellman equation as a method of determining all possible achievable utilities. In the\nsingle-agent case we care only about the maximum utility, but for multiple agents it is rare to be able\nto simultaneous maximize all agents\u2019 utilities. In this paper we seek to \ufb01nd the set of all achievable\njoint utilities (a vector of utilities, one for each player). This set is known as the feasible-set. Given\nthis goal we can reconstruct a proper multi-agent equivalent to the Bellman equation that operates\non feasible-sets for each state instead of values.\nMurray and Gordon (2007) presented an algorithm for calculating the exact form of the feasible-\nset based Bellman equation and proved correctness and convergence; however, their algorithm is\nnot guaranteed to converge in a \ufb01nite number of iterations. Worse, a particular iteration may not\nbe tractable. These are two separate problems. The \ufb01rst problem is caused by the intolerance of\nan equilibrium to error, and the second results from a potential need for an unbounded number of\npoints to de\ufb01ne the closed convex hull that is each states feasible-set. We solve the \ufb01rst problem\nby targeting \u0001-equilibria instead of exact equilibria, and we solve the second by approximating the\nhull with a bounded number of points. Importantly, we achieve both solutions while bounding the\n\ufb01nal error introduced by these approximations. Taken together this produces the \ufb01rst multi-agent\nreinforcement learning algorithm with theoretical guarantees similar to single-agent value iteration.\n\n2 Agenda\n\nWe model the world as a fully-observable n-player stochastic game with cheap talk (communication\nbetween agents that does not affect rewards). Stochastic games (also called Markov games) are\nthe natural multi-agent extension of Markov decision processes with actions being joint actions and\nrewards being a vector of rewards, one to each player. We assume an implicit inclusion of past joint\n\n1\n\n\factions as part of state (we actually only rely on log2 n + 1 bits of history containing if and who has\ndefected). We also assume that each player is rational in the game-theoretic sense.\nOur goal is to produce a joint policy that is Pareto-optimal (no other viable joint policy gives a player\nmore utility without lowering another player\u2019s utility), fair (players agree on the joint policy), and\nin equilibrium (no player can gain by deviating from the joint policy).1 This solution concept is the\ngame-theoretic solution.\nWe present the \ufb01rst approximation algorithm that can ef\ufb01ciently and provably converge to within\na given error of game-theoretic solution concepts for all such stochastic games. We factor out the\nvarious game theoretic elements of the problem by taking in three functions which compute in turn:\nthe equilibrium Feq (such as correlated equilibrium), the threat Fth (such as grim trigger), and the\nbargaining solution Fbs (such as Nash bargaining solution). An error parameters \u00011 controls the\ndegree of approximation. The \ufb01nal algorithm takes in a stochastic game, and returns a targeted\nutility-vector and joint policy such that the policy achieves the targeted utility while guaranteeing\nthat the policy is an \u00011/(1 \u2212 \u03b3)-equilibrium (where \u03b3 is the discount factor) and there are no exact\nequilibria that Pareto-dominate the targeted utility.\n\n3 Previous approaches\n\nMany attempts have been made to extend the Bellman equation to domains with multiple agents.\nMost of these attempts have focused on retaining the idea of a value function as the memoized\nsolution to subproblems in Bellman\u2019s dynamic programming approach (Greenwald & Hall, 2003),\n(Littman, 2001), (Littman, 2005). This has lead to a few successes particularly in the zero-sum case\nwhere the same guarantees as standard reinforcement learning have been achieved (Littman, 2001).\nUnfortunately, more general convergence results have not been achieved. Recently a negative result\nhas shown that any value function based approach cannot solve the general multi-agent scenario\n(Littman, 2005). Consider a simple game (Figure 1-A):\n\nFigure 1: A) The Breakup Game demonstrates the limitation of traditional value-function based approaches.\nCircles represent states, outgoing arrows represent deterministic actions. Unspeci\ufb01ed rewards are zero. B) The\n\ufb01nal feasible-set for player 1\u2019s state (\u03b3 = 0.9).\n\nThis game has four states with two terminal states. In the two middle states play alternates between\nthe two players until one of the players decides to exit the game. In this game the only equilibria\nare stochastic (E.G. the randomized policy of each player passing and exiting with probability 1\n2).\nIn each state only one of the agents takes an action, so an algorithm that depends only on a value\nfunction will myopically choose to deterministically take the best action, and never converge to\nthe stochastic equilibrium. This result exposed the inadequacy of value functions to capture cyclic\nequilibrium (where the equilibrium policy may revisit a state).\nSeveral other complaints have been leveled against the motivation behind MAL research following\nthe Bellman heritage. One such complaint is that value function based algorithms inherently target\nonly stage-game equilibria and not full-game equilibria potentially ignoring much better solutions\n(Shoham & Grenager, 2006). Our approach solves this problem and allows a full-game equilibrium\nto be reached. Another complaint goes even further, challenging the desire to even target equilibria\n(Shoham et al., 2003). Game theorists have shown us that equilibrium solutions are correct when\nagents are rational (in\ufb01nitely intelligent), so the argument against targeting equilibria boils down\nto either assuming other agents are not in\ufb01nitely intelligent (which is reasonable) or that \ufb01nding\n\n1The precise meaning of fair, and the type of equilibrium is intentionally left unspeci\ufb01ed for generality.\n\n2\n\nReward:{1,-2}Reward:{2,-1}Player 1Player 2passpassexitexit(0, 0)Player 1 UtilityPlayer 2 Utility(0, 0)Player 1 UtilityPlayer 2 Utility(0, 0)(0, 0)(0, 0)(1, -2)(0, 0)(2, -1)(1.33, -1.33)(1.33, -1.33)Iteration:Initialization12Iteration:Initialization12Player 1\u2019sChoicePlayer 2\u2019sChoiceEquilibrium ContractionRandomizedStart3(1, -0.5)(1, -2)(1.8, -0.9)A)B)\fequilibria is not computationally tractable (which we tackle here). We believe that although MAL\nis primarily concerned with the case when agents are not fully rational, \ufb01rst assuming agents are\nrational and subsequently relaxing this assumption will prove to be an effective approach.\nMurray and Gordon (2007) presented the \ufb01rst multidimensional extension to the Bellman equation\nwhich overcame many of the problems mentioned above. In their later technical report (Murray &\nGordon, June 2007) they provided an exact solution equivalent to our solution targeting subgame\nperfect correlated equilibrium with credible threats, while using the Nash bargaining solution for\nequilibrium selection. In the same technical report they present an approximation method for their\nexact algorithm that involved sampling the feasible-set. Their approach was a signi\ufb01cant step for-\nward; however, their approximation algorithm has no \ufb01nite time convergence guarantees, and can\nresult in unbounded error.\n\n4 Exact feasible-set solution\n\nThey key idea needed to extend reinforcement learning into multi-agent domains is to replace the\nvalue-function, V (s), in Bellman\u2019s dynamic program with a feasible-set function \u2013 a mapping from\nstate to feasible-set. As a group of n agents follow a joint-policy, each player i receives rewards. the\ndiscounted sum of these rewards is that player\u2019s utility, ui. The n-dimensional vector (cid:126)u containing\nthese utility is known as the joint-utility. Thus a joint-policy yields a joint-utility which is a point\nin n-dimensional space. If we examine all (including stochastic) joint-policies starting from state\ns, discard those not in equilibrium, and compute the remaining joint-utilities we will have a set of\nn-dimensional points - the feasible-set. This set is closed and convex, and can be thought of as\nan n-dimensional convex polytope. As this set contains all possible joint-utilities, it will contain\nthe optimal joint-utility for any de\ufb01nition of optimal (the bargaining solution Fbs will select the\nutility vector it deems optimal). After an optimal joint-utility has been chosen, a joint-policy can\nbe constructed to achieve that joint-utility using the computed feasible-sets (Murray & Gordon,\nJune 2007). Recall that agents care only about the utility they achieve and not the speci\ufb01c policy\nused. Thus computing the feasible-set function solves stochastic games, just as computing the value\nfunction solves MDPs.\nFigure 1-B shows a \ufb01nal feasible-set in the breakup game. The set is a closed convex hull with\nextreme points (1,\u22120.5), (1,\u22122), and (1.8,\u22120.9). This feasible-set depicts the fact that when\nstarting in player 1\u2019s state any full game equilibria will result in a joint-utility that is some weighted\naverage of these three points. For example the players can achieve (1,\u22120.5) by having player 1\nalways pass and player 2 exit with probability 0.55.\nIf player 2 tries to cheat by passing when\nthey are supposed to exit, player 1 will immediate exit in retaliation (recall that history is implicitly\nincluded in state).\nAn exact dynamic programing solution falls out naturally after replacing the value-function in Bell-\nman\u2019s dynamic program with a feasible-set function, however the changes in variable dimension\ncomplicate the backup. An illustration of the modi\ufb01ed backup is shown in Figure 2, where steps\nA-C solve for the action-feasible-set (Q(s, (cid:126)a)), and steps D-E solve for V (s) given Q(s, (cid:126)a). What\nis not depicted in Figure 2 is the process of eliminating non-equilibrium policies in steps D-E. We\nassume an equilibrium \ufb01lter function Feq is provided to the algorithm, which is applied to eliminate\nnon-equilibrium policies. Details of this process is given in section 5.4. The \ufb01nal dynamic program\nstarts by initializing each feasible-set to be some large over-estimate (a hypercube of the maximum\nand minimum utilities possible for each player). Each iteration of the backup then contracts the\nfeasible-sets, eliminating unachievable utility-vectors. Eventually the algorithm converges and only\nachievable joint-utilities remain. The invariant of feasible-sets always overestimating is crucial for\nguaranteeing correctness, and is a point of great concern below. A more detailed examination of\nthe exact algorithm including a formal treatment of the backup, various game theoretic issues, and\nconvergence proofs are given in Murray and Gordon\u2019s technical report (June 2007). This paper does\nnot focus on the exact solution, instead focusing on creating a tractable generalized version.\n\n5 Making a tractable algorithm\n\nThere are a few serious computational bottlenecks in the exact algorithm. The \ufb01rst problem is\nthat the size of the game itself is exponential in the number of agents because joint actions are\n\n3\n\n\fFigure 2: An example of the backup step (one iteration of our modi\ufb01ed Bellman equation). The state\nshown being calculated is an initial rock-paper-scissors game played to decide who goes \ufb01rst in the\nbreakup game from Figure 1. A tie results in a random winner. The backup shown depicts the 2nd\niteration of the dynamic program when feasible-sets are initialized to (0,0) and binding contracts\nare allowed (Feq = set union). In step A the feasibility set of the two successor states are shown\ngraphically. For each combination of points from each successor state the expected value is found\n(in this case 1/2 of the bottom and 1/2 of the top). These points are shown in step B as circles. Next\nin step C, the minimum encircling polygon is found. This feasibility region is then scaled by the\ndiscount factor and translated by the immediate reward. This is the feasibility-set of a particular\njoint action from our original state. The process is repeated for each joint action in step D. Finally,\nin step E, the feasible outcomes of all joint actions are fed into Feq to yield the updated feasibility\nset of our state.\n\nexponential in the number of players. This problem is unavoidable unless we approximate the game\nwhich is outside the scope of this paper. The second problem is that although the exact algorithm\nalways converges, it is not guaranteed to converge in \ufb01nite time (during the equilibrium backup,\nan arbitrarily small update can lead to a drastically large change in the resulting contracted set). A\nthird big problem is that maintaining an exact representation of a feasible-set becomes unwieldy (the\nnumber of faces of the polytope my blow up, such as if it is curved).\nTwo important modi\ufb01cations to the exact algorithm allow us to make the algorithm tractable: Ap-\nproximating the feasible-sets with a bounded number of vertices, and adding a stopping criterion.\nOur approach is to approximate the feasible-set at the end of each iteration after \ufb01rst calculating it\nexactly. The degree of approximation is captured by a user-speci\ufb01ed parameters: \u00011. The approxi-\nmation scheme yields a solution that is an \u00011/(1\u2212\u03b3)-equilibrium of the full game while guaranteeing\nthere exists no exact equilibrium that Pareto-dominates the solution\u2019s utility. This means that despite\nnot being able to calculate the true utilities at each stage game, if other players did know the true util-\nities they would gain no more than \u00011/(1 \u2212 \u03b3) by defecting. Moreover our approximate solution is\nas good or better than any true equilibrium. By targeting an \u00011/(1\u2212 \u03b3)-equilibrium we do not mean\nthat the backup\u2019s equilibrium \ufb01lter function Feq is an \u0001-equilibrium (it could be, although making it\nsuch would do nothing to alleviate the convergence problem). Instead we apply the standard \ufb01lter\nfunction but stop if no feasible-set has changed by more than \u00011.\n\n5.1 Consequences of a stopping criterion\n\nRecall we have added a criterion to stop when all feasible-sets contract by less than \u00011 (in terms\nof Hausdorff distance). This is added to ensure that the algorithm makes \u00011 absolute progress each\niteration and thus will take no more than O(1/\u00011) iterations to converge. After our stopping criterion\nis triggered the total error present in any state is no more than \u00011/(1 \u2212 \u03b3) (i.e. if agents followed\na prescribed policy they would \ufb01nd their actual rewards to be no less than \u00011/(1 \u2212 \u03b3) promised).\nTherefore the feasible-sets must represent at least an \u00011/(1 \u2212 \u03b3)-equilibrium. In other words, after\na backup each feasible-set is in equilibrium (according to the \ufb01lter function) with respect to the\nprevious iteration\u2019s estimation. If that previous estimation is off by at most \u00011/(1\u2212 \u03b3) than the most\nany one player could gain by deviating is \u00011/(1 \u2212 \u03b3). Because we are only checking for a stopping\ncondition, and not explicitly targeting the \u00011/(1 \u2212 \u03b3)-equilibrium in the backup we can\u2019t guarantee\nthat the algorithm will terminate with the best \u00011/(1\u2212 \u03b3)-equilibrium. Instead we can guarantee that\nwhen we do terminate we know that our feasible-sets contain all equilibrium satisfying our original\nequilibrium \ufb01lter and no equilibrium with incentive greater than an \u00011/(1 \u2212 \u03b3) to deviate.\n\n4\n\n(0, 0)(1, -2)(0, 0)(2, -1)(.45, -.9)(0, 0)(.9, -.45)(1.35, -1.35)RRPPSSPlayer 2 ActionPlayer 1 ActionPlayer 2 UtilityPlayer 1 Utiliy(\u00bd, -1)(0, 0)(1, -\u00bd)(1\u00bd, -1\u00bd)\u00bd\u00bdFeasible sets ofsuccessor statesA)Feasible set ofexpected valuesExpected valuesof all policiesFeasible sets ofall joint actions E)D)C)B)Feasible set ofinitial state\f5.2 Bounding the number of vertices\n\nBounding the number of points de\ufb01ning each feasible-set is crucial for achieving a tractable algo-\nrithm. At the end of each iteration we can replace each state feasible-set (V (s)) with an N point\napproximation. The computational geometry literature is rich with techniques for approximating\nconvex hulls. However, we want to insure that our feasible estimation is always an over estimation\nand not an under estimation, otherwise the equilibrium contraction step may erroneously eliminate\nvalid policies. Also, we need the technique to work in arbitrary dimensions and guarantee a bounded\nnumber of vertices for a given error bound. A number of recent algorithms meet these conditions\nand provide ef\ufb01cient running times and optimal worse-case performance (Lopez & Reisner, 2002),\n(Chan, 2003), (Clarkson, 1993).\nDespite the nice theoretical performance and error guarantees of these algorithms they admit a po-\ntential problem. The approximation step is controlled by a parameter \u00012(0 < \u00012 < \u00011) determining\nthe maximum tolerated error induced by the approximation. This error results in an expansion of\nthe feasible-set by at most \u00012. On the other hand by targeting \u00011-equilibrium we can terminate if\nthe backups fail to make \u00011 progress. Unfortunately this \u00011 progress is not uniform and may not\naffect much of the feasible-set. If this is the case, the approximation expansion could potentially\nexpand past the original feasible-set (thus violating our need for progress to be made every iteration,\nsee Figure 3-A). Essentially our approximation scheme must also insure that it is a subset of the\nprevious step\u2019s approximation. With this additional constraint in mind we develop the following\napproximation inspired by (Chen, 2005):\n\nFigure 3: A) (I) Feasible hull from previous iteration. (II) Feasible hull after equilibrium contraction.\nThe set contracts at least \u00011. (III) Feasible hull after a poor approximation scheme. The set expands\nat most \u00012, but might sabotage progress. B) The hull from A-I is approximated using halfspaces\nfrom a given regular approximation of a Euclidean ball. C) Subsequent approximations using the\nsame set of halfspaces will not backtrack.\n\nWe take a \ufb01xed set of hyperplanes which form a regular approximation of a Euclidean ball such that\nthe hyperplane\u2019s normals form an angle of at most \u03b8 with their neighbors (E.G. an optimal Delaunay\ntriangulation). We then project these halfspaces onto the polytope we wish to approximate (i.e.\nretain each hyperplanes\u2019 normals but reduce their offsets until they touch the given polytope). After\nremoving redundant hyperplanes the resulting polytope is returned as the approximation (Figure 3-\nB). To insure a maximum error of \u00012 with n players: \u03b8 \u2264 2 arccos[(r/(\u00012 + r))1/n] where r =\nRmax/(1 \u2212 \u03b3).\nThe scheme trivially uses a bounded number of facets (only those from the predetermined set), and\nhence a bounded number of vertices. Finally, by using a \ufb01xed set of approximating hyperplanes\nsuccessive approximations will strictly be subsets of each other - no hyperplane will move farther\naway when the set its projecting onto shrinks (Figure 3-C). After both the \u00011-equilibrium contraction\nstep and the \u00012 approximation step we can guarantee at least \u00011 \u2212 \u00012 progress is made. Although\nthe \ufb01nal error depends only on \u00011 and not \u00012, the rate of convergence and the speed of each iteration\nis heavily in\ufb02uenced by \u00012. Our experiments (section 6) suggest that the theoretical requirement of\n\u00012 < \u00011 is far too conservative.\n\n5\n\n\u03b5(cid:30)\u03b5(cid:29)IIIIIIA)B)C)\f5.3 Computing expected feasible-sets\n\nAnother dif\ufb01culty occurs during the backup of Q(s, (cid:126)a). Finding the expectation over feasible-sets\ninvolves a modi\ufb01ed set sum (step B in \ufb01g 2), which naively requires an exponential looping over\nall possible combinations of taking one point from the feasible-set of each successor state. We can\nhelp the problem by applying the set sum on an initial two sets and fold subsequent sets into the\nresult. This leads to polynomial performance, but to an uncomfortably high-degree. Instead we can\ndescribe the problem as the following multiobjective linear program (MOLP):\n\nSimultaneously maximize foreach player i from 1 to n: (cid:80)\ns(cid:48)(cid:80)\n\n(cid:126)v\u2208V (s(cid:48)) vixs(cid:48)(cid:126)v\n\n(cid:126)v\u2208V (s(cid:48)) xs(cid:48)(cid:126)v = P (s(cid:48)|s, (cid:126)a)\n\nfor every state s(cid:48) (cid:80)\n\nSubject to:\n\nwhere we maximize over variables xs(cid:48)(cid:126)v (one for each (cid:126)v \u2208 V (s(cid:48)) for all s(cid:48)) and (cid:126)v is a vertex in the\nfeasible-set V (s(cid:48)) and vi is the value of that vertex to player i. This returns only the Pareto frontier.\nAn optimized version of the algorithm described in this paper would only need the frontier, not the\nfull set as calculating the frontier depends only on the frontier (unless the threat function needs the\nentire set). For the full feasible-set 2n such MOLPs are needed, one for each orthant.\nLike our modi\ufb01ed view of the Bellman equation as trying to \ufb01nd the entire set of achievable policy\npayoffs so too can we view linear programming as trying to \ufb01nd the entire set of achievable values\nof the objective function. When there is a single objective function this is simply a maximum\nand minimum value. When there is more than one objective function the solution then becomes\na multidimensional convex set of achievable vectors. This problem is known as multiobjective\nlinear programming and has been previously studied by a small community of operation researchers\nunder the umbrella subject of multiobjective optimization (Branke et al., 2005). MOLP is formally\nde\ufb01ned as a technique to \ufb01nd the Pareto frontier of a set of linear objective functions subject to linear\ninequality constraints. The most prominent exact method for MOLP is the Evans-Steuer algorithm\n(Branke et al., 2005).\n\n5.4 Computing correlated equilibria of sets\n\nOur generalized algorithm requires an equilibrium-\ufb01lter function Feq. Formally this is a monotonic\nfunction Feq : P(Rn) \u00d7 . . . \u00d7 P(Rn)) \u2192 P(Rn) which outputs a closed convex subset of the\nsmallest convex set containing the union of the input sets. Here P denotes the powerset.\nIt is\nmonotonic as x \u2286 y \u21d2 Feq(x) \u2286 Feq(y). The threat function Fth is also passed to Feq. Note\nthan requiring Feq to return a closed convex set disquali\ufb01es Nash equilibria and its re\ufb01nements.\nDue to the availability of cheap talk, reasonable choices for Feq include correlated equilibria (CE),\n\u0001-CE, or a coalition resistant variant of CE. Filtering non-equilibrium policies takes place when the\nvarious action feasible-sets (Q) are merged together as shown in step E of Figure 2. Constructing\nFeq is more complicated than computing the equilibria for a stage game so we describe below how\nto target CE.\nFor a normal-form game the set of correlated equilibria can be determined by taking the intersection\nof a set of halfspaces (linear inequality constraints) (Greenwald & Hall, 2003). Each variable of\nthese halfspaces represents the probability that a particular joint action is chosen (via a shared ran-\ndom variable) and each halfspace represents a rationality constraint that a player being told to take\n1 |Ai|(|Ai| \u2212 1) such rationality\nconstraints (where |Ai| is the number of actions player i can take).\nUnlike in a normal-form game, the rewards for following the correlation device or defecting (switch-\ning actions) are not directly given in our dynamic program. Instead we have a feasible-set of possible\noutcomes for each joint action Q(s, (cid:126)a) and a threat function Fth. Recall that when following a pol-\nicy to achieve a desired payoff, not only must a joint action be given, but also subsequent payoffs\nfor each successor state. Thus the halfspace variables must not only specify probabilities over joint\nactions but also the subsequent payoffs (a probability distribution over the extreme points of each\nsuccessor feasible-set). Luckily, a mixture of probability distributions is still a probability distri-\n(cid:126)a |Q(s, (cid:126)a)| variables (we still have the same number of\n\none action would not want to switch to another action. There are(cid:80)n\n\nbution so our \ufb01nal halfspaces now have(cid:80)\n\nhalfspaces with the same meaning as before).\nAt the end of the day we do not want feasible probabilities over successor states, we want the\nutility-vectors afforded by them. To achieve this without having to explicitly construct the polytope\n\n6\n\n\fdescribed above (which can be exponential in the number of halfspaces) we can describe the problem\nas the following MOLP (given Q(s, (cid:126)a) and Fth):\n\nSimultaneously maximize foreach player i from 1 to n: (cid:80)\nSubject to: probability constraints(cid:80) x(cid:126)a(cid:126)u = 1 and x(cid:126)a(cid:126)u \u2265 0\n(cid:80)\n(cid:126)a(cid:126)u|ai=a1 uix(cid:126)a(cid:126)u \u2265(cid:80)\n\nand foreach player i, actions a1,a2 \u2208 Ai, (a2 (cid:54)= a1)\n\n(cid:126)a(cid:126)u uix(cid:126)a(cid:126)u\n\n(cid:126)a(cid:126)u|ai=a2 Fth(s, (cid:126)a)x(cid:126)a(cid:126)u\n\nwhere variables x(cid:126)a(cid:126)u represent the probability of choosing joint action (cid:126)a and subsequent payoff\n(cid:126)u \u2208 Q(s, (cid:126)a) in state s and ui is the utility to player i.\n\n5.5 Proof of correctness\n\nMurray and Gordon (June 2007) proved correctness and convergence for the exact algorithm by\nproving four properties: 1) Monotonicity (feasible-sets only shrink), 2) Achievability (after conver-\ngence, feasible-sets contain only achievable joint-utilities), 3) Conservative initialization (initializa-\ntion is an over-estimate), and 4) Conservative backups (backups don\u2019t discard valid joint-utilities).\nWe show that our approximation algorithm maintains these properties.\n1) Our feasible-set approximation scheme was carefully constructed so that it would not permit\nbacktracking, maintaining monotonicity (all other steps of the backup are exact). 2) We have broad-\nened the de\ufb01nition of achievability to permit \u00011/(1 \u2212 \u03b3) error. After all feasible-sets shrink by less\nthan \u00011 we could modify the game by giving a bonus reward less than \u00011 to each player in each state\n(equal to that state\u2019s shrinkage). This modi\ufb01ed game would then have converged exactly (and thus\nwould have a perfectly achievable feasible-set as proved by Murray and Gordon). Any joint-policy\nof the modi\ufb01ed game will yield at most \u00011/(1 \u2212 \u03b3) more than the same joint-policy of our original\ngame thus all utilities of our original game are off by at most \u00011/(1 \u2212 \u03b3). 3) Conservative initializa-\nmax/(1 \u2212 \u03b3)).\ntion is identical to the exact solution (start with a huge hyperrectangle with sides Ri\n4) Backups remain conservative as our approximation scheme never underestimates (as shown in\nsection 5.2) and our equilibrium \ufb01lter function Feq is required to be monotonic and thus will never\nunderestimate if operating on overestimates (this is why we require monotonicity of Feq). CE over\nsets as presented in section 5.4 is monotonic. Thus our algorithm maintains the four crucial prop-\nerties and terminates with all exact equilibria (as per conservative backups) while containing no\nequilibrium with error greater than \u00011/(1 \u2212 \u03b3).\n\n6 Empirical results\n\nWe implemented a version of our algorithm targeting exact correlated equilibrium using grim trigger\nthreats (defection is punished to the maximum degree possible by all other players, even at one\u2019s\nown expense). The grim trigger threat reduces to a 2 person zero sum game where the defector\nreceives their normal reward and all other players receive the opposite reward. Because the other\nplayers receive the same reward in this game they can be viewed as a single entity. Zero sum 2-\nplayer stochastic games can be quickly solved using FFQ-Learning (Littman, 2001). Note that grim\ntrigger threats can be computed separately before the main algorithm is run. When computing the\nthreats for each joint action, we use the GNU Linear Programming Kit (GLPK) to solve the zero-sum\nstage games. Within the main algorithm itself we use ADBASE (Steuer, 2006) to solve our various\nMOLPs. Finally we use QHull (Barber et al., 1995) to compute the convex hull of our feasible-sets\nand to determine the normals of the set\u2019s facets. We use these normals to compute the approximation.\nTo improve performance our implementation does not compute the entire feasible hull, only those\npoints on the Pareto frontier. A \ufb01nal policy will exclusively choose targets from the frontier (using\nFbs) (as will the computed intermediate equilibria) so we lose nothing by ignoring the rest of the\nfeasible-set (unless the threat function requires other sections of the feasible-set, for instance in the\ncase of credible threats). In other words, when computing the Pareto frontier during the backup the\nalgorithm relies on no points except those of the previous step\u2019s Pareto frontier. Thus computing\nonly the Pareto frontier at each iteration is not an approximation, but an exact simpli\ufb01cation.\nWe tested our algorithm on a number of problems with known closed form solutions, including the\nbreakup game (Figure 4). We also tested the algorithm on a suite of random games varying across the\nnumber of states, number of players, number of actions, number of successor states (stochasticity of\n\n7\n\n\fthe game), coarseness of approximation, and density of rewards. All rewards were chosen at random\nbetween 1 and -1, and \u03b3 was always set to 0.9.\n\nFigure 4: A visualization of feasible-sets for the terminal state and player 1\u2019s state of the breakup\ngame at various iterations of the dynamic program. By the 50th iteration the sets have converged.\n\nAn important empirical question is what degree of approximation should be adopted. Our testing\n(see Figure 5) suggests that the theoretical requirement of \u00012 < \u00011 is overly conservative. While the\nbound on \u00012 is theoretically proportional to Rmax/(1 \u2212 \u03b3) (the worst case scale of the feasible-set)\na more practical choice for \u00012 would be in scale with the \ufb01nal feasible-sets (as should a choice for\n\u00011).\n\nFigure 5: Statistics from a random game (100 states, 2 players, 2 actions each, with \u00011 = 0.02 )\nrun with different levels of approximation. The numbers shown (120, 36, 12, and 6) represent the\nnumber of predetermined hyperplanes used to approximate each Pareto frontier. A) The better ap-\nproximations only use a fraction of the hyperplanes available to them. B) Wall clock time is directly\nproportional to the size of the feasible-sets. C) Better approximations converge more each iteration\n(the coarser approximations have a longer tail), however due to the additional computational costs\nthe 12 hyperplane approximation converged quickest (in total wall time). The 6, 12, and 36 hyper-\nplane approximations are insuf\ufb01cient to guarantee convergence (\u00012 = 0.7, 0.3, 0.1 respectively) yet\nonly the 6-face approximation occasionally failed to converge.\n\n6.1 Limitations\n\nOur approach is overkill when the feasible-sets are one dimensional (line segments) (as when the\ngame is zero-sum, or agents share a reward function), because CE-Q learning will converge to the\ncorrect solution without additional overhead. When there are no cycles in the state-transition graph\n(or one does not wish to consider cyclic equilibria) traditional game-theory approaches suf\ufb01ce. In\nmore general cases, our algorithm brings signi\ufb01cant advantages. However despite scaling linearly\nwith the number of states, the multiobjective linear program for computing the equilibrium hull\nscales very poorly. The MOLP remains tractable only up to about 15 joint actions (which results in\na few hundred variables and a few dozen constraints, depending on feasible-set size). This in turn\nprevents the algorithm from running with more than four agents.\n\n8\n\nIteration:05101520253035404550 TerminalState Player 1Equilibrium0510152025303540110192837465564738291100Wall Clock Time (seconds)120361260510152025303540110192837465564738291100Feasible Set Size1203612600.20.40.60.811.21.411019283746Average Set Change Each Iteration12036126ABCIterationsIterationsIterations\fReferences\nBarber, C. B., Dobkin, D. P., & Huhdanpaa, H. (1995). The quickhull algorithm for convex hulls.\n\nACM Transactions on Mathematical Software, 22, 469\u2013483.\n\nBranke, J., Deb, K., Miettinen, K., & Steuer, R. E. (Eds.). (2005). Practical approaches to multi-\nobjective optimization, 7.-12. november 2004, vol. 04461 of Dagstuhl Seminar Proceedings.\nInternationales Begegnungs- und Forschungszentrum (IBFI), Schloss Dagstuhl, Germany IBFI,\nSchloss Dagstuhl, Germany.\n\nChan, T. M. (2003). Faster core-set constructions and data stream algorithms in \ufb01xed dimensions.\n\nComput. Geom. Theory Appl (pp. 152\u2013159).\n\nChen, L. (2005). New analysis of the sphere covering problems and optimal polytope approximation\n\nof convex bodies. J. Approx. Theory, 133, 134\u2013145.\n\nClarkson, K. L. (1993). Algorithms for polytope covering and approximation, and for approximate\n\nclosest-point queries.\n\nGreenwald, A., & Hall, K. (2003). Correlated-q learning. Proceedings of the Twentieth International\n\nConference on Machine Learning (pp. 242\u2013249).\n\nLittman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. Proc. 18th International\n\nConf. on Machine Learning (pp. 322\u2013328). Morgan Kaufmann, San Francisco, CA.\n\nLittman, M. Z. . A. G. . M. L. (2005). Cyclic equilibria in markov games. Proceedings of Neural\n\nInformation Processing Systems. Vancouver, BC, Canada.\n\nLopez, M. A., & Reisner, S. (2002). Linear time approximation of 3d convex polytopes. Comput.\n\nGeom. Theory Appl., 23, 291\u2013301.\n\nMurray, C., & Gordon, G. (June 2007). Finding correlated equilibria in general sum stochastic\n\ngames (Technical Report). School of Computer Science, Carnegie Mellon University.\n\nMurray, C., & Gordon, G. J. (2007). Multi-robot negotiation: Approximating the set of subgame\nIn B. Sch\u00a8olkopf, J. Platt and T. Hoffman\nperfect equilibria in general-sum stochastic games.\n(Eds.), Advances in neural information processing systems 19, 1001\u20131008. Cambridge, MA: MIT\nPress.\n\nShoham, Yoav, P., & Grenager (2006). If multi-agent learning is the answer, what is the question?\n\nArti\ufb01cial Intelligence.\n\nShoham, Y., Powers, R., & Grenager, T. (2003). Multi-agent reinforcement learning: a critical\n\nsurvey (Technical Report).\n\nSteuer, R. E. (2006). Adbase: A multiple objective linear programming solver for ef\ufb01cient extreme\n\npoints and unbounded ef\ufb01cient edges.\n\n9\n\n\f", "award": [], "sourceid": 1185, "authors": [{"given_name": "Liam", "family_name": "Dermed", "institution": null}, {"given_name": "Charles", "family_name": "Isbell", "institution": null}]}