{"title": "Cyclic Equilibria in Markov Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1641, "page_last": 1648, "abstract": "", "full_text": "Cyclic Equilibria in Markov Games\n\nMartin Zinkevich and Amy Greenwald\n\nDepartment of Computer Science\n\nBrown University\n\nProvidence, RI 02912\n\nMichael L. Littman\n\nDepartment of Computer Science\nRutgers, The State University of NJ\n\nPiscataway, NJ 08854\u20138019\n\n{maz,amy}@cs.brown.edu\n\nmlittman@cs.rutgers.edu\n\nAbstract\n\nAlthough variants of value iteration have been proposed for \ufb01nding Nash\nor correlated equilibria in general-sum Markov games, these variants\nhave not been shown to be effective in general. In this paper, we demon-\nstrate by construction that existing variants of value iteration cannot \ufb01nd\nstationary equilibrium policies in arbitrary general-sum Markov games.\nInstead, we propose an alternative interpretation of the output of value it-\neration based on a new (non-stationary) equilibrium concept that we call\n\u201ccyclic equilibria.\u201d We prove that value iteration identi\ufb01es cyclic equi-\nlibria in a class of games in which it fails to \ufb01nd stationary equilibria. We\nalso demonstrate empirically that value iteration \ufb01nds cyclic equilibria in\nnearly all examples drawn from a random distribution of Markov games.\n\n1\n\nIntroduction\n\nValue iteration (Bellman, 1957) has proven its worth in a variety of sequential-decision-\nmaking settings, most signi\ufb01cantly single-agent environments (Puterman, 1994), team\ngames, and two-player zero-sum games (Shapley, 1953). In value iteration for Markov\ndecision processes and team Markov games, the value of a state is de\ufb01ned to be the maxi-\nmum over all actions of the value of the combination of the state and action (or Q value).\nIn zero-sum environments, the max operator becomes a minimax over joint actions of the\ntwo players. Learning algorithms based on this update have been shown to compute equi-\nlibria in both model-based scenarios (Brafman & Tennenholtz, 2002) and Q-learning-like\nmodel-free scenarios (Littman & Szepesv\u00b4ari, 1996).\n\nThe theoretical and empirical success of such algorithms has led researchers to apply the\nsame approach in general-sum games, in spite of exceedingly weak guarantees of conver-\ngence (Hu & Wellman, 1998; Greenwald & Hall, 2003). Here, value-update rules based on\nselect Nash or correlated equilibria have been evaluated empirically and have been shown\nto perform reasonably in some settings. None has been identi\ufb01ed that computes equilib-\nria in general, however, leaving open the question of whether such an update rule is even\npossible.\n\nOur main negative theoretical result is that an entire class of value-iteration update rules,\nincluding all those mentioned above, can be excluded from consideration for computing\nstationary equilibria in general-sum Markov games. Brie\ufb02y, existing value-iteration al-\ngorithms compute Q values as an intermediate result, then derive policies from these Q\n\n\fvalues. We demonstrate a class of games in which Q values, even those corresponding to\nan equilibrium policy, contain insuf\ufb01cient information for reconstructing an equilibrium\npolicy.\n\nFaced with the impossibility of developing algorithms along the lines of traditional value\niteration that \ufb01nd stationary equilibria, we suggest an alternative equilibrium concept\u2014\ncyclic equilibria. A cyclic equilibrium is a kind of non-stationary joint policy that satis\ufb01es\nthe standard conditions for equilibria (no incentive to deviate unilaterally). However, unlike\nconditional non-stationary policies such as tit-for-tat and \ufb01nite-state strategies based on the\n\u201cfolk theorem\u201d (Osborne & Rubinstein, 1994), cyclic equilibria cycle rigidly through a set\nof stationary policies.\n\nWe present two positive results concerning cyclic equilibria. First, we consider the class of\ntwo-player two-state two-action games used to show that Q values cannot reconstruct all\nstationary equilibrium. Section 4.1 shows that value iteration \ufb01nds cyclic equilibria for all\ngames in this class. Second, Section 5 describes empirical results on a more general set of\ngames. We \ufb01nd that on a signi\ufb01cant fraction of these games, value iteration updates fail to\nconverge. In contrast, value iteration \ufb01nds cyclic equilibria for nearly all the games.\n\nThe success of value iteration in \ufb01nding cyclic equilibria suggests this generalized solution\nconcept could be useful for constructing robust multiagent learning algorithms.\n\n2 An Impossibility Result for Q Values\n\nIn this section, we consider a subclass of Markov games in which transitions are determin-\nistic and are controlled by one player at a time. We show that this class includes games\nthat have no deterministic equilibrium policies. For this class of games, we present (proofs\navailable in an extended technical report) two theorems. The \ufb01rst, a negative result, states\nthat the Q values used in existing value-iteration algorithms are insuf\ufb01cient for deriving\nequilibrium policies. The second, presented in Section 4.1, is a positive result that states\nthat value iteration does converge to cyclic equilibrium policies in this class of games.\n\n2.1 Preliminary De\ufb01nitions\n\nGiven a \ufb01nite set X, de\ufb01ne \u2206(X) to be the set of all probability distributions over X.\n\nDe\ufb01nition 1 A Markov game \u0393 = [S, N, A, T, R, \u03b3] is a set of states S, a set of players\nN = {1, . . . , n}, a set of actions for each player in each state {Ai,s}s\u2208S,i\u2208N (where we\nrepresent the set of all state-action pairs as A \u2261 Ss\u2208S (cid:0){s} \u00d7 Qi\u2208N Ai,s(cid:1), a transition\nfunction T : A \u2192 \u2206(S), a reward function R : A \u2192 Rn, and a discount factor \u03b3.\n\nGiven a Markov game \u0393, let As = Qi\u2208N Ai,s. A stationary policy is a set of distributions\n{\u03c0(s) : s \u2208 S}, where for all s \u2208 S, \u03c0(s) \u2208 \u2206 (As). Given a stationary policy \u03c0, de\ufb01ne\nV \u03c0,\u0393 : S \u2192 Rn and Q\u03c0,\u0393 : A \u2192 Rn to be the unique pair of functions satisfying the\nfollowing system of equations: for all i \u2208 N, for all (s, a) \u2208 A,\n\nV \u03c0,\u0393\ni\n\n(s) = X\na\u2208As\n\n\u03c0(s)(a)Q\u03c0,\u0393\n\ni\n\n(s, a),\n\nQ\u03c0,\u0393\n\ni\n\n(s, a) = Ri(s, a) + \u03b3 X\n\nT (s, a)(s\u2032)V \u03c0,\u0393\n\ni\n\n(s\u2032).\n\ns\u2032\u2208S\n\n(1)\n\n(2)\n\nA deterministic Markov game is a Markov game \u0393 where the transition function is deter-\nministic: T : A \u2192 S. A turn-taking game is a Markov game \u0393 where in every state, only\none player has a choice of action. Formally, for all s \u2208 S, there exists a player i \u2208 N such\nthat for all other players j \u2208 N \\{i}, |Aj,s| = 1.\n\n\f2.2 A Negative Result for Stationary Equilibria\n\nA NoSDE (pronounced \u201cnasty\u201d) game is a deterministic turn-taking Markov game \u0393 with\ntwo players, two states, no more than two actions for either player in either state, and no\ndeterministic stationary equilibrium policy. That the set of NoSDE games is non-empty is\ndemonstrated by the game depicted in Figure 1. This game has no deterministic stationary\nequilibrium policy: If Player 1 sends, Player 2 prefers to send; but, if Player 2 sends,\nPlayer 1 prefers to keep; and, if Player 1 keeps, Player 2 prefers to keep; but, if Player 2\nkeeps, Player 1 prefers to send. No deterministic policy is an equilibrium because one\nplayer will always have an incentive to change policies.\nR1(2, noop, send) = 0\n\nR1(1, keep, noop) = 1\n\n1\n\n2\n\nR1(2, noop, keep) = 3\n\nR1(1, send, noop) = 0\n\nR2(2, noop, send) = 0\n\nR2(1, keep, noop) = 0\n\n1\n\n2\n\nR2(2, noop, keep) = 1\n\nR2(1, send, noop) = 3\n\nFigure 1: An example of a NoSDE game. Here, S = {1, 2}, A1,1 = A2,2 =\n{keep, send}, A1,2 = A2,1 = {noop}, T (1, keep, noop) = 1, T (1, send, noop) = 2,\nT (2, noop, keep) = 2, T (2, noop, send) = 1, and \u03b3 = 3/4. In the unique stationary\nequilibrium, Player 1 sends with probability 2/3 and Player 2 sends with probability 5/12.\n\nLemma 1 Every NoSDE game has a unique stationary equilibrium policy.1\n\nIt is well known that, in general Markov games, random policies are sometimes needed to\nachieve an equilibrium. This fact can be demonstrated simply by a game with one state\nwhere the utilities correspond to a bimatrix game with no deterministic equilibria (penny\nmatching, say). Random actions in these games are sometimes linked with strategies that\nuse \u201cfaking\u201d or \u201cbluf\ufb01ng\u201d to avoid being exploited. That NoSDE games exist is surprising,\nin that randomness is needed even though actions are always taken with complete infor-\nmation about the other player\u2019s choice and the state of the game. However, the next result\nis even more startling. Current value-iteration algorithms attempt to \ufb01nd the Q values of a\ngame with the goal of using these values to \ufb01nd a stationary equilibrium of the game. The\nmain theorem of this paper states that it is not possible to derive a policy from the Q values\nfor NoSDE games, and therefore in general Markov games.\n\nTheorem 1 For any NoSDE game \u0393 = [S, N, A, T, R] with a unique equilibrium policy\n\u03c0, there exists another NoSDE game \u0393\u2032 = [S, N, A, T, R\u2032] with its own unique equilibrium\npolicy \u03c0\u2032 such that Q\u03c0,\u0393 = Q\u03c0\u2032,\u0393\u2032 but \u03c0 6= \u03c0\u2032 and V \u03c0,\u0393 6= V \u03c0\u2032,\u0393\u2032.\n\nThis result establishes that computing or learning Q values is insuf\ufb01cient to compute a\nstationary equilibrium of a game.2 In this paper we suggest an alternative, where we still\n\n1The policy is both a correlated equilibrium and a Nash equilibrium.\n2Although maintaining Q values and state values and deriving policies from both sets of functions\nmight circumvent this problem, we are not aware of existing value-iteration algorithms or learning\nalgorithms that do so. This observation presents a possible avenue of research not followed in this\npaper.\n\n\fdo value iteration in the same way, but we extract a cyclic equilibrium from the sequence\nof values instead of a stationary one.\n\n3 A New Goal: Cyclic Equilibria\n\nA cyclic policy is a \ufb01nite sequence of stationary policies \u03c0 = {\u03c01, . . . , \u03c0k}. Associated\nwith \u03c0 is a sequence of value functions {V \u03c0,\u0393,j} and Q-value functions {Q\u03c0,\u0393,j} such that\n\nV \u03c0,\u0393,j\ni\n\n(s) = X\na\u2208As\n\n\u03c0j(s)(a) Q\u03c0,\u0393,j\n\ni\n\n(s, a) and\n\nQ\u03c0,\u0393,j\n\ni\n\n(s, a) = Ri(s, a) + \u03b3 X\n\nT (s, a)(s\u2032) V \u03c0,\u0393,inck(j)\n\ni\n\n(s\u2032)\n\ns\u2032\u2208S\n\n(3)\n\n(4)\n\nwhere for all j \u2208 {1, . . . , k}, inck(k) = 1 and inck(j) = j + 1 if j < k.\n\nDe\ufb01nition 2 Given a Markov game \u0393, a cyclic correlated equilibrium is a cyclic policy \u03c0,\nwhere for all j \u2208 {1, . . . , k}, for all i \u2208 N, for all s \u2208 S, for all ai, a\u2032\n\ni \u2208 Ai,s:\n\nX\n\n\u03c0j(s)(ai, a\u2212i) Q\u03c0,\u0393,j\n\ni\n\n(s, ai, a\u2212i) \u2265 X\n\n\u03c0j(s)(ai, a\u2212i) Q\u03c0,\u0393,j(s, a\u2032\n\ni, a\u2212i).\n\n(5)\n\na\u2212i\u2208A\u2212i,s\n\n(ai,a\u2212i)\u2208As\n\nHere, a\u2212i denotes a joint action for all players except i. A similar de\ufb01nition can be con-\nstructed for Nash equilibria by insisting that all policies \u03c0j(s) are product distributions. In\nDe\ufb01nition 2, we imagine that action choices are moderated by a referee with a clock that\nindicates the current stage j of the cycle. At each stage, a typical correlated equilibrium\nis executed, meaning that the referee chooses a joint action a from \u03c0j(s), tells each agent\nits part of that joint action, and no agent can improve its value by eschewing the referee\u2019s\nadvice. If no agent can improve its value by more than \u01eb at any stage, we say \u03c0 is an\n\u01eb-correlated cyclic equilibrium.\nA stationary correlated equilibrium is a cyclic correlated equilibrium with k = 1. In the\nnext section, we show how value iteration can be used to derive cyclic correlated equilibria.\n\n4 Value Iteration in General-Sum Markov Games\n\nFor a game \u0393, de\ufb01ne Q\u0393 = (Rn)A to be the set of all state-action (Q) value functions,\nV\u0393 = (Rn)S to be the set of all value functions, and \u03a0\u0393 to be the set of all stationary\npolicies. Traditionally, value iteration can be broken down into estimating a Q value based\nupon a value function, selecting a policy \u03c0 given the Q values, and deriving a value function\nbased upon \u03c0 and the Q value functions. Whereas the \ufb01rst and the last step are fairly\nstraightforward, the step in the middle is quite tricky. A pair (\u03c0, Q) \u2208 \u03a0\u0393 \u00d7 Q\u0393 agree (see\nEquation 5) if, for all s \u2208 S, i \u2208 N, ai, a\u2032\n\ni \u2208 Ai,s:\n\nX\n\n\u03c0(s)(ai, a\u2212i) Qi(s, ai, a\u2212i) \u2265 X\n\n\u03c0(s)(ai, a\u2212i) Q(s, a\u2032\n\ni, a\u2212i).\n\n(6)\n\na\u2212i\u2208A\u2212i,s\n\n(ai,a\u2212i)\u2208As\n\nEssentially, Q and \u03c0 agree if \u03c0 is a best response for each player given payoffs Q. An\nequilibrium-selection rule is a function f : Q\u0393 \u2192 \u03a0\u0393 such that for all Q \u2208 Q\u0393,\n(f (Q), Q) agree. The set of all such rules is F\u0393. In essence, these rules update values\nassuming an equilibrium policy for a one-stage game with Q(s, a) providing the terminal\nrewards. Examples of equilibrium-selection rules are best-Nash, utilitarian-CE, dictatorial-\nCE, plutocratic-CE, and egalitarian-CE (Greenwald & Hall, 2003). (Utilitarian-CE, which\nwe return to later, selects the correlated equilibrium in which total of the payoffs is max-\nimized.) Foe-VI and Friend-VI (Littman, 2001) do not \ufb01t into our formalism, but it can\n\n\fbe proven that in NoSDE games they converge to deterministic policies that are neither\nstationary nor cyclic equilibria. De\ufb01ne d\u0393 : V\u0393 \u00d7 V\u0393 \u2192 R to be a distance metric over\nvalue functions, such that\n\nd\u0393(V, V \u2032) = max\ns\u2208S,i\u2208N\n\n|Vi(s) \u2212 V \u2032\n\ni (s)|.\n\n(7)\n\nUsing our notation, the value-iteration algorithm for general-sum Markov games can be\ndescribed as follows.\n\nAlgorithm 1: ValueIteration(game \u0393, V 0 \u2208 V\u0393, f \u2208 F\u0393, Integer T )\nFor t := 1 to T :\n\n1. \u2200s \u2208 S, a \u2208 A, Qt(s, a) := R(s, a) + \u03b3 Ps\u2032\u2208S T (s, a)(s\u2032) V t\u22121(s\u2032).\n2. \u03c0t = f (Qt).\n3. \u2200s \u2208 S, V t(s) = Pa\u2208As\n\n\u03c0t(s)(a) Qt(s, a).\n\nReturn {Q1, . . . , QT }, {\u03c01, . . . , \u03c0T }, {V 1, . . . , V T }.\n\nIf a stationary equilibrium is sought, the \ufb01nal policy is returned.\n\nAlgorithm 2: GetStrategy(game \u0393, V 0 \u2208 V\u0393, f \u2208 F\u0393, Integer T )\n\n1. Run (Q1 . . . QT , \u03c01 . . . \u03c0T , V 1 . . . V T ) = ValueIteration(\u0393, V 0, f, T ).\n2. Return \u03c0T .\n\nFor cyclic equilibria, we have a variety of options for how many past stationary policies\nwe want to consider for forming a cycle. Our approach searches for a recent value function\nthat matches the \ufb01nal value function (an exact match would imply a true cycle). Ties are\nbroken in favor of the shortest cycle length. Observe that the order of the policies returned\nby value iteration is reversed to form a cyclic equilibrium.\n\nAlgorithm 3: GetCycle(game \u0393, V 0 \u2208 V\u0393, f \u2208 F\u0393, Integer T , Integer maxCycle)\n\n1. If maxCycle \u2265 T , maxCycle := T \u2212 1.\n2. Run (Q1 . . . QT , \u03c01 . . . \u03c0T , V 1 . . . V T ) = ValueIteration(\u0393, V 0, f, T ).\n3. De\ufb01ne k := argmint\u2208{1,...,maxCycle} d(V T , V T \u2212t).\n4. For each t \u2208 {1, . . . , k} set \u03c0t := \u03c0T +1\u2212t.\n\n4.1 Convergence Conditions\n\nFact 1 If d(V T , V T \u22121) = \u01eb in GetStrategy, then GetStrategy returns an \u01eb\u03b3\nequilibrium.\n\n1\u2212\u03b3 -correlated\n\nFact 2 If GetCycle returns a cyclic policy of length k and d(V T , V T \u2212k) = \u01eb, then GetCy-\ncle returns an \u01eb\u03b3\n\n1\u2212\u03b3k -correlated cyclic equilibrium.\n\nSince, given V 0 and \u0393, the space of value functions is bounded, eventually there will be two\nvalue functions in {V 1, . . . , V T } that are close according to d\u0393. Therefore, the two prac-\ntical (and open) questions are (1) how many iterations does it take to \ufb01nd an \u01eb-correlated\ncyclic equilibrium? and (2) How large is the cyclic equilibrium that is found?\n\nIn addition to approximate convergence described above, in two-player turn-taking games,\none can prove exact convergence. In fact, all the members of F\u0393 described above can be\nconstrued as generalizations of utilitarian-CE in turn-taking games, and utilitarian-CE is\nproven to converge.\n\n\fTheorem 2 Given the utilitarian-CE equilibrium-selection rule f, for every NoSDE game\n\u0393, for every V 0 \u2208 V\u0393, there exists some \ufb01nite T such that GetCycle(\u0393, V 0, f, T, \u2308T /2\u2309)\nreturns a cyclic correlated equilibrium.\n\nTheoretically, we can imagine passing in\ufb01nity as a parameter to value iteration. Doing so\nshows the limitation of value-iteration in Markov games.\n\nTheorem 3 Given the utilitarian-CE equilibrium-selection rule f, for any NoSDE game \u0393\nwith unique equilibrium \u03c0, for every V 0 \u2208 V\u0393, the value-function sequence {V 1, V 2, . . .}\nreturned from ValueIteration(\u0393, V 0, f, \u221e) does not converge to V \u03c0.\n\nSince all of the other rules speci\ufb01ed above (except friend-VI and foe-VI) can be imple-\nmented with the utilitarian-CE equilibrium-selection rule, none of these rules will be guar-\nanteed to converge, even in such a simple class as turn-taking games!\n\nTheorem 4 Given the game \u0393 in Figure 1 and its stationary equilibrium \u03c0, given V 0\ni (s) =\n0 for all i \u2208 N, s \u2208 S, then for any update rule f \u2208 F\u0393, the value-function sequence\n{V 1, V 2, . . .} returned from ValueIteration(\u0393, V 0, f, \u221e) does not converge to V \u03c0.\n\n5 Empirical Results\n\nTo complement the formal results of the previous sections, we ran two batteries of tests\non value iteration in randomly generated games. We assessed the convergence behavior of\nvalue iteration to stationary and cyclic equilibria.\n\n5.1 Experimental Details\n\nOur game generator took as input the set of players N, the set of states S, and for each\nplayer i and state s, the actions Ai,s. To construct a game, for each state-joint action pair\n(s, a) \u2208 A, for each agent i \u2208 N, the generator sets Ri(s, a) to be an integer between\n0 and 99, chosen uniformly at random. Then, it selects T (s, a) to be deterministic, with\nthe resulting state chosen uniformly at random. We used a consistent discount factor of\n\u03b3 = 0.75 to decrease experimental variance.\nThe primary dependent variable in our results was the frequency with which value itera-\ntion converged to a stationary Nash equilibrium or a cyclic Nash equilibrium (of length\nless than 100). To determine convergence, we \ufb01rst ran value iteration for 1000 steps. If\nd\u0393(V 1000, V 999) \u2264 0.0001, then we considered value iteration to have converged to a\nstationary policy. If for some k \u2264 100\n\nmax\n\nt\u2208{1,...,k}\n\nd\u0393(V 1001\u2212t, V 1001\u2212(t+k)) \u2264 0.0001,\n\n(8)\n\nthen we considered value iteration to have converged to a cycle.3\nTo determine if a game has a deterministic equilibrium, for every deterministic policy \u03c0,\nwe ran policy evaluation (for 1000 iterations) to estimate V \u03c0,\u0393 and Q\u03c0,\u0393, and then checked\nif \u03c0 was an \u01eb-correlated equilibrium for \u01eb=0.0001.\n\n5.2 Turn-taking Games\n\nIn the \ufb01rst battery of tests, we considered sets of turn-taking games with x states and y\nactions: formally, there were x states {1, . . . , x}. In odd-numbered states, Player 1 had y\n\n3In contrast to the GetCycle algorithm, we are here concerned with \ufb01nding a cyclic equilibrium\n\nso we check an entire cycle for convergence.\n\n\f)\n0\n0\n0\n1\n \nf\no\n \nt\nu\no\n(\n \ne\nc\nn\ne\ng\nr\ne\nv\nn\no\nC\n\n \nt\nu\no\nh\nt\ni\n\nW\n\n \ns\ne\nm\na\nG\n\n 120\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n 2\n\ns\ne\nm\na\nG\n \nd\ne\ng\nr\ne\nv\nn\no\nC\n\n \nt\nn\ne\nc\nr\ne\nP\n\n 100\n\n 90\n\n 80\n\n 70\n\n 60\n\n 50\n\n 40\n\n5 States\n4 States\n3 States\n2 States\n\n 3\n\n 4\n\n 5\n\n 6\n\n 7\n\n 8\n\n 9\n\n 10\n\nActions\n\nCyclic uCE\nOTComb\nOTBest\nuCE\n\n 2\n\n 3\n\n 4\n\n 5\n\n 6\n\n 7\n\n 8\n\n 9\n\n 10\n\nStates\n\nFigure 2: (Left) For each combination of states and actions, 1000 deterministic turn-taking\ngames were generated. The graph plots the number of games where value iteration did\nnot converge to a stationary equilibrium. (Right) Frequency of convergence on 100 ran-\ndomly generated games with simultaneous actions. Cyclic uCE is the number of times\nutilitarian-CE converged to a cyclic equilibrium. OTComb is the number of games where\nany one of Friend-VI, Foe-VI, utilitarian-NE-VI, and 5 variants of correlated equilibrium-\nVI: dictatorial-CE-VI (First Player), dictatorial-CE-VI (Second Player), utilitarian-CE-VI,\nplutocratic-CE-VI, and egalitarian-VI converged to a stationary equilibrium. OTBest is\nthe maximum number of games where the best \ufb01xed choice of the equilibrium-selection\nrule converged. uCE is the number of games in which utilitarian-CE-VI converged to a\nstationary equilibrium.\n\nactions and Player 2 had one action: in even-numbered states, Player 1 had one action and\nPlayer 2 had y actions. We varied x from 2 to 5 and y from 2 to 10. For each setting of x\nand y, we generated and tested one thousand games.\nFigure 2 (left) shows the number of generated games for which value iteration did not\nconverge to a stationary equilibrium. We found that nearly half (48%, as many as 5% of\nthe total set) of these non-converged games had no stationary, deterministic equilibria (they\nwere NoSDE games). The remainder of the stationary, deterministic equilibria were simply\nnot discovered by value iteration. We also found that value iteration converged to cycles of\nlength 100 or less in 99.99% of the games.\n\n5.3 Simultaneous Games\n\nIn a second set of experiments, we generated two-player Markov games where both agents\nhave at least two actions in every state. We varied the number of states between 2 and 9,\nand had either 2 or 3 actions for every agent in every state.\n\nFigure 2 (right) summarizes results for 3-action games (2-actions games were qualitatively\nsimilar, but converged more often). Note that the fraction of random games on which the\nalgorithms converged to stationary equilibria decreases as the number of states increases.\nThis result holds because the larger the game, the larger the chance that value iteration will\nfall into a cycle on some subset of the states. Once again, we see that the cyclic equilib-\nria are found much more reliably than stationary equilibria by value-iteration algorithms.\nFor example, utilitarian-CE converges to a cyclic correlated equilibrium about 99% of the\ntime, whereas with 10 states and 3 actions, on 26% of the games none of the techniques\nconverge.\n\n\f6 Conclusion\n\nIn this paper, we showed that value iteration, the algorithmic core of many multiagent\nplanning reinforcement-learning algorithms, is not well behaved in Markov games. Among\nother impossibility results, we demonstrated that the Q-value function retains too little\ninformation for constructing optimal policies, even in 2-state, 2-action, deterministic turn-\ntaking Markov games. In fact, there are an in\ufb01nite number of such games with different\nNash equilibrium value functions that have identical Q-value functions. This result holds\nfor proposed variants of value iteration from the literature such as updating via a correlated\nequilibrium or a Nash equilibrium, since, in turn-taking Markov games, both rules reduce\nto updating via the action with the maximum value for the controlling player.\n\nOur results paint a bleak picture for the use of value-iteration-based algorithms for com-\nputing stationary equilibria. However, in a class of games we called NoSDE games, a\nnatural extension of value iteration converges to a limit cycle, which is in fact a cyclic\n(nonstationary) Nash equilibrium policy. Such cyclic equilibria can also be found reliably\nfor randomly generated games and there is evidence that they appear in some naturally\noccurring problems (Tesauro & Kephart, 1999). One take-away message of our work is\nthat nonstationary policies may hold the key to improving the robustness of computational\napproaches to planning and learning in general-sum games.\n\nAcknowledgements\n\nThis research was supported by NSF Grant #IIS-0325281, NSF Career Grant #IIS-\n0133689, and the Alberta Ingenuity Foundation through the Alberta Ingenuity Centre for\nMachine Learning.\n\nReferences\n\nBellman, R. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.\nBrafman, R. I., & Tennenholtz, M. (2002). R-MAX\u2014a general polynomial time algorithm\nfor near-optimal reinforcement learning. Journal of Machine Learning Research, 3,\n213\u2013231.\n\nGreenwald, A., & Hall, K. (2003). Correlated Q-learning. Proceedings of the Twentieth\n\nInternational Conference on Machine Learning (pp. 242\u2013249).\n\nHu, J., & Wellman, M. (1998). Multiagent reinforcement learning:theoretical framework\nand an algorithm. Proceedings of the Fifteenth International Conference on Machine\nLearning (pp. 242\u2013250). Morgan Kaufman.\n\nLittman, M. (2001). Friend-or-foe Q-learning in general-sum games. Proceedings of\nthe Eighteenth International Conference on Machine Learning (pp. 322\u2013328). Morgan\nKaufmann.\n\nLittman, M. L., & Szepesv\u00b4ari, C. (1996). A generalized reinforcement-learning model:\nConvergence and applications. Proceedings of the Thirteenth International Conference\non Machine Learning (pp. 310\u2013318).\n\nOsborne, M. J., & Rubinstein, A. (1994). A Course in Game Theory. The MIT Press.\nPuterman, M. (1994). Markov decision processes: Discrete stochastic dynamic program-\n\nming. Wiley-Interscience.\n\nShapley, L. (1953). Stochastic games. Proceedings of the National Academy of Sciences\n\nof the United States of America, 39, 1095\u20131100.\n\nTesauro, G., & Kephart, J. (1999). Pricing in agent economies using multi-agent Q-\nlearning. Proceedings of Fifth European Conference on Symbolic and Quantitative Ap-\nproaches to Reasoning with Uncertainty (pp. 71\u201386).\n\n\f", "award": [], "sourceid": 2834, "authors": [{"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Amy", "family_name": "Greenwald", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}