{"title": "Learning Near-Pareto-Optimal Conventions in Polynomial Time", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 870, "abstract": "", "full_text": "Learning Near-Pareto-Optimal Conventions in\n\nPolynomial Time\n\nXiaofeng Wang\nECE Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nxiaofeng@andrew.cmu.edu\n\nTuomas Sandholm\n\nCS Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nsandholm@cs.cmu.edu\n\nAbstract\n\nWe study how to learn to play a Pareto-optimal strict Nash equilibrium\nwhen there exist multiple equilibria and agents may have different pref-\nerences among the equilibria. We focus on repeated coordination games\nof non-identical interest where agents do not know the game structure\nup front and receive noisy payoffs. We design ef\ufb01cient near-optimal al-\ngorithms for both the perfect monitoring and the imperfect monitoring\nsetting(where the agents only observe their own payoffs and the joint\nactions).\n\n1\n\nIntroduction\n\nRecent years have witnessed a rapid development of multiagent learning theory. In partic-\nular, the use of reinforcement learning (RL) and game theory has attracted great attentions.\nHowever, research on multiagent RL (MARL) is still facing some rudimentary problems.\nMost importantly, what is the goal of a MARL algorithm? In a multiagent system, a learn-\ning agent generally cannot achieve its goal independent of other agents, which in turn tend\nto pursue their own goals. This questions the de\ufb01nition of optimality: No silver bullet\nguarantees maximization of each agent\u2019s payoff.\n\nIn the setting of self play (where all agents use the same algorithm), most existing MARL\nalgorithms seek to learn to play a Nash equilibrium. It is the \ufb01xed point of the agents\u2019\nbest-response process, that is, each agent maximizes its payoff given the other\u2019s strategy.\nAn equilibrium can be viewed as a convention that the learning agents reach for playing the\nunknown game. A key dif\ufb01culty here is that a game usually contains multiple equilibria,\nand the agents need to coordinate on which one to play. Furthermore, the agents may have\ndifferent preferences among the equilibria. Most prior work has avoided this problem by\nfocusing on games with a unique equilibrium or games in which the agents have common\ninterests.\n\nIn this paper, we advocate Pareto-optimal Nash equilibria as the equilibria that a MARL\nalgorithm should drive agents to. This is a natural goal: Pareto-optimal equilibria are\nequilibria for which no other equilibrium exists where both agents are better off. We further\ndesign ef\ufb01cient algorithms for learning agents to achieve this goal in polynomial time.\n\n\f2 De\ufb01nitions and background\n\nWe study a repeated 2-agent game where the agents do not know the game up front, and\ntry to learn how to play based on the experiences in the previous rounds of the game. As\nusual, we assume that the agents observe each others\u2019 actions. We allow for the possibility\nthat the agents receive noisy but bounded payoffs (as is the case in many real-world MARL\nsettings); this complicates the game because the joint action does not determine the agents\u2019\npayoffs deterministically. Furthermore, the agents may prefer different outcomes of the\ngame. In the next subsection we discuss the (stage) game that is repeated over and over.\n\n2.1 Coordination games (of potentially non-identical interest)\n\nWe consider two agents, 1 and 2. The set of actions that agent i can choose from is denoted\nby Ai. We denote the other agent by (cid:0)i. Agents choose their individual actions ai 2 Ai\nindependently and concurrently. The results of their joint action can be represented in\nmatrix form: The rows correspond to agent 1\u2019s actions and the columns correspond to agent\n2\u2019s actions. Each cell fa1; a2g in the matrix has the payoffs u1(fa1; a2g); u2(fa1; a2g).\nThe agents may receive noisy payoffs. In this case, the ui functions are expected payoffs.\nA strategy for agent i is a distribution (cid:25)i over its action set Ai. A pure strategy determinis-\ntically chooses one of the agent\u2019s individual actions. A Nash equilibrium (NE) is a strategy\npro\ufb01le (cid:25) = f(cid:25)i; (cid:25)(cid:0)ig in which no agent can improve its payoff by unilaterally deviating\nto a different strategy: ui(f(cid:25)i; (cid:25)(cid:0)ig) (cid:21) ui(f(cid:25)0\ni; (cid:25)(cid:0)ig) for both agents (i = 1; 2) and any\nstrategy (cid:25)0\ni. We call a NE a pure strategy NE if the individuals\u2019 strategies in it are pure.\nOtherwise, we call it a mixed strategy NE. The NE is strict if we can replace \u201c(cid:21)\u201d with \u201c>\u201d.\nWe focus on the important and widely studied class of games called coordination games:1\n\nDe\ufb01nition 1 [Coordination game] A 2-agent coordination game G is an N (cid:2) N matrix\ngame with N strict Nash equilibria (called conventions). (It follows that there are no other\npure-strategy equilibria.)\n\nA coordination game captures the notion that agents have the common interest of being\ncoordinated (they both get higher payoffs by playing equilibria than other strategy pro\ufb01les),\nbut at the same time there are potentially non-identical interests (each agent may prefer\ndifferent equilibria). The following small games illustrates this:\n\nOPT OUT\n\nLARGE DEMAND\n\nSMALL DEMAND\n\nOPT OUT\n\nSMALL DEMAND\nLARGE DEMAND\n\n0,0\n\n-0.1,0\n-0.1,0\n\n0,-0.1\n0.3,0.5\n-0.1,-0.1\n\n0,-0.1\n0.3,0.3\n0.5,0.3\n\nTable 1: Two agents negotiate to split a coin. Each one can demand a small share (0.4) or\na large share (0.6). There is a cost for bargaining (0.1). If the agents\u2019 demands add to less\nthan 1, each one gets its demand. In this game, though agents favor different conventions,\nthey would rather have a deal than opt out. The convention where both agents opt out is\nPareto-dominated and the other two conventions are Pareto-optimal.\n\nDe\ufb01nition 2 [Pareto-optimality] A convention fa1; a2g is Pareto-dominated if there ex-\nists at least one other convention fa0\n2g) and\nu(cid:0)i(fa1; a2g) (cid:20) u(cid:0)i(fa0\n2g). If the inequality is strict, the Pareto domination is strict.\nOtherwise, it is weak. A convention is Pareto-optimal (PO) if and only if it is not Pareto-\ndominated.\n\n2g such that ui(fa1; a2g) < ui(fa0\n\n1; a0\n\n1; a0\n\n1; a0\n\n1The term \u201ccoordination game\u201d has sometimes been used to refer to special cases of coordination games, such asidentical-\ninterest games where agents have the same preferences [2], minimum-effort games that have strict Nash equilibria on the diagonal\nand both agents prefer equilibria further to the top left. Our de\ufb01nition is the most general (except that some have even called\ngames that have weak Nash equilibria coordination games).\n\n\fA Pareto-dominated convention is unpreferable because there is another convention that\nmakes both agents better off. Therefore, we advocate that a MARL algorithm should at\nleast cause agents to learn a PO convention.\n\nIn the rest of the paper we assume, without loss of generality, that the game is normalized\nso that all payoffs are strictly positive. We do this so that we can set arti\ufb01cial payoffs of\nzero (as described later) and be guaranteed that they are lower than any real payoffs. This is\nmerely for ease of exposition; in reality we can set the arti\ufb01cial payoffs to a negative value\nbelow any real payoff.\n\n2.2 Learning in game theory: Necessary background\n\nLearning in game theory [6] studies repeated interactions of agents, usually with the goal\nof having the agents learn to play Nash equilibrium. There are key differences between\nlearning in game theory and MARL. In the former, the agents are usually assumed to know\nthe game before play, while in MARL the agents have to learn the game structure in addition\nto learning how to play. Second, the former has paid little attention to the ef\ufb01ciency of\nlearning, a central issue in MARL. Despite the differences, the theory of learning in games\nhas provided important principle for MARL.\n\nOne most widely used learning model is \ufb01ctitious play (FP). The basic FP does not guar-\nantee to converge in coordination games while its variance, adaptive play (AP) [17], does.\nTherefore, we take AP as a building block for our MARL algorithms.\n\n2.2.1 Adaptive play (AP)\n\nThe learning process of AP is as follows: Learning agents are assumed to have a memory\nto keep record of recent m plays of the game. Let at 2 A be a joint action played at time\nt over a game. Fix integers k and m such that 1 (cid:20) k (cid:20) m. When t (cid:20) m, each agent i\nrandomly chooses its actions. Starting from t = m+1, each agent looks back at the m most\nrecent plays ht = (at(cid:0)m; at(cid:0)m+1; : : : ; at(cid:0)1) and randomly (without replacement) selects\nk samples from ht. Let Kt(a(cid:0)i) be the number of times that an action a(cid:0)i 2 A(cid:0)i appears\nin the k samples at t. Agent i calculates its expected payoff w.r.t its individual action ai as\n, and then randomly chooses an action from a\nEP (ai) = Pa(cid:0)i2A(cid:0)i\nset of best responses: BRt\n\nui(fai; a(cid:0)ig) Kt(a(cid:0)i)\n\ni = fai j ai = arg maxa0\n\nk\n\ni2Ai EP (a0\n\ni)g.\n\n1; a0\n\nThe learning process of AP can be modeled as a Markov chain. We take the initial history\nhm = (a1; a2; : : : ; am) as the initial state of the Markov chain. The de\ufb01nition of the other\nstates is inductive: A successor of state h is any state h0 obtained by deleting the left-most\nelement of h and appending a new right-most element. Let h0 be a successor of h, and let\n2g be the new element (joint action) that was appended to the right of h to get\na0 = fa0\nh0. Let ph;h0 be the transition probability from h to h0. Now, ph;h0 > 0 if and only if for\neach agent i, there exists a sample of size k in h to which a0\ni is i\u2019s best response. Because\nagent i chooses such a sample with probability independent of time t, the Markov chain is\nstationary. In the Markov chain model, each state h = (a; : : : ; a) with a being a convention\nis an absorbing state. According to Theorem 1 in [17], AP in coordination games converge\nto such an absorbing state with probability 1 if m (cid:21) 4k.\n\n2.2.2 Adaptive play with persistent noise\n\nAP does not choose a particular convention. However, Young showed that if there is small\nconstant noise in action selection, AP usually selects a particular convention. Young stud-\nied the problem under an independent random tremble model: Suppose that instead of al-\nways taking a best-response action, with a small probability \", the agent chooses a random\naction. This yields an irreducible and aperiodic perturbed process of the original Markov\nchain (unperturbed process). Young showed that with suf\ufb01ciently small \", the perturbed\n\n\fprocess converges to a stationary distribution in which the probability to play so called\nstochastic stable convention(s) is at least 1 (cid:0) C\", where C is a positive constant (Theorem\n4 and its proof in [17]).\n\nThe stochastic stable conventions of a game can be identi\ufb01ed by considering the mistakes\nbeing made during state transitions. We say an agent made a mistake if it chose an action\nthat is not a best response to any sample, of size k, taken from the m most recent steps\nof history. Call the absorbing states in the unperturbed process convention states in the\nperturbed process. For each convention state h, we construct an h-tree (cid:28)h (with each node\nbeing a convention state) such that there is a unique direct path from every other convention\nstate to h. Label the direct edges (v; v0) in (cid:28)h with the number of mistakes rv;v0 needed to\nmake the transition from convention state v to convention state v0. The resistance of the\nrv;v0. The stochastic potential of the convention state h is\nh-tree is r((cid:28)h) = P(v;v0)2(cid:28)h\nthe least resistance among all possible h-trees (cid:28)h. Young proved that the stochastic stable\nstates are the states with the minimal stochastic potentials.\n\n2.3 Reinforcement learning\n\nReinforcement learning offers an effective way for agents to estimate the expected pay-\noffs associated with individual actions based on previous experience\u2014without knowing\nthe game structure. A simple and well-understood algorithm for single-agent RL is Q-\nlearning [9]. The general form of Q-learning is for learning in a Markov decision process.\nIt is more than we need here. In our single-state setting, we take a simpli\ufb01ed form of the\nalgorithm, with Q-value Qi\nt(a) recording the estimate of the expected payoffs ui(a) for\nagent i at time t. The agent updates its Q-values based on the sample of the payoff Rt and\nthe observed action a.\n\nQi\n\nt+1(a) = Qi\n\nt(a) + (cid:11)(Rt (cid:0) Qi\n\nt(a))\n\n(1)\n\nIn single-agent RL, if each action is sampled in\ufb01nitely and the learning rate (cid:11) is decreased\nover time fast enough but not too fast, the Q-values will converge to agent i\u2019s expected\npayoff ui. In our setting, we set (cid:11) = 1\n(cid:17)t(a) , where (cid:17)t(a) is the number of times that action\na has been taken.\n\nMost early literature on RL was about asymptotic convergence to optimum. The ex-\ntension of the convergence results to MARL include the minimax-Q [11], Nash-Q [8],\nfriend-foe-Q [12] and correlated-Q [7]. Recently, signi\ufb01cant attention has been paid to\nef\ufb01ciency results: near-optimal polynomial-time learning algorithms. Important results in-\nclude Fiechter\u2019s algorithm [5], Kearns and Singh\u2019s E 3 [10], Brafman and Tennenholtz\u2019s R-\nmax [3], and Pivazyan and Shoham\u2019s ef\ufb01cient algorithms for learning a near-optimal policy\n[14]. These algorithms aim at ef\ufb01ciency, accumulating a provably close-to-optimal average\npayoff in polynomial running time with large probability. The equilibrium-selection prob-\nlem in MARL has also been explored in the form of team games, a very restricted version\nof coordination games [4, 16].\n\nIn this paper, we develop ef\ufb01cient MARL algorithms for learning a PO convention in an\nunknown coordination game. We consider both the perfect monitoring setting where agents\nobserve each others\u2019 payoffs, and the imperfect monitoring setting where agents do not\nobserve each others\u2019 payoffs (and do not want to tell each other their payoffs). In the latter\nsetting, our agents learn to play PO conventions without learning each others\u2019 preferences\nover conventions. Formally, the objectives of our MARL algorithms are:\nEf\ufb01ciency: Let 0 < (cid:14) < 1 and (cid:15) > 0 be constants. Then with probability at least 1 (cid:0) (cid:14),\nagents will start to play a joint policy a within steps polynomial in 1\n(cid:14) , and N, such that\nthere exists no convention a0 that satis\ufb01es u1(a) + (cid:15) < u1(a0) and u2(a) + (cid:15) < u2(a0). We\ncall such a policy an (cid:15)-PO convention.\n\n(cid:15) , 1\n\n\f3 An ef\ufb01cient algorithm for the perfect monitoring setting\nIn order to play an (cid:15)-PO convention, agents need to \ufb01nd all these conventions \ufb01rst. Existing\nef\ufb01cient algorithms employ random sampling to learn game G before coordination. How-\never, these approaches are thin for the goal: Even when the game structure estimated from\nsamples is within (cid:15) of G, its PO conventions might still be (cid:15) away from these of G. Here we\npresent a new algorithm to identify (cid:15)-PO conventions ef\ufb01ciently.\nLearning game structure (perfect monitoring setting)\n1. Choose (cid:15) > 0, 0:5 > (cid:14) > 0. Set w = 1.\n2. Compute the number of samples M ( (cid:15)\n\n2w(cid:0)1 ) by using Chernoff/Hoeffding bound [14],\n\nsuch that\n\nw ;\n\n(cid:14)\n\nP rfmaxa;i jQi\n\nM (a) (cid:0) ui(a)j (cid:20) (cid:15)\n\nw g (cid:21) 1 (cid:0) (cid:14)\n\n2w(cid:0)1 .\n\n3. Start from t = 0, randomly try M actions with uniform distributions and update the Q-values using Equation 1.\n4.\n\nM ) has N conventions, and (2) for every convention fai; a(cid:0)ig in GM and every agent i,\nM (fa0\n\ni 6= ai, then Stop; else w w + 1, Goto Step 2.\n\ni; a(cid:0)ig) + 2 (cid:15)\n\nIf (1) GM = (Q1\nQi\n\nM ; Q2\nM (fai; a(cid:0)ig) > Qi\n\nw for every a0\n\nM ; Q2\n\nM ) formed from M samples is within (cid:15)\n\nIn Step 2 and Step 3, agent i samples the coordination game G suf\ufb01ciently so that the game\nw of G with probability at least\nGM = (Q1\n2w(cid:0)1 . This is plausible because the agent can observe the other\u2019s payoffs. In Step 4, if\n1 (cid:0) (cid:14)\nCondition (1) and (2) are met and GM are within (cid:15)\nw of G, we know that GM has the same\nset of conventions as G. So, any convention not strictly Pareto-dominated in GM is a 2(cid:15)-PO\nconvention in G by de\ufb01nition. The loop from Step 2 to Step 4 searches for a suf\ufb01ciently\nsmall (cid:15)\nw which has Condition (1) and (2) met. Throughout the learning, the probability\nthat GM always stays within (cid:15)\n2w(cid:0)1 > 1 (cid:0) 2(cid:14).\nThis implies that the algorithm will identify all the conventions of G with probability at\nleast 1 (cid:0) 2(cid:14). The total number of samples drawn is polynomial in (N; 1\n(cid:15) ) according to\nChernoff bound [14].\n\nw of G after Step 3 is at least 1 (cid:0) Pw\n\n(cid:14) ; 1\n\n(cid:14)\n\nAfter learning the game, the agents will further learn how to play, that is, to determine\nwhich PO convention in GM to choose. A simple solution is to let two agents randomize\ntheir action selection until they arrive at a PO convention in GM . However, this treatment\nis problematic because each agent may have different preferences over the conventions and\nthus will not randomly choose an action unless it believes the action is a best response to\nthe other\u2019s strategy. In this paper, we consider the learning agents which use adaptive play\nto negotiate the convention they should play. In game theory, AP was suggested as a simple\nlearning model for bargaining [18], where each agent dynamically adjusts its offer w.r.t its\nbelief about the other\u2019s strategy. Here we further propose a new algorithm called k-step\nadaptive play (KSAP) whose expected running time is polynomial in m and k.\nLearning how to play (perfect monitoring setting)\n\nM ; Q2\n\n1. Let V GM = (Q1\n2. Starting from a random initial state, sample the memory only every k steps. Speci\ufb01cally, with probability 0.5,\nsample the most recent k plays, otherwise, just randomly draw k samples from the earlier m (cid:0) k observations\nwithout replacement.\n\nM ). Now, set those entries in V GM to zero that do not correspond to PO conventions.\n\n3. Choose an action against V GM as in adaptive play except that when there exist multiple best-response actions that\ncorrespond to some conventions in the game, choose an action that belongs to a convention that offers the greatest\npayoff (breaking remaining ties randomly).\n\n4. Play that action k times.\n5. Once observe that the last k steps are composed of the same strict NE, play that NE forever.\n\nIn Step 1, agents construct a virtual game V GM from the game GM = (Q1\nM ) by\nsetting the payoffs of all actions except PO conventions to zero. This eliminates all Pareto-\ndominated conventions in GM . Step 2 to Step 5 is KSAP. Comparing with AP, KSAP lets\nan agent sample the experience to update its opponent model every k steps. This makes\nthe expected steps to reach an absorbing state polynomial in k. A KSAP agent pays more\nattentions on the most recent k observations and will freeze its action once coordinated.\nThis further enhances the performance of the learning algorithm.\nTheorem 1 In any unknown 2-agent coordination game with perfect monitoring, if m (cid:21)\n4k, agents that use the above algorithm will learn a 2(cid:15)-PO policy with probability at least\n1 (cid:0) 2(cid:14) in time poly(N; 1\n\nM ; Q2\n\n(cid:14) ; 1\n\n(cid:15) ; m; k).\n\n\fDue to limited space, we present all proofs in a longer version of this paper [15].\n\n4 An ef\ufb01cient algorithm for the imperfect monitoring setting\n\nIn this section, we present an ef\ufb01cient MARL algorithm for the imperfect monitoring set-\nting where the agents do not observe each others\u2019 payoff during learning. Actually, since\nagents can observe joint actions, they may explicitly signal to each other their preferences\nover conventions through actions. This reduces the learning problem to that in the perfect\nmonitoring setting. Here we assume that agents are not willing to explicitly signal each\nother their preferences over conventions, even part of such information (e.g., their most\npreferable conventions).2 We study how to achieve optimal coordination without relying\non such preference information.\n\nBecause each agent is unable to observe the other\u2019s payoffs and because there is noise in\npayoffs received, it is dif\ufb01cult for the agent to determine when enough samples have been\ntaken to identify all conventions. We address this by allowing agents to demonstrate to each\nother their understanding of game structure (where the conventions are) after sampling.\nLearning the game structure (imperfect monitoring setting)\n\n1. Each agent plays its actions in order, with wrap around, until both agents have just wrapped around.3 The agents\n\nname each others\u2019 actions 1,2,... according to the order of \ufb01rst appearance in play.\n\n2. Given (cid:15) and (cid:14), agents are randomly sampling the game until every joint action has been visited at least\n\nM ( (cid:15)\n\nw ;\n\n(cid:14)\n\n2w(cid:0)1 ) times (with w = 1) and updating their Q-values using Equation 1 along the way.\n\n3. Starting at the same time, each agent i goes through the other\u2019s N individual actions a(cid:0)i in order, playing the action\ni 6= ai. (If such an action ai does not exists\n\nai such that Qi\nfor some a(cid:0)i, then agent i plays action 1 throughout this demonstration phase.)\n\nM (fai; a(cid:0)ig) > 2 (cid:15)\n\nw + Qi\n\nM (fa0\n\ni; a(cid:0)ig) for any a0\n\n4. Each agent determines whether the agents hold the same view of the N strict Nash equilibria.\n\nIf not, they let\n\nw w + 1, Goto Step 2.\n\nAfter learning the game, the agents start to learn how to play. The dif\ufb01culty is, without\nknowing about the other\u2019s preferences over conventions, agents cannot explicitly eliminate\nPareto-dominated conventions in GM . A straightforward approach is to allow each agent\nto choose its most preferable convention, and break tie randomly. This, however, requires\nto disclose the preference information to the other agent, thereby violating our assumption.\nMoreover, such a treatment limits the negotiation to only two solutions. Thus, even if there\nexists a better convention in which one agent compromise a little but the other is better off\ngreatly, it will not be chosen. The intriguing question here is whether agents can learn to\nplay a PO convention without knowing the other\u2019s preferences at all.\n\nAdaptive play with persistent noise in action selection (see Section 2.2.2) causes agents\nto choose \u201cstochastic stable\u201d conventions most of time. This provides a potential solution\nM , each agent i \ufb01rst constructs a best-response\nto the above problem. Speci\ufb01cally, over Qi\nset by including, for each possible action of the other agent a(cid:0)i, the joint action fa(cid:3)\ni ; a(cid:0)ig\nis i\u2019s best response to a(cid:0)i. Then, agent i forms a virtual Q-function V Qi\nwhere a(cid:3)\ni\nM\nwhich equals Qi\nM , except that the values of the joint actions not in the best-reponse set\nare zero. We have proved that in the virtual game (V Q1\nM ), conventions strictly\nPareto-dominated are not stochastic stable [15]. This implies that using AP with persistent\nnoise, agents will play 2(cid:15)-PO conventions most of time even without knowing the other\u2019s\npreferences. Therefore, if the agents can stop using noise in action selection at some point\n(and will thus play a particular convention from then on), there is a high probability that\nthey end up playing a 2(cid:15)-PO convention. The rest of this section presents our algorithm in\nmore detail.\n\nM ; V Q2\n\nWe \ufb01rst adapt KSAP (see Section 3) to a learning model with persistent noise. After choos-\ning the best-response action suggested by KSAP, each agent checks whether the current\n\n2Agents may prefer to hide such information to avoid giving others some advantage in the future interactions.\n3In an N (cid:2) N game this occurs for both agents at the same time, but the technique also works for games with a different\n\nnumber of actions per agent.\n\n\fstate (containing the m most recent joint actions) is a convention state. If it is not, the\nagent plays KSAP as usual (i.e., k plays of the action selected). If it is, then in each of the\nfollowing k steps, the agent has probability \" to randomly independently choose an action,\nand probability 1 (cid:0) \" to play the best-response action. We call this algorithm \"-KSAP.\nWe can model this learning process as a Markov chain, with the state space including all\nt be the \ufb01rst convention state\nand only convention states. Let st be the state at time t and sc\nt = h0jst = hg,\nthe agents reach after time t. The transition probability is p\"\nh;h0 = P rfsc\nand it depends only on h, not t (for a \ufb01xed \"). Therefore, the Markov chain is stationary. It\nis also irreducible and aperiodic, because with \" > 0, all actions have positive probability\nto be chosen in a convention state. Therefore, Theorem 4 in [17] applies and thus the chain\nhas a unique stationary distribution circling around the stochastic stable conventions of\nfV Q1; V Q2g. These conventions are 2(cid:15)-PO (Lemma 5 in [15]) with probability 1 (cid:0) 2(cid:14).\nThe proof of Lemma 1 in [17] further characterizes the support of the limit distribution.\nWith 0 < \" < 1, it is easy to obtain from the proof of Lemma 1 in [17] that the probability\nof playing 2(cid:15)-PO conventions is at least 1 (cid:0) C\", where C > 0 is a constant.\nOur algorithm intends to let agents stop taking noisy actions at some point and stick to a\nparticular convention. This amounts to sampling the stationary distribution of the Markov\nchain. If the sampling is unbiased, the agents have a probability at least 1 (cid:0) C\" to learn\na 2(cid:15)-convention. The issue is how to make the sampling unbiased. We address this by\napplying a simple and ef\ufb01cient Markov chain Monte Carlo algorithm proposed by Lov\u00b4asz\nand Winkler [13]. The algorithm \ufb01rst randomly selects a state h and randomly walks along\nthe chain until all states have been visited. During the walk, it generates a function Ah :\nS n fhg ! S, where S is the set of all convention states. Ah can be represented as a direct\ngraph with a direct edge from each h0 to Ah(h0). After the walk, if agents \ufb01nd that Ah\nde\ufb01nes an h-tree (see Section 2.2.2), h becomes the convention the agents play forever.\nOtherwise, agents take another random sample from S and repeat random walk, and so\non. Lov\u00b4asz and Winkler proved that the algorithm makes an exact sampling of the Markov\nchain and that its expected running time is O((cid:22)h3 log N ), where (cid:22)h is the maximum expected\ntime to transfer from one convention state to another. In our setting, we know that the\nprobability to transit from one convention state to another is polynomial in \" (probability\nto make mistakes in convention states). So, (cid:22)h is polynomial in 1\n\" . In addition, recall that\nour Markov chain is constructed on the convention states instead of all states. The expected\ntime for making a transition in this chain is upper-bounded by the expected convergence\ntime of KSAP which is polynomial in m and k.\n\nRecall that Lov\u00b4asz and Winkler\u2019s algorithm needs to do uniform random experiments when\nchoosing h and constructing Ah. In our setting, individual agents generate random numbers\nindependently. Without knowing each others\u2019 random numbers, agents cannot commit to\na convention together. If one of our learning agents commits to the \ufb01nal action before the\nother, the other may never commit because it is unable to complete the random walk. It\nis nontrivial to coordinate a joint commitment time between the agents because the agents\ncannot communicate (except via actions). We solve this problem by making the agents\nuse the same random numbers (without requiring communication). We accomplish this\nvia a random hash function technique, an idea common in cryptograhy [1]. Formally, a\nrandom hash function is a mapping from a pre-image space to an image space. Denote the\nrandom hash function with an image space X by (cid:13)X. It has two properties: (1) For any\ninput, (cid:13)X randomly with uniform distribution draws an image from X as an output. (2)\nWith the same input, (cid:13)X gives the same output. Such functions are easy to construct (e.g.,\nstandard hash functions like MD5 and SHA can be converted to random hash functions by\ntruncating their output [1]). In our learning setting, the agents share the same observations\nof previous plays. Therefore, we take the pre-image to be the most recent m joint actions\nappended by the number of steps played so far. Our learning agents have the same random\nhash function (cid:13)X. Whenever an agent should make a call to a random number generator, it\n\n\finstead inputs to (cid:13)X the m most recent joint actions and the total number of steps played\nso far, and uses the output of (cid:13)X as the random number. 4 This way the agents see the same\nuniform random numbers, and because the agents use the same algorithms, they will reach\ncommitment to the \ufb01nal action at the same step.\nLearning how to play (imperfect monitoring setting)\n\n1. Construct a virtual Q-function V Qi from Qi\nt.\n2. For steps = 1; 2; 4; 8; : : : do5\n3. For j = 1; 2; 3; : : : ; 3N do\n4. h = (cid:13)S (ht; t) (Use random hash function (cid:13)S to choose a convention state h uniformly from S.)\n5. U = fhg\n6. Do until U = S\n\n(a) Play \"-KSAP until a convention state h0 62 U is reached\n(b) y = (cid:13)f1;:::;stepsg(ht0 ; t0)\n(c) Play \"-KSAP until convention states have been visited y times (counting duplicates). Denote the most recent\n\nconvention state by Ah(h0)\n\n(d) U = U [ fh0g\n\nIf Ah de\ufb01nes an h-tree, play h forever\n\n7.\n8. Endfor\n9. Endfor\n\nTheorem 2 In any unknown 2-agent coordination game with imperfect monitoring, for\n0 < \" < 1 and some constant C > 0, if m (cid:21) 4k, using the above algorithm, the\nagents learn a 2(cid:15)-PO deterministic policy with probability at least 1 (cid:0) 2(cid:14) (cid:0) C\" in time\npoly(N; 1\n\n\" ; m; k).\n\n(cid:14) ; 1\n\n(cid:15) ; 1\n\n5 Conclusions and future research\nIn this paper, we studied how to learn to play a Pareto-optimal strict Nash equilibrium when\nthere exist multiple equilibria and agents may have different preferences among the equi-\nlibria. We focused on 2-agent repeated coordination games of non-identical interest where\nthe agents do not know the game structure up front and receive noisy payoffs. We designed\nef\ufb01cient near-optimal algorithms for both the perfect monitoring and the imperfect moni-\ntoring setting (where the agents only observe their own payoffs and the joint actions). In a\nlonger version of the paper [15], we also present the convergence algorithms. In the future\nwork, we plan to extend all these results to n-agent and multistage coordination games.\n\nReferences\n[1] Bellare and Rogaway. Random oracle are practical: A paradigm for designing ef\ufb01cient protocols. In Proceedings of First\n\nACM Annual Conference on Computer and Communication Security, 93.\n\n[2] Boutilier. Planning, learning and coordination in multi-agent decision processes. In TARK, 96.\n[3] Brafman and Tennenholtz. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. In IJCAI,\n\n01.\n\n[4] Claus and Boutilier. The dynamics of reinforcement learning in cooperative multi-agent systems. In AAAI, 98.\n[5] Fiechter. Ef\ufb01cient reinforcement learning. In COLT, 94.\n[6] Fudenberg and Levine. The theory of learning in games. MIT Press, 98.\n[7] Greenwald and Hall. Correlated-q learning. In AAAI Spring Symposium, 02.\n[8] Hu and Wellman. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, 98.\n[9] Kaelbling, Littman, and Moore. Reinforcement learning: A survey. JAIR, 96.\n[10] Kearns and Singh. Near-optimal reinforcement learning in polynomial time. In ICML, 98.\n[11] Littman. Value-function reinforcement learning in markov games. J. of Cognitive System Research, 2:55\u201366, 00.\n[12] Littman. Friend-or-Foe Q-learning in general sum game. In ICML, 01.\n[13] Lov\u00b4asz and Winkler. Exact mixing in an unknown markov chain. Electronic Journal of Combinatorics, 95.\n[14] Pivazyan and Shoham. Polynomial-time reinforcement learning of near-optimal policies. In AAAI, 02.\n[15] Wang and Sandholm. Learning to play pareto-optimal equilibria: Convergence and ef\ufb01ciency.\n\nwww.cs.cmu.edu/\u02dcxiaofeng/LearnPOC.ps.\n\n[16] Wang and Sandholm. Reinforcement learning to play an optimal Nash equilibrium in team markov game. In NIPS, 02.\n[17] Young. The evolution of conventions. Econometrica, 61:57\u201384, 93.\n[18] Young. An evolutionary model of bargaining. Journal of Economic Theory, 59, 93.\n\n4Recall that agents have established the same numbering of actions. This allows them to encode their joint actions for\n\ninputting into (cid:13) in the same way.\n\n5The pattern of the for-loops is from the Lov\u00b4asz-Winkler algorithm [13].\n\n\f", "award": [], "sourceid": 2390, "authors": [{"given_name": "Xiaofeng", "family_name": "Wang", "institution": null}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": null}]}