{"title": "Efficient Learning Equilibrium", "book": "Advances in Neural Information Processing Systems", "page_first": 1635, "page_last": 1642, "abstract": null, "full_text": "Efficient Learning Equilibrium * \n\nRonen I. Brafman \n\nMoshe Tennenholtz \n\nComputer Science Department \n\nComputer Science Department \n\nBen-Gurion University \n\nBeer-Sheva, Israel \n\nStanford University \nStanford, CA 94305 \n\nemail: brafman@cs.bgu.ac.il \n\ne-mail: moshe@robotics.stanford.edu \n\nAbstract \n\nWe introduce efficient learning equilibrium (ELE), a normative ap(cid:173)\nproach to learning in non cooperative settings. In ELE, the learn(cid:173)\ning algorithms themselves are required to be in equilibrium. In \naddition, the learning algorithms arrive at a desired value after \npolynomial time, and deviations from a prescribed ELE become ir(cid:173)\nrational after polynomial time. We prove the existence of an ELE \nin the perfect monitoring setting, where the desired value is the \nexpected payoff in a Nash equilibrium. We also show that an ELE \ndoes not always exist in the imperfect monitoring case. Yet, it \nexists in the special case of common-interest games. Finally, we \nextend our results to general stochastic games. \n\n1 \n\nIntroduction \n\nReinforcement learning in the context of multi-agent interaction has attracted the \nattention of researchers in cognitive psychology, experimental economics, machine \nlearning, artificial intelligence, and related fields for quite some time [8, 4]. Much \nof this work uses repeated games [3, 5] and stochastic games [10, 9, 7, 1] as models \nof such interactions. \n\nThe literature on learning in games in game theory [5] is mainly concerned with the \nunderstanding of learning procedures that if adopted by the different agents will \nconverge at end to an equilibrium of the corresponding game. The game itself may \nbe known; the idea is to show that simple dynamics lead to rational behavior, as \nprescribed by a Nash equilibrium. The learning algorithms themselves are not re(cid:173)\nquired to satisfy any rationality requirement; it is what they converge to, if adopted \nby all agents that should be in equilibrium. This is quite different from the clas(cid:173)\nsical perspective on learning in Artificial Intelligence, where the main motivation \n\nThe second author permanent address is: Faculty of Industrial Engineering and Man(cid:173)\n\nagement, Technion, Haifa 32000, Israel. This work was supported in part by Israel Science \nFoundation under Grant #91/02-1. The first author is partially supported by the Paul \nIvanier Center for Robotics and Production Management . \n\n\ffor learning stems from the fact that the model of the environment is unknown. \nFor example, consider a Markov Decision Process (MDP). If the rewards and tran(cid:173)\nsition probabilities are known then one can find an optimal policy using dynamic \nprogramming. The major motivation for learning in this context stems from the \nfact that the model (i.e. rewards and transition probabilities) is initially unknown. \nWhen facing uncertainty about the game that is played, game-theorists appeal to \na Bayesian approach, which is completely different from a learning approach; the \ntypical assumption in that approach is that there exists a probability distribution \non the possible games, which is common-knowledge. The notion of equilibrium is \nextended to this context of games with incomplete information, and is treated as \nthe appropriate solution concept. In this context, agents are assumed to be rational \nagents adopting the corresponding (Bayes-) Nash equilibrium, and learning is not \nan issue. \n\nIn this work we present an approach to learning in games, where there is no known \ndistribution on the possible games that may be played - an approach that appears to \nbe much more reflective of the setting studied in machine learning and AI and in the \nspirit of work on on-line algorithms in computer science. Adopting the framework of \nrepeated games, we consider a situation where the learning algorithm is a strategy \nfor an agent in a repeated game. This strategy takes an action at each stage based \non its previous observations, and initially has no information about the identity of \nthe game being played. Given the above, the following are natural requirements for \nthe learning algorithms provided to the agents: \n\n1. Individual Rationality: The learning algorithms themselves should be in \nIt should be irrational for each agent to deviate from its \nequilibrium. \nlearning algorithm, as long as the other agents stick to their algorithms, \nregardless of the what the actual game is. \n\n2. Efficiency: \n\n(a) A deviation from the learning algorithm by a single agent (while the \nother stick to their algorithms) will become irrational (i.e. will lead to \na situation where the deviator's payoff is not improved) after polyno(cid:173)\nmially many stages. \n\n(b) If all agents stick to their prescribed learning algorithms then the ex(cid:173)\n\npected payoff obtained by each agent within a polynomial number of \nsteps will be (close to) the value it could have obtained in a Nash \nequilibrium, had the agents known the game from the outset. \n\nA tuple of learning algorithms satisfying the above properties for a given class of \ngames is said to be an Efficient Learning Equilibrium[ELE]. Notice that the learning \nalgorithms should satisfy the desired properties for every game in a given class \ndespite the fact that the actual game played is initially unknown. Such assumptions \nare typical to work in machine learning. What we borrow from the game theory \nliterature is the criterion for rational behavior in multi-agent systems. That is, we \ntake individual rationality to be associated with the notion of equilibrium. We also \ntake the equilibrium of the actual (initially unknown) game to be our benchmark \nfor success; we wish to obtain a corresponding value although we initially do not \nknow which game is played. \n\n\fIn the remaining sections we formalize the notion of efficient learning equilibrium, \nand present it in a self-contained fashion. We also prove the existence of an ELE \n(satisfying all of the above desired properties) for a general class of games (repeated \ngames with perfect monitoring) , and show that it does not exist for another. Our \nresults on ELE can be generalized to the context of Pareto-ELE (where we wish \nto obtain maximal social surplus), and to general stochastic games. These will be \nmentioned only very briefly, due to space limitations. The discussion of these and \nother issues, as well as proofs of theorems, can be found in the full paper [2]. \n\nTechnically speaking, the results we prove rely on a novel combination of the so(cid:173)\ncalled folk theorems in economics, and a novel efficient algorithm for the punishment \nof deviators (in games which are initially unknown). \n\n2 ELE: Definition \n\nIn this section we develop a definition of efficient learning equilibrium. For ease of \nexposition, our discussion will center on two-player repeated games in which the \nagents have an identical set of actions A. The generalization to n-player repeated \ngames with different action sets is immediate, but requires a little more notation. \nThe extension to stochastic games is fully discussed in the full paper [2]. \n\nA game is a model of multi-agent interaction. In a game, we have a set of players, \neach of whom performs some action from a given set of actions. As a result of the \nplayers' combined choices, some outcome is obtained which is described numerically \nin the form of a payoff vector, i.e., a vector of values, one for each of the players. \n\nA common description of a (two-player) game is as a matrix. This is called a game \nin strategic form. The rows of the matrix correspond to player 1 's actions and the \ncolumns correspond to player 2's actions. The entry in row i and column j in the \ngame matrix contains the rewards obtained by the players if player 1 plays his ith \naction and player 2 plays his jth action. \n\nIn a repeated game (RG) the players playa given game G repeatedly. We can view \na repeated game, with respect to a game G, as consisting of infinite number of \niterations, at each of which the players have to select an action of the game G . \nAfter playing each iteration, the players receive the appropriate payoffs, as dictated \nby that game's matrix, and move to a new iteration. \n\nFor ease of exposition we normalize both players' payoffs in the game G to be non(cid:173)\n\nnegative reals between \u00b0 and some positive constant Rmax . We denote this interval \n\n(or set) of possible payoffs by P = [0, Rmax]. \nIn a perfect monitoring setting, the set of possible histories of length t is (A2 X p2)t, \nand the set of possible histories, H, is the union of the sets of possible histories for all \nt 2 0, where (A2 x p 2)O is the empty history. Namely, the history at time t consists \nof the history of actions that have been carried out so far, and the corresponding \npayoffs obtained by the players. Hence, in a perfect monitoring setting, a player \ncan observe the actions selected and the payoffs obtained in the past, but does not \nknow the game matrix to start with. In an imperfect monitoring setup, all that a \nplayer can observe following the performance of its action is the payoff it obtained \nand the action selected by the other player. The player cannot observe the other \nplayer's payoff. The definition of the possible histories for an agent naturally follows. \n\n\fFinally, in a strict imperfect monitoring setting, the agent cannot observe the other \nagents' payoff or their actions. \n\nGiven an RG , a policy for a player is a mapping from H, the set of possible histories, \nto the set of possible probability distributions over A. Hence, a policy determines the \nprobability of choosing each particular action for each possible history. A learning \nalgorithm can be viewed as an instance of a policy. \n\nWe define the value for player 1 (resp. 2) of a policy profile (1f, p), where 1f is a \npolicy for player 1 and p is a policy for player 2, using the expected average reward \ncriterion as follows. Given an RG M and a natural number T, we denote the \nexpected T -step undiscounted average reward of player 1 (resp. 2) when the players \nfollow the policy profile (1f,p), by U1 (M,1f,p,T) (resp. U2 (M,1f,p,T)). We define \nUi(M, 1f, p) = liminfT--+oo Ui(M, 1f, p, T) for i = 1,2. \nLet M denote a class of repeated games. A policy profile (1f, p) is a learning equi(cid:173)\nlibrium w.r.t. M if'rh' , p',M E M, we have that U1 (M,1f',p) :::; U1 (M,1f,p), and \nU2 (M,1f,p') :::; U2 (M,1f,p). In this paper we mainly treat the class M of all re(cid:173)\npeated games with some fixed action profile (i.e. , in which the set of actions available \nto all agents is fixed). However, in Section 4 we consider the class of common-interest \nrepeated games. We shall stick to the assumption that both agents have a fixed , \nidentical set A of k actions. \n\nOur first requirement, then, is that learning algorithms will be treated as strategies. \nIn order to be individually rational they should be the best response for one another. \nOur second requirement is that they rapidly obtain a desired value. The definition \nof this desired value may be a parameter, the most natural candidate - though not \nthe only candidate - being the expected payoffs in a Nash equilibrium of the game. \nAnother appealing alternative will be discussed later. \n\nFormally, let G be a (one-shot) game, let M be the corresponding repeated game, \nand let n(G) be a Nash-equilibrium of G. Then, denote the expected payoff of agent \ni in n(G) by Nl/i(n(G)). \n\nA policy profile (1f, p) is an efficient learning equilibrium with respect to the class \n\nof games M if for every E > 0, \u00b0 < 8 < 1, there exists some T > 0, where T is \n\npolynomial in ~,~, and k , such that with probability of at least 1 - 8: (1) For \nevery t 2: T and for every repeated game M E M (and its corresponding one-shot \ngame, G), Ui(M, 1f , p, t) 2: Nl/i(n(G)) - E for i = 1,2, for some Nash equilibrium \nn(G), and (2) If player 1 (resp. 2) deviates from 1f to 1f' (resp. from p to p') in \niteration l, then U1 (M, 1f', p, l + t) :::; U1 (M, 1f, p, l + t) + E (resp. U2 (M, 1f, p', l + t) :::; \nU2 (M, 1f, p, l + t) + E) for every t 2: T. \nNotice that a deviation is considered irrational if it does not increase the expected \npayoff by more than E. This is in the spirit of E-equilibrium in game theory. This \nis done mainly for ease of mathematical exposition. One can replace this part of \nthe definition, while getting similar results, with the requirement of \"standard\" \nequilibrium, where a deviation will not improve the expected payoff, and even with \nthe notion of strict equilibrium, where a deviation will lead to a decreased payoff. \nThis will require, however, that we restrict our attention to games where there \nexist a Nash equilibrium in which the agents' expected payoffs are higher than their \nprobabilistic maximin values. \n\n\fThe definition of ELE captures the insight of a normative approach to learning in \nnon-cooperative settings. We assume that initially the game is unknown, but the \nagents will have learning algorithms that will rapidly lead to the values the players \nwould have obtained in a Nash equilibrium had they known the game. Moreover, \nas mentioned earlier, the learning algorithms themselves should be in equilibrium. \nNotice that each agent's behavior should be the best response against the other \nagents' behaviors, and deviations should be irrational, regardless of what the actual \n(one-shot) game is. \n\n3 Efficient Learning Equilibrium: Existence \n\nLet M be a repeated game in which G is played at each iteration. Let A = \n{al' ... , ak} be the set of possible actions for both agents. Finally let there be \nan agreed upon ordering over the actions. The basic idea behind the algorithm is \nas follows. The agents collaborate in exploring the game. This requires k2 moves. \nNext, each agent computes a Nash equilibrium of the game and follows it. If more \nthan one equilibrium exists, then the first one according to the natural lexicographic \nordering is used. l If one of the agents does not collaborate in the initial exploration \nphase, the other agent \"punishes\" this agent. We will show that efficient punish(cid:173)\nment is feasible. Otherwise, the agents have chosen a Nash-equilibrium, and it is \nirrational for them to deviate from this equilibrium unilaterally. \n\nThis idea combines the so-called folk-theorems in economics [6], and a technique \nfor learning in zero-sum games introduced in [1]. Folk-theorems in economics deal \nwith a technique for obtaining some desired behavior by making a threat of em(cid:173)\nploying a punishing strategy against a deviator from that behavior. When both \nagents are equipped with corresponding punishing strategies, the desired behav(cid:173)\nior will be obtained in equilibrium (and the threat will not be materialized - as a \ndeviation becomes irrational). In our context however, when an agent deviates in \nthe exploration phase, then the game is not fully known, and hence punishment \nis problematic; moreover, we wish the punishment strategy to be an efficient algo(cid:173)\nrithm (both computationally, and in the time a punishment will materialize and \nmake deviations irrational). These are addressed by having an efficient punish(cid:173)\nment algorithm that guarantees that the other agent will not obtain more than \nits maximin value, after polynomial time, although the game is initially unknown \nto the punishing agent. The latter is based on the ideas of our R-max algorithm, \nintroduced in [1]. \n\nMore precisely, consider the following algorithm, termed the ELE algorithm. \n\nThe ELE algorithm: \n\nPlayer 1 performs action ai one time after the other for k times, for all i = 1,2, ... , k. \nIn parallel, player 2 performs the sequence of actions (al' ... ,ak) k times. \n\nIf both players behaved according to the above then a Nash equilibrium of the cor(cid:173)\n\nresponding (revealed) game is computed, and the players behave according \nto the corresponding strategies from that point on. If several Nash equilib(cid:173)\nria exist, one is selected based on a pre-determined lexicographic ordering. \n\nlIn particular, the agents can choose the equilibrium selected by a fixed shared \n\nalgorithm. \n\n\fIf one of the players deviated from the above, we shall call this player the adversary \nand the other player the agent. Let G be the Rmax-sum game in which the \nadversary's payoff is identical to his payoff in the original game, and where \nthe agent's payoff is Rmax minus the adversary payoffs. Let M denote the \ncorresponding repeated game. Thus, G is a constant-sum game where the \nagent's goal is to minimize the adversary's payoff. Notice that some of \nthese payoffs will be unknown (because the adversary did not cooperate \nin the exploration phase). The agent now plays according to the following \nalgorithm: \n\nInitialize: Construct the following model M' of the repeated game M, where the \ngame G is replaced by a game G' where all the entries in the game matrix \nare assigned the rewards (Rmax , 0). 2 \nIn addition, we associate a boolean valued variable with each joint-action \n{assumed, known}. This variable is initialized to the value assumed. \n\nRepeat: \n\nCompute and Act: Compute the optimal probabilistic maximin of G' \n\nand execute it. \n\nObserve and update: Following each joint action do as follows : Let a be \nthe action the agent performed and let a' be the adversary's action. \nIf (a, a') is performed for the first time, update the reward associated \nwith (a,a') in G', as observed, and mark it known. Recall- the agent \ntakes its payoff to be complementary to the (observed) adversary's \npayoff. \n\nWe can show that the policy profile in which both agents use the ELE algorithm is \nindeed an ELE. Thus: \n\nTheorem 1 Let M be a class of repeated games. Then, there exists an ELE \nw.r.t. M given perfect monitoring. \n\nThe proof of the above Theorem, contained in the full paper, is non-trivial. It rests \non the ability of the agent to \"punish\" the adversary quickly, making it irrational \nfor the adversary to deviate from the ELE algorithm. \n\n4 \n\nImperfect monitoring \n\nIn the previous section we discussed the existence of an ELE in the context of the \nperfect monitoring setup. This result allows us to show that our concepts provide \nnot only a normative, but also a constructive approach to learning in general non(cid:173)\ncooperative environments. An interesting question is whether one can go beyond \nthat and show the existence of an ELE in the imperfect monitoring case as well. \nUnfortunately, when considering the class M of all games, this is not possible. \n\nTheorem 2 There exist classes of games for which an ELE does not exist given \nimperfect monitoring. \n\n2The value 0 given to the adversary does not play an important role here. \n\n\fProof (sketch): We will consider the class of all 2 x 2 games and we will show \nthat an ELE does not exist for this class under imperfect monitoring. \n\nConsider the following games: \n\n1. Gl: \n\n2. G2: \n\nM= ( 6, \n\no \n5, -100 \n\nM = (6, 9 \n\n5,11 \n\n0,100 ) \n1, 500 \n\n0, 1) \n\n1, 10 \n\nNotice that the payoffs obtained for a joint action in Gland G 2 are identical for \nplayer 1 and are different for player 2. \n\nThe only equilibrium of G 1 is where both players play the second action, leading \nto (1,500). The only equilibrium of G2 is where both players play the first action, \nleading to (6,9). (These are unique equilibria since they are obtained by removal of \nstrictly dominated strategies.) \n\nNow, assume that an ELE exists, and look at the corresponding policies of the \nplayers in that equilibrium. Notice that in order to have an ELE, we must visit the \nentry (6,9) most of the times if the game is G2 and visit the entry (1 ,500) most of \nthe times if the game is G 1; otherwise, player 1 (resp. player 2) will not obtain a \nhigh enough value in G2 (resp. Gl), since its other payoffs in G2 (resp. Gl) are \nlower than that. \n\nGiven the above, it is rational for player 2 to deviate and pretend that the game is \nalways Gland behave according to what the suggested equilibrium policy tells it \nto do in that case. Since the game might be actually G 1, and player 1 cannot tell \nthe difference, player 2 will be able to lead to playing the second action by both \nplayers for most times also when the game is G2, increasing its payoff from 9 to 10, \ncontradicting ELE. \n\nI \n\nThe above result demonstrates that without additional assumptions, one cannot \nprovide an ELE under imperfect monitoring. However, for certain restricted classes \nof games, we can provide an ELE under imperfect monitoring, as we now show. \n\nA game is called a common-interest game if for every joint-action, all agents receive \nthe same reward. We can show: \n\nTheorem 3 Let M c - i be the class of common-interest repeated games in which the \nnumber of actions each agent has is a. There exists an ELE for M c - i under strict \nimperfect monitoring. \n\nProof (sketch): The agents use the following algorithm: for m rounds, each agent \nrandomly selects an action. Following this, each agent plays the action that yielded \nthe best reward. If multiple actions led to the best reward, the one that was used \nfirst is selected. m is selected so that with probability 1 - J every joint-action will \nbe selected. Using Chernoff bound we can choose m that is polynomial in the size \nof the game (which is a k , where k is the number of agents) and in 1/ J. \n\nI \n\n\fThis result improves previous results in this area, such as the combination of Q(cid:173)\nlearning and fictitious play used in [3]. Not only does it provably converge in \npolynomial time, it is also guaranteed, with probability of 1 - J to converge to the \noptimal Nash-equilibrium of the game rather than to an arbitrary (and possibly \nnon-optimal) Nash-equilibrium. \n\n5 Conclusion \n\nWe defined the concept of an efficient learning equilibria - a normative criterion for \nlearning algorithms. We showed that given perfect monitoring a learning algorithm \nsatisfying ELE exists, while this is not the case under imperfect monitoring. In \nthe full paper [2] we discuss related solution concepts, such as Pareto ELE. A \nPareto ELE is similar to a (Nash) ELE, except that the requirement of attaining \nthe expected payoffs of a Nash equilibrium is replaced by that of maximizing social \nsurplus. We show that there fexists a Pareto-ELE for any perfect monitoring setting, \nand that a Pareto ELE does not always exist in an imperfect monitoring setting. In \nthe full paper we also extend our discussion from repeated games to infinite horizon \nstochastic games under the average reward criterion. We show that under perfect \nmonitoring, there always exists a Pareto ELE in this setting. Please refer to [2] for \nadditional details and the full proofs. \n\nReferences \n\n[1] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algo(cid:173)\n\nrithm for near-optimal reinforcement learning. In IJCAI'Ol, 200l. \n\n[2] R. I. Brafman and M. Tennenholtz. Efficient learning equilibrium. Technical \n\nReport 02-06, Dept. of Computer Science, Ben-Gurion University, 2002. \n\n[3] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooper(cid:173)\n\native multi-agent systems. In Proc. Workshop on Multi-Agent Learning, pages \n602- 608, 1997. \n\n[4] I. Erev and A.E. Roth. Predicting how people play games: Reinforcement \n\nlearning in games with unique strategy equilibrium. American Economic Re(cid:173)\nview, 88:848- 881, 1998. \n\n[5] D. Fudenberg and D. Levine. The theory of learning in games. MIT Press, \n\n1998. \n\n[6] D. Fudenberg and J. Tirole. Game Theory. MIT Press, 1991. \n\n[7] J. Hu and M.P. Wellman. Multi-agent reinforcement learning: Theoretical \n\nframework and an algorithms. In Proc. 15th ICML , 1998. \n\n[8] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A \n\nsurvey. Journal of AI Research, 4:237- 285, 1996. \n\n[9] M. L. Littman. Markov games as a framework for multi-agent reinforcement \n\nlearning. In Proc. 11th ICML, pages 157- 163, 1994. \n\n[10] L.S. Shapley. Stochastic Games. In Proc. Nat. Acad. Scie. USA, volume 39, \n\npages 1095- 1100, 1953. \n\n\f", "award": [], "sourceid": 2147, "authors": [{"given_name": "Ronen", "family_name": "Brafman", "institution": null}, {"given_name": "Moshe", "family_name": "Tennenholtz", "institution": null}]}