{"title": "Extending Q-Learning to General Adaptive Multi-Agent Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 871, "page_last": 878, "abstract": "", "full_text": "Extending Q-Learning to General Adaptive\n\nMulti-Agent Systems\n\nGerald Tesauro\n\nIBM Thomas J. Watson Research Center\n\n19 Skyline Drive, Hawthorne, NY 10532 USA\n\ntesauro@watson.ibm.com\n\nAbstract\n\nRecent multi-agent extensions of Q-Learning require knowledge of other\nagents\u2019 payoffs and Q-functions, and assume game-theoretic play at\nall times by all other agents. This paper proposes a fundamentally\ndifferent approach, dubbed \u201cHyper-Q\u201d Learning, in which values of\nmixed strategies rather than base actions are learned, and in which other\nagents\u2019 strategies are estimated from observed actions via Bayesian in-\nference. Hyper-Q may be effective against many different types of adap-\ntive agents, even if they are persistently dynamic. Against certain broad\ncategories of adaptation, it is argued that Hyper-Q may converge to ex-\nact optimal time-varying policies.\nIn tests using Rock-Paper-Scissors,\nHyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent\n(IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary\nanalysis of Hyper-Q against itself is also presented.\n\n1 Introduction\n\nThe question of how agents may adapt their strategic behavior while interacting with other\narbitrarily adapting agents is a major challenge in both machine learning and multi-agent\nsystems research. While game theory provides a pricipled calculation of Nash equilibrium\nstrategies, it is limited in practical use due to hidden or imperfect state information, and\ncomputational intractability. Trial-and-error learning could develop good strategies by try-\ning many actions in a number of environmental states, and observing which actions, in\ncombination with actions of other agents, lead to high cumulative reward. This is highly\neffective for a single learner in a stationary environment, where algorithms such as Q-\nLearning [13] are able to learn optimal policies on-line without a model of the environment.\nStraight off-the-shelf use of RL algorithms such as Q-learning is problematic, however, be-\ncause: (a) they learn deterministic policies, whereas mixed strategies are generally needed;\n(b) the environment is generally non-stationary due to adaptation of other agents.\n\nSeveral multi-agent extensions of Q-Learning have recently been published. Littman [7]\ndeveloped a convergent algorithm for two-player zero-sum games. Hu and Wellman [5]\npresent an algorithm for two-player general-sum games, the convergence of which was\nclarified by Bowling [1]. Littman [8] also developed a convergent many-agent \u201cfriend-or-\nfoe\u201d Q-learning algorithm combining cooperative learning with adversarial learning. These\nall extend the normal Q-function of state-action pairs Q(s, a) to a function of states and\njoint actions of all agents, Q(s, (cid:5)a). These algorithms make a number of strong assumptions\n\n\fwhich facilitate convergence proofs, but which may not be realistic in practice. These\ninclude: (1) other agents\u2019 payoffs are fully observable; (2) all agents use the same learning\nalgorithm; (3) during learning, other agents\u2019 strategies are derivable via game-theoretic\nanalysis of the current Q-functions. In particular, if the other agents employ non-game-\ntheoretic or nonstationary strategies, the learned Q-functions will not accurately represent\nthe expected payoffs obtained by playing against such agents, and the associated greedy\npolicies will not correspond to best-reponse play against the other agents.\n\nThe aim of this paper is to develop more general and practical extensions of Q-learning\navoiding the above assumptions. The multi-agent environment is modeled as a repeated\nstochastic game in which other agents\u2019 actions are observable, but not their payoffs. Other\nagents are assumed to learn, but the forms of their learning algorithms are unknown, and\ntheir strategies may be asymptotically non-stationary. During learning, it is proposed to es-\ntimate other agents\u2019 current strategies from observation instead of game-theoretic analysis.\n\nThe above considerations lead to a new algorithm, presented in Section 2 of the paper,\ncalled \u201cHyper-Q Learning.\u201d Its key idea is to learn the value of joint mixed strategies, rather\nthan joint base actions. Section 3 discusses the effects of function approximation, explo-\nration, and other agents\u2019 strategy dynamics on Hyper-Q\u2019s convergence. Section 4 presents\na Bayesian inference method for estimating other agents\u2019 strategies, by applying a recency-\nweighted version of Bayes\u2019 rule to the observed action sequence. Section 5 discusses im-\nplementation details of Hyper-Q in a simple Rock-Paper-Scissors test domain. Test results\nare presented against two recent algorithms for learning mixed strategies: Infinitesimal\nGradient Ascent (IGA) [10], and Policy Hill Climbing (PHC) [2]. Preliminary results of\nHyper-Q vs. itself are also discussed. Concluding remarks are given in section 6.\n\n2 General Hyper-Q formulation\n\n, and its associated greedy policy is thus an optimal policy \u03c0\u2217\n\nAn agent using normal Q-learning in a finite MDP repeatedly observes a state s, chooses\na legal action a, and then observes an immediate reward r and a transition to a new state\n. The Q-learning equation is given by: \u2206Q(s, a) = \u03b1(t)[r + \u03b3 maxb Q(s(cid:1), b) \u2212 Q(s, a)],\ns(cid:1)\nwhere \u03b3 is a discount parameter, and \u03b1(t) is an appropriate learning rate schedule. Given a\nsuitable method of exploring state-action pairs, Q-learning is guaranteed to converge to the\noptimal value function Q\u2217\nThe multi-agent generalization of an MDP is called a stochastic game, in which each agent i\nchooses an action ai in state s. Payoffs ri to agent i and state transitions are now functions\nof joint actions of all agents. An important special class of stochastic games are matrix\ngames, in which |S| = 1 and payoffs are functions only of joint actions. Rather than\nchoosing the best action in a given state, an agent\u2019s task in a stochastic game is to choose the\nbest mixed strategy (cid:5)xi = (cid:5)xi(s) given the expected mixed strategy (cid:5)x\u2212i(s) of all other agents.\nHere (cid:5)xi denotes a set a probabilities summing to 1 for selecting each of the Ni = Ni(s)\nlegal actions in state s. The space of possible mixed strategies is a continuous (Ni \u2212 1)\ndimensional unit simplex, and choosing the best mixed strategy is clearly more complex\nthan choosing the best base action.\n\n.\n\nWe now consider extensions of Q-learning to stochastic games. Given that the agent needs\nto learn a mixed strategy, which may depend on the mixed strategies of other agents, an\nobvious idea is to have the Q-function evaluate entire mixed strategies, rather than base\nactions, and to include in the \u201cstate\u201d description an observation or estimate of the other\nagents\u2019 current mixed strategy. This forms the basis of the proposed Hyper-Q learning\nalgorithm, which is formulated as follows. For notational simplicity, let x denote the Hyper-\nQ learner\u2019s current mixed strategy, and let y denote an estimated joint mixed strategy of all\nother agents (hereafter referred to as \u201copponents\u201d). At time t, the agent generates a base\naction according to x, and then observes a payoff r, a new state s(cid:1)\n, and a new estimated\nopponent strategy y(cid:1)\n\n. The Hyper-Q function Q(s, y, x) is then adjusted according to:\n\n\f\u2206Q(s, y, x) = \u03b1(t)[r + \u03b3 max\n\nx(cid:1)\n\nQ(s(cid:1), y(cid:1), x(cid:1)\n\n) \u2212 Q(s, y, x)]\n\nThe greedy policy \u02c6x associated with any Hyper-Q function is then defined by:\n\n\u02c6x(s, y) = arg max\n\nx\n\nQ(s, y, x)\n\n(1)\n\n(2)\n\n3 Convergence of Hyper-Q Learning\n3.1 Function approximation\n\nSince Hyper-Q is a function of continuous mixed strategies, one would expect it to require\nsome sort of function approximation scheme. Establishing convergence of Q-learning with\nfunction approximation is substantially more difficult than for a normal Q-table for a finite\nMDP, and there are a number of well-known counterexamples. In particular, finite dis-\ncretization may cause a loss of an MDP\u2019s Markov property [9].\n\nSeveral recent function approximation schemes [11, 12] enable Q-learning to work well in\ncontinuous spaces. There is a least one discretization scheme, Finite Difference Reinforce-\nment Learning [9], that provably converges to the optimal value function of the underlying\ncontinuous MDP. This paper employs a simple uniform grid discretization of the mixed\nstrategies of the Hyper-Q agent and its opponents. No attempt will be made to prove con-\nvergence under this scheme. However, for certain types of opponent dynamics described\nbelow, a plausible conjecture is that a Finite-Difference-RL implementation of Hyper-Q\nwill be provably convergent.\n\n3.2 Exploration\n\nConvergence of normal Q-learning requires visiting every state-action pair infinitely often.\nThe clearest way to achieve this in simulation is via exploring starts, in which training con-\nsists of many episodes, each starting from a randomly selected state-action pair. For real\nenvironments where this may not be feasible, one may utilize off-policy randomized explo-\nration, e.g., \u0001-greedy policies. This will ensure that, for all visited states, every action will\nbe tried infinitely often, but does not guarantee that all states will be visited infinitely often\n(unless the MDP has an ergodicity property). As a result one would not expect the trained\nQ function to exactly match the ideal optimal Q\u2217\nfor the MDP, although the difference in\nexpected payoffs of the respective policies should be vanishingly small.\n\nThe above considerations should apply equally to Hyper-Q learning. The use of explor-\ning starts for states, agent and opponent mixed strategies should guarantee sufficient ex-\nploration of the state-action space. Without exploring starts, the agent can use \u0001-greedy\nexploration to at least obtain sufficient exploration of its own mixed strategy space.\nIf\nthe opponents also do similar exploration, the situation should be equivalent to normal Q-\nlearning, where some stochastic game states might not be visited infinitely often, but the\ncost in expected payoff should be vanishingly small. If the opponents do not explore, the\neffect could be a further reduction in effective state space explored by the Hyper-Q agent\n(where \u201ceffective state\u201d = stochastic game state plus opponent strategy state). Again this\nshould have a negligible effect on the agent\u2019s long-run expected payoff relative to the policy\nthat would have been learned with opponent exploration.\n\n3.3 Opponent strategy dynamics\n\nSince opponent strategies can be governed by arbitrarily complicated dynamical rules, it\nseems unlikely that Hyper-Q learning will converge for arbitrary opponents. Nevertheless,\nsome broad categories can be identified under which convergence should be achievable.\nOne simple example is that of a stationary opponent strategy, i.e., y(s) is a constant. In this\n\n\fcase, the stochastic game obviously reduces to an equivalent MDP with stationary state\ntransitions and stationary payoffs, and with the appropriate conditions on exploration and\nlearning rates, Hyper-Q will clearly converge to the optimal value function.\n\nAnother important broad class of dynamics consists of opponent strategies that evolve ac-\ncording to a fixed, history-independent rule depending only on themselves and not on ac-\ntions of the Hyper-Q player, i.e., yt+1 = f(s, yt). This is a reasonable approximation for\nmany-player games in which any individual has negligible \u201cmarket impact,\u201d or in which a\nplayer\u2019s influence on another player occurs only through a global summarization function\n[6]. In such cases the relevant population strategy representation need only express global\nsummarizations of actitivy (e.g. averages), not details of which player does what. An ex-\nample is the \u201cReplicator Dynamics\u201d model from evolutionary game theory [14], in which\na strategy grows or decays in a population according to its fitness relative to the popula-\ntion average fitness. This leads to a history independent first order differential equation\n\u02d9y = f(y) for the population average strategy. In such models, the Hyper-Q learner again\nfaces an effective MDP in which the effective state (s, y) undergoes stationary history-\nindependent transitions, so that Hyper-Q should be able to converge.\n\nA final interesting class of dynamics occurs when the opponent can accurately estimate\nthe Hyper-Q strategy x, and then adapts its strategy using a fixed history-independent\nrule: yt+1 = f(s, yt, xt). This can occur if players are required to announce their mixed\nstrategies, or if the Hyper-Q player voluntarily announces its strategy. An example is the\nInfinitesimal Gradient Ascent (IGA) model [10], in which the agent uses knowledge of the\ncurrent strategy pair (x, y) to make a small change in its strategy in the direction of the gra-\ndient of immediate payoff P (x, y). Once again, this type of model reduces to an MDP with\nstationary history-independent transitions of effective state depending only on (s, y, x).\nNote that the above claims of reduction to an MDP depend on the Hyper-Q learner being\nable to accurately estimate the opponent mixed strategy y. Otherwise, the Hyper-Q learner\nwould face a POMDP situation, and standard convergence proofs would not apply.\n\n4 Opponent strategy estimation\n\nWe now consider estimation of opponent strategies from the history of base actions. One\napproach to this is model-based, i.e., to consider a class of explicit dynamical models of\nopponent strategy, and choose the model that best fits the observed data. There are two\ndifficult aspects to this approach: (1) the class of possible dynamical models may need to\nbe extraordinarily large; (2) there is a well-known danger of \u201cinfinite regress\u201d of opponent\nmodels if A\u2019s model of B attempts to take into account B\u2019s model of A.\n\nAn alternative approach studied here is model-free strategy estimation. This is in keeping\nwith the spirit of Q-learning, which learns state valuations without explicitly modeling the\ndynamics of the underlying state transitions. One simple method used in the following\nsection is the well-known Exponential Moving Average (EMA) technique. This maintains\na moving average \u00afy of opponent strategy by updating after each observed action using:\n\n\u00afy(t + 1) = (1\u2212 \u00b5)\u00afy(t) + \u00b5(cid:5)ua(t)\n\n(3)\nwhere (cid:5)ua(t) is a unit vector representation of the base action a. EMA assumes only that\nrecent observations are more informative than older observations, and should give accurate\nestimates when significant strategy changes take place on time scales > O(1/\u00b5).\n\n4.1 Bayesian strategy estimation\n\nA more principled model-free alternative to EMA is now presented. We assume a discrete\nset of possible values of y (e.g. a uniform grid). A probability for each y given the history\nof observed actions H, P (y|H), can then be computed using Bayes\u2019 rule as follows:\n\n\fP (y|H) =\n\n(cid:1)\n\nP (H|y)P (y)\ny(cid:1) P (H|y(cid:1))P (y(cid:1))\n\n(4)\n\nwhere P (y) is the prior probability of state y, and the sum over y(cid:1)\nextends over all strategy\ngrid points. The conditional probability of the history given the strategy, P (H|y), can\n(cid:2)t\nk=0 P (a(k)|y(t))\nnow be decomposed into a product of individual action probabilities\nassuming conditional independence of the individual actions. If all actions in the history\nare equally informative regardless of age, we may write P (a(k)|y(t)) = ya(k)(t) for all\nk. This corresponds to a Naive-Bayes equal weighting of all observed actions. However, it\nis again reasonable to assume that more recent actions are more informative. The way to\nimplement this in a Bayesian context is with exponent weights wk that increase with k [4].\nWithin a normalization factor, we then write:\nP (H|y) =\n\nt(cid:3)\n\n(5)\n\nywk\na(k)\n\nA linear schedule wk = 1 \u2212 \u00b5(t \u2212 k) for the weights is intuitively obvious; truncation of\nthe history at the most recent 1/\u00b5 observations ensures that all weights are positive.\n\nk=0\n\n5 Implementation and Results\n\nWe now examine the performance of Hyper-Q learning in a simple two-player matrix game,\nRock-Paper-Scissors. A uniform grid discretization of size N = 25 is used to represent\nmixed-strategy component probabilities, giving a simplex grid of size N(N + 1)/2 = 325\nfor either player\u2019s mixed strategy, and thus the entire Hyper-Q table is of size (325)2 =\n105625. All simulations use \u03b3 = 0.9, and for simplicity, a constant learning rate \u03b1 = 0.01.\n\n5.1 Hyper-Q/Bayes formulation\n\nThree different opponent estimation schemes were used with Hyper-Q learning: (1) \u201cOm-\nniscient,\u201d i.e. perfect knowledge of the opponent\u2019s strategy; (2) EMA, using equation 3\nwith \u00b5 = 0.005; (3) Bayesian, using equations 4 and 5 with \u00b5 = 0.005 and a uniform\nprior. Equations 1 and 2 were modified in the Bayesian case to allow for a distribution of\nopponent states y, with probabilities P (y|H). The corresponding equations are:\n\n\u2206Q(y, x) = \u03b1(t)P (y|H)[r + \u03b3 max\nx(cid:1)\n\n) \u2212 Q(y, x)]\n\nQ(y(cid:1), x(cid:1)\nP (y|H)Q(y, x)\n\n(cid:4)\n\ny\n\n\u02c6x = arg max\n\nx\n\n(6)\n\n(7)\n\nA technical note regarding equation 6 is that, to improve tractability of the algorithm, an ap-\nproximation P (y|H) \u2248 P (y(cid:1)|H(cid:1)) is used, so that the Hyper-Q table updates are performed\nusing the updated distribution P (y(cid:1)|H(cid:1)).\n5.2 Rock-Paper-Scissors results\nWe first examine Hyper-Q training online against an IGA player. Apart from possible state\nobservability and discretization issues, Hyper-Q should in principle be able to converge\nagainst this type of opponent. In order to conform to the original implicit assumptions\nunderlying IGA, the IGA player is allowed to have omniscient knowledge of the Hyper-Q\nplayer\u2019s mixed strategy at each time step. Policies used by both players are always greedy,\napart from resets to uniform random values every 1000 time steps.\n\nFigure 1 shows a smoothed plot of the online Bellman error, and the Hyper-Q player\u2019s\naverage reward per time step, as a function of training time. The figure exhibits good\n\n\fHyper-Q vs. IGA: Online Bellman error\n\nHyper-Q vs. IGA: Avg. reward per time step\nHyper-Q vs. IGA: Avg. reward per time step\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n\u2019Omniscient\u2019\n\u2019EMA\u2019\n\u2019Bayes\u2019\n\n400000\n\n800000\n\nTime Steps\n\n1.2e+06\n\n1.6e+06\n\n0.04\n0.04\n0.03\n0.03\n0.02\n0.02\n0.01\n0.01\n0\n0\n-0.01\n-0.01\n-0.02\n-0.02\n-0.03\n-0.03\n-0.04\n-0.04\n-0.05\n-0.05\n-0.06\n-0.06\n\n\u2019Omniscient\u2019\n\u2019Omniscient\u2019\n\u2019EMA\u2019\n\u2019EMA\u2019\n\u2019Bayes\u2019\n\u2019Bayes\u2019\n\n0\n0\n\n400000\n400000\n\n800000\n800000\n\n1.2e+06\n1.2e+06\n\n1.6e+06\n1.6e+06\n\nTime Steps\nTime Steps\n\nFigure 1: Results of Hyper-Q learning vs. an IGA player in Rock-Paper-Scissors, using\nthree different opponent state estimation methods: \u201cOmniscient,\u201d \u201cEMA\u201d and \u201cBayes\u201d as\nindicated. Random strategy restarts occur every 1000 time steps. Left plot shows smoothed\nonline Bellman error. Right plot shows average Hyper-Q reward per time step.\n\n0.7\n0.65\n0.6\n0.55\n0.5\n0.45\n0.4\n0.35\n0.3\n0.25\n0.2\n\nAsymptotic IGA Trajectory\n\n\u2019IGA_Rock_Prob\u2019\n\u2019IGA_Paper_Prob\u2019\n\u2019HyperQ_Reward\u2019\n\n0\n\n10000\n\n20000\n\nTime Steps\n\n30000\n\n40000\n\nFigure 2: Trajectory of the IGA mixed strategy against the Hyper-Q strategy starting from\na single exploring start. Dots show Hyper-Q player\u2019s cumulative (rescaled) reward.\n\nprogress toward convergence, as suggested by substantially reduced Bellman error and\nsubstantial positive average reward per time step. Among the three estimation methods\nused, Bayes reached the lowest Bellman error at long time scales. This is probably because\nit updates many elements in the Hyper-Q table per time step, whereas the other techniques\nonly update a single element. Bayes also has by far the worst average reward at the start of\nlearning, but asymptotically it clearly outperforms EMA, and comes close to matching the\nperformance obtained with omniscient knowledge of opponent state.\n\nPart of Hyper-Q\u2019s advantage comes from exploiting transient behavior starting from a ran-\ndom initial condition. In addition, Hyper-Q also exploits the asymptotic behavior of IGA,\nas shown in figure 2. This plot shows that the initial transient lasts at most a few thousand\ntime steps. Afterwards, the Hyper-Q policy causes IGA to cycle erraticly between two dif-\nferent probabilites for Rock and two different probabilities for Paper, thus preventing IGA\nfrom reaching the Nash mixed strategy. The overall profit to Hyper-Q during this cycling\nis positive on average, as shown by rising cumulative Hyper-Q reward. The observed cy-\ncling with positive profitability is reminiscent of an algorithm called PHC-Exploiter [3] in\nplay against a PHC player. An interesting difference is that PHC-Exploiter uses an explicit\nmodel of its opponent\u2019s behavior, whereas no such model is needed by a Hyper-Q learner.\n\n\fHyper-Q vs. PHC: Online Bellman error\n\nHyper-Q vs. PHC: Avg. reward per time step\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n\u2019Omniscient\u2019\n\u2019EMA\u2019\n\u2019Bayes\u2019\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n-0.05\n\n-0.1\n\n400000\n\n800000\n\n1.2e+06\n\n\u2019Omniscient\u2019\n\u2019EMA\u2019\n\u2019Bayes\u2019\n\n400000\n\n800000\n\n1.2e+06\n\nTime Steps\n\nFigure 3: Results of Hyper-Q vs. PHC in Rock-Paper-Scissors. Left plot shows smoothed\nonline Bellman error. Right plot shows average Hyper-Q reward per time step.\n\nWe now exmamine Hyper-Q vs. a PHC player. PHC is a simple adaptive strategy based\nonly on its own actions and rewards. It maintains a Q-table of values for each of its base\nactions, and at every time step, it adjusts its mixed strategy by a small step towards the\ngreedy policy of its current Q-function. The PHC strategy is history-dependent, so that\nreduction to an MDP is not possible for the Hyper-Q learner. Nevertheless Hyper-Q does\nexhibit substantial reduction in Bellman error, and also significantly exploits PHC in terms\nof average reward, as shown in figure 3. Given that PHC ignores opponent state, it should\nbe a weak competitive player, and in fact it does much worse in average reward than IGA.\nIt is also interesting to note that Bayesian estimation once again clearly outperforms EMA\nestimation, and surprisingly, it also outperforms omniscient state knowledge. This is not\nyet understood and is a focus of ongoing research.\n\nHyper-Q/Omniscient vs. itself: Online Bellman error\n\nHyper-Q/Bayes vs. itself: Online Bellman error\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n400000\n\n800000\n\n1.2e+06\n\n1.6e+06\n\n400000\n\n800000\n\n1.2e+06\n\n1.6e+06\n\nFigure 4: Smoothed online Bellman error for Hyper-Q vs. itself. Left plot uses Omniscient\nstate estimation; right plot uses Bayesian estimation.\n\nFinally, we examine preliminary data for Hyper-Q vs.\nitself. The average reward plots\nare uninteresting: as one would expect, each player\u2019s average reward is close to zero. The\nonline Bellman error, shown in figure 4, is more interesting. Surprisingly, the plots are less\nnoisy and achieve asymptotic errors as low or lower than against either IGA or PHC. Since\nHyper-Q\u2019s play is history-dependent, one can\u2019t argue for MDP equivalence. However, it is\npossible that the players\u2019 greedy policies \u02c6x(y) and \u02c6y(x) simultaneously become stationary,\nthereby enabling them to optimize against each other.\nIn examining the actual play, it\ndoes not converge to the Nash point ( 1\n3), but it does appear to cycle amongst a small\nnumber of grid points with roughly zero average reward over the cycle for both players.\nConceivably, Hyper-Q could have converged to a cyclic Nash equilibrium of the repeated\ngame, which would certainly be a nice outcome of self-play learning in a repeated game.\n\n3 , 1\n\n3 , 1\n\n\f6 Conclusion\n\nHyper-Q Learning appears to be more versatile and general-purpose than any published\nmulti-agent extension of Q-Learning to date. With grid discretization it scales badly but\nwith other function approximators it may become practical. Some tantalizing early results\nwere found in Rock-Paper-Scissors tests against some recently published adaptive oppo-\nnents, and also against itself. Research on this topic is very much a work in progress. Vastly\nmore research is needed, to develop a satisfactory theoretical analysis of the approach, an\nunderstanding of what kinds of realistic environments it can be expcted to do well in, and\nversions of the algorithm that can be successfully deployed in those environments.\n\nSignificant improvements in opponent state estimation should be easy to obtain. More\nprincipled methods for setting recency weights should be achievable; for example, [4] pro-\nposes a method for training optimal weight values based on observed data. The use of\ntime-series prediction and data mining methods might also result in substantially better\nestimators. Model-based estimators are also likely to be advantageous where one has a\nreasonable basis for modeling the opponents\u2019 dynamical behavior.\nAcknowledgements: The author thanks Michael Littman for many helpful discussions;\nIrina Rish for insights into Bayesian state estimation; and Michael Bowling for assistance\nin implementing the PHC algorithm.\n\nReferences\n\n[1] M. Bowling. Convergence problems of general-sum multiagent reinforcement learning.\n\nProceedings of ICML-00, pages 89\u201394, 2000.\n\nIn\n\n[2] M. Bowling and M. Veloso. Multiagent learning using a variable learning rate. Artificial Intel-\n\nligence, 136:215\u2013250, 2002.\n\n[3] Y.-H. Chang and L. P. Kaelbling. Playing is believing: the role of beliefs in multi-agent learning.\n\nIn Proceedings of NIPS-2001. MIT Press, 2002.\n\n[4] S. J. Hong, J. Hosking, and R. Natarajan. Multiplicative adjustment of class probability: edu-\n\ncating naive Bayes. Technical Report RC-22393, IBM Research, 2002.\n\n[5] J. Hu and M. P. Wellman. Multiagent reinforcement learning: theoretical framework and an\n\nalgorithm. In Proceedings of ICML-98, pages 242\u2013250. Morgan Kaufmann, 1998.\n\n[6] M. Kearns and Y. Mansour. Efficient Nash computation in large population games with bounded\n\ninfluence. In Proceedings of UAI-02, pages 259\u2013266, 2002.\n\n[7] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning.\n\nProceedings of ICML-94, pages 157\u2013163. Morgan Kaufmann, 1994.\n\nIn\n\n[8] M. L. Littman. Friend-or-Foe Q-learning in general-sum games. In Proceedings of ICML-01.\n\nMorgan Kaufmann, 2001.\n\n[9] R. Munos. A convergent reinforcement learning algorithm in the continuous case based on a\nfinite difference method. In Proceedings of IJCAI-97, pages 826\u2013831. Morgan Kaufman, 1997.\n[10] S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in general-sum\n\ngames. In Proceedings of UAI-2000, pages 541\u2013548. Morgan Kaufman, 2000.\n\n[11] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In\n\nProceedings of ICML-00, pages 903\u2013910, 2000.\n\n[12] W. T. B. Uther and M. M. Veloso. Tree based discretization for continuous state space rein-\n\nforcement learning. In Proceedings of AAAI-98, pages 769\u2013774, 1998.\n\n[13] C. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989.\n[14] J. W. Weibull. Evolutionary Game Theory. The MIT Press, 1995.\n\n\f", "award": [], "sourceid": 2503, "authors": [{"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}