{"title": "All learning is Local: Multi-agent Learning in Global Reward Games", "book": "Advances in Neural Information Processing Systems", "page_first": 807, "page_last": 814, "abstract": "", "full_text": "All learning is local:\n\nMulti-agent learning in global reward games\n\nYu-Han Chang\n\nMIT CSAIL\n\nCambridge, MA 02139\nychang@csail.mit.edu\n\nTracey Ho\nLIDS, MIT\n\nCambridge, MA 02139\n\ntrace@mit.edu\n\nLeslie Pack Kaelbling\n\nMIT CSAIL\n\nCambridge, MA 02139\n\nlpk@csail.mit.edu\n\nAbstract\n\nIn large multiagent games, partial observability, coordination, and credit\nassignment persistently plague attempts to design good learning algo-\nrithms. We provide a simple and ef\ufb01cient algorithm that in part uses\na linear system to model the world from a single agent\u2019s limited per-\nspective, and takes advantage of Kalman \ufb01ltering to allow an agent to\nconstruct a good training signal and learn an effective policy.\n\n1\n\nIntroduction\n\nLearning in a single-agent stationary-environment setting can be a hard problem, but rela-\ntive to the multi-agent learning problem, it is easy. The multi-agent learning problem has\nbeen approached from a variety of approaches, from game theory to partially observable\nMarkov decision processes. The solutions are often complex. We take a different approach\nin this paper, presenting a simplifying abstraction and a reward \ufb01ltering technique that al-\nlows computationally ef\ufb01cient and robust learning in large multi-agent environments where\nother methods may fail or become intractable.\n\nIn many multi-agent settings, our learning agent does not have a full view of the world.\nOther agents may be far away or otherwise obscured. At the very least, our learning agent\nusually does not have a a complete representation of the internal states of the other agents.\nThis partial observability creates problems when the agent begins to learn about the world,\nsince it cannot see how the other agents are manipulating the environment and thus it cannot\nascertain the true world state. It may be appropriate to model the observable world as a\nnon-stationary Markov Decision Process (MDP). A separate problem arises when we train\nmultiple agents using a global reward signal. This is often the case in cooperative games\nin which all the agents contribute towards attaining some common goal. Even with full\nobservability, the agents would need to overcome a credit assignment problem, since it may\nbe dif\ufb01cult to ascertain which agents were responsible for creating good reward signals. If\nwe cannot even observe what the other agents are doing, how can we begin to reason about\ntheir role in obtaining the current reward?\n\nConsider an agent in an MDP, learning to maximize a reward that is a function of its ob-\nservable state and/or actions. There are many well-studied learning techniques to do this\n[Sutton and Barto, 1999]. The effects of non-stationarity, partial observability, and global\nrewards can be thought of as replacing the true reward signal with an alternate signal that is\n\n\fa non-stationary function of the original reward. Think of the difference between learning\nwith a personal coach and learning in a large class where feedback is given only on col-\nlective performance. This causes problems for an agent that is trying to use the collective\n\u201cglobal\u201d reward signal to learn an optimal policy. Ideally the agent can recover the original\n\u201cpersonal reward signal\u201d and learn using that signal rather than the global reward signal.\n\nWe show that in many naturally arising situations of this kind, an effective approach is for\nan individual agent to model the observed global reward signal as the sum of its own con-\ntribution (which is the personal reward signal on which it should base its learning) and a\nrandom Markov process (which is the amount of the observed reward due to other agents\nor external factors). With such a simple model, we can estimate both of these quantities ef-\n\ufb01ciently using an online Kalman \ufb01ltering process. Many external sources of reward (which\ncould be regarded as noise) can be modeled as or approximated by a random Markov pro-\ncess, so this technique promises broad applicability. This approach is more robust than\ntrying to learn directly from the global reward, allowing agents to learn and converge faster\nto an optimal or near-optimal policy.\n\n2 Related Work\n\nThis type of problem has been approached in the past using a variety of techniques. For\nslowly varying environments, Szita et al. [2002] show that Q-learning will converge as long\nas the variation per time step is small enough. In our case, we attempt to tackle problems\nwhere the variation could be larger. Choi et al. [1999] investigate models in which there\nare \u201chidden modes\u201d. When the environment switches between modes, all the rewards may\nbe altered. This works if we have fairly detailed domain knowledge about the types of\nmodes we expect to encounter. For variation produced by the actions of other agents in\nthe world, or for truly unobservable environmental changes, this technique would not work\nas well. Auer et al. [1995] show that in arbitrarily varying environments, we can craft a\nregret-minimizing strategy for playing repeated games. The results are largely theoretical\nin nature and can yield fairly loose performance bounds, especially in stochastic games.\nRather than \ufb01ltering the rewards as we will do, Ng et al. [1999] show that a potential func-\ntion can be used to shape the rewards without affecting the learned policy while possibly\nspeeding up convergence. This assumes that learning would converge in the \ufb01rst place,\nthough possibly taking a very long time. Moreover, it requires domain knowledge to craft\nthis shaping function. Wolpert and Tumer [1999] provide a framework called COIN, or\ncollective intelligence, for analyzing distributed reinforcement learning. They stress the\nimportance of choosing utility functions that lead to good policies. Finally, McMahan et\nal. [2003] discuss learning in the scenario in which the opponent gets to choose the agent\u2019s\nreward function.\n\nThe innovative aspect of our approach is to consider the reward signal as merely a signal\nthat is correlated with our true learning signal. We propose a model that captures the\nrelationship between the true reward and the noisy rewards in a wide range of problems.\nThus, without assuming much additional domain knowledge, we can use \ufb01ltering methods\nto recover the underlying true reward signal from the noisy observed global rewards.\n\n3 Mathematical model\n\nThe agent assumes that the world possesses one or more unobservable state variables that\naffect the global reward signal. These unobservable states may include the presence of other\nagents or changes in the environment. Each agent models the effect of these unobservable\nstate variables on the global reward as an additive noise process bt that evolves according\nto bt+1 = bt + zt, where zt is a zero-mean Gaussian random variable with variance (cid:27)w.\n\n\fThe global reward that it observes if it is in state i at time t is gt = r(i) + bt, where r\nis a vector containing the ideal training rewards r(i) received by the agent at state i. The\nstandard model that describes such a linear system is:\n\ngt = Cxt + vt;\n\nvt (cid:24) N (0; (cid:6)2)\nxt = Axt(cid:0)1 + wt; wt (cid:24) N (0; (cid:6)1)\n\nbt]T . We impart our domain knowledge into\nIn our case, we desire estimates of xt = [rT\nt\nthe model by specifying the estimated variance and covariance of the components of xt.\nIn our case, we set (cid:6)2 = 0 since we assume no observation noise when we experience\nrewards; (cid:6)1(j; j) = 0; j 6= jSj + 1, since the rewards are \ufb01xed and do not evolve over\ntime; (cid:6)1(jSj + 1; jSj + 1) = (cid:27)w since the noise term evolves with variance (cid:27)w. The system\nmatrix is A = I, and the observation matrix is C = [0 0 : : : 1i : : : 0 0 1] where the 1i\noccurs in the ith position when our observed state is state i.\nKalman \ufb01lters [Kalman, 1960] are Bayes optimal, minimum mean-squared-error estimators\nfor linear systems with Gaussian noise. The agent applies the following causal Kalman\n\ufb01ltering equations at each time step to obtain maximum likelihood estimates for b and the\nindividual rewards r(i) for each state i given all previous observations. First, the estimate\n^x and its covariance matrix P are updated in time based on the linear system model:\n\n^x0\nt = A^xt(cid:0)1\nt = APt(cid:0)1AT + (cid:6)1\nP 0\n\n(1)\n(2)\n\nThen these a priori estimates are updated using the current time period\u2019s observation gt:\n\nt C T + (cid:6)2)(cid:0)1\n\nKt = P 0\n^xt = ^x0\nPt = (I (cid:0) KtC)P 0\nt\n\nt C T (CP 0\nt + Kt(gt (cid:0) C ^x0\nt)\n\n(3)\n(4)\n(5)\n\nAs shown, the Kalman \ufb01lter also gives us the estimation error covariance Pt, from which\nwe know the variance of the estimates for r and b. We can also compute the likelihood\nof observing gt given the model and all the previous observations. This will be handy\nfor evaluating the \ufb01t of our model, if needed. We could also create more complicated\nmodels if our domain knowledge shows that a different model would be more suitable. For\nexample, if we wanted to capture the effect of an upward bias in the evolution of the noise\nprocess (perhaps to model the fact that all the agents are learning and achieving higher\nrewards), we could add another variable u, initialized such that u0 > 0, modifying x to be\nx = [rT b u]T , and changing our noise term update equation to bt+1 = bt + ut + wt. In\nother cases, we might wish to use non-linear models that would require more sophisticated\ntechniques such as extended Kalman \ufb01lters.\nFor the learning mechanism, we use a simple tabular Q-learning algorithm [Sutton and\nBarto, 1999], since we wish to focus our attention on the reward signal problem. Q-learning\nkeeps a \u201cQ-value\u201d for each state-action pair, and proceeds using the following update rule:\n\nQt(s; a) = (1 (cid:0) (cid:11))Qt(cid:0)1(s; a) + (cid:11)(r + (cid:13) min\na0\n\nQt(s0; a0)) ;\n\n(6)\n\nwhere 0 < (cid:11) < 1 is parameter that controls the learning rate, r is the reward signal used\nfor learning at time t given s and a, 0 < (cid:13) (cid:20) 1 is the discount factor, and s, a, and s0\nare the current state, action, and next state of the agent, respectively. Under fairly general\nconditions, in a stationary MDP, Q-learning converges to the optimal policy, expressed as\n\n(cid:25)(s) = argmaxa Q(s; a) :\n\n\f1\n\n2\n\n3\n\n4\n\n...\n\n+10\n\n+5\n\n...\n\n24\n\n25\n\nFigure 1: This shows the dynamics of our 5x5 grid world domain. The states correspond\nto the grid locations, numbered 1,2,3,4,...,24,25. Actions move the agent N,S,E, or W,\nexcept in states 6 and 16, where any action takes the agent to state 10 and 18, respectively,\nshown by the curved arrows in the \ufb01gure at left. The optimal policy is shown at center,\nwhere multiple arrows at one state denotes indifference between the possibilities. A policy\nlearned by our \ufb01ltering agent is shown at right.\n\n4 The \ufb01ltering learning agent\n\nLike any good student, the \ufb01ltering learning agent chooses to accept well-deserved praise\nfrom its teacher and ignore over-effusive rewards. The good student does not update his be-\nhavior at every time step, but only upon observing relevant rewards. The question remains:\nHow does an agent decide upon the relevance of the rewards it sees? We have proposed a\nmodel in which undeserved rewards over time are captured by a Markov random process b.\nUsing observations from previous states and actions, an agent can approach this question\nfrom two perspectives. In the \ufb01rst, each time the agent visits a particular state i, it should\ngain a better sense of the evolution of the random variable b between its last visit and its\ncurrent visit. It is important to note that rewards are received frequently, thus allowing fre-\nquent updating of b. Secondly, given an estimate of bt upon visiting state i at time t, it has\na better idea of the value of bt+1 when it visits state i0 at time t + 1, since we assume bt\nevolves slowly over time. These are the ideas captured by the causal Kalman \ufb01lter, which\nonly uses the history of past states and observations to provides estimates of r(i) and b.\nThe agent follows this simple algorithm:\n\n1. From initial state i0, take some action a, transition to state i, and receive reward\n\nsignal g0. Initialize ^x0(i0) = g0 and ^x0(jSj + 1) = b0 = 0, since b0 = 0.\n\n2. Perform a Kalman update using equations 1-5 to compute the current vector of\nestimates ^x, which includes a component that is the reward estimate ^r(i0), which\nwill simply equal g this time.\n\n3. From the current state i at time t, take another action with some mix of exploration\nand exploitation; transition to state j, receiving reward signal gt. If this is the \ufb01rst\nvisit to state i, initialize ^xt(i) = gt (cid:0) ^bt(cid:0)1.\n\n4. Perform a Kalman update using equations 1-5 to compute the current vector of\n\nestimates ^x, which includes a component that is the reward estimate ^r(i).\n5. Update the Q-table using ^r(i) in place of r in equation 6; return to Step 3.\n\nThe advantage of the Kalman \ufb01lter is that it requires a constant amount of memory \u2013 at no\ntime does it need a full history of states and observations. Instead, it computes a suf\ufb01cient\nstatistic during each update, x and P , which consists of the maximum likelihood estimate\nof r and b, and the covariance matrix of this estimate. Thus, we can run this algorithm\nonline as we learn, and its speed does not deteriorate over time. Its speed is most tied to\n\n\f250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n\u221250\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\nx 104\n\nFigure 2: (Left) As the agent is attempting to learn, the reward signal value (y-axis) changes\ndramatically over time (x-axis) due to the noise term. While the true range of rewards in\nthis grid world domain only falls between 0 and 20, the noisy reward signal ranges from\n-10 to 250, as shown in the graph at left. (Center) Given this noisy signal, the \ufb01ltering agent\nis still able to learn the true underlying rewards, converging to the correct relative values\nover time, as shown in the middle graph. (Right) The \ufb01ltering learning agent (bold line)\naccrues higher rewards over time than the ordinary Q-learner (thin line), since it is able to\nconverge to an optimal policy whereas the non-\ufb01ltering Q-learner remains confused.\n\nthe number of observation states that we choose to use, since the Kalman update (Eqn. 3)\nneeds to perform a matrix inversion of size jSj (cid:2) jSj. However, since our model assumes\nthe agent only has access to a limited, local observation space within the true global state\nspace, this computation remains feasible.\n\n5 Empirical results\n\nIf the world dynamics exactly match the linear model we provide the Kalman \ufb01lter, then\nthis method will provably converge to the correct reward value estimates and the \ufb01nd the\noptimal policy under conditions similar to those guaranteeing Q-learning\u2019s eventual con-\nvergence. However, we would rarely expect the world to \ufb01t this grossly simpli\ufb01ed model.\nThe interesting question concerns situations in which the actual dynamics are clearly dif-\nferent from our model, and whether our \ufb01ltering agent will still learn a good policy. This\nsection examines the ef\ufb01cacy of the \ufb01ltering learning agent in several increasingly dif\ufb01cult\ndomains: (1) a single agent domain in which the linear system describes the world per-\nfectly, (2) a single agent domain where the noise is manually adjusted without following\nthe model, (3) a multi-agent setting in which the noise term is meant to encapsulate pres-\nence of other agents in the environment, and (4) a more complicated multi-agent setting\nthat simulates an mobile ad-hoc networking domain in which mobile agent nodes try to\nmaximize total network performance.\n\nFor ease of exposition, all the domains we use are variants of the popular grid-world domain\nshown in Figure 1 [Sutton and Barto, 1999]. The agent is able to move North, South, East,\nor West, and most transitions give the agent zero reward, except all actions from state 6\nmove the agent directly to state 10 with a reward of 20, and all actions from state 16 move\nthe agent directly to state 18 with a reward of 10. Bumps into the wall cost the agent -1 in\nreward and move the agent nowhere. We use a discount factor of 0.9.\n\nTo demonstrate the basic feasibility of our \ufb01ltering method, we \ufb01rst create a domain that\nfollows the linear model of the world given in Section 3 perfectly. That is, in each time\nstep, a single agent receives its true reward plus some noise term that evolves as a Markov\nrandom process. To achieve this, we simply add a noise term to the grid world domain\ngiven in Figure 1. As shown in Figure 2, an agent acting in this domain will receive a large\nrange of reward values due to the evolving noise term. In the example given, sometimes\nthis value ranges as high as 250 even though the maximum reward in the grid world is\n\n\f4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n0\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\nx 104\n\n\u22120.5\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\nx 104\n\nFigure 3: (Left) Filtering agents are able to distinguish their personal rewards from the\nglobal reward noise, and thus able to learn optimal policies and maximize their average\nreward over time in a ten-agent grid-world domain. (Right) In contrast, ordinary Q-learning\nagents do not process the global reward signal and can become confused as the environment\nchanges around them. Graphs show average rewards (y-axis) within 1000-period windows\nfor each of the 10 agents in a typical run of 10000 time periods (x-axis).\n\n20 \u2013 the noise term contributes 230 to the reward signal! A standard Q-learning agent\ndoes not stand a chance at learning anything useful using this reward signal. However, the\n\ufb01ltering agent can recover the true reward signal from this noisy signal and use that to learn.\nFigure 2 shows that the \ufb01ltering agent can learn the underlying reward signals, converging\nto these values relatively quickly. The graph to the right compares the performance of the\n\ufb01ltering learner to the normal Q-learner, showing a clear performance advantage.\n\nThe observant reader may note that the learned rewards do not match the true rewards\nspeci\ufb01ed by the grid world. Speci\ufb01cally, they are offset by about -4. Instead of mostly\n0 rewards at each state, the agent has concluded that most states produce reward of -4.\nCorrespondingly, state 6 now produces a reward of about 16 instead of 20. Since Q-learning\nwill still learn the correct optimal policy subject to scaling or translation of the rewards, this\nis not a problem. This oddity is due to the fact that our model has a degree of freedom in\nthe noise term b. Depending on the initial guesses of our algorithm, the estimates for the\nrewards may be biased. If most of the initial guesses for the rewards underestimated the\ntrue reward, then the learned value will be correspondingly lower than the actual true value.\nIn fact, all the learned values will be correspondingly lower by the same amount.\n\nTo further test our \ufb01ltering technique, we next evaluate its performance in a domain that\ndoes not conform to our noise model perfectly, but which is still a single agent system.\nInstead of an external reward term that evolves according to a Gaussian noise process, we\nadjust the noise manually, introducing positive and negative swings in the reward signal\nvalues at arbitrary times. The results are similar to those in the perfectly modeled domain,\nshowing that the \ufb01ltering method is fairly robust.\n\nThe most interesting case occurs when the domain noise is actually caused by other agents\nlearning in the environment. This noise will not evolve according to a Gaussian process,\nbut since the \ufb01ltering method is fairly robust, we might still expect it to work. If there are\nenough other agents in the world, then the noise they collectively generate may actually\ntend towards Gaussian noise. Here we focus on smaller cases where there are 6 or 10\nagents operating in the environment. We modify the grid world domain to include multiple\nsimultaneously-acting agents, whose actions do not interfere with each other, but whose\nreward signal now consists of the sum of all the agents\u2019 personal rewards, as given in the\nbasic single agent grid world of Figure 1.\n\nWe again compare the performance of the \ufb01ltering learner to the ordinary Q-learning algo-\nrithm. As shown in Figure 3, most of the \ufb01ltering learners quickly converge to the optimal\n\n\fR\n\nS\n\n1\n\nS\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\nx 104\n\nFigure 4: (Left) A snapshot of the 4x4 adhoc-networking domain. S denotes the sources, R\nis the receiver, and the dots are the learning agents, which act as relay nodes. Lines denote\ncurrent connections. Note that nodes may overlap. (Right) Graph shows average rewards\n(y-axis) in 1000-period windows as \ufb01ltering (bold line) and ordinary (thin line) agents try\nto learn good policies for acting as network nodes. The \ufb01ltering agent is able to learn a\nbetter policy, resulting in higher network performance (global reward). Graph shows the\naverage for each type of agent over 10 trial runs of 100000 time periods (x-axis) each.\n\npolicy. Three of the 10 agents converge to a suboptimal policy that produces slightly lower\naverage rewards. However, this artifact is largely due to our choice of exploration rate,\nrather than a large error in the estimated reward values. The standard Q-learning algorithm\nalso produces decent results at \ufb01rst. Approximately half of the agents \ufb01nd the optimal pol-\nicy, while the other half are still exploring and learning. An interesting phenomenon occurs\nwhen these other agents \ufb01nally \ufb01nd the optimal policy and begin receiving higher rewards.\nSuddenly the performance drops drastically for the agents who had found the optimal pol-\nicy \ufb01rst. Though seemingly strange, this provides a perfect example of the behavior that\nmotivates this paper. When the other agents learn an optimal policy, they begin affecting\nthe global reward, contributing some positive amount rather than a consistent zero. This\nchanges the world dynamics for the agents who had already learned the optimal policy and\ncauses them to \u201cunlearn\u201d their good behavior.\n\nThe unstable dynamics of the Q-learners could be solved if the agents had full observability,\nand we could learn using the joint actions of all the agents, as in the work of Claus and\nBoutilier [1998]. However, since our premise is that agents have only a limited view of\nthe world, the Q-learning agents will only exhibit convergence to the optimal policy if they\nconverge to the optimal policy simultaneously. This may take a prohibitively long time,\nespecially as the number of agents grows.\n\nFinally, we apply our \ufb01ltering method to a more realistic domain. Mobilized ad-hoc net-\nworking provides an interesting real-world environment that illustrates the importance of\nreward \ufb01ltering due to its high degree of partial observability and a reward signal that de-\npends on the global state. In this domain, there are a number of mobile nodes whose task\nis to move in such a way as to optimize the connectivity (performance) of the network.\nChang et al. [2003] cast this as a reinforcement learning problem. As the nodes move\naround, connections form between nodes that are within range of one another. These con-\nnections allow packets to be transmitted between various sources and receivers scattered\namong the nodes. The nodes are limited to having only local knowledge of their immediate\nneighboring grid locations (rather than the numbered state locations as in the original grid\nworld), and thus do not know their absolute location on the grid. They are trained using\na global reward signal that is a measure of total network performance, and their actions\nare limited functions that map their local state to N, S, E, W movements. We also limit\ntheir transmission range to a distance of one grid block. For simplicity, the single receiver\n\n\fis stationary and always occupies the grid location (1,1). Source nodes move around ran-\ndomly, and in our example here, there are two sources and eight mobile agent nodes in a\n4x4 grid. This setup is shown in Figure 4, and the graph shows a comparison of an ordinary\nQ-learner and the \ufb01ltering learner, plotting the increase in global rewards over time as the\nagents learn to perform their task as intermediate network nodes. The graph plots average\nperformance over 10 runs, showing the bene\ufb01t of the \ufb01ltering process.\n\n6 Limitations and extensions\n\nThe Kalman \ufb01ltering framework seems to work well in these example domains. However,\nthere are some cases where we may need to apply more sophisticated techniques. In all the\nabove work, we have assumed that the reward signal is deterministic \u2013 each state, action\npair only produces a single reward value. There are some domains in which we\u2019d like to\nmodel the reward as being stochastic, such as the multi-armed bandit problem. When the\nstochasticity of the rewards approximates Gaussian noise, we can use the Kalman frame-\nwork directly. In equation 1, v was set to exhibit zero mean and zero variance. However,\nallowing some variance would give the model an observation noise term that could re\ufb02ect\nthe stochasticity of the reward signal.\n\nFinally, in most cases the Kalman \ufb01ltering method provides a very good estimate of r over\ntime. However, since we cannot guarantee an exact estimate of the reward values when the\nmodel is not an exact representation of the world, the agent may make the wrong policy\ndecision sometimes. However, even if the policy is sub-optimal, the error in our derived\nvalue function is at least bounded by\n1(cid:0)(cid:13) , as long as the jr(i) (cid:0) ^r(i)j < (cid:15) 8i, and (cid:13) is\nagain the discount rate. In the majority of cases, the estimates are good enough to lead the\nagent to learning a good policy.\n\n(cid:15)\n\nConclusion and future work. This paper provides the general framework for a new ap-\nproach to solving large multi-agent problems using a simple model that allows for ef\ufb01cient\nand robust learning using well-studied tools such as Kalman \ufb01ltering. As a practical appli-\ncation, we are working on applying these methods to a more realistic version of the mobile\nad-hoc networking domain.\n\nReferences\n[Auer et al., 1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged\ncasino: the adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium\non Foundations of Computer Science, 1995.\n\n[Chang et al., 2003] Y. Chang, T. Ho, and L. P. Kaelbling. Reinforcement learning in mobilized\n\nad-hoc networks. MIT AI Lab Memo AIM-2003-025, 2003.\n\n[Choi et al., 1999] S. Choi, D. Yeung, and N. Zhang. Hidden-mode Markov decision processes. In\nIJCAI Workshop on Neural, Symbolic, and Reinforcement Methods for Sequence Learning, 1999.\n[Claus and Boutilier, 1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement\n\nlearning in cooperative multiaent systems. In Proceedings of the 15th AAAI, 1998.\n\n[Kalman, 1960] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Trans-\n\nactions of the American Society of Mechanical Engineers, Journal of Basic Engineering, 1960.\n\n[McMahan et al., 2003] H. McMahan, G. Gordon, and A. Blum. Planning in the presence of cost\n\nfunctions controlled by an adversary. In Proceedings of the 20th ICML, 2003.\n\n[Ng et al., 1999] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward\n\ntransformations: theory and application to reward shaping. In Proc. 16th ICML, 1999.\n\n[Sutton and Barto, 1999] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An\n\nIntroduction. MIT Press, 1999.\n\n[Szita et al., 2002] Istvan Szita, Balimt Takacs, and Andras Lorincz. e-mdps: Learning in varying\n\nenvironments. Journal of Machine Learning Research, 2002.\n\n[Wolpert and Tumer, 1999] D. Wolpert and K. Tumer. An introduction to collective intelligence.\n\nTech Report NASA-ARC-IC-99-63, 1999.\n\n\f", "award": [], "sourceid": 2476, "authors": [{"given_name": "Yu-han", "family_name": "Chang", "institution": null}, {"given_name": "Tracey", "family_name": "Ho", "institution": null}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": null}]}