{"title": "On Local Rewards and Scaling Distributed Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 98, "abstract": null, "full_text": "On Local Rewards and Scaling Distributed Reinforcement Learning\nJ. Andrew Bagnell Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213\ndbagnell@ri.cmu.edu\n\nAndrew Y. Ng Computer Science Department Stanford University Stanford, CA 94305\nang@cs.stanford.edu\n\nAbstract\nWe consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n.\n\n1\n\nIntroduction\n\nRecently there has been great interest in distributed reinforcement learning problems where a collection of agents with independent action choices attempts to optimize a joint performance metric. Imagine, for instance, a traffic engineering application where each traffic signal may independently decide when to switch colors, and performance is measured by aggregating the throughput at all traffic stops. Problems with such factorizations where the global reward decomposes in to a sum of local rewards are common and have been studied in the RL literature. [10] The most straightforward and common approach to solving these problems is to apply one of the many well-studied single agent algorithms to the global reward signal. Effectively, this treats the multi-agent problem as a single agent problem with a very large action space. Peshkin et al. [9] establish that policy gradient learning factorizes into independent policy gradient learning problems for each agent using the global reward signal. Chang et al. [3] use global reward signals to estimate effective local rewards for each agent. Guestrin et al. [5] consider coordinating agent actions using the global reward. We argue from an information theoretic perspective that such algorithms are fundamentally limited in their scalability. In particular, we show in Section 3 that as a function of the number of agents ~ n, such algorithms will need to see1 (n) trajectories in the worst case to achieve good performance. We suggest an alternate line of inquiry, pursued as well by other researchers (including\n1\n\n~ Big- notation omits logarithmic terms, similar to how big- notation drops constant values.\n\n\f\nnotably [10]), of developing algorithms that capitalize on the availability of local reward signals to improve performance. Our results show that such local information can dramatically reduce the number of examples necessary for learning to O(log n). One approach that the results suggest to solving such distributed problems is to estimate model parameters from all local information available, and then to solve the resulting model offline. Although this clearly still carries a high computational burden, it is much preferable to requiring a large amount of real-world experience. Further, useful approximate multiple agent Markov Decision Process (MDP) solvers that take advantage of local reward structure have been developed. [4]\n\n2\n\nPreliminaries\n\nWe consider distributed reinforcement learning problems, modeled as MDPs, in which there are n (cooperative) agents, each of which can directly influence only a small number of its neighbors. More formally, let there be n agents, each with a finite state space S of size |S | states and a finite action space A of size |A|. The joint state space of all the agents is therefore S n , and the joint action space An . If st S n is the joint state of the agents at (i) (i) time t, we will use st to denote the state of agent i. Similarly, let at denote the action of agent i. For each agent i {1, . . . , n}, we let neigh(i) {1, . . . , n} denote the subset of agents that i's state directly influences. For notational convenience, we assume that if i neigh(j ), then j neigh(i), and that i neigh(i). Thus, the agents can be viewed as living on the vertices of a graph, where agents have a direct influence on each other's state only if they are connected by an edge. This is similar to the graphical games formalism of [7], and is also similar to the Dynamic Bayes Net (DBN)-MDP formalisms of [6] and [2]. (Figure 1 depicts a DBN and an agent influence graph.) DBN formalisms allow the more refined notion of directionality in the influence between neighbors. More formally, each agent i is associated with a CPT (conditional probability table) (i) (neigh(i)) (i) (neigh(i)) Pi (st+1 |st , at ), where st denotes the state of agent i's neighbors at time t. Given the joint action a of the agents, the joint state evolves according to in (i) (neigh(i)) (i) (1) , at ). p(st+1 |st p(st+1 |st , at ) =\n=1\n\nFor simplicity, we have assumed that agent i's state is directly influenced by the states of neigh(i) but not their actions; the generalization offers no difficulties. The initial state s1 is distributed according to some initial-state distribution D. A policy is a map : S n An . Writing out explicitly as a vector-valued function, we have (s) = (1 (s), . . . , n (s)), where i (s) : S n A is the local policy of agent i. For some applications, we may wish to consider only policies in which agent i chooses its local action as a function of only its local state s(i) (and possibly its neighbors); in this case, i can be restricted to depend only on s(i) .\n\nIn the reinforcement learning setting, the dynamics (CPTs) and rewards of the problem are unknown, and a learning algorithm has to take actions in the MDP and use the resulting observations of state transitions and rewards to learn a good policy. Each \"trial\" taken by a reinforcement learning algorithm shall consist of a T -step sequence in the MDP.\n\nEach agent has a local reward function Ri (s(i) , a(i) ), which takes valuesnn the unit interi val [0, 1]. The total payoff in the MDP at each step is R(s, a) = (1/n) i=1 R(s(i) , a(i) ). We call this R(s, a) the global reward function, since it reflects the total reward received by the joint set of agents. We will consider the finite-horizon setting, in which the MDP terminates after T steps. Thus, the utility of a policy in an MDP M is . 1T n ti (i) (i) Ri (st , at )| U ( ) = UM ( ) = Es1 D [V (s1 )] = E n =1 =1\n\n\f\nFigure 1: (Left) A DBN description of a multi-agent MDP. Each row of (round) nodes in the DBN\ncorresponds to one agent. (Right) A graphical depiction of the influence effects in a multi-agent MDP. A connection between nodes in the graph implies arrows connecting the nodes in the DBN.\n\nOur goal is to characterize the scaling of the sample complexity for various reinforcement learning approaches (i.e., how many trials they require in order to learn a near-optimal policy) for large numbers of agents n. Thus, in our bounds below, no serious attempt has been made to make our bounds tight in variables other than n.\n\n3\n\nGlobal rewards hardness result\n\nBelow we show that if an RL algorithm uses only the global reward signal, then there exists a very simple MDP--one with horizon, T = 1, only one state/trivial dynamics, and ~ two actions per agent--on which the learning algorithm will require (n) trials to learn a good policy. Thus, such algorithms do not scale well to large numbers of agents. For example, consider learning in the traffic signal problem described in the introduction with n = 100, 000 traffic lights. Such an algorithm may then require on the order of 100, 000 days of experience (trials) to learn. In contrast, in Section 4, we show that if a reinforcement learning algorithm is given access to the local rewards, it can be possible to learn in such problems with an exponentially smaller O(log n) sample complexity. Theorem 3.1: Let any 0 < < 0.05 be fixed. Let any reinforcement learning algorithm L be given that only uses the global reward signal R(s), and does not use the local rewards Ri (s(i) ) to learn (other than through their sum). Then there exists an MDP with time horizon T = 1, so that: 1. The MDP is very \"simple\" in that it has only one state (|S | = 1, |S n | = 1); trivial state transition probabilities (since T = 1); two actions per agent (|A| = 2); and deterministic binary (0/1)-valued local reward functions. 2. In order for L to output a policy that is near-optimal satisfying2 U ( ) ^ ^ max U ( ) - ,it is necessary that the number of trials m be at least 0.32n + log(1/4) ~ = (n). m log(n + 1) Proof. For simplicity, we first assume that L is a deterministic learning algorithm, so that in each of the m trials, its choice of action is some deterministic function of the outcomes of the earlier trials. Thus, in each of the m trials, L chooses a vector of actions a AN , n 1 and receives the global reward signal R(s, a) = n i=1 R(s(i) , a(i) ). In our MDP, each local reward R(s(i) , a(i) ) will take values only 0 and 1. Thus, R(s, a) can take only n + 1 01 n different values (namely, n , n , . . . , n ). Since T = 1, the algorithm receives only one such reward value in each trial. Let r1 , . . . , rm be the m global reward signals received by L in the m trials. Since L is deterministic, its output policy will be chosen as some deterministic function of these ^\n2 For randomized algorithms we consider instead the expectation of U ( ) under the algorithm's ^ randomization.\n\n\f\nrewards r1 , . . . , rm . But the vector (r1 , . . . , rm ) can take on only (n + 1)m different values (since each rt can take only n + 1 different values), and thus itself can also take only at ^ most (n + 1)m different values. Let m denote this set of possible values for . (|m | ^ (n + 1)m ). Call each local agent's two actions a1 , a2 . We will generate an MDP with randomly chosen parameters. Specifically, each local reward Ri (s(i) , a(i) ) function is randomly chosen with equal probability to either give reward 1 for action a1 and reward 0 for action a2 ; or vice versa. Thus, each local agent has one \"right\" action that gives reward 1, but the algorithm has to learn which of the two actions this is. Further, by choosing the right actions, the optimal policy attains U ( ) = 1. n 1 Fix any policy . Then UM ( ) = n i=1 R(s(i) , (s(i) )) is the mean of n independent Bernoulli(0.5) random variables (since the rewards are chosen randomly), and has expected value 0.5. Thus, by the Hoeffding inequality, P (UM ( ) 1 - 2) exp(-2(0.5 - 2)2 n). Thus, taking a union bound over all policies M , we have P ( M s.t. UM ( ) 1 - 2) |M | exp(-2(0.5 - 2)2 n) (2) (n + 1)m exp(-2(0.5 - 2)2 n) (3) Here, the probability is over the random MDP M . But since L outputs a policy in M , the chance of L outputting a policy with UM ( ) 1 - 2 is bounded by the chance that ^ ^ there exists such a policy in M . Thus, P (UM ( ) 1 - 2) (n + 1)m exp(-2(0.5 - 2)2 n). ^ (4) By setting the right hand side to 1/4 and solving for m, we see that so long as 0.32n + log(1/4) 2(0.5 - 2)2 n + log(1/4) , (5) m< log(n + 1) log(n + 1) we have that P (UM ( ) 1 - 2) < 1/4. (The second equality above follows by taking ^ < 0.05, ensuring that no policy will be within 0.1 of optimal.) Thus, under this condition, by the standard probabilistic method argument [1], there must be at least one such MDP under which L fails to find an -optimal policy. For randomized algorithms L, we can define for each string of input random numbers to the algorithm a deterministic algorithm L . Given m samples above, the expected performance of algorithm L over the distribution of MDPs Ep(M ) [L ] P r(UM (L ) 1 - 2)1 + (1 - P r(UM (L ) 1 - 2))(1 - 2) 13 + (1 - 2) < 1 - < 44 Since Ep(M ) Ep() [UM (L )] = Ep() Ep(M ) [UM (L )] < Ep() [1 - ] it follows again from the probabilistic method there must be at least one MDP for which the L has expected performance less than 1 - . 4\n\nLearning with local rewards\nAssuming the existence of a good exploration policy, we now show a positive result that if our learning algorithm has access to the local rewards, then it is possible to learn a nearoptimal policy after a number of trials that grows only logarithmically in the number of agents n. In this section, we will assume that the neighborhood structure (encoded by neigh(i)) is known, but that the CPT parameters of the dynamics and the reward functions are unknown. We also assume that the size of the largest neighborhood is bounded by maxi |neigh(i)| = B . Definition. A policy explore is a (, )-exploration policy if, given any i, any configuration of states s(neigh(i)) S |neigh(i)| , and any action a(i) A, on a trial of length T the policy explore has at least a probability B of executing action a(i) while i and its neighbors are in state s(neigh(i)) .\n\n\f\nProposition 4.1: Suppose the MDP's initial state distribution is random, so that the state (i) si of each agent i is chosen independently from some distribution Di . Further, assume that Di assigns probability at least > 0 to each possible state value s S . Then the \"random\" policy (that on each time-step chooses each agent's action uniformly at 1 random over A) is a (, |A| )-exploration policy. Proof. For any agent i, the initial state of s(neigh(i)) has has at least a B chance of being any particular vector of values, and the random action policy has a 1/|A| chance of taking I any particular action from this state. n general, it is a fairly strong assumption to assume that we have an exploration policy. However, this assumption serves to decouple the problem of exploration from the \"sample complexity\" question of how much data we need from the MDP. Specifically, it guarantees that we visit each local configuration sufficiently often to have a reasonable amount of data to estimate each CPT. 3 In the envisioned procedure, we will execute an exploration policy for m trials, and then use the resulting data we collect to obtain the maximum-likelihood estimates for the (i) (neigh(i)) (i) CPT entries and the rewards. We call the resulting estimates p(st+1 |st ^ , at ) and ^ R(s(i) , a(i) ).4 The following simple lemma shows that, with a number of trials that grows only logarithmically in n, this procedure will give us good estimates for all CPTs and local rewards. Lemma 4.2: Let any 0 > 0, > 0 be fixed. Suppose |neigh(i)| B for all i, and let a (, )-exploration policy be executed for m trials. Then in order to guarantee that, with probability at least 1 - , the CPT and reward estimates are 0 -accurate: |p(st+1 |st ^\n(i) (neigh(i))\n\n, at ) - p(st+1 |st , at )| 0 ^ (s(i) , a(i) )| - R(s(i) , a(i) )| 0 |R\n\n(i)\n\n(i)\n\n(neigh(i))\n\n(i)\n\nfor all i, st+1 , st\n\n(i)\n\n(neigh(i))\n\n, at\n\n(i)\n\nfor all i, s(i) , a(i) ,\n\n(6)\n\nit suffices that the number of trials be m = O((log n) poly( 11 , , |S |, |A|, 1/( B ), B , T )). 0 \n\nProof (Sketch). Given c examples to estimate a particular CPT entry (or a reward table entry), the probability that this estimate differs from the true value by more than 0 can be controlled by the Hoeffding bound: P (|p(st+1 |st ^\n(i) (neigh(i))\n\n, at ) - p(st+1 |st\n\n(i)\n\n(i)\n\n(neigh(i))\n\n, at )| 0 ) 2 exp(-22 c). 0\n\n(i)\n\nEach CPT has at most |A||S |B +1 entries and there are n such tables. There are also n|S ||A| possible local reward values. Taking a union bound over them, setting our probability of incorrectly estimating any CPTs or rewards to /2, and solving for c gives B +1 ). For each agent i we see each local configurations of states and c 2 log( 4 n |A||S | 2 actions (s(neigh(i)) , a(i) ) with probability B . For m trajectories the expected number\n3 Further, it is possible to show a stronger version of our result than that stated below, showing that a random action policy can always be used as our exploration policy, to obtain a sample complexity bound with the same logarithmic dependence on n (but significantly worse dependencies on T and B ). This result uses ideas from the random trajectory method of [8], with the key observation that local configurations that are not visited reasonably frequently by the random exploration policy will not be visited frequently by any policy, and thus inaccuracies in our estimates of their CPT entries will not significantly affect the result. (i) (neigh(i)) (i) (neigh(i)) (i) 4 We let p(st+1 |st ^ , at ) be the uniform distribution if (st , at ) was never ob^ (s(i) , a(i) ) = 0 if R(s(i) , a(i) ) was never observed. ^ served in the training data, and similarly let R\n0\n\n\f\nof samples we see for each CPT entry is at least mB . Call Sm the number of samples we've seen of a configuration (s(neigh(i)) , a(i) ) in m trajectories. Note then that:\n(s P (Sm\n(neigh(i))\n\n(s(neigh(i)) ,a(i) )\n\n,a(i) )\n\n(s c) P (Sm\n\n(neigh(i))\n\n,a(i) )\n\n(s - E [Sm\n\n(neigh(i))\n\n,a(i) )\n\n] c - mB ).\n\nand another application of Hoeffding's bound ensures that: -2 (c - mB )2 ). mT 2 Applying again the union bound to ensure that the probability of failure here is /2 and D lving for m gives the result. so efinition. Define the radius of influence r(t) after t steps to be the maximum number of nodes that are within t steps in the neighborhood graph of any single node. Viewed differently, r(t) upper bounds the number of nodes in the t-th timeslice of the DBN (as in Figure 1) which are decendants of any single node in the 1-st timeslice. In a DBN as shown in Figure 1, we have r(t) = O(t). If the neighborhood graph is a 2-d lattice in which each node has at most 4 neighbors, then r(t) = O(t2 ). More generally, we might expect to have r(t) = O(t2 ) for \"most\" planar neigborhood graphs. Note that, even in the worst case, by our assumption of each node having B neighbors, we still have the bound r(t) B t , which is a bound independent of the number of agents n.\n(s P (Sm\n(neigh(i))\n\n,a(i) )\n\n(s - E [Sm\n\n(neigh(i))\n\n,a(i) )\n\n] c - mB ) exp(\n\nTheorem 4.3: Let any > 0, > 0 be fixed. Suppose |neigh(i)| B for all i, and let a ^ (, )-exploration policy be executed for m trials in the MDP M . Let M be the maximum likelihood MDP, estimated from data from these m trials. Let be a policy class, and let = arg max UM ( ) ^ ^\n \n\n^ be the best policy in the class, as evaluated on M . Then to ensure that, with probability 1 - , we have that is near-optimal within , i.e., that ^ UM ( ) max UM ( ) - , ^\n \n\nit suffices that the number of trials be: m = O((log n) poly(1/, 1/, |S |, |A|, 1/( B )), B , T , r(T )).\n\nProof. Our approach is essentially constructive: we show that for any policy, finite-horizon value-iteration using approximate CPTs and rewards in its backups will correctly estimate the true value function for that policy within /2. For simplicity, we assume that the initial ^ state distribution is known (and thus the same in M and M ); the generalization offers no difficulties. By lemma (4.2) with m samples we can know both CPTs and rewards with the probability required within any required 0 . Note also that for any MDP with the given DBN or neighborhood graph structure (including ^ both M and M ) the value function for every policy and at each time-step has a property of bounded variation: r(T )T (i) ^ ^ |Vt (s(1) , . . . s(n) ) - Vt (s(1) , . . . s(i-1) , schanged , s(i+1) , . . . , s(n) | n This follows since a change in state can effect at most r(T ) agents' states, so the resulting change in utility must be bounded by r(T )T /n. To compute a bound on the error in our estimate of overall utility we compute a bound ^ ^^ on the error induced by a one-step Bellman backup ||B V - B V || . This quantity can be bounded in turn by considering the sequence of partially correct backup operators ^ ^ ^ B0 , . . . , Bn where Bi is defined as the Bellman operator for policy using the exact transitions and rewards for agents 1, 2, . . . , i, and the estimated transitions rewards/transitions\n\n\f\n200 agents, 20% noise is observed rewards 1 local learner global learner\n\n2500 local learner global learner\n\n0.9\n\n2000\n0.8\n\n0.7\n\nnumber of samples necessary\n\n0.6 performance\n\n1500\n\n0.5\n\n0.4\n\n1000\n\n0.3\n\n0.2\n\n500\n\n0.1\n\n0\n\n0\n\n500\n\n1000 1500 number of training examples\n\n2000\n\n2500\n\n0\n\n0\n\n50\n\n100\n\n150\n\n200 250 number of agents\n\n300\n\n350\n\n400\n\nFigure 2: (Left) Scaling of performance as a function of the number of trajectories seen for a global\nreward and local reward algorithms. (Right) Scaling of the number of samples necessary to achieve near optimal reward as a function of the number of agents.\n\n^ ^ with p(si+1 |st , ) 0 the difference in the CPTs between Bi and Bi+1 . By the t+1 bounded variation argument this total is then less than 0 r(T )T |S |/n. It follows then i ^^ ^ ^ ||Bi V - Bi+1 V || 0 r(T ) (T + 1)|S |. We now appeal to finite-horizon bounds ^ ^ on the error induced by Bellman backups [11] to show that the ||V - V || T ||B V - ^ V || T (T + 1) 0 r(T )|S |. Taking the expectation of V with respect to the initial ^ ^ B state distribution D and setting m according to Lemma (4.2) with 0 = 2|S |r(T ) T (T +1) 5 completes the proof.\n\nfor agents i + 1, . . . , n. From this definition it is immediate that the total error is equivalent to the telescoping sum: ^ ^^ ^^ ^^ ^^ ^ ^ ^^ ||B V - B V || = ||B0 V - B1 V + B1 V - ... + Bn-1 V - Bn V || (7) n-1 ^ ^ ^ ^ That sum is upper-bounded by the sum of term-by-term errors i=0 ||Bi V - Bi+1 V || . We can show that each of the terms in the sum is less than 0 r(T )(T + 1)/n since the ^^ ^ ^ Bellman operators Bi V - Bi+1 V differ in the immediate reward contribution of agent i + 1 by 0 and differ in computing the expected value of the future value by s ^ Qn j p(si+1 |st , )Vt+1 (s)], EQi+1 j [\nj =1\n\np(st+1 |st , )\n\nj =i+2\n\np(st+1 |st , )\n\nt+1\n\ni+1\n\nDemonstration\nWe first present an experimental domain that hews closely to the theory in Section (3) above to demonstrate the importance of local rewards. In our simple problem there are n = 400 independent agents who each choose an action in {0, 1}. Each agent has a \"correct\" action that earns it reward Ri = 1 with probability 0.8, and reward 0 with probability 0.2. Equally, if the agents chooses the wrong action, it earns reward Ri = 1 with probability 0.2. We compare two methods on this problem. Our first global algorithm uses only the global rewards R and uses this to build a model of the local rewards, and finally solves the resulting estimated MDP exactly. The local reward functions are learnt by a least-squares procedure with basis functions for each agent. The second algorithm also learns a local reward function, but does so taking advantage of the local rewards it observes as opposed to only the global signal. Figure (2) demonstrates the advantages of learning using a global reward signal.5 On the right in Figure (2), we compute the time required to achieve 1 of 4 optimal reward for each algorithm, as a function of the number of agents. In our next example, we consider a simple variant of the multi-agent S Y S A D M I N6 prob5 A gradient-based model-free approach using the global reward signal was also tried, but its performance was significantly poorer than that of the two algorithms depicted in Figure (2, left). 6 In S Y S A D M I N there is a network of computers that fail randomly. A computer is more likely to fail if a neighboring computer (arranged in a ring topology) fails. The goal is to reboot machines in such a fashion so a maximize the number of running computers.\n\n\f\nlem [4]. Again, we consider two algorithms: a global R E I N F O R C E [9] learner, and a R E I N F O R C E algorithm run using only local rewards, even through the local R E I N F O R C E algorithm run in this way is not guaranteed to converge to the globally optimal (cooperative) solution. We note that the local algorithm learns much more quickly than using the global reward. (Figure 3) The learning speed we observed for the global algorithm correlates well with the observations in [5] that the number of samples needed scales roughly linearly in the number of agents. The local algorithm continued to require essentially the same number of examples for all sizes used (up to over 100 agents) in our experiments.\n0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 Global Local 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 Global Local\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n450\n\n500\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n450\n\n500\n\nFigure 3: R E I N F O R C E applied to the multi-agent S Y S A D M I N problem. Local refers to R E I N F O R C E applied using only neighborhood (local) rewards while global refers to standard R E I N F O R C E (applied to the global reward signal). (Left) shows averaged reward performance as a function of number of iterations for 10 agents. (Right) depicts the performance for 20 agents.\n\nReferences\n[1] [2] [3] [4] [5] [6] [7] [8] [9] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 2000. C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 1999. Y. Chang, T. Ho, and L. Kaelbling. All learning is local: Multi-agent learning in global reward games. In Advances in NIPS 14, 2004. C. Guestrin, D. Koller, and R. Parr. Multi-agent planning with factored MDPs. In NIPS-14, 2002. C. Guestrin, M. Lagoudakis, and R. Parr. Coordinated reinforcement learning. In ICML, 2002. M. Kearns and D. Koller. Efficient reinforcement learning in factored mdps. In IJCAI 16, 1999. M. Kearns, M. Littman, and S. Singh. Graphical models for game theory. In UAI, 2001. M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. (extended version of paper in NIPS 12), 1999. L. Peshkin, K-E. Kim, N. Meleau, and L. Kaelbling. Learning to cooperate via policy search. In UAI 16, 2000.\n\n[10] J. Schneider, W. Wong, A. Moore, and M. Riedmiller. Distributed value functions. In ICML, 1999. [11] R. Williams and L. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Northeastern University, 1993.\n\n\f\n", "award": [], "sourceid": 2951, "authors": [{"given_name": "Drew", "family_name": "Bagnell", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}