{"title": "Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping", "book": "Advances in Neural Information Processing Systems", "page_first": 263, "page_last": 270, "abstract": null, "full_text": "Memory-based Reinforcement Learning: Efficient \n\nComputation with Prioritized Sweeping \n\nAndrew W. Moore \n\nawm@ai.mit.edu \n\nNE43-759 MIT AI Lab. \n545 Technology Square \nCambridge MA 02139 \n\nChristopher G. At:iteson \n\ncga@ai.mit.edu \n\nNE43-771 MIT AI Lab. \n545 Technology Square \nCambridge MA 02139 \n\nAbstract \n\nWe present a new algorithm, Prioritized Sweeping, for efficient prediction \nand control of stochastic Markov systems. Incremental learning methods \nsuch as Temporal Differencing and Q-Iearning have fast real time perfor(cid:173)\nmance. Classical methods are slower, but more accurate, because they \nmake full use of the observations. Prioritized Sweeping aims for the best \nof both worlds. It uses all previous experiences both to prioritize impor(cid:173)\ntant dynamic programming sweeps and to guide the exploration of state(cid:173)\nspace. We compare Prioritized Sweeping with other reinforcement learning \nschemes for a number of different stochastic optimal control problems. It \nsuccessfully solves large state-space real time problems with which other \nmethods have difficulty. \n\n1 STOCHASTIC PREDICTION \n\nThe paper introduces a memory-based technique, prioritized 6weeping, which is used \nboth for stochastic prediction and reinforcement learning. A fuller version of this \npaper is in preparation [Moore and Atkeson, 1992]. Consider the 500 state Markov \nsystem depicted in Figure 1. The system has sixteen absorbing states, depicted by \nwhite and black circles. The prediction problem is to estimate, for every state, the \nlong-term probability that it will terminate in a white, rather than black, circle. \nThe data available to the learner is a sequence of observed state transitions. Let us \nconsider two existing methods along with prioritized sweeping. \n\n263 \n\n\f264 \n\nMoore and Atkeson \n\nFigure 1: A 500-state \nMarkov system. \nEach \nstate has a random number \n(mean 5) of random suc(cid:173)\ncessors chosen within the \nlocal neighborhood. \n\nTemporal Differencing (TD) is an elegant incremental algorithm [Sutton, 1988] \nwhich has recently had success with a very large problem [Tesauro, 1991]. \n\nThe classical method proceeds by building a maximum likelihood model of the \nstate transitions. qij (the transition probability from i to i) is estimated by \n\nANum ber of observations i ~ i \nllij = Number of occasions in state i \n\n(1) \nAfter t + 1 observations the new absorption probability estimates are computed to \nsatisfy, for each terminal state k, the linear system \n\njeSucC8(i)nNONTERMS \n\n(2) \n\nwhere the i'iJ& [t]'s are the absorption probabilities we are trying to learn, where \nsuccs(i) is the set of all states which have been observed as immediate successors \nof i and NONTERMS is the set of non-terminal states. \n\nThis set of equations is solved after each transition is observed. It is solved using \nGauss-Seidel-an iterative method. What initial estimates should be used to start \nthe iteration? An excellent answer is to use the previous absorption probability \nestimates i'iJ& [t]. \nPrioritized sweeping is designed to combine the advantages of the classical \nmethod with the advantages of TD. It is described in the next section, but let us \nfirst examine performance on the original 500-state example of Figure 1. Figure 2 \nshows the result. TD certainly learns: by 100,000 observations it is estimating the \nterminal-white probability to an RMS accuracy of 0.1. However, the performance \nof the classical method appears considerably better than TD: the same error of 0.1 \nis obtained after only 3000 observations. \n\nFigure 3 indicates why temporal differencing may nevertheless often be more useful. \nTD requires far less computation per observation, and so can obtain more data in \nreal time. Thus, after 300 seconds, TD has had 250,000 observations and is down \n\n\fMemory-based Reinforcement Learning: Efficient Computation with Prioritized Sweeping \n\n265 \n\nMean \u00b1 Standard Dev'n \nAfter 100,000 observations \nAfter 300 seconds \n\nTD \n\n0.40 \u00b1 0.077 \n0.079 \u00b1 0.067 \n\nClassical \n\n0.024 \u00b1 0.0063 \n0.23 \u00b1 0.038 \n\nPri. Sweep \n0.024 \u00b1 0.0061 \n0.021 \u00b1 0.0080 \n\nTable 1: RMS prediction error: mean and standard deviation for ten experiments. \n\n1D ------\n\nClassical \n\nPri. Sweep \n\n-.. .. \n\n, \n...... \n, \n, \n\n\\ \n\n\\ \n\n\\ \n\n\\ \n\n\\ \n\n1.5 \n\n.. us \n.. I.. \n.. \n= \nu 1.35 \nJ:I \noS 1.3 \n~ \n1.25 \n] \n1.2 \nDo 0.15 \nrIJ 0.1 \n~ \n=c ... 5 \no \n\n1D ------\n\nClassical - - - -\n\nPri.Sweep \n\n-----\n--.............. \n\n.. ~ \n\"-\n\" \n\n\" \n\n\\ \n\n\" \n\n0.5 \n\n.. us \n.. \ne \n0.4 \nu 0.35 \n= 0.3 \nJ:I \n1 \n0.25 \n'tI \nu 1.2 \nIS. 1.15 \nfI.l o.t \n~ \n=c 0.15 \no \n\no 100 300 le3 3e3 ... 3e. ..5 \n\no 0.3 \n\n1 \n\n3 \n\n10 30 100 300 \n\nNo. observations (log scale) \n\nReal time, seconds (log scale) \n\nFigure 2: RMS prediction against ob-\nservation during three learning alga-\nrithms. \n\nFigure 3: RMS prediction against real \ntime \n\nto an error of 0.05, whereas even after 300 seconds the classical method has only \n1000 observations and a much cruder estimate. \nIn the same figures we see the motivation behind prioritized sweeping. Its perfor(cid:173)\nmance relative to observations is almost as good as the classical method, while its \nperformance relative to real time is even better than TD. \n\nThe graphs in Figures 2 and 3 were based on only one learning experiment each. \nTen further experiments, each with a different random 500 state problem, were run. \nThe results are given in Table 1. \n\n2 PRIORITIZED SWEEPING \n\nA longer paper [Moore and Atkeson, 1992] will describe the algorithm in detail. \nHere we summarize the essential insights, and then simply present the algorithm \nin Figure 4. The closest relation to prioritized sweeping is the search scheduling \ntechnique of the A* algorithm [Nilsson, 1971]. Closely related research is being \nperformed by [Peng and Williams, 1992] into a similar algorithm to prioritized \nsweeping, which they call Dyna-Q-queue . \n\n\u2022 The memory requirements oflearning a N, x N, matrix, where N, is the number \nof states, may initially appear prohibitive, especially since we intend to operate \nwith more than 10,000 states. However, we need only allocate memory for the \n\n\f266 \n\nMoore and Atkeson \n\n1. Promote state irecent (the source of the most recent transition) to top of priority \n\nqueue. \n\n2. While we are allowed further processing and priority queue not empty \n\n2.1 Remove the top state from the priority queue. Call it i \n2.2 a max = 0 \n2.3 for each Ie E TERMS \nPnew = qil& + \n\nq,j ijl& \n\nj esuccs(i)nNONTERMS \n\nL \ni\"il& I \n\na:= I Pnew -\niil& : = Pnew \n\namax := max(amax , a) \n\n2.4 for each i' E preds(i) \nP := qi1iamax \nif i' not on queue, or P exceeds the current priority \nof i', then promote i' to new priority P. \n\nFigure 4: The prioritized sweeping algorithm. This sequence of operations is exe(cid:173)\ncuted each time a transition is observed. \n\nexperiences the system actually has, and for a wide class of physical systems \nthere is not enough time in the lifetime of the physical system to run out of \nmemory . \n\n\u2022 We keep a record of all predecessors of each state. When the eventual absorp(cid:173)\n\ntion probabilities of a state are updated, its predecessors are alerted that they \nmay need to change. A priority value is assigned to each predecessor according \nto how large this change could be possibly be, and it is placed in a priority \nqueue. \n\n\u2022 After each real-world observation i ~ j, the transition probability estimate \nqij is updated along with the probabilities of transition to all other previously \nobserved successors of i. Then state i is promoted to the top of the priority \nqueue so that its absorption probabilities are updated immediately. Next, we \ncontinue to process further states from the top of the queue. Each state that \nis processed may result in the addition or promotion of its predecessors within \nthe queue. This loop continues for a preset number of processing steps or until \nthe queue empties. \n\nIf a real world observation is interesting, all its predecessors and their earlier an(cid:173)\ncestors quickly find themselves near the top of the priority queue. On the other \nhand, if the real world observation is unsurprising, then the processing immediately \nproceeds to other, more important areas of state-space which had been under con(cid:173)\nsideration on the previous time step. These other areas may be different nom those \nin which the system currently finds itself. \n\n\fMemory-based Reinforcement Learning: Efficient Computation with Prioritized Sweeping \n\n267 \n\n15 States \n\n117 States \n\nDyna-PI+ \nDyna-OPT \nPriSweep \n\n400 \n300 \n150 \n\n> \n500 \n900 \n1200 \n\n605 States \n> \n36000 \n21000 \n6000 \n\n4528 States \n\n> \n> 500000 \n245000 \n59000 \n\nTable 2: Number of observations before 98% of decisions were subsequently optimal. \nDyna and Prioritized Sweeping were each allowed to process ten states per real-world \nobservation. \n\n3 LEARNING CONTROL FROM REINFORCEMENT \n\nPrioritized sweeping is also directly applicable to stochastic control problems. Re(cid:173)\nmembering all previous transitions allows an additional advantage for control(cid:173)\nexploration can be guided towards areas of state space in which we predict we are \nignorant. This is achieved using the exploration philosophy of [Kaelbling, 1990] \nand [Sutton, 1990]: optimism in the face of uncertainty. \n\n4 RESULTS \n\nResults of some maze problems of significant size are shown in Table 2. Each \nstate has four actions: one for each direction. Blocked actions do not move. One \ngoal state (the star in subsequent figures) gives 100 units of reward, all others give \nno reward, and there is a discount factor of 0.99. Trials start in the bottom left \ncorner. The system is reset to the start state whenever the goal state has been \nvisited ten times since the last reset. The reset is outside the learning task: it is \nnot observed as a state transition. Prioritized sweeping is tested against a highly \ntuned Q-learner [Watkins, 1989] and a highly tuned Dyna [Sutton, 1990]. The \noptimistic experimentation method (described in the full paper) can be applied to \nother algorithms, and so the results of optimistic Dyna-learning is also included. \n\nThe same mazes were also run as a stochastic problem in which requested actions \nwere randomly corrupted 50% of the time. The gap between Dyna-OPT and Prior(cid:173)\nitized Sweeping was reduced in these cases. For example, on a stochastic 4528-state \nmaze Dyna-OPT took 310,000 steps and Prioritized sweeping took 200,000. \n\nWe also have results for a five state bench-mark problem described in [Sato et al., \n1988, Barto and Singh, 1990]. Convergence time is reduced by a factor of twenty \nover the incremental methods. \n\n\f268 \n\nMoore and Atkeson \n\nExperiences to converge Real time to converge \n\nQ \n\nDyna-PI+ \n\nOptimistic Dyna \n\nPrioritized Sweeping \n\nnever \nnever \n55,000 \n14,000 \n\n1500 secs \n330 secs \n\nTable 3: Performance on the deterministic rod-in-maze task. Both Dynas and \nprioritized sweeping were allowed 100 backups per experience. \n\nFinally we consider a task with a 3-d state space quantized into 15,000 potential \ndiscrete states (not all reachable). The task is shown in Figure 5 and involves finding \nthe shortest path for a rod which can be rotated and translated. \nQ, Dyna-PI+, Optimistic Dyna and prioritized sweeping were all tested. The results \nare in Table 3. Q and Dyna-PI+ did not even travel a quarter of the way to the \ngoal, let alone discover an optimal path, within 200,000 experiences. Optimistic \nDyna and prioritized sweeping both eventually converged, with the latter requiring \na third the experiences and a fifth the real time. \n\nWhen 2000 backups per experience were permitted, instead of 100, then both opti(cid:173)\nmistic Dyna and prioritized sweeping required fewer experiences to converge. Op(cid:173)\ntimistic Dyna took 21,000 experiences instead of 55,000 but took 2,900 seconds(cid:173)\nalmost twice the real time. Prioritized sweeping took 13,500 instead of 14,000 \nexperiences-very little improvement, but it used no extra time. This indicates \nthat for prioritized sweeping, 100 backups per observation is sufficient to make \nalmost complete use of its observations, so that all the long term reward (J,) esti(cid:173)\nmates are very close to the estimates which would be globally consistent with the \ntransition probability estimates ('if';). Thus, we conjecture that even full dynamic \nprogramming after each experience (which would take days of real time) would do \nlittle better. \n\nFigme 5: A three-DOF \nproblem, and the optimal \nsolution path. \n\n\fMemory-based Reinforcement Learning: Efficient Computation with Prioritized Sweeping \n\n269 \n\nf--- [3 f---\n1--,-\nrr \n\n.Y \n\nf--\n\nI -\n\nl -\n\n\" \n\nI \n\n....L. \n-\n\nFigure 6: Dotted states are all those vis-\nited when the Manhattan heuristic was \nused \n\nFigure 7: A kd.-tree tessellation of state \nspace of a sparse mase \n\n5 DISCUSSION \n\nOur investigation shows that Prioritized Sweeping can solve large state-space real(cid:173)\ntime problems with which other methods have difficulty. An important extension \nallows heuristics to constrain exploration decisions. For example, in finding an \noptimal path through a maze, many states need not be considered at all. Figure 6 \nshows the areas explored using a Manhattan heuristic when finding the optimal \npath from the lower left to the center. For some tasks we may be even satisfied to \ncease exploration when we have obtained a solution known to be, say, within 50% \nof the optimal solution. This can be achieved by using a heuristic which lies: it tells \nus that the best possible reward-ta-go is that of a path which is twice the length of \nthe true shortest possible path. \nFurthermore, another promising avenue is prioritized sweeping in conjunction with \nkd-tree tessellations of state space to concentrate prioritizing sweeping on the im(cid:173)\nportant regions [Moore, 1991]. Other benefits of the memory-based approach, de(cid:173)\nscribed in [Moore, 1992], allow us to control forgetting in changing environments \nand automatic scaling of state variables. \n\nAcknowledgements \n\nThanks to Mary Soon Lee, Satinder Singh and Rich Sutton for useful comments \non an early draft. Andrew W. Moore is supported by a Postdoctoral Fellowship \nfrom SERC/NATO. Support was also provided under Air Force Office of Scientific \nResearch grant AFOSR-89-0500, an Alfred P. Sloan Fellowship, the W. M. Keck \nFoundation Associate Professorship in Biomedical Engineering, Siemens Corpora(cid:173)\ntion, and a National Science Foundation Presidential Young Investigator Award to \nChristopher G. Atkeson. \n\n\f270 \n\nMoore and Atkeson \n\nReferences \n\n[Barto and Singh, 1990] A. G. Barto and S. P. Singh. On the Computational Eco(cid:173)\n\nnomics of Reinforcement Learning. In D. S. Touretzky, editor, Connectioni.t \nMode\": Proceeding. of the 1990 Summer School. Morgan Kaufmann, 1990. \n\n[Kaelbling, 1990] L. P. Kaelbling. Learning in Embedded Systems. PhD. Thesisj \nTechnical Report No. TR-90-04, Stanford University, Department of Computer \nScience, June 1990. \n\n[Moore and Atkeson, 1992] A. V;!. !-Ioore and C. G. Atkeson. Memory-based Rein(cid:173)\nforcement Learning: CO:lverging with Less Data and Less Real Time. In prepa(cid:173)\nration, 1992. \n\n[Moore, 1991] A. W. Moore. Variable Resolution Dynamic Programming: Ef(cid:173)\n\nficiently Learning Action Maps in Multivariate Real-valued State-spaces. \nIn \nL. Birnbaum and G. Collins, editors, Machine Learning: Proceeding. of the Eighth \nInternational Work.thop. Morgan Kaufman, June 1991. \n\n[Moore, 1992] A. W. Moore. Fast, Robust Adaptive Control by Learning only \nForward Models. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, \nAdvance. in Neural Information Proceuing Sydem6 4. Morgan Kaufmann, April \n1992. \n\n[Nilsson, 1971] N. J. Nilsson. Problem-60lving Method6 in Artificial Intelligence. \n\nMcGraw Hill, 1971. \n\n[Peng and Williams, 1992] J. Peng and R. J. Williams. Efficient Search Control in \n\nDyna. College of Computer Science, Northeastern University, March 1992. \n\n[Sato et al., 1988] M. Sato, K. Abe, and H. Takeda. Learning Control of Finite \nMarkov Chains with an Explicit Trade-off Between Estimation and Control. IEEE \nTran6. on SY6tem6, Man, and Cybernetic.t, 18{5}:667-684, 1988. \n\n[Sutton, 1988] R. S. Sutton. Learning to Predict by the Methods of Temporal \n\nDifferences. Machine Learning, 3:9-44, 1988. \n\n[Sutton, 1990] R. S. Sutton. Integrated Architecture for Learning, Planning, and \nReacting Based on Approximating Dynamic Programming. In Proceeding. of \nthe 7th International Conference on Machine Learning. Morgan Kaufman, June \n1990. \n\n[Tesauro, 1991] G. J. Tesauro. Practical Issues in Temporal Difference Learning. \n\nRC 17223 (76307), IBM T. J. Watson Research Center, NY, 1991. \n\n[Watkins, 1989] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD. Thesis, \n\nKing's College, University of Cambridge, May 1989. \n\n\f", "award": [], "sourceid": 651, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}