{"title": "Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 1038, "page_last": 1044, "abstract": null, "full_text": "Generalization in Reinforcement \n\nLearning: Successful Examples Using \n\nSparse Coarse Coding \n\nRichard S. Sutton \n\nUniversity of Massachusetts \nAmherst, MA 01003 USA \n\nrichOcs.umass.edu \n\nAbstract \n\nOn large problems, reinforcement learning systems must use parame(cid:173)\nterized function approximators such as neural networks in order to gen(cid:173)\neralize between similar situations and actions. In these cases there are \nno strong theoretical results on the accuracy of convergence, and com(cid:173)\nputational results have been mixed. In particular, Boyan and Moore \nreported at last year's meeting a series of negative results in attempting \nto apply dynamic programming together with function approximation \nto simple control problems with continuous state spaces. In this paper, \nwe present positive results for all the control tasks they attempted, and \nfor one that is significantly larger. The most important differences are \nthat we used sparse-coarse-coded function approximators (CMACs) \nwhereas they used mostly global function approximators, and that we \nlearned online whereas they learned offline. Boyan and Moore and \nothers have suggested that the problems they encountered could be \nsolved by using actual outcomes (\"rollouts\"), as in classical Monte \nCarlo methods, and as in the TD().) algorithm when). = 1. However, \nin our experiments this always resulted in substantially poorer perfor(cid:173)\nmance. We conclude that reinforcement learning can work robustly \nin conjunction with function approximators, and that there is little \njustification at present for avoiding the case of general ).. \n\n1 Reinforcement Learning and Function Approximation \n\nReinforcement learning is a broad class of optimal control methods based on estimating \nvalue functions from experience, simulation, or search (Barto, Bradtke &; Singh, 1995; \nSutton, 1988; Watkins, 1989). Many of these methods, e.g., dynamic programming \nand temporal-difference learning, build their estimates in part on the basis of other \n\n\fGeneralization in Reinforcement Learning \n\n1039 \n\nestimates. This may be worrisome because, in practice, the estimates never become \nexact; on large problems, parameterized function approximators such as neural net(cid:173)\nworks must be used. Because the estimates are imperfect, and because they in turn \nare used as the targets for other estimates, it seems possible that the ultimate result \nmight be very poor estimates, or even divergence. Indeed some such methods have \nbeen shown to be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy, \n1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have \nbeen proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice \n(Lin, 1991; Tesauro, 1992; Zhang & Diett erich , 1995; Crites & Barto, 1996). What are \nthe key requirements of a method or task in order to obtain good performance? The \nexperiments in this paper are part of narrowing the answer to this question. \n\nThe reinforcement learning methods we use are variations of the sarsa algorithm (Rum(cid:173)\nmery & Niranjan, 1994; Singh & Sutton, 1996). This method is the same as the TD(>.) \nalgorithm (Sutton, 1988), except applied to state-action pairs instead of states, and \nwhere the predictions are used as the basis for selecting actions. The learning agent \nestimates action-values, Q\"\"(s, a), defined as the expected future reward starting in \nstate s, taking action a, and thereafter following policy 71'. These are estimated for \nall states and actions, and for the policy currently being followed by the agent. The \npolicy is chosen dependent on the current estimates in such a way that they jointly \nimprove, ideally approaching an optimal policy and the optimal action-values. In our \nexperiments, actions were selected according to what we call the \u00a3-greedy policy. Most \nof the time, the action selected when in state s was the action for which the estimate \nQ(s,a) was the largest (with ties broken randomly). However, a small fraction, \u00a3, ofthe \ntime, the action was instead selected randomly uniformly from the action set (which \nwas always discrete and finite). There are two variations of the sarsa algorithm, one \nusing conventional accumulate traces and one using replace traces (Singh & Sutton, \n1996). This and other details of the algorithm we used are given in Figure 1. \n\nTo apply the sarsa algorithm to tasks with a continuous state space, we combined \nit with a sparse, coarse-coded function approximator known as the CMAC (Albus, \n1980; Miller, Gordon & Kraft, 1990; Watkins, 1989; Lin & Kim, 1991; Dean et al., \n1992; Tham, 1994). A CMAC uses multiple overlapping tilings of the state space to \nproduce a feature ~epresentation for a final linear mapping where all the learning takes \nplace. See Figure 2. The overall effect is much like a network with fixed radial basis \nfunctions, except that it is particularly efficient computationally (in other respects one \nwould expect RBF networks and similar methods (see Sutton & Whitehead, 1993) to \nwork just as well). It is important to note that the tilings need not be simple grids. \nFor example, to avoid the \"curse of dimensionality,\" a common trick is to ignore some \ndimensions in some tilings, i.e., to use hyperplanar slices instead of boxes. A second \nmajor trick is \"hashing\"-a consistent random collapsing of a large set of tiles into \na much smaller set. Through hashing, memory requirements are often reduced by \nlarge factors with little loss of performance. This is possible because high resolution is \nneeded in only a small fraction of the state space. Hashing frees us from the curse of \ndimensionality in the sense that memory requirements need not be exponential in the \nnumber of dimensions, but need merely match the real demands of the task. \n\n2 Good Convergence on Control Problems \n\nWe applied the sarsa and CMAC combination to the three continuous-state control \nproblems studied by Boyan and Moore (1995): 2D gridworld, puddle world, and moun(cid:173)\ntain car. Whereas they used a model of the task dynamics and applied dynamic pro(cid:173)\ngramming backups offline to a fixed set of states, we learned online, without a model, \nand backed up whatever states were encountered during complete trials. Unlike Boyan \n\n\f1040 \n\nR. S. SUTION \n\n1. Initially: wa(f) := ~, ea(f) := 0, 'ria E Actions, 'rifE CMAC-tiles. \n\n2. Start of Trial: s:= random-stateO; \n\nF := features(s); \na := E-greedy-policy(F). \n\n3. Eligibility Traces: e,,(f) := )..e,,(f), 'rib, 'rIf; \n\n3a. Accumulate algorithm: ea(f) := ea(f) + 1, 'rIf E F. \n3b. Replace algorithm: \n\nea(f) := 1, e,,(f) := 0, 'rIf E F, 'rib t a. \n\n4. Environment Step: \n\nTake action a; observe resultant reward, r, and next state, s' . \n\n5. Choose Next Action: \n\nF' := features(s'), unless s' is the terminal state, then F' := 0; \na' := \u00a3-greedy-policy(F'). \n\n7. Loop: a := a'; s := s'; F := F'; if s' is the terminal state, go to 2; else go to 3. \n\nFigure 1: The sarsa algorithm for finite-horizon (trial based) tasks. The function \u00a3(cid:173)\ngreedy-policy( F) returns , with probability E, a random action or, with probability 1- \u00a3, \ncomputes L:JEF Wa for each action a and returns the action for which the sum is \nlargest, resolving any ties randomly. The function features( s) returns the set of CMAC \ntiles corresponding to the state s. The number of tiles returned is the constant c. Qo, \na, and)\" are scalar parameters. \n\n. ................... ; .......... .,... . .. _ - - Tiling #1 \n\nC\\I \n=It: \nC o \n\n \n\n\u00b7 \u00b7 \u00b7 6 \n\nAccumulate: \n\n0 .\u2022 \n\nRoot \nMean \nSquared \nError \n\n03 \n\n650 \n\n600 \n\nStepsffrial 550 \n\n500 \n\n.50 \n\n240 \n230 \n\n220 \n\n210 \nCostffrial 200 \n190 \n180 \n\n170 \n\n160 \n\n150 \n\nT \n\n.L \n\n0 \n\n0.2 \n\n0.4 \n\n0\"1 \n\n0.4 \n\n0.6 \n\n0 . 8 \n\nPuddle World \n\nReplace \n\n0 . 2 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nA. \n\nCart and Pole \n\n~ \n\n300 \n\n250 \n\nReplace \n\n~_;:i \n\n~ \n\ni--_~ \n\n---~-\n\n0 .6 \n\n0.8 \n\n0.2 \n\n0 .\u2022 \n\nA. \n\n200 Failures per \n100,000 steps \n\nAccumulate \u00b7 150 \n\n\u00b7 \u00b7 \n~ .... ~ ... ~ \n\n........ \n\n0\" \n\n~ \n\n100 \n\n50 \n\n0.6 \n\n0.' \n\nA. \n\nFigure 7: Performance versus A, at best Q, for four different tasks. The left panels \nsummarize data from Figure 6. The upper right panel concerns a 21-state Markov \nchain, the objective being to predict, for each state, the probability of terminating in \none terminal state as opposed to the other (Singh & Sutton, 1996). The lower left \npanel concerns the pole balancing task studied by Barto, Sutton and Anderson (1983). \nThis is previously unpublished data from an earlier study (Sutton, 1984). \n\nReferences \nAlbus, J. S. (1981) Brain, Behavior, and RoboticI, chapter 6, pages 139-179. Byte Books. \nBaird, L. C. (1995) Residual Algorithms: Reinforcement Learning with Function Approxima(cid:173)\n\ntion. Proc. ML95. Morgan Kaufman, San Francisco, CA. \n\nBarto, A. G., Bradtke, S. J., & Singh, S. P. (1995) Real-time learning and control using \n\nasynchronous dynamic programming. Artificial Intelligence. \n\nBarto, A. G., Sutton, R. S., & Anderson, C. W. (1983) Neuronlike elements that can solve \n\ndifficult learning control problems. TranI. IEEE SMC, 13, 835-846. \n\nBertsekas, D . P. (1995) A counterexample to temporal differences learning. Neural Computa(cid:173)\n\ntion, 7, 270-279. \n\nBoyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safelyapprox(cid:173)\n\nimating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann. \n\nCrites, R. H. & Barto, A. G. (1996) Improving elevator performance using reinforcement \n\nlearning. NIPS-8. Cambridge, MA: MIT Press. \n\nDayan, P. (1992) The convergence of TD(~) for general ~. Machine Learning, 8,341-362. \nDean, T., Basye, K. & Shewchuk, J. (1992) Reinforcement learning for planning and control. In \nS. Minton, Machine Learning Methodl for Planning and Scheduling. Morgan Kaufmann. \nDejong, G. & Spong, M. W. (1994) Swinging up the acrobot: An example of intelligent \n\ncontrol. In Proceedingl of the American Control Conference, pagel 1.158-1.161.. \n\nGordon, G. (1995) Stable function approximation in dynamic programming. Proc. ML95. \nLin, L. J. (1992) Self-improving reactive agents based on reinforcement learning, planning \n\nand teaching. Machine Learning, 8(3/4), 293-321. \n\nLin, CoS. & Kim, H. (1991) CMAC-based adaptive critic self-learning control. IEEE TranI. \n\nNeural Networkl, I., 530-533. \n\nMiller, W. T., Glanz, F. H., & Kraft, L. G. (1990) CMAC: An associative neural network \n\nalternative to backpropagation. Proc. of the IEEE, 78, 1561-1567. \n\n\f1044 \n\nR.S. SUTION \n\nRummery, G. A. & Niranjan, M. (1994) On-line Q-Iearning using connectionist systems. \nTechnical Report CUED /F-INFENG /TR 166, Cambridge University Engineering Dept. \nSingh, S. P. & Sutton, R. S. (1996) Reinforcement learning with replacing eligibility traces. \n\nMachine Learning. \n\nSpong, M. W. & Vidyasagar, M. (1989) Robot Dynamic, and Control. New York: Wiley. \nSutton, R. S. (1984) Temporal Credit A\"ignment in Reinforcement Learning. PhD thesis, \n\nUniversity of Massachusetts, Amherst, MA. \n\nSutton, R. S. (1988) Learning to predict by the methods of temporal differences. Machine \n\nLearning, 3, 9-44. \n\nSutton, R. S. & Whitehead, S. D. (1993) Online learning with random representations. Proc. \n\nML93, pages 314-321. Morgan Kaufmann. \n\nTham, C. K. (1994) Modular On-Line Function Approximation for Scaling up Reinforcement \n\nLearning. PhD thesis, Cambridge Univ., Cambridge, England. \n\nTesauro, G. J. (1992) Practical issues in temporal difference learning. Machine Learning, \n\n8{3/4),257-277. \n\nTsitsiklis, J. N. & Van Roy, B. (1994) Feature-based methods for large-scale dynamic pro(cid:173)\n\ngramming. Techical Report LIDS-P2277, MIT, Cambridge, MA 02139. \n\nWatkins, C. J. C. H. (1989) Learning from Delayed Reward,. PhD thesis, Cambridge Univ. \nZhang, W. & Dietterich, T. G., (1995) A reinforcement learning approach to job-shop schedul-\n\ning. Proc. IJCAI95. \n\nAppendix: Details of the Experiments \n\nIn the puddle world, there were four actions, up, down, right, and left, which moved approxi(cid:173)\nmately 0.05 in these directions unless the movement would cause the agent to leave the limits \nof the space. A random gaussian noise with standard deviation 0.01 was also added to the \nmotion along both dimensions. The costs (negative rewards) on this task were -1 for each \ntime step plus additional penalties if either or both of the two oval \"puddles\" were entered. \nThese penalties were -400 times the distance into the puddle (distance to the nearest edge). \nThe puddles were 0.1 in radius and were located at center points (.1, .75) to (A5, .75) and \n(A5, A) to (045, .8). The initial state of each trial was selected randomly uniformly from the \nnon-goal states. For the run in Figure 3, a == 0.5, >. == 0.9, c == 5, f == 0.1, and Qo == O. For \nFigure 6, Qo == -20. \n\nDetails of the mountain-car task are given in Singh & Sutton (1996). For the run in Figure 4, \na == 0.5, >. == 0.9, c == 10, f == 0, and Qo == O. For Figure 6, c == 5 and Qo == -100. \nIn the acrobot task, the CMACs used 48 tilings. Each of the four dimensions were divided \ninto 6 intervals. 12 tilings depended in the usual way on all 4 dimensions. 12 other tilings \ndepended only on 3 dimensions (3 tilings for each of the four sets of 3 dimensions). 12 others \ndepended only on two dimensions (2 tilings for each of the 6 sets of two dimensions. And \nfinally 12 tilings depended each on only one dimension (3 tilings for each dimension). This \nresulted in a total of 12 .64 + 12 . 63 + 12 .62 + 12 \u00b76 == 18,648 tiles. The equations of motion \nwere: \n\n91 = _d~l (d29. + rPl) \n\n9. == ( m.I~2 + I. - ~:) -1 (T + ~: rPl -\nd1 == mll~l + m.(l~ + 1~2 + 2hlc. cosO.) + II + I.) \n\nrP2 ) \n\nd. == m.(l~. + hie. cosO.) + I. \n\n. . \n\n.. \n\nrPl = -m.lde.O.,inO. - 2m.ldc.0.Ol,inO. + (ml lel + m.h)gcos(Ol - 7r/2) + rP. \n\nrP. == m 2 Ie.gcos(01 +0. - 7r/2) \n\nwhere T E {+1, -1,0} was the torque applied at the second joint, and .6. == 0.05 was the \ntime increment. Actions were chosen after every four of the state updates given by the above \nequations, corresponding to 5 Hz. The angular velocities were bounded by 91 E [-47r, 47r] and \n9. E [-97r,97r]. Finally, the remaining constants were m1 == m2 == 1 (masses of the links), \n11 == h == 1 (lengths of links), lei == 'e2 == 0.5 (lengths to center of mass of links), II == 12 == 1 \n(moments of inertia of links), and g == 9.8 (gravity). The parameters were a == 0.2, >. == 0.9, \nc == 48, f == 0, Qo == O. The starting state on each trial was 01 == O. == O. \n\n\f", "award": [], "sourceid": 1109, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}]}