{"title": "Multidimensional Triangulation and Interpolation for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1005, "page_last": 1011, "abstract": null, "full_text": "Multidimensional Triangulation and \n\nInterpolation for Reinforcement Learning \n\nScott Davies \n\nscottd@cs.cmu.edu \n\nDepartment of Computer Science, Carnegie Mellon University \n\n5000 Forbes Ave, Pittsburgh, PA 15213 \n\nAbstract \n\nDynamic Programming, Q-Iearning and other discrete Markov Decision \nProcess solvers can be -applied to continuous d-dimensional state-spaces by \nquantizing the state space into an array of boxes. This is often problematic \nabove two dimensions: a coarse quantization can lead to poor policies, and \nfine quantization is too expensive. Possible solutions are variable-resolution \ndiscretization, or function approximation by neural nets. A third option, \nwhich has been little studied in the reinforcement learning literature, is \ninterpolation on a coarse grid. In this paper we study interpolation tech(cid:173)\nniques that can result in vast improvements in the online behavior of the \nresulting control systems: multilinear interpolation, and an interpolation \nalgorithm based on an interesting regular triangulation of d-dimensional \nspace. We adapt these interpolators under three reinforcement learning \nparadigms: (i) offline value iteration with a known model, (ii) Q-Iearning, \nand (iii) online value iteration with a previously unknown model learned \nfrom data. We describe empirical results, and the resulting implications for \npractical learning of continuous non-linear dynamic control. \n\n1 GRID-BASED INTERPOLATION TECHNIQUES \nReinforcement learning algorithms generate functions that map states to \"cost-t 100000 \n1.42M \n53 \n\n-\n-\n-\n\n1820 \n3.74M \n164 \n\n2733 \n4.03M \n86 \n\n1518 \n4.45M \n197 \n\n1742 \n3 .65M \n93 \n\n1802 \n6.78M \n284 \n\n9613 \n6 .73M \n142 \n\nFigure 2: Acrobot: value iteration with known model \n\nThe interpolated functions require more backUps for convergence, but this is amply \ncompensated by dramatic improvement in the policy. Surprisingly, both interpola(cid:173)\ntion methods provide improvements even at extremely high grid resolutions - the \nnoninterpolated grid with 301 datapoints along each axis fared no better than the \ninterpolated grids with only 21 datapoints along each axis(!) . \n\n3.1.2 Acrobot Results: value iteration with known model \nWe used the same value iteration algorithm in the acrobot domain. In this case our \ntest trials always began from the same start state, but we ran tests for a larger set \nof grid sizes (Figure 2). \n\nGrids with different resolutions place grid cell boundaries at different locations, and \nthese boundary locations appear to be important in this problem -\nthe perfor(cid:173)\nmance varies unpredictably as the grid resolution changes. However, in all cases, \ninterpolation was necessary to arrive at a satisfactory solution; without interpo(cid:173)\nlation, the value iteration often failed to converge at all. With relatively coarse \ngrids it may be that any trajectory to the goal passes through some grid box more \nthan once, which would immediately spell disaster for any algorithm associating a \nconstant value over that entire grid box. \n\nControllers using multilinear interpolation consistently fared better than those em(cid:173)\nploying the simplex-based interpolation; the smoother value function provided by \nmultilinear interpolation seems to help. However, value iteration with the simplex(cid:173)\nbased interpolation was about twice as fast as that with multilinear interpolation. \nIn higher dimensions this speed ratio will increase. \n\n\fTriangulation and Interpolation for Reinforcement Learning \n\n1009 \n\n3.2 CASE II: Q-LEARNING \nUnder a second reinforcement learning paradigm, we do not use any model. \nRather, we learn a Q-function that directly maps state-action pairs to long-term \nrewards [Watkins, 1989]. Does interpolation help here too? \nIn this implementation we encourage exploration by optimistically initializing the \nQ-function to zero everywhere. After travelling a sufficient distance from our last \ndecision point, we perform a single backup by changing the grid point values ac(cid:173)\ncording to a perceptron-like update rule, and then we greedily select the action for \nwhich the interpolated Q-function is highest at the current state. \n\n3.2.1 Hillcar Results: Q-Learning \nWe used Q-Learning with a grid size of 112. Figure 3 shows learning curves for \nthree learners using the three different interpolation techniques. \n\nBoth interpolation methods provided a significant improvement in both initial and \nfinal online performance. The learner without interpolation achieved a final aver(cid:173)\nage performance of about 175 steps to the goal; with multilinear interpolation, 119; \nwith simplex-based interpolation, 122. Note that these are all significant improve(cid:173)\nments over the corresponding results for offline value iteration with a known model. \nInaccuracies in the interpolated functions often cause controllers to enter cycles; be(cid:173)\ncause the Q-Iearning backups are being performed online, however, the Q-Iearning \ncontroller can escape from these control cycles by depressing the Q-values in the \nvicinities of such cycles. \n\n3.2.2 Acrobot Results: Q-Learning \nWe used the same algorithms on the acrobot domain with a grid size of 154 ; results \nare shown in Figure 3. \n\n...,.,.,. . \n\n-,eoooo \n\n-,eoooo \n\n\u00b7200000 \n\no \n\n!iO \n\n100 \n\n150 \n\n-_._--\n\n15 \n\n~ J -1.,..07 \n1 \n~ \n- -'5&+07 \nl \n\n-2 ... 01 L---_'-----'_---'_--'_---'-_~_~ _ \n\n200 \n\n2iO \n\n3DO \n\n360 \n\n400 \n\n450 \n\n500 \n\n0 \n\n!iO \n\n100 \n\n150 \n\n~ofl,.. \n\n200 \n\n250 \n\n300 \n\n!fiO \n\n~otT!U \n\n__' \n400 \n\nFigure 3: Left: Cumulative performance of Q-Iearning hillcar on an 112 grid. (Multilinear \n\ninterpolation comes out on top; no interpolation on the bottom.) Right: Q-Iearning \nacrobot on a. 154 grid. (The two interpolations come out on top with nearly identical \n\nperformance.) For each learner, the y-axis shows the sum of rewards for all trials to date. \n\nThe better the average performance, the shallower the gradient. Gradients are always \nnegative because each state transition before reaching the goal results in a reward of -1. \n\nBoth Q-Iearners using interpolation improved rapidly, and eventually reached the \ngoal in a relatively small number of steps per trial. The learner using multilinear \ninterpolation eventually achieved an average of 1,529 steps to the goal per trial; \nthe learner using simplex-based interpolation achieved 1,727 steps per trial. On \nthe other hand, the learner not using any interpolation fared much worse, taking \n\n\f1010 \n\ns. Davies \n\nan average of more than 27,000 steps per trial. (A controller that chooses actions \nrandomly typically takes about the same number of steps to reach the goal.) \n\nSimplex-based interpolation provided on-line performance very close to that pro(cid:173)\nvided by multilinear interpolation, but at roughly half the computational cost. \n\n3.3 CASE III: VALUE ITERATION WITH MODEL LEARNING \nHere, we use a model of the system, but we do not assume that we have one to start \nwith. Instead , we learn a model of the system as we interact with it; we assume this \nmodel is adequate and calculate a value function via the same algorithms we would \nuse if we knew the true model. This approach may be particularly beneficial for \ntasks in which data is expensive and computation is cheap. Here, models are learned \nusing very simple grid-based function approximators without interpolation for both \nthe reward and transition functions of the model. The same grid resolution is used \nfor the value function grid and the model approximator. We strongly encourage \nexploration by initializing the model so that every state is initially assumed to be \nan absorbing state with zero reward. \n\nWhile making transitions through the state space, we update the model and use \nprioritized sweeping [Moore and Atkeson, 1993] to concentrate backups on relevant \nparts of the state space. We also occasionally stop to recalculate the effects of \nall actions under the updated model and then run value iteration to convergence. \nAs this is fairly time-consuming, it is done rather rarely; we rely on the updates \nperformed by prioritized sweeping to guide the system in the meantime . \n\n. ,00000 \n\n50 \n\n100 \n\no \nFigure 4: Left: Cumulative performance, model-learning on hillcar with a 112 grid. \n\n.\",.\". ~_'------'_~_-.L_--'-_--'-_--'-_~ \n400 \n\n~o1 Tn'\" \n\n~o1 T \"'1I \n\n150 \n\n200 \n\n200 \n\n150 \n\n!OO \n\n350 \n\n-400 \n\n4SO \n\n500 \n\n~oo \n\n.:ISO \n\n2!10 \n\n0 \n\n50 \n\n100 \n\n250 \n\nRight: Acrobot with a 154 grid. In both cases, multilinear interpolation comes out on \n\ntop, while no interpolation winds up on the bottom. \n\n3.3.1 Hillcar Results: value iteration with learned model \nWe used the algorithm described above with an ll-by-ll grid. An average of about \ntwo prioritized sweeping backups were performed per transition; the complete re(cid:173)\ncalculations were performed every 1000 steps throughout the first two trials and \nevery 5000 steps thereafter. Figure 4 shows the results for the first 500 trials. \n\nOver the first 500 trials, the learner using simplex-based interpolation didn't fare \nmuch better than the learner using no interpolation. However, its performance \non trials 1500-2500 (not shown) was close to that of the learner using multilinear \ninterpolation, taking an average of 151 steps to the goal per trial while the learner \nusing multilinear interpolation took 147. The learner using no interpolation did \nsignificantly worse than the others in these later trials, taking 175 steps per trial. \n\n\fTriangulation and Interpolation for Reinforcement Learning \n\n1011 \n\nThe model-learners' performance improved more quickly than the Q-Iearners' over \nthe first few trials; on the other hand, their final performance was significantly worse \nthat the Q-Iearners'. \n\n3.3.2 Acrobot Results: value iteration with learned model \nWe used the same algorithm with a 154 grid on the acrobot domain, this time \nperforming the complete recalculations every 10000 steps through the first two trials \nand every 50000 thereafter. Figure 4 shows the results. In this case, the learner \nusing no interpolation took so much time per trial that the experiment was aborted \nearly; after 100 trials, it was still taking an average of more than 45,000 steps \nto reach the goal. The learners using interpolation, however, fared much better. \nThe learner using multilinear interpolation converged to a solution taking 938 steps \nper trial; the learner using simplex-based interpolation averaged about 2450 steps. \nAgain, as the graphs show, these three learners initially improve significantly faster \nthan did the Q-Learners using similar grid sizes. \n\n4 CONCLUSIONS \nWe have shown how two interpolation schemes- one based on a weighted average of \nthe 2d points in a square cell, the other on a d- dimensional triangulation-may be \nused in three reinforcement learning paradigms: Optimal policy computation with \na known model, Q-Iearning, and online value iteration while learning a model. In \neach case our empirical studies demonstrate interpolation resoundingly decreasing \nthe quantization level necessary for a satisfactory solution. Future extensions of \nthis research will explore the use of variable resolution grids and triangulations, \nmultiple low-dimensional interpolations in place of one high-dimension interpolation \nin a manner reminiscent ofCMAC [Albus, 1981], memory-based approximators, and \nmore intelligent exploration. \n\nThis research was funded in part by a National Science Foundation Graduate Fellowship to Scott Davies, \nand a Research Initiation Award to Andrew Moore. \nReferences \n[Albus, 1981] J. S. Albus. Brains, BehaVIour and Robottcs. BYTE Books, McGraw-Hili , 1981. \n[Boyan and Moore, 1995] J. A . Boyan and A. W. Moore . Generalization in Reinforcement Learning : \n\nSafely Approximating the Value Function. In Neural Information Processing Systems 7, 1995 . \n\n[Crites and Barto, 1996] R. H. Crites and A. G. Barto. Improving Elevator Performance using Rein(cid:173)\n\nIn D. Touretzky, M. Mozer, and M. Hasselmo, editors, Neural Information \n\nforcement Learning. \nProcessing Systems 8, 1996. \n\n[Gordon, 1995] G. Gordon. Stable Function Approximation in Dynamic Programming. In Proceedmgs \n\nof the 12th International Conference on Machme Learning. Morgan Kaufmann, June 1995 . \n\n[Moore and Atkeson, 1993] A. W. Moore and C . G. Atkeson . Prioritized Sweeping: Reinforcement \n\nLearning with Less Data and Less Real Time. Machme Learning, 13, 1993. \n\n[Moore and Atkeson, 1995] A . W. Moore and C. G. Atkeson. The Parti-game Algorithm for Variable \nResolution Reinforcement Learning in Multidimensional State-spaces. Machine Learning, 21, 1995. \n[Moore, 1992] D. W . Moore. Simplical Mesh Generation with Applications. PhD. Thesis. Report no. \n\n92-1322, Cornell University, 1992 . \n\n[Ross, 1983] S. Ross. Introduction to Stochastic Dynamic Programming. Academic Press, New York, \n\n1983. \n\n[Sutton, 1988] R. S. Sutton. Learning to Predict by the Methods of Temporal Differences. Machine \n\nLearning, 3:9-44, 1988. \n\n[Sutton, 1996] R. S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using \nSparse Coarse Coding. In D . Touretzky, M . Mozer, and M . Hasselmo, editors, Neural Information \nProcessing Systems 8, 1996. \n\n[Tesauro, 1991] G. J. Tesauro. Practical Issues in Temporal Difference Learning. RC 17223 (76307), \n\nIBM T. J . Watson Research Center, NY , 1991. \n\n[Watkins, 1989] C. J . C . H . Watkins . Learning from Delayed Rewards . PhD . Thesis, King's College, \n\nUniversity of Cambridge, May 1989. \n\n\f", "award": [], "sourceid": 1229, "authors": [{"given_name": "Scott", "family_name": "Davies", "institution": null}]}