{"title": "The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 711, "page_last": 718, "abstract": null, "full_text": "The Parti-game Algorithm for Variable \nResolution Reinforcement Learning in \n\nMultidimensional State-spaces \n\nAndrew W. Moore \n\nSchool of Computer Science \nCarnegie-Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nParti-game is a new algorithm for learning from delayed rewards \nin high dimensional real-valued state-spaces. In high dimensions \nit is essential that learning does not explore or plan over state \nspace uniformly. Part i-game maintains a decision-tree partitioning \nof state-space and applies game-theory and computational geom(cid:173)\netry techniques to efficiently and reactively concentrate high reso(cid:173)\nlution only on critical areas. Many simulated problems have been \ntested, ranging from 2-dimensional to 9-dimensional state-spaces, \nincluding mazes, path planning, non-linear dynamics, and uncurl(cid:173)\ning snake robots in restricted spaces. In all cases, a good solution \nis found in less than twenty trials and a few minutes. \n\n1 REINFORCEMENT LEARNING \n\nReinforcement learning [Samuel, 1959, Sutton, 1984, Watkins, 1989, Barto et al., \n1991] is a promising method for control systems to program and improve themselves. \nThis paper addresses its biggest stumbling block: the curse of dimensionality [Bell(cid:173)\nman, 1957], in which costs increase exponentially with the number of state variables. \n\nSome earlier work [Simons et al., 1982, Moore, 1991, Chapman and Kaelbling, 1991, \nDayan and Hinton, 1993] has considered recursively partitioning state-space while \nlearning from delayed rewards. The new ideas in the parti-game algorithm in-\n\n711 \n\n\f712 \n\nMoore \n\nclude (i) a game-theoretic splitting criterion to robustly choose spatial resolution \n(ii) real-time incremental maintenance and planning with a database of all previ(cid:173)\nous experIences, and (iii) using local greedy controllers for high-level \"funneling\" \nactions. \n\n2 ASSUMPTIONS \n\nThe parti-game algorithm applies to difficult learning control problems in which: \n\n1. State and action spaces are continuous and multidimensional. \n2. \"Greedy\" and hill-dim bing techniques would become stuck, never attaining \n\nthe goal. \n\n3. Random exploration would be hopelessly time-consuming. \n4. The system dynamics and control laws can have discontinuities and are \n\nunknown: they must be learned. \n\nThe experiments reported later all have properties 1-4. However, the initial algo(cid:173)\nrithm, described and tested here, has the following restrictions: \n\n5. Dynamics are deterministic. \n6. The task is specified by a goal, not an arbitrary reward function. \n7. The goal state is known. \n8. A \"good\" solution is required, not necessarily the optimal path. This na(cid:173)\n\ntion of goodness can be formalized as \"the optimal path to within a given \nresolution of state space\". \n\n9. A local greedy controller is available, which we can ask to move greedily \ntowards any desired state. There is no guarantee that a request to the \ngreedy controller will succeed. For example, in a maze a greedy path to the \ngoal would quickly hit a wall. \n\nFuture developments may include relatively straightforward additions to the algo(cid:173)\nrithm that would remove the need for restrictions 6-9. Restriction 5 is harder to \nremove. \n\n3 ESSENTIALS OF THE PARTI-GAME ALGORITHM \n\nThe state space is broken into partitions by a kd-tree [Friedman et al., 1977]. The \ncontroller can always sense its current (continuous valued) state, and can cheaply \ncompute which partition it is in. The space of actions is also discretized so that \nin a partition with N neighboring partitions, there are N high-level actions. Each \nhigh level action corresponds to a local greedy controller, aiming for the center of \nthe corresponding neighboring partition. \n\nEach partition keeps records of all the occasions on which the system state has \npassed through it. Along with each record is a memory of which high level action \nwas used (i.e. which neighbor was aimed for) and what the outcome was. Figure 1 \nprovides an illustration. \n\nGiven this database of (partition, high-level-action, outcome) triplets, and our \nknowledge of the partition containing the goal state, we can try to compute the \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n713 \n\nPartition I \n\nPartition 2 \n\n................... \n\n, \n\nI \nI \n\nI \n\nPartition 3 \n\nFigure 1: Three trajectories starting \nin partition 1, using high-level action \n\"Aim at partition 2\". Partition 1 re-\nmembers three outcomes. \n(Part 1, Aim 2 --+ Part 2) \n(Part 1, Aim 2 --+ Part 1) \n(Part 1, Aim 2 --+ Part 3) \n\nbest route to the goal. The standard approach would be to model the system \nas a Markov Decision Task in which we empirically estimate the partition tran(cid:173)\nsition probabilities. However, the probabilistic interpretation of coarse resolution \npartitions can lead to policies which get stuck. Instead, we use a game-theoretic \napproach, in which we imagine an adversary. This adversary sees our choice of \nhigh-level action, and is allowed to select any of the observed previous outcomes \nof the action in this partition. Partitions are scored by minimaxing: the adversary \nplays to delay or prevent us getting to the goal and we play to get to the goal as \nquickly as possible. \n\nWhenever the system's continuous state passes between partitions, the database of \nstate transitions is updated and, if necessary, the minimax scores of all partitions \nare updated. If real-time constraints do not permit full recomputation, the updates \ntake place incrementally in a manner similar to prioritized sweeping [Moore and \nAtkeson, 1993]. \n\nAs well as being robust to coarseness, the game-theoretic approach also tells us \nwhere we should increase the resolution . Whenever we compute that we are in a \nlosing partition we perform resolution increase. We first compute the complete set \nof connected partitions which are also losing partitions. We then find the subset of \nthese partitions which border some non-losing region. We increase the resolution of \nall these border states by splitting them along their longest axes 1 . \n\n4 \n\nINITIAL EXPERIMENTS \n\nFigure 2 shows a 2-d continuous maze. Figure 3 shows the performance of the robot \nduring the very first trial. It begins with intense exploration to find a route out of \nthe almost entirely enclosed start region. Having eventually reached a sufficiently \nhigh resolution, it discovers the gap and proceeds greedily towards the goal, only \nto be stopped by the goal's barrier region. The next barrier is traversed at a much \nlower resolution, mainly because the gap is larger. \n\nFigure 4 shows the second trial, started from a slightly different position. The \npolicy derived from the first trial gets us to the goal without further exploration. \nThe trajectory has unnecessary bends. This is because the controller is discretized \naccording to the current partitioning. If necessary, a local optimizer could be used \n\n1 More intelligent splitting criteria are under investigation. \n\n\f714 \n\nMoore \n\nStart I\u00b7 \n\nFigure 2: A 2-d maze problem. The point \nrobot must find a path from start to goal \nwithout crossing any of the barrier lines. Re(cid:173)\nmember that initially it does not know where \nany obstacles are, and must discover them by \nfinding impassable states. \n\nFigure 3: The path taken during the entire \nfirst trial. See text for explanation. \n\nto refine this trajectory2. \n\nThe system does not explore unnecessary areas. The barrier in the top left remains \nat low resolution because the system has had no need to visit there . Figures 5 and 6 \nshow what happens when we now start the system inside this barrier. \n\nFigure 7 shows a 3-d state space problem. If a standard grid were used, this would \nneed an enormous number of states because the solution requires detailed three(cid:173)\npoint-turns. Parti-game's total exploration took 18 times as much movement as \none run of the final path obtained. \n\nFigure 8 shows a 4-d problem in which a ball rolls around a tray with steep edges. \nThe goal is on the other side of a ridge. The maximum permissible force is low, \nand so greedy strategies, or globally linear control rules, get stuck in a limit cycle. \nParti-game's solution runs to the other end of the tray, to build up enough velocity \nto make it over the ridge. The exploration-length versus final-path-Iength ratio is \n24. \n\nFigure 9 shows a 9-joint snake-like robot manipulator which must move to a specified \nconfiguration on the other side of a barrier. Again, no initial model is given: the \ncontroller must learn it as it explores. It takes seven trials before fixing on the \nsolution shown. The exploration-length versus final-path-length ratio is 60. \n\n2 Another method is to increase the resolution along the trajectory [Moore, 1991]. \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n715 \n\n1/ ,. rn \n(') \n11 \nj \nII \n~ /-\"-\n-I-~ \n\n~ \n\nf-H \n\n/'--\n\n) \n\n.f\"\" \nJ \n-1 \n\n1-1 \n\nr--- ./ \n\nFigure 4: The second trial. \n\n1 \n1 \n\nV \nc-f-\nr ~ \nv'\\ \nI ,r-Il \nf- I- L~ IT'\" l.J J \n\"-\n\n1-+' 1 \n\n/ \nr--- ./ \n\n1-1 \n\nf-H \n\nFigure 5: Starting inside the \ntop left barrier. \n\nFigure 6: \nthat. \n\n/--. \n\n) \n\n--1 \nThe trial after \n\nFigure 7: A problem with a planar rod being guided past obstacles. The state space \nis three-dimensional: \ntwo values specify the position of the rod's center, and the third \nspecifies the rod's angle from the horizontal. The angle is constrained so that the pole's \ndotted end must always be below the other end. The pole's center may be moved a short \ndistance (up to 1/40 of the diagram width) and its angle may be altered by up to 5 degrees, \nprovided it does not hit a barrier in the process. Parti-game converged to the path shown \nbelow after two trials. The partitioning lines on the solution diagram only show a 2-d slice \nof the full kd-tree. \n\nTrials \nSteps \nPartitions \n\n149 \n\n149 \n\n149 Change \n\nno \n\n10 \n\n\f716 \n\nMoore \n\nFigure 8: A puck sliding over a hilly surface (hills shown by contours below: the surface \nis bowl shaped, with the lowest points nearest the center, rising steeply at the edges). \nThe state space is four-dimensional: two position and two velocity variables. The controls \nconsist of a force which may be applied in any direction, but with bounded magnitude. \nConvergence time was two trials. \n\nlu..&.I.I.WI'U'I.\u00b7 \u2022\u2022\u2022\u2022 \n\n3 \nno \nchange \n\n10 \n\nTrials \nSteps \nPartitions \n\n2 \n\n1 \n2609 115 \n13 \n\n13 \n\nFigure 9: A nine-degree-of-freedom planar robot must move from the shown start con(cid:173)\nfiguration to the goal. The solution entails curling, rotating and then uncurling. It may \nnot intersect with any of the barriers, the edge of the workspace, or itself. Convergence \noccurred after seven trials. \n\nf-Fixed \nbase \n\n3 \n353 \n67 \n\n4 \n330 \n69 \n\n5 \n739 \n78 \n\n6 \n200 \n85 \n\n7 \n52 \n85 \n\n8 \n\nTrials \nSteps \nPartitions \n\n2 \n\n1 \n1090 430 \n41 \n\n66 \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n717 \n\n5 DISCUSSION \n\nPossible extensions include: \n\n\u2022 Splitting criteria that lay down splits between trajectories with spatially \n\ndistinct outcomes. \n\n\u2022 Allowing humans to provide hints by permitting user-specified controllers \n\n(\"behaviors\") as extra high-level actions. \n\n\u2022 Coalescing neighboring partitions that mutually agree. \n\nWe finish by noting a promising sign involving a series of snake robot experiments \nwith different numbers of links (but fixed total length). Intuitively, the problem \nshould get easier with more links, but the curse of dimensionality would mean \nthat (in the absence of prior knowledge) it becomes exponentially harder. This is \nborne out by the observation that random exploration with the three-link arm will \nstumble on the goal eventually, whereas the nine link robot cannot be expected to \ndo so in tractable time. However, Figure 10 indicates that as the dimensionality \nrises, the amount of exploration (and hence computation) used by parti-game does \nnot rise exponentially. Real-world tasks may often have the same property as the \nsnake example: the complexity of the ultimate task remains roughly constant as the \nnumber of degrees of freedom increases. If so, we may have uncovered the Achilles' \nheel of the curse of dimensionality. \n\n~ \nell \n\"'\" 180 \n~ \n~ = 160 \nQ \nCJ 140 \n~ \n\"'\" 120 \n~ \n~ 100 \n.CI \n~ 80 \n~ \n\n~ 60 e \nfI.l 40 = Q .... ... 20 \n\n:= 0 \n\"'\" ~ \n~ \n\nFigure 10: The number of par(cid:173)\ntitions finally created against de(cid:173)\ngrees of freedom for a set of snake(cid:173)\nlike robots. The kd-trees built were \nall highly non-uniform, \ntypically \nhaving maximum depth nodes of \ntwice the dimensionality. The rela(cid:173)\ntion between exploration time and \ndimensionality (not shown) had a \nsimilar shape. \n\nI \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\n9 \n\nDimensionality \n\nReferences \n[Barto et ai., 1991] A. G. Barto, S. J. Bradtke, and S. P. Singh. Real-time Learning and \nControl using Asynchronous Dynamic Programming. Technical Report 91-57, University \nof Massachusetts at Amherst, August 1991. \n\n[Bellman, 1957] R. E . Bellman. Dynamic Programming. Princeton University Press, \n\nPrinceton, N J, 1957. \n\n[Chapman and Kaelbling, 1991) D. Chapman and L. P. Kaelbling. Learning from Delayed \n\nReinforcement In a Complex Domain. Technical Report, Teleos Research, 1991. \n\n\f718 \n\nMoore \n\n[Dayan and Hinton, 1993] P. Dayan and G. E. Hinton. Feudal Reinforcement Learning. \nIn S. J. Hanson, J. D Cowan, and C. L. Giles, editors, Advances in Neural Information \nProcessing Systems 5. Morgan Kaufmann, 1993. \n\n[Friedman et al., 1977) J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for \nFinding Best Matches in Logarithmic Expected Time. ACM Trans. on Mathematical \nSoftware, 3(3):209-226, September 1977. \n\n[Moore and Atkeson, 1993] A. W. Moore and C. G. Atkeson. Prioritized Sweeping: Rein(cid:173)\nforcement Learning with Less Data and Less Real Time. Machine Learning, 13, 1993. \n[Moore, 1991] A. W. Moore. Variable Resolution Dynamic Programming: Efficiently \nLearning Action Maps in Multivariate Real-valued State-spaces. In L. Birnbaum and \nG. Collins, editors, Machine Learning: Proceedings of the Eighth International Work(cid:173)\nshop. Morgan Kaufman, June 1991. \n\n[Samuel, 1959] A. L. Samuel. Some Studies in Machine Learning using the Game of Check(cid:173)\n\ners. IBM Journal on Research and Development, 3, 1959. Reprinted in E. A. Feigenbaum \nand J. Feldman, editors, Computers and Thought, McGraw-Hill, 1963. \n\n[Simons et al., 1982) J. Simons, H. Van Brussel, J. De Schutter, and J. Verhaert. A Self(cid:173)\n\nLearning Automaton with Variable Resolution for High Precision Assembly by Industrial \nRobots. IEEE Trans. on Automatic Control, 27(5):1109-1113, October 1982. \n\n[Singh, 1993] S. Singh. Personal Communication. \n[Sutton, 1984) R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. \n\n,1993. \n\nPhd. thesis, University of Massachusetts, Amherst, 1984. \n\n[Watkins, 1989] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD. Thesis, \n\nKing's College, University of Cambridge, May 1989. \n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}]}