{"title": "Robust, Efficient, Globally-Optimized Reinforcement Learning with the Parti-Game Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 967, "abstract": null, "full_text": "Robust. Efficient, Globally-Optimized \n\nReinforcement Learning with the \n\nParti-Game Algorithm \n\nMohammad A. AI-Ansari and Ronald J. Williams \n\nCollege of Computer Science,  161  CN \n\nNortheastern University \n\nBoston, MA 02115 \n\nalansar@ccs.neu.edu, rjw@ccs.neu.edu \n\nAbstract \n\nParti-game (Moore 1994a; Moore 1994b; Moore and Atkeson  1995) is a \nreinforcement learning (RL) algorithm that has a lot of promise in over(cid:173)\ncoming the curse of dimensionality that can plague RL algorithms when \napplied to high-dimensional problems.  In  this paper we  introduce mod(cid:173)\nifications to  the algorithm that further improve its performance and ro(cid:173)\nbustness. In addition, while parti-game solutions can be improved locally \nby standard local path-improvement techniques, we introduce an add-on \nalgorithm in  the same spirit  as  parti-game that instead tries  to  improve \nsolutions in a non-local manner. \n\n1  INTRODUCTION \n\nParti-game operates  on  goal  problems  by  dynamically partitioning the  space into hyper(cid:173)\nrectangular cells of varying sizes,  represented using a k-d  tree data structure.  It assumes \nthe existence of a pre-specified local controller that can be commanded to proceed from the \ncurrent state to a given state. The algorithm uses a game-theoretic approach to assign costs \nto cells based on past experiences using a minimax algorithm.  A cell's cost can be either \na finite positive integer or infinity.  The former represents the number of cells that have to \nbe traveled through to  get to  the goal  cell  and the latter represents the belief that there is \nno reliable way of getting from that cell to the goal.  Cells with a cost of infinity are called \nlosing cells while others are called winning ones. \n\nThe algorithm starts out with one cell representing the entire space and another, contained \nwithin it, representing the goal region.  In a typical step, the local controller is commanded \nto proceed to the center of the most promising neighboring cell.  Upon entering a neighbor(cid:173)\ning cell (whether the one aimed at or not), or upon failing to  leave the current cell within \n\n\f962 \n\nM  A.  AI-Ansari and R. J.  Williams \n\no \n\n\u2022 \u2022 !---:---:----:-...J.......:---! \n\n.-\n\ns .... \n\n. \n\n0 \n\ns .... \n\n0 \n\n~ .~ \n\nl  :~ \n\n0 \n\n(I) \n\n(0) \n\n(e) \n\n(d) \n\nFigure  I: In these mazes, the agent is required to stan from the point marked Stan and reach the square goal cell. \n\na timeout period,  the result of this attempt is  added to the database of experiences the al(cid:173)\ngorithm has collected, cell costs  are recomputed based  on  the updated database,  and  the \nprocess repeats.  The costs are computed using a Dijkstra-like, one-pass minimax version \nof dynamic programming. The algorithm terminates upon entering the goal cell. \n\nIf at  any  point the  algorithm  determines  that  it  can  not proceed  because the  agent  is  in \na  losing  cell,  each cell  lying  on  the  boundary between losing  and  winning cells  is  split \nacross the dimension in  which it is largest and all experiences involving cells that are split \nare discarded.  Since parti-game assumes,  in the absence of evidence to the contrary, that \nfrom any given cell every neighboring cell is reachable, discarding experiences in this way \nencourages exploration of the newly created cells. \n\n2  PARTITIONING ONLY LOSING CELLS \n\nThe win-lose boundary mentioned above represents a barrier the algorithm perceives that \nis preventing the agent from reaching the goal.  The reason behind partitioning cells along \nthis boundary is to increase the resolution along these areas that are crucial to reaching the \ngoal and thus creating more regions along this boundary for the agent to try to get through. \nBy partitioning on both sides of the boundary, parti-game guarantees that neighboring cells \nalong the boundary remain close in  size.  Along with the strategy of aiming towards cen(cid:173)\nters of neighboring cells,  this produces pairings of winner-loser cells that form proposed \n\"corridors\" for the agent to try to go through to penetrate the barrier it perceives. \n\nIn this  section we investigate doing away  with partitioning on the winning side,  and only \npartition  losing  cells.  Because  partitioning can  only  be triggered  with  the  agent on  the \nlosing  side  of the  win-lose  boundary,  partitioning only  losing  cells  would  still  give  the \nagent the same kind of access to the boundary through the newly formed cells.  However, \nthis would result in  a size disparity between winner- and loser-side cells and, thus,  would \nnot produce the winner side of the pairings mentioned above. To produce a similar effect to \nthe pairings of parti-game, we change the aiming strategy of the algorithm.  Under the new \nstrategy, when the agent decides to go from the cell it currently occupies to a neighboring \none, it aims towards the center point of the common surface between the two cells.  While \nthis does not reproduce the same line of motion of the original aiming strategy exactly, it \nachieves a very similar objective. \n\nParti-game's success in high-dimensional problems stems from its variable resolution strat(cid:173)\negy, which partitions finely only in regions where it is needed.  By limiting partitioning to \nlosing cells only, we hope to  increase the resolution in even fewer parts of the state space \nand thereby make the algorithm even more efficient. \n\nTo compare the performance of parti-game to the modified algorithm, we  applied both al(cid:173)\ngorithms to the set of continuous mazes shown in Figure 1. For all maze problems we used \na simple local controller that can move directly toward the specified target state.  We  also \n\n\fRobust, Efficient Reiriforcement Learning with the Parti-Game Algorithm \n\n963 \n\nFigure 2:  An ice puck on a hill.  The puck can thrust horizontally to the left and to  the right with a maximum force of I Newton. \nThe state space is two-dimensional consisting of the horizontal position and velocity. The agent starts at the position marked Start \nat velocity zero and its goal is to reach the position marked Goal at velocity zero. Maximum thrust is not adequate to get the puck \nup the ramp so it has to learn to move to the left first to build up momentum \n\nFigure  3:  A nine degree of freedom,  snake-like arm that moves  in a plane and is fixed  at one tip, as  depicted in Figure 3.  The \nobjective is to move the arm from the start configuration to the goal one, which requires curling and uncurling to avoid the barrier \nand the wall. \n\napplied both algorithms to the non-linear dynamics problem of the ice puck on a hill,  de(cid:173)\npicted in Figure 2, which has been studied extensively in reinforcement learning literature. \nWe used a local controller very similar to the one described in Moore and Atkeson (1995). \nFinally, we applied the algorithm to the nine-degree of freedom planar robot introduced in \nMoore and Atkeson (1995) and shown in Figure 3 and we used the same local controller \ndescribed there.  Additional results on the Acrobot problem (Sutton and Barto  1998) were \nnot included here for space limitations but can be found in  AI-Ansari and Williams (1998). \n\nWe  applied both algorithms to  each of these problems, in  each case performing as  many \ntrials as  was  needed for the solution to  stabilize.  The  agent  was  placed back in the start \nstate  at  the  end  of each  trial.  In  the puck problem,  the  agent  was  also  reset to  the  start \nstate whenever it hit either of the barriers at the bottom and top of the slope. The results are \nshown in Table 1. The table compares the number of trials needed, the number of partitions, \ntotal number of steps taken in the world, and the length of the final trajectory. \n\nThe table shows that the new algorithm indeed resulted in fewer total partitions in all prob-\n\n, \n\n, \n\n.1 \n\"--t-\"\" \nmtm \n\n1  1 \n\nf- I \n\\ \n\nft-\n\n. \n\n(a) \n\n1\\ \n\nI \n\n! \n\"'\" -\n\n, \n\n\u00b7 \n\u00b7 \n\n\u00b7 \n\n, \n\nI \nI \n\n(b) \n\n\u00b7 \n\u00b7 /1 \n\n~ \n\nI \n~ -\n\u00b7 \n\u00b7 \n\n, \n\n1\\ \n\nf-\n\nc-\n\nf-\n\n(e) \n\nFigure  4:  The  final  trial  of applying  the  various  algorithms  to  the  maze  in  Figure  1 (a).  (a)  parti-game.  (b)  parti-game  with \npartitioning only losing cells and (c) parti-game with partitioning only the largest losing cells. \n\n\f964 \n\nM  A.  AI-Ansari and R. J.  Williams \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\nI \nI \n\n0 \n\nFigure 5:  Parti-game needed 1194 partitions to reach the goal in the maze of Figure  l(d). \n\nlems.  It also  improved in all  problems in the number of trials required to stabilization_ It \nimproved in  all  but one problem (maze d) in  the length of the final  trajectory, however the \ndifference in length is very small.  Finally, it resulted in  fewer total steps taken in  three of \nthe six problems, but the total steps taken increased in the remaining three. \n\nTo see the effect of the modification in detail, we show the result of applying parti-game and \nthe modified algorithm on the maze of Figure l(a) in Figures 4(a) and  4(b), respectively. \nWe can see how areas with higher resolution are more localized in Figure 4(b). \n\n3  BALANCED PARTITIONING \n\nUpon close observation of Figure 4(a), we see that parti-game partitions very finely  along \nthe right wall of the maze. This behavior is even more clearly seen in parti-game's solution \nto the maze in Figure l(d), which is  a simple maze with a single barrier between the start \nstate and the goal. As we see in Table 1, parti-game has a very hard time reaching the goal \nin this maze. Figure 5 shows the 1194 partitions that parti-game generated in trying to reach \nthe goal. We can see that partitioning along the barrier is very uneven, being extremely fine \nnear the goal and growing coarser as the distance from the goal increases.  Putting higher \nfocus on places where the highest gain could be attained if a hole is found can be a desirable \nfeature, but what happens in cases like this one is obviously excessive. \n\nOne of the factors contributing to this problem of continuing to search at ever-higher reso(cid:173)\nlutions in the part of the barrier nearest the goal is that any version of parti-game searches \nfor solutions using an  implicit trade-off between the shortness of a potential solution path \nand  the resolution required  to  find  this  path.  Only  when  the resolution  becomes so fine \nthat the number of cells through which the agent would have to pass in this potential short(cid:173)\ncut exceeds the  number of cells to  be traversed  when  traveling  around  the  barrier is  the \nalgorithm forced to look elsewhere for the actual opening. \n\nA conceptually appealing way  to  bias this search is to maintain a more explicit coarse-to(cid:173)\nfine search strategy. One way to do this is to try to keep the smallest cell size the algorithm \ngenerates  as  large  as  possible.  In  addition  to  achieving  the  balance we  are  seeking,  this \nwould tend to  lower the total  number of partitions and result in  shallower tree structures \nneeded to represent the state space, which, in tum, results in higher efficiency. \n\nTo  achieve these  goals,  we  modified  the  algorithm  from  the  previous  section  such  that \nwhenever partitioning is required, instead of partitioning all losing cells, we only partition \nthose among them that are of maximum size.  This has the effect of postponing splits that \nwould lower the minimum cell size as long as possible. The results of applying the modified \nalgorithm on the test problems are also shown in Table 1. \n\nComparing the results of this version of the algorithm to those of partitioning all losing cells \n\n\fRobust.  Efficient Reinforcement Learning with the Parti-Game Algorithm \n\n965 \n\n, \n\n: ~  \u00b7 \n\u00b7 \n\u00b7 / \n\u00b7 \\ \n\n~ \n\n, \n\nI \n\nI \n\nI \n\\.P \n\n(a) \n\n(b) \n\nFigure 6:  The result of partitioning  largest cells on the  losing  side in the  maze of Figure  I (d).  Only two  nials are  required  to \nstabilize.  The first requires  1304 steps and 21  partitions. The second nial adds no new partitions and produces a path of only 165 \nsteps. \n\nProblem \n\nAlgorithm \n\nTrials \n\nPartitions \n\nmaze a \n\nmazeb \n\nmazec \n\nmazed \n\npuck \n\nnine-\njoint \narm \n\noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \nparti tion losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \n\n3 \n3 \n3 \n6 \n5 \n6 \n3 \n2 \n2 \n2 \n2 \n2 \n6 \n2 \n2 \n25 \n17 \n7 \n\n444 \n239 \n27 \n98 \n76 \n76 \n176 \n120 \n96 \n1194 \n350 \n21 \n80 \n18 \n18 \n104 \n61 \n37 \n\nTotal \nSteps \n\n35131 \n16652 \n1977 \n5180 \n7187 \n5635 \n7768 \n10429 \n6803 \n553340 \n18639 \n1469 \n6764 \n3237 \n3237 \n2970 \n3041 \n2694 \n\nFinal  I \n\nTrajectory \nLength \n279 \n256 \n270 \n183 \n175 \n174 \n416 \n165 \n165 \n149 \n155 \n165 \n240 \n151 \nlSI \n58 \n56 \n112 \n\nTable 1: Results of applying parti-game, parti-game with partitioning only losing cells and parti-game with partitioning the largest \nlosing cells on three of the problem domains. Smaller numbers are better.  Best numbers are shown in bold. \n\non the win-lose boundary shows that this algorithm improves on parti-garne's performance \neven further.  It outperforms the above algorithm in  four problems in  the total  number of \npartitions required, while it ties it in the remaining two.  It outperforms the above algorithm \nin total  steps taken in  five problems and ties it in  one.  It improves in the number of trials \nneeded  to  stabilize in  one problem,  ties the above algorithm in  four cases and  ties  parti(cid:173)\ngame in  the remaining  one.  In  the  length of the final  trajectory,  partitioning the largest \nlosing cells does better in one case, ties partitioning only losing cells in two cases and does \nworse in three. This latter result is due to the generally larger partition sizes that result from \nthe lower resolution that this algorithm produces.  However, the increase in the number of \nsteps is very minimal in all but the nine-joint arm problem. \n\nFigure 4(c) shows the result of applying the new algorithm to the maze of Figure l(a).  In \ncontrast to  the other two  algorithms depicted in  the same figure,  we can see that the new \nalgorithm partitions very uniformly around the barrier.  In  addition, it requires the fewest \nnumber of partitions and total steps out of the three algorithms. Figure 6 shows that the new \nalgorithm vastly outperforms parti-game on the maze in Figure l(d). Here, too, it partitions \nvery evenly around the barrier and finds the goal very quickly, requiring far fewer steps and \npartitions. \n\n\f966 \n\nM.  A.  AI-Ansari and R.  J  Williams \n\n4  GLOBAL PATH IMPROVEMENT \n\nParti-game does not claim to find optimal solutions. As we see in Figure 4, parti-game and \nthe two  modified algorithms settle on  the longer of the two possible routes to the goal in \nthis maze.  In this section we investigate ways we could improve parti-game so that it could \nfind  paths of optimal form.  It is  important to  note that we are not seeking paths that are \noptimal,  since that is  not possible to  achieve using  the cell  shapes  and  aiming strategies \nwe are using here.  By a path of optimal form  we mean a path that could be continuously \ndeformed into an optimal path. \n\n4.1  OTHER GRADIENTS \n\nAs mentioned above, parti-game partitions only when the agent has no winning cells to aim \nfor and the only cells partitioned are those that lie on the win-lose boundary. The win-lose \nboundary falls  on  the gradient between finite- and  infinite-cost cells  and  it  appears  when \nthe algorithm knows of no reliable way  to get to the goal.  Consistently partitioning along \nthis  gradient guarantees that the  algorithm will  eventually find  a path to  the  goal,  if one \nexists. \n\nHowever, gradients across which the difference in cost is finite  also exist in  a state space \npartitioned by  parti-game (or any  of the variants introduced in  this paper).  Like the  win(cid:173)\nlose boundary,  these gradients are boundaries through  which  the  agent does  not  believe \nit can move directly.  Although finding  an  opening in  such a boundary is  not essential to \nreaching the goal, these boundaries do represent potential shortcuts that might improve the \nagent's policy.  Any gradient with a difference in cost of two or more is a location of such \na potentially useful shortcut. \n\nBecause such gradients appear throughout the space,  we  need to be selective about which \nones to partition along.  There are many possible strategies one might consider using to in(cid:173)\ncorporate these ideas into parti-game. For example, since parti-game focuses on the highest \ngradients only, the first thing that comes to mind is to follow in parti-game's footsteps and \nassign  partitioning priorities  to  cells  along  gradients  based  on  the  differences  in  values \nacross those gradients.  However, since the true cost function typically has discontinuities, \nit is clear that the effect of such a strategy  would  be to continue refining the partitioning \nindefinitely along such a discontinuity in a vain search for a nonexistent shortcut. \n\n4.2  THE ALGORITHM \n\nA much better idea is to try to pick cells to partition in  a way that would achieve balanced \npartitioning,  following  the rationale  we  introduced in  section  3.  Again,  such a  strategy \nwould result in a uniform coarse-to-fine search for better paths along those other gradients. \n\nThe following discussion could, in principle, apply to any of the three forms of parti-game \nstudied up to this point.  Because of the superior behavior of the version where we partition \nthe largest cells on the losing side, this is the specific version we report on here, and we use \nthe term modified parti-game to refer to it. \n\nThe way we incorporated partitioning along other gradients is as follows.  At the end of any \ntrial in which the agent is able to go from the start state to the goal without any unexpected \nresults of any of its aiming attempts, we partition the largest \"losing cells\" (i.e., higher-cost \ncells) that fall  on any gradient across  which costs differ by  more than one.  Because data \nabout experiences involving cells that are partitioned is discarded, the next time modified \nparti-game is  run,  the agent will  try to  go through the  newly  formed  cells  in  search of a \nshortcut. \n\nThis  algorithm  amounts to  simply running modified parti-game until  a stable solution  is \n\n\fRobust, Efficient Reinforcement Learning with the Parti-Game Algorithm \n\n967 \n\n. \n. 11 1 \n.\\ \n.. j \n\nI  I  I \n\nI \n\n\\ \n\n' \u00b7/1  I  I \n\u2022  I \n\nI  I  I \nI \n\nFigure 7:  The solution found  by applying the global  improvement algorithm on the maze of Figure 1 (a). The solution proceeded \nexactly like that of the algorithm of section 3 until the solution in Figure 4(d) was reached.  After that. eight additional iterations \nwere needed to find the better trajectory, resulting in 22 additional partitions, for a total of 49. \n\nreached.  At that point, it introduces new cells along some of the other gradients, and when \nit is  subsequently run, modified parti-game is applied again until stabilization is achieved, \nand so on.  The results  of applying this algorithm to  the maze of Figure  l(a) is  shown in \nFigure 7. As we can see, the algorithm finds the better solution by increasing the resolution \naround the relevant part of the barrier above the start state. \n\nIn the absence of information about the form of the optimal trajectory, there is no natural \ntermination criterion  for this  algorithm.  It is designed to  be run continually in search of \nbetter solutions.  If,  however,  the form  of the optimal solution  is known  in  advance,  the \nextra partitioning could be turned off after such a solution is found. \n\n5  CONCLUSIONS \n\nIn this paper we have presented three successive modifications to parti-game.  The combi(cid:173)\nnation of the first two appears to improve its robustness and efficiency, sometimes dramat(cid:173)\nically, and generally yields better solutions.  The third provides a novel way of performing \nnon-local search for higher quality solutions that are closer to optimal. \n\nAcknowledgments \n\nMohammad  AI-Ansari  acknowledges  the  continued  support  of  King  Saud  University, \nRiyadh, Saudi Arabia and the Saudi Arabian Cultural Mission to the U.S.A. \n\nReferences \nAI-Ansari,  M. A.  and R.  1.  Williams (1998). Modifying the parti-game algorithm for  in(cid:173)\ncreased robustness,  higher efficiency and  better policies.  Technical Report NU-CCS-\n98-13, College of Computer Science, Northeastern University, Boston, MA. \n\nMoore,  A.  (1994a).  Variable  resolution  reinforcement  learning.  In  Proceedings  of the \nEighth Yale  Workshop on Adaptive and Learning Systems. Center for Systems Science, \nYale University. \n\nMoore,  A.  W.  (1994b). The  parti-game algorithm  for  variable resolution  reinforcement \n\nlearning in  multidimensional state spaces.  In Proceedings of Neural Information Pro(cid:173)\ncessing Systems Conference 6. Morgan Kaufman. \n\nMoore, A. W.  and C. O. Atkeson (1995). The parti-game algorithm for variable resolution \n\nreinforcement learning in multidimensional state-spaces. Machine Learning 21. \n\nSutton, R. S. and A. O. Barto (1998). Reinforcement Learning: An Introduction. MIT Press. \n\n\f", "award": [], "sourceid": 1550, "authors": [{"given_name": "Mohammad", "family_name": "Al-Ansari", "institution": null}, {"given_name": "Ronald", "family_name": "Williams", "institution": null}]}