{"title": "Robust Reinforcement Learning in Motion Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 655, "page_last": 662, "abstract": null, "full_text": "Robust Reinforcement Learning \n\n\u2022 In \n\nMotion Planning \n\nSatinder P. Singh'\" \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \nsingh@psyche.mit.edu \n\nAndrew G. Barto, Roderic Grupen, and Christopher Connolly \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nAbstract \n\nWhile exploring to find better solutions, an agent performing on(cid:173)\nline reinforcement learning (RL) can perform worse than is accept(cid:173)\nable. In some cases, exploration might have unsafe, or even catas(cid:173)\ntrophic, results, often modeled in terms of reaching 'failure' states \nof the agent's environment. This paper presents a method that uses \ndomain knowledge to reduce the number of failures during explo(cid:173)\nration. This method formulates the set of actions from which the \nRL agent composes a control policy to ensure that exploration is \nconducted in a policy space that excludes most of the unacceptable \npolicies. The resulting action set has a more abstract relationship \nto the task being solved than is common in many applications of \nRL. Although the cost of this added safety is that learning may \nresult in a suboptimal solution, we argue that this is an appropri(cid:173)\nate tradeoff in many problems. We illustrate this method in the \ndomain of motion planning. \n\n\"'This work was done while the first author was finishing his Ph.D in computer science \n\nat the University of Massachusetts, Amherst. \n\n655 \n\n\f656 \n\nSingh, Barto, Grupen, and Connolly \n\nAn agent using reinforcement learning (Sutton et al., 1991; Barto et al., to appear) \n(RL) to approximate solutions to optimal control problems has to search, or explore, \nto improve its policy for selecting actions. Although exploration does not directly \naffect performance (Moore & Atkeson, 1993) in off-line learning with a model of \nthe environment, exploration in on-line learning can lead the agent to perform \nworse than is acceptable. In some cases, exploration might have unsafe, or even \ncatastrophic, results, often modeled in terms of reaching 'failure' states of the agent's \nenvironment. To make on-line RL more practical, especially if it involves expensive \nhardware, task-specific minimal levels of performance should be ensured during \nlearning, a topic not addressed by prior RL research. \n\nAlthough the need for exploration cannot be entirely removed, domain knowledge \ncan sometimes be used to define the set of actions from which the RL agent composes \na control policy so that exploration is conducted in a space that excludes most of \nthe unacceptable policies. We illustrate this approach using a simulated dynamic \nmobile robot in two different environments. \n\n1 Closed-loop policies as actions \n\nRL agents search for optimal policies in a solution space determined in part by \nthe set of actions available to the agent. With a few exceptions (e.g., Mahadevan \n& Connell, 1990; Singh, 1992), researchers have formulated RL tasks with actions \nthat are primitive in the sense that they are low-level, are available in very state, \nare executed open-loop, and last a single time-step. We propose that this is an \narbitrary, and self-imposed, restriction, and that in general the set of actions can \nhave a much more abstract relationship to the problem being solved. Specifically, \nwhat are considered 'actions' by the RL algorithm can themselves be closed-loop \ncontrol policies that meet important subgoals of the task being solved. \n\nIn this paper, the following general advantages afforded by using closed-loop policies \nas actions are demonstrated in the domain of motion planning: \n\n1. It is possible to design actions to meet certain hard constraints so that RL \n\nmaintains acceptable performance while simultaneously improving perfor(cid:173)\nmance over time. \n\n2. It is possible to design actions so that the action space for the learning \nproblem has fewer dimensions than the actual dimension of the physical \naction space. \n\nThe robustness and greatly accelerated learning resulting from the above factors \ncan more than offset the cost of designing the actions. However, care has to be \ntaken in defining the action space to ensure that the resulting policy space contains \nat least one policy that is close to optimal. \n\n2 RL with Dirichlet and Neumann control policies \n\nThe motion planning problem arises from the need to give an autonomous robot \nthe ability to plan its own motion, i.e., to decide what actions to execute in order \nto achieve a task specified by initial and desired spatial arrangements of objects. \n\n\fRobust Reinforcement Learning in Motion Planning \n\n657 \n\nFirst consider geometric path planning, i.e., the problem of finding safe paths for a \nrobot with no dynamical constraints in an environment with stationary obstacles. \nA safe path in our context is one that avoids all obstacles and terminates in a \ndesired configuration. Connolly (1992) has developed a method that generates safe \npaths by solving Laplace's equation in configuration space with boundary conditions \ndetermined by obstacle and goal configurations (also see, Connolly & Grupen, 1993). \nLaplace's equation is the partial differential equation \n\nV2ljJ \n\nn \n\n{j2ljJ \n\nL {)x~ = 0, \n\ni=l \n\nI \n\n(1) \n\nwhose solution is a harmonic function, ljJ, with no interior local minima. In practice, \na finite difference approximation to Equation 1 is solved numerically via Gauss Sidel \nrelaxation on a mesh over configuration space. Safe paths are generated by gradient \ndescent on the resulting approximate harmonic function. In the general motion \nplanning problem, we are interested in finding control policies that not only keep \nthe robot safe but also extremize some performance criterion, e.g., minimum time, \nminimum jerk, etc. \nTwo types of boundary conditions, called Dirichlet and Neumann boundary condi(cid:173)\ntions, give rise to two different harmonic functions , * D and ** N, that generate dif(cid:173)\nferent types of safe paths. Robots following paths generated from ** D tend to be re(cid:173)\npelled perpendicularly from obstacles. In contrast, robots following paths generated \nfrom **N tend to skirt obstacles by moving parallel to their boundaries. Although \nthe state space in the motion planning problem for a dynamic robot in a planar \nenvironment is {x, x, y, if}, harmonic functions are derived in two-dimensional posi(cid:173)\ntion space. These functions are inexpensive to compute (relative to the expense of \nsolving the optimal control problem) because they are independent of the robot dy(cid:173)\nnamics and criterion of optimal control. The closed-loop control policies that follow \nthe gradient of the Dirichlet or Neumann solutions, respectively denoted 1rD and \n1rN, are defined approximately as follows: 1rD(S) = V**D(\u00a7), and 1rN(S) = V**N(\u00a7), \nwhere \u00a7 is the projection of the state s onto position space .1 \n\nInstead of formulating the motion planning problem as a RL task in which a control \npolicy maps states into primitive control actions, consider the formulation in which \na policy maps each state s to a mixing parameter k( s) so that the actual action is \n: [1- k(S)]1rD(S) + [k(S)]1rN(S) , where 0 ~ k(s) ~ 1. Figure 1B shows the structure \nof this kind of policy. In Appendix B, we present conditions guaranteeing that for \na robot with no dynamical constraints, this policy space contains only acceptable \npolicies, i.e., policies that cause the robot to reach the goal configuration without \ncontacting an obstacle. Although this guarantee does not strictly hold when the \nrobot has dynamical constraints, e.g., when there are bounds on acceleration, this \nformulation still reduces the risk of unacceptable behavior. \n\n3 Simulation Results \n\nIn this paper we present a brief summary of simulation results for the two envi(cid:173)\nronments shown in Figures 2A and 3A. See Singh (1993) for details. The first \n\n1 In practice, the gradients of the harmonic functions act as reference signals to a PD(cid:173)\n\ncontroller. \n\n\f658 \n\nSingh, Barto, Grupen, and Connolly \n\nenvironment consists of two rooms connected by a corridor. The second environ(cid:173)\nment is a horseshoe-shaped corridor. The mobile robot is simulated as a unit-mass \nthat can accelerate in any direction. The only dynamical constraint is a bound on \nthe maximum acceleration. \nA \n\nQ(state. action) \n\nB \n\n(s) \n\nState \n(s) \n\nPolicy \n\n\u2022 \nX \n\nY \n\n\u2022 \nY \n\nX \n\nk \n\nstate \n(s) \n\nmixing \ncoefficient \n\nPolicy 1 \n\nk(s) \n\n1 - k(s) \n\nPoliCy 2 \nNeumann \n\n(s) \n\nFigure 1: Q-Iearning Network and Policy Structure. Panel A: 2-layer Connectionist \nNetwork Used to Store Q-values. Network inversion was used to find the maximum \nQ-value (Equation 2) at any state and the associated greedy action. The hidden \nlayer consists of radial-basis units. Panel B: Policy Structure. The agent has to learn \na mapping from state s to a mixing coefficient 0 < k( s) < 1 that determines the \nproportion in which to mix the actions specifies by the pure Dirichlet and Neumann \npolicies. \n\nThe learning task is to approximate minimum time paths from every point inside \nthe environment to the goal region without contacting the boundary wall. A rein(cid:173)\nforcement learning algorithm called Q-Iearning (Watkins, 1989) (see Appendix A) \nwas used to learn the mixing function, k. Figure lA shows the 2-layer neural net(cid:173)\nwork architecture used to store the Q-values. The robot was trained in a series \nof trials, each trial starting with the robot placed at a randomly chosen state and \nending when the robot enters the goal region. The points marked by stars in Fig(cid:173)\nures 2A and 3A were the starting locations for which statistics were collected to \nproduce learning curves. \n\nFigures 2B, 2C, 3A and 3B show three robot trajectories from two randomly chosen \nstart states; the black-filled circles mark the Dirichlet trajectory (labeled D), the \nwhite-filled circles mark the Neumann trajectory (labeled N), and the grey-filled \ncircles mark the trajectory after learning (labeled Q). Trajectories are shown by \ntaking snapshots of the robot at every time step; the velocity of the robot can \nbe judged by the spacing between successive circles on the trajectory. Figure 2D \nshows the mixing function for zero-velocity states in the two-room environment, \nwhile Figure 3C shows the mixing function for zero velocity states in the horseshoe \nenvironment. The darker the region, the higher the proportion of the Neumann \n\n\fRobust Reinforcement Learning in Motion Planning \n\n659 \n\npolicy in the mixture. In the two-room environment, t.he agent learns to follow \nthe Neumann policy in the left-hand room and to follow the Dirichlet policy in the \nright-hand room. \n\nFigure 2E shows the average time to reach the goal region as a function of the \nnumber of trials in the two-room environment. The solid-line curve shows the \nperformance of the Q-Iearning algorithm. The horizontal lines show the average \ntime to reach the goal region for the designated pure policies. Figure 3D similarly \npresents the results for the horseshoe environment. Note that in both cases the \nRL agent learns a policy that is better than either the pure Dirichlet or the pure \nNeumann policies. The relative advantage of the learned policy is greater in the \ntwo-room environment than in the horseshoe environment. \n\nOn the two-room environment we also compared Q-Iearning using harmonic func(cid:173)\ntions, as described above, with Q-Iearning using primitive accelerations of the mobile \nrobot as actions. The results are summarized along three dimensions: a) speed of \nlearning: the latter system took more than 20,000 trials to achieve the same level \nof performance achieved by the former in 100 trials, b) safety: the simulated robot \nusing the latter system crashed into the walls more than 200 times, and c) asymp(cid:173)\ntotic performance: the final solution found by the latter system was 6% better than \nthe one found by the former. \n\n4 Conclusion \n\nOur simulations show how an RL system is capable of maintaining acceptable per(cid:173)\nformance while simultaneously improving performance over time. A secondary mo(cid:173)\ntivation for this work was to correct the erroneous impression that the proper, if \nnot the only, way to formulate RL problems is with low-level actions. Experience \non large problems formulated in this fashion has contributed to the perception that \nRL algorithms are hopelessly slow for real-world applications. We suggest that a \nmore appropriate way to apply RL is as a \"component technology\" that uses expe(cid:173)\nrience to improve on partial solutions that have already been found through either \nanalytical techniques or the cumulative experience and intuitions of the researchers \nthemselves. The RL framework is more abstract, and hence more flexible, than \nmost current applications of RL would lead one to believe. Future applications of \nRL should more fully exploit the flexibility of the RL framework. \n\nA Q-learning \n\nOn executing action a in state St at time t, the following update on the Q-value \nfunction is performed: \n\nwhere R( St, a) is the payoff, 0 ::; I ::; 1 is the discount factor, and a is a learning \nrate parameter. See Watkins (1989) for further details. \n\n\f660 \n\nSingh, Barto, Grupen, and Connolly \n\n* \n\n* \n\n* \n* \n\n------------------. \n\nGOAl \n\n. \n\n* \n\n* \n\n* \n\n7000 E \n\n8000 \n\nA \n\n* \n\n* \n\n* \n\nB \n\nc \n\n5000 \n\n_0 \n\nQ-Iearning \nNeumann \nDirichlet \n\n~ \nCJ \n.&:. \nCJ m \nex: \n\n0 -Q) \n\nQ) \n\nE \ni= \nC) ca \n.... \n~ \n\nQ) \n\n300 0 \n\n2000 \n\n00 \n\n0 \n\n9000 \n\n18000 \n\n27000 \n\nNumber of Trials \n\n3eOOO \n\n45000 \n\nFigure 2: Results for the Two-Room Environment. Panel A: Two-Room Environ(cid:173)\nment. The stars mark the starting locations for which statistics were computed . \nPanel B: Sample Trajectories from one Starting Location. The black-filled circles \nlabeled D show a pure Dirichlet trajectory, the white-filled circles labeled N show a \npure Neumann trajectory, and the grey-filled circles labeled Q show the trajectory \nafter learning. The trajectories are shown by taking snapshots at every time step; \nthe velocity of the robot can be judged by the distance between successive points \non the trajectory. Panel C: Three Sample Trajectories from a Different Starting \nLocation. Panel D: Mixing Function Learned by the Q-Iearning Network for Zero \nVelocity States. The darker the region the higher the proportion of the Neumann \npolicy in the resulting mixture. Panel E: Learning Curve. The curve plots the time \ntaken by the robot to reach the goal region, averaged over the locations marked \nwith stars in Panel A, as a function of the number of Q-Iearning trials. The dashed \nline shows the average time using the pure Neumann policy; the dotted line for the \npure Dirichlet policy; and the solid line for Q-Iearning. The mixed policy formed \nby Q-Iearning rapidly outperforms both pure harmonic function policies. \n\n\fRobust Reinforcement Learning in Motion Planning \n\n661 \n\n20000 \n\n\u2022 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 M \u2022\u2022 \u2022 \u2022 M \u2022\u2022\u2022\u2022 _ \u2022\u2022\u2022\u2022 M . . . . . . . . . . . . . . . . . _ . . . . . . . . . . . . . . , \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022 _ . . . . . . . . . . . . . . . . . . . . .\n\n. _ \n\n.\n\n. . . . . . . . . . . . . . . . . . . . . . . . . . . _ \n\n\u2022 \u2022 \u2022\u2022 M \n\nQ-Iearning \nNeumann \nDirichlet \n\n------------------------_. \n\nA \n\n* \n\n* \n\n* \n\n* \n\nB \n\n* \n\n* \n\n\u2022 \u2022 \n\n.......... 1... .. \n\nG \n\n\u2022 N \n\n: ) \nD 8 o \u00b7 : \n\n\u2022 \n\n____ \u2022\u2022\u2022 __ A. __ \u2022 A.._ \n\nGOAL \n\nD \n\n~ 1&000 \nD(S) + k(S)\\7<1>N(S) ;/; O. The \nonly way it can vanish is if 'Vi \n\nk(s) \n\n1- k(s) \n\n(3) \n\nwhere [\u00b7Ji is the ith component of vector [.J. The algorithm can explicitly check for \nthat possibility and prevent it. Alternatively, note that due to the finite precision \nin any practical implementation, it is extremely unlikely that Equation 3 will hold \nin any state. Also note that 7r( s) for any point s on the boundary will always point \naway from the boundary because it is the convex sum of two vectors, one of which \nis normal to the boundary, and the other of which is parallel to the boundary. \n\nAcknowledgements \n\nThis work was supported by a grant ECS-9214866 to Prof. A. G. Barto from the \nNational Science Foundation, and by grants IRI-9116297, IRI-9208920, and CDA-\n8922572 to Prof. R. Grupen from the National Science Foundation. \n\nReferences \n\nBarto, A.G., Bradtke, S.J., & Singh, S.P. (to appear). Learning to act using real(cid:173)\n\ntime dynamic programming. Artificial Intelligence. \n\nConnolly, C. (1992). Applications of harmonic functions to robotics. In The 1992 \n\nInternational Symposium on Intelligent Control. IEEE. \n\nConnolly, C. & Grupen, R. (1993). On the applications of harmonic functions to \n\nrobotics. Journal of Robotic Systems, 10(7), 931-946. \n\nMahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based \nrobots using reinforcement learning. Technical report, IBM Research Division, \nT.J.Watson Research Center, Yorktown Heights, NY. \n\nMoore, A.W. & Atkeson, C.G. (1993). Prioritized sweeping: Reinforcement learning \n\nwith less data and less real time. Machine Learning, 13(1). \n\nSingh, S.P. (1992). Transfer of learning by composing solutions for elemental se(cid:173)\n\nquential tasks. Machine Learning, 8(3/4), 323-339. \n\nSingh, S.P. (1993). Learning to Solve Markovian Decision Processes. PhD thesis, \nDepartment of Computer Science, University of Massachusetts. also, CMPSCI \nTechnical Report 93-77. \n\nSutton, R.S ., Barto, A.G., & Williams, R.J. (1991). Reinforcement learning is direct \nadaptive optimal control. In Proceedings of the American Control Conference, \npages 2143-2146, Boston, MA. \n\nWatkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge \n\nUniv., Cambridge, England. \n\n\f", "award": [], "sourceid": 843, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}, {"given_name": "Roderic", "family_name": "Grupen", "institution": null}, {"given_name": "Christopher", "family_name": "Connolly", "institution": null}]}*