{"title": "Obstacle Avoidance through Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 530, "abstract": null, "full_text": "Obstacle  Avoidance  through  Reinforcement \n\nLearning \n\nTony  J.  Prescott and  John  E.  W.  Maybew \nArtificial Intelligence and Vision Research Unit. \n\nUniversity of Sheffield. S 10 2TN. England. \n\nAbstract \n\nA  method  is  described  for  generating  plan-like.  reflexive.  obstacle \navoidance behaviour in a mobile robot. The experiments reported here \nuse  a  simulated  vehicle  with  a  primitive  range  sensor.  Avoidance \nbehaviour is encoded as a set of continuous functions of the  perceptual \ninput space. These functions are stored  using CMACs and trained by a \nvariant of Barto and Sutton's adaptive critic  algorithm.  As  the  vehicle \nexplores  its  surroundings it adapts  its  responses  to  sensory stimuli  so \nas  to  minimise  the  negative  reinforcement  arising  from  collisions. \nStrategies  for  local  navigation  are  therefore  acquired  in  an  explicitly \ngoal-driven  fashion.  The resulting  trajectories  form  elegant collision(cid:173)\nfree paths through the environment \n\n1  INTRODUCTION \nFollowing  Simon's  (1969)  observation  that  complex  behaviour  may  simply  be  the \nreflection  of a  complex  environment  a  number of researchers  (eg.  Braitenberg  1986. \nAnderson  and  Donath  1988.  Chapman  and  Agre  1987)  have  taken  the  view  that \ninteresting,  plan-like  behaviour  can  emerge  from  the  interplay  of a  set  of pre-wired \nreflexes  with  regularities  in  the  world.  However,  the  temporal  structure  in  an  agent's \ninteraction  with  its  environment can act as  more  than  just a  trigger for  fixed  reactions. \nGiven a suitable learning mechanism it can also be exploited to generate sequences of new \nresponses more  suited  to  the  problem  in  hand.  Hence,  this  paper attempts  to  show  that \nobstacle  avoidance.  a  basic  level  of navigation competence, can  be  developed  through \nlearning a set of conditioned  responses to perceptual stimuli. \n\nIn  the absence of a  teacher a  mobile robot can evaluate its performance only in  terms of \nfinal  outcomes.  A negative reinforcement signal can be generated each  time a collision \noccurs but this  information te]]s  the  robot neither when  nor how.  in  the  train of actions \npreceding the crash, a mistake was made.  In reinforcement learning this credit assignment \n\n523 \n\n\f524 \n\nPrescott and Mayhew \n\nproblem  is  overcome  by  forming  associations  between  sensory  input  patterns  and \npredictions  of  future  outcomes.  This  allows  the  generation  of  internal  \"secondary \nreinforcement\"  signals that can  be used  to select improved responses.  Several  authors \nhave discussed the use of reinforcement learning for navigation, this research is inspired \nprimarily by that of Barto, Sutton and co-workers (1981,  1982,  1983,  1989) and Werbos \n(1990).  The principles underlying reinforcement learning have recently been given a firm \nmathematical  basis  by  Watkins  (1989)  who  has  shown  that  these  algorithms  are \nimplementing  an  on-line,  incremental,  approximation  to  the  dynamic  programming \nmethod for detennining optimal control.  Sutton (1990) has also made use of these ideas \nin fonnulating a novel theory of classical conditioning in animal learning. \n\nWe aim  to develop a reinforcement learning system that will allow a simple mobile robot \nwith minimal sensory apparatus to move at speed around an indoor environment avoiding \ncollisions  with  stationary  or  slow  moving  obstacles.  This  paper reports  preliminary \nresults obtained using a simulation of such a robot. \n\n2  THE  ROBOT  SIMULATION \nOur simulation models a three-wheeled mobi1e  vehicle, called  the 'sprite', operating in  a \nsimple two-dimensional world (500x500 cm) consisting of walls and obstacles in  which \nthe sprite is represented by a square box (30x30 cm). Restrictions on the acceleration and \nthe braking  response  of the  vehicle model  enforce a degree  of realism  in  its ability  to \ninitiate fast  avoidance behaviour.  The perceptual system  simulates a laser range-finder \ngiving  the logarithmically scaled distance  to  the nearest obstacle at set angles  from  its \ncurrent orientation.  An important  feature of the research has been to explore the extent \nto which  spatially  sparse but  frequent  data can  support complex  behaviour.  We show \nbelow results from  simulations using only three rays emitted at angles _60\u00b0, 0\u00b0, and +60\u00b0. \nThe  controller  operates  directly  on  this  unprocessed  sensory  input.  The  continuous \ntrajectory  of the vehicle is approximated by  a sequence of discrete time steps.  In  each \ninterval  the  sprite acquires  new  perceptual data  then  performs  the  associated  response \ngenerating either a change in position or a feedback signal indicating that a collision has \noccured preventing the move.  After a collision the sprite reverses slightly then attempts \nto rotate and move off at a random angle (90-180\u00b0 from  its original heading), if this is not \npossible it is relocated to a random  starting position. \n\n3  LEARNING  ALGORITHM \nThe sprite learns a multi-parameter policy  (IT) and an evaluation (V).  These functions \nare stored  using  the CMAC coarse-coding architecture (Albus  1971), and updated by  a \nreinforcement learning algorithm similar to  that described by Watkins (1989). The action \nfunctions comprising the policy are acquired as gaussian probability distributions using \nthe method  proposed by Williams (1988).  The  following  gives a brief summary of the \nalgorithm used. \n\nLet  Xt  be  the  perceptual  input pattern  at  time  t  and  rt  the  external  reward,  then  the \nreinforcement learning error (see Barto et aI.,  1989) is given by \n\nf: t +l  =  rt+l  +  yYt(X1+l)  -\n\n\\{(Xt) \n\n(1) \n\nwhere y is a  constant (0  < y <  1).  This error is  used  to  adjust  V  and IT  by  gradient \ndescent ie. \n\n\fObstacle Avoidance  through  Reinforcement Learning \n\n525 \n\n\\{+ 1 (x)  =  \\{ (x)  +  a.  Et+ 1  mt (x)  and \n111+ 1 (x)  =  Ilt (x)  +  ~ Et+ 1  ndx) \n\n(2) \n\n(3) \n\n(4) \n\nwhere  a.  and ~ are  learning  rates  and  mt(x)  and  nt(x)  are  the  evaluation  and  policy \neligibility traces  for  pattern  x.  The eligibility  traces can be thought of as  activity  in \nshort-term  memory  that  enables  learning  in  the  L TM  store.  The  minimum  STM \nrequirement is  to remember the last input pattern and the exploration gradient ~at of the \nlast action taken (explained below), hence \n\nml+l(x)  = 1  and  nl+l(x)  = ~3t  iff x is the current pattern, \nml+l(x)  = nl+l(x)  = 0 otherwise. \n\nLearning occurs faster,  however, if the memory trace of each pattern is allowed to decay \nslowly over time with strength of activity being related to recency.  Hence, if the rate of \ndecay is given by  A (0 <= A <=  1) then for patterns other than the current one \n\nml+ 1 (x) = A mt (x)  and  nl+ 1 (x) = A nt (x). \n\nUsing a decay rate of less than  1.0 the eligibility trace for any input becomes negligible \nwithin a  short time,  so in practice it is only necessary  to  store a  list of the most recent \npatterns and actions (in our simulations only the last four values are stored). \n\nThe policy acquired by  the learning system has two elements (f and '6) corresponding to \nthe desired forward and angular velocities  of the vehicle.  Each element is specified by a \ngaussian pdf and is encoded by two adjustable parameters denoting its mean and standard \ndeviation (hence the policy as a whole consists of four continuous functions of the input). \nIn  each  time-step an action  is  chosen by  selecting randomly  from  the  two distributions \nassociated with the current input pattern. \n\nIn order to  update the policy the exploratory component of the action must be computed, \nthis  consists  of a  four-vector  with  two  values  for  each  gaussian  element.  Following \nWilliams we define a standard gaussian density function  g with parameters J.1  and 0' and \noutput y such that \n\ng(y, J.1, 0')  =  2 ~(J e- 202 \n\n(Y - IL)2 \n\nthe derivatives of the mean and standard deviation 1 are then given by \n\n~J.1 = -\n\ny - J.1 \n\n0'2 \n\nand \n\n[(y - J.1)2  - a2] \n~O' = - - - -\n\n0'3 \n\nThe exploration gradient of the action as a whole is therefore the vector \n\n~3t = [~J.1f'  ~O'f,  ~J.1t'),  ~o't'}]. \n\n(5) \n\n(6) \n\nThe  four  policy  functions  and  the  evaluation  function  are each  stored  using  a  CMAC \ntable. This technique is a form  of coarse-coding whereby the euclidean space in  which a \nfunction  lies is divided into a set of overlapping but offset tilings.  Each tiling consists \nof regular regions of pre-defined size such that all points within each region are mapped to \na single stored parameter.  The value of the function at any point is given by the average \nof the  parameters  stored  for  the  corresponding  regions  in  all  of the  tHings. \nIn  our \n\n1 In  practice  we  use  (In  s)  as  the  second  adjustable  parameter  to  ensure  that  the  standard \ndeviation  of the gaussian never has  a negative value (see Williams  1988  for  details). \n\n\f526 \n\nPrescott and Mayhew \n\nsimulation  each  sensory  dimension  is quantised  into  five  discrete  bins  resulting  in  a \n5X5X5  tiling,  five  tilings  are  overlaid  to  form  each  CMAC.  If the  input  space  is \nenlarged (perhaps by adding further sensors) the storage requirements can be reduced by \nusing a hashing function  to map all the  tiles onto a  smaller number of parameters.  This \nis a useful economy when there are large areas of the state space that are visited rarely or \nnot at all. \n\n4  EXPLORATION \nIn order for the sprite to learn useful obstacle avoidance behaviour it has to move around \nand explore its environment. If the sprite is  rewarded simply  for  avoiding collisions an \noptimal  strategy  would be  to remain  still  or to stay  within a small,  safe,  circular orbit. \nTherefore to force the sprite to explore its world a second source of reinforcement is used \nwhich  is  a  function  of its  current  forward  velocity  and  encourages  it  to  maintain  an \noptimal  speed.  To  further  promote  adventurous behaviour the  initial  policy  over the \nwhole  state-space is for  the sprite to have a  positive speed.  A system  which  has  a  high \ninitial expectation of future  rewards will settle less rapidly for a locally optimal solution \nthan a  one with  a  low expectation.  Therefore the  value function  is  set initially  to  the \nmaximum reward attainable by the sprite. \n\nImproved policies  are  found  by  deviating  from  the  currently  preferred  set of actions. \nHowever,  there  is  a  trade-off  to  be  made  between  exploiting  the  existing  policy  to \nmaximise  the  short  term  reward  and  experimenting  with  untried  actions  that  have \npotentially  negative consequences  but  may  eventually  lead  to  a  better  policy.  This \nsuggests that an annealing process should be applied to the degree of noise in the policy. \nIn fact, the algorithm described above results in an automatic annealing process (Williams \n88) since the variance of each gaussian element decreases as the mean behaviour converges \nto a local maximum.  However, the width of each gaussian can also increase, if the mean \nis locally sub-optimal, allowing for more  exploratory behaviour.  The final width of the \ngaussian depends on whether the local peak in the action function is narrow or flat on top. \nThe behaviour acquired by  the system  is  therefore  more than  a set of simple reflexes. \nRather, for each circumstance, there is a range of acceptable actions which is narrow if the \nrobot is in  a tight corner, where its behaviour is severely constrained, but wider in  more \nopen spaces. \n\n5  RESULTS \nTo  test  the  effectiveness  of the  learning  algorithm  the  performance  of the  sprite  was \ncompared  before  and  after  fifty-thousand  training  steps  on  a  number  of  simple \nenvironments.  Over  10 independent runs2 in  the first environment shown in figure  one \nthe  average  distance  travelled between  collisions  rose  from  approximately  O.9m  (lb) \nbefore learning to 47.4m (Ic) after training.  At the same time the average velocity more \nthan  doubled  to  just below  the  optimal  speed.  The  requirement  of  maintaining  an \noptimum  speed  encourages  the  sprite  to  follow  trajectories  that avoid  slowing  down, \nstopping or reversing.  However,  if the sprite is  placed too close to  an  obstacle to  turn \naway safely, it can perform an n-point-turn manoeuvre requiring it to  stop, back-off, turn \nand  then  move  forward.  It is  thus  capable of generating  quite  complex  sequences  of \nactions. \n2Each  measure  was  calculated over  a sequence of five  thousand  simulation-steps  with  learning \ndisabled. \n\n\fObstacle Avoidance through Reinforcement Learning \n\n527 \n\na) Robot casting three rays. \n\nb) Trajectories before training ... \n\n'\"(' \n.: '.i:''''''' \n\u2022 ... ~. ~I \n\n. '  \n\n. \"   .... ~t'~e.;.r-: \n\n\"0 \n\n\"0 \n\n8\\ii.:: .. ' .. or ~:(~ \n\n.\".  '. \n.. \\.  .  J.:.~'#tr.'; _\"~ \n::., \"0 \n~ .\u2022.. ::.  (\"'~  ;' \n\u00b7,..~\u00b7 .. I/\u00b7 V\u00b7 \n.,.  :.:\",1::--. i  \":: \n.~: ~\"-~- .  '. \n\u00b7:i~\u00b7\u00b7\u00b7\u00b7\u00b7 , ' :  '. \n\n\"0  .:,:: ... :. ... ,. \u2022\u2022 1\u00b0 \n\nvA.:J \n\n,--reo \n\n. \n....... J. \u2022 \u2022 ~.'  \u2022 :':l,;  \u2022  : \n.. /...t'..;,~:.. ..... ~y.  :; \n'./L\". \n\n...  _  \u2022\u2022\u2022\u2022 ~. \u2022\u2022\u2022\u2022 \n\n'.  .~ ~::. ... \n\n. .. : ........ . \n~v\u00b7:\u00b7\u00b7\u00b7 '. \n\u2022 \". \n.:.: \u2022\u2022\u2022 :. \n\"  =-:v:'  '''''?t. \n' . \n\". ~. I . \n\"\" \n10:' \n' \u2022\u2022\u2022 .\" \n.;\\!~ \nv~\u00b7 \n' . \n~ \nI \n: \n.... \n: \n:,:! \n:(~:f! \n\n.\" \u2022\u2022 ' \n\n: \n\"0 \n\n\u2022 \n\n~=:;;:-........ \n: \u2022\u2022 / \n'.c. '. '. \n\n~ ....... ). '. '. \n\u2022 \"' ...... =. \n\n\"':...,.  \"J'  \u2022\u2022\u2022\u2022 \n\n\u2022\u2022 , .  . . .  : .  \n\n.~.,.  , \n\n\u2022 \"  0'\" \n\n.~ .. :.~I::~. : .... \n{t  ... \u00b7f.:\" \n~!r  , .  \n\n-0\" \n\n\"0 \n~-..~\"::\"  0  . . . . . . .  0\u00b0  ::~ \u2022 \u2022  ' \n\u2022 \u2022\u2022 :  \u2022\u2022 ~J-rJ \u2022\u2022 :'y .. .. \n...  \":  ........... ~ .. \n\u2022 ....  It\":  0: \n\u2022 ~;~:: : \n''':r'\"'. \n':~'. \n: \n\n'111.\".A,~~\":: \n\no\u00b0Jtl:  \u2022 \n\n\u2022 \u2022\u2022\u2022\u2022\u2022 ' \n\nc) ... after training ... \n\nd) ... and in a novel environment \n\nFigure One: Sample Paths from the Obstacle Avoidance Simulation. \n\nThe trajectories show the robot's  movement over two  thousand simulation  steps before \nand after training.  After a collision the robot reverses slightly  then rotates to move off \nat a random angle 90-1800  from  its original heading, if this is not possible it is relocated \nto a random position.  Crosses indicate locations where collisions occured,  circles show \nnew starting positions. \n\n\f528 \n\nPrescott and Mayhew \n\nSome  differences  have  been  found  in  the  sprite's  ability  to  negotiate  different \nenvironments with the effectiveness of the avoidance learning system varying for different \nconfigurations of obstacles.  However, only limited performance loss has been observed \nin  transferring  from  a  learned environment to  an  unseen  one  (eg.  figure  Id),  which  is \nquickly  made  up  if  the  sprite  is  allowed  to  adapt  its  strategies  to  suit  the  new \ncircumstances.  Hence we are encouraged to think that the learning system  is capturing \nsome fairly general strategies for obstacle avoidance. \n\nThe different kinds of tactical behaviour acquired by  the  sprite can be illustrated using \nthree dimensional slices through  the two policy functions (desired forward and angular \nvelocities).  Figure two  shows samples of these functions  recorded after fifty  thousand \ntraining  steps  in  an  environment containing  two  slow  moving  rectangular  obstacles. \nEach graph is a function of the three rays cast out by the sprite: the x and y axes show the \ndepths of the left and right rays and the vertical slices correspond to different depths of the \ncentral ray (9,  35  and 74cm).  The graphs show  clearly several  features  that we might \nexpect of effective avoidance behaviour.  Most notably, there is a transition occuring over \nthe  three  slices  during  which  the  policy  changes  from  one  of braking  then  reversing \n(graph a) to one of turning sharply (d) whilst maintaining speed or accelerating (e).  This \ntransition clearly corresponds to the threshold below which a collision cannot be avoided \nby swerving but requires backing-off instead.  There is a considerable degree of left-right \nsymmetry (reflection along the line left-ray=right-ray) in most of the graphs.  This agrees \nwith  the  observation  that  obstacle  avoidance  is  by  and  large  a  symmetric  problem. \nHowever some asymmetric behaviour is acquired in order to break the deadlock that arises \nwhen the sprite is faced with obstacles that are equidistant on both sides. \n\n6  CONCLUSION \nWe  have  demonstrated  that  complex  obstacle  avoidance  behaviour  can  arise  from \nsequences of learned reactions to immediate perceptual stimu1i.  The trajectories generated \noften have the appearance of planned activity since individual actions are only appropriate \nas part of extended patterns of movement. However, planning only occurs as an implicit \npart of a learning process that allows experience of rewarding outcomes to be propagated \nbackwards to influence future actions taken  in similar contexts.  This learning process is \neffective because it is able to exploit the underlying regularities in  the robot's interaction \nwith its world to find behaviours that consistently achieve its  goals. \n\nAcknowledgements \nThis work was supported by the Science and Engineering Research Council. \n\nReferences \nAlbus, J.S., (1971) A theory of cerebellar function. Math Biosci  10:25-61. \nAnderson, T.L., and Donath,  M.  (1988a)  Synthesis of reflexive behaviour for a  mobHe \nrobot based upon a stimulus-response paradigm. SPIE  Mobile Robots III,  1007:198-\n210. \n\nAnderson, T.L., and Donath, M.  (1988b) A computational structure for enforcing reactive \n\nbehaviour in a mobile robot. SPIE  Mobile Robots III  1007:370-382. \n\nBarto,  A.G.,  Sutton,  R.S.,  and  Brouwer,  P.S.  (1981)  Associative  search  network:  A \n\nreinforcement learning associative memory\". Biological Cybernetics 40:201-211. \n\n\fObstacle Avoidance  through  Reinforcement Learning \n\n529 \n\nForward Velocity \n\nAngular Velocity \n\n+15cm \n\no \n\na) centre  9cm \n\no \n\nb) centre  gem \n\n+15cm \n\nc) 35em \n\nd) 35cm \n\n+15cm \n\ne) 74cm \n\nf)  74cm \n\nFigure Two: Surfaces showing action policies for depth measures \n\nfor the central ray of 9, 35 and 74 cm. \n\n\f530 \n\nPrescott and Mayhew \n\nBarto,  A.G.,  Anderson,  C.W.,  and  Sutton,  R.S.(1982)  Synthesis  of nonlinear  control \nsurfaces by a layered associative search network. Biological Cybernetics 43: 175-185. \nBarto, A.G., Sutton, R.S., Anderson, C.W. (1983) Neuronlike adaptive elements that can \nsolve difficult learning control problems. IEEE Transactions on Systems,  Man,  and \nCybernbetics  SMC-13:834-846. \n\nBarto, A.G., Sutton, R.S., and Watkins, CJ.H.C (1989) Learning and sequential decision \n\nmaking. COINS technical report. \n\nBraitenberg,  V  (1986)  Vehicles:  experiments  in  synthetic  psychology,  MIT  Press, \n\nCambridge, MA. \n\nAAAI-87. \n\nChapman, D.  and  Agre, P.E.  (1987)  Pengi:  An  implementation of a  theory  of activity. \n\nSimon, H.A.  (1969) The  sciences of the artificial. MIT Press, Cambridge, Massachusetts. \nSutton, R.S. and Barto, A.G. (1990) Time-deriviative models of pavlovian reinforcement. \nin Moore, J.W., and Gabriel, M.  (eds.) Learning and Computational  Neuroscience. \nMIT Press, Cambridge, MA. \n\nWatkins, CJ.H.C (1989) Learning from  delayed rewards.  Phd thesis,  King's  College. \n\nCambridge University, UK. \n\nWerbos, PJ. (1990)  A menu of designs for reinforcement learning over time.  in  Millet, \nIII, W.T.,  Sutton, R.S.  and Werbos,  PJ. Neural  networks for  control,  MIT  Press, \nCambridge, MA. \n\nWilliams RJ., (1988) Towards a theory of reinforcement-learning connectionist systems. \nTechnical  Report  NV-CCS-88-3,  College  of Computer  Science,  Northeastern \nUniversity, Boston, MA. \n\n\f", "award": [], "sourceid": 452, "authors": [{"given_name": "Tony", "family_name": "Prescott", "institution": null}, {"given_name": "John", "family_name": "Mayhew", "institution": null}]}