{"title": "Learning Spatio-Temporal Planning from a Dynamic Programming Teacher: Feed-Forward Neurocontrol for Moving Obstacle Avoidance", "book": "Advances in Neural Information Processing Systems", "page_first": 342, "page_last": 349, "abstract": null, "full_text": "Learning Spatio-Temporal Planning from \n\na  Dynamic Programming Teacher: \n\nFeed-Forward N eurocontrol for  Moving \n\nObstacle A voidance \n\nGerald Fahner * \n\nDepartment of Neuroinformatics \n\nUniversity of Bonn \n\nRomerstr.  164 \n\nRolf Eckmiller \n\nDepartment of Neuroinformatics \n\nUniversity of Bonn \n\nRomerstr.  164 \n\nW -5300  Bonn  1,  Germany \n\nW-5300 Bonn  1,  Germany \n\nAbstract \n\nWithin a simple test-bed,  application of feed-forward  neurocontrol \nfor short-term planning of robot trajectories in a dynamic environ(cid:173)\nment  is  studied.  The  action  network  is  embedded  in  a  sensory(cid:173)\nmotoric system architecture that contains  a separate world model. \nIt is  continuously  fed  with  short-term  predicted  spatio-temporal \nobstacle  trajectories,  and  receives  robot  state  feedback.  The  ac(cid:173)\ntion  net  allows  for  external  switching  between  alternative  plan(cid:173)\nning  tasks.  It  generates  goal-directed  motor  actions  - subject  to \nthe  robot's  kinematic  and  dynamic  constraints  - such  that  colli(cid:173)\nsions  with  moving obstacles  are  avoided.  Using  supervised  learn(cid:173)\ning,  we  distribute  examples  of the  optimal planner  mapping over \na  structure-level  adapted  parsimonious higher  order  network.  The \ntraining  database  is  generated  by  a  Dynamic  Programming algo(cid:173)\nrithm.  Extensive  simulations reveal,  that  the  local  planner  map(cid:173)\nping is  highly nonlinear, but can be effectively  and sparsely repre(cid:173)\nsented  by  the  chosen  powerful  net model.  Excellent generalization \noccurs for unseen obstacle configurations.  We  also discuss the limi(cid:173)\ntations of feed-forward  neurocontrol for growing planning horizons. \n\n*Tel.:  (228)-550-364 \n\nFAX:  (228)-550-425 \n\ne-mail:  gerald@nero.uni-bonn.de \n\n342 \n\n\fLearning Spatio-Temporal  Planning from  a Dynamic Programming Teacher \n\n343 \n\n1 \n\nINTRODUCTION \n\nGlobal planning  of goal  directed  trajectories  subject  to  cluttered  spatio-temporal, \nstate-dependent  constraints - as  in  the kinodynamic path planning problem (Don(cid:173)\nald, 1989) considered  here - is a  difficult task,  probably best suited for systems with \nembedded sequential  behavior;  theoretical  insights  indicate  that  the  related  prob(cid:173)\nlem of connectedness  is of unbounded  order  (Minsky,  1969).  However,  considering \npractical  situations,  there  is  a  lack  of globally  disposable  constraints  at  planning \ntime, due to partially unmodelled environments.  The question  then  arises, to what \nextent feed-f )rward  neurocontrol  may be effective for  local planning horizons. \nIn  this paper,  we  put aside problems of credit assignment,  and world model identi(cid:173)\nfication.  We  focus  on  the  complexity of representing  a  local  version  of the  generic \nkinodynamic  path  planning  problem  by  a  feed-forward  net.  We  investigate  the \ncapacity  of sparse  distributed  planner  representations  to  generalize  from  example \nplans. \n\n2  ENVIRONMENT AND  ROBOT MODELS \n\n2.1  ENVIRONMENT \n\nThe world around the robot is  a two-dimensional scene,  occupied by obstacles mov(cid:173)\ning all in parallel to the y-axis,  with randomly choosen  discretized  x-positions, and \nwith  a  continuous  velocity  spectrum.  The  environment's  state  is  given  by  a  list \nreporting  position  (Xi,Yi)  E  (X,Y),  X  E  {0, ... ,8}, Y  = [y-,y+],  and  velocity \n(0, Vi)  ; Vi  E [v- ,v+] of each obstacle i.  The environment dynamics is  given by \n\n(1) \n\nObstacles  are inserted  at random positions,  and  with  random velocities,  into some \nregion  distant from  the  robot's  workspace.  At  each  time step,  the obstacle's posi(cid:173)\ntions are updated according to eqn.(l), so that they will cross the robot's workspace \nsome time. \n\n2.2  ROBOT \n\nWe consider a point-like robot of unit  mass,  which is  confined  to move within some \ninterval along the x-axis.  Its state is  denote~. by  (xr,xr) E (X,X);X = {-1,0, I}. \nAt  each  time step,  a  motor command u  E  X  = {-I, 0,  I}  is  applied  to  the  robot. \nThe robot  dynamics is given by \n\nXr(t + 1) \nzr(t + 1) \n\n=  xr(t) + u(t) \n=  zr(t) + xr(t + 1)  . \n\n(2) \n\nNotice  that  the  set  of admissible  motor  commands  depends  on  the  present  robot \nstate.  With these settings, the robot faces a fluctuating number of obstacles crossing \nits baseline, similar to the situation of a pedestrian who wants to cross  a busy street \n(Figure 1). \n\n\f344 \n\nFahner and Eckmiller \n\ndyno.MiC \nobsto.cles \n\no \n\no \n\nrobot \n\ngOo.l \n\nFigure  1:  Obstacles  Crossing the Robot's Workspace \n\n3  SYSTEM  ARCHITECTURE  AND  FUNCTIONALITY \n\nAdequate  modeling of the perception-action  cycle  is  of decisive  importance for  the \ndesign  of intelligent  reactive  systems.  We  partition  the  overall  system  into  two \nmodules:  an active Perception Module (PM) with built-in capabilities for short-term \nenvironment forecasts,  and a subsequent  Action Module (AM) for  motor command \ngeneration (Figure 2).  Either module may be represented  by a 'classical' algorithm, \nor  by  a  neural  net.  PM  is  fed  with  a  sensory  data stream  reporting  the observed \n\nsens~ \n\ninfor~ \n\nPerception \n\nMoclule \n\nlon9-\nterM \ngoal \n\nroloot \nstate \n\nJJJJ \n\ninterno.l \n\nrepresenta tion \n\nAction \nMoclule \n\nJJ Motor \n\nCOMMancl \n\nFigure 2:  Sensory-Motoric  System  Architecture \n\ndynamic scene  of time-varying obstacle positions.  From this,  it assembles a spatio-\n\n\fLearning Spatio-Temporal Planning from  a  Dynamic Programming Teacher \n\n345 \n\ntemporal internal representation  of near-future obstacle trajectories.  At  each  time \nstep t,  it  actnalizes  the incidence function \noccupancy(x, k) = { _11 \n\n(x = Xi  and  - s < Yi(t + k) < s)  for any obstacle i \notherwise, \n\nwhere  s  is  some  safety  margin  accounting  for  the  y-extension  of obstacles.  The \nincidence furlction  is  defined  on a spatio-temporal cone-shaped  cell  array,  based  at \nthe actual rc bot position: \n\nIx - xr(t)1  ~ k  ;  k =  I, .'\"  HORIZON \n\n(3) \nThe  opening  angle  of this  cone-shaped  region  is  given  by  the  robot's  speed  limit \n(here:  one  cell  per  time step).  Only those cells  that  can  potentially be reached  by \nthe robot within the local prediction-/planning horizon are thus represented  by PM \n(see Figure 3).  The functionality of AM  is to map the current PM representation to \n\nx \n\nI, \n\n,. ..... \n\ni--';\" \n\n/ \n(~o 0 \n\nx ... \n\n/ \n\n4Ir \nr2J \n\nT \n\n~  0 \n\n~@]  0 \n\n,~I---I\u00a3J \nT \n1 \n\no \n\n~ ./  [5J \n\nT \n\n2 \n\n3 \n\n[3]-[3]-\n\n.... , \n\nFigure 3:  Space-Time Representation  with Solution Path Indicated \n\nan appropriate robot motor command, taking into account the present  robot state, \nand  paying  regard  to  the  currently  specified  long-term  goal.  Firstly,  we  realize \nthe optimal AM  by  the  Dynamic  Programming (DP)  algorithm  (Bellman,  1957). \nSecondly,  we  use  supervised  learning to  distribute optimal planning examples over \na  neural  network. \n\n4  DYNAMIC  PROGRAMMING  SOLUTION \n\nGiven  PM's  internal  representation  at  time t,  the  present  robot  state,  and  some \nspecification  of  the  desired  long-term  goal,  DP  determines  a  sequence  of motor \ncommands minimizing some cost  functional.  Here  we  use \n\ncost{u(t), ... ,u(t+HORIZON)}  =  L:: \n\nHORIZON \n\n(xr(t + k)  - xo)2 + c u(t + k)2  , \n\n(4) \n\nk=O \n\n\f346 \n\nFahner and Eckmiller \n\nwith  xr(t + k)  given  by  the  dynamics eqns.(2)  (see  solution  path in  Figure 3).  By \nxo,  we  denote  the  desired  robot  position  or  long-term  goal.  Deviations  from  this \nposition  are  punished  by  higher  costs,  just  as  are  costly  accelerations.  Obstacle \ncollisions are excluded  by restricting search  to admissible cells  (x, X, t + k )admiuible \nin phase-space-time (obeying occupancy(x,t+k) =  -1).  Training targets for  timet \nare constituted by the optimal present motor actions uopt(t), for which the minimum \nis attained in eqn.( 4).  For cases  with degenerated optimal solutions, we consistently \nbreak symmetry, in order to obtain a  deterministic target  mapping. \n\n5  NEURAL ACTION  MODEL \n\nFor  neural  motor command generation,  we  use  a  single  layer  of structure-adapted \nparsimonious  Higher  Order  Neurons  (parsiHONs)  (Fahner,  I992a,  b),  computing \noutputs  Yi  E  [0,1]  ; i  = 1,2,3.  Target  values  for  each  single  neuron  are  given  by \nyfe&  =  1,  if motor-action i  is  the optimal one,  otherwise,  yfe&  =  0.  As  input, each \nneuron  receives  a  bit-vector  x  =  Xl, ... ,XN  E  {-I, I}N,  whose  components specify \nthe  values  of PM's incidence function,  the  binary  encoded  robot  state,  and  some \ntask bits encoding  the long-term goal.  Using  batch training,  we  maximize the log(cid:173)\nlikelihood criterion  for  each  neuron  independently.  For  recall,  the motor command \nis obtained by a winner-takes-all decision:  the index of the most active neuron yields \nthe motor  action applied. \nGenerally,  atoms for  nonlinear interactions  within  a  bipolar-input  HON  are  mod(cid:173)\nelled  by input monomials of the form \n\nN \n\n1]Ot = II xji  ;  Cl'  = Cl'l ... Cl'N  E  n = {O,  I}N  . \n\ni=1 \n\n(5) \n\nHere,  the ph  bit of Cl'  is  understood  as  exponent  of Xi.  It is  well  known  that  the \ncomplete  set  of monomials forms  a  basis  for  Boolean  functions  expansions  (Kar(cid:173)\npovski,  1976).  Combinatorial growth of the number of terms with increasing  input \ndimension  renders  allocation of the complete  basis  impractical in  our  case.  More(cid:173)\nover,  an action  model employing excessive  numbers of basis functions  would overfit \ntrainig data, thus preventing generalization. \nWe  therefore  use  a structural  adaptation algorithm, as  discussed  in detail  in  (Fah(cid:173)\nner,  I992a,  b), for  automatic identification and inclusion of a sparse set  of relevant \nnonlinearities  present  in  the  problem.  In effect,  this  algorithm performs  a  guided \nstochastic  search  exploring  the  space  of nonlinear  interactions by  means of an  in(cid:173)\ntertwined  process  of weight  adaptation, and  competition between  nonlinear terms. \nThe parsiHON  model restricts  the  number of terms used,  not their  orders:  instead \nof the exponential size  set  {1]Ot  :  Cl'  En}, just a small subset  {1]{3  : /3  ESC n} of \nterms is  used  within  a  parsimonious higher  order function  expansion \n\nye,t(x) = f [2: w{31]{3(X)] \n\n{3ES \n\n;  w{3  E 1R  . \n\n(6) \n\nHe~'e, f  denotes  the  usual sigmoid transfer function. \nparsiHONs with high degrees of sparsity were effectively trained and emerged robust \ngeneralization for  difficult  nonlinear classification benchmarks (Fahner,  I992a,  b). \n\n\fLearning Spatia-Temporal Planning from  a  Dynamic Programming Teacher \n\n347 \n\n6  SIMULATION  RESULTS \n\nWe performed extensive simulations to evaluate the neural action network's capabil(cid:173)\nities to generalize from learned optimal planning examples.  The planner was trained \nwith respect  to two  alternative long-term goals:  XO  = 0,  or  XO  = 8.  Firstly, optimal \nDP planner actions were assembled over about 6,000 time steps of the simulated en(cid:173)\nvironment (fa.irly crowded with moving obstacles), for  both long-term goals.  At each \ntime step, optimd motor commands were computed for  all 9 x 3 = 27 available robot \nstates.  From  this  bunch  of situations  we  excluded  those,  where  no  collision-free \npath existed  within  the  planning horizon  considered:  (HORIZON = 3).  A  total \nof 115,000 admissible training situations were  left, out of the 6,000 x 27 =  162,000 \none's generated.  Thus, out of the full spectrum of robot states which  were  checked \nevery  time step,  just  about  19  states  were  not  doomed  to  collide,  at  an  average. \nThese findings  corrobate  the difficulty of the  choosen  task. \nMany repetitions are present in these  accumulated patterns, reflecting the statistics \nof the  simulated  environment.  We  collapsed  the  original  training  set  by  remov(cid:173)\ning repeated  patterns,  providing the  learner  with  more information per  pattern:  a \nworking  data base  containing about 20.000  different  patterns was left. \nInput  to  the  neural  action  net  consisted  of a  bit-vector  of length  N  = 21,  where \n3 + 5 + 7  bits  encode  PM's  internal  representation  (cone  size  in  Figure  3),  6  bits \nencode  the  robot's state,  and  a single  task  bit  reports  the desired  goal.  For  train(cid:173)\ning,  we  delimited  single  neuron  learning  to  a  maximum of 1000  epochs.  In  most \ncases,  this was sufficient  for  successful  training set classification for  any of the three \nneurons  (Yi  < .8  for  yfe&  = 0,  and  Yi  > .8  for  yfe&  = 1  ;  i  = 1,2,3).  But  even  if \nsome  training  patterns  were  misclassified  by  individual motor neurons,  additional \nrobustness  stemming from  the winner-takes-all  decision  rescued  fault-free  recall  of \nthe  voting  community.  To  test  generalization of the neural  action  model,  we  par-\n\n6 \n\n5 \n\n3 \n\n2 \n\nc \n\n+ \n\no \n\n\" ... \nII> .(cid:173).(cid:173).. a. \n\n'0 \n\nII> ... ... .. \n\" ., .. .... \n\" ., \ne \n'\" o \n0-.. ... \nII> \n\" ... \nII> a. \n\nC \nII> \n\na)-HON \n9)-HON \n9)-HON \n93-HON \nllO-HON \n\n'\" \nllO-HON  \u2022 \n\nUO-HON \n\n0 \n+ \nC \n)( \n\n0 \n\no \n+ \n\no \n+ \n\n)( \n\nt \n\n.. \n\nO~----~----~----~----~----~----~----~~ \n\n10 \n\"1000 \n\n12 \n\n14 \n\no \n\n2 \n\na \n4 \n5ize  of  tra~nin9  set \n\n6 \n\nFigure 4:  Generalization Behavior \n\ntitioned  the  data base  into  two  parts,  one  containing  training patterns,  the  other \n\n\f348 \n\nFahner and Eckmiller \n\ncontaining  new  test  patterns,  not  present  in  the  training  set.  Several  runs  were \nperformed  with parsiHONs of sizes  between  83  and  110  terms.  Results for  varying \ntraining  set  sizes  are  depicted  in  Figure  4.  Test  error  decreases  with  increasing \ntraining  set  size,  and  falls  as  low  as  about  one  percent  for  about  12,000  training \npatterns.  It continues to decrease  for  larger training sets.  These findings  corrobate \nthat the trained  architectures  emerge sensible  robust  generalization. \nTo  get  some  insight  into  the  complexity of the  mapping,  we  counted  the  number \nof terms which  carry  a  given order.  The resulting distribution has its maximum at \norder 3, exhibits many terms of orders 4 and higher, and finally decreases to zero for \noruers exceeding  10 (Figure 5).  This indicates that the planner mapping considered \nis  highly nonlinear. \n\no .25  r------~----_,_----__,_----_..,.-___, \naveraged  over  several  ne~~orks ~ \n\n'\" u \nc \n(1/ \n::I \nIT \n(II ... \n.... \n\n(II \n\n> ... .., ., .... \n(II .. \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\no~------~----~~~ ______ -+ ________ ~ __ ~ \n\no \n\n5 \n\n10 \n\norder \n\n15 \n\n20 \n\nFigure 5:  Distribution of Orders \n\n7  DISCUSSION  AND  CONCLUSIONS \n\nSparse representation of planner mappings is desirable when  representation of com(cid:173)\nplete policy look-up tables becomes impracticable (Bellman's \"curse of dimensional(cid:173)\nity\"), or when computation of plans becomes expensive or conflicting with real-time \nrequirements.  For these reasons,  it is urgent to investigate the capacity of neurocon(cid:173)\ntrol for  effective  distributed representation  and for  robust generalization of planner \nmappmgs. \nHere,  we  focused on a new type of shallow feed-forward  action network for  the local \nkinodynamic trajectory  planning  problem.  An  advantage  with  feed- forward  nets \nis their low-latency recall,  which  is  an important requirement for  systems acting in \nrapidly changing  environments.  However,  from  theoretical  considerations  concern(cid:173)\ning the related problem of connectedness  with its inherent serial character  (Minsky, \n1969),  the  planning  problem  under  focus  is  expected  to  be  hard  for  feed-forward \nnets.  Even for  rather local planning horizons,  complex and nonlinear planner map-\n\n\fLearning Spatio-Temporal Planning from  a  Dynamic Programming Teacher \n\n349 \n\npings  must  be  expected.  Using  a  powerful  new  neuron  model  that  identifies  the \nrelevant  nonlinearities inherent  in  the problem,  we  determined  extremely parsimo(cid:173)\nnious  architectures  for  representation  of the  planner  mapping.  This indicates that \nsome compact set of important features  determines the optimal plan.  The adapted \nnetworks emerged  excellent  generalization. \nWe  encourage  use  of feed-forward  nets  for  difficult  local  planning  tasks,  if care  is \ntaken that the models support effective  representation of high-order  nonlinearities. \nFor  growing  planning  horizons,  it is  expected  that  feed-forward  neurocontrol  will \nrun into limitatioml (Werbos,  1992).  The simple test-bed  presented  here  would  al(cid:173)\nlow for  inser tion  a.Dd  testing also of other net models and system designs,  including \nrecurrent  networks. \n\nAcknowledgements \n\nThis work  was supported by  Federal Ministry of Research  and Technology (BMFT(cid:173)\nproject  SENROB),  grant 01  IN  105  AID) \n\nReferences \n\nE.  B.  Baum,  F.  Wilczek  (1987).  Supervised  Learning of Probability Distributions \nby Neural Networks.  In D. Anderson (Ed.), Neural Information Processing Systems, \n52-61.  Denver,  CO:  American Institute of Physics. \nR.  E.  Bellman (1957).  Dynamic  Programming.  Princeton  University  Press. \nB.  Donald  (1989).  Near-Optimal  Kinodynamic  Planning  for  Robots  With  Coupled \nDynamic  Bounds,  Proc.  IEEE Int.  Conf.  on  Robotics and  Automation. \nG.  Fahner,  N.  Goerke,  R.  Eckmiller  (1992).  Structural  Adaptation  of  Boolean \nHigher Order Neurons:  Superior Classification  with Parsimonious  Topologies,  Proc. \nICANN,  Brighton,  UK. \n\nG.  Fahner,  R.  Eckmiller.  Structural  Adaptation  of Parsimonious  Higher  Order \nClassifiers, subm.  to  Neural  Networks. \nM.  G. Karpovski  (1976).  Finite  Orthogonal Series in  the  Design  of Digital Devices. \nNew  York:  John Wiley & Sons. \nM.  Minsky,  S.  A.  Papert  (1969).  Perceptrons.  Cambridge:  The MIT Press. \nP.  Werbos (1992).  Approximate Dynamic Programming for  Real-Time Control and \nNeural  Modeling.  In  D.  White,  D.  Sofge  (eds.)  Handbook  of Intelligent  Control, \n493-525.  New  York:  Van  Nostrand. \n\n\f", "award": [], "sourceid": 595, "authors": [{"given_name": "Gerald", "family_name": "Fahner", "institution": null}, {"given_name": "Rolf", "family_name": "Eckmiller", "institution": null}]}