{"title": "Q-Learning with Hidden-Unit Restarting", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": null, "full_text": "Q-Learning with Hidden-Unit  Restarting \n\nCharles W.  Anderson \n\nDepartment of Computer Science \n\nColorado State University \n\nFort Collins,  CO  80523 \n\nAbstract \n\nPlatt's  resource-allocation  network  (RAN)  (Platt,  1991a,  1991b) \nis  modified for  a  reinforcement-learning paradigm and  to  \"restart\" \nexisting hidden  units  rather than adding  new  units.  After restart(cid:173)\ning,  units  continue  to  learn  via  back-propagation.  The  resulting \nrestart  algorithm is  tested  in  a  Q-Iearning  network  that  learns  to \nsolve an inverted pendulum problem.  Solutions are found faster on \naverage  with  the restart  algorithm than  without it. \n\n1 \n\nIntroduction \n\nThe goal  of supervised  learning  is  the  discovery  of a  compact  representation  that \ngeneralizes  well .  Such  representations  are typically found  by  incremental, gradient(cid:173)\nbased search, such as error back-propagation.  However,  in the early stages of learn(cid:173)\ning  a  control  task,  we  are  more  concerned  with fast  learning  than  a  compact  rep(cid:173)\nresentation.  This implies  a  local  representation  with  the  extreme  being  the  mem(cid:173)\norization of each  experience.  An  initially local  representation  is  also  advantageous \nwhen  the  learning  component  is  operating  in  parallel  with  a  conventional,  fixed \ncontroller.  A  learning  experience  should  not  generalize  widely;  the  conventional \ncontroller should be  preferred  for  inputs that have not yet  been  experienced. \n\nPlatt's resource-allocation network (RAN)  (Platt, 1991a, 1991b)  combines gradient \nsearch  and  memorization.  RAN  uses  locally  tuned  (gaussian)  units  in  the  hidden \nlayer.  The weight vector of a gaussian unit is equal to the input vector for which  the \nunit produces  its maximal response.  A new  unit is added  when  the network's error \nmagnitude is large and the new  unit's radial domain would not significantly overlap \ndomains of existing units.  Platt demonstrated RAN on the supervised learning task \n\n81 \n\n\f82 \n\nAnderson \n\nof predicting values  in  the  Mackey-Glass time series. \n\nWe  have  integrated  Platt's ideas  with  the  reinforcement-learning  algorithm  called \nQ-Iearning  (Watkins,  1989).  One  major  modification  is  that  the  network  has  a \nfixed  number of hidden  units,  all in a  single-layer, all  of which  are  trained on every \nstep.  Rather  than  adding  units,  the  least  useful  hidden  unit  is  selected  and  its \nweights  are  set  to new  values,  then  continue  the  gradient-based  search.  Thus,  the \nunit's search  is  restarted.  The temporal-difference  errors  control restart  events  in  a \nfashion similar to the  way  supervised  errors  control  RAN's addition of new  units. \n\nThe motivation for starting with  all  units present  is  that in  a  parallel implementa(cid:173)\ntion,  the  computation time for  a  layer  of one  unit  is  roughly  the  same as  that for \na  layer  with all  of the  units.  All  units  are  trained from  the start.  Any  that fail  to \nlearn anything useful  are  re-allocated  when  needed. \n\nHere  the  Q-Iearning  algorithm with  restarts  is  applied  to  the  problem  of learning \nto  balance  a  simulated inverted  pendulum.  In  the following  sections,  the  inverted \npendulum  problem  and  Watkin's  Q-Learning  algorithm  are  described.  Then  the \ndetails  of the  restart  algorithm are  given  and  results  of applying  the  algorithm to \nthe inverted  pendulum problem are summarized. \n\n2 \n\nInverted Pendulum \n\nThe inverted  pendulum is  a  classic example of an inherently  unstable system.  The \nproblem  can  be  used  to  study  the  difficult  credit  assignment  problem  that  arises \nwhen  performance feedback  is  provided  only  by  a  failure signal.  This  problem has \noften used  to test  new  approaches  to learning  control (from early  work  by  Widrow \nand  Smith, 1964,  to recent  studies  such  as  Jordan  and  Jacobs,  1990,  and  Whitley, \nDominic,  Das,  and  Anderson,  1993).  It involves  a  pendulum  hinged  to  the  top \nof a  wheeled  cart  that  travels  along  a  track  of limited  length.  The  pendulum  is \nconstrained to move within the vertical plane.  The state is specified  by  the position \nand  velocity  of the  cart  and  the angle between  the  pendulum and  vertical  and  the \nangular velocity  of the  pendulum. \n\nThe only information regarding the goal of the task is provided by  the failure signal, \nor reinforcement, rt, which signals either the pendulum falling past \u00b112\u00b0 or the cart \nhitting  the  bounds  of the  track  at  \u00b11  m.  The state at  time t  of the  pendulum is \npresented  to  the  network  as  a  vector,  Xt,  of the  four  state  variables  scaled  to  be \nbetween  0 and  1. \n\nFor further  details  of this problem and other reinforcement  learning approaches  to \nthis problem, see  Barto, Sutton, and Ande!'-son  (1983)  and  Anderson  (1987). \n\n3  Q-Learning \n\nThe objective of many control problems is  to optimize a  performance measure over \ntime.  For the inverted pendulum problem, we  define a  reinforcement signal to be -1 \nwhen the pendulum angle or the cart position exceed  their bounds, and 0 otherwise. \nThe objective is  to maximize the sum of this  reinforcement signal over  time. \n\n\fQ-Learning with Hidden-Unit Restarting \n\n83 \n\nIf we  had  complete  knowledge  of state  transition  probabilities we  could  apply  dy(cid:173)\nnamic programming to find  the sequence  of pushes  that maximize the sum of rein(cid:173)\nforcements.  Reinforcement  learning  algorithms have  been  devised  to  learn  control \nstrategies when such knowledge is not available.  In fact, Watkins has shown that one \nform  of his  Q-Iearning  algorithm  converges  to  the  dynamic  programming solution \n(Watkins,  1989;  Watkins and Dayan,  1992). \nThe essence  of Q-Iearning is  the  learning  and use  of a  Q  function,  Q(x, a),  that is \na  prediction  of a  weighted sum of future  reinforcement given  that action a  is  taken \nwhen  the controlled system  is  in  a  state represented  by  x.  This is  analogous to the \nvalue function  in  dynamic programming.  Specifically,  the objective of Q-Iearning is \nto form  the following  approximation: \n\nQ(Xt, at)  ::::::  L .. l7't+k+1 \n\n00 \n\nk=O \n\nwhere  0 < 'Y  < 1 is  a  discount  rate  and  7't  is  the  reinforcement  received  at time t. \nWatkins (1989)  presents  a  number of algorithms for adjusting the parameters of Q. \nHere we focus on using error back-propagation to train a neural network to learn the \nQ function.  For  Q-Iearning,  the following  temporal-difference error  (Sutton,  1988) \n\net  = 7't+1  + 'Y  max [Q(Xt+1, at+t)] - Q(Xt, at). \n\nat+l \n\nis  derived by using max [Q(Xt+l, at+t)] as an approximation to L~=o 'Yk 7't+k+2.  See \n(Barto, Bradtke, and Singh, 1991) for further discussion ofthe relationships between \nreinforcement  learning and  dynamic programming. \n\nat+l \n\n4  Q-Learning Network \n\nFor  the  inverted  pendulum  experiments  reported  here,  a  neural  network  with  a \nsingle  hidden  layer  was  used  to  learn  the  Q( x, a)  function.  As  shown  in  Figure  1, \nthe  network  has four  inputs for  the four  state  variables  of the  inverted  pendulum, \nand two outputs corresponding  to the two possible actions for  this problem, similar \nto  Lin  (1992).  In  addition  to  the  weights  shown,  wand  v,  the  two  units  in  the \noutput layer each  have  a single  weight  with  a  constant  input of 0.5. \n\nThe  activation  function  of the  hidden  units  is  the  approximate gaussian  function \nused  by  Platt.  Let  dj  be the squared  distance between  the  current  input  vector,  x, \nand  the weights  in  hidden  unit  j. \n\n4 \n\ndj  =  L(Xi - Wj,i)2 \n\ni=l \n\nHere  Xi  is  the  ith  component  of x  at  the  current  time.  The output,  Yj,  of hidden \nunit j  is \n\nYj  = { \n\nif dj  < P; \notherwise, \n\n\f84 \n\nAnderson \n\nXl \nx2 \n\nx3 \n\nX \n4 \n\nQ(x,-lO) \n\nQ(x,+10) \n\nFigure  1:  Q-Learning Network \n\nwhere  p  controls  the  radius  of the  region  in  which  the  unit's  output  is  nonzero. \nUnlike Platt, p  is  constant and equal for  all units. \n\nThe  output  units  calculate  weighted  sums  of  the  hidden  unit  outputs  and  the \nconstant  input.  The  output  values  are  the  current  estimates  of Q(Xt, -10)  and \nQ(Xt, 10),  which  are  predictions of future  reinforcement given  the  current  observed \nstate of the inverted  pendulum and assuming a  particular action  will be applied in \nthat state. \n\nThe  action  applied  at  each  step  is  selected  as  the  one  corresponding  to  the  larger \nof Q(Xt, -10) and Q(Xt, 10).  To explore  the effects  of each  action,  the action  with \nthe lower  Q value is  applied  with a  probability that  decreases  with  time: \n\n_  {  1 - 0.5At, \n\nif Q(Xt, 10) > Q(Xt, -10); \n\nP -\n\n0.5At ,  otherwise, \n\nat  = \n{ \n\n10, \n-10, \n\nwith probability p; \nwith probability 1 - p. \n\nTo  update  all  weights,  error  back-propagation  is  applied  at  each  step  using  the \nfollowing temporal-difference  error \n\n{ \n\nGt+l \n\n,max[Q(xt+l,at+l)] - Q(Xt, at), \nrt+l - Q(Xt, at), \n\net  = \nNote that rt = 0 for  all non-failure steps and  drops out of the first  expression. \nWeights are updated by  the following equations, assuming Unit j  is  the output unit \ncorresponding  to the action  taken,  and  all  variables  are for  the  current  time t. \n\nif failure  does  not  occur  on step  t + 1, \nif failure occurs  on step t + l. \n\n~WL . \n'\" ,I \n\n~V\u00b7 . J,I \n\ne yL V\u00b7  L (x\u00b7  - W\u00b7  .) \nJ ,I \n\n'\"  J,'\" \n\nI \n\nf3h \n-\nP \nf3 e Yi \n\nIn  all  experiments,  p  = 2,  A  =  0.99999,  and, \ndiscussed  in Section 6. \n\n0.9.  Values  of f3  and  f3h  are \n\n\fQ-Learning with Hidden-Unit Restarting \n\n85 \n\n5  Restart Algorithm \n\nAfter weights are modified by back-propagation, conditions for a restart are checked. \nIf conditions  are  met,  a  unit  is  restarted,  and  processing  continues  with  the  next \ntime step.  Conditions  and  primary steps  of the  restart  algorithm appear  below  as \nthe  numbered equations. \n\n5.1  When to Restart \n\nSeveral  conditions must be met before a  restart  is performed.  First,  the magnitude \nof the  error,  et,  must  be  larger  than usual.  To  detect  this,  exponentially-weighted \naverages of the mean, J1.,  and variance, u 2, of et  are maintained and used to calculate \na  normalized error,  e~ \n\ne' t \n\nJ1.t+l \n\nut+l \nFor our experiments,  Ie  =  0.99. \n\n2 \n\n(1  _  let)' \n\net  -\nleJ1.t  + (1  - Ie)et, \nleU; + (1  - Ie )e? , \n\nNow  we  can  state  the first  restart  condition.  A  restart  is  considered  on  steps  for \nwhich  the  magnitude of the error  is  greater  than 0.01  and greater  than a  constant \nfactor  of the error's standard  deviation, i.e.,  whenever \n\nle,1  > om \n\nand \n\nle,1  > aV(1 ~l\"n)' \n\n(1) \n\nOf a  small number of tested  values,  a  = 0.2  resulted  in  the best  performance. \nBefore  choosing  a  unit  to  restart  for  this  step,  we  determine  whether  or  not  the \ncurrent  input  vector  is  already  \"covered\"  by  a  unit.  Assuming Yj  is  the  output  of \nUnit j  for  the current  input  vector,  the restart  procedure  is  continued  only if \n\nYj  < 0.5,  for  j  = 1, ... ,20 \n\n(2) \n\n5.2  Which Ullit to Restart \n\nAs stated by  Mozer  and Smolensky (1989),  ideally we  would choose  the least useful \nunit  as  the  one  that  results  in  the  largest  error  when  removed from  the  network. \nFor the Q-network,  this requires  the removal of one unit at a  time, making multiple \nattempts  to  balance  the  pendulum,  and  determining  which  unit  when  removed \nresults  in  the shortest  balancing times.  Rather than following this computationally \nexpensive  procedure,  we  simply took the sum of the magnitudes of a  hidden unit's \noutput  weights  as  a  measure  of it's utility.  This is  one  of several  utility  measures \nsuggested  by  Mozer  and Smolensky  and others  (e.g.,  Kloph  and  Gose,  1969). \n\nAfter  a  unit  is  restarted,  it  may  require  further  learning  experience  to  acquire  a \nuseful  function  in  the  network.  The amount of learning experience  is  defined  as  a \nsum of magnitudes of the error  et.  The sum of error  magnitudes since  Unit j  was \n\n\f86 \n\nAnderson \n\nrestarted  is  given  by  Cj.  Once  this  sum  surpasses  a  maxImum,  Cmax ,  the  unit  is \nagain eligible for  restarting.  Thus,  Unit j  is  restarted  when \n\n(IVI,j 1+ IV2,j I) \n\n(3) \n\nUj \n\n.  min \nJE{1 \u2022...\u2022 20} \nand \nCj  >  Cmax . \n\n( 4 ) \nWithout a  detailed search,  a  value of Cmax = 10 was found  to result in good  perfor(cid:173)\nmance. \n\n5.3  New  Weights for  Restarted Unit \n\nSay Unit j  is  restarted.  It's input  weights are set equal to the current  input vector, \nx, the one for  which  the output of the network  was  in error.  One of the two output \nweights of Unit j  is also modified.  The output weight through which Unit j  modifies \nthe output of the unit corresponding to the action actually taken is set equal to the \nerror,  et.  The other output  weight  is  not modified. \n\nW\u00b7\u00b7 }.' \n\nXi,  for  i  = 1, ... , 4, \n\nwhere  k \n\n{ I, \n2, \n\nif at = -10; \nif at = 10. \n\n(5) \n(6) \n\n6  Results \n\nThe  pendulum  is  said  to  be  balanced  when  90,000  steps  (1/2  hour  of simulated \ntime) have elapsed without failure.  After every failure,  the pendulum is  reset  to the \ncenter ofthe track with a zero angle (straight up) and zero  velocities.  Performance is \njudged by the average number of failures before the pendulum is balanced.  Averages \nwere  taken over 30  runs.  Each  run consists of choosing initial values for  the hidden \nunits'  weights  from  a  uniform distribution from  0  to  1,  then  training the  net  until \nthe pendulum is  balanced for  90,000 steps or a  maximum number of 50,000 failures \nis  reached. \n\nTo determine the effect of restarting,  we  ccmpare the performance of the Q-Iearning \nalgorithm with and  without restarts.  Back-propagation learning rates are given by \n13 for  the output units and 13h  for  the hidden units.  13  and 13h  were  optimized for  the \nalgorithm without  restarts  by  testing a  large number of values.  The best  values  of \nthose  tried are 13 = 0.05 and 13h  = 1.0.  These values  were  used for  both algorithms. \nA small number of values for  the  additional restart  parameters  were  tested,  so  the \nrestart  algorithm is  not optimized for  this problem. \n\nFigure  2  is  a  graph  of the  number  of steps  between  failures  versus  the  number  of \nfailures.  Each algorithm was initialized with the same hidden unit weights.  Without \nrestarts  the pendulum is  balanced for  this run after 6,879 failures.  With restarts  it \nis  balanced  after 3,415 failures. \n\nThe performances of the algorithms were averaged over 30  runs giving the following \nresults.  The  restart  algorithm  balanced  the  pendulum  in  all  30  runs,  within  an \n\n\f100,000-\n\n10,000-\n\nSteps \n\nBetween  1,000 -\nFailures \n\n100  -\n\n10  - I \no \n\nQ-Learning with Hidden-Unit Restarting \n\n87 \n\nWith Restarts \n\nWithout Restarts ... \n\nI \n\n'\u00a5 \n\n, \n\nI \nI \nI \nJ \nI ,  \n\n1,,\\ \n\" \n~  I \n'.. \n\n. \nI ' ,  \n:~' \nI \n\n\" \n\n_', \n~ \n\" \n\nI \n\n\\ \n\n\"'-------, ...  -\n\nI \n\n2,000 \n\nI \n\n4,000 \n\nI \n\n6,000 \n\nFailures \n\nFigure  2:  Learning  Curves  of Balancing Time Versus  Failures  (averaged  over  bins \nof 100 failures) \n\naverage  of 3,303 failures.  The algorithm  without  restarts  was  unsuccessful  within \n50,000  failures  for  two  of the  30  runs.  Not  counting  the  unsuccessful  runs,  this \nalgorithm balanced  the  pendulum within  an average  of 4,923 failures.  Considering \nthe  unsuccessful  runs,  this average  is  7 ,928 failures. \n\nIn  studying  the  timing of restarts,  we  observe  that initially the number of restarts \nis  small, due  to the high variance of et  in  the early stages of learning.  During later \nstages,  we  see  that a  single unit might be restarted  many times (15  to 20)  before it \nbecomes  more useful  (at least  aecording  to  our measure)  than some other  unit. \n\n7  Conclusion \n\nThis first  test of an algorithm for restarting hidden units in a reinforcement-learning \nparadigm  led  to  a  decrease  in  learning  time for  this  task.  However,  much  work \nremains in studying the effects  of each  step  of the  restart  procedure.  Many  alter(cid:173)\nnatives exist,  most significantly in the method for  determining the utility of hidden \nunits.  A  significant  extension  of this  algorithm  would  be  to  consider  units  with \nvariable-width domains, as  in  Platt's  RAN  algorithm. \n\nAcknowledgenlents \n\nThe work  was  supported in  part by  the National Science  Foundation through Grant \nIRI-9212191  and  by  Colorado State  University  through  Faculty  Research  Grant  1-\n38592. \n\n\f88 \n\nAnderson \n\nReferences \n\nC.  W.  Anderson.  (1987).  Strategy  learning  with  multilayer  connectionist  repre(cid:173)\n\nA.  G.  Barto,  S.  J.  Bradtke,  and  S.  P.  Singh. \n\nsentations.  Technical  Report  TR87-509.3,  GTE  Laboratories,  Waltham,  MA, \n1987.  Corrected  version  of article  that  was  published  in  Proceedings  of the \nFourth International Workshop on Machine Learning, pp.  103-114, June,  1987. \n(1991).  Real-time  learning  and \ncontrol  using  asynchronous  dynamic  programming.  Technical  Report  91-57, \nDepartment of Computer Science,  University  of Massachusetts,  Amherst,  MA, \nAug. \n\nA.  G.  Barto, R.  S.  Sutton, and  C.  W.  Anderson.  (1983).  Neuronlike elements that \ncan  solve  difficult  learning  control  problems.  IEEE  Transactions  on  Systems, \nMan,  and  Cybernetics,  13:835-846.  Reprinted in J.  A.  Anderson and E.  Rosen(cid:173)\nfeld,  Neurocomputing:  Foundations  of Research,  MIT  Press,  Cambridge,  MA, \n1988. \n\nM. I. Jordan and R.  A.  Jacobs.  (1990).  Learning to control an unstable system with \nforward  modeling.  In  D.  S.  Touretzky, editor,  Advances  in  Neural Information \nProcessing  Systems,  volume 2,  pages  324-331.  Morgan  Kaufmann, San  Mateo, \nCA. \n\nA.  H.  Klopf and  E.  Gose.  (1969).  An  evolutionary  pattern  recognition  network. \n\nIEEE  Transactions  on  Systems,  Science,  and  Cybernetics,  15:247-250. \n\nL.-J.  Lin.  (1992).  Self-improving reactive  agents  based  on  reinforcement  learning, \n\nplanning, and  teaching.  Machine  Learning, 8(3/4):293-32l. \n\nM.  C.  Mozer  and  P.  Smolensky.  (1989).  Skeltonization:  A  technique for  trimming \nthe  fat  from  a  network  via  relevance  assessment.  In  D.  S.  Touretzky,  editor, \nAdvances  in  Neural  Information  Systems,  volume  1,  pages  107-115.  Morgan \nKaufmann, San Mateo,  CA,  1989. \n\nJ.  C.  Platt.  (1991a).  Learning  by  combining memorization and  gradient  descent. \nIn  R.  P.  Lippmann,  J.  E.  Moody,  and  D.  S.  Touretzky,  editors,  Advances  in \nNeural Information  Processing  Systems  3,  pages  714-720.  Morgan  Kaufmann \nPublishers,  San Mateo, CA. \n\nJ. C.  Platt.  (1991 b)  A resource-allocating network for function interpolation.  N eu(cid:173)\n\nral  Computation,  3:213-225. \n\nR.  S.  Sutton.  (1988).  Learning  to  predict  by  the  method  of temporal  differences. \n\nMachine  Learning,  3:9-44. \n\nC.  J.  C.  H.  Watkins.  (1989).  Learning  with  Delayed  Rewards.  PhD  thesis,  Cam(cid:173)\n\nbridge  University  Psychology  Department. \n\nC.  J.  C.  H.  Watkins  and  P.  Dayan. \n\n(1992).  Q-Iearning.  Machine  Learning, \n\n8(3/4):279-292. \n\nD.  Whitley,  S.  Dominic,  R.  Das,  and  C.  Anderson.  (1993).  Genetic  reinforcement \n\nlearning for  neurocontrol  problems.  Machine  Learning,  to appear. \n\nB.  Widrow and  F. W. Smith.  (1964).  Pattern-recognizing  control systems.  In  Pro(cid:173)\n\nceedings  of the  1963 Computer and  Information  Sciences  (COINS) Symposium, \npages  288-317, Washington,  DC.  Spartan. \n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Charles", "family_name": "Anderson", "institution": null}]}