{"title": "The Effect of Eligibility Traces on Finding Optimal Memoryless Policies in Partially Observable Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1010, "page_last": 1016, "abstract": null, "full_text": "The effect of eligibility traces on finding optimal memoryless \n\npolicies in partially observable Markov decision processes \n\nJohn Loch \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO  80309-0430 \n\nloch@cs.colorado.edu \n\nAbstract \n\nAgents  acting  in  the  real  world  are  confronted  with  the  problem  of \nmaking  good  decisions  with  limited  knowledge  of the  environment. \nPartially  observable  Markov  decision  processes  (POMDPs)  model \ndecision problems in which an  agent tries to maximize its reward in  the \nface  of limited sensor feedback.  Recent work has shown empirically that \na  reinforcement  learning  (RL)  algorithm  called  Sarsa(A)  can  efficiently \nfind  optimal  memoryless  policies,  which  map  current  observations  to \nactions,  for  POMDP  problems  (Loch  and  Singh  1998).  The  Sarsa(A) \nalgorithm uses a form  of short-term memory  called an eligibility trace, \nwhich  distributes  temporally  delayed  rewards  to  observation-action \npairs  which  lead  up  to  the  reward.  This  paper  explores  the  effect  of \neligibility traces on the ability of the Sarsa(A) algorithm to find  optimal \nmemoryless  policies.  A  variant  of  Sarsa(A)  called  k-step  truncated \nSarsa(A)  is  applied to four test  problems taken  from  the  recent  work of \nLittman,  Littman,  Cassandra  and  Kaelbling,  Parr  and  Russell,  and \nChrisman.  The  empirical  results  show  that  eligibility  traces  can  be \nsignificantly  truncated  without  affecting  the  ability  of Sarsa(A)  to  find \noptimal memoryless policies for POMDPs. \n\n1 Introduction \n\nAgents which operate in the real world,  such as mobile robots, must use sensors which at \nbest  give  only  partial  information  about  the state of the  environment.  Information  about \nthe robot's surroundings is necessarily incomplete due to noisy and/or imperfect sensors, \noccluded objects, and the inability of the robot to know precisely where it is.  Such agent(cid:173)\nenvironment  systems can be modeled as  partially  observable Markov decision  processes \nor POMDPs (Sondik,  1978). \n\nA  variety  of algorithms  have  been  developed  for  solving  POMDPs  (Lovejoy,  1991). \nHowever most  of these techniques do not  scale well  to  problems  involving  more than a \nfew  dozen  states  due  to  the  computational  complexity  of  the  solution  methods \n(Cassandra,  1994;  Littman  1994).  Therefore,  finding  efficient  reinforcement  learning \n\n\fEffect of Eligibility Traces on Finding Optimal Memoryless Policies \n\nlOll \n\nmethods  for  solving  POMDPs is  of great  practical  interest  to the  Artificial  Intelligence \nand engineering fields. \n\nRecent  work  has  shown  empirically  that  the  Sarsa(A)  algorithm  can  efficiently  find  the \nbest  deterministic  memoryless  policy  for  several  POMDPs  problems  from  the  recent \nliterature  (Loch  and  Singh  1998).  The  empirical  results  from  Loch  and  Singh  (1998) \nsuggest  that  eligibility  traces  are  necessary  for  finding  the  best  or  optimal  memoryless \npolicy.  For this reason, a variant of Sarsa(A) called k-step truncated Sarsa(A) is formulated \nto  explore  the  effect  of eligibility  traces  on  the  ability  of Sarsa( A)  to  find  the  best \nmemory less policy. \n\nThe  main  contribution  of this  paper  is  to  show  empirically  that  a  variant  of Sarsa(A) \nusing  truncated  eligibility  traces  can  find  the  optimal  memory less  policy  for  several \nPOMDP  problems  from  the  literature.  Specifically  we  show  that  the  k-step  truncated \nSarsa(A)  method  can find  the optimal  memoryless  policy for the  four  POMDP problems \ntested when k :S 2. \n\n2 Sarsa(J..) and POMDPs \n\nAn environment is defined by a finite set of states S, the agent can choose from a finite set \nof actions  A,  and  the  agent's  sensors  provide  it  observations  from  a  finite  set  X.  On \nexecuting  action  a  \u00a3  A  in  state  s  \u00a3  S  the  agent  receives  expected  reward  rsa  and  the \nenvironment transitions to a  state  s'  \u00a3  S with  probability  pass\"  The probability  of the \nagent observing x  \u00a3  X given that the state is s is O(xls). \n\nA straightforward way to extend RL algorithms to POMDPs is to learn Q-value functions \nof observation-action  pairs,  i.e.  to simply treat the  agents  observations as  states. Below \nwe describe the standard  Sarsa(A)  algorithm  applied  to POMDPs.  At  time  step  t  the  Q(cid:173)\nvalue function is denoted Qt ; the eligibility trace function is denoted  'YIt  ; and the reward \nreceived is denoted rt  . On  experiencing transition <xt.  at.  rb  Xt+l> the following updates \nare performed in order: \n\n'YIt(x,  a) =  YA  'YIt-l(X,  a)  ; for all X\"# Xt  and a\"# at \n\nwhere  bt  = rt  +  Y Qt<xt+h  at+l)  - Qt(Xb  aJ  and  a  is  the  step-size  (learning  rate).  The \neligibility traces are initialized to zero, and in  episodic tasks they are reinitiaHzed to zero \nafter  every  episode.  The  greedy  policy at  time  step  t  assigns  to  each  observation  x  the \naction a = argmaxb Qt<x,  b). \n\n2.1  Sarsa(A) Using Truncated Eligibility Traces \n\nSarsa(A) with truncated eligibility traces uses a parameter k which sets the eligibility \ntrace for an observation-action pair to zero if that observation-action pair was not visited \nwithin the last k-I time steps.  Thus  I-step truncated Sarsa(A) is equivalent to Sarsa(O) \nand 2-step truncated Sarsa(A) updates the Q-values of the current observation-action pair \nand the immediately preceding observation-action pair. \n\n\f1012 \n\n3 Empirical Results \n\nJ  Loch \n\nThe  truncated  Sarsa(/..)  algorithm  was  applied  in  an  identical  manner  to  four  POMDP \nproblems  taken  from  the  recent  literature.  Complete  descriptions  of the  states,  actions, \nobservations,  and  rewards for  each problem are provided in Loch and  Singh (1998). Here \nwe  describe  the  aspects  of the  empirical  results  common  to  all  four  problems.  At  each \nstep.  the  agent  selected  a random action with a  probability  equal  to the exploration rate \nparameter and  selected a greedy action otherwise. An initial  exploration rate of 35% was \nused,  decreasing  linearly  with  each  action  (step)  until  the  350000th  action  from  there \nonward the exploration rate remain fixed  at 0%. Q-values were initialized to O. Both the \nstep-size a  and the  /..  values are held constant in each experiment.  A discount factor y  of \n0.95  and a /..  value of 1.0 were used for all four problems. \n\n3.1  Sutton's Grid World \n\nSutton's grid  world  (Littman  1994)  is  an  agent-environment  system with 46  states,  30 \nobservations, and 4 actions. State transitions and observations are deterministic. \n\nThe  I-step  truncated  eligibility  trace,  equivalent  to  Sarsa(O),  was  able  to  find  a  policy \nwhich could only reach the goal from start states within 7 steps of the goal state as shown \nin  Figure  1.  The  optimal  memoryless  policy  yielding  416  total  steps  to  the  goal  state \nwas found  by  the  2-step,  4-step  and  8-step  truncated  eligibility  trace  methods  shown in \nFigure 1. \n\ns)' ut \n\nlQ __  . -_ -_ . )  \n\n,., \n\n?eO \n\n\u2022 eo \n\n-.,-p. ..... ) \n\n.Q-_.--\n\nos \n\n'0 \n\n' .. \n\nsT. .. \u2022 \n\n\"~ \n\n3$ \n\nI  $ \u2022 la \n\nJ  2 \n\nto \n\nIS \n\n\" \n\n~l . 100 \n\n. .L1 \n\n1M \n\n_.,_,Io ..... ) \n\nsoo \n\n1100 \n\n5DO \n\n'''' \n\n\"\" \n\n10' \n\nIS \n\nos \n\n\u2022 \u2022 \n, .... \n\"~ \n\u2022 \n\nIS \n\n1 l \u2022 r ~ \nI  ~ \n\nIS \n\n0.5 \n\n200 \n4!l1 \n........ AcIIeIN (M dOO\".) \n\nSfIO \n\n' \"  \n\nMO \n\n\"'. \n\nlOll \n\n11& \n.00 \n....... oI~ 0- l..-.) \n\nSOB \n\ns.t \n\nFigure 1: Sutton's Grid World (from Littman,  1994). Total steps to goal performance as \na function of the number oflearning steps for  1,2,4, and 8-step eligibility traces. \n\n... \n\n\fEffect of Eligibility Traces on Finding Optimal Memoryless Policies \n\n1013 \n\n3.2 Chrisman's Shuttle Problem \n\nChrisman's  shuttle  problem \nobservations, and 3 actions.  State transitions and observations are stochastic. \n\nis  an  agent-environment  system  with  8  states,  5 \n\nThe  I-step truncated  eligibility trace,  equivalent to Sarsa(O),  was unable to find  a policy \nwhich  could  could  reach  the  goal  state  (Figure  2).  The  optimal  memoryless  policy \nyielding an average reward  per step of 1.02 was  found  by the 2-step,  4-step,  and  8-step \ntruncated eligibility trace methods shown in Figure 2. \n\nlQ_  ... _\u00b7-..cOI \n\n2Q __  -_  \n\n...... \n..... \n\n..... \n.,.. \n\n.. \n\n\" \n\" \n114 \n\nI\" ] , r G. \n\n~ .. \n\n.. \n\n~ .. \n\noz \n\n'0 \n\n... \n\n<G' \n\n,110 \n\n......... AdIoM\"\" t.eor.l \n\n.00 \n\n,GO \n\n4Q _ _  ... _ \n\nHO \n\n4M \nl~ \n_\"_110'-'1 \neQ __   _ \n\n... \n\n\" \n\n\" r\u00b7 a\" ], r es \n\n\u2022\u2022 \n\n'2 \n\n\u00b0o~~ * \n, \n\u2022 \n\u2022 \nI' \u2022 \n.' 2 \n,  ..,. ... \n1 \nI: , \n, \n\u2022 \n, \n\n0 \n\nI Gt \n\n, \n\u2022 \n, .. \n\nr' .' , I , \n108 \n\u2022\u2022 .. \n, \n\u2022 \n\n..u.. \n\nf \n\n'0\"\" \n\nI \n\n... 0 \nl'ftl \n.......... ~~ (lftS0.\"8) \n\nJOo \n\n5tft \n\nuo \n\n,1 \n\n.i ..... , \n\nrr \n\n' .. \n\n.. \n\n..... \n\nlee \n\nXI\" \n040(1 \n........ elAdIaN(1a I ..... ) \n\n~ \n\n~f)O \n\n6.L'II \n\nFigure 2: Chrisman's shuttle problem. Average reward per step performance as a function \nof the number ofleaming steps for  1,2,4, and 8-step eligibility traces. \n\n3.3 Littman, Cassandra, and Kaelbling's 89 State Office World \n\nLittman  et  al.' s  89  state  office  world  (Littman  (995)  is  an  agent-environment  system \nwith  89  states,  17  observations,  and  5  actions.  State  transitions  and  observations  are \nstochastic. \n\nThe  I-step  truncated  eligibility  trace,  equivalent  to  Sarsa(O),  was  able  to  find  a  policy \nwhich could reach the goal state in only 51% of the 251  trials (Figure 3).  The 2-step, 4-\nstep  and  8-step  truncated  eligibility  trace  methods  converged  to  the  best  memoryless \npolicy  found  by  Loch  &  Singh (1998) yielding  a  77%  success  rate in  reaching  the goal \nstate (Figure 3). \n\n\f1014 \n\nJ.  Loch \n\nIQ ____  -_O }  \n\n2Q ___  _ \n\n\u2022\u2022 \n\"' \nI., \nfo. \nj:: \n1\u00b0' \n. , \n\n. 2 \n\n'. \n\not \n\n\u2022\u2022 \nlu \nt .. \ni:: \n\nJOS \n\n... \n\n.. \n\nH \n\n\u2022\u2022 \n\u2022 \u00b0 \n\nlO' \n\ns .. \n\n\u2022 \u2022\u2022 \n\n, .. \n\n, .. \n\n2<>0 \n\n\u2022\u2022\u2022 \n_\"_0_1_'0) \n\nu \n\nu \nlor \n10$ \njos \n\u2022\u2022 \nJ .. \n., \n' 2 \n\n. , \n\n\"\"\"'''AdMM O. 1\"'.) \n\nFigure 3:  Littman et al.'s 89 state office world,  Percent successful trials in reaching goal \nperformance as a function of the number oflearning steps for  1,2,4, and 8-step eligibility \ntraces_ \n\n3.4 Parr &  Russell's Grid World \n\nParr and  Russell's grid  world  (parr and  Russell  1995)  is  an  agent-environment  system \nwith  11  states,  6  observations,  and  4  actions,  State  transitions  are  stochastic  while \nobservations are deterministic, \n\nThe optimal  memoryless  policy yielding an  average reward  per step  of 0,024  was found \nby  both  the  I-step  and  2-step  truncated  eligibility  trace  methods  (Figure  4),  Policies \nfound by the 4-step and 8-step methods were not optimal, This result can be attributed to \nthe  sharp  eligibility  trace cutoff as  this  effect  was not  observed  with  smoothly  decaying \neligibility traces. \n\n\fEffect of Eligibility Traces on Finding Optimal Memoryless Policies \n\n1015 \n\nI I  \n\n.. , \n\nOf \n\nI .os \n1 \nI \nI \n.. , \n\nI\u00b7 \n1 \nI \n,~ \n.. ' \n\n., \n\n.. ' \n\n51 \n\n100 \n\nI so \n......... ., AdIona CIIII  1 ...... \n\nit1l \n\n'sa \n\n2SO \n\n' H \n\n.. ., \n\n+W \n\n$90 \n\nFigure  4:  Parr  &  Russell's  Grid  World.  Average  reward  per  step  performance  as  a \nfunction of the number oflearning steps for 1, 2, 4,  and 8-step eligibility traces. \n\n3.5 Discussion \n\nIn  all  the  empirical  results  presented  above,  we  have  shown  that  the  k-step  truncated \nSarsa(i-.)  algorithm  was  able  to  find  the  best  or the  optimal  deterministic  memoryless \npolicy when k=2. \n\nThis  result  is  surprising  since  it  was  expected  that  the  length  of the  eligibility  trace \nrequired  to  find  a  good  or  optimal  policy  would  vary  widely  depending  on  problem \nspecific  factors  such  as  landmark  (unique  observation)  spacing  and  the  delay  between \ncritical  decisions and  rewards.  Several  additional  POMDP  problems were  formulated  in \nan  attempt to create a POMDP which would  require a k value greater than 2 to find  the \noptimal  policy.  However,  for  all  trial  POMDPs  tested  the  optimal  memoryless  policy \ncould be found with k ~ 2. \n\n4 Conclusions and Future Work \n\nThe ability of the  Sarsa(i-.)  algorithm and  the k-step truncated  Sarsa(i-.)  algorithm to find \noptimal  deterministic  memoryless policies for  a  class of POMDP  problems is important \nfor  several  reasons. For POMDPs with good memoryless policies the Sarsa(i-.) algorithm \nprovides an efficient method for finding the best policy in that space. \n\nIf the performance of the memoryless  policy is unsatisfactory,  the observation and  action \nspaces  of the agent can  be  modified  so  as  to produce an  agent  with  a good memoryless \npolicy.  The  designer  of the  autonomous  system  or  agent  can  modifY  the  observation \n\n\f1016 \n\nJLoch \n\nspace  of the  agent  by  either  adding  sensors  or  making  finer  distinctions  in  the  current \nsensor values. In addition,  the designer can add attributes from  past observations into the \ncurrent observation space.  The action space can be modified by adding lower-level actions \nand by adding new actions to the space. Thus one method for  designing a capable agent \nis  to  iterate  between  selecting  an  observation  and  action  space  for  the  agent,  using \nSarsa(J...) to find the best memory less policy in that space, and repeating until  satisfactory \nperfonnance is achieved. \n\nThis  suggests  a future  line  of research  into  how to automate the  process of observation \nand action space selection so as to acheive an acceptable performance level.  Other avenues \nof research  include  an  exploration  into  theoretical  reasons  why  Sarsa(J...)  and  k-step \ntruncated  Sarsa(J...)  are  able to  solve  POMDPs.  In  addition,  further  research  needs to  be \nconducted  as  to  why  short  (k  -::;  2)  eligibility  traces  work  well  over  a  wide  class  of \nPOMDPs. \n\nReferences \n\nCassandra,  A \n(1994).  Optimal  policies  for  partially  observable  Markov  decision \nprocesses.  Technical  Report  CS-94-14,  Brown  University,  Department  of  Computer \nScience, Providence RI. \n\nLittman,  M.  (1994).  The  Witness  Algorithm:  Solving  partially  observable  Markov \ndecision  processes.  Technical  Report  CS-94-40,  Brown  University,  Department  of \nComputer Science, Providence RI. \n\nLittman,  M.,  Cassandra,  A,  &  Kaelbling,  L.  (1995).  Learning  policies  for  partially \nobservable  environments:  Scaling  up.  In  Proceedings  of the  Twelfth  International \nConference  on Machine  Learning,  pages  362-370,  San  Francisco,  CA,  1995.  Morgan \nKaufinann. \n\nLoch,  J., &  Singh,  S. (1998).  Using eligibility traces to find  the best memoryless policy \nin  partially  observable  Markov  decision  processes.  To  appear  In  Proceedings  of the \nFifteenth International Conference on Machine Learning\" Madison, WI,  1998. Morgan \nKaufinann.  (Available from http://www.cs.colorado.edul-baveja/papers.htm1) \n\nLovejoy, W. S. (1991).  A survey of algorithmic methods for partially observable Markov \ndecision processes.  In Annals of Operations Research,  28 : 47~66. \n\nParr,  R.  &  Russell,  S.  (1995).  Approximating  optimal  policies  for  partially  observable \nstochastic  domains.  In  Proceedings  of the  International Joint  Conference  on  Artificial \nIntelligence. \n\nSondik,  E .  J.  (1978).  The  optimal  control  of partially  observable  Markov  decision \nprocesses over the infinite horizon: Discounted costs. InOperations Research, 26(2). \n\nSutton, R.S . (1990). Integrated architectures for learning,  planning,  and reacting based on \napproximating  dynamic  programming.  In  Proceedings  of the  Seventh  International \nConference of Machine Learning, pages 216-224,  San Mateo, CA Morgan Kaufman. \n\nLittman,  M. (1994).  Memoryless policies: theoretical limitations and practical results. In \nFrom  Animals  to  Animats  3:  Proceedings  of the  Third  International  Conference  on \nSimulation of Adaptive Behavior,  Cambridge, \nMA MIT Press. \n\n\f", "award": [], "sourceid": 1603, "authors": [{"given_name": "John", "family_name": "Loch", "institution": null}]}