{"title": "Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": "Experimental Results on Learning Stochastic \nMemoryless Policies for Partially Observable \n\nMarkov Decision Processes \n\nJohn K. Williams \n\nDepartment of Mathematics \n\nUniversity of Colorado \nBoulder, CO  80309-0395 \n\njkwillia@euclid.colorado.edu \n\nSatinder Singh \n\nAT &T Labs-Research \n\n180 Park Avenue \n\nFlorham Park, NJ 07932 \nbaveja@research.att.com \n\nAbstract \n\nPartially Observable Markov Decision  Processes  (pO \"MOPs)  constitute \nan  important  class  of reinforcement  learning problems  which  present \nunique theoretical and computational difficulties.  In  the absence of the \nMarkov property,  popular reinforcement  learning  algorithms  such  as \nQ-Iearning  may  no  longer  be  effective,  and  memory-based  methods \nwhich remove partial observability via state-estimation  are notoriously \nexpensive.  An alternative approach is  to seek a stochastic memoryless \npolicy  which  for  each  observation  of  the  environment  prescribes  a \nprobability  distribution  over  available  actions  that  maximizes  the \naverage  reward  per  timestep.  A  reinforcement  learning  algorithm \nwhich  learns  a locally optimal  stochastic memoryless  policy has  been \nproposed by Jaakkola,  Singh and Jordan,  but not empirically verified. \nWe present  a variation  of this  algorithm,  discuss  its  implementation, \nand demonstrate its viability using four test problems. \n\n1  INTRODUCTION \n\nReinforcement  learning  techniques  have  proven  quite  effective  in  solving  Markov \nDecision  Processes  (\"MOPs),  control  problems  in  which  the  exact  state  of  the \nenvironment is  available to the learner and the expected result of an  action depends  only \non  the present  state  [10].  Algorithms  such  as  Q-Iearning  learn  optimal  deterministic \npolicies  for  \"MOPs----rules  which  for  every state prescribe  an  action  that  maximizes  the \nexpected  future  reward.  In  many important  problems,  however,  the  exact  state  of the \nenvironment  is  either inherently unknowable or  prohibitively expensive  to  obtain,  and \nonly  a  limited,  possibly  stochastic  observation  of the  environment  is  available.  Such \n\n\f1074 \n\n1.  K.  Williams and S.  Singh \n\nPartially Observable Markov Decision  Processes  (POMDPs)  [3,6]  are  often  much  more \ndifficult  than  MDPs  to  solve  [4].  Distinct  sequences  of  observations  and  actions \npreceding  a  given  observation  in  a  POMDP  may  lead  to  different  probabilities  of \noccupying the underlying  exact  states  of the MDP.  If the efficacy of an  action  depends \non  the hidden  exact  state of  the environment,  an  optimal  choice  may  require  knowing \nthe past history as  well  as  the current observation,  and the problem is no longer Markov. \nIn  light  of  this  difficulty,  one  approach  to  solving  POMDPs  is  to  explore  the \nenvironment  while  building  up  a  memory  of  past  observations,  actions  and  rewards \nwhich  allows  estimation  of  the  current  hidden  state  [1].  Such  methods  produce \ndeterministic  policies,  but  they  are  computationally  expensive  and  may  not  scale  well \nwith  problem  size.  Furthermore,  policies  that  require  state-estimation  using  memory \nmay be complicated to implement. \n\nMemoryless  policies  are  particularly  appropriate  for  problems  in  which  the  state  is \nexpensive  to  obtain  or  inherently difficult  to  estimate,  and  they  have  the  advantage  of \nbeing extremely simple to  act  upon.  For  a POMDP,  the  optimal  memoryless  policy is \ngenerally  a  stochastic  policy-one  which  for  each  observation  of  the  environment \nprescribes  a  probability  distribution  over  the  available  actions. \nIn  fact,  examples  of \nPOMDPs  can  be  constructed  for  which  a stochastic policy  is  arbitrarily  better  than  the \noptimal deterministic policy [9] .  An algorithm proposed by Jaakkola,  Singh and Jordan \nOSJ)  [2],  which we investigate here, learns memoryless stochastic policies for POMDPs. \n\n2  POMDPs AND DIFFERENTIAL-REWARD Q-VALUES \nWe  assume  that  the environment has  discrete states  S = {s1,  S2,  ..  IV},  and  the learner \nchooses actions from a set f4.  State transitions depend only on the current state s and the \naction a taken  (the Markov property);  they occur with probabilities  r(s,sl) and result in \nexpected rewards  K'(s,s}  In  a POMDP, the learner cannot sense exactly the  state  s  of \nthe  enVironment,  but  rather  perceives  only  an  observation--or  \"message\"-from a  set \n:M =  {m 1,  m2,  ..  mM }  according  to  a  conditional  probability  distribution  P(mls).  The \nlearner  will  in  general  not  know  the  size  of  the  underlying  state  space,  its  transition \nprobabilities, reward function,  or the conditional distributions of the messages. \nIn  MDPs,  there  always  exists  a  policy  which  simultaneously  maximizes  the  expected \nfuture  reward  for  all  states,  but  this  is  not  the  case  for  POMDPs  [9].  An  appropriate \nalternative measure of the merit of a stochastic POMDP policy 7Z{alm)  is  the asymptotic \naverage  reward  per  timestep,  R7r,  that  it  achieves. \nIn  seeking  an  optimal  stochastic \npolicy,  the  JSJ  algorithm  makes  use  of  Q-values  determined  by  the  infinite-horizon \ndifferential reward for  each observation-action pair (m,a).  In particular, if rr  denotes  the \nreward obtained at time t,  we may define the differential-reward Q-values by \n\nQ7r(s,a)=  LE7r [Ii  _R 7r  I S1  =s,a1  = a];  Q7r(m,a)=  E s [Q7r(s,a)IM(s)=m](l) \n\n1=1 \n\nwhere M  is  the observation  operator.  Note that E[rr] ~ R7r  as  t  ~ 00,  so  the  summand \nconverges to ~ero.  The value functions  V7r(s)  and V7r(m) may be defined similarly. \n\n3  POLICY IMPROVEMENT \n\nThe JSJ  algorithm consists  of a method for  evaluating  Q7r and  V7r and  a mechanism for \nusing  them  to  improve the current policy.  Roughly  speaking,  if Q7r(m,a)  >  V7r(m),  then \naction  a realized  a  higher  differential  reward  than  the  average  for  observation  m,  and \nassigning it  a slightly greater probability will  increase the  average reward  per timestep, \nR7r.  We interpret the quantities ~m(a) = Q7r(m,a) - V7r(m)  as  comprising a \"gradient\" of \nR7r in policy space.  Their projections onto the probability simplexes may then be  written \n\n\fAn Algorithm which Learns Stochastic Memoryless Policies for POMDPs \n\n1075 \n\nas  8m = Llm -<Llm,l> 11/JIl,  where 1 is  the one-vector (1,1, ... ,1), <, > is  the inner product, \nand IJIl is  the number of actions, or \n\n8 \nmea) = Llm(a)  -\n\n1 \nIAI  LLlm (a')  =  Q  (m,a) - -IAI  LQ  (m, a'). \n\n~ \n\n1 \n\na'EA \n\nR \n\na'EA \n\n(2) \n\nFor sufficiently small  E;n,  an improved policy 1l'(alm) may be obtained by the increments \n\n1l'(a/m) = 1l(alm) + E;n  8m(a)  . \n\n(3) \n\nIn  practice,  we  also  enforce  1l'(alm)  ~ P min  for  all  a  and  m  to  guarantee  continued \nexploration.  The  original  JSJ  algorithm  prescribed  using  Llm(a)  in  place  of  8m(a)  in \nequation  (3),  followed  by  renormalization  [2].  Our  method  has  the  advantage  that  a \ngiven value of Ll  yields  the same incremeiu regardless of the current value of the policy, \nand  it  ensures  that  the  step  is  in  the  correct  direction.  We  also  do  not  require  the \ndifferential-reward value estimate,  yR. \n\n4  Q-EVALUATION \n\nAs  the POMDP is  simulated under  a  fixed  stochastic policy 1l,  every  occurrence of an \nobservation-action  pair  (m, a)  begins  a  sequence  of  rewards  which  can  be  used  to \nestimate QR(m, a).  Exploiting the fact  that the QR(m, a)  are defined  as  sums,  the JSJ Q(cid:173)\nevaluation method recursively averages the estimates from all  such sequences using a so(cid:173)\nIn  order  to  reduce  the  bias  and  variance \ncalled  \"every-visit\"  Monte-Carlo  method. \ncaused by the dependence of the evaluation sequences, a factor fJ is used to discount their \nshared \"tails\".  Specifically,  at time t the learner makes  observation  mr ,  takes  action  ar , \nand obtains  reward  rr.  The number  of visits  K(mr,ar)  is  incremented,  the  tail  discount \nrate  rem, a)  = 1-K(m, arl/4,  and  the  following  updates  are  performed  (the  indicator \nfunction x.:Cm, a) is  1 if (m,a) = (mr,ar) and 0 otherwise). \n\nfJ \n\n%r(m,a) \n(m,a)=  1- K(m,a)  r(m,a)  (m,a)+  K(m,a) \n\n[ %r(m,a)] \n\nfJ \n\n(tail discount factor) \n\n(4) \n\nQ(m,a)= [1- ~~::~ ]Q(m, a) + fJ(m,a)[Ti  - R] \nC(m,a)= [1- ~f:::~ ]c(m,a) + fJ(m, a) \n\n(cumulative discount effect) \n\n(5) \n\n(6) \n\n(7) \n\nlIt)R + (lit) rr \n\nR = (1  -\nQ(m, a) = Q(m, a) - C(m, a) [R  - Rold];  Rold = R \n\n(R~-estimate) \n\n(QR-estimate correction) \n\n(8) \n\nOther  schedules  for  rem, a)  are  possible----see  [2~and the  correction  provided  by  (8) \nneed not be performed at every step, but can be delayed until the Q~-estirnate is needed. \n\nThis  evaluation  method  can  be  used  as  given  for  a  policy-iteration  type  algorithm  in \nwhich  independent  T-step  evaluations  of Q~ are interspersed with policy improvements \nas  prescribed in section 3.  However,  an  online version of the algorithm which performs \npolicy improvement after every step requires  that old experience be gradually \"forgotten\" \nso  that  the  QR-estimate  can  respond  to  more recent  experience.  To  achieve  this,  we \nmultiply  the previous  estimates  of  fJ,  Q,  and  C  at  each  timestep  by  a  \"decay\"  factor \na, 0 < a< 1, before they are updated via equations (4)-(6), and replace equation (7) by \n\nR = a(l - lit) R + [1  - a(1 - lit)] rl\n\n(9) \nAn alternative method,  which also works reasonably well,  is  to multiply K  and t by  a at \neach timestep instead. \n\n. \n\n\f1076 \n\n(a) \n\nA \n\n+1 \n\nB \n\n+1 \n\n(b) \n\n(c) \n\nJ  K.  Williams and S.  Singh \n\n0. '  r --\n\n- r --\n\n- r --\n\n----;,.--\n\n- - - - ; -- - - - - ;  \n\n.. .. .. .. j ...... . \n\n., ... \n\n10000 \n\n20000 \n\nnumber of iterations \n\n3 0000 \n\n40000 \n\n50000 \n\n0 .8 \n\n\\ \n\nf 06  \\..-:' ~::\" ==_:::_-=-__ ::-0' ~~.~=~.-... .:.....'. '-\"'-\" '-\"'-\" .----..;...--(cid:173)\n\n[0.4 \n\n0. 2 \n\n\u00b0o!:----,,;-;;coo=o-=-o --'2::::0~00:::-0 -\n\n-'3\"\"0\"\"00\"\"'0 -\nnumber Of  Iterations \n\n--;:40:;';,00\"\"0 - -=50000 \n\n(a)  Schematic  of  confounded  two-state  POMDP,  (b)  evolution  of  the  R7r_ \nFigure  1: \nestimate, and (c) evolution of n(A) (solid) and nCB) (dashed) for  e= 0.0002, a= 0.9995. \n\n5  EMPIRICAL RESULTS \n\nWe present only results from single runs of our online algorithm, including the modified \n]S] policy improvement and Q-evaluation procedures described above.  Results  from the \npolicy  iteration  version  are  qualitatively  similar,  and  statistics  performed  on  multiple \nruns verify that those shown are representative of the algorithm's behavior.  To simplify \nthe presentation, we fix  a constant learning rate,  e, and decay factor,  a, for each problem, \nand we use P min = 0.02 throughout.  Note, however, that appropriate schedules or online \nheuristics for  decreasing  e and P min  while increasing  a would improve performance and \nare necessary to ensure convergence.  Except for  the first problem,  we choose the initial \npolicy n to be  uniform.  In  the last  two problems,  values  of n(alm) < 0.03  are  rounded \ndown  to zero, with renormalization, before the learned policy is evaluated. \n\n5.1  CONFOUNDED TWO-STA TE  PROBLEM \nThe two-state MDP diagrammed in Figure l(a) becomes  a POMDP when  the two states \nare  confounded  into  a  single  observation.  The  learner  may  take  action  A  or  B,  and \nreceives  a reward of either + 1 or -1; the state transition  is  deterministic,  as  indicated in \nthe diagram.  Note that either stationary deterministic policy results in  R7r = -1 , whereas \nthe optimal stochastic policy assigns each action the probability 112, resulting in R7r = O. \nThe evolution  of the R7r-estimate and policy, starting from the initial  policy n(A) = 0.1 \nand nCB) = 0.9,  is  shown in Figure 1.  Clearly the learned policy approaches the optimal \nstochastic policy n = (112,112). \n\n5.2  MATRIX  GAME:  SCISSORS-PAPER-STONE-GLASS-WATER \nScissors-Paper-Stone-Glass-Water  (SPSGW),  an  extension  of the  well-known  Scissors(cid:173)\nPaper-Stone,  is  a symmetric zero-sum matrix game in which the learner selects  a row i, \nthe opponent  selects  a  column j,  and  the  learner' s  payoff is  determined  by  the  matrix \nentry  M(i,j).  A  game-theoretic  solution  is  a  stochastic  (or  \"mixed\")  policy  which \nguarantees the learner an  expected payoff of at least zero.  It can  be  shown using linear \nprogramming  that  the unique optimal  strategy for  SPSGW,  yielding  R7r = 0,  is  to play \nstone  and  water  with  probability  1/3,  and  to  play  scissors,  paper,  and  glass  with \nprobability  119  [7].  Any  stationary  deterministic  policy results  in  R7r  = -1,  since  the \nopponent eventually learns to anticipate the learner's choice and exploit it. \n\n\fAn Algorithm which Learns Stochastic Memory/ess Policiesfor POMDPs \n\n1077 \n\n(a) \n\nstone \n\n(c) \n\nwater \n\npaper \n\nor:---I---\\--~ \n\nscissors \n\n(b) \n\n[0  -1 \n\n1 \nM=  -1 \n-1 \n1 \n\n1  -1] \n\n1 \n1  -1  -1 \n0 \n1 \n-1 \n0  -1 \n1 \n0  -1 \n1 \n1  -1 \n1  0 \n\n- 0. 4 \n\n. ... \n\n-0 5 O~--='-=OO::::OO:-----:::20=:':O=OO\"-----:::300'-!:'OO:::::---:-::40::':::OOO:::-----:5;-;::'OOOO \n\nnumber of iterations \n\n(d) \n\n0. 8 \n\n-___ ~ ~.= _______ . _________________ s  __  _ \n\n%~-~1=OO~OO~~2~OO~OO~~3~OO~OO~~4~OO~OO~~50000 \n\nnumber of iteratio ns \n\nFigure  2: \n(a)  Diagram  of  Scissors-Paper-Stone-Glass-Water,  (b)  the  payoff  matrix, \n(c)  evolution of the RJr-estimate,  and  (d)  evolution  of n(stone)  and  n(water)  (solid)  and \nn(scissors), n(paper), and n(glass) (dashed) for  \u00a3= 0.00005,  a= 0.9995. \n\nIn  formulating  SPSGW  as  a  POMDP,  it  is  necessary  to  include in  the  state  sufficient \ninformation to  allow the opponent  to exploit  any sub-optimal  strategy.  We thus  choose \nas  states  the learner's past  action  frequencies,  multiplied  at  each timestep  by the decay \nfactor,  a.  There is  only  one  observation,  and  the  learner  acts  by  selecting  the  \"row\" \nscissors,  paper,  stone,  glass  or  water,  producing  a  deterministic  state  transition.  The \nsimulated opponent plays  the  column  which maximizes  its  expected  payoff  against  the \nestimate of the learner's strategy obtained from  the state.  The learner's  reward  is  then \nobtained from the appropriate entry of the payoff matrix. \nThe policy  n  =  (0.1124,0.1033,0.3350,0.1117,0.3376)  learned  after  50,000  iterations \n(see Figure 2) is very close to the optimal policy 7i = (119, 119,113,119,1/3). \n\n5.3  PARR AND  RUSSELL'S  GRID  WORLD \nParr and Russell's grid world [S]  consists of 11  states in a 4x3  grid with a single obstacle \nas  shown in Figure 3(a).  The learner senses  only walls  to its immediate east or west  and \nwhether  it is  in  the goal  state (upper  right  comer)  or penalty  state  (directly below  the \ngoal),  resUlting  in  the 6 possible observations  (0-3,  G  and P) indicated in  the diagram. \nThe available actions are to move N, E, S,  or W, but there is  a probability 0.1 of slipping \nto  either  side and  only O.S  of moving  in  the deSired  direction;  a  movement into  a  wall \nresults  in bouncing back to the original  state.  The learner receives  a reward of + 1 for  a \ntransition into the goal state,  -1 for  a transition into the penalty state,  and -0.04  for  all \nother  transitions.  The  goal  and  penalty  states  are  connected  to  a  cost-free  absorbing \nstate; when  the learner reaches  either of them it is  teleported immediately to  a new start \nstate chosen with uniform probability. \n\nThe results  are shown  in Figure 3.  A  separate  106-step  evaluation  of the final  learned \npolicy resulted in  RJr = 0.047.  In contrast,  the optimal  deterministic policy indicated by \narrows  in  Figure  3(a)  yields  R Jr  =  0.024  [5],  while  Parr  and  Russell's  memory-based \nSPOVA-RL algorithm achieved RJr = 0.12 after learning for 400,000 iterations  [S]. \n\n5.4  MULTI-SERVER QUEUE \n\nAt each timestep, an  arriving job having  type  1,  2,  or 3 with probability 112,  113  or  116, \nrespectively,  must  be  assigned  to  server  A,  B  or  C;  see  Figure  4(a).  Each  server  is \noptimized  for  a  particular  job  type  which  it  can  complete  in  an  expected  time  of  2.6 \n\n\f1078 \n\n(a) \n\n;j'  ~  ~  +1 \n\nt \n\n0 \n\n3 \n\n2 \n\nt\u00b7 \n\n2 \n\n0 \n\n-1 \n\nP \n\nt  ~  ~  ~ \n\n0 \n\n2 \n\n2 \n\n1 \n\nJ  K.  Williams and S.  Singh \n\n\"...  ~ \n\n20000 \n\n40000 \n\n60000 \nnurrber of  itera1iorlS; \n\n80000 \n\n100000 \n\n(b) \n\n0.06 \n\n0 0 4 \n\n0 .0 2 \n\n~  0 \n~ - 0 .02 \na: \n\n- 0 .04 \n\n- 0 .06 \n\n-0.0 8 \n0 \n\n(c) \n\nrO.91  0.02  0.36  0.52J \n7r(alm) =  8:8i  0.21  0.60  0.18 \n\n0.34  0.02  0.11 \n0.02  0.43  0.02  0.19 \n\n(a)  Parr  and  Russell's  grid  world,  with  observations  shown  in  lower  right \nFigure  3: \ncorners  and  the  optimal  deterministic  memoryless  policy  represented  by  arrows, \n(b)  evolution  of the R7r-estimate,  and  (c)  the resulting  learned policy  (observations  0-3 \nacross columns, actions N, E, S, W down rows) for  E= 0.02,  a= 0.9999. \n\ntimesteps,  while the other job types require 50% longer.  All jobs in a server's queue are \nhandled in parallel, up to a capacity of 10 for  each server; they finish with probability  Ilf \nat each timestep,  where f  is  the product of the expected time for  the job and the number \nof jobs in the server's queue.  The states for  this POMDP are all combinations of waiting \njobs  and  server  occupancies  of  the  three  job  types,  but  the  learner's  observation  is \nrestricted to the type of the waiting job.  The state transition is obtained by removing  all \njobs  which have finished and adding the waiting job to the chosen  server if it has  space \navailable.  The reward is + 1 if the job is successfully placed, or 0 if it is dropped. \nThe results  are shown in Figure 4.  A  separate  106-step evaluation  of the learned policy \nobtained  R7r  = 0.95,  corresponding  to  95%  success  in  placing  jobs. \nIn  contrast,  the \noptimal  deterministic  policy,  which  assigns  each  job  to  the  server  optimized  for  it, \nattained only 87% success.  Thus the learned policy more than halves the drop rate! \n\n6  CONCLUSION \n\nOur online version of an  algorithm proposed by Jaakkola,  Singh  and Jordan  efficiently \nlearns  a  stochastic  memoryless  policy  which  is  either  provably  optimal  or  at  least \nsuperior  to  any  deterministic  memoryless  policy for  each  of four  test  problems.  Many \nenhancements  are  possible,  including  appropriate  learning  schedules  to  improve \nperformance and ensure convergence, estimation of the time between observation-action \nvisits to obtain better discount rates r and thereby enhance Q7r-estimate bias  and variance \nreduction (see [2]), and multiple starts or simulated annealing to avoid local minima.  In \naddition, observations could be extended to include some past history when appropriate. \n\nMost  POMDP  algorithms  use  memory  and  attempt  to  learn  an  optimal  deterministic \npolicy  based  on  belief  states.  The  stochastic  memoryless  policies  learned  by  the  JSJ \nalgorithm may not  always  be  as  good,  but  they  are  simpler  to  act  upon  and  can  adapt \nsmoothly in  non-stationary  environments.  Moreover,  because  it  searches  the  space of \nstochastic policies,  the JS]  algorithm has  the potential  to  find  the optimal  memoryless \npolicy.  These  considerations,  along  with  the  success  of  our  simple  implementation, \nsuggest  that  this  algorithm may be a  viable candidate for  solving  real-world POMDPs, \nincluding  distributed  control  or network  admission  and  routing  problems  in  which  the \nnumbers of states are enormous and complete state information may be difficult to obtain \nor estimate in a timely manner. \n\n\fAn AlgOrithm which Learns Stochastic Memoryless Policiesjor POMDPs \n\n1079 \n\n(a) \n\nJob \n\narrival \nof type \n1,2,or 3 \n\nServer A \n\nTA  = (2.6,3.9,3.9) \n\nServer B \n\nTB  = (3.9,2.6,3.9) \n\nServer C \n\nTc = (3.9,3.9,2.6) \n\n(b) \n\n. \n\n095 \n\nI 09 \n\nI a: \n\no.8o'----=20~00::-:0-----,-:4oo~00=---60~00-=-0 --::8~00'\":c00::------:-=-::'\n\n1 00000 \n\nnumber of  iterations \n\n(c) \n\n[0.73 \nn(alm) = 0.02 \n0.25 \n\n0.02 \n0.96 \n0.02 \n\n0.02] \n0.09 \n0.89 \n\nFigure 4:  (a) Schematic of the multi-server queue,  (b)  evolution of the R71-estimate,  and \n(c)  the  resulting  learned  policy  (observations  I,  2,  3  across  columns,  actions  A,  B,  C \ndown rows) for  \u20ac=  0.005,  a= 0.9999. \n\nAcknowledgements \n\nWe  would  like  to  thank  Mike  Mozer  and  Tim  Brown  for  helpful  discussions. \nSatinder Singh was funded by NSF grant IIS-9711753. \n\nReferences \n\n[1]  Chrisman,  L.  (1992).  Reinforcement  learning  with  perceptual  aliasing:  The \nIn  Proceedings  of  the  Tenth  National \n\nperceptual  distinctions  approach. \nConference on Artificial Intelligence. \nJaakkola,  T.,  Singh,  S.  P.,  and  Jordan,  M.  I.  (1995).  Reinforcement  learning \nIn  Advances  in \nalgorithm  for  partially  observable  Markov  decision  problems. \nNeural Information Processing Systems 7. \n\n[2] \n\n[3]  Littman,  M.,  Cassandra,  A.,  and  Kaelbling,  L.  (1995).  Learning  poliCies  for \nIn  Proceedings  of the  Twelfth \n\npartially  observable  environments:  Scaling  up. \nInternational  Conference on Machine Learning. \n\n[4]  Littman,  M.  L.  (1994).  Memoryless policies:  Theoretical  limitations  and practical \nresults.  Proceedings  of  the  Third  International  Conference  on  Simulation  of \nAdaptive Behavior: From Animals to Animats. \n\n[5]  Loch,  J.,  and  Singh,  S.  P.  (1998).  Using  eligibility  traces  to  find  the  best \nmemoryless policy in partially observable Markov decision  processes.  In Machine \nLearning: Proceedings of the Fifteenth International Conference. \n\n[6]  Lovejoy,  W.  S.  (1991).  A  survey of algorithmic  methods  for  partially observable \n\nMarkov decision processes.  In Annals of Operations Research, 28. \n\n[7]  Morris, P.  (1994).  Introduction to Game  Theory.  Springer-Verlag, New York. \n[8]  Parr,  R.  and  Russell,  S.  (1995).  Approximating  optimal  poliCies  for  partially \nIn  Proceedings  of  the  International  Joint \n\nobservable  stochastic  domains. \nConference on Artificial Intelligence. \n\n[9]  Singh,  S.  P.,  Jaakkola,  T.,  and  Jordan,  M.  I.  (1994).  Learning  without  state(cid:173)\nIn  Machine \n\nestimation  in  partially  observable  Markovian  decision  processes. \nLearning: Proceedings of the Eleventh International Conference. \n\n[10]  Sutton, R.  S.  and Barto,  A.  G.  (1998).  Reinforcement Learning:  An Introduction. \n\nMIT Press. \n\n\f\f", "award": [], "sourceid": 1509, "authors": [{"given_name": "John", "family_name": "Williams", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}