{"title": "The Effect of Eligibility Traces on Finding Optimal Memoryless Policies in Partially Observable Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1010, "page_last": 1016, "abstract": null, "full_text": "The effect of eligibility traces on finding optimal memoryless \n\npolicies in partially observable Markov decision processes \n\nJohn Loch \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nloch@cs.colorado.edu \n\nAbstract \n\nAgents acting in the real world are confronted with the problem of \nmaking good decisions with limited knowledge of the environment. \nPartially observable Markov decision processes (POMDPs) model \ndecision problems in which an agent tries to maximize its reward in the \nface of limited sensor feedback. Recent work has shown empirically that \na reinforcement learning (RL) algorithm called Sarsa(A) can efficiently \nfind optimal memoryless policies, which map current observations to \nactions, for POMDP problems (Loch and Singh 1998). The Sarsa(A) \nalgorithm uses a form of short-term memory called an eligibility trace, \nwhich distributes temporally delayed rewards to observation-action \npairs which lead up to the reward. This paper explores the effect of \neligibility traces on the ability of the Sarsa(A) algorithm to find optimal \nmemoryless policies. A variant of Sarsa(A) called k-step truncated \nSarsa(A) is applied to four test problems taken from the recent work of \nLittman, Littman, Cassandra and Kaelbling, Parr and Russell, and \nChrisman. The empirical results show that eligibility traces can be \nsignificantly truncated without affecting the ability of Sarsa(A) to find \noptimal memoryless policies for POMDPs. \n\n1 Introduction \n\nAgents which operate in the real world, such as mobile robots, must use sensors which at \nbest give only partial information about the state of the environment. Information about \nthe robot's surroundings is necessarily incomplete due to noisy and/or imperfect sensors, \noccluded objects, and the inability of the robot to know precisely where it is. Such agent(cid:173)\nenvironment systems can be modeled as partially observable Markov decision processes \nor POMDPs (Sondik, 1978). \n\nA variety of algorithms have been developed for solving POMDPs (Lovejoy, 1991). \nHowever most of these techniques do not scale well to problems involving more than a \nfew dozen states due to the computational complexity of the solution methods \n(Cassandra, 1994; Littman 1994). Therefore, finding efficient reinforcement learning \n\n\fEffect of Eligibility Traces on Finding Optimal Memoryless Policies \n\nlOll \n\nmethods for solving POMDPs is of great practical interest to the Artificial Intelligence \nand engineering fields. \n\nRecent work has shown empirically that the Sarsa(A) algorithm can efficiently find the \nbest deterministic memoryless policy for several POMDPs problems from the recent \nliterature (Loch and Singh 1998). The empirical results from Loch and Singh (1998) \nsuggest that eligibility traces are necessary for finding the best or optimal memoryless \npolicy. For this reason, a variant of Sarsa(A) called k-step truncated Sarsa(A) is formulated \nto explore the effect of eligibility traces on the ability of Sarsa( A) to find the best \nmemory less policy. \n\nThe main contribution of this paper is to show empirically that a variant of Sarsa(A) \nusing truncated eligibility traces can find the optimal memory less policy for several \nPOMDP problems from the literature. Specifically we show that the k-step truncated \nSarsa(A) method can find the optimal memoryless policy for the four POMDP problems \ntested when k :S 2. \n\n2 Sarsa(J..) and POMDPs \n\nAn environment is defined by a finite set of states S, the agent can choose from a finite set \nof actions A, and the agent's sensors provide it observations from a finite set X. On \nexecuting action a \u00a3 A in state s \u00a3 S the agent receives expected reward rsa and the \nenvironment transitions to a state s' \u00a3 S with probability pass\" The probability of the \nagent observing x \u00a3 X given that the state is s is O(xls). \n\nA straightforward way to extend RL algorithms to POMDPs is to learn Q-value functions \nof observation-action pairs, i.e. to simply treat the agents observations as states. Below \nwe describe the standard Sarsa(A) algorithm applied to POMDPs. At time step t the Q(cid:173)\nvalue function is denoted Qt ; the eligibility trace function is denoted 'YIt ; and the reward \nreceived is denoted rt . On experiencing transition the following updates \nare performed in order: \n\n'YIt(x, a) = YA 'YIt-l(X, a) ; for all X\"# Xt and a\"# at \n\nwhere bt = rt + Y Qt0 \n\n\u2022\u2022\u2022 \n_\"_0_1_'0) \n\nu \n\nu \nlor \n10$ \njos \n\u2022\u2022 \nJ .. \n., \n' 2 \n\n. , \n\n\"\"\"'''AdMM O. 1\"'.) \n\nFigure 3: Littman et al.'s 89 state office world, Percent successful trials in reaching goal \nperformance as a function of the number oflearning steps for 1,2,4, and 8-step eligibility \ntraces_ \n\n3.4 Parr & Russell's Grid World \n\nParr and Russell's grid world (parr and Russell 1995) is an agent-environment system \nwith 11 states, 6 observations, and 4 actions, State transitions are stochastic while \nobservations are deterministic, \n\nThe optimal memoryless policy yielding an average reward per step of 0,024 was found \nby both the I-step and 2-step truncated eligibility trace methods (Figure 4), Policies \nfound by the 4-step and 8-step methods were not optimal, This result can be attributed to \nthe sharp eligibility trace cutoff as this effect was not observed with smoothly decaying \neligibility traces. \n\n\fEffect of Eligibility Traces on Finding Optimal Memoryless Policies \n\n1015 \n\nI I \n\n.. , \n\nOf \n\nI .os \n1 \nI \nI \n.. , \n\nI\u00b7 \n1 \nI \n,~ \n.. ' \n\n., \n\n.. ' \n\n51 \n\n100 \n\nI so \n......... ., AdIona CIIII 1 ...... \n\nit1l \n\n'sa \n\n2SO \n\n' H \n\n.. ., \n\n+W \n\n$90 \n\nFigure 4: Parr & Russell's Grid World. Average reward per step performance as a \nfunction of the number oflearning steps for 1, 2, 4, and 8-step eligibility traces. \n\n3.5 Discussion \n\nIn all the empirical results presented above, we have shown that the k-step truncated \nSarsa(i-.) algorithm was able to find the best or the optimal deterministic memoryless \npolicy when k=2. \n\nThis result is surprising since it was expected that the length of the eligibility trace \nrequired to find a good or optimal policy would vary widely depending on problem \nspecific factors such as landmark (unique observation) spacing and the delay between \ncritical decisions and rewards. Several additional POMDP problems were formulated in \nan attempt to create a POMDP which would require a k value greater than 2 to find the \noptimal policy. However, for all trial POMDPs tested the optimal memoryless policy \ncould be found with k ~ 2. \n\n4 Conclusions and Future Work \n\nThe ability of the Sarsa(i-.) algorithm and the k-step truncated Sarsa(i-.) algorithm to find \noptimal deterministic memoryless policies for a class of POMDP problems is important \nfor several reasons. For POMDPs with good memoryless policies the Sarsa(i-.) algorithm \nprovides an efficient method for finding the best policy in that space. \n\nIf the performance of the memoryless policy is unsatisfactory, the observation and action \nspaces of the agent can be modified so as to produce an agent with a good memoryless \npolicy. The designer of the autonomous system or agent can modifY the observation \n\n\f1016 \n\nJLoch \n\nspace of the agent by either adding sensors or making finer distinctions in the current \nsensor values. In addition, the designer can add attributes from past observations into the \ncurrent observation space. The action space can be modified by adding lower-level actions \nand by adding new actions to the space. Thus one method for designing a capable agent \nis to iterate between selecting an observation and action space for the agent, using \nSarsa(J...) to find the best memory less policy in that space, and repeating until satisfactory \nperfonnance is achieved. \n\nThis suggests a future line of research into how to automate the process of observation \nand action space selection so as to acheive an acceptable performance level. Other avenues \nof research include an exploration into theoretical reasons why Sarsa(J...) and k-step \ntruncated Sarsa(J...) are able to solve POMDPs. In addition, further research needs to be \nconducted as to why short (k -::; 2) eligibility traces work well over a wide class of \nPOMDPs. \n\nReferences \n\nCassandra, A \n(1994). Optimal policies for partially observable Markov decision \nprocesses. Technical Report CS-94-14, Brown University, Department of Computer \nScience, Providence RI. \n\nLittman, M. (1994). The Witness Algorithm: Solving partially observable Markov \ndecision processes. Technical Report CS-94-40, Brown University, Department of \nComputer Science, Providence RI. \n\nLittman, M., Cassandra, A, & Kaelbling, L. (1995). Learning policies for partially \nobservable environments: Scaling up. In Proceedings of the Twelfth International \nConference on Machine Learning, pages 362-370, San Francisco, CA, 1995. Morgan \nKaufinann. \n\nLoch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy \nin partially observable Markov decision processes. To appear In Proceedings of the \nFifteenth International Conference on Machine Learning\" Madison, WI, 1998. Morgan \nKaufinann. (Available from http://www.cs.colorado.edul-baveja/papers.htm1) \n\nLovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov \ndecision processes. In Annals of Operations Research, 28 : 47~66. \n\nParr, R. & Russell, S. (1995). Approximating optimal policies for partially observable \nstochastic domains. In Proceedings of the International Joint Conference on Artificial \nIntelligence. \n\nSondik, E . J. (1978). The optimal control of partially observable Markov decision \nprocesses over the infinite horizon: Discounted costs. InOperations Research, 26(2). \n\nSutton, R.S . (1990). Integrated architectures for learning, planning, and reacting based on \napproximating dynamic programming. In Proceedings of the Seventh International \nConference of Machine Learning, pages 216-224, San Mateo, CA Morgan Kaufman. \n\nLittman, M. (1994). Memoryless policies: theoretical limitations and practical results. In \nFrom Animals to Animats 3: Proceedings of the Third International Conference on \nSimulation of Adaptive Behavior, Cambridge, \nMA MIT Press. \n\n\f", "award": [], "sourceid": 1603, "authors": [{"given_name": "John", "family_name": "Loch", "institution": null}]}