{"title": "Monte Carlo POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1070, "abstract": null, "full_text": "Monte Carlo POMDPs \n\nSebastian Thrun \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nWe present a Monte Carlo algorithm for learning to act in partially observable \nMarkov decision processes (POMDPs) with real-valued state and action spaces. \nOur approach uses importance sampling for representing beliefs, and Monte Carlo \napproximation for belief propagation. A reinforcement learning algorithm, value \niteration, is employed to learn value functions over belief states. Finally, a sample(cid:173)\nbased version of nearest neighbor is used to generalize across states. \nInitial \nempirical results suggest that our approach works well in practical applications. \n\n1 Introduction \nPOMDPs address the problem of acting optimally in partially observable dynamic environ(cid:173)\nment [6]. In POMDPs, a learner interacts with a stochastic environment whose state is only \npartially observable. Actions change the state of the environment and lead to numerical \npenalties/rewards, which may be observed with an unknown temporal delay. The learner's \ngoal is to devise a policy for action selection that maximizes the reward. Obviously, the \nPOMDP framework embraces a large range of practical problems. \nPast work has predominately studied POMDPs in discrete worlds [1]. Discrete worlds have \nthe advantage that distributions over states (so-called \"belief states\") can be represented \nexactly, using one parameter per state. The optimal value function (for finite planning \nhorizon) has been shown to be convex and piecewise linear [lO, 14], which makes it \npossible to derive exact solutions for discrete POMDPs. \nHere we are interested in POMDPs with continuous state and action spaces, paying tribute \nto the fact that a large number of real-world problems are continuous in nature. In general, \nsuch POMDPs are not solvable exactly, and little is known about special cases that can be \nsolved. This paper proposes an approximate approach, the MC-POMDP algorithm, which \ncan accommodate real-valued spaces and models. The central idea is to use Monte Carlo \nsampling for belief representation and propagation. Reinforcement learning in belief space \nis employed to learn value functions, using a sample-based version of nearest neighbor \nfor generalization. Empirical results illustrate that our approach finds to close-to-optimal \nsolutions efficiently. \n\n2 Monte Carlo POMDPs \n2.1 Preliminaries \nPOMDPs address the problem of selection actions in stationary, partially observable, con(cid:173)\ntrollable Markov chains. To establish the basic vocabulary, let us define: \n\n\u2022 State. At any point in time, the world is in a specific state, denoted by x. \n\n\fMonte Carlo POMDPs \n\n1065 \n\n\u2022 Action. The agent can execute actions, denoted a. \n\u2022 Observation. Through its sensors, the agent can observe a (noisy) projection of the \n\nworld's state. We use 0 to denote observations. \n\n\u2022 Reward. Additionally, the agent receives rewards/penalties, denoted R E ~. To \nsimplify the notation, we assume that the reward is part of the observation. More \nspecifically, we will use R( 0) to denote the function that \"extracts\" the reward from \nthe observation. \n\nThroughout this paper, we use the subscript t to refer to a specific point in time (e.g., St \nrefers to the state at time t). \nPOMDPs are characterized by three probability distributions: \n\ntime t = O. \n\n1. The initial distribution, 7r( x) := Pr( xo), specifies the initial distribution of states at \n2. The next state distribution, p(x' I a,x) := Pr(xt = x' I at-I = a,Xt-l = x), \n3. The perceptual distribution, v( 0 Ix) := Pr( 0t = 0 I Xt = x), describes the likeli(cid:173)\n\ndescribes the likelihood that action a, when executed at state x, leads to state x'. \n\nhood of observing 0 when the world is in state x. \n\nA history is a sequence of states and observations. For simplicity, we assume that actions \nand observations are alternated. We use dt to denote the history leading up to time t: \n\ndt \n\n{Ot,at-l,Ot-l,at-2, ... ,ao,00} \n\n(1) \n\nThe fundamental problem in POMDPs is to devise a policy for action selection that maxi(cid:173)\nmizes reward. A policy, denoted \n\n(T \n\n: d--+a \n\n(2) \nis a mapping from histories to actions. Assuming that actions are chosen by a policy (T, \neach policy induces an expected cumulative (and possibly discounted by a discount factor \n, :::; 1) reward, defined as \n\n00 \n\nJ<7 = L E [,T R(OT)] \n\nT=O \n\n(3) \n\nHere E[ ] denotes the mathematical expectation. The POMDP problem is, thus, to find a \npolicy (T* that maximizes r, i.e., \n\n(T* = argmax J<7 \n\n<7 \n\n(4) \n\n(6) \n\n(7) \n(8) \n\n(9) \n\n(10) \n\n2.2 Belief States \nTo avoid the difficulty of learning a function with unbounded input (the history can be \narbitrarily long), it is common practice to map histories into belief states, and learn a \nmapping from belief states to actions instead [10]. \nFormally, a belief state (denoted e) is a probability distribution over states conditioned on \npast actions and observations: \n\nPr(xt I dt} = Pr(xt lOt, at-I,\"\" 00) \n\n(5) \nBelief are computed incrementally, using knowledge of the POMDP's defining distributions \n7r, p, and v. Initially \n\net \n\neo = \n\n7r \n\nFor t ~ 0, we obtain \n\nBt+1 \n\nPr(xt+1 I Ot+l, at,\u00b7\u00b7\u00b7, 00) \n0' Pr(Ot+1 I Xt+I,\u00b7\u00b7\u00b7, 00) Pr(Xt+l I at,\u00b7\u00b7\u00b7, 00) \n\n0' Pr(ot+1 I Xt+l) J Pr(Xt+l I at,\u00b7\u00b7\u00b7, 00, xt} Pr(xt I at,\u00b7\u00b7\u00b7, 00) dXt \n0' Pr(Ot+l I Xt+d J Pr(xt+1 I at, Xt) et dXt \n\n\f1066 \n\nS. Thrun \n\n0.2 \n\n0.1 \n\n/ \n\n'''-', \n\n9 \n\nI \nI \nI \n/ \nI \nI \nI \nI \nI \nI \n\n, \n\\ \n\\ \n\\ \n\\ \n\\ \n, \n\\ \n, \n\\ \n\n\\ \n\n'-, \n\nII \n\nI \n\nI \n\n.. .\u2022 I II III. HI \"I \n\nI \u2022 \u2022 \u2022 . . . _\n\n___ ....... \n\n11. ___ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 1111. 1.11111. \n\n\u2022 \n\n2 \n\n12 \nFigure 1: Sampling: (a) Likelihood-weighted sampling and (b) importance sampling. At the bottom \nof each graph, samples are shown that approximate the function f shown at the top. The height of \nthe samples illustrates their importance/actors. \n\n10 \n\n10 \n\n12 \n\n4 \n\nHere a denotes a constant normalizer. The derivations of (8) and (10) follow directly from \nthe fact that the environment is a stationary Markov chain, for which future states and \nobservations are conditionally independent from past ones given knowledge of the state. \nEquation (9) is obtained using the theorem of total probability. \nArmed with the notion of belief states, the policy is now a mapping from belief states \n(instead of histories) to actions: \n\n(j : 0 -+ a \n\n(11) \nThe legitimacy of conditioning a on 0, instead of d, follows directly from the fact that the \nenvironment is Markov, which implies that 0 is all one needs to know about the past to \nmake optimal decisions. \n\n2.3 Sample Representations \nThus far, we intentionally left open how belief states 0 are represented. In prior work, state \nspaces have been discrete. In discrete worlds, beliefs can be represented by a collection \nof probabilities (one for each state), hence, beliefs can be represented exactly. Here were \nare interested in real-valued state spaces. In general, probability distributions over real(cid:173)\nvalued spaces possess infinitely many dimensions, hence cannot be represented on a digital \ncomputer. \nThe key idea is to represent belief states by sets of (weighted) samples drawn from the \nbelief distribution. Figure 1 illustrates two popular schemes for sample-based approxima(cid:173)\ntion: likelihood-weighted sampling, in which samples (shown at the bottom of Figure la) \nare drawn directly from the target distribution (labeled f in Figure la), and importance \nsampling, where samples are drawn from some other distribution, such as the curve labeled \n9 in Figure 1 b. In the latter case, samples x are annotated by a numerical importance factor \n\nf(x) \ng(x) \n\np(x) \n\n(12) \nto account for the difference in the sampling distribution, g, and the target distribution f \n(the height of the bars in Figure 1 b illustrates the importance factors). Importance sampling \nrequires that f > 0 -+ 9 > 0, which will be the case throughout this paper. Obviously, both \nsampling methods generate approximations only. Under mild assumptions, they converge \nto the target distribution at a rate of -j;;, with N denoting the sample set size [16]. \nIn the context of POMDPs, the use of sample-based representations gives rise to the \nfollowing algorithm for approximate belief propagation (c.f., Equation (10\u00bb: \n\nAlgorithm particleJilter(Ot , at, 0t+l): \n\nOt+l = 0 \ndoN times: \n\ndraw random state Xt from Ot \n\n\fMonte Carlo POMDPs \n\n1067 \n\nsample Xt+1 according to p(Xt+1 I at, xt} \nset importance factorp(xt+J) = V(Ot+1 I xt+d \nadd (Xt+l,p(Xt+I)) toBt+1 \n\nnormalize all p(Xt+d E Bt+1 so that LP(Xt+d = 1 \nreturn Bt+1 \n\nThis algorithm converges to (10) for arbitrary models p, v, and 11\" and arbitrary belief \ndistributions B, defined over discrete, continuous, or mixed continuous-discrete state and \naction spaces. It has, with minor modifications, been proposed under names like particle \nfilters [131. condensation algorithm [5], survival of the fittest [8], and, in the context of \nrobotics, Monte Carlo localization [4]. \n\n2.4 Projection \nIn conventional planning, the result of applying an action at at a state Xt is a distribution \nPr(Xt+l, Rt+1 I at, xt} over states Xt+1 and rewards R t+1 at the next time step. This \noperation is called projection. In POMDPs, the state Xt is unknown. Instead, one has to \ncompute the result of applying action at to a belief state Bt . The result is a distribution \nPr(Bt+I' Rt+ 1 I at, Bt ) over belief states Bt+1 and rewards Rt+ I. Since belief states them(cid:173)\nselves are distributions, the result of a projection in POMDPs is, technically, a distribution \nover distributions. \nThe projection algorithm is derived as follows. Using total probability, we obtain: \nPr(Bt+l , R t+1 I at,Bd \n\nPr(Bt+I,Rt+11 at,dt} \n\n(13) \n\n= J !r(Bt+l , Rt+: I Ot+l, at, dt), !r(ot+I,,1 at, dt}, dOt+1 (14) \n\n(*) \n\n(**) \n\nThe term (*) has already been derived in the previous section (c.f., Equation (10\u00bb, under \nthe observation that the reward Rt +1 is trivially computed from the observation 0t+l. \nThe second term, (**), is obtained by integrating out the unknown variables, Xt+1 and Xt. \nand by once again exploiting the Markov property: \n\nPr(Ot+l I at, dt} J Pr(Ot+1 I Xt+d Pr(xt+1 I at. dt} dXt+1 \n\n(15) \n\nJ Pr(Ot+1 I Xt+l) J Pr(xt+1 I Xt, at} Pr(xt I dt} dXt dXt-t616) \nJ V(Ot+1 I Xt+d J p(Xt+1 I Xt, at} Bt(xt) dXt dXt+1 \n\n(17) \n\nThis leads to the following approximate algorithm for projecting belief state. In the spirit \nof this paper, our approach uses Monte Carlo integration instead of exact integration. It \nrepresents distributions (and distributions over distributions) by samples drawn from such \ndistributions. \n\nAlgorithm particle_projection(Bt, at): \n\n8 t = 0 \ndoN times: \n\ndraw random state Xt from Bt \nsample a next state Xt+1 accordingtop(xt+1 I at,xt) \nsample an observation Ot+1 according to V(Ot+1 I Xt+d \ncompute Bt+1 = partic1e_filter(Bt. at. Ot+l) \nadd (Bt+I,R(ot+J)) t08t \n\nreturn8t \n\nThe result of this algorithm, 8 t , is a sample set of belief states Bt+1 and rewards Rt+I, \ndrawn from the desired distribution Pr( Bt+ I, Rt+ 1 I Bt , at}. As N ~ 00, at converges \nwith probability 1 to the true posterior [16]. \n\n\f1068 \n\nS. Thrun \n\n2.5 Learning Value Functions \nFollowing the rich literature on reinforcement learning [7, 15], our approach solves the \nPOMDP problem by value iteration in belief space. More specifically, our approach \nrecursively learns a value function Q over belief states and action, by backing up values \nfrom subsequent belief states: \n\nQ(Ot,at} ~ E[R(ot+t}+,m:xQ(Ot+l,a)] \n\n(18) \nLeaving open (for a moment) how Q is represented, it is easy to be seen how the algorithm \nparticle_projection can be applied to compute a Monte Carlo approximation of the right \nhand-side expression: Given a belief state Ot and an action at, particle_projection computes \na sample of R( 0t+ I) and Ot+ I, from which the expected value on the right hand side of (18) \ncan be approximated. \nIt has been shown [2] that if both sides of (18) are equal, the greedy policy \n\na \n\n(1'Q(O) = argmaxQ(O,a) \n\n(19) \nis optimal, i.e., (1'* = (1'Q. Furthermore, it has been shown (for the discrete case!) that \nrepetitive application of (18) leads to an optimal value function and, thus, to the optimal \npolicy [17, 3]. \nOur approach essentially performs model-based reinforcement learning in belief space \nusing approximate sample-based representations. This makes it possible to apply a rich \nbag of tricks found in the literature on MDPs. In our experiments below, we use on(cid:173)\nline reinforcement learning with counter-based exploration and experience replay [9] to \ndetermine the order in which belief states are updated. \n\n2.6 Nearest Neighbor \nWe now return to the issue how to represent Q. Since we are operating in real-valued \nspaces, some sort of function approximation method is called for. However, recall that \nQ accepts a probability distribution (a sample set) as an input. This makes most existing \nfunction approximators (e.g., neural networks) inapplicable. \nIn our current implementation, nearest neighbor [11] is applied to represent Q. More \nspecifically, our algorithm maintains a set of sample sets 0 (belief states) annotated by an \naction a and a Q-value Q(O, a). When a new belief state Of is encountered, its Q-value is \nobtained by finding the k nearest neighbors in the database, and linearly averaging their \nQ-values. If there aren't sufficiently many neighbors (within a pre-specified maximum \ndistance), Of is added to the database; hence, the database grows over time. \nOur approach uses KL divergence (relative entropy) as a distance function I. Technically, \nthe KL-divergence between two continuous distributions is well-defined. When applied \nto sample sets, however, it cannot be computed. Hence, when evaluating the distance be(cid:173)\ntween two different sample sets, our approach maps them into continuous-valued densities \nusing Gaussian kernels, and uses Monte Carlo sampling to approximate the KL divergence \nbetween them. This algorithm is fairly generic an extension of nearest neighbors to func(cid:173)\ntion approximation in density space, where densities are represented by samples. Space \nlimitations preclude us from providing further detail (see [11, 12]). \n\n3 Experimental Results \nPreliminary results have been obtained in a world shown in two domains, one synthetic and \none using a simulator of a RWI B21 robot. \nIn the synthetic environment (Figure 2a), the agents starts at the lower left comer. Its \nobjective is to reach \"heaven\" which is either at the upper left comer or the lower right \n\n1 Strictly speaking, KL divergence is not a distance metric, but this is ignored here. \n\n\fMonte Carlo POMDPs \n\n1069 \n\n... (a .... ) ,.-__ ...,.._ (~\"----~~-\nI=) \nP \n\n\\on... \n'--\n\n,1M \n\n50 \n\n,-~--~-v----v.n-\"'\" \n\n25 \n\n\u00b725 \n\n-50 \n\n\u00b775 \n\n\u00b7100 \n\nt.S:: \n\nt .. .....,.\u00b7 \n\n0 \n\n20 \n\n30 \nFigure 2: (a) The environment, schematically. (b) Average perfonnance (reward) as a function of \ntraining episodes. The black graph corresponds to the smaller environment (25 steps min), the grey \ngraph to the larger environment (50 steps min). (c) Same results, plotted as a function of number of \nbackups (in thousands). \n\n10 \n\n60 \n\n40 \n\n80 \n\n15 \n\n20 \n\n25 \n\ncomer. The opposite location is \"hell.\" The agent does not know the location of heaven, \nbut it can ask a \"priest\" who is located in the upper right comer. Thus, an optimal solution \nrequires the agent to go first to the priest, and then head to heaven. The state space contains \na real-valued (coordinates of the agent) and discrete (location of heaven) component. Both \nare unobservable: In addition to not knowing the location of heaven, the agent also cannot \nsense its (real-valued) coordinates. 5% random motion noise is injected at each move. \nWhen an agent hits a boundary, it is penalized, but it is also told which boundary it hit \n(which makes it possible to infer its coordinates along one axis). However, notice that the \ninitial coordinates of the agent are known. \nThe optimal solution takes approximately 25 steps; thus, a successful POMDP planner must \nbe capable of looking 25 steps ahead. We will use the term \"successful policy\" to refer \nto a policy that always leads to heaven, even if the path is suboptimal. For a policy to be \nsuccessful, the agent must have learned to first move to the priest (information gathering), \nand then proceed to the right target location. \nFigures 2b&c show performance results, averaged over 13 experiments. The solid (black) \ncurve in both diagrams plots the average cumulative reward J as a function of the number \nof training episodes (Figure 2b), and as a function of the number of backups (Figure 2c). \nA successful policy was consistently found after 17 episodes (or 6,150 backups), in all \n13 experiments. In our current implementation, 6,150 backups require approximately 29 \nminutes on a Pentium Pc. In some experiments, a successful policy was identified in 6 \nepisodes (less than 1,500 backups or 7 minutes). After a successful policy is found, further \nlearning gradually optimizes the path. To investigate scaling, we doubled the size of the \nenvironment (quadrupling the size of the state space), making the optimal sol uti on 50 steps \nlong. The results are depicted by the gray curves in Figures 2b&c. Here a successful \npolicy is consistently found after 33 episodes (10,250 backups, 58 minutes). In some runs, \na successful policy is identified after only 14 episodes. \nWe also applied MC-POMDPs to a robotic locate-and-retrieve task. Here a robot (Figure 3a) \nis to find and grasp an object somewhere in its vicinity (at floor or table height). The robot's \ntask is to grasp the object using its gripper. It is rewarded for successfully grasping the \nobject, and penalized for unsuccessful grasps or for moving too far away from the object. \nThe state space is continuous in x and y coordinates, and discrete in the object's height. \nThe robot uses a mono-camera system for object detection; hence, viewing the object from \na single location is insufficient for its 3D localization. Moreover, initially the object might \nnot be in sight of the robot's camera, so that the robot must look around first. \nIn our \nsimulation, we assume 30% general detection error (false-positive and false-negative), with \nadditional Gaussian noise if the object is detected correctly. The robot's actions include \nturns (by a variable angle), translations (by a variable distance), and grasps (at one of two \nlegal heights). Robot control is erroneous with a variance of20% (in x-y-space) and 5% (in \nrotational space). Typical belief states range from uniformly distributed sample sets (initial \nbelief) to samples narrowly focused on a specific x-y-z location. \n\n\f1070 \n\nS. Thrun \n\n(b) \n\n, \n\n\\ \n\n\\ \n\n\\ , \n\n(c) \n% success \n\n1 \n\nOB \n\n0.6 \n\n0.4 \n\n2000 \n\nL \nC \n\n4000 \n\n6000 \n\niteration \n\nBOOO \n\nFigure 3: Find and fetch task: (a) The mobile robot with gripper and camera, holding the target \nobject (experiments are carried out in simulation!), (b) three successful runs (trajectory projected into \n2D), and (c) success rate as a function of number of planning steps. \n\nFigure 3c shows the rate of successful grasps as a function of iterations (actions). While \ninitially, the robot fails to grasp the object, after approximately 4,000 iterations its perfor(cid:173)\nmance surpasses 80%. Here the planning time is in the order of 2 hours. However, the robot \nfails to reach 100%. This is in part because certain initial configurations make it impossible \nto succeed (e.g., when the object is too close to the maximum allowed distance), in part \nbecause the robot occasionally misses the object by a few centimeters. Figure 3b depicts \nthree successful example trajectories. In all three, the robot initially searches the object, \nthen moves towards it and grasps it successfully. \n\n4 Discussion \nWe have presented a Monte Carlo approach for learning how to act in partially observable \nMarkov decision processes (POMDPs). Our approach represents all belief distributions \nusing samples drawn from these distributions. Reinforcement learning in belief space is \napplied to learn optimal policies, using a sample-based version of nearest neighbor for \ngeneralization. Backups are performed using Monte Carlo sampling. Initial experimental \nresults demonstrate that our approach is applicable to real-valued domains, and that it yields \ngood performance results in environments that are-by POMDP standards-relatively large. \n\nReferences \n[1] AAAI Fall symposium on POMDPs. \n\npomdp-symposiurn.html \n\n1998. \n\nSee http://www.cs.duke.edu/ ... mlittman/talks/ \n\n[2] R E. Bellman. Dynamic Programming. Princeton University Press, 1957. \n[3] P. Dayan and T. 1. Sejnowski. ID('>') converges with probability 1. 1993. \n[4] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. Monte carlo localization: Efficient position estimation for mobile robots. \n\nAAAI-99. \n\n[5] M. lsard and A. Blake. Condensation: conditional density propagationforvisual tracking.lnternationalJoumalofComputer \n\nVision, 1998. \n\n[6] L.P. Kaelbling, M.L. Littman, and A.R Cassandra. Planning and acting in partially observable stochastic domains. Submitted \n\nfor publication, 1997. \n\n[7] L.P. Kaelbling, M.L. Littman, and A. W. Moore. Reinforcement learning: A survey. lAIR,4, 1996. \n[8] K Kanazawa, D. Koller, and S.l. Russell. Stochastic simulation algorithms for dynamic probabilistic networks. UAI-95. \n[9] L.-l. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8, \n\n1992. \n\n[10] M.L. Littman, A.R Cassandra, and L.P. KaeJbling. Learning poliCies for partially observable environments: Scaling up. \n\nICML-95. \n\n[11] A.w. Moore, C.G. Atkeson, and S.A. Schaal. Locally weighted learning for control. AI Review, II, 1997. \n[12] D. Ormoneit and S. Sen. Kernel-based reinforcernentlearning. TR 1999-8, Statistics, Stanford University, 1999. \n[13] M. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter. lournal of the American Statistical Association, \n\n1999. \n\n[14] E. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford, 1971. \n[I 5] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. \n[16] M.A. Tanner. ToolsforStatistical Inference. Springer Verlag, 1993. \n[17] C. 1. C. H. Watkins. Learningfrom Delayed Rewards. PhD thesis, King's College, Cambridge, 1989. \n\n\f", "award": [], "sourceid": 1772, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}