{"title": "Reinforcement Learning Using Approximate Belief States", "book": "Advances in Neural Information Processing Systems", "page_first": 1036, "page_last": 1042, "abstract": null, "full_text": "Reinforcement Learning Using Approximate \n\nBelief States \n\nAndres Rodriguez * \n\nArtificial Intelligence Center \n\nSRI International \n\n333 Ravenswood Avenue, Menlo Park, CA 94025 \n\nRonald Parr, Daphne Koller \nComputer Science Department \n\nStanford University \nStanford, CA 94305 \n\nrodriguez@ai.sri.com \n\n{parr,koller}@cs.stanford.edu \n\nAbstract \n\nThe problem of developing good policies for partially observable Markov \ndecision problems (POMDPs) remains one of the most challenging ar(cid:173)\neas of research in stochastic planning. One line of research in this area \ninvolves the use of reinforcement learning with belief states, probabil(cid:173)\nity distributions over the underlying model states. This is a promis(cid:173)\ning method for small problems, but its application is limited by the in(cid:173)\ntractability of computing or representing a full belief state for large prob(cid:173)\nlems. Recent work shows that, in many settings, we can maintain an \napproximate belief state, which is fairly close to the true belief state. In \nparticular, great success has been shown with approximate belief states \nthat marginalize out correlations between state variables. In this paper, \nwe investigate two methods of full belief state reinforcement learning and \none novel method for reinforcement learning using factored approximate \nbelief states. We compare the performance of these algorithms on several \nwell-known problem from the literature. Our results demonstrate the im(cid:173)\nportance of approximate belief state representations for large problems. \n\n1 Introduction \n\nThe Markov Decision Processes (MDP) framework [2] is a good way of mathematically \nformalizing a large class of sequential decision problems involving an agent that is inter(cid:173)\nacting with an environment. Generally, an MDP is defined in such a way that the agent has \ncomplete knowledge of the underlying state of the environment. While this formulation \nposes very challenging research problems, it is still a very optimistic modeling assumption \nthat is rarely realized in the real world. Most of the time, an agent must face uncertainty \nor incompleteness in the information available to it. An extension of this formalism that \ngeneralizes MDPs to deal with this uncertainty is given by partially observable Markov \nDecision Processes (POMDPs) [1, 11] which are the focus of this paper. \n\nSolving a POMDP means finding an optimal behavior policy 7l'*, that maps from the agent's \navailable knowledge of the environment, its belief state, to actions. This is usually done \nthrough a function, V, that assigns values to belief states. In the fully observable (MDP) \n\n\"The work presented in this paper was done while the first author was at Stanford University. \n\n\fReinforcement Learning Using Approximate Belief States \n\n1037 \n\ncase, a value function can be computed efficiently for reasonably sized domains. The \nsituation is somewhat different for POMDPs, where finding the optimal policy is PSPACE(cid:173)\nhard in the number of underlying states [6]. To date, the best known exact algorithms to \nsolve POMDPs are taxed by problems with a few dozen states [5]. \n\nThere are several general approaches to approximating POMDP value functions using rein(cid:173)\nforcement learning methods and space does not permit a full review of them. The approach \nupon which we focus is the use of a belief state as a probability distribution over underlying \nmodel states. This is in contrast to methods that manipulate augmented state descriptions \nwith finite memory [9, 12] and methods that work directly with observations [8] . \n\nThe main advantage of a probability distribution is that it summarizes all of the informa(cid:173)\ntion necessary to make optimal decisions [1]. The main disadvantages are that a model \nis required to compute a belief state, and that the task of representing and updating belief \nstates in large problems is itself very difficult. In this paper, we do not address the problem \nof obtaining a model; our focus is on the the most effective way of using a model. Even \nwith a known model, reinforcement learning techniques can be quite competitive with ex(cid:173)\nact methods for solving POMDPs [lO]. Hence, we focus on extending the model-based \nreinforcement learning approach to larger problems through the use of approximate belief \nstates. There are risks to such an approach: inaccuracies introduced by belief state approx(cid:173)\nimation could give an agent a hopelessly inaccurate perception of its relationship to the \nenvironment. \n\nRecent work [4], however, presents an approximate tracking approach, and provides theo(cid:173)\nretical guarantees that the result of this process cannot stray too far from the exact belief \nstate. In this approach, rather than maintaining an exact belief state, which is infeasible \nin most realistically large problems, we maintain an approximate belief state, usually from \nsome restricted class of distributions. As the approximate belief state is updated (due to \nactions and observations), it is continuously projected back down into this restricted class. \nSpecifically, we use decomposed belief states, where certain correlations between state \nvariables are ignored. \n\nIn this paper we present empirical results comparing three approaches to belief state rein(cid:173)\nforcement learning. The most direct approach is the use of a neural network with one input \nfor each element of the full belief state. The second is the SPOVA method [lO], which \nuses a function approximator designed for POMDPs and the third is the use of a neural net(cid:173)\nwork with an approximate belief state as input. We present results for several well-known \nproblems in the POMDP literature, demonstrating that while belief state approximation is \nill-suited for some problems, it is an effective means of attacking large problems. \n\n2 Basic Framework and Algorithms \n\nA POMDP is defined as a tuple < S, A, 0, T, R, 0 > of three sets and three functions. \n\nS is a set of states, A is a set of actions and \u00b0 is a set of observations. The transition \n\nfunction T : S x A ~ II( S) specifies how the actions affect the state of the world. It can \nbe viewed as T( Si, a, S j) = P( S j la, sd, the probability that the agent reaches state S j if it \ncurrently is in state Si and takes action a. The reward function R : S x A ~ 1R determines \nthe immediate reward received by the agent The observation model 0 : S x A ~ II( 0) \ndetermines what the agent perceives, depending on the environment state and the action \ntaken. O(s, a, 0) = P( ola, s) is the probability that the agent observes 0 when it is in state \ns, having taken the action a. \n\n\f1038 \n\nA. Rodriguez, R. Parr and D. Koller \n\n2.1 POMDP belief states \n\nA beliefstate, b, is defined as a probability distribution over all states S E S, where b(s), \nrepresents probability that the environment is in state s. After taking action a and observing \n0, the belief state is updated using Bayes rule: \n\n1 \n\n1 \n\nb (s ) = P( s I a, 0, b) = =---::::-:--:,-,:'=--~--:-:-:---:\u00ad\nL.sjES O(Sj, a, 0) L.siES T(Si' a, Sj)b(Si) \n\nO(S', a, 0) L.SES T(Si, a, s')b(sd \n\n1 \n\nThe size of an exact belief state is equal to the number of states in the model. For large \nproblems, maintaining and manipulating an exact belief state can be problematic even if \nthe the transition model has a compact representation [4]. For example, suppose the state \nspace is described via a set of random variables X = {Xl, ... ,Xn }, where each Xi takes \non values in some finite domain Val(Xi ), a particular S defines a value Xi E VaJ(Xi) for \neach variable Xi. The full belief state representation will be exponential in n. We use the \napproximation method analyzed by Boyen and Koller [4], where the variables are parti(cid:173)\ntioned into a set of disjoint clusters C I ... Ck and belief functions, bl ... bk are maintained \nover the variables in each cluster. At each time step, we compute the exact belief state, then \ncompute the individual belief functions by marginalizing out inter-cluster correlations. For \nsome assignment, Ci, to variables in C i, we obtain bi(Ci) = L.ygCl P(Ci' y). An approxi-\nmation of the original, full belief state is then reconstructed as b( s) = n~=l bi (Ci). \nBy representing the belief state as a product of marginal probabilities, we are projecting \nthe belief state into a reduced space. While a full belief state representation for n state \nvariables would be exponential in n, the size of decomposed belief state representation is \nexponential in the size of the largest cluster and additive in the number of clusters. For pro(cid:173)\ncesses that mix rapidly enough, the errors introduced by approximation will stay bounded \nover time [4]. As discussed by Boyen and Koller [4], this type of decomposed belief state \nis particularly suitable for processes that can themselves be factored and represented as a \ndynamic Bayesian network [3]. In such cases we can avoid ever representing an exponen(cid:173)\ntially sized belief state. However, the approach is fully general, and can be applied in any \nsetting where the state is defined as an assignment of values to some set of state variables. \n\n2.2 Value functions and policies for POMDPs \n\nIf one thinks of a POMOP as an MOP defined over belief states, then the well-known fixed \npoint equations for MOPs still hold. Specifically, \n\nV*(b) = m~x [L b(s)R(s, a) + 'Y L P(ola, b)V*(bl )] \n\nsES \n\noED \n\nwhere'Y is the discount factor and b' (defined above) is the next belief state. The optimal \npolicy is determined by the maximizing action for each belief state. In principle, we could \nuse Q-Iearning or value iteration directly to solve POMOPs. The main difficulty lies in the \nfact that there are uncountably many belief states, making a tabular representation of the \nvalue function impossible. \n\nExact methods for POMOPs use the fact that finite horizon value functions are piecewise(cid:173)\nlinear and convex [11], ensuring a finite representation. While finite, this representation \ncan grow exponentially with the horizon, making exact approaches impractical in most set(cid:173)\ntings. Function approximation is an attractive alternative to exact methods. We implement \nfunction approximation using a set of parameterized Q-functions, where Qa(b, W a) is the \nreward-to-go for taking action a in belief state b. A value function is reconstructed from the \nQ-functions as V(b) = maxa(Qa(b, W a)), and the update rule for Wa when a transition \n\n\fReinforcement Learning Using Approximate Belief States \n\n1039 \n\nfrom state b to b' under action a with reward R is: \n\n2.3 Function approximation architectures \n\nWe consider two types of function approximators. The first is a two-layer feedforward \nneural network with sigmoidal internal units and a linear outermost layer. We used one \nnetwork for each Q function. For full belief state reinforcement learning, we used networks \nwith lSI inputs (one for each component of the belief state) and v'fSf hidden nodes. For \napproximate belief state reinforcement learning, we used networks with one input for each \nassignment to the variables in each cluster. If we had two clusters, for example, each with \n3 binary variables, then our Q networks would each have 23 + 23 = 16 inputs. We kept the \nnumber of hidden nodes for each network as the square root of the number of inputs. \n\nOur second function approximator is SPOVA [10], which is a soft max function designed \nto exploit the piecewise-linear structure of POMDP value functions. A SPOVA Q function \nmaintains a set of weight vectors Wal ... W ai, and is evaluated as: \n\nIn practice, a small value of k (usually 1.2) is adopted at the start of learning, making the \nfunction very smooth. This is increased during learning until SPOVA closely approximates \na PWLC function of b (usually k = 8). We maintained one SPOVA Q function for each \naction and assigned JiST vectors to each function. This gave O(IAIISI JiST) parameters \nto both SPOVA and the full belief state neural network. \n\n3 Empirical Results \n\nWe present results on several problems from the POMDP literature and present an extension \nto a known machine repair problem that is designed to highlight the effects of approximate \nbelief states. Our results are presented in the form of performance graphs, where the value \nof the current policy is obtained by taking a snapshot of the value function and measuring \nthe discounted sum of reward obtained by the resulting policy in simulation. We use \"NN\" \nto refer to the neural network trained reinforcement learner trained with the full belief state \nand the term \"Decomposed NN\" to refer to the neural network trained with an approxi(cid:173)\nmate belief which is decomposed as a product of marginals. We used a simple exploration \nstrategy, starting with a 0.1 probability of acting randomly, which decreased linearly to \n0.01. \n\nDue to space limitations, we are not able to describe each model in detail. However, we \nused publicly available model description files from [5].1 Table 3.4 shows the running times \nof the different methods. These are generally much lower than what would be required to \nsolve these problems using exact methods. \n\n3.1 Grid Worlds \n\nWe begin by considering two grid worlds, a 4 x 3 world from [10] and a 60-state world from \n[7]. The 4 x 3 world contains only 11 states and does not have a natural decomposition \ninto state variables, so we compared SPOVA only with the full belief state neural network. \n\nI See hup:/Iwww.cs.brown.edu/research/ai/pomdp/index.html. Note that this file format specifies \n\na starting distribution for each problem and our results are reported with respect to this starting dis(cid:173)\ntribution. \n\n\f1040 \n\n\" \n\n~ 0 5 \n\n! \n\n/ \n\no f\u00b7 \n/ \n.0.5 J \n\nA. Rodriguez, R. Parr and D. Koller \n\nSPOVA -\nNN -------\nOecompo.-l NN ______ / \n\n01 \n\no. \n\n\u00b71 50'---\"\"\"OOOO\"\"\"\"\"\"20000-'--\"\"\"\".L---\"-\"\"\"\"'\"c:--ISOOOO\n\n:::-\"--60000-'--=' OOOO.L--\",,,1IOOOO'\"c:--90000\n\n-. \n\n,,-.--',OOOOO \n\nFigure 1: a) 3 x 4 Grid World, b) 60-state maze \n\nThe experimental results, which are averaged over 25 training runs and 100 simulations per \npolicy snapshot, are presented in Figure 1a. They show that SPOVA learns faster than the \nneural network, but that the network does eventually catch up. \n\nThe 60-state robot navigation problem [7] was amenable to a decomposed belief state ap(cid:173)\nproximation since its underlying state space comes from the product of 15 robot positions \nand 4 robot orientations. We decomposed the belief state with two clusters, one containing \na position state variable and the other containing an orientation state variable. Figure 1 b \nshows results in which SPOVA again dominates. The decomposed NN has trouble with this \nproblem because the effects of position and orientation on the value function are not easily \ndecoupled, i.e., the effect of orientation on value is highly state-dependent. This meant that \nthe decomposed NN was forced to learn a much more complicated function of its inputs \nthan the function learned by the network using the full belief state. \n\n3.2 Aircraft Identification \n\nAircraft identification is another problem studied in Cassandra's thesis. It includes sensing \nactions for identifying incoming aircraft and actions for attacking threatening aircraft. At(cid:173)\ntacks against friendly aircraft are penalized, as are failures to intercept hostile aircraft. This \nis a challenging problem because there is tension in deciding between the various sensors. \n\" Better sensors tend to make the base more visible to hostile aircraft, while more stealthy \nsensors are less accurate. The sensors give information about both the aircraft's type and \ndistance from the base. \n\nThe state space of this problem is comprised of three main components. aircraft \neitherthe aircraft is a friend orit is a foe; distance -how far the aircraft \ntype -\nis currently from the base discretized into an adjustable number, d, of distinct distances; \na measure of how visible the base is to the approaching aircraft, which \nvis ibi 1 i ty -\nis discretized into 5 levels. \nWe chose d = 10, gaving this problem 104 states. The problem has a natural decomposition \ninto state variables for aircraft type, distance and base visibility. The results for the three \nalgorithms are shown in Figure 2(a). This is the first problem where we start to see an \nadvantage from decomposing the belief state. For the decomposed NN, we used three \nseparate clusters, one for each variable, which meant that the network had only 17 inputs. \nNot only did the simpler network learn faster, but it learned a better policy overall. We \nbelieve that this illustrates an important point: even though SPOVA and the full belief state \nneural network may be more expressive than the decomposed NN, the decomposed NN \nis able to search the space of functions it can represent much more efficiently due to the \nreduced number of parameters. \n\n\fReinforcement Learning Using Approximate Belief States \n\n1041 \n\nso \n\nSPOVA, -\n\nNN ----- (cid:173)\n\nDKompo!ll8d NN \n\n. \n\n\u00b720 O'--'-'~OOOOO\"-:--:-200000'-'---\"\"\"\"\"'OOO:---:-:\"\"\"\"\"\"\"\"\"\"'SOOOOO~-:eooooo\n\n-:\":-:-:--::700000\n\n\"\"\"\"\"\"\"'800000\":::::-900000\n\n\"\"\"\"\"'--'''''' \n\n1*IIIIonI \n\nSPOVA \nNN \nO.:.ompo!ll8d NN \n\n\u00b720 \n\no \n\n10000 \n\n2IXlOO \n\n30000 \n\n40000 \n\n50000 \n\n60000 \n\n70000 \n\naoooo \n\n90000 \n\nl00c00 \n\nlteratbns \n\nFigure 2.: a) Aircraft Identification, b) Machine Maintenance \n\n3.3 Machine Maintenance \n\nOur last problem was the machine maintenance problem from Cassandra's database. The \nproblem assumes that there is a machine with a certain number of components. The quality \nof the parts produced by the machine is determined by the condition of the components. \nEach component can be in one of four conditions: good -\nthe component is in good \ncondition; fair -\nthe component has some amount of wear, and would benefit from \nsome maintenance; bad -\nthe \npart is broken and must be replaced. The status of the components is observable only if the \nmachine is completely disassembled. \n\nthe part is very worn and could use repairs; broken -\n\nFigure 2(b) shows performance results for this problem for the 4 component version of this \nproblem. At 256 states, it was at the maximum size for which a full belief state approach \nwas manageable. However, the belief state for this problem decomposes naturally into \nclusters describing the status of each machine, creating a decomposed belief state with \njust four components. The graph shows the dominance of this this simple decomposition \napproach. We believe that this problem clearly demonstrates the advantage of belief state \ndecomposition: The decomposed NN learns a function of 16 inputs in fraction of the time \nit takes for the full net or SPOVA to learn a lower-quality function of 256 inputs. \n\n3.4 Running Times \n\nThe table below shows the running times for the different problems presented above. These \nare generally much less than what would be required to solve these problems exactly. The \nfull NN and SPOVA are roughly comparable, but the decomposed neural network is con(cid:173)\nsiderably faster. We did not exploit any problem structure in our approximate belief state \ncomputation, so the time spent computing belief states is actually larger for the decom(cid:173)\nposed NN. The savings comes from the the reduction in the number of parameters used, \nwhich reduced the number of partial derivatives computed. We expect the savings to be \nsignificantly more substantial for processes represented in a factored way [3], as the ap(cid:173)\nproximate belief state propagation algorithm can also take advantage of this additional \nstructure. \n\n4 Concluding Remarks \n\nWe have a proposed a new approach to belief state reinforcement learning through the use \nof approximate belief states. Using well-known examples from the POMDP literature, we \nhave compared approximate belief state reinforcement learning with two other methods \n\n\f1042 \n\nA. Rodriguez, R. Parr and D. Koller \n\nProblem \n3x4 \nHallway \nAircraft ID \nMachineM. \n\nSPOVA \n19.1 s \n32.8 min \n38.3 min \n2.5 h \n\nNN \n13.0s \n47.1 min \n49.9 min \n\n2.6 h \n\nDecomposed NN \n\n3.2 min \n4.4 min \n4.7 min \n\nTable 1: Run times (in seconds, minutes or hours) for the different algorithms \n\nthat use exact belief states. Our results demonstrate that, while approximate belief states \nmay not be ideal for tightly coupled problem features, such as the position and orientation \nof a robot, they are a natural and effective means of addressing some large problems. Even \nfor the medium-sized problems we showed here, approximate belief state reinforcement \nlearning can outperform full belief state reinforcement learning using fewer trials and much \nless CPU time. For many problems, exact belief state methods will simply be impractical \nand approximate belief states will provide a tractable alternative. \n\nAcknowledgements \n\nThis work was supported by the ARO under the MURI program \"Integrated Approach to \nIntelligent Systems,\" by ONR contract N66001-97-C-8554 under DARPA's HPKB pro(cid:173)\ngram, and by the generosity of the Powell Foundation and the Sloan Foundation. \n\nReferences \n\n[1] K. J. Astrom. Optimal control of Markov decision processes with incomplete state \n\nestimation. l. Math. Anal. Applic., 10:174-205,1965. \n\n[2] R.E. Bellman. Dynamic Programming. Princeton University Press, 1957. \n[3] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assump(cid:173)\ntions and computational leverage. Journal of Artijiciallntelligence Research, 1999. \n\n[4] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In \n\nProc. UAI, 1998. \n\n[5] A. Cassandra. Exact and approximate Algorithms for partially observable Markov \n\nDecision Problems. PhD thesis, Computer Science Dept., Brown Univ., 1998. \n\n[6] M. Littman. Algorithms for Sequential Decision Making. PhD thesis, Computer \n\nScience Dept., Brown Univ., 1996. \n\n[7] M. Littman, A. Cassandra, and L.P. Kaelbling. Learning policies for partially observ(cid:173)\n\nable environments: Scaling up. In Proc. ICML, pages 362-370, 1996. \n\n[8] J. Loch and S. Singh. Using eligibility traces to find the best memory less policy in \npartially observable markov decision processes. In Proc. ICML. Morgan Kaufmann, \n1998. \n\n[9] Andrew R. McCallum. Overcoming incomplete perception with utile distinction \n\nmemory. In Proc.ICML, pages 190-196, 1993. \n\n[10] Ronald Parr and Stuart Russell. Approximating optimal policies for partially observ(cid:173)\n\nable stochastic domains. In Proc. IlCAI, 1995. \n\n[11] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable \nMarkov processes over a finite horizon. Operations Research, 21: 1071-1088,1973. \n[12] M. Wiering and J. Schmidhuber. HQ-leaming: Discovering Markovian subgoals for \nnon-Markovian reinforcement learning. Technical report, Istituo Daile Molle di Studi \nsull'Intelligenza Artificiale, 1996. \n\n\f", "award": [], "sourceid": 1667, "authors": [{"given_name": "Andres", "family_name": "Rodriguez", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}