{"title": "An Improved Policy Iteration Algorithm for Partially Observable MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1015, "page_last": 1021, "abstract": "", "full_text": "An Improved Policy Iteratioll Algorithm \n\nfor Partially Observable MDPs \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nEric A. Hansen \n\nAmherst, MA 01003 \nhansen@cs.umass.edu \n\nAbstract \n\nA new policy iteration algorithm for partially observable Markov \ndecision processes is presented that is simpler and more efficient than \nan earlier policy iteration algorithm of Sondik (1971,1978). The key \nsimplification is representation of a policy as a finite-state controller. \nThis representation makes policy evaluation straightforward. The pa(cid:173)\nper's contribution is to show that the dynamic-programming update \nused in the policy improvement step can be interpreted as the trans(cid:173)\nformation of a finite-state controller into an improved finite-state con(cid:173)\ntroller. The new algorithm consistently outperforms value iteration \nas an approach to solving infinite-horizon problems. \n\n1 \n\nIntroduction \n\nA partially observable Markov decision process (POMDP) is a generalization of the \nstandard completely observable Markov decision process that allows imperfect infor(cid:173)\nmation about the state of the system. First studied as a model of decision-making in \noperations research, it has recently been used as a framework for decision-theoretic \nplanning and reinforcement learning with hidden state (Monahan, 1982; Cassandra, \nKaelbling, & Littman, 1994; Jaakkola, Singh, & Jordan, 1995). \nValue iteration and policy iteration algorithms for POMDPs were first developed by \nSondik and rely on a piecewise linear and convex representation of the value function \n(Sondik, 1971; Smallwood & Sondik,1973; Sondik, 1978). Sondik's policy iteration \nalgorithm has proved to be impractical, however, because its policy evaluation step is \nextremely complicated and difficult to implement. As a result, almost all subsequent \nwork on dynamic programming for POMDPs has used value iteration. In this paper, \nwe describe an improved policy iteration algorithm for POMDPs that avoids the dif(cid:173)\nficulties of Sondik's algorithm. We show that these difficulties hinge on the choice of \na policy representation and can be avoided by representing a policy as a finite-state \n\n\f1016 \n\nE. A. Hansen \n\ncontroller. This representation makes the policy evaluation step easy to implement \nand efficient. We show that the policy improvement step can be interpreted in a nat(cid:173)\nural way as the transformation of a finite-state controller into an improved finite-state \ncontroller. Although it is not always possible to represent an optimal policy for an \ninfinite-horizon POMDP as a finite-state controller, it is always possible to do so when \nthe optimal value function is piecewise linear and convex. Therefore representation of \na poiicy as a finite-state controller is no more limiting than representation of the value \nfunction as piecewise linear and convex. In fact, it is the close relationship between \nrepresentation of a policy as a finite-state controller and representation of a value \nfunction as piecewise linear and convex that the new algorithm successfully exploits. \nThe paper is organized as follows. Section 2 briefly reviews the POMDP model and \nSondik's policy iteration algorithm. Section 3 describes an improved policy iteration \nalgorithm. Section 4 illustrates the algorithm with a simple example and reports a \ncomparison of its performance to value iteration. The paper concludes with a discus(cid:173)\nsion of the significance of this work. \n\n2 Background \n\nConsider a discrete-time POMDP with a finite set of states 5, a finite set of actions \nA, and a finite set of observations e. Each time period, the system is in some state \ni E 5, an agent chooses an action a E A for which it receives a reward with expected \nvalue ri, the system makes a transition to state j E 5 with probability pij' and the \nagent observes () E e with probability tje. We assume the performance objective is \nto maximize expected total discounted reward over an infinite horizon. \nAlthough the state of the system cannot be directly observed, the probability that it is \nin a given state can be calculated. Let 7r denote a vector of state probabilities, called \nan information state, where 7ri denotes the probability that the system is in state i. If \naction a is taken in information state 7r and () is observed, the successor information \nstate is determined by revising each state probability using Bayes' theorem: trj = \nLiEs 7riPijQje/ Li,jES 7riPijQje' Geometrically, each information state 7r is a point in \nthe (151 - I)-dimensional unit simplex, denoted II. \nIt is well-known that an information state 7r is a sufficient statistic that summarizes \nall information about the history of a POMDP necessary for optimal action selection. \nTherefore a POMDP can be recast as a completely observable MDP with a continuous \nstate space II and it can be theoretically solved using dynamic programming. The key \nto practical implementation of a dynamic-programming algorithm is a piecewise-linear \nand convex representation of the value function. Smallwood and Sondik (1973) show \nthat the dynamic-programming update for POMDPs preserves the piecewise linearity \nand convexity of the value function. They also show that an optimal value function fot \na finite-horizon POMDP is always piecewise linear and convex. For infinite-horizon \nPOMDPs, Sondik (1978) shows that an optimal value function is sometimes piecewise \nlinear and convex and can be aproximated arbitrarily closely by a piecewise linear and \nconvex function otherwise. \nA piecewise linear and convex value function V can be represented by a finite set \nof lSI-dimensional vectors, r = {aO,a i , \u2022.. }, such that V(7r) = maxkLi s7riaf. A \ndynamic-programming update transforms a value function V representedEfiy a set r \nof a-vectors into an improved value function V' represented by a set r' of a-vectors. \nEach possible a-vector in r' corresponds to choice of an action, and for each possible \nobservation, choice of a successor vector in r. Given the combinatorial number of \nchoices that can be made, the maximum n4mber of vectors in r' is IAllfll91. However \nmost of these potential vectors are not needed to define the updated value function \nand can be pruned. Thus the dynamic-programming update problem is to find a \n\n\fAn Improved Policy Iteration Algorithmfor Partially Observable MDPs \n\nJ017 \n\nminimal set of vectors r' that represents V', given a set of vectors r that represents \nV . Several algorithms for performing this dynamic-programming update have been \ndeveloped but describing them is beyond the scope of this paper. Any algorithm for \nperforming the dynamic-programming update can be used in the policy improvement \nstep of policy iteration. The algorithm that is presently the fastest is described by \n(Cassandra, Littman, & Zhang, 1997). \nFor value iteration, it is sufficient to have a representation of the value function because \na policy is defined implicitly by the value function, as follows, \n\n8(11\") = a(arg mF L 1I\"iof), \n\niES \n\n(1) \n\nwhere a(k) denotes the action associated with vector ok. But for policy iteration, \na policy must be represented independently of the value function because the policy \nevaluation step computes the value function of a given policy. Sondik's choice of \na policy representation is influenced by Blackwell's proof that for a continuous-space \ninfinite-horizon MDP, there is a stationary, deterministic Markov policy that is optimal \n(Blackwell, 1965). Based on this result, Sondik restricts policy space to stationary and \ndeterministic Markov policies that map the continuum of information space II into \naction space A. Because it is important for a policy to have a finite representation, \nSondik defines an admissible policy as a mapping from a finite number of polyhedral \nregions of II to A . Each region is represented by a set of linear inequalities, where \neach linear inequality corresponds to a boundary of the region. \nThis is Sondik's canonical representation of a policy, but his policy iteration algorithm \nmakes use of two other representations. In the policy evaluation step, he converts a \npolicy from this representation to an equivalent, or approximately equivalent, finite(cid:173)\nstate controller. Although no method is known for computing the value function of \na policy represented as a mapping from II to A, the value function of a finite-state \ncontroller can be computed in a straightforward way. \nIn the policy improvement \nstep, Sondik converts a policy represented implicitly by the updated value function \nand equation (1) back to his canonical representation. The complexity of translating \nbetween these different policy representations - especially in the policy evaluation step \n- makes Sondik's policy iteration algorithm difficult to implement and explains why \nit is not used in practice. \n\n3 Algorithm \n\nWe now show that policy iteration for POMDPs can be simplified - both conceptually \nand computationally - by using a single representation of a policy as a finite-state \ncontroller. \n\n3.1 Policy evaluation \n\nAs Sondik recognized, policy evaluation is straightforward when a policy is represented \nas a finite-state controller. An o-vector representation of the value function of a finite(cid:173)\nstate controller is computed by solving the system of linear equations, \n\nk _ \n0i -\n\na(k) + (3'\"' a(k) a(k) s(k ,8) \nri \n\n, \n\nL.JPij qj8 OJ \nj ,8 \n\n(2) \n\nwhere k is an index of a state of the finite-state controller, a(k) is the action associated \nwith machine state k, and s(k,O) is the index of the successor machine state if 0 is \nobserved. This value function is convex as well as piecewise linear because the expected \nvalue of an information state is determined by assuming the controller is started in \nthe machine state that optimizes it. \n\n\f1018 \n\nE. A. Hansen \n\n1. Specify an initial finite-state controller, <5, and select f. for detecting conver(cid:173)\n\ngence to an f.-optimal policy. \n\n2. Policy evaluation: Calculate a set r of a-vectors that represents the value \n\nfunction for <5 by solving the system of equations given by equation 2. \n\n3. Policy improvement: Perform a dynamic-programming update and use the \nnew set of vectors r' to transform <5 into a new finite-state controller, <5', as \nfollows: \n(a) For each vector a in r': \n\nl. If the action and successor links associated with a duplicate those of \n\na machine state of <5, then keep that machine state unchanged in 8'. \n\nii. Else if a pointwise dominates a vector associated with a machine state \nof <5, change the action and successor links of that machine state to \nthose used to create a. (If it pointwise dominates the vectors of more \nthan one machine state, they can be combined into a single machine \nstate.) \n\niii. Otherwise add a machine state to <5' that has the same action and \n\nsuccessor links used to create a. \n\n(b) Prune any machine state for which there is no corresponding vector in \nr', as long as it is not reachable from a machine state to which a vector \nin r' does correspond. \n\n4. Termination test. If the Bellman residual is less than or equal to f.(1 - /3)//3, \n\nexit with f.-optimal policy. Otherwise set <5 to <5' and go to step 2. \n\nFigure 1: Policy iteration algorithm. \n\n3.2 Policy improvement \n\nThe policy improvement step uses the dynamic-programming update to transform a \nvalue function V represented by a set r of a-vectors into an improved value function \nV' represented by a set r' of a-vectors. We now show that the dynamic-programming \nupdate can also be interpreted as the transformation of a finite-state controller 8 into \nan improved finite-state controller <5'. The transformation is made based on a simple \ncomparison of r' and r. \nFirst note that some of the a-vectors in r' are duplicates of a-vectors in r, that is, \ntheir action and successor links match (and their vector values are pointwise equal). \nAny machine state of <5 for which there is a duplicate vector in r' is left unchanged. \nThe vectors in r' that are not duplicates of vectors in r indicate how to change the \nfinite-state controller. If a non-duplicate vector in r' pointwise dominates a vector \nin r, the machine state that corresponds to the pointwise dominated vector in r is \nchanged so that its action and successor links match those of the dominating vector \nin r'. If a non-duplicate vector in r' does not pointwise dominate a vector in r, a \nmachine state is added to the finite-state controller with the same action and successor \nlinks used to generate the vector. There may be some machine states for which there \nis no corresponding vector in r' and they can be pruned, but only if they are not \nreachable from a machine state that corresponds to a vector in r'. This last point is \nimportant because it preserves the integrity of the finite-state controller. \n\nA policy iteration algorithm that uses these simple transformations to change a finite(cid:173)\nstate controller in the policy improvement step is summarized in Figure 1. An algo(cid:173)\nrithm that performs this transformation is easy to implement and runs very efficiently \nbecause it simply compares the a-vectors in r' to the a-vectors in r and modifies the \nfinite-state controller accordingly. The policy evaluation step is invoked to compute \nthe value function of the transformed finite-state controller. (This is only necessary \n\n\fAn Improved Policy Iteration Algorithmfor Partially Observable MDPs \n\n1019 \n\nif a machine state has been changed, not if machine states have simply been added.) \nIt is easy to show that the value function of the transformed finite-state controller /j' \ndominates the value function of the original finite-state controller, /j, and we omit the \nproof which appears in (Hansen, 1998). \n\nTheorem 1 If a finite-state controller is not optimal, policy improvement transforms \nit into a finite-state controller with a value function that is as good or better for every \ninformation state and better for some information state. \n\n3.3 Convergence \n\nIf a finite-state controller cannot be improved in the policy improvement step (Le., all \nthe vectors in r' are duplicates of vectors in r), it must be optimal because the value \nfunction satisfies the optimality equation. However policy iteration does not neces(cid:173)\nsarily converge to an optimal finite-state controller after a finite number of iterations \nbecause there is not necessarily an optimal finite-state controller. Therefore we use \nthe same stopping condition used by Sondik to detect t-optimality: a finite-state con(cid:173)\ntroller is t-optimal when the Bellman residual is less than or equal to t(l- {3) / {3, where \n{3 denotes the discount factor. Representation of a policy as a finite-state controller \nmakes the following proof straightforward (Hansen, 1998). \n\nTheorem 2 Policy iteration converges to an t-optimal finite-state controller after a \nfinite number of iterations. \n\n4 Example and performance \n\nWe illustrate the algorithm using the same example used by Sondik: a simple two(cid:173)\nstate, two-action, two-observation POMDP that models the problem of finding an \noptimal marketing strategy given imperfect information about consumer preferences \n(Sondik,1971,1978). The two states of the problem represent consumer preference \nor lack of preference for the manufacturers brand; let B denote brand preference \nand ....,B denote lack of brand preference. Although consumer preferences cannot be \nobserved, they can be infered based on observed purchasing behavior; let P denote \npurchase of the product and let ....,p denote no purchase. There are two marketing \nalternatives or actions; the company can market a luxury version of the product (L) \nor a standard version (S). The luxury version is more expensive to market but can \nbring greater profit. Marketing the luxury version also increases brand preference. \nHowever consumers are more likely to purchase the less expensive, standard product. \nThe transition probabilities, observation probabilities, and reward function for this \nexample are shown in Figure 2. The discount factor is 0.9. \nBoth Sondik's policy iteration algorithm and the new policy iteration algorithm con(cid:173)\nverge in three iterations from a starting policy that is equivalent to the finite-state \n\nAClions \n\nTransilion \nprobabililies \n\nObservalion \nprobabililies \n\nExpecled \nreward \n\nMarkel \nluxury \nproducl (L) \n\nMarkel \nslandard \nproducl (S) \n\nB -B \nB/O.8/0.2\\ \n-B 0.5 0.5 \nB -B \n\nP -p \nB 10.81 0.2\\ \n-B 0.60.4 \n\nP \n\n-p \n\nB\u00a7j \n-B \u00b74 \n\nB~ B~ Bbj \n-B \u00b73 \n-B 0.4 o. \n\n-B O. 0. \n\nFigure 2: Parameters for marketing example of Sondik (1971,1978) . \n\n\f1020 \n\nE A. Hansen \n\n~; .. \" -~ '''' .. \n\" a = L \n\\ \n~ 9,96 \n: \n'- 18.86 <,8=-P \n\n'~~9~~_:~~~;~<~._ \n\n:' .1 = S \" \n: 14.82! \n\\ 18.20 / \n', __ __ - '~~ P \n\n\\\\ \n\\ \n\n\\ \n\\., \n\n''', \n\n,.;,\\ \n\n/\"\"---... ,9=-p,y: \n\" a.= S \n14.86 t ____ .... \n: \n\\_:8.1~/ 8=P \n\n.. ,'\" \n\n\\ \n\n(.) \n\n(b) \n\n(e) \n\n(d) \n\n(e) \n\nFigure 3: (a) shows the initial finite-state controller, (b) uses dashed circles to show the \nvectors in r' generated in the first policy improvement step and (c) shows the transformed \nfinite-state controller, (d) uses dashed circles to show the vectors in r' generated in the second \npolicy improvement step and (e) shows the transformed finite-state controller after policy \nevaluation. The optimality of this finite-state controller is detected on the third iteration, \nwhich is not shown. Arcs are labeled with one of two possible observations and machine \nstates are labeled with one of two possible actions and a 2-dimensional vector that contains \na value for each of the two possible system states. \n\ncontroller shown in Figure 3a. Figure 3 shows how the initial finite-state controller \nis transformed into an optimal finite-state controller by the new algorithm. In the \nfirst iteration, the updated set of vectors r' (indicated by dashed circles in Figure 3b) \nincludes two duplicate vectors and one non-duplicate that results in an added machine \nstate. Figure 3c shows the improved finite-state controller after the first iteration. In \nthe second iteration, each of the three vectors in the updated set of vectors r' (indi(cid:173)\ncated by dashed circles in Figure 3d) pointwise dominates a vector that corresponds \nto a current machine state. Thus each of these machine states is changed. Figure 4e \nshows the improved finite-state controller after the second iteration. The optimality \nof this finite-state controller is detected in the third iteration. \nThis is the only example for which Sondik reports using policy iteration to find an op(cid:173)\ntimal policy. For POMDPs with more than two states, Sondik's algorithm is especially \ndifficult to implement. Sondik reports that his algorithm finds a suboptimal policy \nfor an example described in (Smallwood & Sondik, 1973). No further computational \nexperience with his algorithm has been reported. \n\nThe new policy iteration algorithm described in this paper easily finds an optimal \nfinite-state controller for the example described in (Smallwood & Sondik, 1973) and \nhas been used to solve many other POMDPs. In fact, it consistently outperforms value \niteration. We compared its performance to the performance of value iteration on a \nsuite of ten POMDPs that represent a range of problem sizes for which exact dynamic(cid:173)\nprogramming updates are currently feasible. (Presently, exact dynamic-prorgramming \nupdates are not feasible for POMDPs with more than about ten or fifteen states, \nactions, or observations.) Starting from the same point, we measured how soon each \nalgorithm converged to f-optimality for f values of 10.0, 1.0, 0.1 , and 0.01. Policy \niteration was consistently faster than value iteration by a factor that ranged from a \nlow of about 10 times faster to a high of over 120 times faster. On average, its rate \nof convergence was between 40 and 50 times faster than value iteration for this set \nof examples. The finite-state controllers it found had as many as several hundred \nmachine states, although optimal finite-state controllers were sometimes found with \njust a few machine states. \n\n\fAn Improved Policy Iteration Algorithm for Partially Observable MDPs \n\n1021 \n\n5 Discussion \n\nWe have demonstrated that the dynamic-programming update for POMDPs can be \ninterpreted as the improvement of a finite-state controller. This interpretation can \nbe applied to both value iteration and policy iteration. It provides no computational \nspeedup for value iteration, but for policy iteration it results in substantial speedup by \nmaking policy evaluation straightforward and easy to implement. This representation \nalso has the advantage that it makes a policy easier to understand and execute than \nrepresentation as a mapping from regions of information space to actions. In particu(cid:173)\nlar, a policy can be executed without maintaining an information state at run-time. \nIt is well-known that policy iteration converges to f-optimality (or optimality) in \nfewer iterations than value iteration. For completely observable MDPs, this is not a \nclear advantage because the policy evaluation step is more computationally expensive \nthan the dynamic-programming update. But for POMDPs, policy evaluation has low(cid:173)\norder polynomial complexity compared to the worst-case exponential complexity of \nthe dynamic-programming update (Littman et al., 1995). Therefore, policy iteration \nappears to have a clearer advantage over value iteration for POMDPs. Preliminary \ntesting bears this out and suggests that policy iteration significantly outperforms value \niteration as an approach to solving infinite-horizon POMDPs. \n\nAcknowledgements \n\nThanks to Shlomo Zilberstein and especially Michael Littman for helpful discussions. \nSupport for this work was provided in part by the National Science Foundation under \ngrants IRI-9409827 and IRI-9624992. \n\nReferences \n\nBlackwell, D. {1965} Discounted dynamic programming. Ann. Math. Stat. 36:226-\n235. \nCassandra, A.; Kaelbling, L.P.; Littman, M.L. {1994} Acting optimally in partially \nobservable stochastic domains. In Proc. 13th National Conf. on AI, 1023-1028. \nCassandra, A.; Littman, M.L.; & Zhang, N.L. (1997) Incremental pruning: A simple, \nfast, exact algorithm for partially observable Markov decision processes. In Proc. 13th \nA nnual Con/. on Uncertainty in AI. \nHansen, E.A. (1998). Finite-Memory Control of Partially Observable Systems. PhD \nthesis, Department of Computer Science, University of Massachusetts at Amherst. \nJaakkola, T.; Singh, S.P. ; & Jordan, M.I. (1995) Reinforcement learning algorithm for \npartially observable Markov decision problems. In NIPS-7. \nLittman, M.L.; Cassandra, A.R.; & Kaebling, L.P. (1995) Efficient dynamic(cid:173)\nprogramming updates in partially observable Markov decision processes. Computer \nScience Technical Report CS-95-19, Brown University. \nMonahan, G.E. (1982) A survey of partially observable Markov decision processes: \nTheory, models, and algorithms. Management Science 28:1-16. \nSmallwood, R.D. & Sondik, E.J. (1973) The optimal control of partially observable \nMarkov processes over a finite horizon. Operations Research 21:1071-1088. \nSondik, E.J. (1971) The Optimal Control of Partially Observable Markov Processes. \nPhD thesis, Department of Electrical Engineering, Stanford University. \nSondik, E.J. (1978) The optimal control of partially observable Markov processes over \nthe infinite horizon: Discounted costs. Operations Research 26:282-304. \n\n\f", "award": [], "sourceid": 1447, "authors": [{"given_name": "Eric", "family_name": "Hansen", "institution": null}]}