{"title": "Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 345, "page_last": 352, "abstract": null, "full_text": "Reinforcement Learning Algorithm for \nPartially Observable Markov Decision \n\nProblems \n\nTommi Jaakkola \n\ntommi@psyche.mit.edu \n\nSatinder P. Singh \nsingh@psyche.mit.edu \n\nMichael I. Jordan \njordan@psyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences, BId. E10 \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nIncreasing attention has been paid to reinforcement learning algo(cid:173)\nrithms in recent years, partly due to successes in the theoretical \nanalysis of their behavior in Markov environments. If the Markov \nassumption is removed, however, neither generally the algorithms \nnor the analyses continue to be usable. We propose and analyze \na new learning algorithm to solve a certain class of non-Markov \ndecision problems. Our algorithm applies to problems in which \nthe environment is Markov, but the learner has restricted access \nto state information. The algorithm involves a Monte-Carlo pol(cid:173)\nicy evaluation combined with a policy improvement method that is \nsimilar to that of Markov decision problems and is guaranteed to \nconverge to a local maximum. The algorithm operates in the space \nof stochastic policies, a space which can yield a policy that per(cid:173)\nforms considerably better than any deterministic policy. Although \nthe space of stochastic policies is continuous-even for a discrete \naction space-our algorithm is computationally tractable. \n\n\f346 \n\nTommi Jaakkola, Satinder P. Singh, Michaell. Jordan \n\n1 \n\nINTRODUCTION \n\nReinforcement learning provides a sound framework for credit assignment in un(cid:173)\nknown stochastic dynamic environments. For Markov environments a variety of \ndifferent reinforcement learning algorithms have been devised to predict and control \nthe environment (e.g., the TD(A) algorithm of Sutton, 1988, and the Q-Iearning \nalgorithm of Watkins, 1989). Ties to the theory of dynamic programming (DP) and \nthe theory of stochastic approximation have been exploited, providing tools that \nhave allowed these algorithms to be analyzed theoretically (Dayan, 1992; Tsitsiklis, \n1994; Jaakkola, Jordan, & Singh, 1994; Watkins & Dayan, 1992). \nAlthough current reinforcement learning algorithms are based on the assumption \nthat the learning problem can be cast as Markov decision problem (MDP), many \npractical problems resist being treated as an MDP. Unfortunately, if the Markov \nassumption is removed examples can be found where current algorithms cease to \nperform well (Singh, Jaakkola, & Jordan, 1994). Moreover, the theoretical analyses \nrely heavily on the Markov assumption. \nThe non-Markov nature of the environment can arise in many ways. The most direct \nextension of MDP's is to deprive the learner of perfect information about the state \nof the environment. Much as in the case of Hidden Markov Models (HMM's), the \nunderlying environment is assumed to be Markov, but the data do not appear to be \nMarkovian to the learner. This extension not only allows for a tractable theoretical \nanalysis, but is also appealing for practical purposes. The decision problems we \nconsider here are of this type. \nThe analog of the HMM for control problems is the partially observable Markov \ndecision process (POMDP; see e.g., Monahan, 1982). Unlike HMM's, however, \nthere is no known computationally tractable procedure for POMDP's. The problem \nis that once the state estimates have been obtained, DP must be performed in \nthe continuous space of probabilities of state occupancies, and this DP process is \ncomputationally infeasible except for small state spaces. In this paper we describe \nan alternative approach for POMDP's that avoids the state estimation problem and \nworks directly in the space of (stochastic) control policies. (See Singh, et al., 1994, \nfor additional material on stochastic policies.) \n\n2 PARTIAL OBSERVABILITY \n\nA Markov decision problem can be generalized to a POMDP by restricting the state \ninformation available to the learner. Accordingly, we define the learning problem as \nfollows. There is an underlying MDP with states S = {SI, S2, ... , SN} and transition \nprobability pO \" the probability of jumping from state S to state s' when action a is \ntaken in state s. For every state and every action a (random) reward is provided to \nthe learner. In the POMDP setting, the learner is not allowed to observe the state \ndirectly but only via messages containing information about the state. At each time \nstep t an observable message mt is drawn from a finite set of messages according to \nan unknown probability distribution P(mlst) 1. We assume that the learner does \n\nII! \n\n1 For simplicity we assume that this distribution depends only on the current state. The \n\nanalyses go through also with distributions dependent on the past states and actions \n\n\fReinforcement Learning Algorithm for Markov Decision Problems \n\n347 \n\nnot possess any prior information about the underlying MDP beyond the number \nof messages and actions. The goal for the learner is to come up with a policy-a \nmapping from messages to actions-that gives the highest expected reward. \n\nAs discussed in Singh et al. (1994), stochastic policies can yield considerably higher \nexpected rewards than deterministic policies in the case of POMDP's. To make this \nstatement precise requires an appropriate technical definition of \"expected reward,\" \nbecause in general it is impossible to find a policy, stochastic or not, that maximizes \nthe expected reward for each observable message separately. We take the time(cid:173)\naverage reward as a measure of performance, that is, the total accrued reward per \nnumber of steps taken (Bertsekas, 1987; Schwartz, 1993). This approach requires the \nassumption that every state of the underlying controllable Markov chain is reachable. \nIn this paper we focus on a direct approach to solving the learning problem. Direct \napproaches are to be compared to indirect approaches, in which the learner first \nidentifies the parameters of the underlying MDP, and then utilizes DP to obtain the \npolicy. As we noted earlier, indirect approaches lead to computationally intractable \nalgorithms. Our approach can be viewed as providing a generalization of the direct \napproach to MDP's to the case of POMDP's. \n\n3 A MONTE-CARLO POLICY EVALUATION \n\nAdvantages of Monte-Carlo methods for policy evaluation in MDP's have been re(cid:173)\nviewed recently (Barto and Duff, 1994). Here we present a method for calculating \nthe value of a stochastic policy that has the flavor of a Monte-Carlo algorithm. To \nmotivate such an approach let us first consider a simple case where the average re(cid:173)\nward is known and generalize the well-defined MDP value function to the POMDP \nsetting. In the Markov case the value function can be written as (cf. Bertsekas, \n1987): \n\nV(s) = lim ~ E{R(st, Ut) - Risl = s} \n\nN \n\nN-oo~ \nt=l \n\n(1) \n\nwhere St and at refer to the state and the action taken at the tth step respectively. \nThis form generalizes easily to the level of messages by taking an additional expec(cid:173)\ntation: \n\n(2) \nwhere s -+ m refers to all the instances where m is observed in sand E{\u00b7ls -+ m} \nis a Monte-Carlo expectation. This generalization yields a POMDP value function \ngiven by \n\nV(m) = E {V(s)ls -+ m} \n\nV(m) = I: P(slm)V(s) \n\n,Em \n\n(3) \n\nin which P(slm) define the limit occupancy probabilities over the underlying states \nfor each message m. As is seen in the next section value functions of this type can be \nused to refine the currently followed control policy to yield a higher average reward. \nLet us now consider how the generalized value functions can be computed based \non the observations. We propose a recursive Monte-Carlo algorithm to effectively \ncompute the averages involved in the definition of the value function. In the simple \n\n\f348 \n\nTommi Jaakkola, Satinder P. Singh, Michael I. Jordan \n\ncase when the average payoff is known this algorithm is given by \n\nf3t(m) \n\nlIt(m) \n\nXt(m) \n\nXt(m) \n(1 - Kt(m) htf3t-l(m) + Kt(m) \n\nXt(m) \n\n(1- Kt(m))lIt-l(m) + f3t(m)[R(st, at) - R] \n\n(4) \n\n(5) \n\nwhere Xt(m) is the indicator function for message m, Kt(m) is the number of times \nm has occurred, and 'Yt is a discount factor converging to one in the limit. This \nalgorithm can be viewed as recursive averaging of (discounted) sample sequences of \ndifferent lengths each of which has been started at a different occurrence of message \nm . This can be seen by unfolding the recursion, yielding an explicit expression for \nlit (m). To this end, let tk denote the time step corresponding to the ph occurrence \nof message m and for clarity let Rt = R(st, Ut) - R for every t. Using these the \nrecursion yields: \n\nlIt(m) = Kt(m) [ Rtl + rl,l R t1+1 + ... + r1,t-tl Rt \n\n1 \n\n(6) \nwhere we have for simplicity used rk ,T to indicate the discounting at the Tth step \nin the kth sequence. Comparing the above expression to equation 1 indicates that \nthe discount factor has to converge to one in the limit since the averages in V(s) or \nV(m) involve no discounting. \nTo address the question of convergence of this algorithm let us first assume a constant \ndiscounting (that is, 'Yt = 'Y < 1). In this case, the algorithm produces at best an \napproximation to the value function. For large K(m) the convergence rate by which \nthis approximate solution is found can be characterized in terms of the bias and \nvariance. This gives Bias{V(m)} Q( (1 - r)-l / K(m) and Var{V(m)} Q( (1 -\nr)-2 / K(m) where r = Ehtk-tk-l} is the expected effective discounting between \nobservations. Now, in order to find the correct value function we need an appropriate \nway of letting 'Yt - 1 in the limit. However, not all such schedules lead to convergent \nalgorithms; setting 'Yt = 1 for all t , for example, would not. By making use of the \nabove bounds a feasible schedule guaranteeing a vanishing bias and variance can be \nfound . For instance, since 'Y > 7 we can choose 'Yk(m) = 1 - K(m)I/4. Much faster \nschedules are possible to obtain by estimating r. \n\nLet us now revise the algorithm to take into account the fact that the learner in fact \nhas no prior knowledge of the average reward. In this case the true average reward \nappearing in the above algorithm needs to be replaced with an incrementally updated \nestimate Rt- l . To improve the effect this changing estimate has on the values we \ntransform the value function whenever the estimate is updated. This transformation \nis given by \n\n(8) \nand, as a result, the new values are as if they had been computed using the current \nestimate of the average reward. \n\n-\n\n(7) \n\nXt(m) \n\n(1 - Kt(m) )Ct-1(m) + f3t(m) \nlIt(m) - Ct(m)(Rt - Rt-l) \n\n\fReinforcement Learning Algorithm for Markov Decision Problems \n\n349 \n\nTo carry these results to the control setting and assign a figure of merit to stochastic \npolicies we need a quantity related to the actions for each observed message. As \nin the case of MDP's, this is readily achieved by replacing m in the algorithm \njust described by (m, a). In terms of equation 6, for example, this means that the \nsequences started from m are classified according to the actions taken when m is \nobserved. The above analysis goes through when m is replaced by (m, a), yielding \n\"Q-values\" on the level of messages: \n\n(9) \n\nIn the next section we show how these values can be used to search efficiently for a \nbetter policy. \n\n4 POLICY IMPROVEMENT THEOREM \n\nHere we present a policy improvement theorem that enables the learner to search \nefficiently for a better policy in the continuous policy space using the \"Q-values\" \nQ(m, a) described in the previous section. The theorem allows the policy refinement \nto be done in a way that is similar to policy improvement in a MDP setting. \n\nTheorem 1 Let the current stochastic policy 1I\"(alm) lead to Q-values Q1r(m, a) on \nthe level of messages. For any policy 11\"1 (aim) define \n\nJ1r 1 (m) = L: 11\"1 (alm)[Q1r(m, a) - V 1r(m)] \n\na \n\nThe change in the average reward resulting from changing the current policy accord(cid:173)\ning to 1I\"(alm) -+ (1- {)1I\"(alm) + {1I\"1(alm) is given by \n\n~R1r = {L: p1r(m)J1r 1 (m) + O({2) \n\nm \n\nwhere p1r (m) are the occupancy probabilities for messages associated with the current \npolicy. \n\nThe proof is given in Appendix. In terms of policy improvement the theorem can \nbe interpreted as follows . Choose the policy 1I\"1(alm) such that \n\n1 \n\nJ1r (m) = max[Q1r(m, a) - V 1r(m)] \n\na \n\n(10) \n\nIf now J1r 1 (m) > 0 for some m then we can change the current policy towards \n11\"1 and expect an increase in the average reward as shown by the theorem. The \n{ factor suggests local changes in the policy space and the policy can be refined \nuntil m~l J1r\\ m) = 0 for all m which constitutes a local maximum for this policy \nimprovement method. Note that the new direction 1I\"1(alm) in the policy space can \nbe chosen separately for each m. \n\n5 THE ALGORITHM \n\nBased on the theoretical analysis presented above we can construct an algorithm that \nperforms well in a POMDP setting. The algorithm is composed of two parts: First, \n\n\f350 \n\nTommi Jaakkola, Satinder P. Singh, Michael I. Jordan \n\nQ(m, a) values-analogous to the Q-values in MDP-are calculated via a Monte(cid:173)\nCarlo approach. This is followed by a policy improvement step which is achieved by \nincreasing the probability of taking the best action as defined by Q(m,a). The new \npolicy is guaranteed to yield a higher average reward (see Theorem 1) as long as for \nsomem \n\nmax[Q(m, a) - V(m)] > 0 \n\n(11) \nThis condition being false constitutes a local maximum for the algorithm. Examples \nillustrating that this indeed is a local maximum can be found fairly easily. \nIn practice, it is not feasible to wait for the Monte-Carlo policy evaluation to converge \nbut to try to improve the policy before the convergence. The policy can be refined \nconcurrently with the Monte-Carlo method according to \n\na \n\n1I\"(almn) -+ 1I\"(almn) + \u20ac[Qn(m n, a) - Vn(mn)] \n\n(12) \nwith normalization. Other asynchronous or synchronous on-online updating schemes \ncan also be used. Note that if Qn(m, a) = Q(m, a) then this change would be \nstatistically equivalent to that of the batch version with the concomitant guarantees \nof giving a higher average reward. \n\n6 CONCLUSIONS \n\nIn this paper we have proposed and theoretically analyzed an algorithm that solves \na reinforcement learning problem in a POMDP setting, where the learner has re(cid:173)\nstricted access to the state of the environment. As the underlying MDP is not \nknown the problem appears to the learner to have a non-Markov nature. The aver(cid:173)\nage reward was chosen as the figure of merit for the learning problem and stochastic \npolicies were used to provide higher average rewards than can be achieved with de(cid:173)\nterministic policies. This extension from MDP's to POMDP's greatly increases the \ndomain of potential applications of reinforcement learning methods. \n\nThe simplicity of the algorithm stems partly from a Monte-Carlo approach to obtain(cid:173)\ning action-dependent values for each message. These new \"Q-values\" were shown to \ngive rise to a simple policy improvement result that enables the learner to gradually \nimprove the policy in the continuous space of probabilistic policies. \nThe batch version of the algorithm was shown to converge to a local maximum. We \nalso proposed an on-line version of the algorithm in which the policy is changed \nconcurrently with the calculation of the \"Q-values.\" The policy improvement of the \non-line version resembles that of learning automata. \n\nAPPENDIX \n\nLet us denote the policy after the change by 11\"!. Assume first that we have access \nto Q1I\"(s, a), the Q-values for the underlying MDP, and to p1l\"< (slm), the occupancy \nprobabilities after the policy refinement. Define \n\nJ(m, 11\"!, 11\"!, 11\") = L 1I\"!(alm) L p1l\"< (slm)[Q1I\"(s, a) - V1I\"(s)] \n\n(13) \n\nwhere we have used the notation that the policies on the left hand side correspond \nto the policies on the right respectively. To show how the average reward depends \n\n\fReinforcement Learning Algorithm for Markov Decision Problems \n\n351 \n\non this quantity we need to make use of the following facts. The Q-values for the \nunderlying MDP satisfy (Bellman's equation) \n\nQ1r(s, a) = R(s, a) - R1r + L:P~3' V1r (s') \n\n(14) \n\n\" \n\nIn addition, 2:0 7r(alm)Q1r(s, a) = V1r (s), implying that J(m, 7r!, 7r!, 7r f ) = 0 (see eq. \n13). These facts allow us to write \n\nJ(m, 7r f , 7r f , 7r) - J(m, 7r f , 7r f , 7rf ) \nL: 7r f (alm) L: p 1r\u00abslm)[Q1r(s, a) - V1r (s) - Q1r< (s, a) + V1r< (s)] \n\n(15) \n\n3 \n\nBy weighting this result for each class by p 1r< (m) and summing over the messages \nthe probability weightings for the last two terms become equal and the terms cancel. \nThis procedure gives us \n\nR1r< - R1r = L:P1r\u00abm)J(m, 7r f ,7r f , 7r) \n\n(16) \n\nm \n\nThis result does not allow the learner to assess the effect of the policy refinement \non the average reward since the JO term contains information not available to the \nlearner. However, making use of the fact that the policy has been changed only \nslightly this problem can be avoided. \nAs 7r! is a policy satisfying maxmo l7r f (alm) -7r(alm)1 ::; (, it can then be shown that \nthere exists a constant C such that the maximum change in P(slm), pes), P(m) is \nbounded by Cf. Using these bounds and indicating the difference between 7r! and \n7r dependent quantities by ~ we get \n\nL:[7r(alm) + ~7r(alm)] L)P1r(slm) + ~p1r(slm)][Q1r(s, a) - V1r (s)] \n\no \n\no \n\n3 \n\no \n\n3 \n\nwhere the second equality follows from 2:0 7r(alm)[Q1r(s, a) - V1r(s)] = 0 and the \nthird from the bounds stated earlier. \n\nThe equation characterizing the change in the average reward due to the policy \nchange (eq. 16) can be now rewritten as follows: \n\nR1r< - R1r = L: p 1r< (m)J(m, 7r f\n\n, 7r, 7r) + 0\u00ab(2) \n\nm \n\n\f352 \n\nTommi Jaakkola, Satinder P. Singh, Michael I. Jordan \n\nm \n\na \n\nwhere the bounds (see above) have been used for Vir' (m) - p1r(m). This completes \nthe proof. \n\n0 \n\nAcknowledgments \n\nThe authors thank Rich Sutton for pointing out errors at early stages of this work. \nThis project was supported in part by a grant from the McDonnell-Pew Foundation, \nby a grant from ATR Human Information Processing Research Laboratories, by a \ngrant from Siemens Corporation and by grant NOOOI4-94-1-0777 from the Office of \nNaval Research. Michael I. Jordan is a NSF Presidential Young Investigator. \n\nReferences \n\nBarto, A., and Duff, M. (1994). Monte-Carlo matrix inversion and reinforcement \nlearning. In Advances of Neural Information Processing Systems 6, San Mateo, CA, \n1994. Morgan Kaufmann. \nBertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Mod(cid:173)\nels. Englewood Cliffs, N J: Prentice-Hall. \n\nDayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8, \n341-362. \n\nJaakkola, T., Jordan M.I., and Singh, S. P. (1994). On the convergence of stochastic \niterative Dynamic Programming algorithms. Neural Computation 6, 1185-1201. \nMonahan, G. (1982). A survey of partially observable Markov decision processes. \nManagement Science, 28, 1-16. \nSingh, S. P., Jaakkola, T., Jordan, M.1. (1994). Learning without state estimation in \npartially observable environments. In Proceedings of the Eleventh Machine Learning \nConference. \n\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. \nMachine Learning, 3, 9-44. \n\nSchwartz, A. (1993). A reinforcement learning method for maximizing un discounted \nrewards. In Proceedings of the Tenth Machine Learning Conference. \nTsitsiklis J. N. (1994). Asynchronous stochastic approximation and Q-Iearning. \nMachine Learning 16, 185-202. \nWatkins, C.J.C.H. (1989). Learning from delayed rewards. PhD Thesis, University \nof Cambridge, England. \nWatkins, C.J .C.H, & Dayan, P. (1992). Q-Iearning. Machine Learning, 8, 279-292. \n\n\f", "award": [], "sourceid": 951, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}