{"title": "Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": "Experimental Results on Learning Stochastic \nMemoryless Policies for Partially Observable \n\nMarkov Decision Processes \n\nJohn K. Williams \n\nDepartment of Mathematics \n\nUniversity of Colorado \nBoulder, CO 80309-0395 \n\njkwillia@euclid.colorado.edu \n\nSatinder Singh \n\nAT &T Labs-Research \n\n180 Park Avenue \n\nFlorham Park, NJ 07932 \nbaveja@research.att.com \n\nAbstract \n\nPartially Observable Markov Decision Processes (pO \"MOPs) constitute \nan important class of reinforcement learning problems which present \nunique theoretical and computational difficulties. In the absence of the \nMarkov property, popular reinforcement learning algorithms such as \nQ-Iearning may no longer be effective, and memory-based methods \nwhich remove partial observability via state-estimation are notoriously \nexpensive. An alternative approach is to seek a stochastic memoryless \npolicy which for each observation of the environment prescribes a \nprobability distribution over available actions that maximizes the \naverage reward per timestep. A reinforcement learning algorithm \nwhich learns a locally optimal stochastic memoryless policy has been \nproposed by Jaakkola, Singh and Jordan, but not empirically verified. \nWe present a variation of this algorithm, discuss its implementation, \nand demonstrate its viability using four test problems. \n\n1 INTRODUCTION \n\nReinforcement learning techniques have proven quite effective in solving Markov \nDecision Processes (\"MOPs), control problems in which the exact state of the \nenvironment is available to the learner and the expected result of an action depends only \non the present state [10]. Algorithms such as Q-Iearning learn optimal deterministic \npolicies for \"MOPs----rules which for every state prescribe an action that maximizes the \nexpected future reward. In many important problems, however, the exact state of the \nenvironment is either inherently unknowable or prohibitively expensive to obtain, and \nonly a limited, possibly stochastic observation of the environment is available. Such \n\n\f1074 \n\n1. K. Williams and S. Singh \n\nPartially Observable Markov Decision Processes (POMDPs) [3,6] are often much more \ndifficult than MDPs to solve [4]. Distinct sequences of observations and actions \npreceding a given observation in a POMDP may lead to different probabilities of \noccupying the underlying exact states of the MDP. If the efficacy of an action depends \non the hidden exact state of the environment, an optimal choice may require knowing \nthe past history as well as the current observation, and the problem is no longer Markov. \nIn light of this difficulty, one approach to solving POMDPs is to explore the \nenvironment while building up a memory of past observations, actions and rewards \nwhich allows estimation of the current hidden state [1]. Such methods produce \ndeterministic policies, but they are computationally expensive and may not scale well \nwith problem size. Furthermore, policies that require state-estimation using memory \nmay be complicated to implement. \n\nMemoryless policies are particularly appropriate for problems in which the state is \nexpensive to obtain or inherently difficult to estimate, and they have the advantage of \nbeing extremely simple to act upon. For a POMDP, the optimal memoryless policy is \ngenerally a stochastic policy-one which for each observation of the environment \nprescribes a probability distribution over the available actions. \nIn fact, examples of \nPOMDPs can be constructed for which a stochastic policy is arbitrarily better than the \noptimal deterministic policy [9] . An algorithm proposed by Jaakkola, Singh and Jordan \nOSJ) [2], which we investigate here, learns memoryless stochastic policies for POMDPs. \n\n2 POMDPs AND DIFFERENTIAL-REWARD Q-VALUES \nWe assume that the environment has discrete states S = {s1, S2, .. IV}, and the learner \nchooses actions from a set f4. State transitions depend only on the current state s and the \naction a taken (the Markov property); they occur with probabilities r(s,sl) and result in \nexpected rewards K'(s,s} In a POMDP, the learner cannot sense exactly the state s of \nthe enVironment, but rather perceives only an observation--or \"message\"-from a set \n:M = {m 1, m2, .. mM } according to a conditional probability distribution P(mls). The \nlearner will in general not know the size of the underlying state space, its transition \nprobabilities, reward function, or the conditional distributions of the messages. \nIn MDPs, there always exists a policy which simultaneously maximizes the expected \nfuture reward for all states, but this is not the case for POMDPs [9]. An appropriate \nalternative measure of the merit of a stochastic POMDP policy 7Z{alm) is the asymptotic \naverage reward per timestep, R7r, that it achieves. \nIn seeking an optimal stochastic \npolicy, the JSJ algorithm makes use of Q-values determined by the infinite-horizon \ndifferential reward for each observation-action pair (m,a). In particular, if rr denotes the \nreward obtained at time t, we may define the differential-reward Q-values by \n\nQ7r(s,a)= LE7r [Ii _R 7r I S1 =s,a1 = a]; Q7r(m,a)= E s [Q7r(s,a)IM(s)=m](l) \n\n1=1 \n\nwhere M is the observation operator. Note that E[rr] ~ R7r as t ~ 00, so the summand \nconverges to ~ero. The value functions V7r(s) and V7r(m) may be defined similarly. \n\n3 POLICY IMPROVEMENT \n\nThe JSJ algorithm consists of a method for evaluating Q7r and V7r and a mechanism for \nusing them to improve the current policy. Roughly speaking, if Q7r(m,a) > V7r(m), then \naction a realized a higher differential reward than the average for observation m, and \nassigning it a slightly greater probability will increase the average reward per timestep, \nR7r. We interpret the quantities ~m(a) = Q7r(m,a) - V7r(m) as comprising a \"gradient\" of \nR7r in policy space. Their projections onto the probability simplexes may then be written \n\n\fAn Algorithm which Learns Stochastic Memoryless Policies for POMDPs \n\n1075 \n\nas 8m = Llm - 11/JIl, where 1 is the one-vector (1,1, ... ,1), <, > is the inner product, \nand IJIl is the number of actions, or \n\n8 \nmea) = Llm(a) -\n\n1 \nIAI LLlm (a') = Q (m,a) - -IAI LQ (m, a'). \n\n~ \n\n1 \n\na'EA \n\nR \n\na'EA \n\n(2) \n\nFor sufficiently small E;n, an improved policy 1l'(alm) may be obtained by the increments \n\n1l'(a/m) = 1l(alm) + E;n 8m(a) . \n\n(3) \n\nIn practice, we also enforce 1l'(alm) ~ P min for all a and m to guarantee continued \nexploration. The original JSJ algorithm prescribed using Llm(a) in place of 8m(a) in \nequation (3), followed by renormalization [2]. Our method has the advantage that a \ngiven value of Ll yields the same incremeiu regardless of the current value of the policy, \nand it ensures that the step is in the correct direction. We also do not require the \ndifferential-reward value estimate, yR. \n\n4 Q-EVALUATION \n\nAs the POMDP is simulated under a fixed stochastic policy 1l, every occurrence of an \nobservation-action pair (m, a) begins a sequence of rewards which can be used to \nestimate QR(m, a). Exploiting the fact that the QR(m, a) are defined as sums, the JSJ Q(cid:173)\nevaluation method recursively averages the estimates from all such sequences using a so(cid:173)\nIn order to reduce the bias and variance \ncalled \"every-visit\" Monte-Carlo method. \ncaused by the dependence of the evaluation sequences, a factor fJ is used to discount their \nshared \"tails\". Specifically, at time t the learner makes observation mr , takes action ar , \nand obtains reward rr. The number of visits K(mr,ar) is incremented, the tail discount \nrate rem, a) = 1-K(m, arl/4, and the following updates are performed (the indicator \nfunction x.:Cm, a) is 1 if (m,a) = (mr,ar) and 0 otherwise). \n\nfJ \n\n%r(m,a) \n(m,a)= 1- K(m,a) r(m,a) (m,a)+ K(m,a) \n\n[ %r(m,a)] \n\nfJ \n\n(tail discount factor) \n\n(4) \n\nQ(m,a)= [1- ~~::~ ]Q(m, a) + fJ(m,a)[Ti - R] \nC(m,a)= [1- ~f:::~ ]c(m,a) + fJ(m, a) \n\n(cumulative discount effect) \n\n(5) \n\n(6) \n\n(7) \n\nlIt)R + (lit) rr \n\nR = (1 -\nQ(m, a) = Q(m, a) - C(m, a) [R - Rold]; Rold = R \n\n(R~-estimate) \n\n(QR-estimate correction) \n\n(8) \n\nOther schedules for rem, a) are possible----see [2~and the correction provided by (8) \nneed not be performed at every step, but can be delayed until the Q~-estirnate is needed. \n\nThis evaluation method can be used as given for a policy-iteration type algorithm in \nwhich independent T-step evaluations of Q~ are interspersed with policy improvements \nas prescribed in section 3. However, an online version of the algorithm which performs \npolicy improvement after every step requires that old experience be gradually \"forgotten\" \nso that the QR-estimate can respond to more recent experience. To achieve this, we \nmultiply the previous estimates of fJ, Q, and C at each timestep by a \"decay\" factor \na, 0 < a< 1, before they are updated via equations (4)-(6), and replace equation (7) by \n\nR = a(l - lit) R + [1 - a(1 - lit)] rl\n\n(9) \nAn alternative method, which also works reasonably well, is to multiply K and t by a at \neach timestep instead. \n\n. \n\n\f1076 \n\n(a) \n\nA \n\n+1 \n\nB \n\n+1 \n\n(b) \n\n(c) \n\nJ K. Williams and S. Singh \n\n0. ' r --\n\n- r --\n\n- r --\n\n----;,.--\n\n- - - - ; -- - - - - ; \n\n.. .. .. .. j ...... . \n\n., ... \n\n10000 \n\n20000 \n\nnumber of iterations \n\n3 0000 \n\n40000 \n\n50000 \n\n0 .8 \n\n\\ \n\nf 06 \\..-:' ~::\" ==_:::_-=-__ ::-0' ~~.~=~.-... .:.....'. '-\"'-\" '-\"'-\" .----..;...--(cid:173)\n\n[0.4 \n\n0. 2 \n\n\u00b0o!:----,,;-;;coo=o-=-o --'2::::0~00:::-0 -\n\n-'3\"\"0\"\"00\"\"'0 -\nnumber Of Iterations \n\n--;:40:;';,00\"\"0 - -=50000 \n\n(a) Schematic of confounded two-state POMDP, (b) evolution of the R7r_ \nFigure 1: \nestimate, and (c) evolution of n(A) (solid) and nCB) (dashed) for e= 0.0002, a= 0.9995. \n\n5 EMPIRICAL RESULTS \n\nWe present only results from single runs of our online algorithm, including the modified \n]S] policy improvement and Q-evaluation procedures described above. Results from the \npolicy iteration version are qualitatively similar, and statistics performed on multiple \nruns verify that those shown are representative of the algorithm's behavior. To simplify \nthe presentation, we fix a constant learning rate, e, and decay factor, a, for each problem, \nand we use P min = 0.02 throughout. Note, however, that appropriate schedules or online \nheuristics for decreasing e and P min while increasing a would improve performance and \nare necessary to ensure convergence. Except for the first problem, we choose the initial \npolicy n to be uniform. In the last two problems, values of n(alm) < 0.03 are rounded \ndown to zero, with renormalization, before the learned policy is evaluated. \n\n5.1 CONFOUNDED TWO-STA TE PROBLEM \nThe two-state MDP diagrammed in Figure l(a) becomes a POMDP when the two states \nare confounded into a single observation. The learner may take action A or B, and \nreceives a reward of either + 1 or -1; the state transition is deterministic, as indicated in \nthe diagram. Note that either stationary deterministic policy results in R7r = -1 , whereas \nthe optimal stochastic policy assigns each action the probability 112, resulting in R7r = O. \nThe evolution of the R7r-estimate and policy, starting from the initial policy n(A) = 0.1 \nand nCB) = 0.9, is shown in Figure 1. Clearly the learned policy approaches the optimal \nstochastic policy n = (112,112). \n\n5.2 MATRIX GAME: SCISSORS-PAPER-STONE-GLASS-WATER \nScissors-Paper-Stone-Glass-Water (SPSGW), an extension of the well-known Scissors(cid:173)\nPaper-Stone, is a symmetric zero-sum matrix game in which the learner selects a row i, \nthe opponent selects a column j, and the learner' s payoff is determined by the matrix \nentry M(i,j). A game-theoretic solution is a stochastic (or \"mixed\") policy which \nguarantees the learner an expected payoff of at least zero. It can be shown using linear \nprogramming that the unique optimal strategy for SPSGW, yielding R7r = 0, is to play \nstone and water with probability 1/3, and to play scissors, paper, and glass with \nprobability 119 [7]. Any stationary deterministic policy results in R7r = -1, since the \nopponent eventually learns to anticipate the learner's choice and exploit it. \n\n\fAn Algorithm which Learns Stochastic Memory/ess Policiesfor POMDPs \n\n1077 \n\n(a) \n\nstone \n\n(c) \n\nwater \n\npaper \n\nor:---I---\\--~ \n\nscissors \n\n(b) \n\n[0 -1 \n\n1 \nM= -1 \n-1 \n1 \n\n1 -1] \n\n1 \n1 -1 -1 \n0 \n1 \n-1 \n0 -1 \n1 \n0 -1 \n1 \n1 -1 \n1 0 \n\n- 0. 4 \n\n. ... \n\n-0 5 O~--='-=OO::::OO:-----:::20=:':O=OO\"-----:::300'-!:'OO:::::---:-::40::':::OOO:::-----:5;-;::'OOOO \n\nnumber of iterations \n\n(d) \n\n0. 8 \n\n-___ ~ ~.= _______ . _________________ s __ _ \n\n%~-~1=OO~OO~~2~OO~OO~~3~OO~OO~~4~OO~OO~~50000 \n\nnumber of iteratio ns \n\nFigure 2: \n(a) Diagram of Scissors-Paper-Stone-Glass-Water, (b) the payoff matrix, \n(c) evolution of the RJr-estimate, and (d) evolution of n(stone) and n(water) (solid) and \nn(scissors), n(paper), and n(glass) (dashed) for \u00a3= 0.00005, a= 0.9995. \n\nIn formulating SPSGW as a POMDP, it is necessary to include in the state sufficient \ninformation to allow the opponent to exploit any sub-optimal strategy. We thus choose \nas states the learner's past action frequencies, multiplied at each timestep by the decay \nfactor, a. There is only one observation, and the learner acts by selecting the \"row\" \nscissors, paper, stone, glass or water, producing a deterministic state transition. The \nsimulated opponent plays the column which maximizes its expected payoff against the \nestimate of the learner's strategy obtained from the state. The learner's reward is then \nobtained from the appropriate entry of the payoff matrix. \nThe policy n = (0.1124,0.1033,0.3350,0.1117,0.3376) learned after 50,000 iterations \n(see Figure 2) is very close to the optimal policy 7i = (119, 119,113,119,1/3). \n\n5.3 PARR AND RUSSELL'S GRID WORLD \nParr and Russell's grid world [S] consists of 11 states in a 4x3 grid with a single obstacle \nas shown in Figure 3(a). The learner senses only walls to its immediate east or west and \nwhether it is in the goal state (upper right comer) or penalty state (directly below the \ngoal), resUlting in the 6 possible observations (0-3, G and P) indicated in the diagram. \nThe available actions are to move N, E, S, or W, but there is a probability 0.1 of slipping \nto either side and only O.S of moving in the deSired direction; a movement into a wall \nresults in bouncing back to the original state. The learner receives a reward of + 1 for a \ntransition into the goal state, -1 for a transition into the penalty state, and -0.04 for all \nother transitions. The goal and penalty states are connected to a cost-free absorbing \nstate; when the learner reaches either of them it is teleported immediately to a new start \nstate chosen with uniform probability. \n\nThe results are shown in Figure 3. A separate 106-step evaluation of the final learned \npolicy resulted in RJr = 0.047. In contrast, the optimal deterministic policy indicated by \narrows in Figure 3(a) yields R Jr = 0.024 [5], while Parr and Russell's memory-based \nSPOVA-RL algorithm achieved RJr = 0.12 after learning for 400,000 iterations [S]. \n\n5.4 MULTI-SERVER QUEUE \n\nAt each timestep, an arriving job having type 1, 2, or 3 with probability 112, 113 or 116, \nrespectively, must be assigned to server A, B or C; see Figure 4(a). Each server is \noptimized for a particular job type which it can complete in an expected time of 2.6 \n\n\f1078 \n\n(a) \n\n;j' ~ ~ +1 \n\nt \n\n0 \n\n3 \n\n2 \n\nt\u00b7 \n\n2 \n\n0 \n\n-1 \n\nP \n\nt ~ ~ ~ \n\n0 \n\n2 \n\n2 \n\n1 \n\nJ K. Williams and S. Singh \n\n\"... ~ \n\n20000 \n\n40000 \n\n60000 \nnurrber of itera1iorlS; \n\n80000 \n\n100000 \n\n(b) \n\n0.06 \n\n0 0 4 \n\n0 .0 2 \n\n~ 0 \n~ - 0 .02 \na: \n\n- 0 .04 \n\n- 0 .06 \n\n-0.0 8 \n0 \n\n(c) \n\nrO.91 0.02 0.36 0.52J \n7r(alm) = 8:8i 0.21 0.60 0.18 \n\n0.34 0.02 0.11 \n0.02 0.43 0.02 0.19 \n\n(a) Parr and Russell's grid world, with observations shown in lower right \nFigure 3: \ncorners and the optimal deterministic memoryless policy represented by arrows, \n(b) evolution of the R7r-estimate, and (c) the resulting learned policy (observations 0-3 \nacross columns, actions N, E, S, W down rows) for E= 0.02, a= 0.9999. \n\ntimesteps, while the other job types require 50% longer. All jobs in a server's queue are \nhandled in parallel, up to a capacity of 10 for each server; they finish with probability Ilf \nat each timestep, where f is the product of the expected time for the job and the number \nof jobs in the server's queue. The states for this POMDP are all combinations of waiting \njobs and server occupancies of the three job types, but the learner's observation is \nrestricted to the type of the waiting job. The state transition is obtained by removing all \njobs which have finished and adding the waiting job to the chosen server if it has space \navailable. The reward is + 1 if the job is successfully placed, or 0 if it is dropped. \nThe results are shown in Figure 4. A separate 106-step evaluation of the learned policy \nobtained R7r = 0.95, corresponding to 95% success in placing jobs. \nIn contrast, the \noptimal deterministic policy, which assigns each job to the server optimized for it, \nattained only 87% success. Thus the learned policy more than halves the drop rate! \n\n6 CONCLUSION \n\nOur online version of an algorithm proposed by Jaakkola, Singh and Jordan efficiently \nlearns a stochastic memoryless policy which is either provably optimal or at least \nsuperior to any deterministic memoryless policy for each of four test problems. Many \nenhancements are possible, including appropriate learning schedules to improve \nperformance and ensure convergence, estimation of the time between observation-action \nvisits to obtain better discount rates r and thereby enhance Q7r-estimate bias and variance \nreduction (see [2]), and multiple starts or simulated annealing to avoid local minima. In \naddition, observations could be extended to include some past history when appropriate. \n\nMost POMDP algorithms use memory and attempt to learn an optimal deterministic \npolicy based on belief states. The stochastic memoryless policies learned by the JSJ \nalgorithm may not always be as good, but they are simpler to act upon and can adapt \nsmoothly in non-stationary environments. Moreover, because it searches the space of \nstochastic policies, the JS] algorithm has the potential to find the optimal memoryless \npolicy. These considerations, along with the success of our simple implementation, \nsuggest that this algorithm may be a viable candidate for solving real-world POMDPs, \nincluding distributed control or network admission and routing problems in which the \nnumbers of states are enormous and complete state information may be difficult to obtain \nor estimate in a timely manner. \n\n\fAn AlgOrithm which Learns Stochastic Memoryless Policiesjor POMDPs \n\n1079 \n\n(a) \n\nJob \n\narrival \nof type \n1,2,or 3 \n\nServer A \n\nTA = (2.6,3.9,3.9) \n\nServer B \n\nTB = (3.9,2.6,3.9) \n\nServer C \n\nTc = (3.9,3.9,2.6) \n\n(b) \n\n. \n\n095 \n\nI 09 \n\nI a: \n\no.8o'----=20~00::-:0-----,-:4oo~00=---60~00-=-0 --::8~00'\":c00::------:-=-::'\n\n1 00000 \n\nnumber of iterations \n\n(c) \n\n[0.73 \nn(alm) = 0.02 \n0.25 \n\n0.02 \n0.96 \n0.02 \n\n0.02] \n0.09 \n0.89 \n\nFigure 4: (a) Schematic of the multi-server queue, (b) evolution of the R71-estimate, and \n(c) the resulting learned policy (observations I, 2, 3 across columns, actions A, B, C \ndown rows) for \u20ac= 0.005, a= 0.9999. \n\nAcknowledgements \n\nWe would like to thank Mike Mozer and Tim Brown for helpful discussions. \nSatinder Singh was funded by NSF grant IIS-9711753. \n\nReferences \n\n[1] Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The \nIn Proceedings of the Tenth National \n\nperceptual distinctions approach. \nConference on Artificial Intelligence. \nJaakkola, T., Singh, S. P., and Jordan, M. I. (1995). Reinforcement learning \nIn Advances in \nalgorithm for partially observable Markov decision problems. \nNeural Information Processing Systems 7. \n\n[2] \n\n[3] Littman, M., Cassandra, A., and Kaelbling, L. (1995). Learning poliCies for \nIn Proceedings of the Twelfth \n\npartially observable environments: Scaling up. \nInternational Conference on Machine Learning. \n\n[4] Littman, M. L. (1994). Memoryless policies: Theoretical limitations and practical \nresults. Proceedings of the Third International Conference on Simulation of \nAdaptive Behavior: From Animals to Animats. \n\n[5] Loch, J., and Singh, S. P. (1998). Using eligibility traces to find the best \nmemoryless policy in partially observable Markov decision processes. In Machine \nLearning: Proceedings of the Fifteenth International Conference. \n\n[6] Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable \n\nMarkov decision processes. In Annals of Operations Research, 28. \n\n[7] Morris, P. (1994). Introduction to Game Theory. Springer-Verlag, New York. \n[8] Parr, R. and Russell, S. (1995). Approximating optimal poliCies for partially \nIn Proceedings of the International Joint \n\nobservable stochastic domains. \nConference on Artificial Intelligence. \n\n[9] Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learning without state(cid:173)\nIn Machine \n\nestimation in partially observable Markovian decision processes. \nLearning: Proceedings of the Eleventh International Conference. \n\n[10] Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. \n\nMIT Press. \n\n\f\f", "award": [], "sourceid": 1509, "authors": [{"given_name": "John", "family_name": "Williams", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}