{"title": "Learning Factored Representations for Partially Observable Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1050, "page_last": 1056, "abstract": null, "full_text": "Learning Factored Representations for Partially \n\nObservable Markov Decision Processes \n\nBrian Sallans \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto M5S 2Z9 Canada \n\nGatsby Computational Neuroscience Unit* \n\nUniversity College London \nLondon WCIN 3AR U.K. \n\nsallans@cs.toronto.edu \n\nAbstract \n\nThe problem of reinforcement learning in a non-Markov environment is \nexplored using a dynamic Bayesian network, where conditional indepen(cid:173)\ndence assumptions between random variables are compactly represented \nby network parameters. The parameters are learned on-line, and approx(cid:173)\nimations are used to perform inference and to compute the optimal value \nfunction. The relative effects of inference and value function approxi(cid:173)\nmations on the quality of the final policy are investigated, by learning to \nsolve a moderately difficult driving task. The two value function approx(cid:173)\nimations, linear and quadratic, were found to perform similarly, but the \nquadratic model was more sensitive to initialization. Both performed be(cid:173)\nlow the level of human performance on the task. The dynamic Bayesian \nnetwork performed comparably to a model using a localist hidden state \nrepresentation, while requiring exponentially fewer parameters. \n\n1 Introduction \n\nReinforcement learning (RL) addresses the problem of learning to act so as to maximize \na reward signal provided by the environment. Online RL algorithms try to find a policy \nwhich maximizes the expected time-discounted reward. They do this through experience \nby performing sample backups to learn a value function over states or state-action pairs. \n\nIf the decision problem is Markov in the observable states, then the optimal value function \nover state-action pairs yields all of the information required to find the optimal policy for \nthe decision problem. When complete know ledge of the environment is not available, states \nwhich are different may look the same; this uncertainty is called perceptual aliasing [1], \nand causes decision problems to have dynamics which are non-Markov in the perceived \nstate . \n\n\u2022 Correspondence address \n\n\fLearning Factored Representations for POMDPs \n\n1051 \n\n1.1 Partially observable Markov decision processes \n\nMany interesting decision problems are not Markov in the inputs. A partially observable \nMarkov decision process (POMDP) is a formalism in which it is assumed that a process is \nMarkov, but with respect to some unobserved (i.e. \"hidden\") random variable. The state of \nthe variable at time t, denoted st, is dependent only on the state at the previous time step and \non the action performed. The currently-observed evidence is assumed to be independent of \nprevious states and observations given the current state. \n\nThe state of the hidden variable is not known with certainty, so a belief state is maintained \ninstead. At each time step, the beliefs are updated by using Bayes' theorem to combine the \nbelief state at the previous time step (passed through a model of the system dynamics) with \nnewly observed evidence. In the case of discrete time and finite discrete state and actions, a \nPOMDP is typically represented by conditional probability tables (CPTs) specifying emis(cid:173)\nsion probabilities for each state, and transition probabilities and expected rewards for states \nand actions. This corresponds to a hidden Markov model (HMM) with a distinct transition \nmatrix for each action. The hidden state is represented by a single random variable that can \ntake on one of K values. Exact belief updates can be computed using Bayes' rule. \n\nThe value function is not over the discrete state, but over the real-valued belief state. It has \nbeen shown that the value function is piecewise linear and convex [2]. In the worst case, \nthe number of linear pieces grows exponentially with the problem horizon, making exact \ncomputation of the optimal value function intractable. \n\nNotice that the localist representation, in which the state is encoded in a single random \nvariable, is exponentially inefficient: Encoding n bits of information about the state of the \nprocess requires 2n possible hidden states. This does not bode well for the abilities of \nmodels which use this representation to scale up to problems with high-dimensional inputs \nand complex non-Markov structure. \n\n1.2 Factored representations \n\nA Bayesian network can compactly represent the state of the system in a set of random \nvariables [3]. A two time-slice dynamic Bayesian network (DBN) represents the system at \ntwo time steps [4]. The conditional dependencies between random variables from time t to \ntime t + 1, and within time step t, are represented by edges in a directed acyclic graph. The \nconditional probabilities can be stored explicitly, or parameterized by weights on edges in \nthe graph. \n\nIf the network is densely-connected then inference is intractable [5]. Approximate infer(cid:173)\nence methods include Markov chain Monte Carlo [6], variational methods [7], and belief \nstate simplification [8]. \n\nIn applying a DBN to a large problem there are three distinct issues to disentangle: How \nwell does a parameterized DBN capture the underlying POMDP; how much is the DBN \nhurt by approximate inference; and how good must the approximation of the value function \nbe to achieve reasonable performance? We try to tease these issues apart by looking at the \nperformance of a DBN on a problem with a moderately large state-space and non-Markov \nstructure. \n\n2 The algorithm \n\nWe use a fully-connected dynamic sigmoid belief network (DSBN) [9], with K units at \neach time slice (see figure 1). The random variables Si are binary, and conditional proba-\n\n\f1052 \n\nB. Sa/lans \n\nFigure 1: Architecture of the \ndynamic sigmoid belief network. \nCircles indicate random variables, \nwhere a filled circle is observed \nand an empty circle is unobserved. \nSquares are action nodes, and dia(cid:173)\nmonds are rewards. \n\nbilities relating variables at adjacent time-steps are encoded in action-specific weights: \n\nP(s~+1 = II{sD~=l,at) = a (twi~st) \n\nk=l \n\n(1) \n\nwhere wi~ is the weight from the ith unit at time step t to the kth unit at time step t + 1, \nassuming action at is taken at time t. The nonlinearity is the usual sigmoid function: \na(x) = 1/1 +exp{ -x}. Note that a bias can be incorporated into the weights by clamping \none of the binary units to 1. \n\nThe observed variables are assumed to be discrete; the conditional distribution of an output \ngiven the hidden state is multinomial and parameterized by output weights. The probability \nof observing an output with value t is given by: \n\nP(ot = ll{sDk 1) = \n\nK \n= \n\nexp {L:~1 Uklst} \n101 \n\nt \nL:m=l exp L:k=l UkmSk \n\n{ \n\nK \n\n} \n\n(2) \n\nwhere ot E 0 and Ukl denotes the output weight from hidden unit k to output value t. \n\n2.1 Approximate inference \n\nInference in the fully-connected Bayesian network is intractable. Instead we use a varia(cid:173)\ntional method with a fully-factored approximating distribution: \n\nP(stlst-1,at- 1,ot) ~ Pst ~ II,u~t(I-,uk)l-S~ \n\nK \n\n(3) \n\nk==l \n\nwhere the ,uk are variational parameters to be optimized. This is the standard mean-field \napproximation for a sigmoid belief network [10]. The parameters,u are optimized by iterat(cid:173)\ning the mean-field equations, and converge in a few iterations. The values of the variational \nparameters at time t are held fixed while computing the values for step t + 1. This is \nanalogous to running only the forward portion of the HMM forward-backward algorithm \n[11]. \n\nThe parameters of the DSBN are optimized online using stochastic gradient ascent in the \nlog-likelihood: \n\nU \n\nf---\n\n(4) \n\n\fLearning Factored Representations for POMDPs \n\n1053 \n\nwhere Wand U are the transition and emission matrices respectively, aw and au are \nlearning rates, the vector J-L contains the fully-factored approximate belief state, and 1/ is a \nvector of zeros with a one in the otth place. The notation [\u00b7]k denotes the kth element of a \nvector (or kth column of a matrix). \n\n2.2 Approximating the value function \n\nComputing the optimal value function is also intractable. If a factored state-space represen(cid:173)\ntation is appropriate, it is natural (if extreme) to assume that the state-action value function \ncan be decomposed in the same way: \n\nQ(Pst, at) ~ L Qk (J-Lt, at) \n\nK \n\nk=l \n\nt::. Q F(J-L, at) \n\n(5) \n\nThis simplifying assumption is still not enough to make finding the optimal value func(cid:173)\ntion tractable. Even if the states were completely independent, each Q k would still be \npiecewise-linear and convex, with the number of pieces scaling exponentially with the hori(cid:173)\nzon. We test two approximate value functions, a linear approximation: \n\nK L qk,a' J-Lk = \n\nk=l \n\n[Q]a t T . J-L \n\n(6) \n\nand a quadratic approximation: \n\nK L \u00a2k,at J-Lk + qk,at J-Lk + bat \nk=l \n[~]at T . (J-L 0 J-L) + [Q]at T . J-L + [blat \n\n(7) \nWhere ~, Q and b are parameters of the approximations. The notation [\u00b7]i denotes the \nith column of a matrix, [.]T denotes matrix transpose and 0 denotes element-wise vector \nmultiplication. \n\nWe update each term of the factored approximation with a modified Q-Iearning rule [12], \nwhich corresponds to a delta-rule where the target for input J-L is rt + 'Y maxa Q F (J-LH I , a): \n\n\u00a2k,at \nqk,at \nbat \n\nt-\n\nt-\n\nt-\n\n\u00a2k,at + a J-Lk EB \nqk,at + a J-Lk EB \nbat + a EB \n\n(8) \n\nHere a is a learning rate, 'Y is the temporal discount factor, and EB is the Bellman residual: \n\n(9) \n\n3 Experimental results \n\nThe \"New York Driving\" task [13] involves navigating through slower and faster one-way \ntraffic on a multi-lane highway. The speed of the agent is fixed, and it must change lanes to \navoid slower cars and move out of the way of faster cars. If the agent remains in front of a \nfaster car, the driver of the fast car will honk its horn, resulting in a reward of -1.0. Instead \nof colliding with a slower car, the agent can squeeze past in the same lane, resulting in a \nreward of -10.0. A time step with no horns or lane-squeezes constitutes clear progress, \nand is rewarded with +0 .1. See [13] for a detailed description of this task. \n\n\f1054 \n\nB. Sal/ans \n\nTable 1: Sensory input for the New York driving task \n\nDimension \nHear horn \nGaze object \nGaze speed \nGaze distance \nGaze refined distance \nGaze colour \n\nI Size I Values \nyes,no \ntruck, shoulder, road \nlooming, receding \nfar, near, nose \nfar-half, near-half \nred, blue, yellow, white, gray, tan \n\n2 \n3 \n2 \n3 \n2 \n6 \n\nA modified version of the New York Driving task was used to test our algorithm. The \ntask was essentially the same as described in [13], except that the \"gaze side\" and \"gaze \ndirection\" inputs were removed. See table 1 for a list of the modified sensory inputs. \n\nThe performance of a number of algorithms and approximations were measured on the task: \na random policy; Q-Iearning on the sensory inputs; a model with a localist representation \n(i.e. the hidden state consisted of a single multinomial random variable) with linear and \nquadratic approximate value functions; the DSBN with mean-field inference and linear and \nquadratic approximations; and a human driver. The localist representation used the linear \nQ-Iearning approximation of [14], and the corresponding quadratic approximation. The \nquadratic approximations were trained both from random initialization, and from initial(cid:173)\nization with the corresponding learned linear models (and random quadratic portion). The \nnon-human algorithms were each trained for 100000 iterations, and in each case a constant \nlearning rate of 0.01 and temporal decay rate of 0.9 were used. The human driver (the au(cid:173)\nthor) was trained for 1000 iterations using a simple character-based graphical display, with \neach iteration lasting 0.5 seconds. \n\nStochastic policies were used for all RL algorithms, with actions being chosen from a \nBoltzmann distribution with temperature decreasing over time: \n\n(10) \n\nThe DSBN had 4 hidden units per time slice, and the localist model used a multinomial \nwith 16 states. The Q-Iearner had a table representation with 2160 entries. After training, \neach non-human algorithm was tested for 20 trials of 5000 time steps each. The human was \ntested for 2000 time steps, and the results were renormalized for comparison with the other \nmethods. The results are shown in figure 2. All results were negative, so lower numbers \nindicate better performance in the graph. The error bars show one standard deviation across \nthe 20 trials. \n\nThere was little performance difference between the localist representation and the DSBN \nbut, as expected, the DSBN was exponentially more efficient in its hidden-state represen(cid:173)\ntation. The linear and quadratic approximations performed comparably, but well below \nhuman performance. However, the DSBN with quadratic approximation was more sensi(cid:173)\ntive to initialization. When initialized with random parameter settings, it failed to find a \ngood policy. However, it did converge to a reasonable policy when the linear portion of the \nquadratic model was initialized with a previously learned linear model. \n\nThe hidden units in the DSBN encode useful features of the input, such as whether a car \nwas at the \"near\" or \"nose\" position. They also encode some history, such as current gaze \ndirection. This has advantages over a simple stochastic policy learned via Q-Iearning: If the \nQ-Iearner knows that there is an oncoming car, it can randomly select to look left or right. \nThe DSBN systematically looks to the left, and then to the right, wasting fewer actions. \n\n\fLearning Factored Representations for POMDPs \n\n1055 \n\n4000 \n\n3500 \n\n3000 \n\n\"1:l 2500 \n~ \n~2000 \n~ \nI 1500 \n\n1000 \n\ntask \n\nFigure 2: Results on the New York \nDriving \nfor nine algorithms: \nR=random; Q=Q-Ieaming; LC=linear \nmultinomial; QCR=quadratic multi(cid:173)\nnomial, random init.; QCL=quadratic \nmultinomial, \nlinear init; LD=linear \nDSBN; QDR=quadratic DSBN, ran(cid:173)\ndom \nlinear init.; H=human \n\ninit.; QDL=quadratic DSBN, \n\n4 Discussion \n\nThe DSBN performed better than a standard Q-learner, and comparably to a model with \na localist representation, despite using approximate inference and exponentially fewer pa(cid:173)\nrameters. This is encouraging, since an efficient encoding of the state is a prerequisite \nfor tackling larger decision problems. Less encouraging was the value-function approxi(cid:173)\nmation: When compared to human performance, it is clear that all methods are far from \noptimal, although again the factored approximation of the DSBN did not hurt performance \nrelative to the localist multinomial representation. The sensitivity to initialization of the \nquadratic approximation is worrisome, but the success of initializing from a simpler model \nsuggests that staged learning may be appropriate, where simple models are learned and \nused to initialize more complex models. These findings echo those of [14] in the context of \nlearning a non-factored approximate value function. \n\nThere are a number of related works, both in the fields of reinforcement learning and \nBayesian networks. We use the sigmoid belief network mean-field approximation given \nin [10], and discussed in the context of time-series models (the \"fully factored\" approxi(cid:173)\nmation) in [15]. Approximate inference in dynamic Bayesian networks has been discussed \nin [15] and [8]. The additive factored value function was used in the context of factored \nMDPs (with no hidden state) in [16], and the linear Q-learning approximation was given \nin [14]. Approximate inference was combined with more sophisticated value function ap(cid:173)\nproximation in [17]. To our knowledge, this is the first attempt to explore the practicality \nof combining all of these techniques in order to solve a single problem. \n\nThere are several possible extensions. As described above, the representation learned by \nthe DSBN is not tuned to the task at hand. The reinforcement information could be used \nto guide the learning of the DSBN parameters[18, 13]. Also, if this were done, then the \nreinforcement signals would provide additional evidence as to what state the POMDP is in, \nand could be used to aid inference. More sophisticated function approximation could be \nused [17]. Finally, although this method appears to work in practice, there is no guarantee \nthat the reinforcement learning will converge. We view this work as an encouraging first \nstep, with much further study required. \n\n5 Conclusions \n\nWe have shown that a dynamic Bayesian network can be used to construct a compact rep(cid:173)\nresentation useful for solving a decision problem with hidden state. The parameters of the \nDBN can be learned from experience. Learning occurs despite the use of simple value-\n\n\f1056 \n\nB. Sallans \n\nfunction approximations and mean-field inference. Approximations of the value function \nresult in good performance, but are clearly far from optimal. The fully-factored assump(cid:173)\ntions made for the belief state and the value function do not appear to impact performance, \nas compared to the non-factored model. The algorithm as presented runs entirely on-line \nby performing \"forward\" inference only. There is much room for future work, including \nimproving the utility of the factored representation learned, and the quality of approximate \ninference and the value function approximation. \n\nAcknowledgments \n\nWe thank Geoffrey Hinton, Zoubin Ghahramani and Andy Brown for helpful discussions, \nthe anonymous referees for valuable comments and criticism, and particularly Peter Dayan \nfor helpful discussions and comments on an early draft of this paper. This research was \nfunded by NSERC Canada and the Gatsby Charitable Foundation. \n\nReferences \n\n[1] S.D. Whitehead and D.H. Ballard. Learning to perceive and act by trial and error. Machine \n\nLearning, 7, 1991. \n\n[2] EJ. Sondik. The optimal control of partially observable Markov processes over the infinite \n\nhorizon: Discounted costs. Operations Research, 26:282-304, 1973. \n\n[3] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor(cid:173)\n\ngan Kaufmann, San Mateo, CA, 1988. \n\n[4] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computa(cid:173)\n\ntionallntelligence, 5, 1989. \n\n[5] Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian \n\nbelief networks. Anijiciallntelligence, 42:393-405, 1990. \n\n[6] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report \n\nCRG-TR-93-1, Department of Computer Science, University of Toronto, 1993. \n\n[7] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to variational meth(cid:173)\n\nods for graphical models. Machine Learning, 1999. in press. \n\n[8] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proc. UA1'98, \n\n1998. \n\n[9] R. M. Neal. Connectionist learning of belief networks. Artijiciallntelligence, 56:71-113, 1992. \n[10] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean field theory for sigmoid belief networks. \n\nJournal of Artijiciallntelligence Research, 4:61-76, 1996. \n\n[11] Lawrence R. Rabiner and Biing-Hwang Juang. An introduction to hidden Markov models. \n\nIEEE ASSAP Magazine, 3:4-16, January 1986. \n\n[12] CJ.C.H. Watkins and P. Dayan. Q-Iearning. Machine Learning, 8:279-292, 1992. \n[13] A.K. McCallum. Reinforcement learning with selective perception and hidden state. Dept. of \n\nComputer Science, Universiy of Rochester, Rochester NY, 1995. Ph.D. thesis. \n\n[14] M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable \n\nenvironments: Scaling up. In Proc. International Conference on Machine Learning, 1995. \n\n[15] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. Machine Learning, 1997. \n[16] D. Koller and R. Parr. Computing factored value functions for policies in structured MDPs. In \n\nProc. lJCA/'99, 1999. \n\n[17] A. Rodriguez, R. Parr, and D. Koller. Reinforcement learning using approximate belief states. In \nS. A. Solla, T. K. Leen, and K.-R. Mtiller, editors, Advances in Neural Information Processing \nSystems, volume 12. The MIT Press, Cambridge, 2000. \n\n[18] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions \n\napproach. In Tenth National Conference on AI, 1992. \n\n\f", "award": [], "sourceid": 1754, "authors": [{"given_name": "Brian", "family_name": "Sallans", "institution": null}]}