Learning Factored Representations for Partially Observable Markov Decision Processes

Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)

Bibtex Metadata Paper


Brian Sallans


The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional indepen(cid:173) dence assumptions between random variables are compactly represented by network parameters. The parameters are learned on-line, and approx(cid:173) imations are used to perform inference and to compute the optimal value function. The relative effects of inference and value function approxi(cid:173) mations on the quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approx(cid:173) imations, linear and quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed be(cid:173) low the level of human performance on the task. The dynamic Bayesian network performed comparably to a model using a localist hidden state representation, while requiring exponentially fewer parameters.