Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)
The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional indepen(cid:173) dence assumptions between random variables are compactly represented by network parameters. The parameters are learned on-line, and approx(cid:173) imations are used to perform inference and to compute the optimal value function. The relative effects of inference and value function approxi(cid:173) mations on the quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approx(cid:173) imations, linear and quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed be(cid:173) low the level of human performance on the task. The dynamic Bayesian network performed comparably to a model using a localist hidden state representation, while requiring exponentially fewer parameters.