Dirk Ormoneit, Peter W. Glynn
Many approaches to reinforcement learning combine neural net(cid:173) works or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those pro(cid:173) cedures is that the resulting learning algorithms are frequently un(cid:173) stable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical ap(cid:173) plication to the optimal portfolio choice problem.