Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)
John Tsitsiklis, Benjamin Van Roy
We propose and analyze an algorithm that approximates solutions to the problem of optimal stopping in a discounted irreducible ape(cid:173) riodic Markov chain. The scheme involves the use of linear com(cid:173) binations of fixed basis functions to approximate a Q-function. The weights of the linear combination are incrementally updated through an iterative process similar to Q-Iearning, involving sim(cid:173) ulation of the underlying Markov chain. Due to space limitations, we only provide an overview of a proof of convergence (with prob(cid:173) ability 1) and bounds on the approximation error. This is the first theoretical result that establishes the soundness of a Q-Iearning(cid:173) like algorithm when combined with arbitrary linear function ap(cid:173) proximators to solve a sequential decision problem. Though this paper focuses on the case of finite state spaces, the results extend naturally to continuous and unbounded state spaces, which are ad(cid:173) dressed in a forthcoming full-length paper.