The Asymptotic Convergence-Rate of Q-learning

Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)

Bibtex Metadata Paper

Authors

Csaba Szepesvári

Abstract

In this paper we show that for discounted MDPs with discount factor, > 1/2 the asymptotic rate of convergence of Q-Iearning if R(1 - ,) < 1/2 and O( Jlog log tit) otherwise is O(1/tR (1-1') provided that the state-action pairs are sampled from a fixed prob(cid:173) ability distribution. Here R = Pmin/Pmax is the ratio of the min(cid:173) imum and maximum state-action occupation frequencies. The re(cid:173) sults extend to convergent on-line learning provided that Pmin > 0, where Pmin and Pmax now become the minimum and maximum state-action occupation frequencies corresponding to the station(cid:173) ary distribution.