Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
Csaba Szepesvári
In this paper we show that for discounted MDPs with discount factor, > 1/2 the asymptotic rate of convergence of Q-Iearning if R(1 - ,) < 1/2 and O( Jlog log tit) otherwise is O(1/tR (1-1') provided that the state-action pairs are sampled from a fixed prob(cid:173) ability distribution. Here R = Pmin/Pmax is the ratio of the min(cid:173) imum and maximum state-action occupation frequencies. The re(cid:173) sults extend to convergent on-line learning provided that Pmin > 0, where Pmin and Pmax now become the minimum and maximum state-action occupation frequencies corresponding to the station(cid:173) ary distribution.