{"title": "The Asymptotic Convergence-Rate of Q-learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1070, "abstract": null, "full_text": "The Asymptotic Convergence-Rate of \n\nQ-Iearning \n\nes. Szepesvari* \n\nResearch Group on Artificial Intelligence, \"Jozsef Attila\" University, \n\nSzeged, Aradi vrt. tere 1, Hungary, H-6720 \n\nszepes@math.u-szeged.hu \n\nAbstract \n\nIn this paper we show that for discounted MDPs with discount \nfactor, > 1/2 the asymptotic rate of convergence of Q-Iearning \nif R(1 - ,) < 1/2 and O( Jlog log tit) otherwise \nis O(1/tR (1-1') \nprovided that the state-action pairs are sampled from a fixed prob(cid:173)\nability distribution. Here R = Pmin/Pmax is the ratio of the min(cid:173)\nimum and maximum state-action occupation frequencies. The re(cid:173)\nsults extend to convergent on-line learning provided that Pmin > 0, \nwhere Pmin and Pmax now become the minimum and maximum \nstate-action occupation frequencies corresponding to the station(cid:173)\nary distribution. \n\n1 \n\nINTRODUCTION \n\nQ-Iearning is a popular reinforcement learning (RL) algorithm whose convergence is \nwell demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman \nand Szepesvari, 1996; Szepesvari and Littman, 1996). Our aim in this paper is to \nprovide an upper bound for the convergence rate of (lookup-table based) Q-Iearning \nalgorithms. Although, this upper bound is not strict, computer experiments (to be \npresented elsewhere) and the form of the lemma underlying the proof indicate that \nthe obtained upper bound can be made strict by a slightly more complicated defi(cid:173)\nnition for R. Our results extend to learning on aggregated states (see (Singh et al., \n1995\u00bb and other related algorithms which admit a certain form of asynchronous \nstochastic approximation (see (Szepesv O. The values 0 ~ at(x,a) ~ 1 are called the learning rate associated \nwith the state-action pair (x, a) at time t. This value is assumed to be zero if \n(x,a) =J (xt,at), i.e. only the value of the actual state and action is reestimated in \neach step. If \n\n00 L at(x, a) = 00 \n\nand \n\nt=l \n\n00 L a;(x, a) < 00 \n\nt=l \n\n(2) \n\n(3) \n\nthen Q-Iearning is guaranteed to converge to the only fixed point Q* of the operator \nT : lRX x A ~ lRXxA defined by \n\n(TQ)(x,a) = R(x, a) +, L P(x,a,y)mFQ(y,b) \n\nyEX \n\n(convergence proofs can be found in (Jaakkola et al., 1994; TSitsiklis, 1994; Littman \nand Szepesv.hi, 1996; Szepesvari and Littman, 1996)). Once Q* is identified the \nlearning agent can act optimally in the underlying MDP simply by choosing the \naction which maximizes Q* (x, a) when the agent is in state x (Ross, 1970; Puterman, \n1994). \n\n3 THE MAIN RESULT \n\nCondition (2) on the learning rate at(x, a) requires only that every state-action pair \nis visited infinitely often, which is a rather mild condition. In this article we take the \nstronger assumption that {(Xt, at) h is a sequence of independent random variables \nwith common underlying probability distribution. Although this assumption is not \nessential it simplifies the presentation of the proofs greatly. A relaxation will be \ndiscussed later. We further assume that the learning rates take the special form \n\n{ ~l \nat x, a = Ol,x,a, \n\n( \n\n) \n\n0, \n\n, \n\nif (x,a) = (xt,a); \notherwise, \n\nwhere St (x, a) is the number of times the state-action pair was visited by the process \n(xs, as) before time step t plus one, i.e. St(x, a) = 1 + #{ (xs, as) = (x, a), 1 ~ s ~ \n\n\f1066 \n\nC. Szepesvari \n\nt }. This assumption could be relaxed too as it will be discussed later. For technical \nreasons we further assume that the absolute value of the random reinforcement \nsignals Tt admit a common upper bound. Our main result is the following: \n\nTHEOREM 3.1 Under the above conditions the following relations hold asymptoti(cid:173)\ncally and with probability one: \n\nIQt(x, a) - Q*(x, a)1 ~ tR(~-'Y) \n\n(4) \n\nand \n\n* \n\nIQt(x,a) - Q (x,a)1 ~ B \n\n(5) \nfor some suitable constant B > O. Here R = Pmin/Pmax, where Pmin = \nmin(z,a) p(x, a) and Pmax = max(z,a) p(x, a), where p(x, a) is the sampling proba(cid:173)\nbility of (x, a). \n\nJIOg log t \n\nt ' \n\nNote that if'Y 2: 1 - Pmax/2pmin then (4) is the slower, while if'Y < 1 - Pmax/2Pmin \nthen (5) is the slower. The proof will be presented in several steps. \n\nStep 1. Just like in (Littman and Szepesvari, 1996) (see also the extended version \n(Szepesvciri and Littman, 1996)) the main idea is to compare Qt with the simpler \nprocess \n\nNote that the only (but rather essential) difference between the definition of Qt and \nthat of Qt is the appearance of Q* in the defining equation of Qt. Firstly, notice \nthat as a consequence of this change the process Qt clearly converges to Q* and \nthis convergence may be investigated along each component (x, a) separately using \nstandard stochastic-approximation techniques (see e.g. (Was an , 1969; Poljak and \nTsypkin, 1973)). \nUsing simple devices one can show that the difference process At(x, a) = IQt(x, a)(cid:173)\nat(x, a)1 satisfies the following inequality: \n\nA t+1 (x, a) ~ (1 - Ot(x, a))At(x, a) + 'Y0t(x, a)(IIAtll + lIat - Q*II). \n\n(7) \n\nHere 11\u00b711 stands for the maximum norm. That is the task of showing the convergence \nrate of Qt to Q* is reduced to that of showing the convergence rate of At to zero. \n\nStep 2. We simplify the notation by introducing the abstract process whose update \nequation is \n\n(8) \n\nwhere i E 1,2, ... , n can be identified with the state-action pairs, Xt with At, \nf.t with Qt - Q*, etc. We analyze this process in two steps. First we consider \nprocesses when the \"perturbation-term\" f.t is missing. For such processes we have \nthe following lemma: \n\nLEMMA 3.2 Assume that 771,1]2, ... ,'TIt, . .. are independent random variables with \na common underlying distribution P{TJt = i) = Pi > O. Then the process Xt defined \n\n\fThe Asymptotic Convergence-Rate of Q-leaming \n\nby \n\nsatisfies \n\n1067 \n\n(9) \n\nIIxtil = OCR(~--Y\u00bb) \n\nwi~h probability one (w.p.1), where R = mini Pi/ maxi Pi. \nProof. (Outline) Let To = 0 and \n\nTk+l = min{ t ~ Tk I Vi = 1 . . . n, 3s = s(i) : 1]8 = i}, \n\ni.e. Tk+1 is the smallest time after time Tk such that during the time interval \n[Tk + 1, Tk+d all the components of XtO are \"updated\" in Equation (9) at least \nonce. Then \n\n(10) \n\nwhere Sk = maxi Sk(i) . This inequality holds because if tk(i) is the last time in \n[Tk + 1, Tk+1] when the ith component is updated then \n\nXT\"+l+1(i) = Xtk(i)+l(i) = (1-1/St/o(i\u00bbXt,,(i)(i) + ,/St,,(i) II Xt,,(i) 011 \n\n< (l-l/St,,(i\u00bblIxt/o(i)OIl +,/St,,(i)lIxt,,(i)OIl \n\n= \n\n(1 -1 -,) IIXt,,(i) 011 \n< (1- 1 ;k') IIXT,,+1011, \n\nSt,,(i) \n\nwhere it was exploited that Ilxtll is decreasing. Now, iterating (10) backwards in \ntime yields \n\nX7Hl(-)::: IIxolin (1- 1 ~ 'Y). \n\nNow, consider the following approximations: Tk ~ Ck, where C ~ 1/Pmin (C can be \ncomputed explicitly from {Pi}), Sk ~ PmaxTk+1 ~ Pmax/Pmin(k + 1) ~ (k + 1)/ Ro, \nwhere Ro = 1/CPmax' Then, using Large Deviation's Theory, \n\nk-l ( 1 _ ,) k-l ( \nIT 1 - - ~II 1-\n\nSj \n\nj=O \n\nj=O \n\nRo(1 _ ,\u00bb) \n. \nJ + 1 \n\n~-\nk \n\n(1) Ro(l--Y) \n\n(11) \n\nholds w.p.1. Now, by defining s = Tk + 1 so that siC ~ k we get \n\nwhich holds due to the monotonicity of Xt and l/kRo (l--y) and because R \nPmin/Pmax ~ Ro. \n\n0 \n\nStep 3. Assume that, > 1/2. Fortunately, we know by an extension of the Law of \nthe Iterated Logarithm to stochastic approximation processes that the convergence \n\n\f1068 \n\nC. Szepesvari \n\nrate of IIOt -Q*II is 0 (y'loglogt/t) (the uniform boundedness ofthe random rein(cid:173)\nforcement signals must be exploited in this step) (Major, 1973). Thus it is sufficient \nto provide a convergence rate estimate for the perturbed process, Xt, defined by (8), \nwhen f.t = Cy'loglogt/t for some C > O. We state that the convergence rate of f.t \nis faster than that of Xt. Define the process \n\nZHI (i) = { (1 - ~~l)) Zt(i), \n\nZt (i), \n\nif 7Jt = i; \nif 7Jt f. i. \n\n(12) \n\nThis process clearly lower bounds the perturbed process, Xt. Obviously, the con(cid:173)\nvergence rate of Zt is O(l/tl-'Y) which is slower than the convergence rate of f.t \nprovided that, > 1/2, proving that f.t must be faster than Xt. Thus, asymptoti(cid:173)\ncally f.t ~ (1/, - l)xt, and so Ilxtll is decreasing for large enough t. Then, by an \nargument similar to that of used in the derivation of (10), we get \n\nXTIo+1+1(i) ~ (1- 1 ~k') II XTk +1 II ~ ~ f.Tk, \n\n(13) \n\nwhere Sk = mini Sk(i). By some approximation arguments similar to that of Step 2, \n\ntogether with the bound (l/n71) 2:: s71- 3/ 2Jloglogs ~ s-1/2Jloglogs, 1 > 7J > 0, \n\nwhich follows from the mean-value theorem for integrals and the law of integration \nby parts, we get that Xt ~ O(l/tR (l-'Y\u00bb). The case when , ~ 1/2 can be treated \nsimilarly. \n\nStep 5. Putting the pieces together and applying them for At = Ot - Qt yields \nTheorem 3.1. \n\n4 DISCUSSION AND CONCLUSIONS \n\nThe most restrictive of our conditions is the assumption concerning the sampling \nof (Xt, at). However, note that under a fixed learning policy the process (Xt, at) \nis a (non-stationary) Markovian process and if the learning policy converges in \nthe sense that limt-+oo peat 1Ft) = peat I Xt) (here Ft stands for the history of the \nlearning process) then the process (Xt, at) becomes eventually stationary Markovian \nand the sampling distribution could be replaced by the stationary distribution of \nthe underlying stationary Markovian process. If actions become asymptotically \noptimal during the course of learning then the support of this stationary process \nwill exclude the state-action pairs whose action is sub-optimal, i.e. the conditions \nof Theorem 3.1 will no longer be satisfied. Notice that the proof of convergence of \nsuch processes still follows very similar lines to that of the proof presented here (see \nthe forthcoming paper (Singh et al., 1997)), so we expect that the same convergence \nrates hold and can be proved using nearly identical techniques in this case as well. \n\nA further step would be to find explicit expressions for the constant B of The(cid:173)\norem 3.1. Clearly, B depends heavily on the sampling of (Xt, at), as well as the \ntransition probabilities and rewards of the underlying MDP. Also the choice of har(cid:173)\nmonic learning rates is arbitrary. If a general sequence at were employed then the \nartificial \"time\" Tt (x, a) = 1 /IT}=o (1 - at (x, a)) should be used (note that for the \nharmonic sequence Tt(x, a) ~ t). Note that although the developed bounds are \nasymptotic in their present forms, the proper usage of Large Deviation's Theory \nwould enable us to develope non-asymptotic bounds. \n\n\fThe Asymptotic Convergence-Rate ofQ-learning \n\n1069 \n\nOther possible ways to extend the results of this paper may include Q-Iearning \nwhen learning on aggregated states (Singh et al., 1995), Q-Iearning for alternat(cid:173)\ning/simultaneous Markov games (Littman, 1994; Szepesvari and Littman, 1996) \nand any other algorithms whose corresponding difference process At satisfies an \ninequality similar to (7). \n\nYet another application of the convergence-rate estimate might be the convergence \nproof of some average reward reinforcement learning algorithms. The idea of those \nalgorithms follows from a kind of Tauberian theorem, Le. \nthat discounted sums \nconverge to the average value if the discount rate converges to one (see e.g. Lemma 1 \nof (Mahadevan, 1994; Mahadevan, 1996) or for a value-iteration scheme relying on \nthis idea (Hordjik and Tijms, 1975)). Using the methods developed here the proof \nof convergence of the corresponding Q-learning algorithms seems quite possible. We \nwould like to note here that related results were obtained by Bertsekas et al. et. al \n(see e.g. (Bertsekas and Tsitsiklis, 1996)). \n\nFinally, note that as an application of this result we immediately get that the con(cid:173)\nvergence rate of the model-based RL algorithm, where the transition probabilities \nand rewards are estimated by their respective averages, is clearly better than that of \nfor Q-Iearning. Indeed, simple calculations show that the law of iterated logarithm \nholds for the learning process underlying model-based RL. Moreover, the exact ex(cid:173)\npression for the convergence rate depends explicitly on how much computational \neffort we spend on obtaining the next estimate of the optimal value function, the \nmore effort we spend the faster is the convergence. This .bound thus provides a direct \nway to control the tradeoff between the computational effort and the convergence \nrate. \n\nAcknowledgements \n\nThis research was supported by aTKA Grant No. F20132 and by a grant provided \nby the Hungarian Educational Ministry under contract no. FKFP 1354/1997. I \nwould like to thank Andras Kramli and Michael L. Littman for numerous helpful \nand thought-provoking discussions. \n\nReferences \n\nBertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena \n\nScientific, Belmont, MA. \n\nHordjik, A. ~nd Tijms, H. (1975). A modified form of the iterative method of \n\ndynamic programming. Annals of Statistics, 3:203-208. \n\nJaakkola, T., Jordan, M., and Singh, S. (1994). On the convergence of stochastic \niterative dynamic programming algorithms. Neural Computation, 6(6):1185-\n1201. \n\nLittman, M. (1994). Markov games as a framework for multi-agent reinforcement \nlearning. In Proc. of the Eleventh International Conference on Machine Learn(cid:173)\ning, pages 157-163, San Francisco, CA. Morgan Kauffman. \n\nLittman, M. and Szepesvciri, C. (1996). A Generalized Reinforcement Learning \nModel: Convergence and applications. In Int. Con/. on Machine Learning. \nhttp://iserv.ikLkfki.hu/ asl-publs.html. \n\n\f1070 \n\nC. Szepesvari \n\nMahadevan, S. (1994). To discount or not to discount in reinforcement learning: A \ncase study comparing R learning and Q learning. In Proceedings of the Eleventh \nInternational Conference on Machine Learning, pages 164-172, San Francisco, \nCA. Morgan Kaufmann. \n\nMahadevan, S. (1996). Average reward reinforcement learning: Foundations, algo(cid:173)\n\nrithms, and empirical results. Machine Learning, 22(1,2,3):124-158. \n\nMajor, P. (1973). A law of the iterated logarithm for the Robbins-Monro method. \n\nStudia Scientiarum Mathematicarum Hungarica, 8:95-102. \n\nPoljak, B. and Tsypkin, Y. (1973). Pseudogradient adaption and training algo(cid:173)\n\nrithms. Automation and Remote Control, 12:83-94. \n\nPuterman, M. L. (1994). Markov Decision Processes - Discrete Stochastic Dynamic \n\nProgramming. John Wiley & Sons, Inc., New York, NY. \n\nRoss, S. (1970). Applied Probability Models with Optimization Applications. Holden \n\nDay, San Francisco, California. \n\nSingh, S., Jaakkola, T., and Jordan, M. (1995). Reinforcement learning with soft \n\nstate aggregation. In Proceedings of Neural Information Processing Systems. \n\nSingh, S., Jaakkola, T., Littman, M., and Csaba Szepesva ri (1997). On the con(cid:173)\n\nvergence of single-step on-policy reinforcement-learning al gorithms. Machine \nLearning. in preparation. \n\nSzepesvari, C. and Littman, M. (1996). Generalized Markov Decision Processes: Dy(cid:173)\nnamic programming and reinforcement learning algorithms. Machine Learning. \nin preparation, available as TR CS96-10, Brown Univ. \n\nTsitsiklis, J. (1994). Asynchronous stochastic approximation and q-learning. Ma(cid:173)\n\nchine Learning, 8(3-4):257-277. \n\nWasan, T. (1969). Stochastic Approximation. Cambridge University Press, London. \n\nWatkins, C. (1990). Learning /rom Delayed Rewards. PhD thesis, King's College, \n\nCambridge. QLEARNING. \n\n\f", "award": [], "sourceid": 1383, "authors": [{"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}