{"title": "The Asymptotic Convergence-Rate of Q-learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1070, "abstract": null, "full_text": "The Asymptotic  Convergence-Rate of \n\nQ-Iearning \n\nes.  Szepesvari* \n\nResearch Group on Artificial Intelligence,  \"Jozsef Attila\"  University, \n\nSzeged,  Aradi vrt.  tere  1,  Hungary,  H-6720 \n\nszepes@math.u-szeged.hu \n\nAbstract \n\nIn  this  paper  we  show  that  for  discounted  MDPs  with  discount \nfactor,  >  1/2  the  asymptotic  rate  of convergence  of  Q-Iearning \nif  R(1 - ,) <  1/2 and  O( Jlog log tit)  otherwise \nis  O(1/tR (1-1') \nprovided that the state-action pairs are sampled from a fixed prob(cid:173)\nability  distribution.  Here  R  =  Pmin/Pmax  is  the ratio of the  min(cid:173)\nimum  and maximum state-action occupation frequencies.  The re(cid:173)\nsults extend to convergent on-line learning provided that Pmin  > 0, \nwhere  Pmin  and  Pmax  now  become  the  minimum  and  maximum \nstate-action  occupation frequencies  corresponding  to  the  station(cid:173)\nary distribution. \n\n1 \n\nINTRODUCTION \n\nQ-Iearning is a popular reinforcement learning (RL) algorithm whose convergence is \nwell  demonstrated in the literature  (Jaakkola et al.,  1994; Tsitsiklis,  1994;  Littman \nand  Szepesvari,  1996;  Szepesvari and  Littman,  1996).  Our aim in  this paper is  to \nprovide  an upper bound for the convergence rate of (lookup-table based)  Q-Iearning \nalgorithms.  Although, this upper bound is  not strict, computer experiments (to be \npresented elsewhere)  and the form  of the lemma underlying the proof indicate that \nthe obtained upper  bound can be made strict by a  slightly more complicated defi(cid:173)\nnition for  R.  Our results extend to learning on aggregated states (see  (Singh et al., \n1995\u00bb  and  other  related  algorithms  which  admit  a  certain  form  of asynchronous \nstochastic approximation  (see  (Szepesv<iri  and Littman,  1996\u00bb. \n\nPresent address:  Associative Computing, Inc., Budapest, Konkoly Thege M.  u.  29-33, \n\nHUNGARY-1121 \n\n\fThe Asymptotic Convergence-Rate of Q-leaming \n\n1065 \n\n2  Q-LEARNING \n\nWatkins  introduced  the  following  algorithm  to  estimate  the  value  of state-action \npairs in  discounted  Markovian Decision Processes  (MDPs)  (Watkins,  1990): \n\nHere x  E X  and a  E A are states and actions, respectively, X  and A are finite.  It is \nassumed that some random sampling mechanism (e.g.  simulation or interaction with \na  real  Markovian  environment)  generates  random  samples  of form  (Xt, at, Yt, rt), \nwhere  the  probability  of Yt  given  (xt,at)  is  fixed  and  is  denoted  by  P(xt,at,Yt), \nE[rt I Xt, at]  = R(x, a)  is  the  immediate  average  reward  which  is  received  when \nexecuting action a from  state x, Yt  and rt  are assumed to be independent given the \nhistory  of the learning-process,  and also it  is  assumed  that  Var[rt I Xt, at]  < C  for \nsome  C  > O.  The  values  0  ~ at(x,a)  ~ 1  are  called  the  learning  rate  associated \nwith  the  state-action  pair  (x, a)  at  time  t.  This  value  is  assumed  to  be  zero  if \n(x,a)  =J  (xt,at),  i.e.  only the value of the actual state and action is  reestimated in \neach step.  If \n\n00 L at(x, a)  = 00 \n\nand \n\nt=l \n\n00 L a;(x, a)  < 00 \n\nt=l \n\n(2) \n\n(3) \n\nthen Q-Iearning is guaranteed to converge to the only fixed point Q*  of the operator \nT  : lRX x A  ~ lRXxA  defined  by \n\n(TQ)(x,a) = R(x, a) +, L P(x,a,y)mFQ(y,b) \n\nyEX \n\n(convergence proofs can be found in (Jaakkola et al., 1994; TSitsiklis, 1994; Littman \nand  Szepesv.hi,  1996;  Szepesvari  and  Littman,  1996)).  Once  Q*  is  identified  the \nlearning  agent  can  act  optimally  in  the  underlying  MDP  simply  by  choosing  the \naction which maximizes Q* (x, a)  when the agent is in state x (Ross, 1970; Puterman, \n1994). \n\n3  THE MAIN RESULT \n\nCondition (2)  on the learning rate at(x, a)  requires only that every state-action pair \nis visited infinitely often, which is a rather mild condition.  In this article we take the \nstronger assumption that {(Xt, at) h is  a  sequence of independent random variables \nwith common underlying probability distribution.  Although this assumption is not \nessential  it  simplifies  the  presentation of the  proofs  greatly.  A  relaxation  will  be \ndiscussed later.  We  further  assume that the learning rates take the special form \n\n{  ~l \nat  x, a  =  Ol,x,a, \n\n( \n\n) \n\n0, \n\n, \n\nif (x,a)  =  (xt,a); \notherwise, \n\nwhere St (x, a)  is the number of times the state-action pair was visited by the process \n(xs, as)  before time step t  plus one,  i.e.  St(x, a)  = 1 + #{ (xs, as)  = (x, a),  1 ~ s  ~ \n\n\f1066 \n\nC.  Szepesvari \n\nt }.  This assumption could be relaxed too as it will be discussed later.  For technical \nreasons  we  further  assume  that  the  absolute  value  of the  random  reinforcement \nsignals Tt  admit a common upper bound.  Our main result is  the following: \n\nTHEOREM  3.1  Under  the  above  conditions  the  following  relations  hold  asymptoti(cid:173)\ncally  and with  probability  one: \n\nIQt(x, a)  - Q*(x, a)1  ~ tR(~-'Y) \n\n(4) \n\nand \n\n* \n\nIQt(x,a)  - Q  (x,a)1  ~ B \n\n(5) \nfor  some  suitable  constant  B  >  O.  Here  R  =  Pmin/Pmax,  where  Pmin  = \nmin(z,a) p(x, a)  and Pmax  =  max(z,a) p(x, a),  where  p(x, a)  is  the  sampling  proba(cid:173)\nbility  of (x, a). \n\nJIOg log t \n\nt '  \n\nNote that if'Y 2:  1 - Pmax/2pmin  then  (4)  is  the slower,  while if'Y < 1 - Pmax/2Pmin \nthen  (5)  is  the slower.  The proof will  be presented in several steps. \n\nStep 1.  Just like in  (Littman and Szepesvari, 1996)  (see also the extended version \n(Szepesvciri  and  Littman,  1996))  the  main idea is  to compare  Qt  with  the simpler \nprocess \n\nNote that the only (but rather essential) difference between the definition of Qt  and \nthat  of Qt  is  the  appearance of Q*  in  the defining  equation  of Qt.  Firstly,  notice \nthat  as  a  consequence  of this  change  the  process  Qt  clearly  converges  to  Q*  and \nthis convergence may be investigated along each component  (x, a)  separately using \nstandard stochastic-approximation techniques  (see  e.g.  (Was an ,  1969;  Poljak  and \nTsypkin,  1973)). \nUsing simple devices one can show that the difference process At(x, a)  = IQt(x, a)(cid:173)\nat(x, a)1  satisfies the following  inequality: \n\nA t+1 (x, a)  ~ (1  - Ot(x, a))At(x, a) + 'Y0t(x, a)(IIAtll + lIat - Q*II). \n\n(7) \n\nHere 11\u00b711  stands for the maximum norm.  That is the task of showing the convergence \nrate of Qt  to Q*  is  reduced  to that of showing the convergence rate of At to zero. \n\nStep 2.  We simplify the notation by introducing the abstract process whose update \nequation is \n\n(8) \n\nwhere  i  E  1,2, ... , n  can  be  identified  with  the  state-action  pairs,  Xt  with  At, \nf.t  with  Qt  - Q*,  etc.  We  analyze  this  process  in  two  steps.  First  we  consider \nprocesses  when  the  \"perturbation-term\"  f.t  is  missing.  For  such  processes we  have \nthe following  lemma: \n\nLEMMA  3.2  Assume  that  771,1]2, ... ,'TIt, . ..  are  independent  random  variables  with \na  common  underlying  distribution P{TJt  = i)  =  Pi  > O.  Then  the  process  Xt  defined \n\n\fThe Asymptotic Convergence-Rate of Q-leaming \n\nby \n\nsatisfies \n\n1067 \n\n(9) \n\nIIxtil = OCR(~--Y\u00bb) \n\nwi~h probability  one  (w.p.1),  where  R  =  mini Pi/ maxi Pi. \nProof.  (Outline)  Let To  = 0 and \n\nTk+l  = min{ t ~ Tk I Vi = 1 . . . n,  3s = s(i)  :  1]8  = i}, \n\ni.e.  Tk+1  is  the  smallest  time  after  time  Tk  such  that  during  the  time  interval \n[Tk  + 1, Tk+d  all  the  components  of XtO  are  \"updated\"  in  Equation  (9)  at  least \nonce.  Then \n\n(10) \n\nwhere  Sk  =  maxi Sk(i) .  This  inequality  holds  because  if tk(i)  is  the  last  time  in \n[Tk  + 1, Tk+1]  when  the ith  component is  updated then \n\nXT\"+l+1(i)  =  Xtk(i)+l(i)  =  (1-1/St/o(i\u00bbXt,,(i)(i) + ,/St,,(i) II Xt,,(i) 011 \n\n<  (l-l/St,,(i\u00bblIxt/o(i)OIl +,/St,,(i)lIxt,,(i)OIl \n\n= \n\n(1 -1 -,) IIXt,,(i) 011 \n<  (1- 1 ;k') IIXT,,+1011, \n\nSt,,(i) \n\nwhere  it  was  exploited  that Ilxtll  is  decreasing.  Now,  iterating  (10)  backwards  in \ntime yields \n\nX7Hl(-)::: IIxolin (1- 1 ~ 'Y). \n\nNow,  consider the following approximations:  Tk  ~ Ck, where C  ~ 1/Pmin (C can be \ncomputed explicitly from  {Pi}),  Sk  ~ PmaxTk+1  ~ Pmax/Pmin(k + 1)  ~ (k + 1)/ Ro, \nwhere  Ro  =  1/CPmax'  Then, using Large Deviation's Theory, \n\nk-l (  1 _ ,)  k-l ( \nIT  1 - - ~II  1-\n\nSj \n\nj=O \n\nj=O \n\nRo(1  _ ,\u00bb) \n. \nJ + 1 \n\n~-\nk \n\n(1) Ro(l--Y) \n\n(11) \n\nholds w.p.1.  Now,  by defining  s =  Tk  + 1 so  that siC ~ k  we  get \n\nwhich  holds  due  to  the  monotonicity  of  Xt  and  l/kRo (l--y)  and  because  R \nPmin/Pmax  ~ Ro. \n\n0 \n\nStep 3.  Assume that, > 1/2.  Fortunately, we know by an extension of the Law of \nthe Iterated Logarithm to stochastic approximation processes that the convergence \n\n\f1068 \n\nC.  Szepesvari \n\nrate of IIOt -Q*II is 0  (y'loglogt/t)  (the uniform boundedness ofthe random rein(cid:173)\nforcement signals must be exploited in this step)  (Major, 1973).  Thus it is sufficient \nto provide a convergence rate estimate for the perturbed process,  Xt,  defined by (8), \nwhen  f.t  = Cy'loglogt/t for  some C  > O.  We  state that the convergence rate of f.t \nis faster  than that of Xt.  Define the process \n\nZHI (i) = { (1 - ~~l)) Zt(i), \n\nZt (i), \n\nif 7Jt  =  i; \nif 7Jt  f.  i. \n\n(12) \n\nThis  process clearly  lower  bounds  the  perturbed  process,  Xt.  Obviously,  the  con(cid:173)\nvergence  rate  of  Zt  is  O(l/tl-'Y)  which  is  slower  than  the  convergence  rate  of f.t \nprovided  that, > 1/2, proving  that f.t  must  be  faster  than  Xt.  Thus,  asymptoti(cid:173)\ncally  f.t  ~ (1/, - l)xt,  and so  Ilxtll  is  decreasing for  large enough  t.  Then,  by  an \nargument similar to that of used in the derivation of (10),  we  get \n\nXTIo+1+1(i)  ~ (1- 1 ~k') II XTk +1 II  ~ ~ f.Tk, \n\n(13) \n\nwhere Sk  =  mini Sk(i).  By some approximation arguments similar to that of Step 2, \n\ntogether with the  bound  (l/n71) 2:: s71- 3/ 2Jloglogs ~ s-1/2Jloglogs, 1 > 7J  > 0, \n\nwhich follows  from the mean-value theorem for  integrals and the law of integration \nby  parts,  we  get  that  Xt  ~ O(l/tR (l-'Y\u00bb).  The  case  when ,  ~ 1/2 can be  treated \nsimilarly. \n\nStep  5.  Putting the  pieces  together  and  applying  them for  At  =  Ot - Qt  yields \nTheorem 3.1. \n\n4  DISCUSSION  AND  CONCLUSIONS \n\nThe  most  restrictive of our  conditions is  the assumption  concerning the sampling \nof  (Xt, at).  However,  note  that  under  a  fixed  learning  policy  the  process  (Xt, at) \nis  a  (non-stationary)  Markovian  process  and  if  the  learning  policy  converges  in \nthe sense that limt-+oo  peat 1Ft) = peat I Xt)  (here  Ft  stands for  the history of the \nlearning process) then the process (Xt, at)  becomes eventually stationary Markovian \nand the  sampling  distribution  could  be  replaced  by  the stationary  distribution  of \nthe  underlying  stationary  Markovian  process.  If actions  become  asymptotically \noptimal  during  the  course of learning  then  the  support  of this stationary  process \nwill  exclude the state-action pairs whose  action is  sub-optimal,  i.e.  the conditions \nof Theorem 3.1  will  no longer be satisfied.  Notice that the proof of convergence of \nsuch processes still follows very similar lines to that of the proof presented here  (see \nthe forthcoming paper (Singh et al., 1997)), so we expect that the same convergence \nrates hold and can be proved using nearly identical techniques in  this case as  well. \n\nA  further  step  would  be  to  find  explicit  expressions  for  the  constant  B  of  The(cid:173)\norem  3.1.  Clearly,  B  depends  heavily  on  the  sampling  of  (Xt, at),  as  well  as  the \ntransition probabilities and rewards of the underlying MDP.  Also the choice of har(cid:173)\nmonic learning rates is  arbitrary.  If a  general sequence at were  employed then the \nartificial  \"time\"  Tt (x, a)  =  1 /IT}=o (1  - at (x, a))  should be  used  (note that for  the \nharmonic  sequence  Tt(x, a)  ~ t).  Note  that  although  the  developed  bounds  are \nasymptotic  in  their  present  forms,  the proper  usage  of Large  Deviation's  Theory \nwould enable us to develope non-asymptotic bounds. \n\n\fThe Asymptotic Convergence-Rate ofQ-learning \n\n1069 \n\nOther  possible  ways  to  extend  the  results  of  this  paper  may  include  Q-Iearning \nwhen  learning  on  aggregated  states  (Singh  et  al.,  1995),  Q-Iearning  for  alternat(cid:173)\ning/simultaneous  Markov  games  (Littman,  1994;  Szepesvari  and  Littman,  1996) \nand  any  other  algorithms  whose  corresponding  difference  process  At  satisfies  an \ninequality similar to (7). \n\nYet  another application of the convergence-rate estimate might be the convergence \nproof of some average reward reinforcement learning algorithms.  The idea of those \nalgorithms  follows  from  a  kind  of Tauberian  theorem,  Le. \nthat  discounted  sums \nconverge to the average value if the discount rate converges to one (see e.g.  Lemma 1 \nof (Mahadevan,  1994;  Mahadevan,  1996)  or for  a  value-iteration scheme relying on \nthis idea (Hordjik and Tijms,  1975)).  Using  the methods developed  here the proof \nof convergence of the corresponding Q-learning algorithms seems quite possible.  We \nwould like to note here that related results were obtained by  Bertsekas et al.  et.  al \n(see  e.g.  (Bertsekas and Tsitsiklis, 1996)). \n\nFinally, note that as an application of this  result  we  immediately get that the con(cid:173)\nvergence rate of the model-based  RL  algorithm,  where the transition probabilities \nand rewards are estimated by their respective averages, is clearly better than that of \nfor  Q-Iearning.  Indeed, simple calculations show that the law of iterated logarithm \nholds for  the learning process underlying model-based RL.  Moreover, the exact ex(cid:173)\npression  for  the  convergence  rate  depends  explicitly  on  how  much  computational \neffort  we  spend  on obtaining the next  estimate of the  optimal value  function,  the \nmore effort we spend the faster is the convergence.  This .bound thus provides a direct \nway  to control the tradeoff between  the computational effort  and  the convergence \nrate. \n\nAcknowledgements \n\nThis research was supported by  aTKA Grant No.  F20132 and by a grant provided \nby  the  Hungarian  Educational  Ministry  under  contract  no.  FKFP  1354/1997.  I \nwould  like  to  thank  Andras  Kramli and  Michael  L.  Littman for  numerous  helpful \nand thought-provoking discussions. \n\nReferences \n\nBertsekas,  D.  and  Tsitsiklis,  J.  (1996).  Neuro-Dynamic  Programming.  Athena \n\nScientific,  Belmont,  MA. \n\nHordjik,  A.  ~nd Tijms,  H.  (1975).  A  modified  form  of  the  iterative  method  of \n\ndynamic programming.  Annals of Statistics,  3:203-208. \n\nJaakkola,  T.,  Jordan,  M.,  and  Singh,  S.  (1994).  On the convergence  of stochastic \niterative  dynamic  programming algorithms.  Neural  Computation,  6(6):1185-\n1201. \n\nLittman,  M.  (1994).  Markov games as  a  framework for  multi-agent  reinforcement \nlearning.  In Proc.  of the  Eleventh International  Conference  on Machine  Learn(cid:173)\ning,  pages 157-163, San Francisco,  CA.  Morgan Kauffman. \n\nLittman,  M.  and  Szepesvciri,  C.  (1996).  A  Generalized  Reinforcement  Learning \nModel:  Convergence  and  applications.  In  Int.  Con/.  on  Machine  Learning. \nhttp://iserv.ikLkfki.hu/ asl-publs.html. \n\n\f1070 \n\nC.  Szepesvari \n\nMahadevan,  S.  (1994).  To discount or not to discount in reinforcement learning:  A \ncase study comparing R learning and Q learning. In Proceedings  of the Eleventh \nInternational  Conference  on Machine  Learning,  pages 164-172, San Francisco, \nCA.  Morgan Kaufmann. \n\nMahadevan,  S.  (1996).  Average reward reinforcement learning:  Foundations,  algo(cid:173)\n\nrithms,  and empirical results.  Machine  Learning,  22(1,2,3):124-158. \n\nMajor,  P.  (1973).  A law of the iterated logarithm for  the Robbins-Monro method. \n\nStudia  Scientiarum  Mathematicarum  Hungarica,  8:95-102. \n\nPoljak,  B.  and  Tsypkin,  Y.  (1973).  Pseudogradient  adaption  and  training  algo(cid:173)\n\nrithms.  Automation and Remote  Control,  12:83-94. \n\nPuterman, M. L.  (1994).  Markov Decision Processes - Discrete Stochastic Dynamic \n\nProgramming.  John Wiley &  Sons,  Inc.,  New  York,  NY. \n\nRoss, S.  (1970).  Applied Probability  Models  with  Optimization  Applications.  Holden \n\nDay,  San  Francisco, California. \n\nSingh,  S.,  Jaakkola,  T.,  and  Jordan,  M.  (1995).  Reinforcement  learning with  soft \n\nstate aggregation.  In Proceedings  of Neural Information  Processing  Systems. \n\nSingh,  S.,  Jaakkola,  T.,  Littman,  M.,  and Csaba Szepesva ri  (1997).  On  the  con(cid:173)\n\nvergence of single-step on-policy reinforcement-learning al  gorithms.  Machine \nLearning.  in preparation. \n\nSzepesvari, C. and Littman, M.  (1996). Generalized Markov Decision Processes:  Dy(cid:173)\nnamic programming and reinforcement learning algorithms.  Machine  Learning. \nin preparation, available as TR CS96-10, Brown Univ. \n\nTsitsiklis,  J.  (1994).  Asynchronous  stochastic  approximation and q-learning.  Ma(cid:173)\n\nchine  Learning,  8(3-4):257-277. \n\nWasan, T.  (1969).  Stochastic  Approximation.  Cambridge University Press, London. \n\nWatkins,  C.  (1990).  Learning /rom  Delayed  Rewards.  PhD  thesis,  King's  College, \n\nCambridge.  QLEARNING. \n\n\f", "award": [], "sourceid": 1383, "authors": [{"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}