{"title": "Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1002, "abstract": null, "full_text": "Finite-Sample Convergence Rates  for \nQ-Learning and  Indirect  Algorithms \n\nMichael Kearns and Satinder Singh \n\nAT&T  Labs \n\n180  Park Avenue \n\nFlorham Park, NJ  07932 \n\n{mkearns,baveja }@research.att.com \n\nAbstract \n\nIn  this  paper,  we  address  two  issues  of long-standing  interest  in  the  re(cid:173)\ninforcement  learning  literature.  First,  what  kinds  of performance  guar(cid:173)\nantees  can  be  made  for  Q-learning  after  only  a  finite  number  of actions? \nSecond,  what  quantitative  comparisons  can  be  made  between  Q-learning \nand  model-based  (indirect)  approaches,  which  use  experience  to estimate \nnext-state  distributions  for  off-line  value  iteration? \nWe  first  show  that  both  Q-learning  and  the  indirect  approach  enjoy \nrather  rapid  convergence  to  the  optimal  policy  as  a  function  of the num(cid:173)\nber  of  state  transitions  observed. \nIn  particular,  on  the  order  of  only \n(Nlog(1/c)/c2 )(log(N) + loglog(l/c))  transitions  are  sufficient  for  both \nalgorithms  to  come  within  c  of the  optimal  policy,  in  an  idealized  model \nthat  assumes  the  observed  transitions  are  \"well-mixed\"  throughout  an \nN-state  MDP.  Thus,  the  two  approaches  have  roughly  the  same  sample \ncomplexity.  Perhaps  surprisingly,  this  sample  complexity  is  far less  than \nwhat is required for the model-based approach to actually construct a good \napproximation  to  the  next-state  distribution.  The  result  also  shows  that \nthe  amount  of memory required  by  the model-based  approach  is  closer  to \nN  than  to N 2 \u2022 \nFor  either  approach,  to  remove  the  assumption  that  the  observed  tran(cid:173)\nsitions  are  well-mixed,  we  consider  a  model  in  which  the  transitions  are \ndetermined by a fixed,  arbitrary exploration policy.  Bounds on the number \nof transitions  required  in  order  to  achieve  a  desired  level  of performance \nare  then  related  to  the  stationary  distribution  and  mixing  time  of  this \npolicy. \n\n1 \n\nIntroduction \n\nThere  are  at least  two different  approaches  to learning  in  Markov decision  processes: \nindirect approaches,  which  use  control  experience  (observed  transitions  and  payoffs) \nto estimate a  model, and then  apply dynamic programming to compute policies from \nthe  estimated model;  and  direct approaches such  as  Q-Iearning  [2],  which  use  control \n\n\fConvergence Rates for Q-Leaming and Indirect Algorithms \n\n997 \n\nexperience  to directly learn policies  (through value functions)  without ever  explicitly \nestimating a  model.  Both  are  known  to converge  asymptotically to  the  optimal pol(cid:173)\nicy  [1,  3] .  However,  little  is  known  about  the  performance  of these  two  approaches \nafter only a  finite  amount of experience. \n\nA common argument offered  by  proponents of direct  methods is  that it  may require \nmuch more experience  to learn an  accurate model than to simply learn a good policy. \nThis argument is  predicated on the seemingly reasonable assumption that an indirect \nmethod  must  first  learn  an  accurate  model  in  order  to  compute a  good  policy.  On \nthe  other  hand,  proponents  of  indirect  methods  argue  that  such  methods  can  do \nunlimited off-line computation on the estimated model, which  may give  an advantage \nover  direct  methods,  at  least  if the  model  is  accurate.  Learning  a  good  model  may \nalso  be useful  across  tasks, permitting the  computation of good  policies for  multiple \nreward  functions  [4].  To date,  these  arguments  have  lacked  a  formal  framework for \nanalysis and verification. \n\nIn  this  paper,  we  provide  such  a framework,  and use  it  to derive the first  finite-time \nconvergence rates  (sample size bounds) for  both Q-learning and the standard indirect \nalgorithm.  An  important aspect  of our analysis is  that we  separate  the quality of the \npolicy  generating  experience  from  the  quality  of  the  two  learning  algorithms.  In \naddition  to demonstrating  that both methods enjoy  rather  rapid convergence  to the \noptimal policy as a function of the amount of control experience,  the convergence rates \nhave  a  number  of specific  and  perhaps  surprising  implications for  the  hypothetical \ndifferences  between  the  two  approaches  outlined above.  Some of these  implications, \nas  well  as  the rates  of convergence  we  derive,  were  briefly mentioned in  the  abstract; \nin the interests  of brevity, we  will not repeat  them here,  but instead proceed  directly \ninto the  technical  material. \n\n2  MDP  Basics \n\nLet  M  be  an  unknown  N-state MDP with  A  actions.  We  use  PM(ij)  to denote  the \nprobability  of  going  to  state  j,  given  that  we  are  in  state  i  and  execute  action  a; \nand RM(i)  to denote the reward  received  for  executing  a  from  i  (which  we  assume is \nfixed  and  bounded  between  0  and  1 without  loss  of generality).  A  policy  1r  assigns \nan  action  to  each  state.  The value  of state  i  under  policy  1r,  VM(i),  is  the  expected \ndiscounted sum of rewards  received  upon starting in state i  and executing  1r  forever : \nVM(i)  = E7r[rl  + ,r2 + ,2r3 + ... ],  where  rt  is  the  reward  received  at  time step  t \nunder a random walk governed by  1r  from start state i,  and 0 ~ ,  < 1 is  the discount \nfactor .  It is  also convenient  to define  values for  state-action  pairs  (i, a):  QM (i, a)  = \nRM (i) +, Lj PM (ij) VM (j) . The goal of learning is to approximate the optimal policy \n1r*  that maximizes the value at every state; the optimal value function is denoted QM. \nGiven QM'  we  can compute the optimal policy as 1r*(i) = argmaxa{QM(i,a)}. \nIf M  is  given,  value  iteration  can  be  used  to compute a  good  approximation to  the \noptimal  value  function.  Setting  our  initial  guess  as  Qo(i, a)  = 0  for  all  (i, a),  we \niterate  as follows: \n\nRM(i) +, 2)PM(ij)Ve(j)] \n\nj \n\n(1) \n\nwhere  we  define  \\Il(j)  =  maxv{Qe(j, b)}.  It can  be  shown  that  after  I!  iterations, \n\nmax(i,aj{IQe(i, a) - QM(i , a)1}  ~ ,e. Given any approximation Q to QM we  can com(cid:173)\n\npute the greedy approximation 1r  to the optimal policy 1r*  as 1r(i)  =  argmaxa{Q(i, a)}. \n\n\f998 \n\nM  Kearns and S.  Singh \n\n3  The Parallel  Sampling Model \n\nnamely,  7r  must  try  every  state-action  pair  infinitely  often,  with \n\nIn  reinforcement  learning,  the  transition  probabilities  PM(ij)  are  not  given,  and  a \ngood  policy  must be  learned on the  basis  of observed  experience  (transitions)  in  M . \nClassical convergence  results for  algorithms such  as  Q-Iearning  [1]  implicitly assume \nthat the observed experience is generated by an arbitrary  \"exploration policy\"  7r,  and \nthen  proceed  to  prove  convergence  to  the  optimal  policy  if  7r  meets  certain  mini(cid:173)\nmal  conditions  -\nprobability  1.  This approach  conflates  two  distinct  issues:  the  quality of the  explo(cid:173)\nration policy 7r,  and the quality ofreinforcement learning algorithms using experience \ngenerated  by  7r.  In  contrast,  we  choose  to separate  these  issues.  If the  exploration \npolicy  never  or  only  very  rarely  visits  some state-action pair,  we  would  like  to  have \nthis  reflected  as  a  factor  in  our  bounds  that  depends  only  on  7r;  a  separate  factor \ndepending  only on  the  learning  algorithm will in  turn  reflect  how efficiently  a  partic(cid:173)\nular  learning  algorithm uses  the  experience  generated  by  7r .  Thus,  for  a  fixed  7r,  all \nlearning  algorithms are placed on  equal footing,  and can  be  directly compared. \n\nThere  are  probably  various  ways  in  which  this  separation  can  be  accomplished;  we \nnow  introduce  one  that  is  particularly  clean  and  simple.  We  would  like  a  model  of \nthe  ideal exploration policy -\none  that produces  experiences  that  are  \"well-mixed\", \nin  the  sense  that  every  state-action  pair  is  tried  with  equal  frequency.  Thus,  let  us \ndefine  a  parallel sampling subroutine  PS(M)  that behaves  as  follows:  a  single call  to \nPS( M)  returns,  for  every state-action  pair  (i, a),  a  random  next  state  j  distributed \naccording  to  PM (ij).  Thus,  every  state-action  pair is  executed  simultaneously,  and \nthe resulting N  x A next states are reported.  A single call to PS(M) is  therefore really \nsimulating N  x  A  transitions  in  M,  and  we  must be  careful  to multiply the  number \nof calls  to  PS(M)  by  this factor  if we  wish  to count  the  total number  of transitions \nwitnessed. \n\nWhat is  PS(M)  modeling?  It is modeling the idealized exploration policy  that  man(cid:173)\nages  to  visit  every  state-action  pair in succession,  without  duplication,  and  without \nfail.  It should be intuitively obvious that such an exploration policy would be optimal, \nfrom  the  viewpoint of gathering experience  everywhere  as  rapidly  as  possible. \n\nWe  shall first  provide an  analysis,  in  Section  5,  of both direct  and indirect  reinforce(cid:173)\nment learning algorithms, in a  setting in  which  the  observed  experience  is  generated \nby calls to PS(M).  Of course, in any given MDP M , there may not be any exploration \npolicy that meets the ideal captured  by  PS(M)  -\nfor  instance,  there  may simply be \nsome  states  that  are  very  difficult  for  any  policy  to reach,  and  thus  the  experience \ngenerated  by  any policy  will certainly  not  be  equally mixed around  the entire  MDP. \n(Indeed,  a  call  to PS(M)  will  typically return  a  set  of transitions that does  not even \ncorrespond  to  a  trajectory  in  M.)  Furthermore,  even  if  PS(M)  could  be  simulated \nby  some  exploration  policy,  we  would  like  to  provide  more  general  results  that  ex(cid:173)\npress  the  amount of experience  required for  reinforcement  learning  algorithms under \nany  exploration  policy  (where  the  amount of experience  will ,  of course,  depend  on \nproperties of the exploration policy). \n\nThus,  in Section  6,  we  sketch  how  one can  bound  the  amount of experience  required \nunder  any  7r  in order  to simulate calls to  PS(M) .  (More  detail will  be  provided in a \nlonger version of this paper.)  The bound depends on natural properties of 7r,  such  as \nits stationary distribution  and mixing time.  Combined with  the results of Section  5, \nwe  get the desired  two-factor bounds discussed above:  for both the direct and indirect \napproaches,  a  bound  on  the  total  number  of transitions  required,  consisting  of one \nfactor  that depends  only on  the  algorithm, and  another factor  that depends  only on \nthe  exploration  policy. \n\n\fConvergence Rates for Q-Learning and Indirect Algorithms \n\n999 \n\n4  The Learning Algorithms \n\nWe  now  explicitly  state  the  two  reinforcement  learning  algorithms we  shall  analyze \nand  compare.  In  keeping  with  the  separation  between  algorithms  and  exploration \npolicies  already  discussed,  we  will  phrase  these  algorithms in  the  parallel sampling \nframework,  and Section  6  indicates  how  they  generalize  to the  case  of arbitrary  ex(cid:173)\nploration  policies.  We  begin  with the direct  approach. \n\nRather  than  directly  studying  standard  Q-Iearning,  we  will  here  instead  examine  a \nvariant that is slightly easier to analyze, and is called phased Q-Iearning.  However, we \nemphasize  that  all  of our resuits  can  be  generalized  to  apply  to  standard  Q-learning \n(with learning rate a(i, a)  =  t(i~a)' where t(i, a)  is the number oftrials of (i, a) so far) . \nBasically, rather than updating the value function with every observed transition from \n(i , a),  phased  Q-Iearning  estimates  the  expected  value  of the  next  state  from  (i, a) \non  the  basis  of  many  transitions,  and  only  then  makes  an  update.  The  memory \nrequirements  for  phased  Q-learning  are  essentially  the  same  as  those  for  standard \nQ-Iearning. \n\nDirect Algorithm - Phased Q-Learning:  As  suggested  by the name , the algo(cid:173)\nrithm operates in phases.  In each phase, the algorithm will make mD  calls to PS(M) \n(where  mD  will  be  determined  by  the  analysis),  thus  gathering  mD  trials  of every \nstate-action pair (i, a) . At  the fth  phase,  the algorithm updates  the estimated value \nfunction  as follows:  for  every  (i , a), \n\nQl+d i , a)  = RM(i) + ,_1_ ~ Oeu\u00a3) \n\nmD  k=l \n\n(2) \n\nwhere  jf, ... , j~  are  the  m D  next  states  observed  from  (i, a)  on  the  m D  calls  to \nPS(M)  during  t~e fth  phase.  The  policy  computed  by  the  algorithm  is  then  the \ngreedy  policy  determined  by  the  final  value  function.  Note  that  phased  Q-learning \nis quite like standard  Q-Iearning, except  that we  gather statistics  (the summation in \nEquation  (2))  before  making an  update. \n\nWe  now  proceed  to describe  the standard  indirect  approach . \n\nIndirect  Algorithm:  The  algorithm first  makes  m[  calls  to  PS(M)  to obtain  m[ \nnext state samples for  each  (i, a) . It then builds an empirical model of the transition \nprobabilities as  follows:  PM(ij)  = #(~aj) ,  where  #(i -+a  j)  is  the  number of times \nstate j  was reached  on the m[ trials of (i, a).  The algorithm then does value iteration \n(as described in Section 2)  on the fixed model PM(ij) for  f[  phases.  Again , the policy \ncomputed by  the  algorithm is  the greedy  policy dictated by the  final  value function . \nThus, in phased  Q-Iearning, the algorithm runs for some number fD  phases,  and  each \nphase requires mD  calls to PS(M), for  a total number of transitions fD x mD x N  x A . \nThe  direct  algorithm  first  makes  mj  calls  to  PS(M) ,  and  then  runs  f[  phases  of \nvalue  iteration  (which  requires  no additional data) , for  a  total  number of transitions \nm[  x  N  x  A.  The  question  we  now  address  is:  how  large  must  mD, m[, fD' f[  be \nso  that,  with  probability at least  1 - 6,  the  resulting  policies  have  expected  return \nwithin f.  of the optimal policy in  M?  The answers we  give yield perhaps surprisingly \nsimilar bounds on  the total number of transitions required for  the two approaches in \nthe parallel sampling model. \n\n5  Bounds  on the Number of Transitions \n\nWe  now  state our main result. \n\n\f1000 \n\nM  Kearns and S.  Singh \n\nTheorem 1  For  any MDP  M: \n\n\u2022  For an  appropriate  choice  of the parameters mJ  and and fJ,  the  total number \nof calls  to  PS(M)  required  by  the  indirect algorithm  in  order to  ensure  that, \nwith  probability at least  1 - 6,  the  expected  return  of the  resulting policy  will \nbe  within  f  of the  optimal policy,  is \n\nO((I/f 2)(log(N/6) + loglog(l/f)). \n\n(3) \n\n\u2022  For an  appropriate  choice  of the parameters mD  and fD,  the  total number of \ncalls  to  PS(M)  required  by  phased  Q-learning  in  order  to  ensure  that,  with \nprobability  at  least  1 - 6,  the  expected  return  of the  resulting  policy  will  be \nwithin  f  of the  optimal policy,  is \n\nO((log(1/f)/f 2)(log(N/6) + log log(l/f)). \n\n(4) \n\nThe  bound  for  phased  Q-learning  is  thus  only  O(log(l/f))  larger  than  that  for  the \nindirect  algorithm.  Bounds  on  the  total  number  of transitions  witnessed  in  either \ncase  are  obtained by  multiplying  the given  bounds  by N  x  A . \n\nand perhaps surprisingly -\n\nBefore  sketching  some  of the  ideas  behind  the  proof of this  result,  we  first  discuss \nsome of its implications for  the debate  on direct versus  indirect  approaches.  First of \nall, for  both approaches, convergence  is rather fast:  with a  total number of transitions \nonly on  the  order  of N log(N)  (fixing  f  and  6  for  simplicity),  near-optimal  policies \nare  obtained.  This  represents  a  considerable  advance  over  the  classical  asymptotic \nresults:  instead of saying that an infinite  number of visits  to  every  state-action  pair \nare  required  to  converge  to  the  optimal policy,  we  are  claiming that  a  rather  small \nnumber  of visits  are  required  to  get  close  to  the  optimal  policy.  Second,  by  our \nanalysis, the two approaches have similar complexities, with the number of transitions \nrequired  differing  by  only a  log(l/f)  factor  in  favor  of the indirect  algorithm.  Third \nnote  that since  only O(log(N))  calls  are  being made \n-\nto  PS(M)  (again fixing  f  and 6),  and since  the  number of trials per state-action pair \nis  exactly  the  number of calls  to  PS(M),  the total number of non-zero entries  in the \nmodel  PM (ij)  built  by  the  indirect  approach  is  in  fact  only  O(log( N)).  In  other \nwords , PM (ij)  will be  extremely sparse -\nand  thus, a  terrible  approximation to the \nyet still good enough to derive  a  near-optimal policy! \ntrue  transition probabilities -\nClever  representation  of PM(ij)  will  thus  result  in  total  memory requirements  that \nare  only  O(N log(N))  rather  than  O(N2).  Fourth,  although  we  do  not  have  space \nto provide  any details,  if instead  of a single reward function,  we  are  provided  with  L \nreward functions (where  the L  reward functions are given in aqvance of observing any \nexperience),  then for  both algorithms, the number of transitions required  to compute \nnear-optimal  policies  for  all  L  reward  functions  simultaneously  is  only  a  factor  of \nO(log(L))  greater than the bounds  given  above. \nOur own  view  of the result  and its implications is: \n\n\u2022  Both  algorithms enjoy rapid convergence  to the optimal policy as  a function \n\nof the  amount of experience. \n\n\u2022  In  general,  neither  approach  enjoys  a  significant  advantage  in  convergence \nrate,  memory requirements,  or handling multiple reward functions.  Both are \nquite efficient  on  all counts. \n\nWe  do not have space  to provide  a  detailed proof of Theorem  1,  but  instead  provide \nsome  highlights  of the  main  ideas.  The  proofs  for  both  the  indirect  algorithm  and \nphased  Q-Iearning  are  actually  quite  similar,  and  have  at  their  heart  two  slightly \n\n\fConvergence Rates for Q-Learning and Indirect Algorithms \n\n/001 \n\ndifferent  uniform  convergence  lemmas.  For  phased  Q-Iearning,  it is  possible  to show \nthat, for  any bound  fD  on  the  number of phases  to be  executed,  and  for  any  T  > 0, \nwe  can choose  mD  so  that \n\nmD \n\n(l/mD)LVtU\u00a3)- LPijVtU)  < T \n\nk=l \n\nj \n\n(5) \n\nwill  hold  simultaneously for  every (i, a)  and for  every phase  f  = 1, . . . , fD.  In other \nwords,  at  the  end  of every  phase,  the  empirical  estimate of the  expected  next-state \nvalue for every  (i, a)  will be close to the true expectation,  where  here the expectation \nis  with respect  to the current  estimated value function  Vt. \nFor  the  indirect  algorithm,  a  slightly more  subtle  uniform convergence  argument  is \nrequired.  Here  we  show that it is  possible to choose,  for~any bound fI on the number \nof iterations of value  iteration  to  be  executed  on  the  PM(ij),  and for  any  T  > 0,  a \nvalue  mI  such  that \n\n(6) \n\nj \n\nj \n\nfor  every  (i,a)  and every  phase f  =  1, . . . ,fI, where  the VtU)  are the value functions \nresulting from performing  true value iteration  (that is,  on the PM (ij)).  Equation (6) \nessentially says  that expectations  of the  true  value functions  are quite similar under \neither  the  true  or  estimated  model,  even  though  the  indirect  algorithm  never  has \naccess  to the  true value functions . \n\nIn  either  case,  the  uniform  convergence  results  allow  us  to  argue  that  the  corre(cid:173)\nsponding  algorithms  still  achieve  successive  contractions,  as  in  the  classical  proof \nof  value  iteration.  For  instance,  in  the  case  of  phased  Q-Iearning,  if  we  define \nb..l  = max(i ,a){IQe(i, a)  - Ql(i , a)l},  we  can  derive  a  recurrence  relation  for  b..l+ 1 \nas follows : \n\nm \n\n,(l/m) L  VtU\u00a3)  -, L  Pij VtU) \n\n(7) \n\nk=l \n\n<  7 \"E'I',~x,} { \n<  ,T + ,b..l . \n\nj \n\n( y P;j v,(j) +\" )  - y P;j V, (j) }S) \n\n(9) \nHere  we  have  made  use  of Equation  (5).  Since  b..o =  0 (Qo  =  Qo) , this  recurrence \ngives  b..l  :::;  Tb/(l--,)) for  any f.  From this it is  not hard to show  that for  any (i,a) \n\n~ \n\nIQdi , a)  - Q*(i, a)1  :::;  Tb/(l -,)) + ,l . \n\n(10) \nFrom this  it can  be shown  that  the  regret  in  expected  return  suffered  by  the  policy \ncomputed by phased Q-Learning after f  phases is at most (T, /(1-,) +,l )(2/(1-,)). \nThe proof proceeds  by setting this regret  smaller than the desired  f,  solving for f  and \nT,  and obtaining the resulting bound on m D.  The derivation of bounds for the indirect \nalgorithm is similar. \n\n6  Handling General Exploration Policies \n\nAs promised, we conclude our technical results by briefly sketching how  we  can trans(cid:173)\nlate the bounds obtained in Section 5 under the idealized parallel sampling model into \n\n\f1002 \n\nM  Kearns and S.  Singh \n\nbounds  applicable when  any  fixed  policy  1r  is  guiding the  exploration.  Such  bounds \nmust,  of course,  depend  on  properties  of 1r.  Due  to space  limitations,  we  can  only \noutline  the  main  ideas;  the  formal  statements  and  proofs  are  deferred  to  a  longer \nversion  of the  paper. \n\nLet  us  assume  for  simplicity  that  1r  (which  may  be  a  stochastic  policy)  defines  an \nergodic  Markov  process  in the  MDP  M.  Thus,  1r  induces  a  unique  stationary distri(cid:173)\nbution  PM,1[(i, a)  over state-action pairs -\nintuitively,  PM,1[(i, a)  is  the frequency  of \nexecuting  action  a  from  state  i  during  an  infinite  random  walk  in  M  according  to \n1r.  Furthermore,  we  can  introduce  the  standard  notion  of the  mixing  time  of 1r  to \ninformally, this is  the number T1[  of steps required such \nits stationary distribution -\nthat  the  distribution  induced  on  state-action  pairs  by  T1[-step  walks  according  to  1r \n\nwill be  \"very close\"  to PM,1[  1.  Finally, let us define  P1[  = min(i,a){PM,1[(i, an. \n\nArmed with these notions, it is not difficult to show that the number of steps we  must \ntake under  1r  in  order  to simulate, with high  probability,  a call  to the oracle  PS(M) , \nis  polynomial in  the quantity T1[ / P1[.  The intuition is straightforward:  at most every \nT1[  steps,  we  obtain  an  \"almost independent\"  draw  from  PM,1[(i, a);  and  with  each \nindependent  draw,  we  have  at  least  probability  p  of drawing  any  particular  (i, a) \npair.  Once  we  have  sampled  every  (i, a)  pair,  we  have  simulated  a  call  to  PS(M). \nThe formalization of these  intuitions  leads  to  a  version  of Theorem  1  applicable  to \nany 1r,  in which  the bound  is  multiplied by  a factor  polynomial in T1[ / P1[,  as desired. \nHowever, a better result is possible.  In cases where P1[  may be small or even  0 (which \nwould occur  when  1r  simply does  not ever  execute some  action from  some state),  the \nfactor  T1[ / P1[  is  large  or  infinite  and our  bounds  become  weak  or  vacuous.  In  such \ncases,  it is better to define  the sub-MDP M1[(O'),  which is obtained from M  by simply \ndeleting any  (i, a)  for which PM,1[(i, a)  < a, where  a> 0 is a  parameter of our choos(cid:173)\ning.  In  M1[ (a),  P1[  > a  by  construction,  and  we  may  now  obtain convergence  rates \nto  the  optimal policy  in  M1[ (a)  for  both  Q-Iearning  and  the  indirect  approach  like \nthose  given  in Theorem  1,  multiplied by  a  factor  polynomial in  T1[/O'.  (Technically, \nwe  must slightly alter the algorithms to have  an initial phase  that detects  and elim(cid:173)\ninates  small-probability state-action  pairs,  but  this  is  a  minor detail.)  By  allowing \na  to  become  smaller  as  the  amount of experience  we  receive  from  1r  grows,  we  can \nobtain an  \"anytime\"  result,  since  the sub-MDP M1[(O')  approaches  the full  MDP  M \nas  0'-+0. \n\nReferences \n\n[1]  Jaakkola,  T.,  Jordan,  M.  I.,  Singh,  S.  On  the  convergence  of  stochastic  iterative  dy(cid:173)\n\nnamic programming algorithms.  Neural  Computation,  6(6),  1185-1201,  1994. \n\n[2]  C.  J.  C.  H.  Watkins.  Learning  from  Delayed  Rewards.  Ph.D.  thesis,  Cambridge  Uni(cid:173)\n\nversity,  1989. \n\n[3]  R.  S.  Sutton and  A.  G.  Barto.  Reinforcement Learning:  An  Introduction.  MIT Press, \n\n1998. \n\n[4]  S.  Mahadevan.  Enhancing  Transfer in  Reinforcement  Learning  by  Building  Stochastic \nModels of Robot Actions. In  Machine  Learning:  Proceedings of the  Ninth International \nConference,  1992. \n\n1 Formally,  the degree  of closeness is measured by the distance  between the transient and \nstationary  distributions.  For brevity  here  we  will  simply  assume  this  parameter is  set  to a \nvery  small,  constant value. \n\n\f", "award": [], "sourceid": 1531, "authors": [{"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}