{"title": "Convergence of Optimistic and Incremental Q-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1499, "page_last": 1506, "abstract": "", "full_text": "Convergence of Optimistic and\n\nIncremental Q-Learning\n\nEyal Even-Dar*\n\nYishay Mansourt\n\nAbstract\n\nVie sho,v the convergence of tV/O deterministic variants of Q(cid:173)\nlearning. The first is the widely used optimistic Q-learning, which\ninitializes the Q-values to large initial values and then follows a\ngreedy policy with respect to the Q-values. We show that setting\nthe initial value sufficiently large guarantees the converges to an E(cid:173)\noptimal policy. The second is a new and novel algorithm incremen(cid:173)\ntal Q-learning, which gradually promotes the values of actions that\nare not taken. We show that incremental Q-learning converges, in\nthe limit, to the optimal policy. Our incremental Q-learning algo(cid:173)\nrithm can be viewed as derandomization of the E-greedy Q-learning.\n\n1\n\nIntroduction\n\nOne of the challenges of Reinforcement Learning is learning in an unknown envi(cid:173)\nronment. The environment is modeled by an MDP and we can only observe the\ntrajectory of states, actions and rewards generated by the agent wandering in the\nMDP. There are two basic conceptual approaches to the learning problem. The first\nis model base, where we first reconstruct a model of the MDP, and then find an\noptimal policy for the approximate model. Recently polynomial time algorithms\nhave been developed for this approach, initially in [7] and latter extended in [3].\nThe second are direct methods that update their estimated policy after each step.\nThe most popular of the direct methods is Q-learning [13].\n\nQ-learning uses the information observed to approximate the optimal value function,\nfrom which one can construct an optimal policy. There are various proofs that Q(cid:173)\nlearning converges, in the limit, to the optimal value function, under very mild\nconditions [1, 11, 12, 8,6, 2]. In a recent result the convergence rates of Q-learning\nare computed and an interesting dependence on the learning rates is exhibited [4].\n\nQ-learning is an off-policy that can be run on top of any strategy. \u00b7Although, it is\nan off policy algorithm, in many cases its estimated value function is used to guide\nthe selection of actions. Being always greedy with respect to the value function may\nresult in poor performance, due to the lack of exploration, and often randomization\nis used guarantee proper exploration.\n\nWe show the convergence of two deterministic strategies. The first is optimistic\n\n*School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel. evend@cs.tau.ac.il\ntSchool of Computer Science, Tel-Aviv University, Israel. mansDur@cs.tau.ac.il\n\n\fQ-learning, that initializes the estimates to large values and then follows a greedy\npolicy. Optimistic Q-Iearning is widely used in applications and has been recognized\nas having good convergence in practice [10].\n\nWe prove that optimistic Q-Iearning, with the right setting of initial values, converge\nto a near optimal policy. This is not the first theoretical result showing that opti(cid:173)\nmism helps in reinforcement learning, however previous results where concern with\nmodel based methods [7,3]. We show the convergence of the widely used optimistic\nQ-Iearning, thus explaining and supporting the results observed in practice.\n\nOur second result\nis a new and novel deterministic algorithm incremental Q(cid:173)\nlearning, which gradually promotes the values of actions that are not taken. We\nshow that the frequency of sub-optimal actions vanishes, in the limit, and that\nthe strategy defined by incremental Q-Iearning converges, in the limit, to the op(cid:173)\ntimal policy (rather than a near optimal policy).\nAnother view of incremental\nQ-Iearning is as a derandomization of the E-greedy Q-Iearning. The E-greedy Q(cid:173)\nlearning performs the sub optimal action every liE times in expectation, while the\nincremental Q-learning performs sub optimal action every (Q(s, a(s)) - Q(s, b))jE\ntimes. Furthermore, by taking the appropriate values it can be similar to the Boltz(cid:173)\nman machine.\n\n2 The Model\n\nWe define a Markov Decision process (MDP) as follows\n\nDefinition 2.1 A Markov Decision process (MDP) M is a 4-tuple (8, A, P, R),\nwhere S is a set of the states, A is a set of actions, P/:J (a) is the transition proba-\nto state j when performing action a E A in state i, and RM(S, a)\nbility from state i\nis the reward received when performing action a in state s.\n\nA strategy for an MDP assigns, at each time t, for each state S a probability for\nperforming action a E A, given a history Ft - 1 == {sl,al,rl, ...,St-l,at-l,rt-l}\nwhich includes the states, actions and rewards observed until time t - 1. While\nexecuting a strategy 1f we perform at time t action at in state St and observe a\nreward rt (distributed according to RM(S, a)), and the next state St+l distributed\naccording to P:!,St+l (at). We combine the sequence of rewards to a single value\ncalled return, and our goal is to maximize the return.\nIn this work we focus on\ndiscounted return, which has a parameter, E (0,1), and the discounted return of\npolicy 1f is VM== L:o ,trt , where Tt is the reward observed at time t.\nWe assume that RM(S, a) is non-negative and bounded by Rmax , i.e, \"Is, a:\n0:::;\nRM(S, a) :::; Rmax . This implies that the discounted return is bounded by Vmax ==\nRrnalZ'\n1-, .\nWe define a value function for ~ach state s, under policy 1f, as VM(s) == E[L:o Ti,i] ,\nwhere the expectation is over a run of policy 1f starting at state s, and a state-action\nvalue function Q:M(s, a) == E[RM(S, a)] +,LSI P:!sl (a)ViI(s/).\nLet 1f* be an optimal policy which maximizes the return from any start state.\nThis implies that for any policy 1f and any state S we have VM* (s) ~ ViI (s), and\n1f*(s) == argmaxa(E[RM(S, a)] + ,(LsI P:!sl (a)V*(sl)).\nWe use ViI and QM- for VM* and Q', respectively. We say that a policy 1f is an\nE-optimal if IIVM- Vull oo :::;\n\n\u20ac.\n\n\fGiven a trajectory let Ts,a be the.set of times in which we perform action a in state\ns, TS == UaTs,a be the times when state s is visited, Ts,not(a) == TS \\ Ts,a be the set\nof times where in state s an action a' =1= a is performed, and Tnot(s) == UsJ=I=sTs\nbe\nthe set of times in which a state s' =1= s is visited. Also, [#(8, a, t)] is the number of\ntimes action a is performed in state 8 up to time t, Le., ITs,a n [1, t]l.\nFinally, throughout the paper we assume that the MDP is a uni-chain (see [9]),\nnamely that from every state we can reach any other state.\n\nJ\n\n3 Q-Learning\n\nThe Q-Learning algorithm [13] estimates the state-action value function (for dis(cid:173)\ncounted return) as follows:\n\nQt+1 ( s, a) == (1 - at (s, a) )Qt (s, a) + at (8, a) (rt (8, a) + ,Vi(s') )\n\nJ\n.\n\n(s, a):\n\nis well-behaved if for every state action pair\n\nwhere Sl is the state reached from state s when performing action a at time t, and\nVi(s) == maxa Qt(s, a). We assume that at(sl, a') == 0 for t fj. TsJ,a\n(1)\nA learning rate at\n2::1 at(8, a) == 00 and (2) 2::1o'.;(s, a) < 00. If the learning rate is well-behaved\nand every state action pair is performed infinitely often then Q-Learning converges\nto Q* with probability 1 (see [1, 11, 12, 8, 6]).\nThe convergence of Q-Iearning holds using any exploration policy, and only requires\nthat each state action pair is executed infinitely often. The greedy policy with\nrespect to the Q-values tries to exploit continuously, however, since it does not\nexplore properly, it might result in poor return. At the other extreme random\npolicy continuously explores, but its actual return may be very poor. An interesting\ncompromise between the two extremes is the E-greedy policy, which is widely used\nin practice [10]. This policy executes the greedy policy with probability 1 - E\nand the random policy with probability E. This balance between exploration and\nexploitation both guarantees convergence and often good performance.\n\nCommon to many of the exploration techniques, is the use of randomization, which\nIn this work we explore strategies which perform\nis also a very natural choice.\nexploration but avoids randomization and uses deterministic strategies.\n\n4 Optimistic Q-Learning\n\nOptimistic Q-learning is a simple greedy algorithm with respect to the Q-values,\nwhere the initial Q-values are set to large values, larger than their optimal values.\nWe show that optimistic Q-Iearning converges to an E-optimal policy if the initial\nQ-values are set sufficiently large.\nLet fiT == rr;=l (1 - ai). We set the initial conditions of the Q-values as follows:\n\nVs, a: Qo(s, a) = flT Vma:t,\n\n1\n\nwhere T == T (E, 8, S, A, a.) will be specified later. Let T}i,T == ai rrj=i+1 (1 - aj) ==\nai(3T/ fii. Note that\n\nQt+l (s, a).== (l- at)Qt(s, a)+at(rt+,Vi(s/)) == (3TQO(S, a)+L T}i,Tri(S, a)+, L T}i,T\"Vti (Si),\n\nT\n\nT\n\ni=l\n\ni=l\n\n\fwhere T == [#(s, a, t)] and Si is the next state arrived at time ti when action a is\nperformed for the ith time in state s.\nFirst we show that as long as T == [#(s, a, t)] :::; T actions a are performed in state s,\nwe have Qt(s, a) ~ Vmax ' Latter we will use this to show that action a is performed\nat least T times in state s.\n\nLemma 4.1 In optimistic Q-learning for any state s, action a and time t, such\nthat T == [#(s, a, t)] :::; T we have Qt(s, a) ~ Vmax ~ Q*(s, a).\n\nLemma 4.1 follows from the following observation:\nQt(s, a) = f3rQO(s, a) + ~)7i,rri(s,a) + 'Y 2: 17i,rVt;(Si) 2:: / Vmax 2:: V*(s).\n\nf'\n\nr\n\nr\n\n.\n\ni=l\n\ni=l\n\nT\n\nNow we bound T as a function of the algorithm parameters (Le., E,8, lSI, IAI) and\nthe learning rate. We need to set T large enough to guarantee that with probability\n1 - 8, for any t >-T updates, using the given learning rate, the deviation from the\ntrue value is at most E. Formally, given a sequence X t of i.i.d.\nrandom variables\nwith zero mean and bounded by Vmax , and a learning rate at == (l/[#(s, a, t)])W let\nZt+1 == (1 - D:t)Zt + D:tXt. A time T(E, 8) is an initialization time if Prl'v't ~ T :\nZt :::; E] ~ 1 - 8. The following lemma bounds the initialization time as a function\nof the parameter w of the learning rate.\nLemma 4.2 The initialization time for X and a is at most T(E, 8)\nc ( (V~r (In(1/8) + In(Vmaxj\u20ac))) t), for some constant c.\n\nWe define a modified process, in which we update using the optimal value function,\nrather than our current estimate. For t ~ 1 we have,\n\nwhere Sl is the next state. The following lemma bounds the difference between Q*\nand Qt.\n\nLemma 4.3 Consider optimistic Q-learning and let T == T(E, 8) be the initialization\ntime. Then with probability 1 - 8, for any t > T, we have Q*(s, a) - Q(s, a) :::; E.\n\nProof: Let T == [#(s, a, t)]. By definition we have\n\nQt(s, a) == f3rQO(s, a) + 2: 'T/i,rri + 'Y 2: 'T/i,rV*(Si)'\n\nr\n\ni=l\n\nr\n\ni=l\n\nThis implies that,\n\nQ*(s, a) - Q(s, a) == -f3rQO(s, a) + error_r[s, a, t] + error_v[s, a, t]\nerror_r[s, a, t]\n\nwhere\nE[V*(SI)ls, a] - 2:;=1 'T/i,rV*(Si)' We bound both error_r[s, a, t] and error_v[s, a, t]\nusing Lemma 4.2. Therefore, with probability 1- 8, we have Q*(s, a) - Q(s, a) :::; E,\nfor any t ~ T.\nQ.E.D.\nNext we bound the difference between our estimate Vi(s) and V*(s).\n\nand error _v[s, a, t]\n\n- 2:;=1 'T/i,rri,\n\nE[R(s, a)]\n\n\fLemma 4.4 Consider optimistic Q-learning and let T == T((I-I')E,8/ISIIAI) be\nthe initialization time. With probability at least 1- 8 for any state s and time t, we\nhave V*(s) - vt(s) :::; E.\n\nProof: By Lemma 4.3 we have that with probability 1- {) for every state s, action\na and time t we have Q* (s, a) - Qt(s, a) :::; (1-I')E. We show by induction on t that\nV*(s) - vt(s) :::; E, for every state s. For t == 0 we have Vo(s) > Vmax and hence\nthe claim holds. For the inductive step assume it holds up to time t and show that\nit hold for time t + 1. Let (8, a) be the state action pair executed in time t + 1. If.\n[#(s, a, t + 1)] :::; T then by Lemma 4.1, vt(s) ~ Vmax ~ V*(s), and the induction\nclaim holds. Otherwise, let a* be the optimal action at state s, then,\n\nV*(s) - vt+l(S) < Q*(s,a*) - Qt+l(s,a*)\n\nQ*(s,a*) - Qt+l(s,a*) + Qt+l(s,a*) - Qt+l(s,a*)\n\n< (1 -I')E + I'L 1Ji,,,.(V*(Si) - vti (Si)),\n\n\".\n\ni==l\n\nwhere T == [#(s, a, t)], ti is the time when the i-th time the action a is performed\nin state 8, and state Si is the next state. Since ti :::; t, by the inductive hypothesis\nwe have that V* (Si) - vti (Si) :::; E, and therefore,\n\nV* (s) - vt+l (s) :::; (1 - I')E + I'E == E.\n\nQ.E.D.\n\nLemma 4.5 Consider optimistic Q-learning and let T == T((I-I')E,{)/\\SIIAI) be\n{) any state action pair (s, a)\nthe initialization time. With probability at least 1 -\nthat is executed infinitely often is E-optimal, i.e., V*(s) - Q*(s, a) :::; E.\n\nProof: Given a trajectory let U' be the set of state action pairs that are executed\ninfinitely often, and let M' be the original MDP M restricted to U'. For M' we\ncan use the classical convergence proofs, and claim that vt (s) converges to ViII (s )\nand Qt(s, a), for (s, a) E U', converges to QMI (s, a), both with probability 1. Since\n(8, a) E U' is performed infinitely often it implies that Qt (s, a) converges to vt (s) ==\nVM, (s) and therefore QM' (s, a) == ViII (s). By Lemma 4.4 with probability 1- {) we\nhave that VM(s) - yt(s) :::; E, therefore ViI(s) - QM(s, a) :::; ViI(s) - QM' (s, a) :::; E.\nQ.E.D.\nA simple corollary is that if we set E small enough, e.g., E < min(s,a){V*(s) (cid:173)\nQ*(s,a)IV*(s) f:. Q*(s,a)}, then optimistic Q-Iearning converges to the optimal\npolicy. Another simple corollary is the following theorem.\nTheorem 4.6 Consider optimistic Q-learning and let T == T((I-I')E, {)/ISIIAI) be\nthe initialization time. For any constant ~, with probability at least 1 - {) there is\na time T~ > T such that at any time t > T~ the strategy defined by the optimistic\nQ-learning is (E + ~)/(1 - ,)-optimal.\n\n5\n\nIncremental Q-Iearning\n\nIn this section we describe a new algorithm that we call incremental Q-learning. The\nmain idea of the algorithm is to achieve a deterministic tradeoff between exploration\nand exploitation.\n\nIncremental Q-Iearning is a greedy policy with respect to the estimated Q-values\nplus a promotion term. The promotion term of a state-action pair (s, a) is promoted\n\n\feach time the action a is not executed in state s, and zeroed each time action a is\nexecuted. We show that in incremental Q-Iearning every state-action pair is taken\ninfinitely often, which implies standard convergence of the estimates. We show that\nthe fraction of time in which sub-optimal actions are executed vanishes in the limit.\nThis implies that the strategy defined by incremental Q-Iearning converges, in the\nlimit, to the optimal policy. Incremental Q-Iearning estimates the Q-function as in\nQ-Iearning:\n\nQt+l(S, a) == (1 - (It(s, a))Qt(s, a) + (It(s, a)(rt(s, a) + IVi(s/))\n\nwhere Sl is the next state reached when performing action a in state s at time t.\nThe promotion term At is define as follows:\nt E Ts,a\n\nAt+1 (s, a) == 0:\nAt+1 (s, a) == At(s, a) + \"p([#(s, a, t)]):\nAt+1 (s, a) == At(s, a):\n\nt E Tnot(s) ,\n\nt E Ts,not(a)\n\nwhere \"p(i) is a promotion junction which in our case depends only on the number\nof times we performed (s, a'), al\n:j:. a, since the last time we performed (s, a). We\nsay that a promotion function \"p is well-behaved if: (1) The function \"p converges to\nzero, Le., limi-+oo 'ljJ(i) == 0, and (2) \"p(1) == 1 and 'ljJ(k) > \"p(k+ 1) > o. For example\n\"p(i) == t is well behaved promotion function.\nIncremental Q-Iearning is a greedy policy with respect to St(s, a) == Qt(s, a) +\nAt(s, a). First we show that Qt, in incremental Q-Iearning, converges to Q*.\n\nLemma 5.1 Consider incremental Q-Iearning using a well-behaved learning rate\nand a well-behaved promotion function. Then Qt converges to Q* with probability\n1.\n\nProof: Since the learning rate is well-behaved, we need only to show that each state\naction pair is performed infinitely often. We show that each state that is visited\ninfinitely often, all of its actions are performed infinitely often. Since the MDP is\nuni-chain this will imply that with probability 1 we reach all states infinitely often,\nwhich completes the proof.\n\nAssume that state s is visited infinitely often. Since s is visited infinitely often,\nthere has to be a non-empty subset of the actions AI which are performed infinitely\noften in s. The proof is by contradiction, namely assume that AI :j:. A. Let tl be the\nlast time that an action not in A' is performed in state s. Since\"p is well behaved\nwe have that 'ljJ(tl) is constant for a fixed tl , it implies that At(s, a) diverges for\na fj. AI. Therefore, eventually we reach a time t2 > tl such that At2 (s, a) > Vmax ,\nfor every a fj. AI. Since the actions in AI are performed infinitely often there is a\ntime t3 > t2 such that each action al E AI is performed at least once in [t2, ts]. This\nimplies that Ata (s, a) > Vmax + Ata (s, al\n) for any al E AI and a fj. AI. Therefore,\nsome action in a E A \\ AI will be performed after t 1 , contradicting our assumption.\nQ.E.D.\nThe following lemma shows that the frequency of sub-optimal actions vanishes.\n\nLemma 5.2 Consider incremental Q-learning using a well behaved learning rate\nand a well behaved promotion function. Let It(s, a) == ITs,al/ITsl and (s, a) be any\nsub-optimal state-action pair. Then limt-+oo It(s, a) == 0, with probability 1.\n\nThe intuition behind Lemma 5.2 is the following. Let a* be an optimal action in\nstate s and a be a sub-optimal action. By Lemma 5.1, with probability 1 both\n\n\f~\nlii\n>\n~ 0.8\na.o\nQl\n-5\n\n0.2\n\nJ\n\n100\n\n\" ....\n\n....\n\n....\n\n....\n\nI\n\n200\n\nI\n\n300\n\nI\n\n400\n\nI\n\n500\n\nI\n\n600\n\nI\n\n700\n\nI\n\n800\n\nI\n\n900\n\n1000\n\nNumber of steps 103\n\nFigure 1: Example of 50 states MDP, where the discount factor, {, is 0.9. The\nleaning rate of both Incremantal and epsilon greedy Q-Iearning is set to 0.8. The\ndashed line represents the epsilon greedy Q-Iearning.\n\nQt(s, a*) converges to Q*(s, a*) == V*(s) and Qt(s, a) converges to Q*(s, a). This\nimplies, intuitively, that At(s, a) has to be at least V*(s) - Q*(s, a) == h > 0 for\n(s, a) to be executed. Since the promotion function is well behaved, the number\nof time steps required until At(s, a) changes from 0 to h increases after each time\nwe perform (s,a). Since the inter-time between executions of (s,a) diverges, the\nfrequency ft(s, a) vanishes.\nThe following corollary gives a quantitative bound.\n\n==\nCorollary 5.3 Consider incremental Q-learning with learning rate at(s, a)\nl/[#(s, a, t)] and '\u00a2(k) == l/e k . Let (s, a) be a sub-optimal state-action pair. The\nnumber of times (s, a) is performed in the first n visits to state s is 8( V'(s~~~(s,a\u00bb)'\nfor sufficiently large n.\n\nFurthermore, the return obtained by incremental Q-Iearning converges to the opti(cid:173)\nmal return.\n\nCorollary 5.4 Consider incremental Q-learning using a well behaved learning rate\nthere exists a time T f such that\nand a well behaved promotion function. For every \u20ac\nfor any t > T f we have that the strategY.1T defined by incremental Q-Iearning is\n\u20ac-optimal with probability 1.\n\n6 Experiments\n\nIn this section we show some experimental results, comparing Incremental Q(cid:173)\nLearning and epsilon-greedy Q-Learning. One can consider incremental Q-Iearning\n. as a derandomization of \u20act-greedy Q-Learning, where the promotion function satis(cid:173)\nfies 'l/Jt == \u20act\u00b7\u00b7\n\n\fThe experiment was made on MDP, which includes 50 states and two actions per\nstate. Each state action pair immediate reward is randomly chosen in the interval\n[0, 10]. For each state and action (s, a) the next state transition is random,i.e., for\n\nevery state Sl we have a random variable X:: a E [0, 1] and PS~SI = E:::;.a. For\n== 10000/t at time t, while for the incremental\nthe \u20act-greedy Q-Iearning, we have \u20act\nwe have 'l/Jt == 10000/t. Each result in the experiment is an average of ten different\nruns. In Figure 1, we observe similar behavior of the two algorithms. This experi(cid:173)\nment demonstrates the strong experimental connection between these methods. We\nplan to further investigate the theoretical connection between \u20ac-greedy, Boltzman\nmachine and incremental Q-Learning.\n\n7 Acknowledgements\n\nThis research was supported in part by a grant from the Israel Science Foundation.\n\nReferences\n[1] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena\n\nScientific, Belmont, MA, 1996.\n\n[2] V.S. Borkar and S.P. Meyn. The O.D.E. method for convergence of stochastic\napproximation and reinforcement learning. Siam J. control, 38 (2):447-69,\n2000.\n\n[3] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algo(cid:173)\n\nrithm for near-optimal reinforcement learning. m IJCAI, 2001.\n\n[4] E. Even-Dar and Y. Mansour. Learning rates for Q-Iearning. m COLT, 2001.\n[5] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential\n\ndesign of experiments. Progress in Statistics, pages 241 -266, 1974.\n\n[6] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic\n\niterative dynamic programming algorithms. Neural Computation, 6, 1994.\n\n[7] M. Kearns and S. Singh. Efficient reinforcement learning:\n\nwork and algorithms. In fCML, 1998.\n\ntheoretical frame(cid:173)\n\n[8] M. Littman and Cs. Szepesvari. A generalized reinforcement learning model:\n\nconvergence and applications. m ICML, 1996.\n\n[9] M.L Puterman. Markov Decision Processes - Discrete Stochastic Dynamic\n\nProgramming. John Wiley & Sons. mc., New York, NY, 1994.\n\n[10] R. S. Sutton and A. G. Bato. Reinforcement Learning. MIT press, 1998.\n[11] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-Iearning. Ma(cid:173)\n\nchine Learning, 16:185-202, 1994.\n\n[12] C. Watkins and P. Dyan. Q-Iearning. Machine Learning, 8(3/4):279 -292,\n\n1992.\n\n[13] C. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge Univer(cid:173)\n\nsity, 1989.\n\n\f", "award": [], "sourceid": 1944, "authors": [{"given_name": "Eyal", "family_name": "Even-dar", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}]}