{"title": "Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4704, "page_last": 4713, "abstract": "We study two time-scale linear stochastic approximation algorithms, which can be used to model well-known reinforcement learning algorithms such as GTD, GTD2, and TDC. We present finite-time performance bounds for the case where the learning rate is fixed. The key idea in obtaining these bounds is to use a Lyapunov function motivated by singular perturbation theory for linear differential equations. We use the bound to design an adaptive learning rate scheme which significantly improves the convergence rate over the known optimal polynomial decay rule in our experiments, and can be used to potentially improve the performance of any other schedule where the learning rate is changed at pre-determined time instants.", "full_text": "Finite-Time Performance Bounds and Adaptive\n\nLearning Rate Selection for Two Time-Scale\n\nReinforcement Learning\n\nHarsh Gupta\nECE and CSL\n\nR. Srikant\nECE and CSL\n\nUniversity of Illinois at Urbana-Champaign\n\nUniversity of Illinois at Urbana-Champaign\n\nhgupta10@illinois.edu\n\nrsrikant@illinois.edu\n\nLei Ying\nEECS\n\nUniversity of Michigan, Ann Arbor\n\nleiying@umich.edu\n\nAbstract\n\nWe study two time-scale linear stochastic approximation algorithms, which can\nbe used to model well-known reinforcement learning algorithms such as GTD,\nGTD2, and TDC. We present \ufb01nite-time performance bounds for the case where the\nlearning rate is \ufb01xed. The key idea in obtaining these bounds is to use a Lyapunov\nfunction motivated by singular perturbation theory for linear differential equations.\nWe use the bound to design an adaptive learning rate scheme which signi\ufb01cantly\nimproves the convergence rate over the known optimal polynomial decay rule in\nour experiments, and can be used to potentially improve the performance of any\nother schedule where the learning rate is changed at pre-determined time instants.\n\n1\n\nIntroduction\n\nA key component of reinforcement learning algorithms is to learn or approximate value functions\nunder a given policy [Sutton, 1988], [Bertsekas and Tsitsiklis, 1996], [Szepesv\u00e1ri, 2010], [Bertsekas,\n2011], [Bhatnagar et al., 2012], [Sutton and Barto, 2018]. Many existing algorithms for learning\nvalue functions are variants of the temporal-difference (TD) learning algorithms [Sutton, 1988],\n[Tsitsiklis and Van Roy, 1997], and can be viewed as stochastic approximation algorithms for\nminimizing the Bellman error (or objectives related to the Bellman error). Characterizing the\nconvergence of these algorithms, such as TD(0), TD(\u03bb), GTD , nonlinear GTD has been an important\nobjective of reinforcement learning [Szepesv\u00e1ri, 2010], [Bhatnagar et al., 2009], and [Sutton et al.,\n2016]. The asymptotic convergence of these algorithms with diminishing steps has been established\nusing stochastic approximation theory in many prior works (comprehensive surveys on stochastic\napproximations can be found in [Benveniste et al., 2012], [Kushner and Yin, 2003], and [Borkar,\n2009]).\nThe conditions required for theoretically establishing asymptotic convergence in an algorithm with\ndiminishing step sizes imply that the learning rate becomes very small very quickly. As a result,\nthe algorithm will require a very large number of samples to converge. Reinforcement learning\nalgorithms used in practice follow a pre-determined learning rate (step-size) schedule which, in\nmost cases, uses decaying step sizes \ufb01rst and then a \ufb01xed step size. This gap between theory and\npractice has prompted a sequence of works on \ufb01nite-time performance of temporal difference learning\nalgorithms with either time-varying step sizes or constant step sizes [Dalal et al., 2017a,b, Liu et al.,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2018, Lakshminarayanan and Szepesvari, 2018, Bhandari et al., 2018, Srikant and Ying, 2019]. Most\nof these results are for single time-scale TD algorithms, except [Dalal et al., 2017b] which considers\ntwo time-scale algorithms with decaying step sizes. Two time-scale TD algorithms are an important\nclass of reinforcement learning algorithms because they can improve the convergence rate of TD\nlearning or remedy the instability of single time-scale TD in some cases. This paper focuses on two\ntime-scale linear stochastic approximation algorithms with constant step sizes. The model includes\nTDC, GTD and GTD2 as special cases (see [Sutton et al., 2008], [Sutton et al., 2009] and [Szepesv\u00e1ri,\n2010] for more details). We note that, in contemporaneous work, [Xu et al., 2019] have carried out a\ntwo-time-scale analysis of linear stochastic approximation with diminishing step sizes.\nBesides the theoretical analysis of \ufb01nite-time performance of two time-scale reinforcement learning\nalgorithms, another important aspect of reinforcement learning algorithms, which is imperative in\npractice but has been largely overlooked, is the design of learning rate schedule, i.e., how to choose\nproper step-sizes to improve the learning accuracy and reduce the learning time. This paper addresses\nthis important question by developing principled heuristics based on the \ufb01nite-time performance\nbounds.\nThe main contributions of this paper are summarized below.\n\u2022 Finite Time Performance Bounds: We study two time-scale linear stochastic approximation\nalgorithms, driven by Markovian samples. We establish \ufb01nite time bounds on the mean-square\nerror with respect to the \ufb01xed point of the corresponding ordinary differential equations (ODEs).\nThe performance bound consists of two parts: a steady-state error and a transient error, where the\nsteady-state error is determined by the step sizes but independent of the number of samples (or\nnumber of iterations), and the transient error depends on both step sizes and the number of samples.\nThe transient error decays geometrically as the number of samples increases. The key differences\nbetween this paper and [Dalal et al., 2017b] include (i) we do not require a sparse projection step\nin the algorithm; and (ii) we assume constant step-sizes which allows us to develop the adaptive\nstep-size selection heuristic mentioned next.\n\n\u2022 Adaptive Learning Rate Selection: Based on the \ufb01nite-time performance bounds, in particular,\nthe steady-state error and the transient error terms in the bounds, we propose an adaptive learning\nrate selection scheme. The intuition is to use a constant learning rate until the transient error is\ndominated by the steady-state error; after that, running the algorithm further with the same learning\nrate is not very useful and therefore, we reduce the learning rate at this time. To apply adaptive\nlearning rate selection in a model-free fashion, we develop data-driven heuristics to determine the\ntime at which the transient error is close to the steady-state error. A useful property of our adaptive\nrate selection scheme is that it can be used with any learning rate schedule which already exists in\nmany machine learning software platforms: one can start with the initial learning rate suggested by\nsuch schedules and get improved performance by using our adaptive scheme. Our experiments on\nMountain Car and Inverted Pendulum show that our adaptive learning rate selection signi\ufb01cantly\nimproves the convergence rates as compared to optimal polynomial decay learning rate strategies\n(see [Dalal et al., 2017b] and [Konda et al., 2004] for more details on polynomial decay step-size\nrules).\n\n2 Model, Notation and Assumptions\n\nWe consider the following two time-scale linear stochastic approximation algorithm:\n\nUk+1 = Uk + \u0001\u03b1 (Auu(Xk)Uk + Auv(Xk)Vk + bu(Xk))\nVk+1 = Vk + \u0001\u03b2 (Avu(Xk)Uk + Avv(Xk)Vk + bv(Xk)) ,\n\n(1)\n\nthe change in V is O(1) while the change in U is O(cid:0)\u0001\u03b1\u2212\u03b2(cid:1) . Therefore, V is updated at a faster time\n\nwhere {Xk} are the samples from a Markov process. We assume \u03b2 < \u03b1 so that, over \u0001\u2212\u03b2 iterations,\n\nscale than U.\nIn the context of reinforcement learning, when combined with linear function approximation of\nthe value function, GTD, GTD2, and and TDC can be viewed as two time-scale linear stochastic\napproximation algorithms, and can be described in the same form as (1). For example, GTD2 with\n\n2\n\n\flinear function approximation is as follows:\n\nUk+1 =Uk + \u0001\u03b1 (\u03c6(Xk) \u2212 \u03b6\u03c6(Xk+1)) \u03c6(cid:62)(Xk)Vk\n\nVk+1 =Vk + \u0001\u03b2(cid:0)\u03b4k \u2212 \u03c6(cid:62)(Xk)Vk\n\n(cid:1) \u03c6(Xk),\n\nwhere \u03b6 is the discount factor, \u03c6(x) is the feature vector of state x, Uk is the weight vec-\ntor such that \u03c6(cid:62)(x)Uk is the approximation of value function of state x at iteration k, \u03b4k =\nc(Xk) + \u03b6\u03c6(cid:62)(Xk+1)Uk \u2212 \u03c6(cid:62)(Xk)Uk is the TD error, and Vk is the weight vector that estimates\nE[\u03c6(Xk)\u03c6(Xk)T ]\u22121E[\u03b4k\u03c6(Xk)].\nWe now summarize the notation we use throughout the paper and the assumptions we make.\n\u2022 Assumption 1: {Xk} is a Markov chain with state space S. We assume that the following two\n\nlimits exist:\n\n(cid:18) \u00afAuu\n(cid:0)\u00afbu\n\n\u00afAvu\n\n(cid:19)\n(cid:18)E [Auu(Xk)] E [Auv(Xk)]\n(cid:19)\n(cid:1) = lim\nk\u2212\u2192\u221e\nE [Avu(Xk)] E [Avv(Xk)]\n= lim\nk\u2212\u2192\u221e (E[bu(Xk)] E[bv(Xk)]) = 0.\n\n\u00afAuv\n\u00afAvv\n\u00afbv\n\nNote that without the loss of generality, we assume \u00afb = 0 which allows for the \ufb01xed point of the\nassociated ODEs to be 0. This can be guaranteed by appropriate centering. We de\ufb01ne\n\nB(Xk) =Auu(Xk) \u2212 Auv(Xk) \u00afA\u22121\n\nvv\n\n\u00afAvu\n\n\u02dcB(Xk) =Avu(Xk) \u2212 Avv(Xk) \u00afA\u22121\n\nvv\n\n\u00afAvu\n\n\u00afB = \u00afAuu \u2212 \u00afAuv \u00afA\u22121\n\nvv Avu\n\n\u00af\u02dcB = \u00afAvu \u2212 \u00afAvv \u00afA\u22121\n\nvv\n\n\u00afAvu.\n\n\u2022 Assumption 2: We assume that max{(cid:107)bu(x)(cid:107),(cid:107)bv(x)(cid:107)} \u2264 bmax < \u221e for any x \u2208 S. We\nalso assume that max{(cid:107)B(x)(cid:107),(cid:107) \u02dcB(x)(cid:107),(cid:107)Auu(x)(cid:107),(cid:107)Avu(x)(cid:107),(cid:107)Auv(x)(cid:107),(cid:107)Avv(x)(cid:107)} \u2264 1 for\nany x \u2208 S. Note that these assumptions imply that the steady-state limits of the random matri-\nces/vectors will also satisfy the same inequalities.\n\u2022 Assumption 3: We assume \u00afAvv and \u00afB are Hurwitz and \u00afAvv is invertible. Let Pu and Pv be the\n\nsolutions to the following Lyapunov equations:\n\n\u2212I = \u00afB(cid:62)Pu + Pu \u00afB\n\u2212I = \u00afA(cid:62)\n\nvvPv + Pv \u00afAvv.\n\nSince both \u00afAvv and \u00afB are Hurwitz, Pu and Pv are real positive de\ufb01nite matrices.\n\n\u2022 Assumption 4: De\ufb01ne \u03c4\u2206 \u2265 1 to be the mixing time of the Markov chain {Xk}. We assume\n\n(cid:107)E[bk|X0 = i](cid:107) \u2264 \u2206,\u2200i,\u2200k \u2265 \u03c4\u2206\n(cid:107) \u00afB \u2212 E[B(Xk)|X0 = i](cid:107) \u2264 \u2206,\u2200i,\u2200k \u2265 \u03c4\u2206\n(cid:107) \u00af\u02dcB \u2212 E[ \u02dcB(Xk)|X0 = i](cid:107) \u2264 \u2206,\u2200i,\u2200k \u2265 \u03c4\u2206\n(cid:107) \u00afAuv \u2212 E[Auv(Xk)|X0 = i](cid:107) \u2264 \u2206,\u2200i,\u2200k \u2265 \u03c4\u2206\n(cid:107) \u00afAvv \u2212 E[Avv(Xk)|X0 = i](cid:107) \u2264 \u2206,\u2200i,\u2200k \u2265 \u03c4\u2206.\n\n\u2206 = 2\u0001\u03b1(cid:0)1 + (cid:107) \u00afA\u22121\n\n\u00afAvu(cid:107) + \u0001\u03b2\u2212\u03b1(cid:1)\n\n4 , where \u02dc\u0001 = \u2206 = 2\u0001\u03b1(cid:0)1 + (cid:107) \u00afA\u22121\n\nvv\n\nvv\n\n\u00afAvu(cid:107) + \u0001\u03b2\u2212\u03b1(cid:1) .\n(cid:33)\n\n\u2022 Assumption 5: As in [Srikant and Ying, 2019], we assume that there exists K \u2265 1 such that\n\n\u03c4\u2206 \u2264 K log( 1\n\n\u2206 ). For convenience, we choose\n\nand drop the subscript from \u03c4\u2206, i.e., \u03c4\u2206 = \u03c4. Also, for convenience, we assume that \u0001 is small\nenough such that \u02dc\u0001\u03c4 \u2264 1\n\nWe further de\ufb01ne the following notation:\n\u2022 De\ufb01ne matrix\n\n(cid:32) \u03bev\nwhere \u03beu = 2(cid:107)Pu \u00afAuv(cid:107) and \u03bev = 2(cid:13)(cid:13)Pv \u00afA\u22121\n\nP =\n\nvv\n\n\u03beu+\u03bev\n0\n\n0\n\u03beu\n\n\u03beu+\u03bev\n\nPu\n\n\u00afAvu \u00afB(cid:13)(cid:13) .\n\n,\n\nPv\n\n(2)\n\n\u2022 Let \u03b3max and \u03b3min denote the largest and smallest eigenvalues of Pu and Pv, respectively. So \u03b3max\n\nand \u03b3min are also upper and lower bounds on the eigenvalues of P.\n\n3\n\n\f3 Finite-Time Performance Bounds\n\nTo establish the \ufb01nite-time performance guarantees of the two time-scale linear stochastic approxima-\ntion algorithm (1), we de\ufb01ne\n\nZk = Vk + \u00afA\u22121\n\nvv\n\n\u00afAvuUk\n\nand \u0398k =\n\n(cid:18)Uk\n\n(cid:19)\n\nZk\n\n.\n\nThen we consider the following Lyapunov function:\nW (\u0398k) = \u0398(cid:62)\n\nk P \u0398k,\n\n(3)\nwhere P is a symmetric positive de\ufb01nite matrix de\ufb01ned in (2) (P is positive de\ufb01nite because both\nPu and Pv are positive de\ufb01nite matrices). The reason to introduce Zk will become clear when we\nintroduce the key idea of our analysis based on singular perturbation theory.\nThe following lemma bounds the expected change in the Lyapunov function in one time step.\nLemma 1. For any k \u2265 \u03c4 and \u0001, \u03b1, and \u03b2 such that \u03b71\u02dc\u0001\u03c4 + 2 \u02dc\u00012\nholds:\n\n2 , the following inequality\n\n\u0001\u03b1 \u03b3max \u2264 \u03ba1\n\n(cid:16) \u03ba1\n\n\u2212 \u03ba2\u0001\u03b1\u2212\u03b2(cid:17) E[W (\u0398k)] + \u00012\u03b2\u03c4 \u03b72,\n\nwhere \u02dc\u0001 = 2\u0001\u03b1(cid:0)1 + (cid:107) \u00afA\u22121\n\nE[W (\u0398k+1) \u2212 W (\u0398k)] \u2264 \u2212 \u0001\u03b1\n\u03b3max\n\n\u00afAvu(cid:107) + \u0001\u03b2\u2212\u03b1(cid:1) , and \u03b71, \u03b72 \u03ba1, and \u03ba2 are constants independent of \u0001.\n\n2\n\nvv\n\nThe proof of Lemma 1 is somewhat involved, and is provided in the supplementary material. The\nde\ufb01nitions of \u03b71, \u03b72, \u03ba1 and \u03ba2 can be found in the supplementary material as well. Here, we\nprovide some intuition behind the result by studying a related ordinary differential equation (ODE).\nIn particular, consider the expected change in the stochastic system divided by the slow time-scale\nstep size \u0001\u03b1:\n\nE[Uk+1 \u2212 Uk|Uk\u2212\u03c4 = u, Vk\u2212\u03c4 = v, Xk\u2212\u03c4 = x]\n=E [ (Auu(Xk)Uk + Auv(Xk)Vk + bu)| Uk\u2212\u03c4 = u, Vk\u2212\u03c4 = v, Xk\u2212\u03c4 = x]\n\u0001\u03b1\u2212\u03b2\n=E [ (Avu(Xk)Uk + Avv(Xk)Vk + bv(Xk))| Uk\u2212\u03c4 = u, Vk\u2212\u03c4 = v, Xk\u2212\u03c4 = x] ,\n\nE[Vk+1 \u2212 Vk|Uk\u2212\u03c4 = u, Vk\u2212\u03c4 = v, Xk\u2212\u03c4 = x]\n\n\u0001\u03b1\n\n\u0001\u03b1\n\n(4)\n\nwhere the expectation is conditioned suf\ufb01ciently in the past in terms of the underlying Markov chain\n(i.e. conditioned on the state at time k \u2212 \u03c4 instead of k) so the expectation is approximately in\nsteady-state.\nApproximating the left-hand side by derivatives and the right-hand side using steady-state expecta-\ntions, we get the following ODEs:\n\n\u02d9u = \u00afAuuu + \u00afAuvv\n\u0001\u03b1\u2212\u03b2 \u02d9v = \u00afAvuu + \u00afAvvv.\n\n(5)\n(6)\nNote that, in the limit as \u0001 \u2192 0, the second of the above two ODEs becomes an algebraic equation,\ninstead of a differential equation. In the control theory literature, such systems are called singularly-\nperturbed differential equations, see for example [Kokotovic et al., 1999]. In [Khalil, 2002, Chapter\n11], the following Lyapunov equation has been suggested to study the stability of such singularly\nperturbed ODEs:\n\nW (u, v) = du(cid:62)Puu + (1 \u2212 d)(cid:0)v + \u00afA\u22121\n\n\u00afAvuu(cid:1)(cid:62)\n\n(7)\nfor d \u2208 [0, 1]. The function W mentioned earlier in (3) is the same as above for a carefully chosen d.\nThe rationale behind the use of the Lyapunov function (7) is presented in the appendix.\nThe intuition behind the result in Lemma 1 can be understood by studying the dynamics of the above\nLyapunov function in the ODE setting. To simplify the notation, we de\ufb01ne z = v + \u00afA\u22121\n\u00afAvuu, so\nthe Lyapunov function can also be written as\n\nPv\n\nvv\n\nvv\n\nvv\n\n(cid:0)v + \u00afA\u22121\n\n\u00afAvuu(cid:1) ,\n\nW (u, z) = du(cid:62)Puu + (1 \u2212 d)z(cid:62)Pvz,\n\n(8)\n\n4\n\n\fand adapting the manipulations for nonlinear ODEs in [Khalil, 2002, Chapter 11] to our linear model,\nwe get\n\n\u02d9W =2duT Pu \u02d9u + 2(1 \u2212 d)z(cid:62)Pv \u02d9z\n\n\u2264 \u2212 ((cid:107)u(cid:107)\n\n(cid:107)z(cid:107)) \u02dc\u03a8\n\n(cid:19)\n\n(cid:18)(cid:107)u(cid:107)\n(cid:0) 1\u2212d\n\u2212d\u03b3max \u2212 (1 \u2212 d)\u03b3max\u03c3min\n2\u0001\u03b1\u2212\u03b2 \u2212 (1 \u2212 d)\u03b3max\u03c3min\n\n(cid:107)z(cid:107)\n\n,\n\n(cid:1)(cid:19)\n\n\u2265 (d\u03b3max + (1 \u2212 d)\u03b3max\u03c3min)2 ,\n\nwhere\n\n\u02dc\u03a8 =\n\nd\n\n\u2212d\u03b3max \u2212 (1 \u2212 d)\u03b3max\u03c3min\n\n(cid:18)\n(cid:18) 1 \u2212 d\n2\u0001\u03b1\u2212\u03b2 \u2212 (1 \u2212 d)\u03b3max\u03c3min\n\n(cid:19)\n\nNote that \u02dc\u03a8 is positive de\ufb01nite when\n\nd\n\ni.e., when\n\n.\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n\u0001\u03b1\u2212\u03b2 \u2264\n\n2d(1 \u2212 d)\u03b3max\u03c3min + (d\u03b3max + (1 \u2212 d)\u03b3max\u03c3min)2 .\n\nd(1 \u2212 d)\n\nLet \u02dc\u03bbmin denote the smallest eigenvalue of \u02dc\u03a8. We have\n\n\u02d9W \u2264 \u2212\u02dc\u03bbmin\n\n(cid:0)(cid:107)u(cid:107)2 + (cid:107)z(cid:107)2(cid:1) \u2264 \u2212 \u02dc\u03bbmin\n\nW.\n\n\u03b3max\n\nIn particular, recall that we obtained the ODEs by dividing by the step-size \u0001\u03b1. Therefore, for the\ndiscrete equations, we would expect\n\nE[W (\u0398k+1) \u2212 W (\u0398k)] \u2248\u2264 \u2212\u0001\u03b1\n\n\u02dc\u03bbmin\n\u03b3max\n\nE [W (\u0398k)] ,\n\n(15)\n\nwhich resembles the transient term of the upper bound in Lemma 1. The exact expression in the\ndiscrete, stochastic case is of course different and additionally includes a steady-state term, which is\nnot captured by the ODE analysis above.\nNow, we are ready to the state the main theorem.\nTheorem 1. For any k \u2265 \u03c4, \u0001, \u03b1 and \u03b2 such that \u03b71\u02dc\u0001\u03c4 + 2 \u02dc\u00012\n\n\u0001\u03b1 \u03b3max \u2264 \u03ba1\n\n2 , we have\n\n(cid:18)\n\n\u2212 \u03ba2\u0001\u03b1\u2212\u03b2(cid:17)(cid:19)k\u2212\u03c4\n\n(cid:16) \u03ba1\n(cid:0) \u03ba1\n2 \u2212 \u03ba2\u0001\u03b1\u2212\u03b2(cid:1) .\n\n2\n\u03b72\u03c4\n\n1 \u2212 \u0001\u03b1\nE[(cid:107)\u0398k(cid:107)2] \u2264 \u03b3max\n\u03b3min\n\u03b3max\n+ \u00012\u03b2\u2212\u03b1 \u03b3max\n\u03b3min\n\n(1.5(cid:107)\u03980(cid:107) + 0.5bmax)2\n\nProof. Applying Lemma 1 recursively, we obtain\n\nE[W (\u0398k)] \u2264 uk\u2212\u03c4E[W (\u0398\u03c4 )] + v\n\n2 \u2212 \u03ba2\u0001\u03b1\u2212\u03b2(cid:1) and v = \u03b72\u03c4 \u00012\u03b2. Also, we have that\n(cid:0) \u03ba1\n\n\u2264 uk\u2212\u03c4E[W (\u0398k)] + v\n\n1 \u2212 uk\u2212\u03c4\n1 \u2212 u\n\n1\n\n1 \u2212 u\n\nwhere u = 1 \u2212 \u0001\u03b1\n\n\u03b3max\n\nE[(cid:107)\u0398k(cid:107)2] \u2264 1\n\u03b3min\n\nE[W (\u0398k)] \u2264 1\n\u03b3min\n\nuk\u2212\u03c4E[W (\u0398\u03c4 )] + v\n\n1\n\n\u03b3min(1 \u2212 u)\n\n.\n\nFurthermore,\n\nE[W (\u0398\u03c4 )] \u2264 \u03b3maxE[(cid:107)\u0398\u03c4(cid:107)2] \u2264 \u03b3maxE[((cid:107)\u0398\u03c4 \u2212 \u03980(cid:107) + (cid:107)\u03980(cid:107))2]\n\n\u2264 \u03b3max ((1 + 2\u02dc\u0001\u03c4 )(cid:107)\u03980(cid:107) + 2\u02dc\u0001\u03c4 bmax)2 .\n\nThe theorem then holds using the fact that \u02dc\u0001\u03c4 \u2264 1\n4.\n\n(16)\n\n(17)\n\n(18)\n\nTheorem 1 essentially states that the expected error for a two-time scale linear stochastic approx-\nimation algorithm comprises two terms: a transient error term which decays geometrically with\ntime and a steady-state error term which is directly proportional to \u00012\u03b2\u2212\u03b1 and the mixing time. This\ncharacterization of the \ufb01nite-time error is useful in understanding the impact of different algorithmic\nand problem parameters on the rate of convergence, allowing the design of ef\ufb01cient techniques such\nas the adaptive learning rate rule which we will present in the next section.\n\n5\n\n\f4 Adaptive Selection of Learning Rates\n\nEquipped with the theoretical results from the previous section, one interesting question that arises\nis the following: given a time-scale ratio \u03bb = \u03b1\n\u03b2 , can we use the \ufb01nite-time performance bound to\ndesign a rule for adapting the learning rate to optimize performance?\nIn order to simplify the discussion, let \u0001\u03b2 = \u00b5 and \u0001\u03b1 = \u00b5\u03bb. Therefore, Theorem 1 can be simpli\ufb01ed\nand written as\n\n(cid:18)\n\n(cid:18) \u03ba1\n\nE[(cid:107)\u0398k(cid:107)2] \u2264K1\n\n1 \u2212 \u00b5\u03bb\n\n\u2212 \u03ba2\n\u03b3max\n\n\u00b5\u03bb\u22121\n\n2\u03b3max\n\n(cid:19)(cid:19)k\n\n+ \u00b52\u2212\u03bb\n\n2 \u2212 \u03ba2\u00b5\u03bb\u22121(cid:1)\n(cid:0) \u03ba1\n\nK2\n\n(19)\n\n(20)\n\nwhere K1 and K2 are problem-dependent positive constants. Since we want the system to be stable,\n\u00b5\u03bb\u22121 = c > 0. Plugging this condition\nwe will assume that \u00b5 is small enough such that\nin (19), we get\n\n\u2212 \u03ba2\n\n\u03b3max\n\n\u03ba1\n\n2\u03b3max\n\n(cid:0)1 \u2212 c\u00b5\u03bb(cid:1)k\n\nK2\u00b52\u2212\u03bb\n\u03b3maxc\n\n+\n\nE[(cid:107)\u0398k(cid:107)2] \u2264K1\n\nIn order to optimize performance for a given number of samples, we would like to choose the learning\nrate \u00b5 as a function of the time step. In principle, one can assume time-varying learning rates, derive\nmore general mean-squared error expressions (similar to Theorem 1), and then try to optimize over\nthe learning rates to minimize the error for a given number of samples. However, this optimization\nproblem is computationally intractable. We note that even if we assume that we are only going to\nchange the learning rate a \ufb01nite number of times, the resulting optimization problem of \ufb01nding the\ntimes at which such changes are performed and \ufb01nding the learning rate at these change points is an\nequally intractable optimization problem. Therefore, we have to devise simpler adaptive learning rate\nrules.\nTo motivate our learning rate rule, we \ufb01rst con-\nsider a time T such that errors due to the tran-\nsient and steady-state parts in (20) are equal, i.e.,\n\nK1(1 \u2212 c\u00b5\u03bb)T =\n\nK2\u00b52\u2212\u03bb\n\u03b3maxc\n\n(21)\n\nFigure 1: The evolution of (cid:107)\u0398k \u2212 \u03980(cid:107).\n\nFrom this time onwards, running the two time-\nscale stochastic approximation algorithm any\nfurther with \u00b5 as the learning rate is not going\nto signi\ufb01cantly improve the mean-squared error.\nIn particular, the mean-squared error beyond\nthis time is upper bounded by twice the steady-\nstate error K2\u00b52\u2212\u03bb\n\u03b3maxc . Thus, at time T, it makes\nsense to reset \u00b5 as \u00b5 \u2190 \u00b5/\u03be, where \u03be > 1 is\na hyperparameter. Roughly speaking, T is the\ntime at which one is close to steady-state for a given learning rate, and therefore, it is the time to\nreduce the learning rate to get to a new \"steady-state\" with a smaller error.\nThe key dif\ufb01culty in implementing the above idea is that it is dif\ufb01cult to determine T . For ease of\nexposition, we considered a system centered around 0 in our analysis (i.e., \u0398\u2217 = 0). More generally,\nthe results presented in Theorem 1 and (19) - (20) will have \u0398k replaced by \u0398k \u2212 \u0398\u2217. In any practical\napplication, \u0398\u2217 will be unknown. Thus, we cannot determine (cid:107)\u0398k \u2212 \u0398\u2217(cid:107) as a function of k and\nhence, it is dif\ufb01cult to use this approach.\nOur idea to overcome this dif\ufb01culty is to estimate whether the algorithm is close to its steady-state by\nobserving (cid:107)\u0398k \u2212 \u03980(cid:107) where \u03980 is our initial guess for the unknown parameter vector and is thus\nknown to us. Note that (cid:107)\u0398k \u2212 \u03980(cid:107) is zero at k = 0 and will increase (with some \ufb02uctuations due\nto randomness) to (cid:107)\u0398\u2217 \u2212 \u03980(cid:107) in steady-state (see Figure 1 for an illustration). Roughly speaking,\nwe approximate the curve in this \ufb01gure by a sequence of straight lines, i.e., perform a piecewise\nlinear approximation, and conclude that the system has reached steady-state when the lines become\napproximately horizontal. We provide the details next.\nTo derive a test to estimate whether (cid:107)\u0398k \u2212 \u03980(cid:107) has reached steady-state, we \ufb01rst note the following\ninequality for k \u2265 T (i.e., after the steady-state time de\ufb01ned in (21)):\n\n6\n\n\f(cid:115)\n\nE[(cid:107)\u03980 \u2212 \u0398\u2217(cid:107)] \u2212 E[(cid:107)\u0398k \u2212 \u0398\u2217(cid:107)] \u2264E[(cid:107)\u0398k \u2212 \u03980(cid:107)] \u2264 E[(cid:107)\u0398k \u2212 \u0398\u2217(cid:107)] + E[(cid:107)\u03980 \u2212 \u0398\u2217(cid:107)]\n\n\u21d2 d \u2212\n\n2K2\u00b52\u2212\u03bb\n\u03b3maxc\n\n\u2264E[(cid:107)\u0398k \u2212 \u03980(cid:107)] \u2264 d +\n\n2K2\u00b52\u2212\u03bb\n\u03b3maxc\n\n(22)\n\n(cid:115)\n\nwhere the \ufb01rst pair of inequalities follow from the triangle inequality and the second pair of inequalities\nfollow from (20) - (21), Jensen\u2019s inequality and letting d = E[(cid:107)\u03980 \u2212 \u0398\u2217(cid:107)]. Now, for k \u2265 T , consider\nthe following N points: {Xi = i, Yi = (cid:107)\u0398k+i \u2212 \u03980(cid:107)}N\ni=1. Since these points are all obtained after\n\u201csteady-state\" is reached, if we draw the best-\ufb01t line through these points, its slope should be small.\nMore precisely, let \u03c8N denote the slope of the best-\ufb01t line passing through these N points. Using\n(22) along with formulas for the slope in linear regression, and after some algebraic manipulations\n(see Appendix ?? for detailed calculations), one can show that:\n\n(cid:33)\n\n(cid:32)\n\n2\n\n\u00b51\u2212 \u03bb\nN\n\n|E[\u03c8N ]| = O\n\n, Var(\u03c8N ) = O\n\n(cid:18) 1\n\n(cid:19)\n\nN 2\n\n(cid:18)\n\n(23)\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\n\u00b51\u2212 \u03bb\nN\n\n\u00b5\n\n\u03bb\n2\n\n, then the slope of the best-\ufb01t line connecting {Xi, Yi} will be O\n\nTherefore, if N \u2265 \u03c7\nwith high probability (for a suf\ufb01ciently large constant \u03c7 > 0). On the other hand, when the algorithm\nis in the transient state, the difference between (cid:107)\u0398k+m \u2212 \u03980(cid:107) and (cid:107)\u0398k \u2212 \u03980(cid:107) will be O(m\u00b5) since\n\u0398k changes by O(\u00b5) from one time slot to the next (see Lemma 3 in Appendix ?? for more details).\nUsing this fact, the slope of the best-\ufb01t line through N consecutive points in the transient state can\nbe shown to be O (\u00b5), similar to (23). Since we choose N \u2265 \u03c7\n, the slope of the best-\ufb01t line in\n\n\u00b51\u2212 \u03bb\nN\n\n2\n\n\u03bb\n2\n\n\u00b5\n\n2\n\nsteady state, i.e., O\nwill be lower than the slope of the best-\ufb01t line in the transient phase,\ni.e., O (\u00b5) (for a suf\ufb01ciently large \u03c7). We use this fact as a diagnostic test to determine whether or\nnot the algorithm has entered steady-state. If the diagnostic test returns true, we update the learning\nrate (see Algorithm 1).\n\n2 , \u03980, \u0398ini = \u03980.\n\nDo two time-scale algorithm update.\n\nAlgorithm 1 Adaptive Learning Rate Rule\n\nHyperparameters: \u03c1, \u03c3, \u03be, N\nInitialize \u00b5 = \u03c1, \u03c8N = 2\u03c3\u00b51\u2212 \u03bb\nfor i = 1, 2, ... do\n\nCompute \u03c8N = Slope(cid:0){k,(cid:107)\u0398i\u2212k \u2212 \u0398ini(cid:107)}N\u22121\n\nWe note that our adaptive learning\nrate rule will also work for single\ntime-scale reinforcement learning al-\ngorithms such as TD(\u03bb) since our ex-\npressions for the mean-square error,\nwhen specialized to the case of a sin-\ngle time-scale, will recover the re-\nsult in [Srikant and Ying, 2019] (see\n[Gupta et al., 2019] for more details).\nTherefore, an interesting question that\narises from (19) is whether one can op-\ntimize the rate of convergence with re-\nspect to the time-scale ratio \u03bb? Since\nthe RHS in (19) depends on a variety\nof problem-dependent parameters, it\nis dif\ufb01cult to optimize it over \u03bb. An in-\nteresting direction of further research\nis to investigate if practical adaptive strategies for \u03bb can be developed in order to improve the rate of\nconvergence further.\n\nif \u03c8N < \u03c3\u00b51\u2212 \u03bb\n\n\u00b5 = \u00b5\n\u03be .\n\u0398ini = \u0398i.\n\nend if\nend for\n\n(cid:1).\n\n2\n\nthen\n\nk=0\n\nN\n\n5 Experiments\n\nWe implemented our adaptive learning rate schedule on two popular classic control problems in\nreinforcement learning - Mountain Car and Inverted Pendulum, and compared its performance with\nthe optimal polynomial decay learning rate rule suggested in [Dalal et al., 2017b] (described in the\nnext subsection). See Appendix ?? for more details on the Mountain Car and Inverted Pendulum\nproblems. We evaluated the following policies using the two time-scale TDC algorithm (see [Sutton\net al., 2009] for more details regarding TDC):\n\n7\n\n\f(a) Mountain Car\n\n(b) Inverted Pendulum\n\nFigure 2: Performance of different learning rate rules in classic control problems.\n\n\u2022 Mountain Car - At each time step, choose a random action \u2208 {0, 2}, i.e., accelerate randomly to\n\u2022 Inverted Pendulum - At each time step, choose a random action in the entire action space, i.e.,\n\nthe left or right.\napply a random torque \u2208 [\u22122.0, 2.0] at the pivot point.\n\nSince the true value of \u0398\u2217 is not known in both the problems we consider, to quantify the performance\nof the TDC algorithm, we used the error metric known as the norm of the expected TD update (NEU,\nsee [Sutton et al., 2009] for more details). For both problems, we used a O(3) Fourier basis (see\n[Konidaris et al., 2011] for more details) to approximate the value function and used 0.95 as the\ndiscount factor.\n\n5.1 Learning Rate Rules and Tuning\n1. The optimal polynomial decay rule suggested in [Dalal et al., 2017b] is the following: at time\n3. For our experiments,\nstep k, choose \u0001\u03b1\n\u03b2 = 1.5. Since the problems we considered\nwe chose \u03b1 = 0.99 and \u03b2 = 0.66. This implies \u03bb = \u03b1\nrequire smaller initial step-sizes for convergence, we let \u0001\u03b1\n(k+1)\u03b2 and did a\ngrid search to determine the best \u03c10, i.e., the best initial learning rate. The following values for \u03c10\nwere found to be the best: Mountain Car - \u03c10 = 0.2, Inverted Pendulum - \u03c10 = 0.2.\n\n(k+1)\u03b2 , where \u03b1 \u2192 1 and \u03b2 \u2192 2\n\n(k+1)\u03b1 and \u0001\u03b2\n\nk = 1\n\nk = 1\n\nk = \u03c10\n\n(k+1)\u03b1 and \u0001\u03b2\n\nk = \u03c10\n\n2. For our proposed adaptive learning rate rule, we \ufb01xed \u03be = 1.2, N = 200 in both problems since\nwe did not want the decay in the learning rate to be too aggressive and the resource consumption\nfor slope computation to be high. We also set \u03bb = 1.5 as in the polynomial decay case to have a\nfair comparison. We then \ufb01xed \u03c1 and conducted a grid search to \ufb01nd the best \u03c3. Subsequently,\nwe conducted a grid search over \u03c1. Interestingly, the adaptive learning rate rule was reasonably\nrobust to the value of \u03c1. We used \u03c1 = 0.05 in Inverted Pendulum and \u03c1 = 0.1 in Mountain Car.\nEffectively, the only hyperparameter that affected the rule\u2019s performance signi\ufb01cantly was \u03c3. The\nfollowing values for \u03c3 were found to be the best: Mountain Car - \u03c3 = 0.001, Inverted Pendulum -\n\u03c3 = 0.01.\n\n5.2 Results\n\nFor each experiment, one run involved the following: 10, 000 episodes with the number of iterations\nin each episode being 50 and 200 for Inverted Pendulum and Mountain Car respectively. After every\n1, 000 episodes, training/learning was paused and the NEU was computed by averaging over 1, 000\ntest episodes. We initialized \u03980 = 0. For Mountain Car, 50 such runs were conducted and the results\nwere computed by averaging over these runs. For Inverted Pendulum, 100 runs were conducted and\nthe results were computed by averaging over these runs. Note that the learning rate for each adaptive\nstrategy was adapted at the episodic level due to the episodic nature of the problems. The results are\nreported in Figures 2a and 2b. As is clear from the \ufb01gures, our proposed adaptive learning rate rule\nsigni\ufb01cantly outperforms the optimal polynomial decay rule.\n\n8\n\n12345678910Number of Episodes (in Thousands)01234567NEUPolynomial DecayAdaptive Rule\f6 Conclusion\n\nWe have presented \ufb01nite-time bounds quantifying the performance of two time-scale linear stochastic\napproximation algorithms. The bounds give insight into how the different time-scale and learning rate\nparameters affect the rate of convergence. We utilized these insights and designed an adaptive learning\nrate selection rule. We implemented our rule on popular classical control problems in reinforcement\nlearning and showed that the proposed rule signi\ufb01cantly outperforms the optimal polynomial decay\nstrategy suggested in literature.\n\nAcknowledgements\n\nResearch supported by ONR Grant N00014-19-1-2566, NSF Grants CPS ECCS 1739189, NeTS\n1718203, CMMI 1562276, ECCS 16-09370, and NSF/USDA Grant AG 2018-67007-28379. Lei\nYing\u2019s work supported by NSF grants CNS 1618768, ECCS 1609202, IIS 1715385, ECCS 1739344,\nCNS 1824393 and CNS 1813392.\n\nReferences\nA. Benveniste, M. M\u00e9tivier, and P. Priouret. Adaptive algorithms and stochastic approximations,\n\nvolume 22. Springer Science & Business Media, 2012.\n\nD. P. Bertsekas. Dynamic programming and optimal control 3rd edition, volume II. Belmont, MA:\n\nAthena Scienti\ufb01c, 2011.\n\nD. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena, 1996.\n\nJ. Bhandari, D. Russo, and R. Singal. A \ufb01nite time analysis of temporal difference learning with\n\nlinear function approximation. arXiv preprint arXiv:1806.02450, 2018.\n\nS. Bhatnagar, H. L. Prasad, and L. A. Prashanth. Stochastic recursive algorithms for optimization:\n\nsimultaneous perturbation methods, volume 434. Springer, 2012.\n\nShalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba\nSzepesv\u00e1ri. Convergent temporal-difference learning with arbitrary smooth function approximation.\nIn Advances in Neural Information Processing Systems, pages 1204\u20131212, 2009.\n\nV. S. Borkar. Stochastic approximation: a dynamical systems viewpoint. Springer, 2009.\n\nG. Dalal, B. Sz\u00f6r\u00e9nyi, G. Thoppe, and S. Mannor. Finite sample analyses for TD(0) with function\n\napproximation. arXiv preprint arXiv:1704.01161, 2017a. Also appeared in AAAI 2018.\n\nG. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale stochastic\napproximation with applications to reinforcement learning. arXiv preprint arXiv:1703.05376,\n2017b. Also appeared in COLT 2018.\n\nHarsh Gupta, R Srikant, and Lei Ying. Adaptive learning rate selection for temporal difference\n\nlearning. Real-world Sequential Decision Making Workshop, ICML, 2019.\n\nH. K. Khalil. Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ, 2002.\n\nP. Kokotovic, H. K. Khalil, and J. O\u2019Reilly. Singular perturbation methods in control: analysis and\n\ndesign, volume 25. SIAM, 1999.\n\nVijay R Konda, John N Tsitsiklis, et al. Convergence rate of linear two-time-scale stochastic\n\napproximation. The Annals of Applied Probability, 14(2):796\u2013819, 2004.\n\nG. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning\n\nusing the fourier basis. In Twenty-\ufb01fth AAAI conference on arti\ufb01cial intelligence, 2011.\n\nH. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications,\n\nvolume 35. Springer Science & Business Media, 2003.\n\n9\n\n\fC. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant\nstep-size and iterate averaging go? In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1347\u20131355, 2018.\n\nBo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik.\nProximal gradient temporal difference learning: Stable reinforcement learning with polynomial\nsample complexity. Journal of Arti\ufb01cial Intelligence Research, 63:461\u2013494, 2018.\n\nR. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning.\n\nConference on Learning Theorey (COLT), 2019. ArXiv preprint arXiv:1902.00923.\n\nR. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):\n\n9\u201344, 1988.\n\nR. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.\n\nR. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora. Fast\ngradient-descent methods for temporal-difference learning with linear function approximation. In\nProceedings of the 26th Annual International Conference on Machine Learning, pages 993\u20131000.\nACM, 2009.\n\nRichard S Sutton, Csaba Szepesv\u00e1ri, and Hamid Reza Maei. A convergent O(n) algorithm for\noff-policy temporal-difference learning with linear function approximation. Advances in neural\ninformation processing systems, 21(21):1609\u20131616, 2008.\n\nRichard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem\nof off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):\n2603\u20132631, 2016.\n\nC. Szepesv\u00e1ri. Algorithms for reinforcement learning. Synthesis lectures on Arti\ufb01cial Intelligence\n\nand Machine Learning, 4(1):1\u2013103, 2010.\n\nJ. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approxi-\n\nmation. IEEE Transactions on Automatic Control, 42(5), 1997.\n\nTengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy td learning: Non-asymptotic\nanalysis over markovian samples. In Advances in Neural Information Processing Systems, pages\n10633\u201310643, 2019.\n\n10\n\n\f", "award": [], "sourceid": 2626, "authors": [{"given_name": "Harsh", "family_name": "Gupta", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "R.", "family_name": "Srikant", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Lei", "family_name": "Ying", "institution": "ASU"}]}