{"title": "Zap Q-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2235, "page_last": 2244, "abstract": "The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.", "full_text": "Zap Q-Learning\n\nAdithya M. Devraj\n\nSean P. Meyn\n\nDepartment of Electrical and Computer Engineering,\n\nUniversity of Florida,\nGainesville, FL 32608.\n\nadithyamdevraj@ufl.edu, meyn@ece.ufl.edu\n\nAbstract\n\nThe Zap Q-learning algorithm introduced in this paper is an improvement of\nWatkins\u2019 original algorithm and recent competitors in several respects.\nIt is a\nmatrix-gain algorithm designed so that its asymptotic variance is optimal. More-\nover, an ODE analysis suggests that the transient behavior is a close match to a\ndeterministic Newton-Raphson implementation. This is made possible by a two\ntime-scale update equation for the matrix gain sequence. The analysis suggests\nthat the approach will lead to stable and ef\ufb01cient computation even for non-ideal\nparameterized settings. Numerical experiments con\ufb01rm the quick convergence,\neven in such non-ideal cases.\n\n1\n\nIntroduction\n\nIt is recognized that algorithms for reinforcement learning such as TD- and Q-learning can be slow\nto converge. The poor performance of Watkins\u2019 Q-learning algorithm was \ufb01rst quanti\ufb01ed in [25],\nand since then many papers have appeared with proposed improvements, such as [9, 1].\nAn emphasis in much of the literature is computation of \ufb01nite-time PAC (probably almost correct)\nbounds as a metric for performance. Explicit bounds were obtained in [25] for Watkins\u2019 algorithm,\nand in [1] for the \u201cspeedy\u201d Q-learning algorithm that was introduced by these authors. A general\ntheory is presented in [18] for stochastic approximation algorithms.\nIn each of the models considered in prior work, the update equation for the parameter estimates can\nbe expressed\n\n\u03b8n+1 = \u03b8n + \u03b1n[f (\u03b8n) + \u2206n+1] , n \u2265 0 ,\n\n(1)\nin which {\u03b1n} is a positive gain sequence, and {\u2206n} is a martingale difference sequence. This\nrepresentation is critical in analysis, but unfortunately is not typical in reinforcement learning ap-\nplications outside of these versions of Q-learning. For Markovian models, the usual transformation\nused to obtain a representation similar to (1) results in an error sequence {\u2206n} that is the sum of a\nmartingale difference sequence and a telescoping sequence [15]. It is the telescoping sequence that\nprevents easy analysis of Markovian models.\nThis gap in the research literature carries over to the general theory of Markov chains. Examples of\nconcentration bounds for i.i.d. sequences or martingale-difference sequences include the \ufb01nite-time\nbounds of Hoeffding and Bennett. Extensions to Markovian models either offer very crude bounds\n[17], or restrictive assumptions [14, 11]; this remains an active area of research [20].\nIn contrast, asymptotic theory for stochastic approximation (as well as general state space Markov\nchains) is mature. Large Deviations or Central Limit Theorem (CLT) limits hold under very general\nassumptions [3, 13, 4]. The CLT will be a guide to algorithm design in the present paper. For a\n\ntypical stochastic approximation algorithm, this takes the following form: denoting {(cid:101)\u03b8n := \u03b8n \u2212 \u03b8\u2217 :\nn \u2265 0} to be the error sequence, under general conditions the scaled sequence {\u221an(cid:101)\u03b8n : n \u2265 1}\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fn] .\n\n\u03a3\u03b8 = lim\n\nn\u2192\u221e nE[(cid:101)\u03b8n(cid:101)\u03b8T\n\nconverges in distribution to a Gaussian distribution, N (0, \u03a3\u03b8). Typically, the scaled covariance is\nalso convergent to the limit, which is known as the asymptotic covariance:\n(2)\nAn asymptotic bound such as (2) may not be satisfying for practitioners of stochastic optimization\nor reinforcement learning, given the success of \ufb01nite-n performance bounds in prior research. How-\never, the fact that the asymptotic covariance \u03a3\u03b8 has a simple representation, and can therefore be\neasily improved or optimized, makes it a compelling tool to consider. Moreover, as the examples in\nthis paper suggest, the asymptotic covariance is often a good predictor of \ufb01nite-time performance,\nsince the CLT approximation is accurate for reasonable values of n.\nTwo approaches are known for optimizing the asymptotic covariance. First is the remarkable aver-\naging technique introduced in [21, 22, 24] (also see [12]). Second is Stochastic Newton-Raphson,\nbased on a special choice of matrix gain for the algorithm [13, 23]. The algorithms proposed here\nuse the second approach.\nMatrix gain variants of TD-learning [10, 19, 29, 30] and Q-learning [27] are available in the liter-\nature, but none are based on optimizing the asymptotic variance. It is a fortunate coincidence that\nLSTD(\u03bb) of [6] achieves this goal [8].\nIn addition to accelerating the convergence rate of the standard Q-learning algorithm, it is hoped\nthat this paper will lead to entirely new algorithms. In particular, there is little theory to support\nQ-learning in non-ideal settings in which the optimal \u201cQ-function\u201d does not lie in the parameterized\nfunction class. Convergence results have been obtained for a class of optimal stopping problems\n[31], and for deterministic models [16]. There is now intense practical interest, despite an incomplete\ntheory. A stronger supporting theory will surely lead to more ef\ufb01cient algorithms.\nContributions A new class of Q-learning algorithms is proposed, called Zap Q-learning, designed\nto more accurately mimic the classical Newton-Raphson algorithm. It is based on a two time-scale\nstochastic approximation algorithm, constructed so that the matrix gain tracks the gain that would\nbe used in a deterministic Newton-Raphson method.\nA full analysis is presented for the special case of a complete parameterization (similar to the setting\nof Watkins\u2019 algorithm [28]). It is found that the associated ODE has a remarkable and simple rep-\nresentation, which implies consistency under suitable assumptions. Extensions to non-ideal param-\neterized settings are also proposed, and numerical experiments show dramatic variance reductions.\nMoreover, results obtained from \ufb01nite-n experiments show close solidarity with asymptotic theory.\nThe remainder of the paper is organized as follows. The new Zap Q-learning algorithm is introduced\nin Section 2, which contains a summary of the theory from extended version of this paper [8].\nNumerical results are surveyed in Section 3, and conclusions are contained in Section 4.\n\n2 Zap Q-Learning\nConsider an MDP model with state space X, action space U, cost function c : X \u00d7 U \u2192 R, and\ndiscount factor \u03b2 \u2208 (0, 1). It is assumed that the state and action space are \ufb01nite: denote (cid:96) = |X|,\n(cid:96)u = |U|, and Pu the (cid:96) \u00d7 (cid:96) conditional transition probability matrix, conditioned on u \u2208 U. The\nstate-action process (X, U ) is adapted to a \ufb01ltration {Fn : n \u2265 0}, and Q1 is assumed throughout:\nQ1: The joint process (X, U ) is an irreducible Markov chain, with unique invariant pmf \u0001.\nThe minimal value function is the unique solution to the discounted-cost optimality equation:\n\n(cid:88)\n\nx(cid:48)\u2208X\n\n(cid:48)\nPu(x, x\n\n\u2217\n)h\n\n(cid:48)\n\n(x\n\n)} ,\n\nx \u2208 X.\n\nx(cid:48)\u2208X\n\n(cid:88)\n\nx(cid:48)\n\n2\n\n\u2217\n\nh\n\n\u2217\nQ\n\n(x, u) := min\n\n(x) = min\nu\u2208U\n\nu\u2208U{c(x, u) + \u03b2\nThe \u201cQ-function\u201d solves a similar \ufb01xed point equation:\n(cid:48)\nPu(x, x\n\n(x, u) = c(x, u) + \u03b2\n\n(cid:88)\n\n\u2217\nQ\n\n\u2217\n)Q\n\n(cid:48)\n\n) ,\n\n(x\n\nx \u2208 X, u \u2208 U,\n\n(3)\n\nin which Q(x) := minu\u2208U Q(x, u) for any function Q : X \u00d7 U \u2192 R.\nGiven any function \u03c2 : X \u00d7 U \u2192 R, let Q(\u03c2) denote the corresponding solution to the \ufb01xed point\nequation (3), with c replaced by \u03c2: The function q = Q(\u03c2) is the solution to the \ufb01xed point equation,\n\nq(x, u) = \u03c2(x, u) + \u03b2\n\n(cid:48)\nPu(x, x\n\n(cid:48)\n\n(cid:48)\n, u\n\n) ,\n\n) min\n\nu(cid:48) q(x\n\nx \u2208 X, u \u2208 U.\n\n\fE(cid:2)(cid:8)c(Xn, Un) + \u03b2Q\u03b8\n\nThe mapping Q is a bijection on the set of real-valued functions on X\u00d7 U. It is also piecewise linear,\nconcave and monotone (See [8] for proofs and discussions).\nIt is known that Watkins\u2019 Q-learning algorithm can be regarded as a stochastic approximation\nmethod [26, 5] to obtain the solution \u03b8\u2217\n\n\u2208 Rd to the steady-state mean equations,\n\n(Xn+1) \u2212 Q\u03b8\n\n(4)\nwhere {\u03b6n} are d-dimensional Fn-measurable functions and Q\u03b8 = \u03b8T\u03c8 for basis functions {\u03c8i :\n1 \u2264 i \u2264 d}. In Watkins\u2019 algorithm \u03b6n = \u03c8(Xn, Un), and the basis functions are indicator functions:\n\u03c8k(x, u) = I{x = xk, u = uk}, 1 \u2264 k \u2264 d, with d = (cid:96) \u00d7 (cid:96)u the total number of state-action pairs\n= Q\u2217, and the parameter \u03b8 is identi\ufb01ed with the estimate\n[26]. In this special case we identify Q\u03b8\nQ\u03b8. A stochastic approximation algorithm to solve (4) coincides with Watkins\u2019 algorithm [28]:\n\n(cid:8)c(Xn, Un) + \u03b2\u03b8n(Xn+1) \u2212 \u03b8n(Xn, Un)(cid:9)\u03c8(Xn, Un)\n\n\u03b8n+1 = \u03b8n + \u03b1n+1\n\n(Xn, Un)(cid:9)\u03b6n(i)(cid:3) = 0,\n\n1 \u2264 i \u2264 d\n\n(5)\n\n\u2217\n\n\u2217\n\n\u2217\n\n(cid:110)\n\n(cid:88)\n\nOne very general technique that is used to analyze convergence of stochastic approximation al-\ngorithms is to consider the associated limiting ODE, which is the continuous-time, deterministic\napproximation of the original recursion [4, 5]. For (5), denoting the continuous time approximation\nof {\u03b8n} to be {qt}, and under standard assumptions on the gain sequence {\u03b1n}, the associated ODE\nis of the form\n(6)\nd\ndt qt(x, u) = \u0001(x, u)\nUnder Q1, {qt} converges to Q\u2217: A key step in the proof of convergence of {\u03b8n} to the same limit.\nWhile Watkins\u2019 Q-learning (5) is consistent, it is argued in [8] that the asymptotic covariance of this\nalgorithm is typically in\ufb01nite. This conclusion is complementary to the \ufb01nite-n analysis of [25]:\nTheorem 2.1. Watkins\u2019 Q-learning algorithm with step-size \u03b1n \u2261 1/n is consistent under Assump-\n\u22121, and the conditional variance of\ntion Q1. Suppose that in addition max\nh\u2217(Xt) is positive: (cid:88)\nx,u\n\n\u0001(x, u) \u2264 1\n2 (1 \u2212 \u03b2)\n(cid:48)\n\n) \u2212 qt(x, u)\n\n(cid:48)\nPu(x, x\n\nc(x, u) + \u03b2\n\nu(cid:48) q(x\n\n(cid:111)\n\n) min\n\n(cid:48)\n\n(cid:48)\n, u\n\nx(cid:48)\n\n.\n\n(cid:48)\n\u0001(x, u)Pu(x, x\n\n\u2217\n)[h\n\n(x\n\nx,x(cid:48),u\n\n(x)]2 > 0\n\n\u2217\n) \u2212 Puh\n\u2217\n\nThen the asymptotic covariance is in\ufb01nite: lim\n\nn\u2192\u221e nE[(cid:107)\u03b8n \u2212 \u03b8\n\n(cid:107)2] = \u221e.\n\n2 (1 \u2212 \u03b2)\u22121 is satis\ufb01ed whenever \u03b2 \u2265 1\n2.\n\nThe assumption maxx,u \u0001(x, u) \u2264 1\nMatrix-gain stochastic approximation algorithms have appeared in previous literature. In particular,\nmatrix gain techniques have been used to speed-up the rate of convergence of Q-learning (see [7]\nand the second example in Section 3). The general G-Q(\u03bb) algorithm is described as follows, based\non a sequence of d \u00d7 d matrices G = {Gn} and \u03bb \u2208 [0, 1]: For initialization \u03b80 , \u03b60 \u2208 Rd, the\nsequence of estimates are de\ufb01ned recursively:\n\n\u03b8n+1 = \u03b8n + \u03b1n+1Gn+1\u03b6ndn+1\ndn+1 = c(Xn, Un) + \u03b2Q\u03b8n (Xn+1) \u2212 Q\u03b8n(Xn, Un)\n\u03b6n+1 = \u03bb\u03b2\u03b6n + \u03c8(Xn+1, Un+1)\n\n(7)\n\nThe special case based on stochastic Newton-Raphson is Zap Q(\u03bb)-learning:\nAlgorithm 1 Zap Q(\u03bb)-learning\n\nInput: \u03b80 \u2208 Rd, \u03b60 = \u03c8(X0, U0), (cid:98)A0 \u2208 Rd\u00d7d, n = 0, T \u2208 Z +\n(cid:2)\u03b2\u03c8(Xn+1, \u03c6n(Xn+1)) \u2212 \u03c8(Xn, Un)(cid:3)T;\n\n\u03c6n(Xn+1) := arg minu Q\u03b8n (Xn+1, u);\ndn+1 := c(Xn, Un) + \u03b2Q\u03b8n (Xn+1, \u03c6n(Xn+1)) \u2212 Q\u03b8n(Xn, Un);\n(cid:98)An+1 = (cid:98)An + \u03b3n+1\nAn+1 := \u03b6n\n\u03b8n+1 = \u03b8n \u2212 \u03b1n+1(cid:98)A\n\n(cid:2)An+1 \u2212 (cid:98)An\n\n\u22121\nn+1\u03b6ndn+1;\n\u03b6n+1 := \u03bb\u03b2\u03b6n + \u03c8(Xn+1, Un+1);\nn = n + 1\n\n1: repeat\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: until n \u2265 T\n\n(cid:3);\n\n(cid:46) Initialization\n\n(cid:46) Temporal difference\n\n(cid:46) Matrix gain update rule\n(cid:46) Zap-Q update rule\n(cid:46) Eligibility vector update rule\n\n3\n\n\f\u22121\nn+1\n\n(cid:9),\n\nA special case is considered in the analysis here: the basis is chosen as in Watkins\u2019 algorithm, \u03bb = 0,\nand \u03b1n \u2261 1/n. An equivalent representation for the parameter recursion is thus\nin which c and \u03b8n are treated as d-dimensional vectors rather than functions on X \u00d7 U, and \u03a8n =\n\u03c8(Xn, Un)\u03c8(Xn, Un)T.\nPart of the analysis is based on a recursion for the following d-dimensional sequence:\n\n(cid:8)\u03a8nc + An+1\u03b8n\n\u22121(cid:98)An\u03b8n , n \u2265 1 ,\nsequence {(cid:98)Cn} admits a very simple recursion in the special case \u03b3 \u2261 \u03b1:\n\u22121\u03a8nc \u2212 (cid:98)Cn] .\nIt follows that (cid:98)Cn converges to c as n \u2192 \u221e, since (8) is essentially a Monte-Carlo average of\n{\u03a0\u22121\u03a8nc : n \u2265 0}. Analysis for this case is complicated since (cid:98)An is obtained as a uniform average\n\nwhere \u03a0 is the d \u00d7 d diagonal matrix with entries \u0001 (the steady-state distribution of (X, U )). The\n(8)\n\n\u03b8n+1 = \u03b8n \u2212 \u03b1n+1(cid:98)A\n(cid:98)Cn = \u2212\u03a0\n(cid:98)Cn+1 = (cid:98)Cn + \u03b1n+1[\u03a0\n\nof {An}.\nThe main contributions of this paper concern a two time-scale implementation for which\n\n\u03b1n\n\u03b3n\n\n(9)\n\n= 0 .\n\nn < \u221e and\n\u03b32\n\n\u03b3n = \u221e\n\nlim\nn\u2192\u221e\n2 , 1). Through ODE analysis, it is\nIn our analysis, we restrict to \u03b3n \u2261 1/n\u03c1, for some \ufb01xed \u03c1 \u2208 ( 1\nargued that the Zap Q-learning algorithm closely resembles an implementation of Newton-Raphson\n\nin this case. This analysis suggests that {(cid:98)An} more closely tracks the mean of {An}. Theorem 2.2\n\nsummarizes the main results under Q1, and the following additional assumptions:\nQ2: The optimal policy \u03c6\u2217 is unique.\nQ3: The sequence of policies {\u03c6n} satisfy\n\n\u03b3nI{\u03c6n+1 (cid:54)= \u03c6n} < \u221e ,\n\nThe assumption Q3 is used to address the discontinuity in the recursion for {(cid:98)An} resulting from the\n\ndependence of An+1 on \u03c6n.\nTheorem 2.2. Suppose that Assumptions Q1\u2013Q3 hold, and the gain sequences \u03b1 and \u03b3 satisfy:\n\n\u221e(cid:88)\n\na.s..\n\nn=1\n\n(cid:88)\n\n(cid:88)\n\n\u03b1n = n\n\n\u22121 ,\n\n\u03b3n = n\n\n\u2212\u03c1 ,\n\nn \u2265 1 ,\n\n2 , 1). Then,\n\nfor some \ufb01xed \u03c1 \u2208 ( 1\n(i) The parameter sequence {\u03b8n} obtained using the Zap-Q algorithm converges to Q\u2217 a.s..\n(ii) The asymptotic covariance (2) is minimized over all G-Q(0) matrix gain versions of\n(iii) An ODE approximation holds for the sequence {\u03b8n,(cid:98)Cn}, by continuous functions (q, \u03c2)\nWatkins\u2019 Q-learning algorithm.\n\nsatisfying\n\nThis ODE approximation is exponentially asymptotically stable, with lim\n\nqt = Q(\u03c2t) ,\n\nd\n\ndt \u03c2t = \u2212\u03c2t + c\n\n\u2217.\nt\u2192\u221e qt = Q\n\n(10)\n\nThe ODE result (10) is an important aspect of this work. It says that the sequence {qt}, a continuous\ntime approximation of the parameter estimates {\u03b8n} that are obtained using the Zap Q-learning\nalgorithm, evolves as the Q-function of some time-varying cost function \u03c2t. Furthermore, this time-\nvarying cost function \u03c2t has dynamics independent of qt, and converges to c; the cost function de\ufb01ned\nin the MDP model. Convergence follows from the continuity of the mapping Q:\n\nlim\nn\u2192\u221e \u03b8n = lim\n\nt\u2192\u221e qt = lim\n\n\u2217\nt\u2192\u221eQ(\u03c2t) = Q(c) = Q\n\n.\n\nThe reader is referred to [8] for complete proofs and technical details.\n\n3 Numerical Results\n\nResults from numerical experiments are surveyed here to illustrate the performance of the Zap Q-\nlearning algorithm.\n\n4\n\n\fFigure 1: Graph for MDP\n\nn \u2261 n\u22120.85.\n\nFinite state-action MDP Consider \ufb01rst a simple path-\ufb01nding problem.\nThe state space X = {1, . . . , 6} coincides with the six nodes on the un-\ndirected graph shown in Fig. 1. The action space U = {ex,x(cid:48)}, x, x(cid:48)\n\u2208 X,\nconsists of all feasible edges along which the agent can travel, including\neach \u201cself-loop\u201d, u = ex,x. The goal is to reach the state x\u2217 = 6 and\nmaximize the time spent there. The reader is referred to [8] for details\non the cost function and other modeling assumptions.\nSix variants of Q-learning were tested: Watkins\u2019 algorithm (5), Watkins\u2019\nalgorithm with Ruppert-Polyak-Juditsky (RPJ) averaging [21, 22, 24],\nWatkins\u2019 algorithm with a \u201cpolynomial learning rate\u201d \u03b1n \u2261 n\u22120.6 [9],\nSpeedy Q-learning [1], and two versions of Zap Q-learning: \u03b3n \u2261 \u03b1n \u2261\nn\u22121, and \u03b3n \u2261 \u03b10.85\nFig. 2 shows the normalized trace of the asymptotic covariance of Watkins\u2019 algorithm with step-\nsize \u03b1n = g/n, as a function of g > 0. Based on this observation or on Theorem 2.1, it follows\nthat the asymptotic covariance is not \ufb01nite for the standard Watkins\u2019 algorithm with \u03b1n \u2261 1/n. In\nsimulations it was found that the parameter estimates are not close to \u03b8\u2217 even after many millions of\niterations.\nIt was also found that Watkins\u2019 algorithm performed\npoorly in practice for any scalar gain. For example, more\nthan half of the 103 experiments using \u03b2 = 0.8 and\ng = 70 resulted in values of \u03b8n(15) exceeding \u03b8\u2217(15)\nby 104 (with \u03b8\u2217(15) \u2248 500), even with n = 106. The\nalgorithm performed well with the introduction of pro-\njection (to ensure that the parameter estimates evolve on\na bounded set) in the case \u03b2 = 0.8. With \u03b2 = 0.99, the\nperformance was unacceptable for any scalar gain, even\nwith projection.\nFig. 3 shows normalized histograms of {W i\nn(k) =\n\u221an(\u03b8i\nn(k) \u2212 \u03b8n(k)) : 1 \u2264 i \u2264 N} for the projected\nWatkins Q-learning with gain g = 70, and the Zap algo-\nrithm, \u03b3n \u2261 \u03b10.85\n[8]. Results for \u03b2 = 0.99 contained in [8] show similar solidarity with asymptotic theory.\n\n. The theoretical predictions were based on the solution to a Lyapunov equation\n\nFigure 2: Normalized trace of the asymp-\ntotic covariance\n\nn\n\nFigure 3: Asymptotic variance for Watkins\u2019 g = 70 and Zap Q-learning, \u03b3n \u2261 \u03b10.85\n\nn ; \u03b2 = 0.8\n\nThis is identically zero if and only if \u03b8n = Q\u2217. Fig. 4 contains plots of the maximal error Bn =\nmaxx,u |Bn(x, u)| for the six algorithms.\nThough all six algorithms perform reasonably well when \u03b2 = 0.8, Zap Q-learning is the only one\nthat achieves near zero Bellman error within n = 106 iterations in the case \u03b2 = 0.99. Moreover, the\n\nBellman Error The Bellman error at iteration n is denoted:\n\nBn(x, u) = \u03b8n(x, u) \u2212 r(x, u) \u2212 \u03b2\n\n(cid:48)\nPu(x, x\n\n) max\nu(cid:48)\u2208U\n\n\u03b8n(x\n\n(cid:48)\n\n(cid:48)\n, u\n\n) .\n\n(cid:88)\n\nx(cid:48)\u2208X\n\n5\n\n14653210002000300050607080900246810\u03b2=0.8\u03b2=0.99g-600-400-2000200400600012-600-400-200020040060010-310-310-310-301234567-500-400-300-200-1000100200-150-100-50050100150(a)Wn(18)withn=104(b)Wn(18)withn=106(c)Wn(10)withn=104(d)Wn(10)withn=106-1000010002000-1000-50005001000-400-2000200400600-1000-500050010001500Experimental histogramTheoritical pdfExperimental pdf00.511.501234567Zap-QScalarGain\fperformance of the two time-scale algorithm is clearly superior to the one time-scale algorithm. It\nis also observed that the Watkins algorithm with an optimized scalar gain (i.e., step-size \u03b1n \u2261 g\u2217/n\nwith g\u2217 chosen so that the asymptotic variance is minimized) has the best performance among scalar-\ngain algorithms.\n\nFigure 4: Maximum Bellman error {Bn : n \u2265 0} for the six Q-learning algorithms\n\nFig. 4 shows only the typical behavior \u2014 repeated trials were run to investigate the range of possible\noutcomes. Plots of the mean and 2\u03c3 con\ufb01dence intervals of Bn are shown in Fig. 5 for \u03b2 = 0.99.\n\nFigure 5: Simulation-based 2\u03c3 con\ufb01dence intervals for the six Q-learning algorithms for the case \u03b2 = 0.99\nFinance model The next example is taken from [27, 7]. The reader is referred to these references\nfor complete details of the problem set-up and the reinforcement learning architecture used in this\nprior work. The example is of interest because it shows how the Zap Q-learning algorithm can be\nused with a more general basis, and also how the technique can be extended to optimal stopping\ntime problems.\nThe Markovian state process for the model evolves in X = R100. The \u201ctime to exercise\u201d is modeled\nas a discrete valued stopping time \u03c4. The associated expected reward is de\ufb01ned as E[\u03b2\u03c4 r(X\u03c4 )],\n\nwhere \u03b2 \u2208 (0, 1), r(Xn) := Xn(100) = (cid:101)pn/(cid:101)pn\u2212100, and {(cid:101)pt : t \u2208 R} is a geometric Brownian\n\nmotion (derived from an exogenous price-process). The objective of \ufb01nding a policy that maximizes\nthe expected reward is modeled as an optimal stopping time problem.\nThe value function is de\ufb01ned to be the supremum over all stopping times:\n\n\u2217\nh\n\n(x) = sup\n\u03c4 >0\n\n(x) = max(cid:0)r(x), \u03b2E[h\n\nE[\u03b2\u03c4 r(X\u03c4 ) | X0 = x].\n\n(Xn+1) | Xn = x](cid:1).\n\nThis solves the Bellman equation: For each x \u2208 X,\nThe associated Q-function is denoted Q\u2217(x) := \u03b2E[h\u2217(Xn+1) | Xn = x], and solves a similar \ufb01xed\npoint equation:\n\n\u2217\nh\n\n\u2217\n\n\u2217\nQ\n\n\u2217\n(x) = \u03b2E[max(r(Xn+1), Q\n\n(Xn+1)) | Xn = x].\n\n6\n\n0123456789100246810121416182001234567891002040608010012010x5nBellman ErrorRPJZap-Q:Zap-Q:SpeedyPolynomialWatkins \u03b2=0.8g=70\u03b2=0.99g=1500\u2261\u03b10.85n\u03b3n\u2261\u03b3n\u03b1nBng = 500g = 1500SpeedyPolyg = 5000g = 500g = 1500SpeedyPolyg = 5000103104105106106100101102103104103104105106nRPJRPJBellman ErrorNormalized number ofof ObservationsBBn01020304050012020406080100120140160Bellman Errorn =106Bellman Errorn =00.5Zap-Q:Zap-Q:\u2261\u03b1085n\u03b3n\u2261\u03b3n\u03b1nZap-Q:Zap-Q:\u2261\u03b1085n\u03b3n\u2261\u03b3n\u03b1n\fThe Q(0)-learning algorithm considered in [27] is de\ufb01ned as follows:\n\n(cid:104)\n\n\u03b2 max(cid:0)Xn+1(100), Q\u03b8n(Xn+1)(cid:1)\n\n(cid:105)\n\u2212 Q\u03b8n(Xn)\n\n, n \u2265 0 .\n\n\u03b8n+1 = \u03b8n + \u03b1n+1\u03c8(Xn)\n\nIn [7] the authors attempt to improve the performance of the Q(0) algorithm through the use of a\nsequence of matrix gains, which can be regarded as an instance of the G-Q(0)-learning algorithm\nde\ufb01ned in (7). For details see this prior work as well as the extended version of this paper [8].\nA gain sequence {Gn} was introduced in [7] to improve performance. Denoting G and A to be\nthe steady state means of {Gn} and {An} respectively, the eigenvalues corresponding to the matrix\nGA are shown on the right hand side of Fig. 6. It is observed that the suf\ufb01cient condition for a\n\ufb01nite asymptotic covariance are \u201cjust\u201d satis\ufb01ed in this algorithm: the maximum eigenvalue of GA\nis approximately \u03bb \u2248 \u22120.525 < \u2212 1\n2 (see Theorem 2.1 of [8]). It is worth stressing that the \ufb01nite\nasymptotic covariance was not a design goal in this prior work. It is only now on revisiting this\npaper that we \ufb01nd that the suf\ufb01cient condition \u03bb < \u2212 1\nThe Zap Q-learning algorithm for this example is de\ufb01ned by the following recursion:\n\n2 is satis\ufb01ed.\n\n(cid:104)\n\u03b2 max(cid:0)Xn+1(100), Q\u03b8n (Xn+1)(cid:1)\n\n(cid:105)\n\u2212 Q\u03b8n (Xn)\n\n,\n\n\u03b8n+1 = \u03b8n \u2212 \u03b1n+1(cid:98)A\n(cid:98)An+1 = (cid:98)An + \u03b3n[An+1 \u2212 (cid:98)An],\n\n\u22121\nn+1\u03c8(Xn)\n\n\u03d5(\u03b8n, Xn+1) = \u03b2\u03c8(Xn+1)I{Q\u03b8n (Xn+1) \u2265 Xn+1(100)} \u2212 \u03c8(Xn).\n\nAn+1 = \u03c8(Xn)\u03d5T(\u03b8n, Xn+1) ,\n\nHigh performance despite ill-conditioned matrix gain\nThe real part of the eigenvalues of A are\nshown on a logarithmic scale on the left-hand side of Fig. 6. These eigenvalues have a wide spread:\nthe ratio of the largest to the smallest real parts of the eigenvalues is of the order 104. This presents a\nchallenge in applying any method. In particular, it was found that the performance of any scalar-gain\nalgorithm was extremely poor, even with projection of parameter estimates.\n\nFigure 6: Eigenvalues of A and GA for the \ufb01nance example\n\nFigure 7: Theoretical and empirical variance for the \ufb01nance example\n\nIn applying the Zap Q-learning algorithm it was found that the estimates {(cid:98)An} de\ufb01ned in the\nabove recursion are nearly singular. Despite the unfavorable setting for this approach, the perfor-\nmance of the algorithm was better than any alternative that was tested. Fig. 7 contains normalized\nhistograms of {W i\nn(k) \u2212 \u03b8n(k)) : 1 \u2264 i \u2264 N} for the Zap-Q algorithm, with\nn \u2261 n\u22120.85. The variance for \ufb01nite n is close to the theoretical predictions based on the\n\u03b3n \u2261 \u03b10.85\noptimal asymptotic covariance. The histograms were generated for two values of n, and k = 1, 7.\nOf the d = 10 possibilities, the histogram for k = 1 had the worst match with theoretical predic-\ntions, and k = 7 was the closest. The histograms for the G-Q(0) algorithm contained in [8] showed\nextremely high variance, and the experimental results did not match theoretical predictions.\n\nn(k) = \u221an(\u03b8i\n\n7\n\ni012345678910-100-10-1-10-2-10-3-10-4-10-5-10-6-0.525-30-25-20-15-10-5-10-50510Re(\u03bb(GA))Co(\u03bb(GA))\u03bbi(GA)Real\u03bbi(A)Experimental histogramTheoritical pdfExperimental pdfWn(1)withn=2\u00d7104Wn(1)withn=2\u00d7106Wn(7)withn=2\u00d7104Wn(7)withn=2\u00d7106Zap-Q0-200-150-100-5005010015020025000.010.020.030.040.050.060.070.08-250-200-150-100-50050100-200-10001002003004005006000.0020.0040.0060.0080.010.0120.0140.016-200-1000100200300\f\u2212\u03c1\n\n2e4\n82.7\n82.4\n35.7\n0.17\n0.13\n\n2e5\n77.5\n72.5\n0\n0.03\n0.03\n\nFigure 8: Histograms of average reward: G-Q(0) learning and Zap-Q-learning, \u03b3n \u2261 \u03b1\u03c1\n2e5\n49.7\n51.8\n0\n0\n0\n\nn\nG-Q(0) g = 100\nG-Q(0) g = 200\nZap-Q \u03c1 = 1.0\nZap-Q \u03c1 = 0.8\nZap-Q \u03c1 = 0.85\n(a) Percentage of runs with h\u03b8n (x) \u2264 0.999\nTable 1: Percentage of outliers observed in N = 1000 runs. Each table represents the percentage of runs\nwhich resulted in an average reward below a certain value\n\nn \u2261 n\n2e4\n2e6\n54.5\n39.5\n64.1\n39\n0\n0\n0\n0\n0\n0\n(c) h\u03b8n (x) \u2264 0.5\n\n2e6\n68\n55.9\n0\n0\n0\n\n2e6\n65.4\n53.7\n0\n0\n0\n\n2e5\n75.5\n70.6\n0\n0\n0\n\n2e4\n81.1\n80.6\n0.55\n0\n0\n(b) h\u03b8n (x) \u2264 0.95\n\nHistograms of the average reward h\u03b8n (x) obtained from N = 1000 simulations is contained in\nFig. 8, for n = 2 \u00d7 104, 2 \u00d7 105 and 2 \u00d7 106, and x(i) = 1, 1 \u2264 i \u2264 100. Omitted in this \ufb01gure are\noutliers: values of the reward in the interval [0, 1). Table 1 lists the number of outliers for each run.\nThe asymptotic covariance of the G-Q(0) algorithm was not far from optimal (its trace is about 15\ntimes larger than obtained using Zap Q-learning). However, it is observed that this algorithm suffers\nfrom much larger outliers.\n4 Conclusions\nWatkins\u2019 Q-learning algorithm is elegant, but subject to two common and valid complaints: it can\nbe very slow to converge, and it is not obvious how to extend this approach to obtain a stable\nalgorithm in non-trivial parameterized settings (i.e., without a look-up table representation for the Q-\nfunction). This paper addresses both concerns with the new Zap Q(\u03bb) algorithms that are motivated\nby asymptotic theory of stochastic approximation.\nThe potential complexity introduced by the matrix gain is not of great concern in many cases, be-\ncause of the dramatic acceleration in the rate of convergence. Moreover, the main contribution of\nthis paper is not a single algorithm but a class of algorithms, wherein the computational complexity\ncan be dealt with separately. For example, in a parameterized setting, the basis functions can be\nintelligently pruned via random projection [2].\nThere are many avenues for future research. It would be valuable to \ufb01nd an alternative to Assumption\nQ3 that is readily veri\ufb01ed. Based on the ODE analysis, it seems likely that the conclusions of\nTheorem 2.2 hold without this additional assumption. No theory has been presented here for non-\nideal parameterized settings. It is conjectured that conditions for stability of Zap Q(\u03bb)-learning will\nhold under general conditions. Consistency is a more challenging problem.\nIn terms of algorithm design, it is remarkable to see how well the scalar-gain algorithms perform,\nprovided projection is employed and the ratio of largest to smallest real parts of the eigenvalues of A\nis not too large. It is possible to estimate the optimal scalar gain based on estimates of the matrix A\nthat is central to this paper. How to do so without introducing high complexity is an open question.\nOn the other hand, the performance of RPJ averaging is unpredictable. In many experiments it is\nfound that the asymptotic covariance is a poor indicator of \ufb01nite-n performance. There are many\nsuggestions in the literature for improving this technique. The results in this paper suggest new\napproaches that we hope will simultaneously\n\n(i) Reduce complexity and potential numerical instability of matrix inversion,\n(ii) Improve transient performance, and\n(iii) Maintain optimality of the asymptotic covariance\nAcknowledgments: This research was supported by the National Science Foundation under grants\nEPCN-1609131 and CPS-1259040.\n\n8\n\n11.051.11.151.21.2502040608010011.051.11.151.21.25010020030040050060011.051.11.151.21.2505101520253035G-Q(0)G-Q(0)Zap-QZap-Q\u03c1 = 0.8\u03c1 = 1.0g = 100g = 200Zap-Q\u03c1 = 0.85n=2\u00d7104n=2\u00d7105n=2\u00d7106\fReferences\n[1] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in\n\nNeural Information Processing Systems, 2011.\n\n[2] K. Barman and V. S. Borkar. A note on linear function approximation using random projec-\n\ntions. Systems & Control Letters, 57(9):784\u2013786, 2008.\n\n[3] A. Benveniste, M. M\u00b4etivier, and P. Priouret. Adaptive algorithms and stochastic approxima-\ntions, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990.\nTranslated from the French by Stephen S. Wilson.\n\n[4] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book\n\nAgency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.\n\n[5] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation\nand reinforcement learning. SIAM J. Control Optim., 38(2):447\u2013469, 2000. (also presented at\nthe IEEE CDC, December, 1998).\n\n[6] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,\n\n49(2-3):233\u2013246, 2002.\n\n[7] D. Choi and B. Van Roy. A generalized Kalman \ufb01lter for \ufb01xed point approximation and\nef\ufb01cient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applica-\ntions, 16(2):207\u2013239, 2006.\n\n[8] A. M. Devraj and S. P. Meyn. Fastest Convergence for Q-learning. ArXiv e-prints, July 2017.\n\n[9] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning\n\nResearch, 5(Dec):1\u201325, 2003.\n\n[10] A. Givchi and M. Palhang. Quasi Newton temporal difference learning. In Asian Conference\n\non Machine Learning, pages 159\u2013172, 2015.\n\n[11] P. W. Glynn and D. Ormoneit. Hoeffding\u2019s inequality for uniformly ergodic Markov chains.\n\nStatistics and Probability Letters, 56:143\u2013146, 2002.\n\n[12] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approx-\n\nimation. Ann. Appl. Probab., 14(2):796\u2013819, 2004.\n\n[13] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35\n\nof Applications of Mathematics (New York). Springer-Verlag, New York, 1997.\n\n[14] R. B. Lund, S. P. Meyn, and R. L. Tweedie. Computable exponential convergence rates for\n\nstochastically ordered Markov processes. Ann. Appl. Probab., 6(1):218\u2013237, 1996.\n\n[15] D.-J. Ma, A. M. Makowski, and A. Shwartz. Stochastic approximations for \ufb01nite-state Markov\n\nchains. Stochastic Process. Appl., 35(1):27\u201345, 1990.\n\n[16] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin\u2019s minimum principle. In IEEE Con-\n\nference on Decision and Control, pages 3598\u20133605, Dec. 2009.\n\n[17] S. P. Meyn and R. L. Tweedie. Computable bounds for convergence rates of Markov chains.\n\nAnn. Appl. Probab., 4:981\u20131011, 1994.\n\n[18] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms\nfor machine learning. In Advances in Neural Information Processing Systems 24, pages 451\u2013\n459. Curran Associates, Inc., 2011.\n\n[19] Y. Pan, A. M. White, and M. White. Accelerated gradient temporal difference learning. In\n\nAAAI, pages 2464\u20132470, 2017.\n\n[20] D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral\n\nmethods. Electron. J. Probab., 20:32 pp., 2015.\n\n9\n\n\f[21] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in\n\nRussian). translated in Automat. Remote Control, 51 (1991), pages 98\u2013107, 1990.\n\n[22] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJ. Control Optim., 30(4):838\u2013855, 1992.\n\n[23] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The\n\nAnnals of Statistics, 13(1):236\u2013245, 1985.\n\n[24] D. Ruppert. Ef\ufb01cient estimators from a slowly convergent Robbins-Monro processes. Techni-\ncal Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Indus-\ntrial Engineering, Ithaca, NY, 1988.\n\n[25] C. Szepesv\u00b4ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th\nInternational Conference on Neural Information Processing Systems, pages 1064\u20131070. MIT\nPress, 1997.\n\n[26] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,\n\n16:185\u2013202, 1994.\n\n[27] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory,\napproximation algorithms, and an application to pricing high-dimensional \ufb01nancial derivatives.\nIEEE Trans. Automat. Control, 44(10):1840\u20131851, 1999.\n\n[28] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King\u2019s College, Cambridge,\n\nCambridge, UK, 1989.\n\n[29] H. Yao, S. Bhatnagar, and C. Szepesv\u00b4ari. LMS-2: Towards an algorithm that is as cheap as\nLMS and almost as ef\ufb01cient as RLS. In Decision and Control, 2009 held jointly with the 2009\n28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference\non, pages 1181\u20131188. IEEE, 2009.\n\n[30] H. Yao and Z.-Q. Liu. Preconditioned temporal difference learning. In Proceedings of the 25th\n\ninternational conference on Machine learning, pages 1208\u20131215. ACM, 2008.\n\n[31] H. Yu and D. P. Bertsekas. Q-learning and policy iteration algorithms for stochastic shortest\n\npath problems. Annals of Operations Research, 208(1):95\u2013132, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1323, "authors": [{"given_name": "Adithya M", "family_name": "Devraj", "institution": "University of Florida"}, {"given_name": "Sean", "family_name": "Meyn", "institution": "University of Florida"}]}