{"title": "Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1204, "page_last": 1212, "abstract": "We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.", "full_text": "Convergent Temporal-Difference Learning with\n\nArbitrary Smooth Function Approximation\n\nHamid R. Maei\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\nCsaba Szepesv\u00b4ari\u2217\nUniversity of Alberta\nEdmonton, AB, Canada\n\nShalabh Bhatnagar\n\nIndian Institute of Science\n\nBangalore, India\n\nDoina Precup\n\nMcGill University\n\nMontreal, QC, Canada\n\nDavid Silver\n\nUniversity of Alberta,\nEdmonton, AB, Canada\n\nRichard S. Sutton\nUniversity of Alberta,\nEdmonton, AB, Canada\n\nAbstract\n\nWe introduce the \ufb01rst temporal-difference learning algorithms that converge with\nsmooth value function approximators, such as neural networks. Conventional\ntemporal-difference (TD) methods, such as TD(\u03bb), Q-learning and Sarsa have\nbeen used successfully with function approximation in many applications. How-\never, it is well known that off-policy sampling, as well as nonlinear function ap-\nproximation, can cause these algorithms to become unstable (i.e., the parameters\nof the approximator may diverge). Sutton et al. (2009a, 2009b) solved the prob-\nlem of off-policy learning with linear TD algorithms by introducing a new objec-\ntive function, related to the Bellman error, and algorithms that perform stochastic\ngradient-descent on this function. These methods can be viewed as natural gener-\nalizations to previous TD methods, as they converge to the same limit points when\nused with linear function approximation methods. We generalize this work to non-\nlinear function approximation. We present a Bellman error objective function and\ntwo gradient-descent TD algorithms that optimize it. We prove the asymptotic\nalmost-sure convergence of both algorithms, for any \ufb01nite Markov decision pro-\ncess and any smooth value function approximator, to a locally optimal solution.\nThe algorithms are incremental and the computational complexity per time step\nscales linearly with the number of parameters of the approximator. Empirical re-\nsults obtained in the game of Go demonstrate the algorithms\u2019 effectiveness.\n\nIntroduction\n\n1\nWe consider the problem of estimating the value function of a given stationary policy of a Markov\nDecision Process (MDP). This problem arises as a subroutine of generalized policy iteration and\nis generally thought to be an important step in developing algorithms that can learn good control\npolicies in reinforcement learning (e.g., see Sutton & Barto, 1998). One widely used technique\nfor value-function estimation is the TD(\u03bb) algorithm (Sutton, 1988). A key property of the TD(\u03bb)\nalgorithm is that it can be combined with function approximators in order to generalize the observed\ndata to unseen states. This generalization ability is crucial when the state space of the MDP is large\nor in\ufb01nite (e.g., TD-Gammon, Tesauro, 1995; elevator dispatching, Crites & Barto, 1997; job-shop\nscheduling, Zhang & Dietterich, 1997). TD(\u03bb) is known to converge when used with linear function\napproximators, if states are sampled according to the policy being evaluated \u2013 a scenario called on-\npolicy learning (Tsitsiklis & Van Roy, 1997). However, the absence of either of these requirements\ncan cause the parameters of the function approximator to diverge when trained with TD methods\n(e.g., Baird, 1995; Tsitsiklis & Van Roy, 1997; Boyan & Moore, 1995). The question of whether it\nis possible to create TD-style algorithms that are guaranteed to converge when used with nonlinear\nfunction approximation has remained open until now. Residual gradient algorithms (Baird, 1995)\n\n\u2217On leave from MTA SZTAKI, Hungary.\n\n1\n\n\fattempt to solve this problem by performing gradient descent on the Bellman error. However, unlike\nTD, these algorithms usually require two independent samples from each state. Moreover, even if\ntwo samples are provided, the solution to which they converge may not be desirable (Sutton et al.,\n2009b provides an example).\nIn this paper we de\ufb01ne the \ufb01rst TD algorithms that are stable when used with smooth nonlinear\nfunction approximators (such as neural networks). Our starting point is the family of TD-style algo-\nrithms introduced recently by Sutton et al. (2009a, 2009b). Their goal was to address the instability\nof TD learning with linear function approximation, when the policy whose value function is sought\ndiffers from the policy used to generate the samples (a scenario called off-policy learning). These al-\ngorithms were designed to approximately follow the gradient of an objective function whose unique\noptimum is the \ufb01xed point of the original TD(0) algorithm. Here, we extend the ideas underlying\nthis family of algorithms to design TD-like algorithms which converge, under mild assumptions,\nalmost surely, with smooth nonlinear approximators. Under some technical conditions, the limit\npoints of the new algorithms correspond to the limit points of the original (not necessarily conver-\ngent) nonlinear TD algorithm. The algorithms are incremental, and the cost of each update is linear\nin the number of parameters of the function approximator, as in the original TD algorithm.\nOur development relies on three main ideas. First, we extend the objective function of Sutton et\nal. (2009b), in a natural way, to the nonlinear function approximation case. Second, we use the\nweight-duplication trick of Sutton et al. (2009a) to derive a stochastic gradient algorithm. Third,\nin order to implement the parameter update ef\ufb01ciently, we exploit a nice idea due to Pearlmutter\n(1994), allowing one to compute exactly the product of a vector and a Hessian matrix in linear\ntime. To overcome potential instability issues, we introduce a projection step in the weight update.\nThe almost sure convergence of the algorithm then follows from standard two-time-scale stochastic\napproximation arguments.\nIn the rest of the paper, we \ufb01rst introduce the setting and our notation (Section 2), review previ-\nous relevant work (Section 3), introduce the algorithms (Section 4), analyze them (Section 5) and\nillustrate the algorithms\u2019 performance (Section 6).\n2 Notation and Background\nWe consider policy evaluation in \ufb01nite state and action Markov Decision Processes (MDPs).1\nAn MDP is described by a 5-tuple (S,A, P, r,\u03b3 ), where S is the \ufb01nite state space, A is the \ufb01-\nnite action space, P = (P (s\uffff|s, a))s,s\uffff\u2208S,a\u2208A are the transition probabilities (P (s\uffff|s, a) \u2265 0,\n\uffffs\uffff\u2208S P (s\uffff|s, a) = 1, for all s \u2208S , a \u2208A ), r = (r(s, a, s\uffff))s,s\uffff\u2208S,a\u2208A are the real-valued\nimmediate rewards and \u03b3 \u2208 (0, 1) is the discount factor. The policy to be evaluated is a map-\nping \u03c0 : S\u00d7A\u2192 [0, 1]. The value function of \u03c0, V \u03c0 : S\u2192 R, maps each state s to a\nnumber representing the in\ufb01nite-horizon expected discounted return obtained if policy \u03c0 is fol-\nlowed from state s. Formally, let s0 = s and for t \u2265 0 let at \u223c \u03c0(st,\u00b7), st+1 \u223c P (\u00b7|st, at)\nand rt+1 = r(st, at, st+1). Then V \u03c0(s) = E[\uffff\u221et=0 \u03b3trt+1]. Let R\u03c0 : S \u2192 R, with\nR\u03c0(s) = \uffffs\uffff\u2208S\uffffa\u2208A \u03c0(s, a)P (s\uffff|s, a)r(s, a, s\uffff), and let P \u03c0 : S\u00d7S\u2192 [0, 1] be de\ufb01ned as\nP \u03c0(s, s\uffff) = \uffffa\u2208A \u03c0(s, a)P (s\uffff|s, a). Assuming a canonical ordering on the elements of S, we\n\ncan treat V \u03c0 and R\u03c0 as vectors in R|S|, and P \u03c0 as a matrix in R|S|\u00d7|S|. It is well-known that V \u03c0\nsatis\ufb01es the so-called Bellman equation:\n\nV \u03c0 = R\u03c0 + \u03b3P \u03c0V \u03c0.\n\nDe\ufb01ning the operator T \u03c0 : R|S| \u2192 R|S| as T \u03c0V = R\u03c0+\u03b3P \u03c0V, the Bellman equation can be written\ncompactly as V \u03c0 = T \u03c0V \u03c0. To simplify the notation, from now on we will drop the superscript \u03c0\neverywhere, since the policy to be evaluated will be kept \ufb01xed.\nAssume that\nthe policy to be evaluated is followed and it gives rise to the trajectory\n(s0, a0, r1, s1, a1, r2, s2, . . .). The problem is to estimate V , given a \ufb01nite pre\ufb01x of this trajec-\ntory. More generally, we may assume that we are given an in\ufb01nite sequence of 3-tuples, (sk, rk, s\uffffk),\nthat satis\ufb01es the following:\nAssumption A1 (sk)k\u22650 is an S-valued stationary Markov process, sk \u223c d(\u00b7), rk = R(sk) and\ns\uffffk \u223c P (sk,\u00b7).\n\n1Under appropriate technical conditions, our results, can be generalized to MDPs with in\ufb01nite state spaces,\n\nbut we do not address this here.\n\n2\n\n\fWe call (sk, rk, s\uffffk) the kth transition. Since we assume stationarity, we will sometimes drop the\nindex k and use (s, r, s\uffff) to denote a random transition. Here d(\u00b7) denotes the probability distribution\nover initial states for a transition; let D \u2208 R|S|\u00d7|S| be the corresponding diagonal matrix. The\nproblem is still to estimate V given a \ufb01nite number of transitions.\nWhen the state space is large (or in\ufb01nite) a function approximation method can be used to facilitate\nthe generalization of observed transitions to unvisited or rarely visited states. In this paper we focus\non methods that are smoothly parameterized with a \ufb01nite-dimensional parameter vector \u03b8 \u2208 Rn. We\ndenote by V\u03b8(s) the value of state s \u2208S returned by the function approximator with parameters \u03b8.\nThe goal of policy evaluation becomes to \ufb01nd \u03b8 such that V\u03b8 \u2248 V .\n3 TD Algorithms with function approximation\nThe classical TD(0) algorithm with function approximation (Sutton, 1988; Sutton & Barto, 1998)\nstarts with an arbitrary value of the parameters, \u03b80. Upon observing the kth transition, it computes\nthe scalar-valued temporal-difference error,\n\nwhich is then used to update the parameter vector as follows:\n\n\u03b4k = rk + \u03b3V\u03b8k(s\uffffk) \u2212 V\u03b8k(sk),\n\u03b8k+1 \u2190 \u03b8k + \u03b1k \u03b4k\u2207V\u03b8k(sk).\n\n(1)\nHere \u03b1k is a deterministic positive step-size parameter, which is typically small, or (for the purpose\n\nof convergence analysis) is assumed to satisfy the Robbins-Monro conditions: \uffff\u221ek=0 \u03b1k = \u221e,\n\uffff\u221ek=0 \u03b12\n\nWhen the TD algorithm converges, it must converge to a parameter value where, in expectation, the\nparameters do not change:\n\nk < \u221e. We denote by \u2207V\u03b8(s) \u2208 Rn the gradient of V w.r.t. \u03b8 at s.\n\nE[\u03b4 \u2207V\u03b8(s)] = 0,\n\n(2)\nwhere s, \u03b4 are random and share the common distribution underlying (sk,\u03b4 k); in particular, (s, r, s\uffff)\nare drawn as in Assumption A1 and \u03b4 = r + \u03b3V\u03b8(s\uffff) \u2212 V\u03b8(s).\nHowever, it is well known that TD(0) may not converge; the stability of the algorithm is affected\nboth by the actual function approximator V\u03b8 and by the way in which transitions are sampled. Sutton\net al (2009a, 2009b) tackled this problem in the case of linear function approximation, in which\nV\u03b8(s) = \u03b8\uffff\u03c6(s), where \u03c6 : S\u2192 Rn, but where transitions may be sampled in an off-policy\nmanner. From now on we use the shorthand notation \u03c6 = \u03c6(s), \u03c6\uffff = \u03c6(s\uffff).\nSutton et al.\n(2009b) rely on an error function, called mean-square projected Bellman error\n(MSPBE)2, which has the same unique optimum as Equation (2). This function, which we de-\nnote J, projects the Bellman error measure, T V\u03b8 \u2212 V\u03b8 onto the linear space M = {V\u03b8 | \u03b8 \u2208 Rn}\nwith respect to the metric \uffff\u00b7\uffff D. Hence, \u03a0V = arg min V \uffff\u2208M \uffffV \uffff \u2212 V \uffff2\n(3)\nD =\uffffs\u2208S d(s)V (s)2, and the scalar\nwhere \uffffV \uffffD is the weighted quadratic norm de\ufb01ned by \uffffV \uffff2\nTD(0) error for a given transition (s, r, s\uffff) is \u03b4 = r + \u03b3\u03b8\uffff\u03c6\uffff \u2212 \u03b8\uffff\u03c6.\nThe negative gradient of the MSPBE objective function is:\n\nJ(\u03b8) =\uffff \u03a0(T V\u03b8 \u2212 V\u03b8) \uffff2\n\nD= E[\u03b4\u03c6]\uffffE[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6],\n\nD=\uffff \u03a0 T V\u03b8 \u2212 V\u03b8 \uffff2\n\nD. More precisely:\n\n1\n\n\u2212\n\n2\u2207J(\u03b8) = E\uffff(\u03c6 \u2212 \u03b3\u03c6\uffff)\u03c6\uffffw\uffff = E[\u03b4\u03c6] \u2212 \u03b3E\uffff\u03c6\uffff\u03c6\uffff\uffff w,\n\n(4)\nwhere w = E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6]. Note that \u03b4 depends on \u03b8, hence w depends on \u03b8. In order to develop\nan ef\ufb01cient (O(n)) stochastic gradient algorithm, Sutton et al. (2009a) use a weight-duplication\ntrick. They introduce a new set of weights, wk, whose purpose is to estimate w for a \ufb01xed value of\nthe \u03b8 parameter. These weights are updated on a \u201cfast\u201d timescale, as follows:\n\n(5)\nThe parameter vector \u03b8k is updated on a \u201cslower\u201d timescale. Two update rules can be obtained,\nbased on two slightly different calculations:\n\nwk+1 = wk + \u03b2k(\u03b4k \u2212 \u03c6\uffffk wk)\u03c6k.\n\n\u03b8k+1 = \u03b8k + \u03b1k(\u03c6k \u2212 \u03b3\u03c6\uffffk)(\u03c6\uffffk wk) (an algorithm called GTD2), or\n\u03b8k+1 = \u03b8k + \u03b1k\u03b4k\u03c6k \u2212 \u03b1k\u03b3\u03c6\uffffk(\u03c6\uffffk wk) (an algorithm called TDC).\n\n(6)\n(7)\n\n2This error function was also described in (Antos et al., 2008), although the algorithmic issue of how to\nminimize it is not pursued there. Algorithmic issues in a batch setting are considered by Farahmand et al.\n(2009) who also study regularization.\n\n3\n\n\f4 Nonlinear Temporal Difference Learning\nOur goal is to generalize this approach to the case in which V\u03b8 is a smooth, nonlinear function\napproximator. The \ufb01rst step is to \ufb01nd a good objective function on which to do gradient descent. In\nthe linear case, MSPBE was chosen as a projection of the Bellman error on a natural hyperplane\u2013the\nsubspace to which V\u03b8 is restricted. However, in the nonlinear case, the value function is no longer\nrestricted to a plane, but can move on a nonlinear surface. More precisely, assuming that V\u03b8 is a\ndifferentiable function of \u03b8, M = {V\u03b8 \u2208 R|S| | \u03b8 \u2208 Rn} becomes a differentiable submanifold\nof R|S|. Projecting onto a nonlinear manifold is not computationally feasible; to get around this\nproblem, we will assume that the parameter vector \u03b8 changes very little in one step (given that\nlearning rates are usually small); in this case, the surface is locally close to linear, and we can\nproject onto the tangent plane at the given point. We now detail this approach and show that this is\nindeed a good objective function.\nThe tangent plane PM\u03b8 of M at \u03b8 is the hyperplane of R|S| that (i) passes through V\u03b8 and (ii)\nis orthogonal to the normal of M at \u03b8. The tangent space TM\u03b8 is the translation of PM\u03b8 to the\norigin. Note that TM\u03b8 = {\u03a6\u03b8a| a \u2208 Rn}, where \u03a6\u03b8 \u2208 R|S|\u00d7n is de\ufb01ned by (\u03a6\u03b8)s,i = \u2202\nV\u03b8(s).\nLet \u03a0\u03b8 be the projection that projects vectors of (R|S|,\uffff\u00b7\uffff D) to TM\u03b8. If \u03a6\uffff\u03b8 D\u03a6\u03b8 is non-singular\nthen \u03a0\u03b8 can be written as:\n(8)\n\n\u03a0\u03b8 =\u03a6 \u03b8(\u03a6\uffff\u03b8 D\u03a6\u03b8)\u22121\u03a6\uffff\u03b8 D.\n\n\u2202\u03b8i\n\nThe objective function that we will optimize is:\n\nJ(\u03b8) = \uffff \u03a0\u03b8(T V\u03b8 \u2212 V\u03b8) \uffff2\nD .\n\n(9)\n\nThis is a natural generalization of the objective function de\ufb01ned by (3), as the plane on which we\nproject is parallel to the tangent plane at \u03b8. More precisely, let \u03a5\u03b8 be the projection to PM\u03b8 and\nlet \u03a0\u03b8 be the projection to TM\u03b8. Because the two hyperplanes are parallel, for any V \u2208 R|S|,\n\u03a5\u03b8V \u2212 V\u03b8 =\u03a0 \u03b8(V \u2212 V\u03b8). In other words, projecting onto the tangent space gives exactly the same\ndistance as projecting onto the tangent plane, while being mathematically more convenient. Fig. 1\nillustrates visually this objective function.\n\nT V\u03b8\n\n\u03a5\n\u03b8\n\nT\n\n\u03a5 \u03b8T V \u03b8\n\nV\u03b8\n\n(\u03b8)\n\uffffJ\n\n\u03b8 \u2217\n\nV\n\nT\n\nV \u03b8\u2217\n\n\u03a5\u03b8\u2217T V\u03b8\u2217 = V\u03b8\u2217\n\nTangent plane \n\nTD(0) solution \n\nFigure 1: The MSPBE objective for nonlin-\near function approximation at two points in\nthe value function space. The \ufb01gure shows a\npoint, V\u03b8, at which, J(\u03b8), is not 0 and a point,\nV\u03b8\u2217, where J(\u03b8\u2217) = 0, thus \u03a5\u03b8\u2217T V\u03b8\u2217 = V\u03b8\u2217,\nso this is a TD(0) solution.\n\nWe now show that J(\u03b8) can be re-written in the same way as done in (Sutton et al., 2009b).\nLemma 1. Assume V\u03b8(s0) is continuously differentiable as a function of \u03b8, for any s0 \u2208S s.t.\nd(s0) > 0. Let (s, \u03b4) be jointly distributed random variables as in Section 3 and assume that\nE[\u2207V\u03b8(s)\u2207V\u03b8(s)\uffff] is nonsingular. Then\n\nJ(\u03b8) = E[ \u03b4 \u2207V\u03b8(s) ]\uffff E[\u2207V\u03b8(s)\u2207V\u03b8(s)\uffff ]\u22121 E[ \u03b4 \u2207V\u03b8(s) ].\n\n(10)\n\nProof. The identity is obtained similarly to Sutton et. al (2009b), except that here \u03a0\u03b8 is expressed\nby (8). Details are omitted for brevity.\n\nNote that the assumption that E[\u2207V\u03b8(s)\u2207V\u03b8(s)\uffff ]\u22121 is non-singular is akin to the assumption that\nthe feature vectors are independent in the linear function approximation case. We make this assump-\ntion here for convenience; it can be lifted, but the proofs become more involved.\nCorollary 1. Under the conditions of Lemma 1, J(\u03b8) = 0, if and only if V\u03b8 satis\ufb01es (2).\n\n4\n\n\fThis is an important corollary, because it shows that the global optima of the proposed objective\nfunction will not modify the set of solutions that the usual TD(0) algorithm would \ufb01nd (if it would\nindeed converge). We now proceed to compute the gradient of this objective.\nTheorem 1. Assume that (i) V\u03b8(s0) is twice continuously differentiable in \u03b8 for any s0 \u2208\nS s.t. d(s0) > 0 and (ii) W (\u00b7) de\ufb01ned by W (\u02c6\u03b8) = E[\u2207V\u02c6\u03b8 \u2207V \uffff\u02c6\u03b8\n] is non-singular in a small neigh-\nborhood of \u03b8. Let (s, \u03b4) be jointly distributed random variables as in Section 3. Let \u03c6 \u2261 \u2207V\u03b8(s),\n\u03c6\uffff \u2261 \u2207V\u03b8(s\uffff) and\n(11)\nwhere u \u2208 Rn. Then\n\nh(\u03b8, u) = \u2212E[ (\u03b4 \u2212 \u03c6\uffffu)\u22072V\u03b8(s)u ],\n\n1\n2\u2207J(\u03b8) = \u2212E[(\u03b3\u03c6\uffff \u2212 \u03c6)\u03c6\uffffw] + h(\u03b8, w) = \u2212E[\u03b4\u03c6] \u2212 \u03b3E[\u03c6\uffff\u03c6\uffffw] + h(\u03b8, w),\n\n(12)\n\n\u2212\n\nwhere w = E[\u03c6 \u03c6\uffff]\u22121 E[\u03b4\u03c6].\nThe main difference between Equation (12) and Equation (4), which shows the gradient for the linear\ncase, is the appearance of the term h(\u03b8, w), which involves second-order derivatives of V\u03b8 (which\nare zero when V\u03b8 is linear in \u03b8).\n\nProof. The conditions of Lemma 1 are satis\ufb01ed, so (10) holds.\nde\ufb01nition and the assumptions, W (u) is a symmetric, positive de\ufb01nite matrix, so d\n\u2212 W \u22121(\u03b8) ( d\nexists in a small neighborhood of \u03b8. From this identity, we have:\n1\n2\n\ndu W|u=\u03b8) W \u22121(\u03b8), where we use the assumption that d\n\n1\n\n\u2212\n\nDenote \u2202i = \u2202\n\u2202\u03b8i\n\n. From its\ndu(W \u22121)|u=\u03b8 =\ndu W exists at \u03b8 and W \u22121\n\n[\u2207J(\u03b8)]i = \u2212(\u2202iE[\u03b4\u03c6])\uffffE[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6] \u2212\n= \u2212(\u2202iE[\u03b4\u03c6])\uffffE[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6] +\n= \u2212E[\u2202i(\u03b4\u03c6)]\uffff(E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6]) +\n\n2 E[\u03b4\u03c6]\uffff \u2202i\uffffE[\u03c6\u03c6\uffff]\u22121\uffff E[\u03b4\u03c6]\n1\n2 E[\u03b4\u03c6]\uffff E[\u03c6\u03c6\uffff]\u22121(\u2202iE[\u03c6\u03c6\uffff]) E[\u03c6\u03c6\uffff]\u22121 E[\u03b4\u03c6]\n1\n2\n\n(E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6])\uffff E[\u2202i(\u03c6\u03c6\uffff)] (E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6]).\n\nThe interchange between the gradient and expectation is possible here because of assumptions (i)\nand (ii) and the fact that S is \ufb01nite. Now consider the identity\n1\n2 x\uffff\u2202i(\u03c6\u03c6\uffff)x = \u03c6\uffffx (\u2202i\u03c6\uffff)x,\n\nwhich holds for any vector x \u2208 Rn. Hence, using the de\ufb01nition of w,\n1\n2 w\uffffE[\u2202i(\u03c6\u03c6\uffff)]w\n\n[\u2207J(\u03b8)]i = \u2212E[\u2202i(\u03b4\u03c6)]\uffffw +\n\n\u2212\n\n1\n2\n\n= \u2212E[(\u2202i\u03b4)\u03c6\uffffw] \u2212 E[\u03b4(\u2202i\u03c6\uffff)w] + E[\u03c6\uffffw(\u2202i\u03c6\uffff)w].\n\nUsing \u2207\u03b4 = \u03b3\u03c6\uffff \u2212 \u03c6 and \u2207\u03c6\uffff = \u22072V\u03b8(s), we get\n\n1\n2\u2207J(\u03b8) = \u2212E[(\u03b3\u03c6\uffff \u2212 \u03c6)\u03c6\uffffw] \u2212 E[(\u03b4 \u2212 \u03c6\uffffw)\u22072V (s)w],\n\n\u2212\n\nFinally, observe that :\n\nE[(\u03b3\u03c6\uffff \u2212 \u03c6)\u03c6\uffffw] = E[(\u03c6 \u2212 \u03b3\u03c6\uffff)\u03c6]\uffff (E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6])\n\n= E[\u03b4\u03c6] \u2212 E[\u03b3\u03c6\uffff\u03c6\uffff](E[\u03c6\u03c6\uffff]\u22121E[\u03b4\u03c6]) = E[\u03b4\u03c6] \u2212 E[\u03b3\u03c6\uffff\u03c6\uffffw].\n\nwhich concludes the proof.\n\nTheorem 1 suggests straightforward generalizations of GTD2 and TDC (cf. Equations (6) and (7))\nto the nonlinear case. Weight wk is updated as before on a \u201cfaster\u201d timescale:\n\nThe parameter vector \u03b8k is updated on a \u201cslower\u201d timescale, either according to\n\nwk+1 = wk + \u03b2k(\u03b4k \u2212 \u03c6\uffffk wk)\u03c6k.\n\n(13)\n\n\u03b8k+1 =\u0393\uffff\u03b8k + \u03b1k\uffff(\u03c6k \u2212 \u03b3\u03c6\uffffk)(\u03c6\uffffk wk) \u2212 hk\uffff\uffff,\n\n5\n\n(non-linear GTD2)\n\n(14)\n\n\fhk = (\u03b4k \u2212 \u03c6\uffffk wk)\u22072V\u03b8k(sk)wk.\n\n(16)\nBesides hk, the only new ingredient compared to the linear case is \u0393: Rn \u2192 Rn, a mapping\nthat projects its argument into an appropriately chosen compact set C with a smooth boundary.\nThe purpose of this projection is to prevent the parameters to diverge in the initial phase of the\nalgorithm, which could happen due to the presence of the nonlinearities in the algorithm. Projection\nis a common technique for stabilizing the transient behavior of stochastic approximation algorithms\n(see, e.g., Kushner & Yin, 2003). In practice, if one selects C large enough so that it contains the\nset of possible solutions U = { \u03b8 | E[ \u03b4 \u2207V\u03b8(s)] = 0} (by using known bounds on the size of the\nrewards and on the derivative of the value function), it is very likely that no projections will take\nplace at all during the execution of the algorithm. We expect this to happen frequently in practice:\nthe main reason for the projection is to facilitate convergence analysis.\nLet us now analyze the computational complexity per update. Assume that V\u03b8(s) and its gradi-\nent can each be computed in O(n) time, the usual case for approximators of interest (e.g., neu-\nral networks). Equation (16) also requires computing the product of the Hessian of V\u03b8(s) and\nw. Pearlmutter (1994) showed that this can be computed exactly in O(n) time. The key is to\nnote that \u22072V\u03b8k(sk)wk = \u2207(\u2207V\u03b8k(s)\uffffwk), because wk does not depend on \u03b8k. The scalar term\n\u2207V\u03b8k(s)\uffffwk can be computed in O(n) and its gradient, which is a vector, can also be computed in\nO(n). Hence, the computation time per update for the proposed algorithms is linear in the number\nof parameters of the function approximator (just like in TD(0)).\n5 Convergence Analysis\nGiven the compact set C \u2282 Rn, let C(C) be the space of C \u2192 Rn continuous functions. Given\nprojection \u0393 onto C, let operator \u02c6\u0393: C(C) \u2192C (Rn) be\n\n\u02c6\u0393v (\u03b8) = lim\n0<\u03b5\u21920\n\n\u0393\uffff\u03b8 + \u03b5v (\u03b8)\uffff \u2212 \u03b8\n\n\u03b5\n\n.\n\nor, according to\n\n\u03b8k+1 =\u0393\uffff\u03b8k + \u03b1k\uffff\u03b4k\u03c6k \u2212 \u03b3\u03c6\uffffk(\u03c6\uffffk wk) \u2212 hk\uffff\uffff,\n\nwhere\n\n(non-linear TDC)\n\n(15)\n\n(0) \u2208 C.\n\n\u02d9\u03b8 = \u02c6\u0393(\u2212 1\n\n2\u2207J)(\u03b8),\u03b8\n\nBy assumption, \u0393(\u03b8) = arg min\u03b8\uffff\u2208C \uffff\u03b8\uffff\u2212\u03b8\uffff and the boundary of C is smooth, so \u02c6\u0393 is well de\ufb01ned.\nIn particular, \u02c6\u0393v (\u03b8) = v(\u03b8) when \u03b8 \u2208 C\u25e6, otherwise, if \u03b8 \u2208 \u2202C, \u02c6\u0393v (\u03b8) is the projection of v(\u03b8) to\nthe tangent space of \u2202C at \u03b8. Consider the following ODE:\n(17)\nLet K be the set of all asymptotically stable equilibria of (17). By the de\ufb01nitions, K \u2282 C. Further-\nmore, U \u2229 C \u2282 K.\nThe next theorem shows that under some technical conditions, the iterates produced by nonlinear\nGTD2 converge to K with probability one.\nTheorem 2 (Convergence of nonlinear GTD2). Let (sk, rk, s\uffffk)k\u22650 be a sequence of transitions that\nsatis\ufb01es A1. Consider the nonlinear GTD2 updates (13), (14). with positive step-size sequences that\nsatisfy\uffff\u221ek=0 \u03b1k =\uffff\u221ek=0 \u03b2k = \u221e,\uffff\u221ek=0 \u03b12\n\u03b2k \u2192 0, as k \u2192 \u221e. Assume\nthat for any \u03b8 \u2208 C and s0 \u2208S s.t. d(s0) > 0, V\u03b8(s0) is three times continuously differentiable.\nFurther assume that for each \u03b8 \u2208 C, E[\u03c6\u03b8\u03c6\uffff\u03b8 ] is nonsingular. Then \u03b8k \u2192 K, with probability one,\nas k \u2192 \u221e.\nProof. Let (s, r, s\uffff) be a random transition. Let \u03c6\u03b8 = \u2207V\u03b8(s), \u03c6\uffff\u03b8 = \u2207V\u03b8(s\uffff), \u03c6k = \u2207V\u03b8k(sk), and\n\u03c6\uffffk = \u2207V\u03b8k(s\uffffk). We begin by rewriting the updates (13)-(14) as follows:\n(18)\n\u03b8k+1 =\u0393\uffff\u03b8k + \u03b1k(g(\u03b8k, wk) + Nk+1)\uffff,\n(19)\n\nwhere\nf(\u03b8k, wk) = E[\u03b4k\u03c6k|\u03b8k] \u2212 E[\u03c6k\u03c6\uffffk |\u03b8k]wk,\ng(\u03b8k, wk) = E\uffff(\u03c6k \u2212 \u03b3\u03c6\uffffk)\u03c6\uffffk wk \u2212 hk|\u03b8k, wk\uffff , Nk+1 = ((\u03c6k \u2212 \u03b3\u03c6\uffffk)\u03c6\uffffk wk \u2212 hk) \u2212 g(\u03b8k, wk).\n\nMk+1 = (\u03b4k \u2212 \u03c6\uffffk wk)\u03c6k \u2212 f(\u03b8k, wk),\n\nwk+1 = wk + \u03b2k(f(\u03b8k, wk) + Mk+1),\n\nk,\uffff\u221ek=0 \u03b22\n\nk < \u221e and \u03b1k\n\n6\n\n\fWe need to verify that there exists a compact set B \u2282 R2n such that (a) the functions f(\u03b8, w),\ng(\u03b8, w) are Lipschitz continuous over B, (b) (Mk,Gk), (Nk,Gk), k \u2265 1 are martingale difference\nsequences, where Gk = \u03c3(ri,\u03b8 i, wi, si, i \u2264 k; s\uffffi, i < k), k \u2265 1 are increasing sigma \ufb01elds, (c)\n{(wk(\u03b8),\u03b8 )} with wk(\u03b8) obtained as \u03b4k(\u03b8) = rk + \u03b3V\u03b8(s\uffffk) \u2212 V\u03b8(sk), \u03c6k(\u03b8) = \u2207V\u03b8(sk),\n\nwk+1(\u03b8) = wk(\u03b8) + \u03b2k\uffff\u03b4k(\u03b8) \u2212 \u03c6k(\u03b8)\uffffwk(\u03b8)\uffff\u03c6k(\u03b8)\n\n\u02d9w = E[\u03b4\u03b8\u03c6\u03b8] \u2212 E[\u03c6\u03b8\u03c6\uffff\u03b8 ]w,\n\nalmost surely stays in B for any choice of (w0(\u03b8),\u03b8 ) \u2208 B, and (d) {(w, \u03b8k)} almost surely stays in\nB for any choice of (w, \u03b80) \u2208 B. From these and the conditions on the step-sizes, using standard\narguments (c.f. Theorem 2 of Sutton et al. (2009b)), it follows that \u03b8k converges almost surely to\nthe set of asymptotically stable equilibria of \u02d9\u03b8 = \u02c6\u0393F (\u03b8), (\u03b8(0) \u2208 C), where F (\u03b8) = g(\u03b8, w\u03b8). Here\nfor \u03b8 \u2208 C \ufb01xed, w\u03b8 is the (unique) equilibrium point of\n(20)\nwhere \u03b4\u03b8 = r + \u03b3V\u03b8(s\uffff) \u2212 V\u03b8(s). Clearly, w\u03b8 = E\uffff\u03c6\u03b8\u03c6\uffff\u03b8\uffff\u22121 E[\u03b4\u03b8\u03c6\u03b8], which exists by assumption.\nThen by Theorem 1 it follows that F (\u03b8) = \u2212 1\n2 \u2207J(\u03b8). Hence, the statement will follow once (a)\u2013(d)\nare veri\ufb01ed.\nNote that (a) is satis\ufb01ed because V\u03b8 is three times continuously differentiable. For (b), we need to\nverify that for any k \u2265 0, E[Mk+1 | Gk] = 0 and E[Nk+1 | Gk] = 0, which in fact follow from\nthe de\ufb01nitions. Condition (c) follows since, by a standard argument (e.g., Borkar & Meyn, 2000),\nwk(\u03b8) converges to w\u03b8, which by assumption stays bounded if \u03b8 comes from a bounded set. For\ncondition (d), note that {\u03b8k} is uniformly bounded since for any k \u2265 0, \u03b8k \u2208 C, and by assumption\nC is a compact set.\nTheorem 3 (Convergence of nonlinear TDC). Under the same conditions as in Theorem 2, the\niterates computed via (13), (15) satisfy \u03b8k \u2192 K, with probability one, as k \u2192 \u221e.\nThe proof follows in a similar manner as that of Theorem 2 and is omitted for brevity.\n6 Empirical results\nTo illustrate the convergence properties of the algorithms, we applied them to the \u201cspiral\u201d coun-\nterexample of Tsitsikilis & Van Roy (1997), originally used to show the divergence of TD(0) with\nnonlinear function approximation. The Markov chain with 3 states is shown in the left panel of\nFigure 2. The reward is always zero and the discount factor is \u03b3 = 0.9. The value function has a sin-\n\ngle parameter, \u03b8, and takes the nonlinear spiral form V\u03b8(s) =\uffffa(s) cos (\u02c6\u03bb\u03b8) \u2212 b(s) sin (\u02c6\u03bb\u03b8)\uffff e\uffff\u03b8.\nThe true value function is V = (0, 0, 0)\uffff which is achieved as \u03b8 \u2192 \u2212\u221e. Here we used\nV0 = (100,\u221270,\u221230)\uffff, a = V0, b = (23.094,\u221298.15, 75.056)\uffff, \u02c6\u03bb = 0.866 and \uffff = 0.05.\nNote that this is a degenerate example, in which our theorems do not apply, because the optimal\nparameter values are in\ufb01nite. Hence, we run our algorithms without a projection step. We also use\nconstant learning rates, in order to facilitate gradient descent through an error surface which is es-\nsentially \ufb02at. For TDC we used \u03b1 = 0.5, \u03b2 = 0.05, and for GTD2, \u03b1 = 0.8 and \u03b2 = 0.1. For TD(0)\nwe used \u03b1 = 2\u00d7 10\u22123 (as argued by Tsitsiklis & Van Roy (1997), tuning the step-size does not help\nd\u03b8 V\u03b8\uffff. The graph shows\nwith the divergence problem). All step sizes are then normalized by \uffffV \uffff\u03b8 D d\nthe performance measure, \u221aJ, as a function of the number of updates (we used expected updates\nfor all algorithms). GTD2 and TDC converge to the correct solution, while TD(0) diverges. We note\nthat convergence happens despite the fact that this example is outside the scope of the theory.\nTo assess the performance of the new algorithms on a large scale problem, we used them to learn\nan evaluation function in 9x9 computer Go. We used a version of RLGO (Silver, 2009), in which a\nlogistic function is \ufb01t to evaluate the probability of winning from a given position. Positions were\ndescribed using 969,894 binary features corresponding to all possible shapes in every 3x3, 2x2, and\n1x1 region of the board. Using weight sharing to take advantage of symmetries, the million features\nwere reduced to a parameter vector of n = 63, 303 components. Experience was generated by self-\nplay, with actions chosen uniformly randomly among the legal moves. All rewards were zero, except\nupon winning the game, when the reward was 1. We applied four algorithms to this problem: TD(0),\nthe proposed algorithms (GTD2 and TDC) and residual gradient (RG). In the experiments, RG was\n\n7\n\n\f15\n\n10\n\nJ\n\u221a\n\n5\n\n0\n0\n\nTDC\n\nTD\n\n0.60\n\n0.55\n\nRMSE\n\nTDC\n\nRG\n\nTD\n\nGTD2\n\nTDC\n\nTDC\n\nRG\n\nTD\n\n.00001\n\n.0001\n\n.001\n\n\u03b1\n\n.01\n\n.1\n\nGTD2\n\n1\n2\n\n1\n2\n\n2\n\n1\n2\n1\n2\n\n1\n\n1\n2\n\n3\n\n1\n2\n\n1000\n\nTime step\n\n2000\n\n3000\n\n0.50\n\n0.45\n\n0.40\n\nFigure 2: Empirical evaluation results. Left panel: example MDP from Tsitsiklis & Van Roy (1994).\nRight panel: 9x9 Computer Go.\nrun with only one sample3. In each run, \u03b8 was initialized to random values uniformly distributed in\n[\u22120.1, 0.1]; for GTD2 and TDC, the second parameter vector, w, was initialized to 0. Training then\nproceeded for 5000 complete games, after which \u03b8 was frozen. This problem is too large to compute\nthe objective function J. Instead, to assess the quality of the solutions obtained, we estimated the\naverage prediction error of each algorithm. More precisely, we generated 2500 test games; for\nevery state occurring in a game, we computed the squared error between its predicted value and the\nactual return that was obtained in that game. We then computed the root of the mean-squared error,\naveraged over all time steps. The right panel in Figure 2 plots this measure over a range of values of\nthe learning rate \u03b1. The results are averages over 50 independent runs. For TDC and GTD we used\nseveral values of the \u03b2 parameter, which generate the different curves. As was noted in previous\nempirical work, TD provides slightly better estimates than the RG algorithm. TDC\u2019s performance\nis very similar to TD, for a wide range of parameter values. GTD2 is slightly worse. These results\nare very similar in \ufb02avor to those obtained in Sutton et al. (2009b) using the same domain, but with\nlinear function approximation.\n7 Conclusions and future work\nIn this paper, we solved a long-standing open problem in reinforcement learning, by establishing a\nfamily of temporal-difference learning algorithms that converge with arbitrary differentiable func-\ntion approximators (including neural networks). The algorithms perform gradient descent on a nat-\nural objective function, the projected Bellman error. The local optima of this function coincide with\nsolutions that could be obtained by TD(0). Of course, TD(0) need not converge with non-linear\nfunction approximation. Our algorithms are on-line, incremental and their computational cost per\nupdate is linear in the number of parameters. Our theoretical results guarantee convergence to a\nlocal optimum, under standard technical assumptions. Local optimality is the best one can hope\nfor, since nonlinear function approximation creates non-convex optimization problems. The early\nempirical results obtained for computer Go are very promising. However, more practical experience\nwith these algorithms is needed. We are currently working on extensions of these algorithms using\neligibility traces, and on using them for solving control problems.\nAcknowledgments\nThis research was supported in part by NSERC, iCore, AICML and AIF. We thank the three anonymous re-\nviewers for their useful comments on previous drafts of this paper.\n\n3Unlike TD, RG would require two independent transition samples from a given state. This requires knowl-\nedge about the model of the environment which is not always available. In the experiments only one transition\nsample was used following Baird\u2019s original recommendation.\n\n8\n\n\fReferences\nAntos, A., Szepesv\u00b4ari, Cs. & Munos, R. (2008). Learning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine Learning 71:\n89\u2013129.\nBaird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation.\nIn Proceedings of the Twelfth International Conference on Machine Learning, pp. 30\u201337. Morgan\nKaufmann.\nBorkar, V. S. & Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation\nand reinforcement learning. SIAM Journal on Control And Optimization 38(2): 447\u2013469.\nBoyan, J. A. & Moore, A.W. (1995). Generalization in Reinforcement Learning: Safely Approxi-\nmating the Value Function. In Advances in Neural Information Processing Systems 7, pp. 369\u2013376,\nMIT Press.\nCrites, R. H. & Barto, A.G. (1995). Improving Elevator Performance Using Reinforcement Learning\nIn Advances in Neural Information Processing Systems 8, pp. 1017-1023. MIT Press.\nFarahmand, A.m., Ghavamzadeh, M., Szepesvari, C. & Mannor, S. (2009). Regularized Policy\nIteration In Advances in Neural Information Processing Systems 21, pp. 441\u2013448.\nKushner, H. J. & Yin, G. G. (2003). Stochastic Approximation Algorithms and Applications. Second\nEdition, Springer-Verlag.\nPearlmutter, B. A (1994). Fast exact multiplication by the Hessian. Neural Computation 6(1),\npp. 147\u2013160.\nSilver, D. (2009). Reinforcement Learning and Simulation-Based Search in Computer Go. Univer-\nsity of Alberta Ph.D. thesis.\nSutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning\n3:9\u201344.\nSutton, R. S. & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\nSutton, R. S., Szepesv\u00b4ari, Cs. & Maei, H. R. (2009a). A convergent O(n) algorithm for off-policy\ntemporal-difference learning with linear function approximation. In Advances in Neural Information\nProcessing Systems 21, pp. 1609\u20131616. MIT Press.\nSutton, R. S., Maei, H. R, Precup, D., Bhatnagar, S., Silver, D., Szepesv\u00b4ari, Cs. & Wiewiora,\nE. (2009b). Fast gradient-descent methods for temporal-difference learning with linear function\napproximation.\nIn Proceedings of the 26th International Conference on Machine Learning, pp.\n993\u20131000. Omnipress.\nTesauro, G. (1992) Practical issues in temporal difference learning. Machine Learning 8: 257-277.\nTsitsiklis, J. N. & Van Roy, B. (1997). An analysis of temporal-difference learning with function\napproximation. IEEE Transactions on Automatic Control 42:674\u2013690.\nZhang, W. & Dietterich, T. G. (1995) A reinforcement learning approach to job-shop scheduling. In\nProceedings of the Fourteenth International Joint Conference on Arti\ufb01cial Intelligence, pp. 1114-\n1120. AAAI Press.\n\n9\n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Shalabh", "family_name": "Bhatnagar", "institution": null}, {"given_name": "Doina", "family_name": "Precup", "institution": null}, {"given_name": "David", "family_name": "Silver", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Hamid", "family_name": "Maei", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}