{"title": "Temporal Difference Learning in Continuous Time and Space", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1079, "abstract": "", "full_text": "Temporal Difference Learning in \n\nContinuous Time and Space \n\nKenji Doya \n\ndoya~hip.atr.co.jp \n\nATR Human Information Processing Research Laboratories \n2-2 Hikaridai, Seika.-cho, Soraku-gun, Kyoto 619-02, Japan \n\nAbstract \n\nA continuous-time, continuous-state version of the temporal differ(cid:173)\nence (TD) algorithm is derived in order to facilitate the application \nof reinforcement learning to real-world control tasks and neurobi(cid:173)\nological modeling. An optimal nonlinear feedback control law was \nalso derived using the derivatives of the value function. The per(cid:173)\nformance of the algorithms was tested in a task of swinging up a \npendulum with limited torque. Both the \"critic\" that specifies the \npaths to the upright position and the \"actor\" that works as a non(cid:173)\nlinear feedback controller were successfully implemented by radial \nbasis function (RBF) networks. \n\n1 \n\nINTRODUCTION \n\nThe temporal-difference (TD) algorithm (Sutton, 1988) for delayed reinforcement \nlearning has been applied to a variety of tasks, such as robot navigation, board \ngames, and biological modeling (Houk et al., 1994). Elucidation of the relationship \nbetween TD learning and dynamic programming (DP) has provided good theoretical \ninsights (Barto et al., 1995). However, conventional TD algorithms were based on \ndiscrete-time, discrete-state formulations. In applying these algorithms to control \nproblems, time, space and action had to be appropriately discretized using a priori \nknowledge or by trial and error. Furthermore, when a TD algorithm is used for \nneurobiological modeling, discrete-time operation is often very unnatural. \nThere have been several attempts to extend TD-like algorithms to continuous cases. \nBradtke et al. (1994) showed convergence results for DP-based algorithms for a \ndiscrete-time, continuous-state linear system with a quadratic cost. Bradtke and \nDuff (1995) derived TD-like algorithms for continuous-time, discrete-state systems \n(semi-Markov decision problems). Baird (1993) proposed the \"advantage updating\" \nalgorithm by modifying Q-Iearning so that it works with arbitrary small time steps. \n\n\f1074 \n\nK.DOYA \n\nIn this paper, we derive a TD learning algorithm for continuous-time, continuous(cid:173)\nstate, nonlinear control problems. The correspondence of the continuous-time ver(cid:173)\nsion to the conventional discrete-time version is also shown. The performance of \nthe algorithm was tested in a nonlinear control task of swinging up a pendulum \nwith limited torque. \n\n2 CONTINUOUS-TIME TD LEARNING \n\nWe consider a continuous-time dynamical system (plant) \n\n(1) \nwhere x E X eRn is the state and u E U C Rm is the control input (action). We \ndenote the immediate reinforcement (evaluation) for the state and the action as \n\nx(t) = f(x(t), u(t)) \n\nr(t) = r(x(t), u(t)). \nOur goal is to find a feedback control law (policy) \nu(t) = JL(x(t)) \n\n(3) \nthat maximizes the expected reinforcement for a certain period in the future. To \nbe specific, for a given control law JL, we define the \"value\" of the state x(t) as \n\n(2) \n\n(4) \n\nV!L(x(t)) = \n\n100 1 \n\nt \n\n,-t \n\n-e--T r(x(s), u(s))ds, \nr \n\nwhere x(s) and u(s) (t < s < 00) follow the system dynamics (1) and the control \nlaw (3). Our problem now is to find an optimal control law JL* that maximizes \nV!L(x) for any state x E X. Note that r is the time scale of \"imminence-weighting\" \nand the scaling factor ~ is used for normalization, i.e., ftOO ~e- ':;:t ds = 1. \n\n2.1 TD ERROR \n\nThe basic idea in TD learning is to predict future reinforcement in an on-line man(cid:173)\nner. We first derive a local consistency condition for the value function V!L(x). By \ndifferentiating (4) by t, we have \n\nd \n\nr dt V!L(x(t)) = V!L(x(t)) - r(t). \n\n(5) \n\nLet P(t) be the prediction of the value function V!L(x(t)) from x(t) (output of the \n\"critic\"). If the prediction is perfect, it should satisfy rP(t) = P(t) - r(t). If this \nis not satisfied, the prediction should be adjusted to decrease the inconsistency \n\nf(t) = r(t) - P(t) + rP(t). \n\n(6) \n\nThis is a continuous version of the temporal difference error. \n\n2.2 EULER DIFFERENTIATION: TD(O) \n\nThe relationship between the above continuous-time TD error and the discrete-time \nTD error (Sutton, 1988) \n\nf(t) = r(t) + ,,(P(t) - P(t - ~t) \n\n(7) \n\ncan be easily seen by a backward Euler approximation of p(t). By substituting \np(t) = (P(t) - P(t - ~t))/~t into (6), we have \n\nf=r(t)+ ~t [(1- ~t)P(t)-P(t-~t)] . \n\n\fTemporal Difference in Learning in Continuous Time and Space \n\n1075 \n\nThis coincides with (7) if we make the \"discount factor\" '\"Y = 1- ~t ~ e-'\u00a5, except \nfor the scaling factor It ' \nNow let us consider a case when the prediction of the value function is given by \n\n(8) \n\nwhere bi O are basis functions (e.g., sigmoid, Gaussian, etc) and Vi are the weights. \nThe gradient descent of the squared TD error is given by \n\n~Vi ex: _ o~r2(t) ex: - r et) [(1 _ ~t) oP(t) _ oP(t - ~t)] . \n\nOVi \n\nT \n\nOVi \n\nOVi \n\nIn order to \"back-up\" the information about the future reinforcement to correct the \nprediction in the past, we should modify pet - ~t) rather than pet) in the above \nformula. This results in the learning rule \n\n~Vi ex: ret) OP(~~ ~t) = r(t)bi(x(t - ~t)). \n\n(9) \n\nThis is equivalent to the TD(O) algorithm that uses the \"eligibility trace\" from the \nprevious time step. \n\n2.3 SMOOTH DIFFERENTIATION: TD(-\\) \n\nThe Euler approximation of a time derivative is susceptible to noise (e.g., when \nwe use stochastic control for exploration) . Alternatively, we can use a \"smooth\" \ndifferentiation algorithm that uses a weighted average of the past input, such as \n\npet) ~ pet) - Pet) \n\n~ \n\nwhere \n\nTc dd pet) = pet) - pet) \n\nt \n\nand Tc is the time constant of the differentiation. The corresponding gradient de(cid:173)\nscent algorithm is \n\n~Vi ex: _ O~;2(t) ex: ret) o~(t) = r(t)bi(t) , \n\nVi \n\nUVi \n\nwhere bi is the eligibility trace for the weight \n\nd -\n\nTc dtbi(t) = bi(x(t)) - bi(t) . \n\n-\n\n(10) \n\n(11) \n\nNote that this is equivalent to the TD(-\\) algorithm (Sutton, 1988) with -\\ = 1- At \nif we discretize the above equation with time step ~t. \n\nT c \n\n3 OPTIMAL CONTROL BY VALUE GRADIENT \n\n3.1 HJB EQUATION \n\nThe value function V * for an optimal control J..L* is defined as \n\nV*(x(t)) = max \nU[t,oo) \n\n. -t \n\n-e--T r(x(s), u(s))ds . \nT \n\n] \n\n(12) \n\n[1 00 1 \n\nt \n\nAccording to the principle of dynamic programming (Bryson and Ho, 1975), we \nconsider optimization in two phases, [t, t + ~t] and [t + ~t , 00), resulting in the \nexpression \n\nV*(x(t)) = max \n. \n\nU[t,HAt) \n\n[I t+At 1 \n\nT \n\nt \n\n1 \n_e- \u00b7:;:-t r(x(s), u(s))ds + e--'\u00a5V*(x(t + ~t)) . \n\n\f1076 \n\nK.DOYA \n\nBy Taylor expanding the value at t + f:l.t as \n\nV*(x(t + f:l.t)) = V*(x(t)) + ax(t) f(x(t), u(t))f:l.t + O(f:l.t) \n\nav* \n\nand then taking f:l.t to zero, we have a differential constraint for the optimal value \nfunction \n\nV*(t) = max \nU(t)EU \n\nav* \nr(x(t), u(t)) + T -a \n[ \nx \n\n] \nf(x(t), u(t)) \n\n. \n\n(13) \n\nThis is a variant of the Hamilton-Jacobi-Bellman equation (Bryson and Ho, 1975) \nfor a discounted case. \n\n3.2 OPTIMAL NONLINEAR FEEDBACK CONTROL \n\nWhen the reinforcement r(x, u) is convex with respect to the control u, and the \nvector field f(x, u) is linear with respect to u, the optimization problem in (13) has \na unique solution. The condition for the optimal control is \n\nar(x, u) \n\nau +T ax \n\nav* af(x, u) _ 0 \n\nau -. \n\n(14) \n\nNow we consider the case when the cost for control is given by a convex potential \nfunction GjO for each control input \n\nf(x, u) = rx(x) - 2:= Gj(Uj), \n\nj \n\nwhere reinforcement for the state r x (x) is still unknown. We also assume that the \ninput gain of the system \n\nb -(x) = af(x, u) \nJ \n\nau-J \n\nis available. In this case, the optimal condition (14) for Uj is given by \n\n-Gj(Uj) + T ax bj(x) = O. \n\nav* \n\nNoting that the derivative G'O is a monotonic function since GO is convex, we have \nthe optimal feedback control law \n\n) \nUj = (G')-1 T ax b(x) \n\n( av* \n\n. \n\n(15) \n\nParticularly, when the amplitude of control is bounded as IUj I < uj&X, we can \nenforce this constraint using a control cost \n\nGj(Uj) = Cj IoUi \n\n~ \n\ng-l(s)ds, \n\n(16) \n\nwhere g-10 is an inverse sigmoid function that diverges at \u00b11 (Hopfield, 1984). In \nthis case, the optimal feedback control law is given by \n\nUj = ujaxg \n\n( u max av* \n\n) \n~j T ax bj(x) \n\n. \n\nIn the limit of Cj -70, this results in the \"bang-bang\" control law \n\nUj = Uj \n\nmax' \n\nSIgn ax \n\n[av* b ( )] \n\nj x \n\n. \n\n(17) \n\n(18) \n\n\fTemporal Difference in Learning in Continuous Time and Space \n\n1077 \n\nFigure 1: A pendulum with limited torque. The dynamics is given by m18 \n-f-tiJ + mglsinO + T. Parameters were m = I = 1, 9 = 9.8, and f-t = 0.0l. \n\n20 \n\n17 .5 \n\n15 \n\n12.5 \n\n0. \n\" 10 \n.-\n\n7 . 5 \n\ntrials \n(a) \n\n~\\~ \niii \ni~! I \nI, \n':1 \n\ntrials \n\n(c) \n\nth \n\n(b) \n\nth \n\n(d) \n\nFigure 2: Left: The learning curves for (a) optimal control and (c) actor-critic. \nLup: time during which 101 < 90\u00b0. Right: (b) The predicted value function P after \n100 trials of optimal control. (d) The output of the controller after 100 trials with \nactor-critic learning. The thick gray line shows the trajectory of the pendulum. th: \no (degrees), om: iJ (degrees/sec). \n\n\f1078 \n\n4 ACTOR-CRITIC \n\nK.DOYA \n\nWhen the information about the control cost, the input gain of the system, or the \ngradient of the value function is not available, we cannot use the above optimal \ncontrol law. However, the TD error (6) can be used as \"internal reinforcement\" for \ntraining a stochastic controller, or an \"actor\" (Barto et al., 1983). \nIn the simulation below, we combined our TD algorithm for the critic with a rein(cid:173)\nforcement learning algorithm for real-valued output (Gullapalli, 1990). The output \nof the controller was given by \n\nu;(t) = ujUg (~W;,b'(X(t)) + <1n;(t)) , \n\n(19) \n\nwhere nj(t) is normalized Gaussian noise and Wji is a weight. The size of this per(cid:173)\nturbation was changed based on the predicted performance by (Y = (Yo exp( -P(t)). \nThe connection weights were changed by \n\n!:l.Wji ex f(t)nj(t)bi(x(t)). \n\n(20) \n\n5 SIMULATION \n\nThe performance of the above continuous-time TD algorithm was tested on a task \nof swinging up a pendulum with limited torque (Figure 1). Control of this one(cid:173)\ndegree-of-freedom system is trivial near the upright equilibrium. However, bringing \nthe pendulum near the upright position is not if we set the maximal torque Tmax \nsmaller than mgl. The controller has to swing the pendulum several times to \nbuild up enough momentum to bring it upright. Furthermore, the controller has to \ndecelerate the pendulum early enough to avoid falling over. \nWe used a radial basis function (RBF) network to approximate the value function \nfor the state of the pendulum x = (8,8). We prepared a fixed set of 12 x 12 Gaussian \nbasis functions. This is a natural extension of the \"boxes\" approach previously used \nto control inverted pendulums (Barto et al., 1983). The immediate reinforcement \nwas given by the height of the tip of the pendulum, i.e., rx = cos 8. \n\n5.1 OPTIMAL CONTROL \n\nFirst, we used the optimal control law (17) with the predicted value function P \ninstead of V\u00b7. We added noise to the control command to enhance exploration. \nThe torque was given by \n\n) \nT = Tmaxg - - r - -b + (Yn(t) \n\n( Tmax aP(x) \nax \n\nc \n\n, \n\nwhere g(x) = ~ tan- 1 ( ~x) (Hopfield, 1984). Note that the input gain b = \n(0, 1/mI2)T was constant. Parameters were rm ax = 5, c = 0.1, (Yo = 0.01, r = 1.0, \nand rc = 0.1. \nEach run was started from a random 8 and was continued for 20 seconds. Within \nten trials, the value function P became accurate enough to be able to swing up and \nhold the pendulum (Figure 2a). An example of the predicted value function P after \n100 trials is shown in Figure 2b. The paths toward the upright position, which were \nimplicitly determined by the dynamical properties of the system, can be seen as the \nridges of the value function. We also had successful results when the reinforcement \nwas given only near the goal: rx = 1 if 181 < 30\u00b0, -1 otherwise. \n\n\fTemporal Difference in Learning in Continuous Time and Space \n\n1079 \n\n5.2 ACTOR-CRITIC \n\nNext, we tested the actor-critic learning scheme as described above. The controller \nwas also implemented by a RBF network with the same 12 x 12 basis functions as \nthe critic network. It took about one hundred trials to achieve reliable performance \n(Figure 2c). Figure 2d shows an example of the output of the controller after 100 \ntrials. We can see nearly linear feedback in the neighborhood of the upright position \nand a non-linear torque field away from the equilibrium. \n\n6 CONCLUSION \n\nWe derived a continuous-time, continuous-state version of the TD algorithm and \nshowed its applicability to a nonlinear control task. One advantage of continuous \nformulation is that we can derive an explicit form of optimal control law as in (17) \nusing derivative information, whereas a one-ply search for the best action is usually \nrequired in discrete formulations. \n\nReferences \nBaird III, L. C. (1993). Advantage updating. Technical Report WL-TR-93-1146, \n\nWright Laboratory, Wright-Patterson Air Force Base, OH 45433-7301, USA. \n\nBarto, A. G. , Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time \n\ndynamic programming. Artificial Intelligence, 72:81-138. \n\nBarto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive \nelements that can solve difficult learning control problems. IEEE Transactions \non System, Man, and Cybernetics, SMC-13:834-846. \n\nBradtke, S. J. and Duff, M. O. (1995). Reinforcement learning methods for \ncontinuous-time Markov decision problems. In Tesauro, G., Touretzky, D. S., \nand Leen, T. K., editors, Advances in Neural Information Processing Systems \n7, pages 393-400. MIT Press, Cambridge, MA. \n\nBradtke, S. J ., Ydstie, B. E., and Barto, A. G. (1994). Adaptive linear quadratic \ncontrol using policy iteration. CMPSCI Technical Report 94-49, University of \nMassachusetts, Amherst, MA. \n\nBryson, Jr., A. E .. and Ho, Y.-C. (1975). Applied Optimal Control. Hemisphere \n\nPublishing, New York, 2nd edition. \n\nGuUapalli, V. (1990) . A stochastic reinforcement learning algorithm for learning \n\nreal-valued functions. Neural Networks, 3:671-192. \n\nHopfield, J. J. (1984). Neurons with graded response have collective computational \nproperties like those of two-state neurons. Proceedings of National Academy of \nScience, 81 :3088-3092. \n\nHouk, J . C., Adams, J. L., and Barto, A. G. (1994). A model of how the basal \nganglia generate and use neural signlas that predict renforcement. In Houk, \nJ. C., Davis, J. L., and Beiser, D. G., editors, Models of Information Processing \nin the Basal Ganglia, pages 249--270. MIT Press, Cambrigde, MA. \n\nSutton, R. S. (1988). Learning to predict by the methods of temporal difference. \n\nMachine Learning, 3:9--44. \n\n\f", "award": [], "sourceid": 1169, "authors": [{"given_name": "Kenji", "family_name": "Doya", "institution": null}]}