{"title": "Reinforcement Learning Methods for Continuous-Time Markov Decision Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 400, "abstract": null, "full_text": "Reinforcement Learning Methods for \nContinuous-Time Markov Decision \n\nProblems \n\nSteven J. Bradtke \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nbradtkeGcs.umass.edu \n\nMichael O. Duff \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \nduffGcs.umass.edu \n\nAbstract \n\nSemi-Markov Decision Problems are continuous time generaliza(cid:173)\ntions of discrete time Markov Decision Problems. A number of \nreinforcement learning algorithms have been developed recently \nfor the solution of Markov Decision Problems, based on the ideas \nof asynchronous dynamic programming and stochastic approxima(cid:173)\ntion. Among these are TD(,x), Q-Iearning, and Real-time Dynamic \nProgramming. After reviewing semi-Markov Decision Problems \nand Bellman's optimality equation in that context, we propose al(cid:173)\ngorithms similar to those named above, adapted to the solution of \nsemi-Markov Decision Problems. We demonstrate these algorithms \nby applying them to the problem of determining the optimal con(cid:173)\ntrol for a simple queueing system. We conclude with a discussion \nof circumstances under which these algorithms may be usefully ap(cid:173)\nplied. \n\n1 \n\nIntroduction \n\nA number of reinforcement learning algorithms based on the ideas of asynchronous \ndynamic programming and stochastic approximation have been developed recently \nfor the solution of Markov Decision Problems. Among these are Sutton's TD(,x) \n[10], Watkins' Q-Iearning [12], and Real-time Dynamic Programming (RTDP) [1, \n\n\f394 \n\nSteven Bradtke, Michael O. Duff \n\n3]. These learning alogorithms are widely used, but their domain of application \nhas been limited to processes modeled by discrete-time Markov Decision Problems \n(MDP's). \n\nextending the domain of applicability to continuous time. This ef(cid:173)\n\nThis paper derives analogous algorithms for semi-Markov Decision Problems \n(SMDP's) -\nfort was originally motivated by the desire to apply reinforcement learning methods \nto problems of adaptive control of queueing systems, and to the problem of adaptive \nrouting in computer networks in particular. We apply the new algorithms to the \nwell-known problem of routing to two heterogeneous servers [7]. We conclude with \na discussion of circumstances under which these algorithms may be usefully applied. \n\n2 Semi-Markov Decision Problems \n\nA semi-Markov process is a continuous time dynamic system consisting of a count(cid:173)\nable state set, X, and a finite action set, A. Suppose that the system is originally \nobserved to be in state z EX, and that action a E A is applied. A semi-Markov \nprocess [9] then evolves as follows: \n\n\u2022 The next state, y, is chosen according to the transition probabilities Pz,(a) \n\u2022 A reward rate p(z, a) is defined until the next transition occurs \n\u2022 Conditional on the event that the next state is y, the time until the tran(cid:173)\n\nsition from z to y occurs has probability distribution Fz,(\u00b7Ja) \n\nOne form of the SMDP is to find a policy the minimizes the expected infinite horizon \ndiscounted cost, the \"value\" for each state: \n\ne {IoOO e-.Bt p(z(t), a(t\u00bbdt}, \n\nwhere z(t) and aCt) denote, respectively, the state and action at time t. \nFor a fixed policy 71', the value of a given state z must satisfy \n\nL Pz,(7I'(z\u00bb (00 r e-.B\u00b7 p(z, 71'(z\u00bbdsdFz,(tJ7I'(z\u00bb + \nL Pz,(7I'(Z\u00bb fooo e-.Bt V,..(y)dFz,(tJ7I'(z\u00bb. \n,EX \n\n10 10 \n\n(1) \n\nX \n\n,E \n\nv,..(z) \n\nDefining \n\nR(z, y, a) = foOO fot e-.B\u00b7 p(z, 71'(z\u00bbdsdFz, (tJ7I'(z\u00bb, \n\nthe expected reward that will be received on transition from state z to state y on \naction a, and \n\n\fReinforcement Learning Methods for Continuous-Time Markov Decision Problems \n\n395 \n\nthe expected discount factor to be applied to the value of state y on transition \nfrom state z on action a, it is clear that equation (1) is nearly identical to the \nvalue-function equation for discrete time Markov reward processes, \n\nVw(z) = R(z, 1I\"(z\u00bb + \"Y I: Pzr ( 1I\"(z\u00bbVw (Y), \n\n(2) \n\nwhere R(z, a) = :ErEx Pzr(a)R(z, y, a). If transition times are identically one for \nan SMDP, then a standard discrete-time MDP results. \n\nrEX \n\nSimilarly, while the value function associated with an optimal policy for an MDP \nsatisfies the Bellman optimality equation \n\nVe(z) = max {R(Z' a) + \"Y I: pzr(a)v*(y)} , \n\nilEA \n\nX \n\nrE \n\n(3) \n\nthe optimal value function for an SMDP satisfies the following version of the Bellman \noptimality equation: \n\nV*(z) = max { I: Pzr(a) 100 r e-fJa p(z, a)dsdFzr(tJa) + \n\nilEA \n\n0 10 \n\nX \n\nrE \n\nI: Pzr(a) loo e-fJtv*(y)dFzr(tJa)} . \n\nrEX \n\n(4) \n\n3 Temporal Difference learning for SMDP's \n\nSutton's TD(O) [10] is a stochastic approximation method for finding solutions to \nthe system of equations (2). Having observed a transition from state z to state y \nwith sample reward r(z, y, 1I\"(z\u00bb, TD(O) updates the value function estimate V(A:)(z) \nin the direction of the sample value r(z, y, 1I\"(z\u00bb+\"YV(A:)(y). The TD(O) update rule \nfor MDP's is \n\nV(A:+l)(Z) = V(A:)(z) + QA:[r(z, y, 1I\"(z\u00bb + \"YV(A:)(y) - V(A:)(z)], \n\n(5) \nwhere QA: is the learning rate. The sequence of value-function estimates generated \nby the TD(O) proceedure will converge to the true solution, Vw , with probability \none [5,8, 11] under the appropriate conditions on the QA: and on the definition of the \nMDP. \n\nThe TD(O) learning rule for SMDP's, intended to solve the system of equations (1) \ngiven a sequence of sampled state transitions, is: \n\n] \nV(A:+1)(z) = V(A:)(z) + QA: -; r(z, y, 1I\"(z\u00bb + e-fJTV(A:)(y) - V(A:)(z) \n\n-fJT \n\n[ 1 \n\n, (6) \n\nwhere the sampled transition time from state z to state y was T \ntime units, \nI_p-tl .. r(z, y, 1I\"(z\u00bb is the sample reward received in T time units, and e-fJT is the \nsample discount on the value of the next state given a transition time of T time \nunits. The TD(>.) learning rule for SMDP's is straightforward to define from here. \n\n\f396 \n\nSteven Bradtke. Michael 0. Duff \n\n4 Q-Iearning for SMDP's \n\nDenardo [6] and Watkins [12] define Q.f) the Q-function corresponding to the policy \n71\", as \n\nQ'II\"(z, a) = R(z, a) + 'Y 2: PzJ(a)V'II\"(Y) \n\n(7) \n\nNotice that a can be any action. It is not necesarily the action 7I\"(z) that would be \nchosen by policy 71\". The function Q. corresponds to the optimal policy. Q'II\"(z, a) \nrepresents the total discounted return that can be expected if any action is taken \nfrom state z, and policy 71\" is followed thereafter. Equation (7) can be rewritten as \n\nYEX \n\nQ'II\"(z, a) = R(z, a) + 'Y 2: PZJ(a)Q'II\"(Y' 7I\"(Y\u00bb, \n\nyEX \n\nand Q. satisfies the Bellman-style optimality equation \n\nQ\u00b7(z, a) = R(z, a) + 'Y 2: Pzy(a) max Q.(y, a'), \n\nJEX \n\nA'EA \n\n(8) \n\n(9) \n\nQ-Iearning, first described by Watkins [12], uses stochastic approximation to itera(cid:173)\ntively refine an estimate for the function Q \u2022. The Q-Iearning rule is very similar to \nTD(O). Upon a sampled transition from state z to state y upon selection of a, with \nsampled reward r(z, y, a), the Q-function estimate is updated according to \n\nQ(A:+l)(Z, a) = Q(J:)(z, a) + etJ: [r(z, y, a) + 'Y ~~ Q(J:)(y, a') - Q(J:)(z, a)]. \n\n(10) \n\nQ-functions may also be defined for SMDP's. The optimal Q-function for an SMDP \nsatisfies the equation \n\nQ\u00b7(z, a) \n\n2: PZJ(a) roo t e-tJ\u2022 p(z, a)dsdFzJ(tla) + \n\n10 10 \n\n'V \nJE\"-\n\n'V \nJE\"-\n\n2: Pz1I (a) roo e-tJt max Q.(y, a')dFzJ(tla). \n\n10 \n\nA'EA \n\n(11) \n\nThis leads to the following Q-Iearning rule for SMDP's: \n\nQ(A:+l)(Z, a) = Q(J:)(z, a)+etJ: [1 -;-tJ'r' r(z, y, a) + e-tJ'r' ~~ Q(J:)(y, a') _ Q(J:)(z, a)] \n\n(12) \n\n5 RTDP and Adaptive RTDP for SMDP's \n\nThe TD(O) and Q-Iearning algorithms are model-free, and rely upon stochastic \napproximation for asymptotic convergence to the desired function (V'll\" and Q., re(cid:173)\nspectively). Convergence is typically rather slow. Real-Time Dynamic Program(cid:173)\nming (RTDP) and Adaptive RTDP [1,3] use a system model to speed convergence. \n\n\fReinforcement Learning Methods for Continltolts-Time Markov Decision Problems \n\n397 \n\nRTDP assumes that a system model is known a priori; Adaptive RTDP builds a \nmodel as it interacts with the system. As discussed by Barto et al. [1], these asyn(cid:173)\nchronous DP algorithms can have computational advantages over traditional DP \nalgorithms even when a system model is given. \n\nInspecting equation (4), we see that the model needed by RTDP in the SMDP \ndomain consists of three parts: \n\n1. the state transition probabilities Pzy(a), \n2. the expected reward on transition from state z to state y using action a, \n\nR(z, y, a), and \n\n3. the expected discount factor to be applied to the value of the next state on \n\ntransition from state z to state y using action a, 'Y(z, y, a). \n\nIf the process dynamics are governed by a continuous time Markov chain, then the \nmodel needed by RTDP can be analytically derived through uniJormization [2]. In \ngeneral, however, the model can be very difficult to analytically derive. In these \ncases Adaptive RTD P can be used to incrementally build a system model through \ndirect interaction with the system. One version of the Adaptive RTDP algorithm \nfor SMDP's is described in Figure 1. \n\n1 Set k = 0, and set Zo to some start state. \n2 \n3 \n4 \n\nInitialize P, R, and ~. \nrepeat forever { \n\nFor all actions a, compute \n\nQ(Ie)(ZIe,a) = L P .. \"v(a) [ R(zIe,y,a) +~(zIe,y,a)V(Ie)(y) ] \n\nveX \n\nPerform the update V(le+l)(ZIe) = minoeA Q(Ie)(zIe,a) \nSelect an action, ale. \nPerform ale and observe the transition to ZIe+l after T time units. Update \nP. Use the sample reward 1 __ ;;11'\" r(ZIe,Zle+l,ale) and the sample discount \nfactor e- f3T to update R and ~. \nk=k+l \n\n5 \n6 \n7 \n\n8 \n9 } \n\nFigure 1: Adaptive RTDP for SMDP's. P, il, and .y are the estimates maintained \nby Adaptive RTDP of P, R, and 'Y. \n\nNotice that the action selection procedure (line 6) is left unspecified. Unlike RTDP, \nAdaptive RTDP can not always choose the greedy action. This is because it only has \nan e8timate of the system model on which to base its decisions, and the estimate \ncould initially be quite inaccurate. Adaptive RTDP needs to explore, to choose \nactions that do not currently appear to be optimal, in order to ensure that the \nestimated model converges to the true model over time. \n\n\f398 \n\nSteven Bradtke, Michael O. Duff \n\n6 Experiment: Routing to two heterogeneous servers \n\nConsider the queueing system shown in Figure 2. Arrivals are assumed to be Poisson \nwith rate ).. Upon arrival, a customer must be routed to one of the two queues, \nwhose servers have service times that are exponentially distributed with parameters \nJ.l.1 and J.l.2 respectively. The goal is compute a policy that minimizes the objective \nfunction: \n\ne {foOO e-tJt [c1n1(t) + C2n2(t)]dt}, \n\nwhere C1 and C2 are scalar cost factors, and n1(t) and n2(t) denote the number of \ncustomers in the respective queues at time t. The pair (n1(t), n2(t)) is the state of \nthe system at time t; the state space for this problem is countably infinite. There \nare two actions available at every state: if an arrival occurs, route it to queue 1 or \nroute it to queue 2. \n\n-.<-~ \n\n__ _ -.J~ \n\nFigure 2: Routing to two queueing systems. \n\nIt is known for this problem (and many like it [7]), that the optimal policy is a \nthreshold policy; i.e., the set of states Sl for which it is optimal to route to the \nfirst queue is characterized by a monotonically nondecreasing threshold function F \nvia Sl = {(nl,n2)ln1 $ F(n2)}' For the case where C1 = C2 = 1 and J.l.1 = J.l.2, \nthe policy is simply to join the shortest queue, and the theshold function is a line \nslicing diagnonally through the state space. \n\nWe applied the SMDP version of Q-Iearning to this problem in an attempt to find \nthe optimal policy for some subset of the state space. The system parameters were \nset to ). = J.l.1 = J.l.2 = 1, /3 = 0.1, and C1 = C2 = 1. We used a feedforward neural \nnetwork trained using backpropagation as a function approximator. \n\nQ-Iearning must take exploratory actions in order to adequately sample all of the \navailable state transitions. At each decision time k, we selected the action aA: to be \napplied to state ZA: via the Boltzmann distribution \n\nwhere TA: is the \"computational temperature.\" The temperature is initialized to a \nrelatively high value, resulting in a uniform distribution for prospective actions. TA: \nis gradually lowered as computation proceeds, raising the probability of selecting \nactions with lower (and for this application, better) Q-values. In the limit, the action \nthat is greedy with respect to the Q-function estimate is selected. The temperature \nand the learning rate erA: are decreased over time using a \"search then converge\" \nmethod [4]. \n\n\fReinforcement Learning Methods for Continuous-Time Markov Decision Problems \n\n399 \n\nFigure 3 shows the results obtained by Q-Iearning for this problem. Each square \ndenotes a state visited, with nl(t) running along the z-axis, and n2(t) along the y(cid:173)\naxis. The color of each square represents the probability of choosing action 1 (route \narrivals to queue 1). Black represents probability 1, white represents probability o. \nAn optimal policy would be black above the diagonal, white below the diagonal, \nand could have arbitrary colors along the diagonal. \n\n== == == == = \n== = \n\n== \n;;= == == = \n== = \n\n;;= \n;;;;;; == == = \n== = \n\n.. \nII \u2022 II \u2022 \nw \u2022 \u2022\u2022 ... moo mllw \nit \u2022 \u2022\u2022 E M m@ mi\u00b7 1 @ \n\n!m il lUll @@ ~d r2 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 ,.m @ll \n\nA \n\n\u2022 .. III \nm \u2022 . ' . \u2022\u2022 IIIIJIIIII \n\u2022\u2022 \u2022\u2022 Ell =11111 \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \nII \u2022 \u2022 \n@w liliiii w \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n@ m \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 '. \n\n~2 @ %@Ii \n\nB \n\nl1li 0 \n@oo mm \noow mw m]lw \n\u2022\u2022 mill lUll lM]lm \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 lIlIg!W \nI'm \n=wjin i \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \nII \n\n... '.' ........ '. \n\nc \n\nFigure 3: Results of the Q-Iearning experiment. Panel A represents the policy after \n50,000 total updates, Panel B represents the policy after 100,000 total updates, and \nPanel C represents the policy after 150,000 total updates. \n\nOne unsatisfactory feature of the algorithm's performance is that convergence is \nrather slow, though the schedules governing the decrease of Boltzmann temperature \nTA: and learning rate 0A: involve design parameters whose tweakings may result in \nfaster convergence. If it is known that the optimal policies are of theshold type, \nor that some other structural property holds, then it may be of extreme practical \nutility to make use of this fact by constraining the value-functions in some way or \nperhaps by representing them as a combination of appropriate basis vectors that \nimplicity realize or enforce the given structural property. \n\n7 Discussion \n\nIn this paper we have proposed extending the applicability of well-known reinforce(cid:173)\nment learning methods developed for discrete-time MDP's to the continuous time \ndomain. We derived semi-Markov versions of TD(O), Q-Iearning, RTDP, and Adap(cid:173)\ntive RTDP in a straightforward way from their discrete-time analogues. While we \nhave not given any convergence proofs for these new algorithms, such proofs should \nnot be difficult to obtain if we limit ourselves to problems with finite state spaces. \n(Proof of convergence for these new algorithms is complicated by the fact that, in \ngeneral, the state spaces involved are infinite; convergence proofs for traditional \nreinforcement learning methods assume the state space is finite.) Ongoing work \nis directed toward applying these techniques to more complicated systems, exam(cid:173)\nining distributed control issues, and investigating methods for incorporating prior \n\n\f400 \n\nSteven Bradtke, Michael 0. Duff \n\nknowledge (such as structured function approximators). \n\nAcknowledgements \n\nThanks to Professor Andrew Barto, Bob Crites, and to the members of the Adaptive \nNetworks Laboratory. This work was supported by the National Science Foundation \nunder Grant ECS-9214866 to Professor Barto. \n\nReferences \n\n[1] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time \n\ndynamic programming. Artificial Intelligence. Accepted. \n\n[2] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. \n\nPrentice Hall, Englewood Cliffs, NJ, 1987. \n\n[3] S. J. Bradtke. Incremental Dynamic Programming for On-line Adaptive Opti(cid:173)\n\nmal Control. PhD thesis, University of Massachusetts, 1994. \n\n[4] C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochas(cid:173)\n\ntic gradient search. In Neural Networks for Signal Processing ~ - Proceedings \nof the 199~ IEEE Workshop. IEEE Press, 1992. \n\n[5] P. Dayan and T. J. Sejnowski. Td(A): Convergence with probability 1. Machine \n\nLearning, 1994. \n\n[6] E. V. Denardo. Contraction mappings in the theory underlying dynamic pro(cid:173)\n\ngramming. SIAM Review, 9(2):165-177, April 1967. \n\n[7] B. Hajek. Optimal control of two interacting service stations. \n\nIEEE-TAC, \n\n29:491-499, 1984. \n\n[8] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic \n\niterative dynamic programming algorithms. Neural Computation, 1994. \n\n[9] S. M. Ross. Applied Probability Models with Optimization Applications. Holden(cid:173)\n\nDay, San Francisco, 1970. \n\n[10] R. S. Sutton. Learning to predict by the method of temporal differences. \n\nMachine Learning, 3:9-44, 1988. \n\n[11] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-Iearning. Tech(cid:173)\n\nnical Report LIDS-P-2172, Laboratory for Information and Decision Systems, \nMIT, Cambridge, MA, 1993. \n\n[12] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge \n\nUniversity, Cambridge, England, 1989. \n\n\f", "award": [], "sourceid": 889, "authors": [{"given_name": "Steven", "family_name": "Bradtke", "institution": null}, {"given_name": "Michael", "family_name": "Duff", "institution": null}]}