{"title": "Transition Point Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 639, "page_last": 646, "abstract": null, "full_text": "Transition Point Dynamic Programming \n\nKenneth M. Buckland'\" \n\nDept. of Electrical Engineering \nUniversity of British Columbia \n\nPeter D. Lawrence \n\nDept. of Electrical Engineering \nUniversity of British Columbia \n\nVancouver, B.C, Canada V6T 1Z4 \n\nVancouver, B.C, Canada V6T 1Z4 \n\nbuckland@pmc-sierra.bc.ca \n\npeterl@ee.ubc.ca \n\nAbstract \n\nTransition point dynamic programming (TPDP) is a memory(cid:173)\nbased, reinforcement learning, direct dynamic programming ap(cid:173)\nproach to adaptive optimal control that can reduce the learning \ntime and memory usage required for the control of continuous \nstochastic dynamic systems. TPDP does so by determining an \nideal set of transition points (TPs) which specify only the control \naction changes necessary for optimal control. TPDP converges to \nan ideal TP set by using a variation of Q-Iearning to assess the mer(cid:173)\nits of adding, swapping and removing TPs from states throughout \nthe state space. When applied to a race track problem, TPDP \nlearned the optimal control policy much sooner than conventional \nQ-Iearning, and was able to do so using less memory. \n\n1 \n\nINTRODUCTION \n\nDynamic programming (DP) approaches can be utilized to determine optimal con(cid:173)\ntrol policies for continuous stochastic dynamic systems when the state spaces of \nthose systems have been quantized with a resolution suitable for control (Barto et \nal., 1991). DP controllers, in lheir simplest form, are memory-based controllers \nthat operate by repeatedly updating cost values associated with every state in the \ndiscretized state space (Barto et al., 1991). In a slate space of any size the required \nquantization can lead to an excessive memory requirement, and a related increase \nin learning time (Moore, 1991). This is the \"curse of dimensionality\". \n\n\u00b7Nowat: PMC-Sierra Inc., 8501 Commerce Court, Burnaby, B.C., Canada V5A 4N3. \n\n639 \n\n\f640 \n\nBuckland and Lawrence \n\nQ-Iearning (Watkins, 1989, Watkins et al., 1992) is a direct form of DP that avoids \nexplicit system modeling - thereby reducing the memory required for DP control. \nFurther reductions are possible if Q-Ieal'l1ing is modified so that its DP cost values \n(Q-values) are associated only with states where control action changes need to be \nspecified. Transition point dynamic programming (TPDP), the control approach \ndescribed in this paper, is designed to take advantage of this DP memory reduction \npossibility by determining the states where control action changes must be specified \nfor optimal control, and what those optimal changes are. \n\n2 GENERAL DESCRIPTION OF TPDP \n\n2.1 TAKING ADVANTAGE OF INERTIA \n\nTPDP is suited to the control of continuous stochastic dynamic systems that have \ninertia. In such systems \"uniform regions\" are likely to exist in the state space \nwhere all of the (discretized) states have the same optimal control action (or the \nsame set of optimal actions l ). Considering one such uniform region, if the optimal \naction for that region is specified at the \"boundary states\" of the region and then \nmaintained throughout the region until it is left and another uniform region is \nentered (where another set of boundary states specify the next action), none of the \n\"dormant states\" in the middle of the region need to specify any actions themselves. \nThus dormant states do not have to be represented in memory. This is the basic \npremise of TPDP. \n\nThe association of optimal actions with boundary states is done by \"transition \npoints\" (TPs) at those states. Boundary states include all of the states that can \nbe reached from outside a uniform region when that region is entered as a result of \nstochastic state transitions. The boundary states of anyone uniform region form a \nhyper-surface of variable thickness which mayor may not be closed. The TPs at \nboundary states must be represented in memory, but if they are small in number \ncompared to the dormant states the memory savings can be significant. \n\n2.2 \n\nILLUSTRATING THE TPDP CONCEPT \n\nFigure 1 illustrates the TPDP concept when movement control of a \"car\" on a \none dimensional track is desired. The car, with some initial positive velocity to the \nright, must pass Position A and return to the left. The TPs in Figure 1 (represented \nby boxes) are located at boundary states. The shaded regions indicate all of the \nstates that the system can possibly move through given the actions specified at the \nboundary states and the stochastic response of the car. Shaded states without TPs \nare therefore dormant states. Uniform regiolls consist of adjacent boundary states \nwhere the same action is specified, as well as the shaded region through which that \naction is maintained before another boundary is encountered. Boundary states that \ndo not seem to be on the main sta.te transition routes (the one identified in Figure 1 \nfor example) ensure that any stochastic deviations from those routes are realigned. \nUnshaded states are \"external states\" the system does not reach. \n\nIThe simplifying assumption t.hat t.here is ouly oue optimal action in each uniform \n\nregion will be made throughout this paper. TPDP operates the same regardless. \n\n\f+ \n\n~ \n'0 \n00 \nQ) > \n\nTransition Point Dynamic Programming \n\n641 \n\nEach 13 is a \n\ntransition point (TP), \n\nniform \nRegion \n\nBoundary \n\nState \n\nA \n\nPosition \n\nFigure 1: Application of TPDP to a One Dimension Movement Control Task \n\n2.3 MINIMAL TP OPTIMAL CONTROL \n\nThe main benefit of the TPDP approach is that, where uniform regions exist, they \ncan be represented by a relatively small number of DP elements (TPs) - depending \non the shape of the boundaries and the size of the uniform regions they encom(cid:173)\npass. This reduction in memory usage results in an accompanying reduction in the \nlearning time required to learn optimal control policies (Chapman et al., 1991). \n\nTPDP operates by learning optimal points of transition in the control action specifi(cid:173)\ncation, where those points can be accurately located in highly resolved state spaces. \nTo do this TPDP must determine which states are boundary states that should \nhave TPs, and what actions those TPs should specify. In other words, TPDP must \nfind the right TPs for the right states. When it has done so, \"minimal TP optimal \ncontrol\" has been achieved. That is, optimal control with a minimal set of TPs. \n\n3 ACHIEVING MINIMAL TP OPTIMAL CONTROL \n\n3.1 MODIFYING A SET OF TPs \n\nGiven an arbitrary initial set of TPs, TPDP must modify that set so that it is \ntransformed into a minimal TP optimal control set. Modifications can include the \n\"addition\" and \"removal\" of TPs throughout the state space, and the \"swapping\" \nof one TP for another (each specifying a different action) at the same state. These \n\n\f642 \n\nBuckland and Lawrence \n\nmodifications are performed one at a time in arbitl'ary order, and can continue \nindefinitely. TPDP operates so that each TP modification results in an incremental \nmovement towards minimal TP optimal control (Buckland, 1994). \n\n3.2 Q-LEARNING \n\nTPDP makes use of Q-Iearning (Watkins, 1989, Watkins et ai., 1992) to modify the \nTP set. Normally Q-Iearning is used to determine the optimal control policy J-t for \na stochastic dynamic system subjected to immediate costs c(i, u) when action u is \napplied in each state i (Barto et ai., 1991). Q-learning makes use of \"Q-values\" \nQ( i, u), which indicate the expected total infini te-horizon discounted cost if action \nu is applied in state i, and actions defined by the existing policy J-t are applied in \nall future states. Q-values are learned by using the following updating equation: \n\nQt+l(St, Ut) = (1 - Ctt)Qt(St, ud + at [c(St, ud + 'YVt(St+l)] \n\n(1) \nWhere at is the update rate, l' is the discount factor, and St and Ut are respectively \nthe state at time step t and the action taken at that time step (all other Q-values \nremain the same at time step t). The evaluation function value lit ( i) is set to the \nlowest Q-value action of all those possible U(i) in each state i: \n\nVt(i) = min Qt(i, u) \n\nUEU(i) \n\n(2) \n\nIf Equations 1 and 2 are employed during exploratory movement of the system, it has \nbeen proven that convergence to optimal Q-values Q* (i, u) and optimal evaluation \nfunction values VI-'. (i) will result (given that the proper constraints are followed, \nWatkins, 1989, Watkins et ai., 1992, Jaakkola et ai., 1994). From these values the \noptimal action in each state can be determined (the action that fulfills Equation 2). \n\n3.3 ASSESSING TPs WITH Q-LEARNING \n\nTPDP uses Q-Iearning to determiue how an existing set of TPs should be modified \nto achieve minimal TP optimal control. Q-values can be associated with TPs, and \nthe Q-values of two TPs at the same \"TP state\", each specifying different actions, \ncan be compared to determine which should be maintained at that state - that is, \nwhich has the lower Q-value. This is how TPs are swapped (Buckland, 1994). \n\nStates which do not have TPs, \"non-TP states\", have no Q-values from which \nevaluation function values vt(i) can be determined (using Equation 2). As a result, \nto learn TP Q-values, Equation 1 must be modified to facilitate Q-value updating \nwhen the system makes d state transitions from one TP state through a number of \nnon-TP states to another TP state: \n\nQt+.( St, Ut) = (1 - a,jQt (5t, Ut) + \"t [ (~'Yn c( St+n, Ut)) + 'Y.v,( St+.)] \n\n(3) \nWhen d = 1, Equation 3 takes the form of Equation 1. When d > 1, the intervening \nnon-TP states are effectively ignored and treated as inherent parts of the stochastic \ndynamic behavior of the system (Buckla.nd, 1994). \n\nIf Equation 3 is used to determine the costs incurred when no action is specified \nat a state (when the action specified at some previous state is maintained), an \"R(cid:173)\nvalue\" R( i) is the result. R-values can be used to expediently add and remove TPs \n\n\fTransition Point Dynamic Programming \n\n643 \n\nfrom each state. If the Q-value of a TP is less than the R-value of the state it is \nassociated with, then it is worthwhile having that TP at that state; otherwise it is \nnot (Buckland, 1994). \n\n3.4 CONVERGENCE TO MINIMAL TP OPTIMAL CONTROL \n\nIt has been proven that a random sequence of TP additions, swaps and removals \nattempted at states throughout the state space will result in convergence to min(cid:173)\nimal TP optimal control (Buckland, 1994). This proof depends mainly on all TP \nmodifications \"locking-in\" any potential cost reductions which are discovered as the \nresult of learning exploration. \n\nThe problem with this proof of convergence, and the theoretical form of TPDP \ndescribed up to this point, is that each modification to the existing set of TPs (each \naddition, swap and removal) requires the determination of Q-values and R-values \nwhich are negligibly close to being exact. This means that a complete session of \nQ-Iearning must occur for every TP modification. 2 The result is excessive learning \ntimes - a problem circumvented by the practical form of TPDP described next. \n\n4 PRACTICAL TPDP \n\n4.1 CONCURRENT TP ASSESSMENT \n\nTo solve the problem of the protracted learning time required by the theoretical \nform of TPDP, many TP modifications can be assessed concurrently. That is, \nQ-Iearning can be employed not just to determine the Q-values and R-values for a \nsingle TP modification, but instead to learn these values for a number of concurrent \nmodifications. Further, the modification attempts, and the learning of the values \nrequired for them, need not be initiated simultaneously. The determination of each \nvalue can be made part of the Q-Iearning process whenever new modifications are \nrandomly attempted. This approa.ch is called \"Pra.ctical TPDP\". Practical TPDP \nconsists of a continually running Q-Ieal'l1ing process (based on Equations 2 and 3), \nwhere the Q-values and R-values of a constantly changing set of TPs are learned. \n\n4.2 USING WEIGHTS FOR CONCURRENT TP ASSESSMENT \n\nThe main difficulty that arises when TPs are assessed concurrently is that of deter(cid:173)\nmining when an assessment is complete. That is, when the Q-values and R-values \nassociated with each TP ha.ve been learned well enough for a TP modification to \nbe made based on them. The technique employed to address this problem is to \nassociate a \"weight\" wei, u) with ea.ch TP that indicates the general merit of that \nTP. The basic idea of weights is to facilita.te the random addition of trial TPs to \na TP \"assessment group\" with a low initial weight Winitial. The Q-values and R(cid:173)\nvalues of the TPs in the assessment group are learned in an ongoing Q-Iearning \nprocess, and the weights of the TPs are adjusted heuristically using those values. \nOf those TPs at any state i whose weights wei, u) have been increased above Wthr \n\n2The TPDP proof allows for more than one TP swap to be assessed simultaneously, \n\nbut this does little to reduce the overall problem being described (Buckland, 1994). \n\n\f644 \n\nBuckland and Lawrence \n\n100 \n\n50 \n\nC \nQ) \n.....J \n\n..c -C> \n..c -CU a... \n\nQ) \nC> \n~ \nQ) \n~ \n\no \n\no \n\nConventional \n\nQ-Iearning \n\nPractical TPDP \n\nEpoch Number \n\n2500 \n\nFigure 2: Performance of Practical TPDP on a Race Track Problem \n\n(Winitial < Wthr < wmax ), the one with the lowest Q-value Q(i, u) is swapped into \nthe \"policy TP\" role for that state. The heuristic weight adjustment rules are: \n\n1. New, trial TPs are given an initial weight of Wjnitial (0 < Winitial < Wthr). \n2. Each time the Q-value of a TP is updated, the weight w(i, u) of that TP is \n\nincremented if Q(i, u) < R(i) and decremented otherwise. \n\n3. Each TP weight w( i, u) is limited to a maximum value of w max . This \nprevents anyone weight from becoming so large that it cannot readily be \nreduced again. \n\n4. If a TP weight w(i, u) is decremented to 0 the TP is removed. \n\nAn algorithm for Practical TPDP implementation is described in Buckland (1994). \n\n4.3 PERFORMANCE OF PRACTICAL TPDP \n\nPractical TPDP was applied to a continuous version of a control task described by \nBarto et al. (1991) - that of controlling the acceleration of a car down a race track \n(specifically the track shown in Figures 3 and 4) when that car randomly experiences \ncontrol action non-responsiveness. As shown in Figure 2 (each epoch in this Figure \nconsisted of 20 training trials and 500 testing trials), Practical TPD P learned the \noptimal control policy much sooner than conventional Q-Iearning, and it was able \nto do so when limited to only 15% of the possible number of TPs (Buckland, 1994). \nThe possible number of TPs is the full set of Q-values required by conventional \nQ-Iearning (one for each possible state and action combination). \n\nThe main advantage of Practical TPDP is that it facilitates rapid learning of pre(cid:173)\nliminary control policies. Figure 3 shows typical routes followed by the car early \n\n\fTransition Point Dynamic Programming \n\n645 \n\nFinishing \nPositions \n\nFinishing \nPositions \n\nFigure 3: Typical Race Track Routes After 300 Epochs \n\nStarting \nPositions \n\nStarting \nPositions \n\nFigure 4: Typical Race Track Routes After 1300 Epochs \n\nin the learning process. With the addition of relatively few TPs, the policy of ac(cid:173)\ncelerating wildly down the track, smashing into the wall and continuing on to the \nfinishing positions was learned. Further learning centered around this preliminary \npolicy led to the optimal policy of sweeping around the left turn. Figure 4 shows \ntypical routes followed by the car during this shift in the learned policy - a shift \nindicated by a slight drop in the learning curve shown in Figure 2 (around 1300 \nepochs). After this shift, learning progressed rapidly until roughly optimal policies \nwere consistently followed. \n\nA problem which occurs in Practical TPDP is that of the addition of superfluous \nTPs after the optimal policy has bac;ically been learned. The reasons this occurs \nare described in Buckland (1994), ac; well as a number of solutions to the problem. \n\n5 CONCLUSION \n\nThe practical form of TPDP performs very well when compared to conventional \nQ-Iearning. When applied to a race track problem it was able to learn optimal \npolicies more quickly while using less memory. Like Q-learning, TPDP has all the \n\n\f646 \n\nBuckland and Lawrence \n\nadvantages and disadvantages that result from it being a direct control approach \nthat develops no explicit system model (Watkins, 1989, Buckland, 1994). \n\nIn order to take advantage of the sparse memory usage that occurs in TPDP, TPs \nare best represented by ACAMs (associative content addressable memories, Atke(cid:173)\nson, 1989). A localized neural network design which operates as an ACAM and \nwhich facilitates Practical TPDP control is described in Buckland et al. (1993) and \nBuckland (1994). \n\nThe main idea of TPDP is to, \"try this for a while and see what happens\". This \nis a potentially powerful approach, and the use of TPs associated with abstracted \ncontrol actions could be found to have substantial utility in hierarchical control \nsystems. \n\nAcknowledgements \n\nThanks to John Ip for his help on this work. This work was supported by an NSERC \nPostgraduate Scholarship, and NSERC Operating Grant A4922. \n\nReferences \n\nAtkeson, C. G. (1989), \"Learning arm kinematics and dynamics\", Annual Review \nof Neuroscience, vol. 12, 1989, pp. 157-183. \nBarto, A. G., S. J. Bradtke and S. P. Singh (1991), \"Real-time learning and con(cid:173)\ntrol using asynchronous dynamic programming\", COINS Technical Report 91-57, \nUniversity of Massachusetts, Aug. 1991. \n\nBuckland, K. M. and P. D. Lawrence (1993), \"A connectionist approach to direct \ndynamic programming control\" , Proc. of the IEEE Pacific Rim Conf. on Commu(cid:173)\nnications, Computers and Signal Processing, Victoria, 1993, vol. 1, pp. 284-287. \n\nBuckland, K. M. (1994), Optimal Control of Dynamic Systems Through the Rein(cid:173)\nforcement Learning of Transition Points, Ph.D. Thesis, Dept. of Electrical Engi(cid:173)\nneering, University of British Columbia, 1994. \nChapman, D. and L. P. Kaelbling (1991), \"Input generalization in delayed \nreinforcement-learning: an algorithm a.nd performance comparisons\", Proc. of the \n12th Int. Joint Con/. on Artificial Intelligence, Sydney, Aug. 1991, pp. 726-731. \n\nJaakkola, T., M. I. Jordan and S. P. Singh (1994), \"Stocha'ltic convergence of iter(cid:173)\native DP algorithms\", A dvances in N eM'al Information Processing Systems 6, eds.: \nJ. D. Cowen, G. Tesauro and J. Alspector, San Francisco, CA: Morgan Kaufmann \nPublishers, 1994. \nMoore, A. W. (1991), \"Variable resolution dynamic programming: efficiently learn(cid:173)\ning action maps in multivariate real-valued state-spaces\", Machine Learning: Proc. \nof the 8th Int. Workshop, San Mateo, CA: Morgan Kaufmann Publishers, 1991. \n\nWatkins, C. J. C. H. (1989), Learning from Delayed Rewards, Ph.D. Thesis, Cam(cid:173)\nbridge University, Cambridge, England, 1989. \n\nWatkins, C. J. C. H. and P. Dayan (1992), \"Q-Iearning\", Machine Learning, vol. 8, \n1992, pp. 279-292. \n\n\f", "award": [], "sourceid": 848, "authors": [{"given_name": "Kenneth", "family_name": "Buckland", "institution": null}, {"given_name": "Peter", "family_name": "Lawrence", "institution": null}]}