{"title": "On-Line Estimation of the Optimal Value Function: HJB- Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 319, "page_last": 326, "abstract": null, "full_text": "On-Line Estimation of the Optimal Value \n\nFunction: HJB-Estimators \n\nJames K. Peterson \n\nDepartment of Mathematical Sciences \n\nMartin Hall Box 341907 \n\nClemson University \n\nClemson, SC 29634-1907 \n\nemail: petersonOmath. clemson. edu \n\nAbstract \n\nIn this paper, we discuss on-line estimation strategies that model \nthe optimal value function of a typical optimal control problem. \nWe present a general strategy that uses local corridor solutions \nobtained via dynamic programming to provide local optimal con(cid:173)\ntrol sequence training data for a neural architecture model of the \noptimal value function. \n\nION-LINE ESTIMATORS \n\nIn this paper, the problems of adaptive control using neural architectures are ex(cid:173)\nplored in the setting of general on-line estimators. 'Ve will try to pay close attention \nto the underlying mathematical structure that arises in the on-line estimation pro(cid:173)\ncess. \n\nThe complete effect of a control action Uk at a given time step t/.; is clouded by \nthe fact that the state history depends on the control actions taken after time \nstep tk' So the effect of a control action over all future time must be monitored. \nHence, choice of control must inevitably involve knowledge of the future history \nof the state trajectory. In other words, the optimal control sequence can not be \ndetermined until after the fact. Of course, standard optimal control theory supplies \nan optimal control sequence to this problem for a variety of performance criteria. \nRoughly, there are two approaches of interest: solving the two-point boundary value \n\n319 \n\n\f320 \n\nPeterson \n\nproblem arising from the solution of Pontryagin 's maximum or minimum principle or \nsolving the Hamilton-J acobi-Bellman (HJB) partial differential equation. However, \nthe computational burdens associated with these schemes may be too high for real(cid:173)\ntime use. \nIs it possible to essentially use on-line estimation to build a solution \nto either of these two classical techniques at a lower cost? In other words, if TJ \nsamples are taken of the system from some initial point under some initial sequence \nof control actions, can this time series be use to obtain information about the true \noptimal sequence of controls that should be used in the next TJ time steps? \nWe will focus here on algorithm designs for on-line estimation of the optimal con(cid:173)\ntrol law that are implement able in a control step time of 20 milliseconds or less. \nvVe will use local learning methods such as CMAC (Cerebellar Model Articulated \nControllers) architectures (Albus, 1 and W. Miller, 7), and estimators for character(cid:173)\nizations of the optimal value function via solutions of the Hamilton-Jacobi-Bellman \nequation, (adaptive critic type methods), (Barto, 2; Werbos, 12). \n\n2 CLASSICAL CONTROL STRATEGIES \n\nIn order to discuss on-line estimation schemes based on the Hamilton- Jacobi(cid:173)\nBellman equation, we now introduce a common sample problem: \n\nJ(x, u, t) \n\nmm \nuEU \n\nwhere \n\nSubject to: \n\nJ(x, u, t) \n\ndist(y(tf), r) + t \n\ni t! \n\nL(y(s), u(s), s) ds \n\ny'(s) \ny(t) \ny(s) \nu(s) \n\nf(y(s), u(s), s), t::; s ::; tf \nx \n\nE y ( s) ~ RN , t::; s ::; t f \nE U ( s) ~ RM, t::; s ::; t f \n\n(1) \n\n(2) \n\n(3) \n(4) \n(5) \n(6) \n\nHere y and u are the state vector and control vector of the system, respectively; U is \nthe space of functions that the control must be chosen from during the minimization \nprocess and ( 4) - ( 6) give the initialization and constraint conditions that the \nstate and control must satisfy. The set r represents a target constraint set and \ndist(y(tf), r) indicates the distance from the final state y(tf) to the constraint set \nr. The optimal value of this problem for t.he initial state x and time t will be \ndenoted by J(x, t) where \n\nJ(x, t) \n\nminJ(x,u,t). \nu \n\n\fOn-Line Estimation of the Optimal Value Function: HJB-Estimators \n\n321 \n\nIt is well known that the optimal value function J(x, t) satisfies a generalized partial \ndifferential equation known as the Hamilton-J acobi-Bellman (HJB) equation. \n\naJ(x, t) \n\nat \n\nJ(x,t,) \n\n) aJ(x, t) ( \n\n. {( \n\nm~n L x, u, t + ax \ndist(x, f) \n\nI x, u, t \n\n)} \n\nIn the case that J is indeed differentiable with respect to both the state and time \narguments, this equation is interpreted in the usual way. However, there are many \nproblems where the optimal value function is not differentiable, even though it \nis bounded and continuous. In these cases, the optimal value function J can be \ninterpreted as a viscosity solution of the HJB equation and the partial derivatives \nof J are replaced by the sub and superdifferentials of J (Crandall, 5). In general, \nonce the HJB equation is solved, the optimal control from state x and time t is then \ngiven by the minimum condition \n\nU E argm~n L(x,u,t)+ \n\n. { \n\naJ(x,t) ( \n\nax \n\nI x,u,t \n\n)} \n\nIf the underlying state and time space are discretized using a state mesh of resolution \nr and a time mesh of resolution s, the HJB equation can be rewritten into the form \nof the standard Bellman Principle of Optimality (BPO): \n\nwhere X(Xi, u) indicates the new state achieved by using control u over time interval \n[tj,tj+d from initial state Xi. \nIn practice, this equat.ion is solved by successive \niterations of the form: \n\nwhere T denotes the iteration cycle and the process is started by initializing \nJ~~ (Xi, tj) in a suitable manner. Generally, the iterations continue until the values \nJ;tl(Xi,tj) and J;tl(Xi,tj) differ by negligible amounts. This iterative process is \nusually referred to as dynamic programming (DP). Once this iterative process con(cid:173)\nverges, let Jr~(Xi,tj) = limT->ooJ:~, and consider linl(r,s)->(O,O) Jrs(xi,tj), where \n(xi, tj) indicates that the discrete grid points depend on the resolution (r, s). In \nmany situations, this limit gives the viscosity solution J(x, t) to the HJB equation. \nNow consider the problem of finding J(x,O). The Pontrya.gin minimum principle \ngives first order necessary conditions that the optimal state x and costate p variables \nmust satisfy. Letting fl(x, u, p, t) = L(x, u, t) + pT I(x, u, t) and defining \n\n\f322 \n\nPeterson \n\nH(x,p, t) \n\nmin H(x, u, p, t), \nu \n\n(7) \n\nthe optimal state and costate then must satisfy the following two-point boundary \nvalue problem (TPBVP): \n\n'(t) - oH(x,p,t) \nx \n, \nx(O) = x, \n\nop \n\n-\n\np'(t) = _ aH~;p,t) \np(tj) = 0 \n\n(8) \n\nand the optimal control is obtained from ( 7) once the optimal state and costate \nare determined. Note that ( 7) can not necessarily be solved for the control u in \nterms of x and p, i.e. a feedback law may not be possible. If the TPBVP can \nnot be solved, then we set J(x,O) = 00. In conclusion, in this problem, we are led \ninevitably to an optimal value function that can be poorly behaved; hence, we can \neasily imagine that at many (x, t), ~; is not available and hence J will not satisfy \nthe HJB equation in the usual sense. So if we estimate J directly using some form \nof on-line estimation, how can we hope to back out the control law if ~; is not \navailable? \n\n3 HJB ESTIMATORS \n\nA potential on-line estimation technique can be based on approximations of the \noptimal value function. Since the optimal value function should satisfy the HJB \nequation, these methods will be grouped under the broad classification HJD esti(cid:173)\nmators. \nAssume that there is a given initial state Xo with start time O. Consider a local \npatch, or local corridor, of the state space around the initial state xo, denoted by \nn(xo). The exact size ofO(xo) will depend on the nature of the state dynamics and \nthe starting state. If O( xo) is then discretized using a coarse grid of resolution r \nand the time domain is discretized using resolution s, an approximat.e dynamic pro(cid:173)\ngramming problem can be formulated and solved using the BPa equations. Since \nthe new states obtained via integration of the plant dynamics will in general not \nland on coarse grid lines, some sort of interpolation must be used to assign the \nintegrated new state value an appropriate coarse grid value. This can be done using \nthe coarse encoding implied by the grid resolution r of O(xo). In addition, multiple \ngrid resolutions may be used with coarse and fine grid approximations interacting \nwith one another as in multigrid schemes (Briggs, 3). The optimal value function \nso obtained will be denoted by Jr~(Zi,tj) for any discrete grid point Zi E O(xo) and \ntime point t j. This approximate solution also supplies an estimate of the optimal \n\ncontrol sequence (u*)\u00a3j-l = (u*)'j-l(Zi,tj)' Some papers on approximate dynamic \n\nprogramming are (Peterson, 8; (Sutton, 10; Luus, 6). It is also possible to obtain \nestimates of the optimal control sequences, states and costates using an 7J step look(cid:173)\nahead and the Pontryagin minimum principle. The associated two point boundary \nvalue problem is solved and the controls computed via Ui E arg minu H(x;, u, pi, ti) \nwhere (x*)ri and (P*)ri are the calculated optimal state and costate sequences re(cid:173)\nspectively. This approach is developed in (Peterson, 9) and implemelltated for \n\n\fOn-Line Estimation of the Optimal Value Function: HJB-Estimators \n\n323 \n\nvibration suppression in a large space structure, by (Carlson, Rothermel and Lee, \n4) \n\n(u)J- 1(Zi' tj) be a control sequence used from \nFor any Zi E n(xo), let (u){j-1 -\ninitial state Zi and time point tj. Thus Uij is the control used on time interval \n\n[tj,tj+1] from start point Zi. Define zl/1 = Z(Zi,Uij,tj), the state obtained by \nand the new state is zl/2 = z(zl/l, Ui,j+l, ij+d; in general, Ui,j+k is the control \n\nintegrating the plant dynamics one time step using control Uij and initial state Zi\" \nThen Ui,j+1 is the control used on time interval [tj+1, tj+2] from start point zl/l \n\nused on time interval [tj+k, tj+k+1] from start point zl/k and the new state is \nj+k+1 -\nZij \n\n,Ui,j+k, j+k , were Zij = Zi\u00b7 \n\nt ) h \n\n= Z Zij \n\n(j+k \n\nj -\n\nLet's now assume that optimal control information Uij (we will dispense with the \nsuperscript * labeling for expositional cleanness) is available at each of the discrete \ngrid points (Zi, tj) E n(xo). Let <Prs(Zi, tj) denote the value of a neural architecture \n(CMAC, feedforward, associative etc.) which is trained as follows using this optimal \ninformation for 0 ~ k < T} -\nj - 1 (the equation below holds for the converged \nvalue of the network's parameters and the actual dependence of the network on \nthose parameters is notationally suppressed): \n\n\u00b7+k \n\n\"+k \n\n<Prs (zfj \n\n,tj+k+d + (~(zfj ,Ui,j+k) \nwhere 0 < e, ( ~ 1 and we define a typical reinforcement function ~ by \n\n,tj+k) = \n\ne<Prs (zfj \n\n\"+k+l \n\nif j ~ k < T} -\nif k = T} - 1 \n\nj - 1 \n\n(9) \n\n(10) \n\n(11) \n\nFor notational convenience, we will now drop the notational dependence on the time \ngrid points and simply refer to the reinforcement by ~(zf/k, Ui,j+k) \n\nThen applying ( 9) repeatedly, for any 0 ~ p ~ '1] -\n\ni, \n\ne <Prs (zf/P, tj+p ) + ( E e 3i(zf/k, Ui,j+k) \n\np-1 \n\n(12) \n\nk=O \n\nThus, the function wr .\u2022 can be defined by \n\n\f324 \n\nPeterson \n\nwhere the term uif7 will be interpreted as Uj,1}-1. \nIt follows then that since Uij is optimal, \n\nClearly, the function <Prs(Zi, tj) = Wrs(Zi' tj, 1, 1) estimates the optimal value \nJrs (Zi, tj) itself. (See, Q-Learning (Watkins, 11\u00bb. \n\nAn alternate approach that does not model J indirectly, as is done above, is to \ntrain a neural model <Prs(Zi,tj) directly on the data J(Zi,tj) that is computed in \neach local corridor calculation. In either case, the above observations lead to the \nfollowing algorithm: \n\nInitialization: \n\nHere, the iteration count is r = O. For given starting state Xo and local look \nahead of 7J time steps, form the local corridor O(xo) and solve the associated \napproximate BPO equation for Jrs(Zi, tj). Compute the associated optimal \ncontrol sequences for each (Zi,tj) pair, (u*){j-1 = (u*)1- 1(Zi,tj)' Initialize \nthe neural architecture for the optimal value estimate using cI>~8(Zi' tj) = \nJ r 8 (Zi , t i)' \n\nEstilnate of New Optimal Control Sequence: \n\nFor the next TJ time steps, an estimate must be made of the next optimal \ncontrol action in time interval [tf7 +k, t f7 +k+1]' The initial state is any Zi in \nO( xf7) (xf7 is one such choice) and the initial time is tf7' For the time interval \n[tf7, t f7 +1], if the model <P~8 (Zi, tj) is differentiable, the new control can be \nestimated by \n\nUf7 +1 E arg ~in + a:.:. (zf7' tf7) \n\n{ \n\nL(zf7,u,tf7)(tf7+1 -tTl) } \n\nf(zf7,u,t f7 )(t1}+l -tf7) \n\nFor ease of notation, let Zf7+1 denote the new state obtained using the \ncontrol Uf7 +1 on the interval [tf7' t f7 H]' Then choose the next control via \n\nClearly, if Zf7+ k denote the new state obtained using the control ttf7 +k-1 on \nthe interval [t,/+k, t f7 +k+1], the next control is chosen to satisfy \n\nE \n\n\fOn-Line Estimation of the Optimal Value Function: HJB-Estimators \n\n325 \n\nAlternately, if the neural architecture is not differentiable (that is 0:;, is \n\nnot availa.ble), the new control action can be computed via \n\nE \n\nUpdate of the Neural Estimator: \n\nThe new starting point for the dynamics is now x1/ and there is a new \nassociateclloca.l corridor n( x1/). The neural estimator is then updated using \neither the HJB or the BPa equations over the local corridor n(x1/). Using \nthe BPa equations, for all Zi E n(x1/) the updates are: \n\nwhere (it )1- 1 indicates the optimal control estimates obtained in the pre(cid:173)\nvious algorithm step. Finally, using the HJB equation, for all Zi E n(x1/) \nthe updates are: \n\n~~s (Zi, t77+1+1) + mJn + a:;, (Zi, t77+i) \n\nL( Zi, u, t1/+1) (t77+1 +1 -\n\n{ \n\nt'I+1) } \n\n!(Zi,u,t77+i)(t77+i+1 -t77+i) \n\nComparison to BPO optimal control sequence: \n\nNow solve the associated approximate BPa equation for each Zi in the local \ncorridor n(x1/) for Jrs(Zi' t77+j). Compute the new approximate optimal \n\ncontrol sequences for each (Zi' t77+j) pair, (u* )~~j 1 = (u* )~~j 1 (Zi, t77+i) and \n\ncompare them to the estimated sequences (it )~~j 1. If the discrepancy is \nout of tolerance (this is a design decision) initialize the neural architecture \nfor the optimal value estimate using ~~s(Zi,t'1+i) = Jrs (Zi,t77+i). If the \ndiscrepancy is acceptable, terminate the BPa approximation calculations \nfor M future iterations and use the neural architectures alone for on-line \nestimation. \n\nThe determination of the stability and convergence properties of anyon-line approx(cid:173)\nimation procedure of this sort is intimately connected with the the optimal value \nfunction which solves the generalized HJB equation. We conjecture the following \nlimit converges to a viscosity solution of the HJB equation for the given optimal \ncontrol problem: \n\nFurther, there are stability questions and there are interesting issues relating to the \nuse of multiple state resolutions rl and r2 and the corresponding different approx(cid:173)\nimations to J, leading to the use of multigrid like methods on the HJ B equation \n(see, for example, Briggs, 3). Also note that there is an advantage to using CMAC \n\nJ(x, t) \n\n\f326 \n\nPeterson \n\narchitectures for the approximation of the optimal value function J j since J need \nnot be smooth, the CMAC's lack of differentiability wit.h respect to its inputs is not \na problem and in fact is a virtue. \n\nAcknowledgements \n\nWe acknowledge the partial support of NASA grant NAG 3-1311 from the Lewis \nResearch Center. \nReferences \n\n1. Albus, J. 1975. \"A New Approach to Manipulator Control: The Cerebellar \n\nModel Articulation Controller (CMAC).\" J. Dynamic Systems, Measure(cid:173)\nment and Control, 220 - 227. \n\n2. Barto, A., R. Sutton, C. Anderson. 1983 \"Neuronlike Adaptive Elements \n\nThat Can Solve Difficult Learning Control Problems.\" IEEE Trans. Sys(cid:173)\ntems, Man Cybernetics, Vol. SMC-13, No.5, September/October, 834 -\n846. \n\n3. Briggs, W. 1987. A Multigrid Tutorial, SIAM, Philadelphia, PA. \n4. Carlson, R., C. Lee and K. Rothermel. 1992. \"Real Time Neural Control \nof an Active Structure\", Artificial Neural Networks in Engineering \n2, 623 - 628. \n\n5. Crandall, M. and P. Lions. 1983. \"Viscosity solutions of Hamilton-Jacobi \n\nEquations.\" Trans. American Math. Soc., Vol. 277, No.1, 1 - 42. \n\n6. Luus, R. 1990. \" Optimal Control by Dynamic Programming Using System(cid:173)\n\natic Reduction of Grid Size\", Int. J. Control, Vol. 51, No.5, 995 - 1013. \n\n7. Miller, W. 1987. \"Sensor-Based Control of Robotic Manipulators Using as \nGeneral Learning Algorithm.\" IEEE J. Robot. Automat., Vol RA-3, No.2, \n157 - 165 \n\n8. Peterson, J. 1992. \"Neural Network Approaches to Estimating Directional \nCost Information and Path Planning in Analog Valued Obstacle Fields\", \nHEURISTICS: The Journal of Knowledge Engineering, Special Issue on \nArtificial Neural Networks, Vol. 5, No.2, Summer, 50 - 61. \n\n9. Peterson, J. 1992. \"On-Line Estimation of Optimal Control Sequences: \nPontryagin Estimators\", Artificial Neural Networks in Engineering \n2, ed. Dagli et. al., 579 - 584. \n\n10. Sutton, R. 1991. \" Planning by Incremental Dynamic Programming\", Pro(cid:173)\nceedings of the Ninth International Workshop on Machine Learning, 353 -\n357. \n\n11. Watkins, C. 1989. Learning From Delayed Rewards, Ph. D. Disserta(cid:173)\n\ntion, King's College. \n\n12. Werbos, P. 1990. \"A Menu of Designs for Reinforcement Learning Over \nTime\". In Neural Networks for Control, Ed. Miller, W. R. Sutton and \nP. Werbos, 67 - 96. \n\n\f", "award": [], "sourceid": 648, "authors": [{"given_name": "James", "family_name": "Peterson", "institution": null}]}