{"title": "On-Line Estimation of the Optimal Value Function: HJB- Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 319, "page_last": 326, "abstract": null, "full_text": "On-Line Estimation of the Optimal Value \n\nFunction: HJB-Estimators \n\nJames K. Peterson \n\nDepartment of Mathematical Sciences \n\nMartin Hall Box 341907 \n\nClemson University \n\nClemson, SC 29634-1907 \n\nemail: petersonOmath. clemson. edu \n\nAbstract \n\nIn this paper, we discuss on-line estimation strategies that model \nthe optimal value function of a typical optimal control problem. \nWe present a general strategy that uses local corridor solutions \nobtained via dynamic programming to provide local optimal con(cid:173)\ntrol sequence training data for a neural architecture model of the \noptimal value function. \n\nION-LINE ESTIMATORS \n\nIn this paper, the problems of adaptive control using neural architectures are ex(cid:173)\nplored in the setting of general on-line estimators. 'Ve will try to pay close attention \nto the underlying mathematical structure that arises in the on-line estimation pro(cid:173)\ncess. \n\nThe complete effect of a control action Uk at a given time step t/.; is clouded by \nthe fact that the state history depends on the control actions taken after time \nstep tk' So the effect of a control action over all future time must be monitored. \nHence, choice of control must inevitably involve knowledge of the future history \nof the state trajectory. In other words, the optimal control sequence can not be \ndetermined until after the fact. Of course, standard optimal control theory supplies \nan optimal control sequence to this problem for a variety of performance criteria. \nRoughly, there are two approaches of interest: solving the two-point boundary value \n\n319 \n\n\f320 \n\nPeterson \n\nproblem arising from the solution of Pontryagin 's maximum or minimum principle or \nsolving the Hamilton-J acobi-Bellman (HJB) partial differential equation. However, \nthe computational burdens associated with these schemes may be too high for real(cid:173)\ntime use. \nIs it possible to essentially use on-line estimation to build a solution \nto either of these two classical techniques at a lower cost? In other words, if TJ \nsamples are taken of the system from some initial point under some initial sequence \nof control actions, can this time series be use to obtain information about the true \noptimal sequence of controls that should be used in the next TJ time steps? \nWe will focus here on algorithm designs for on-line estimation of the optimal con(cid:173)\ntrol law that are implement able in a control step time of 20 milliseconds or less. \nvVe will use local learning methods such as CMAC (Cerebellar Model Articulated \nControllers) architectures (Albus, 1 and W. Miller, 7), and estimators for character(cid:173)\nizations of the optimal value function via solutions of the Hamilton-Jacobi-Bellman \nequation, (adaptive critic type methods), (Barto, 2; Werbos, 12). \n\n2 CLASSICAL CONTROL STRATEGIES \n\nIn order to discuss on-line estimation schemes based on the Hamilton- Jacobi(cid:173)\nBellman equation, we now introduce a common sample problem: \n\nJ(x, u, t) \n\nmm \nuEU \n\nwhere \n\nSubject to: \n\nJ(x, u, t) \n\ndist(y(tf), r) + t \n\ni t! \n\nL(y(s), u(s), s) ds \n\ny'(s) \ny(t) \ny(s) \nu(s) \n\nf(y(s), u(s), s), t::; s ::; tf \nx \n\nE y ( s) ~ RN , t::; s ::; t f \nE U ( s) ~ RM, t::; s ::; t f \n\n(1) \n\n(2) \n\n(3) \n(4) \n(5) \n(6) \n\nHere y and u are the state vector and control vector of the system, respectively; U is \nthe space of functions that the control must be chosen from during the minimization \nprocess and ( 4) - ( 6) give the initialization and constraint conditions that the \nstate and control must satisfy. The set r represents a target constraint set and \ndist(y(tf), r) indicates the distance from the final state y(tf) to the constraint set \nr. The optimal value of this problem for t.he initial state x and time t will be \ndenoted by J(x, t) where \n\nJ(x, t) \n\nminJ(x,u,t). \nu \n\n\fOn-Line Estimation of the Optimal Value Function: HJB-Estimators \n\n321 \n\nIt is well known that the optimal value function J(x, t) satisfies a generalized partial \ndifferential equation known as the Hamilton-J acobi-Bellman (HJB) equation. \n\naJ(x, t) \n\nat \n\nJ(x,t,) \n\n) aJ(x, t) ( \n\n. {( \n\nm~n L x, u, t + ax \ndist(x, f) \n\nI x, u, t \n\n)} \n\nIn the case that J is indeed differentiable with respect to both the state and time \narguments, this equation is interpreted in the usual way. However, there are many \nproblems where the optimal value function is not differentiable, even though it \nis bounded and continuous. In these cases, the optimal value function J can be \ninterpreted as a viscosity solution of the HJB equation and the partial derivatives \nof J are replaced by the sub and superdifferentials of J (Crandall, 5). In general, \nonce the HJB equation is solved, the optimal control from state x and time t is then \ngiven by the minimum condition \n\nU E argm~n L(x,u,t)+ \n\n. { \n\naJ(x,t) ( \n\nax \n\nI x,u,t \n\n)} \n\nIf the underlying state and time space are discretized using a state mesh of resolution \nr and a time mesh of resolution s, the HJB equation can be rewritten into the form \nof the standard Bellman Principle of Optimality (BPO): \n\nwhere X(Xi, u) indicates the new state achieved by using control u over time interval \n[tj,tj+d from initial state Xi. \nIn practice, this equat.ion is solved by successive \niterations of the form: \n\nwhere T denotes the iteration cycle and the process is started by initializing \nJ~~ (Xi, tj) in a suitable manner. Generally, the iterations continue until the values \nJ;tl(Xi,tj) and J;tl(Xi,tj) differ by negligible amounts. This iterative process is \nusually referred to as dynamic programming (DP). Once this iterative process con(cid:173)\nverges, let Jr~(Xi,tj) = limT->ooJ:~, and consider linl(r,s)->(O,O) Jrs(xi,tj), where \n(xi, tj) indicates that the discrete grid points depend on the resolution (r, s). In \nmany situations, this limit gives the viscosity solution J(x, t) to the HJB equation. \nNow consider the problem of finding J(x,O). The Pontrya.gin minimum principle \ngives first order necessary conditions that the optimal state x and costate p variables \nmust satisfy. Letting fl(x, u, p, t) = L(x, u, t) + pT I(x, u, t) and defining \n\n\f322 \n\nPeterson \n\nH(x,p, t) \n\nmin H(x, u, p, t), \nu \n\n(7) \n\nthe optimal state and costate then must satisfy the following two-point boundary \nvalue problem (TPBVP): \n\n'(t) - oH(x,p,t) \nx \n, \nx(O) = x, \n\nop \n\n-\n\np'(t) = _ aH~;p,t) \np(tj) = 0 \n\n(8) \n\nand the optimal control is obtained from ( 7) once the optimal state and costate \nare determined. Note that ( 7) can not necessarily be solved for the control u in \nterms of x and p, i.e. a feedback law may not be possible. If the TPBVP can \nnot be solved, then we set J(x,O) = 00. In conclusion, in this problem, we are led \ninevitably to an optimal value function that can be poorly behaved; hence, we can \neasily imagine that at many (x, t), ~; is not available and hence J will not satisfy \nthe HJB equation in the usual sense. So if we estimate J directly using some form \nof on-line estimation, how can we hope to back out the control law if ~; is not \navailable? \n\n3 HJB ESTIMATORS \n\nA potential on-line estimation technique can be based on approximations of the \noptimal value function. Since the optimal value function should satisfy the HJB \nequation, these methods will be grouped under the broad classification HJD esti(cid:173)\nmators. \nAssume that there is a given initial state Xo with start time O. Consider a local \npatch, or local corridor, of the state space around the initial state xo, denoted by \nn(xo). The exact size ofO(xo) will depend on the nature of the state dynamics and \nthe starting state. If O( xo) is then discretized using a coarse grid of resolution r \nand the time domain is discretized using resolution s, an approximat.e dynamic pro(cid:173)\ngramming problem can be formulated and solved using the BPa equations. Since \nthe new states obtained via integration of the plant dynamics will in general not \nland on coarse grid lines, some sort of interpolation must be used to assign the \nintegrated new state value an appropriate coarse grid value. This can be done using \nthe coarse encoding implied by the grid resolution r of O(xo). In addition, multiple \ngrid resolutions may be used with coarse and fine grid approximations interacting \nwith one another as in multigrid schemes (Briggs, 3). The optimal value function \nso obtained will be denoted by Jr~(Zi,tj) for any discrete grid point Zi E O(xo) and \ntime point t j. This approximate solution also supplies an estimate of the optimal \n\ncontrol sequence (u*)\u00a3j-l = (u*)'j-l(Zi,tj)' Some papers on approximate dynamic \n\nprogramming are (Peterson, 8; (Sutton, 10; Luus, 6). It is also possible to obtain \nestimates of the optimal control sequences, states and costates using an 7J step look(cid:173)\nahead and the Pontryagin minimum principle. The associated two point boundary \nvalue problem is solved and the controls computed via Ui E arg minu H(x;, u, pi, ti) \nwhere (x*)ri and (P*)ri are the calculated optimal state and costate sequences re(cid:173)\nspectively. This approach is developed in (Peterson, 9) and implemelltated for \n\n\fOn-Line Estimation of the Optimal Value Function: HJB-Estimators \n\n323 \n\nvibration suppression in a large space structure, by (Carlson, Rothermel and Lee, \n4) \n\n(u)J- 1(Zi' tj) be a control sequence used from \nFor any Zi E n(xo), let (u){j-1 -\ninitial state Zi and time point tj. Thus Uij is the control used on time interval \n\n[tj,tj+1] from start point Zi. Define zl/1 = Z(Zi,Uij,tj), the state obtained by \nand the new state is zl/2 = z(zl/l, Ui,j+l, ij+d; in general, Ui,j+k is the control \n\nintegrating the plant dynamics one time step using control Uij and initial state Zi\" \nThen Ui,j+1 is the control used on time interval [tj+1, tj+2] from start point zl/l \n\nused on time interval [tj+k, tj+k+1] from start point zl/k and the new state is \nj+k+1 -\nZij \n\n,Ui,j+k, j+k , were Zij = Zi\u00b7 \n\nt ) h \n\n= Z Zij \n\n(j+k \n\nj -\n\nLet's now assume that optimal control information Uij (we will dispense with the \nsuperscript * labeling for expositional cleanness) is available at each of the discrete \ngrid points (Zi, tj) E n(xo). Let ~8(Zi' tj) = \nJ r 8 (Zi , t i)' \n\nEstilnate of New Optimal Control Sequence: \n\nFor the next TJ time steps, an estimate must be made of the next optimal \ncontrol action in time interval [tf7 +k, t f7 +k+1]' The initial state is any Zi in \nO( xf7) (xf7 is one such choice) and the initial time is tf7' For the time interval \n[tf7, t f7 +1], if the model