{"title": "Adaptive Choice of Grid and Time in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1036, "page_last": 1042, "abstract": "", "full_text": "Adaptive choice of grid and time \n\n\u2022 \nIn \n\nreinforcement learning \n\nStephan Pareigis \n\nstp@numerik.uni-kiel.de \n\nLehrstuhl Praktische Mathematik \nChristian-Albrechts-U ni versitiit Kiel \n\nKiel, Germany \n\nAbstract \n\nWe propose local error estimates together with algorithms for adap(cid:173)\ntive a-posteriori grid and time refinement in reinforcement learn(cid:173)\ning. We consider a deterministic system with continuous state and \ntime with infinite horizon discounted cost functional. For grid re(cid:173)\nfinement we follow the procedure of numerical methods for the \nBellman-equation. For time refinement we propose a new criterion, \nbased on consistency estimates of discrete solutions of the Bellman(cid:173)\nequation. We demonstrate, that an optimal ratio of time to space \ndiscretization is crucial for optimal learning rates and accuracy of \nthe approximate optimal value function. \n\n1 \n\nIntroduction \n\nReinforcement learning can be performed for fully continuous problems by discretiz(cid:173)\ning state space and time, and then performing a discrete algorithm like Q-Iearning \nor RTDP (e.g. [5]). Consistency problems arise if the discretization needs to be \nrefined, e.g. for more accuracy, application of multi-grid iteration or better starting \nvalues for the iteration of the approximate optimal value function. In [7] it was \nshown, that for diffusion dominated problems, a state to time discretization ratio \nk/ h of Ch'r, I > 0 has to hold, to achieve consistency (i.e. k = o(h)). It can be \nshown, that for deterministic problems, this ratio must only be k / h = C, C a con(cid:173)\nstant, to get consistent approximations of the optimal value function. The choice \nof the constant C is crucial for fast learning rates, optimal use of computer memory \nresources and accuracy of the approximation. \n\nWe suggest a procedure involving local a-posteriori error estimation for grid refine(cid:173)\nment, similar to the one used in numerical schemes for the Bellman-equation (see \n[4]). For the adaptive time discretization we use a combination from step size con-\n\n\fAdaptive Choice of Grid and Time in Reinforcement Learning \n\n1037 \n\ntrol for ordinary differential equations and calculations for the rates of convergence \nof fully discrete solutions of the Bellman-equation (see [3]). We explain how both \nmethods can be combined and applied to Q-Iearning. A simple numerical example \nshows the effects of suboptimal state space to time discretization ratio, and provides \nan insight in the problems of coupling both schemes. \n\n2 Error estimation for adaptive choice of grid \n\nWe want to approximate the optimal value function V : n -+ IR in a state space \nn C IRd of the following problem: Minimize \n\nJ(x, u(.)) := 100 e- pr g(Yx,u( .)(r), u(r))dr, u(.): IR+ -+ A measurable, \n\n(1) \n\nwhere 9 : n X A -+ IR+ is the cost function, and Yx,u( .)(.) is the solution of the \ndifferential equation \n\ny(t) = f(y(t), u(t)), \n\ny(O) = x. \n\n(2) \n\nAs a trial space for the approximation of the optimal value function (or Q-function) \nwe use locally linear elements on simplizes Si, i = 1, ... , N s which form a triangu(cid:173)\nlation of the state space, N s the number of simplizes. The vertices shall be called \nXi, i = 1, . .. , N, N the dimension of the trial space1 . This approach has been used \nin numerical schemes for the Bellman-equation ([2], [4]). We will first assume, that \nthe grid is fixed and has a discretization parameter \nk = maxdiam{Si}. \n\ni \n\nOther than in the numerical case, where the updates are performed in the vertices of \nthe triangulation, in reinforcement learning only observed information is available. \nWe will assume, that in one time step of size h > 0, we obtain the following \ninformation: \n\n\u2022 the current state Yn E n, \n\u2022 an action an E A, \n\u2022 the subsequent state Yn+1 := YYn,a n (h) \n\u2022 the local cost rn = r(Yn, an) = Joh e-PTg(YYn,an(r),an(r))dr. \n\nThe state Yn, in which an update is to be made, may be any state in n. A shall be \nfinite, and an locally constant . \nThe new value of the fully discrete Q-function Qi (Yn, an) should be set to \n\nwhere V; (Yn+d = minaQi(Yn+l,a). We call the right side the update function \n\nshall be \n\nrn + e \n\n-phTTk ( \n\nv h Yn+l , \n\n) \n\nWe will update Qi in the vertices {Xd~l of the triangulation in one ofthe following \ntwo ways: \n\n1 When an adaptive grid is used, then N s and N depend on the refinement. \n\n(3) \n\n\fKaczmarz-update. Let >.7 \nnates, such that \n\n(AI, .. . , AN) be the vector of barycentric coordi-\n\nN \n\nYn = 2: Aixi, O:SAi:Sl, foralli=I, ... ,N. \n\ni=1 \n\nThen update \n\n(4) \n\nKronecker-update. Let 53 Yn and x be the vertex of 5, closest to Yn (if there \nis a draw, then the update can be performed in all winners). Then update Q~ only \nin x according to \n\n(5) \n\nEach method has some assets and drawbacks. In our computer simulations the \nKaczmarz-update seemed to be more stable over the Kronecker-update (see [6]) . \nHowever , examples may be constructed where a (Holder-) continuous bounded op(cid:173)\ntimal value function V is to be approximated, and the Kaczmarz-update produces \nan approximation with arbitrarily high \".\"sup-norm (place a vertex x of the trian(cid:173)\ngulation in a point where d: V is infinity, and use as update states the vertex x in \nturn with an arbitrarily close state x) . \nKronecker-update will provide a bounded approximation if V is bounded. Let Vhk \nbe the fully-discrete optimal value function \n\nVhk (xd = min{r(xi' a) + e-PhVhk (Yxi,a(h)), \n\na \n\ni = 1, . . . , N . \n\nThen it can be shown, that an approximation yerformed by Kronecker-update will \neventually be caught in an c-neighborhood of VI: (with respect to the \".\"sup-norm), \nif the data points Yo, Yl, Y2, . . . are sufficiently dense. Under regularity conditions \non V, c may be bounded by2 \n\n(6) \n\nAs a criterion for grid refinement we choose a form of a local a posteriori error esti(cid:173)\nmate as defined in [4] . Let vI: (x) = mina Q~ (x, a) be the current iterate of the op(cid:173)\ntimal value function. Let ax E U be the minimizing control ax = argmina Q~ (x, a). \nThen we define \n\n(7) \nIf Vhk is in the c-neighborhood of vI:, then it can be shown, that (for every x E n \nand simplex Sx with x E Sx, ax as above) \n\nO:S e(x) :S sup P(z , az , Vhk) -\n\nz ESz: \n\ninf P( z , az , Vhk). \nz ES:t \n\nIf Vhk is Lipschitz-continuous, then an estimate using only Gronwall's inequality \nbounds the right side and therefore e(x) by C p\\' where C depends on the Lipschitz(cid:173)\nconstants of vl' and the cost g . \n\n2With respect to the results in [3] we assume, that also E: ~ C(h + 7;:) can be shown. \n\n\fAdaptive Choice of Grid and Time in Reinforcement Learning \n\n1039 \n\nThe value ej := maXxesj eh(x) defines a function, which is locally constant on every \nsimplex. We use ej, j = 1, ... , N as an indicator function for grid refinement. The \n(global) tolerance value tolk for ej shall be set to \n\nNs \n\ntolk = C * (L; edlNs, \n\ni=l \n\nwhere we have chosen 1 :::; C :::; 2. We approximate the function e on the simplizes \nin the following way, starting in some Yn E Sj: \n1. apply a control a E U constantly on [T, T + h] \n2. receive value rn and subsequent state Yn+l \n3. calculate the update value Ph(x, a, Vf) \n4. if (IPh(x,a, vt) - Vt(x)l ~ ej) then ej := IPh(x,a, Vhk) - Vt(x)1 \n\nIt is advisable to make grid refinements in one sweep. We also store (different to \nthe described algorithm) several past values of ej in every simplex, to be able to \ndistinguish between large e j due to few visits in that simplex and the large e j due to \nspace discretization error. For grid refinement we use a method described in ([1]). \n\n3 A local criterion for time refinement \n\nWhy not take the smallest possible sampling rate? There are two arguments for \nadaptive time discretization. First, a bigger time step h naturally improves (de(cid:173)\ncreases) the contraction rate of the iteration, which is e- ph . The new information \nis conveyed from a point further away (in the future) for big h, without the need \nto store intermediate states along the trajectory. It is therefore reasonable to start \nwith a big h and refine where needed. \n\nThe second argument is, that the grid and time discretization k and h stand in a \ncertain relation. In [3] the estimate \n\nlV(x) - vt(x)1 :::; C(h + ..Jh)' for all x En, C a constant \n\nk \n\nis proven (or similar estimates, depending on the regularity of V). For obvious \nreasons, it is desirable to start with a coarse grid (storage, speed), i.e. k large. \nHaving a too small h in this case will make the approximation error large. Also \nhere, it is reasonable to start with a big h and refine where needed. \nWhat can serve as a refinement criterion for the time step h? In numerical schemes \nfor ordinary differential equations, adaptive step size control is performed by es(cid:173)\ntimating the local truncation error of the Taylor series by inserting intermediate \npoints. In reinforcement learning, however, suppose the system has a large trunca(cid:173)\ntion error (i.e. it is difficult to control) in a certain region using large h and locally \nconstant control functions. If the optimal value function is nearly constant in this \nregion, we will not have to refine h. The criterion must be, that at an intermediate \npoint, e.g. at time h12, the optimal value function assumes a value considerably \nsmaller (better) than at time h . However, if this better value is due to error in the \nstate discretization, then do not refine the time step. \n\nWe define a function H on the simplices of the triangulation. H(S) > \u00b0 holds the \n\ntime-step which will be used when in simplex S. Starting at a state Yn E n, Yn E Sn \nat time T > 0, with the current iterate of the Q-function Q~ (Vhk respectively) the \nfollowing is performed: \n\n\f1040 \n\ns. Pareigis \n\nelse: \n\n1. apply a control a E U constantly on [T, T + h] \n2. take a sample at the intermediate state z = YYn,a(h/2) \n3. if (H(Sn) < C*vdiam{Sn}) then end. \n4. compute Vl(z) = millb Q~(z, b) \n5. compute Ph/2(Yn, a, Vt) = rh/2(Yn, a) + e- ph/2Vt(z) \n6. compute Ph(Yn, a, Vt) = rh(Yn, a) +e-phVl(Yn+d \n7. if (Ph/2(Yn, a, Vhk) S Ph(Yn , a, Vhk)-tol) update H(Sn) = H(Sn)/2 \nThe value C is currently set to \n\nC = C(Yn, a) = -lrh/2(Yn, a) - rh(Yn, a)/, \n\n2 \np \n\nwhereby a local value of MI~gh2 is approximated, MJ (x) = maxa If(x, a)l, Lg an \napproximation of l\\7g(x, a)1 (if 9 is sufficiently regular). \ntol depends on the local value of Vhk and is set to \ntOl(x) = 0.1 * vt(x). \n\nHow can a Q-function Q:~:~(x, a), with state dependent time and space discretisa(cid:173)\ntion be approximated and stored? We have stored the time discretisation function \nH locally constant on every simplex. This implies (if H is not constant on 0), that \nthere will be vertices Xj, such that adjacent triangles hold different values of H . \nThe Q-function, which is stored in the vertices, then has different choices of H(xj). \nWe solved this problem, by updating a function Q'H(Xj, a) with Kaczmarz-update \nand the update value PH(Yn) (Yn , a, Vt), Yn in an to Xj adjacent simplex, regardless \nof the different H-values in Xj. Q'H(Xj, a) therefore has an ambiguous semantic: \nit is the value if a is applied for 'some time ', and optimal from there on. \n'some \ntime'depends here on the value of H in the current simplex. It can be shown, that \nIQ~(Xj)/2(xj,a) - Q'H(Xj)(xj,a)1 is less than the space discretization error. \n\n4 A simple numerical example \n\nWe demonstrate the effects of suboptimal values for space and time discretisation \nwith the following problem. Let the system equation be \niJ = f(y, u) := (~1 ~) (y - v), \n\n' yEO = [0,1] x [0,1] \n\n.375 ) \n.375 \n\nv = \n( \n\n(8) \n\ni}, u E [-c, cJ. The system is reflected at the boundary. \n\nThe stationary point of the uncontrolled system is v. The eigenvalues of the system \nare {u + i, U -\nThe goal of the optimal control shall be steer the solution along a given trajectory in \nstate space (see figure 1), minimizing the integral over the distance from the current \nstate to the given trajectory. The reinforcement or cost function is therefore chosen \nto be \n\n(9) \nwhere L denotes the set of points in the given trajectory. The cost functional takes \nthe form \n\ng(y) = dist(L, y)t, \n\n( 10) \n\n\fAdaptive Choice of Grid and Time in Reinforcement Learning \n\n1041 \n\n0.5 IL \n\n~ \n\no~ ______ ~ ______ ~ \no \n\n0.5 \n\nFigure 1: The left picture depicts the L-form of the given trajectory. The stationary \npoint of the system is at (.375, .375) (depicted as a big dot). The optimal value function \ncomputed by numerical schemes on a fine fixed grid is depicted with too large time dis(cid:173)\ncretization (middle) and small time discretization (right) (rotated by about 100 degrees \nfor better viewing). The waves in the middle picture show the effect of too large time steps \nin regions where 9 varies considerably. \n\nIn the learning problem, the adaptive grid mechanism tries to resolve the waves \n(figure 1, middle picture) which come from the large time discretization. This is \ndepicted in figure 2. We used only three different time step sizes (h = 0.1, 0.05 and \n0.025) and started globally with the coarsest step size 0.1. \n\nFigure 2: The adaptive grid mechanism refines correctly. However, in the left picture, \nunnecessary effort is spended in resolving regions, in which the time step should be refined \nurgently. The right picture shows the result, if adaptive time is also used. Regions outside \nthe L-form are refined in the early stages of learning while h was still large. An additional \ncoarsening should be considered in future work. We used a high rate of random jumps in \nthe process and locally a certainty equivalence controller to produce these pictures. \n\n\f1042 \n\nS. Pareigis \n\n5 Discussion of the methods and conclusions \n\nWe described a time and space adaptive method for reinforcement learning with \ndiscounted cost functional. The ultimate goal would be, to find a self tuning algo(cid:173)\nrithm which locally adjusted the time and space discretization automatically to the \noptimal ratio. The methods worked fine in the problems we investigated, e.g. non(cid:173)\nlinearities in the system showed no problems. Nevertheless, the results depended \non the choice of the tolerance values C, tol and tolk' We used only three time dis(cid:173)\ncretization steps to prevent adjacent triangles holding time discretization values too \nfar apart. The smallest state space resolution in the example is therefore too fine \nfor the finest time resolution. A solution can be, to eventually use controls that are \nof higher order (in terms of approximation of control functions) than constant (e.g. \nlinear, polynomial, or locally constant on subintervals of the finest time interval). \nThis corresponds to locally open loop controls. \n\nThe optimality of the discretization ratio time/space could not be proven. Some \ndiscontinuous value functions 9 gave problems, and we had problems handling stiff \nsystems, too. \nThe learning period was considerably shorter (about factor 100 depending on the \nrequested accuracy and initial data) in the adaptive cases as opposed to fixed grid \nand time with the same accuracy. \n\nFrom our experience, it is difficult in numerical analysis to combine adaptive time \nand space discretization methods. To our knowledge this concept has not yet been \napplied to the Bellman-equation. Theoretical work is still to be done. We are aware, \nthat triangulation of the state space yields difficulties in implementation in high \ndimensions. In future work we will be using rectangular grids. We will also make \nsome comparisons with other algorithms like Parti-game ([5]). To us, a challenge is \nseen in handling discontinuous systems and cost functions as they appear in models \nwith dry friction for example, as well as algebro-differential systems as they appear \nin robotics. \n\nReferences \n\n[1] E. Bansch. Local mesh refinement in 2 and 3 dimensions. IMPACT Comput. \n\nSci. Engrg. 3, Vol. 3:181-191, 1991. \n\n[2] M. Falcone. A numerical approach to the infinite horizon problem of determin(cid:173)\n\nistic control theory. Appl Math Optim 15:1-13, 1987. \n\n[3] R. Gonzalez and M. Tidball. On the rates of convergence of fully discrete \nINRIA, Rapports de Recherche, No \n\nsolutions of Hamilton-Jacobi equations. \n1376, Programme 5, 1991. \n\n[4] L. Griine. An adaptive grid scheme for the discrete Hamilton-Jacobi-Bellman \n\nequation. Numerische Mathematik, Vol. 75, No. 3:319-337, 1997. \n\n[5] A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolu(cid:173)\ntion reinforcement learning in multidimensional state-spaces. Machine Learning, \nVolume 21, 1995. \n\n[6] S. Pareigis. Lernen der Losung der Bellman-Gleichung durch Beobachtung von \n\nkontinuierlichen Prozepen. PhD thesis, Universitat Kiel, 1996. \n\n[7] S. Pareigis. Multi-grid methods for reinforcement learning in controlled diffusion \nprocesses. \nIn D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, \nAdvances in Neural Information Processing Systems, volume 9. The MIT Press, \nCambridge, 1997. \n\n\f", "award": [], "sourceid": 1465, "authors": [{"given_name": "Stephan", "family_name": "Pareigis", "institution": null}]}