{"title": "Improving Policies without Measuring Merits", "book": "Advances in Neural Information Processing Systems", "page_first": 1059, "page_last": 1065, "abstract": null, "full_text": "Improving Policies without Measuring \n\nMerits \n\nPeter Dayan! \n\nCBCL \n\nE25-201, MIT \n\nCambridge, MA 02139 \ndayan~ai.mit.edu \n\nSatinder P Singh \n\nHarlequin, Inc \n\n1 Cambridge Center \nCambridge, MA 02142 \nsingh~harlequin.com \n\nAbstract \n\nPerforming policy iteration in dynamic programming should only \nrequire knowledge of relative rather than absolute measures of the \nutility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid:173)\nvantages of actions at states. Nevertheless, most existing methods \nin dynamic programming (including Baird's) compute some form of \nabsolute utility function . For smooth problems, advantages satisfy \ntwo differential consistency conditions (including the requirement \nthat they be free of curl), and we show that enforcing these can lead \nto appropriate policy improvement solely in terms of advantages. \n\n1 \n\nIntrod uction \n\nIn deciding how to change a policy at a state, an agent only needs to know the \ndifferences (called advantages) between the total return based on taking each action \na for one step and then following the policy forever after, and the total return \nbased on always following the policy (the conventional value of the state under the \npolicy). The advantages are like differentials - they do not depend on the local levels \nof the total return. Indeed, Werbos (1991) defined Dual Heuristic Programming \n(DHP), using these facts, learning the derivatives of these total returns with respect \nto the state. For instance, in a conventional undiscounted maze problem with a \n\nlWe are grateful to Larry Saul, Tommi Jaakkola and Mike Jordan for comments, and \nAndy Barto for pointing out the connection to Werbos' DHP. This work was supported by \nNSERC, MIT, and grants to Professor Michael I Jordan from ATR Human Information \nProcessing Research and Siemens Corporation. \n\n\f1060 \n\nP. DAYAN, S. P. SINGH \n\npenalty for each move, the advantages for the actions might typically be -1,0 \nor 1, whereas the values vary between 0 and the maximum distance to the goal. \nAdvantages should therefore be easier to represent than absolute value functions in a \ngeneralising system such as a neural network and, possibly, easier to learn. Although \nthe advantages are differential, existing methods for learning them, notably Baird \n(1993), require the agent simultaneously to learn the total return from each state. \nThe underlying trouble is that advantages do not appear to satisfy any form of a \nBellman equation. Whereas it is clear that the value of a state should be closely \nrelated to the value of its neighbours, it is not obvious that the advantage of action \na at a state should be equally closely related to its advantages nearby. \n\nIn this paper, we show that under some circumstances it is possible to use a solely \nadvantage-based scheme for policy iteration using the spatial derivatives of the \nvalue function rather than the value function itself. Advantages satisfy a particular \nconsistency condition, and, given a model of the dynamics and reward structure \nof the environment, an agent can use this condition to directly acquire the spatial \nderivatives of the value function. It turns out that the condition alone may not \nimpose enough constraints to specify these derivatives (this is a consequence of the \nproblem described above) - however the value function is like a potential function \nfor these derivatives, and this allows extra constraints to be imposed. \n\n2 Continuous DP, Advantages and Curl \n\nConsider the problem of controlling a deterministic system to minimise V\"'(xo) = \nminu(t) Jo= r(y(t), u(t\u00bb)dt, where y(t) E Rn is the state at time t, u(t) E Rm is \nthe control, y(O) = xo, and y(t) = f((y(t), u(t)). This is a simplified form of a \nclassic variational problem since rand f do not depend on time t explicitly, but \nonly through y(t) and there are no stopping time or terminal conditions on y(t) \n(see Peterson, 1993; Atkeson, 1994, for recent methods for solving such problems) . \nThis means that the optimal u(t) can be written as a function of y(t) and that \nV(xo) is a function of Xo and not t. We do not treat the cases in which the infinite \nintegrals do not converge comfortably and we will also assume adequate continuity \nand differentiability. \n\nThe solution by advantages: This problem can be solved by writing down the \nHamilton-Jacobi-Bellman (HJB) equation (see Dreyfus, 1965) which V\"'(x) satisfies: \n(1) \n\n0= mJn [r(x, u) + f(x, u) . V' x V\"'(x)] \n\nis \n\nthe continuous space/time analogue of the conventional Bellman \n\nThis \nequation (Bellman, 1957) for discrete, non-discounted, deterministic deci(cid:173)\nsion problems, which says that for the optimal value function V\"', 0 = \nmina [r(x, a) + V'\" (f(x, a)) - V\"'(x)] , where starting the process at state x and us(cid:173)\ning action a incurs a cost r(x, a) and leaves the process in state !(x, a). This, and \nits obvious stochastic extension to Markov decision processes, lie at the heart of \ntemporal difference methods for reinforcement learning (Sutton, 1988; Barto, Sut(cid:173)\nton & Watkins, 1989; Watkins, 1989). Equation 1 describes what the optimal value \nfunction must satisfy. Discrete dynamic programming also comes with a method \ncalled value iteration which starts with any function Vo(x), improves it sequentially, \nand converges to the optimum. \n\nThe alternative method, policy iteration (Howard, 1960), operates in the space of \n\n\fImproving Policies without Measuring Merits \n\n1061 \n\npolicies, ie functions w(x). Starting with w(x), the method requires evaluating \neverywhere the value function VW(x) = 1000 r(y(t), w(y(t))dt, where y(O) = \nx, and y(t) = f(y(t), w(y(t)). It turns out that VW satisfies a close relative of \nequation 1: \n\n0= r(x, w(x)) + f(x, w(x)) . V' x VW(x) \n\n(2) \n\nIn policy iteration, w(x) is improved, by choosing the maximising action: \n\nWi (x) = argm~ [r(x, u) + f(x, u) . V' x VW (x)] \n\n(3) \nas the new action. For discrete Markov decision problems, the equivalent of this \nprocess of policy improvement is guaranteed to improve upon w. \n\nIn the discrete case and for an analogue of value iteration, Baird (1993) defined the \noptimal advantage function A*(x, a) = [Q*(x, a) - maxb Q*(x, b)] jM, where 6t is \neffectively a characteristic time for the process which was taken to be 1 above, and \nthe optimal Q function (Watkins, 1989) is Q*(x, a) = r(x, a) + V*(f(x, a)), where \nV* (y) = maxb Q* (y, b). It turns out (Baird, 1993) that in the discrete case, one can \ncast the whole of policy iteration in terms of advantages. In the continuous case, \nwe define advantages directly as \n\n(4) \nThis equation indicates how the spatial derivatives of VW determine the advantages. \nNote that the consistency condition in equation 2 can be written as AW(x, w(x)) = \nO. Policy iteration can proceed using \n\nw'(x) = argmaxuAW(x, u). \n\n(5) \nDoing without VW: We can now state more precisely the intent of this paper: a) \nthe consistency condition in equation 2 provides constraints on the spatial deriva(cid:173)\ntives V' x VW(x), at least given a model of rand f; b) equation 4 indicates how these \nspatial derivatives can be used to determine the advantages, again using a model; \nand c) equation 5 shows that the advantages tout court can be used to improve the \npolicy. Therefore, one apparently should have no need to know Vv.' (x) but just its \nspatial derivatives in order to do policy iteration. \n\nDidactic Example -\nLQR: To make the discussion more concrete, consider \nthe case of a one-dimensional linear quadratic regulator (LQR). The task is to \nminimise V*(xo) = It o:x(t)2 + (3u(t)2dt by choosing u(t), where 0:,(3 > O,\u00b1(t) = \n-[ax(t) + u(t)] and x(O) = Xo. It is well known (eg Athans & Falb, 1966) that \nthe solution to this problem is that V*(x) = k*x2 j2 where k* = (0: + (3(u*)2)j(a + \nu*) and u(t) = (-a + Ja2 + o:j (3)x(t). Knowing the form of the problem, we \nconsider policies w that make u(t) = wx(t) and require h(x,k) == V'\" VW(x) = kx , \nwhere the correct value of k = (0: + (3w2)j(a + w). The consistency condition \nin equation 2 evaluated at state x implies that 0 = (0: + (3w2)X2 - h(x, k)(a + \nw)x. Doing online gradient descent in the square inconsistency at samples Xn gives \nkn+l = kn -fa [(0: + (3W2)x~ - knXn(a + W)Xn]2 jakn, which will reduce the square \ninconsistency for small enough f unless x = O. As required, the square inconsistency \ncan only be zero for all values of x if k = (0: + (3w2)j((a + w)). The advantage of \nperforming action v (note this is not vx) at state x is, from equation 4, AW (x, v) = \no:x2 + (3v2 -\n(ax + v)(o: + (3w2)xj(a + w), which, minimising over v (equation 5) \ngives u(x) = w'x where Wi = (0: + (3w2)j(2(3(a+ w)) , which is the Newton-Raphson \niteration to solve the quadratic equation that determines the optimal policy. In this \ncase, without ever explicitly forming VW (x), we have been able to learn an optimal \n\n\f1062 \n\nP. DAYAN, S. P. SINGH \n\npolicy. This was based, at least conceptually, on samples Xn from the interaction \nof the agent with the world. \n\nThe curl condition: The astute reader will have noticed a problem. The consis(cid:173)\ntency condition in equation 2 constrains the spatial derivatives \\7 x VW in only one \ndirection at every point - along the route f(x, w(x)) taken according to the policy \nthere. However, in evaluating actions by evaluating their advantages, we need to \nknow \\7 x VW in all the directions accessible through f(x, u) at state x. The quadratic \nregulation task was only solved because we employed a function approximator (which \nwas linear in this case h(x, k) = kx). For the case of LQR, the restriction that h be \nlinear allowed information about f(X', w(x' )) . \\7 x' VW (x') at distant states x' and \nfor the policy actions w(x' ) there to determine f(x, u) . \\7 x VW(x) at state x but \nfor non-policy actions u. If we had tried to represent h(x, k) using a more flexible \napproximator such as radial basis functions, it might not have worked. In general, if \nwe didn't know the form of \\7 x VW (x), we cannot rely on the function approximator \nto generalize correctly. \nThere is one piece of information that we have yet to use - function h(x, k) == \n\\7 x VW (x) (with parameters k, and in general non-linear) is the gradient of some(cid:173)\nthing - it represents a conservative vector field. Therefore its curl should vanish \n(\\7 x x h(x, k) = 0). Two ways to try to satisfy this are to represent h as a suitably \nweighted combination of functions that satisfy this condition or to use its square as \nan additional error during the process of setting the parameters k. Even in the case \nof the LQR, but in more than one dimension, it turns out to be essential to use the \ncurl condition. For the multi-dimensional case we know that VW (x) = x T KWx/2 \nfor some symmetric matrix KW, but enforcing zero curl is the only way to enforce \nthis symmetry. \n\nThe curl condition says that knowing how some component of \\7 x VW(x) changes \nin some direction (eg 8\\7 x VW(xh/8xl) does provide information about how some \nother component changes in a different direction (eg 8\\7 x vw (xh /8X2). This infor(cid:173)\nmation is only useful up to constants of integration, and smoothness conditions will \nbe necessary to apply it. \n\n3 Simulations \n\nWe tested the method of approximating hW(x) = \\7 x VW(x) as a linearly weighted \ncombination of local conservative vector fields hW(x) = L~=l ci\\7 x