{"title": "Batch Value Function Approximation via Support Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 1491, "page_last": 1498, "abstract": "", "full_text": "Batch Value Function Approximation via\n\nSupport Vectors\n\nThomas G Dietterich\n\nXin W\"ang\n\nDepartment of Computet Science\n\nDepartment of Computer Science\n\nOregon State University\n\nCorvallis, OR, 97331\nwangxi@cs. orst. edu\n\nOregon State University\n\nCorvallis, OR, 97331\n\ntgd@cs.orst.edu\n\nAbstract\n\nWe present three ways of combining linear programming with the\nkernel trick to find value function approximations for reinforcement\nlearning. One formulation is based on SVM regression; the second\nis based on the Bellman equation; and the third seeks only to ensure\nthat good moves have an advantage over bad moves. All formu(cid:173)\nlations attempt to minimize the number of support vectors while\nfitting the data. Experiments in a difficult, synthetic maze problem\nshow that all three formulations give excellent performance, but the\nadvantage formulation is much easier to train. Unlike policy gradi(cid:173)\nent methods, the kernel methods described here can easily 'adjust\nthe complexity of the function approximator to fit the complexity\nof the value function.\n\n1\n\nIntroduction\n\nVirtually all existing work on value function approximation and policy-gradient\nmethods starts with a parameterized formula for the value function or policy and\nthen seeks to find the best policy that can be represented in that parameterized form.\nThis can give rise to very difficult search problems for which the Bellman equation\nis of little or no use. In this paper, we take a different approach: rather than fixing\nthe form of the function approximator and searching for a representable policy, we\ninstead identify a good policy and then search for a function approximator that\ncan represent it. Our approach exploits the ability of mathematical programming\nto represent a variety of constraints including those that derive from supervised\nlearning, from advantage learning (Baird, 1993), and from the Bellman equation. By\ncombining the kernel trick with mathematical programming, we obtain a function\napproximator that seeks to find the smallest number of support vectors sufficient to\nrepresent the desired policy. This side-steps the difficult problem of searching for\na good policy among those policies representable by a fixed function approximator.\nOur method applies to any episodic MDP, but it works best in domains-such as\nresource-constrained scheduling and other combinatorial optimization problems(cid:173)\nthat are discrete and deterministic.\n\n\f2 Preliminaries\n\nThere are two distinct reasons for studying value function approximation methods.\nThe primary reason is to be able to generalize from some set of training experiences\nto produce a policy that can be applied in new states that were not visited during\ntraining. For example, in Tesauro's (1995) work on backgammon, even after training\non 200,000 games, the TD-gammon system needed to be able to generalize to new\nboard positions that it had not previously visited. Similarly, in Zhang's (1995) work\non space shuttle scheduling, each individual scheduling problem visits only a finite\nnumber of states, but the goal is to learn from a series of \"training\" problems and\ngeneralize to new states that arise in \"test\" problems. Similar MDPs have been\nstudied by Moll, Barto, Perkins & Sutton (1999).\nThe second reason to study function approximation is to support learning in con(cid:173)\ntinuous state spaces. Consider a robot with sensors that return continuous values.\nEven during training, it is unlikely that the same vector of sensor readings will ever\nbe experienced more thaIl once. Hence, generalization is critical during the learning\nprocess as well as after learning.\n\nThe methods described in this paper address only the first of these reason. Specif(cid:173)\nically, we study the problem of generalizing from a partial policy to construct a\ncomplete policy for a Markov Decision Problem (MDP). Formally, consider a dis(cid:173)\ncrete time MDP M with probability transition function P(s/ls, a) (probability that\nstate Sl will result from executing action a in state s) and expected reward function\nR(s/ls, a) (expected reward received from executing action a in state s and entering\nstate Sl). We will assume that, as in backgammon and space shuttle scheduling,\nP(s/ls, a) and R(s/ls, a) are known and available to the agent, but that the state\nspace is so large that it prevents methods such as value iteration or policy itera(cid:173)\ntion from being applied. Let L be a set of \"training\" states for which we have an\napproximation V(s) to the optimal value function V*(s), s E L. In some cases, we\nwill also assume the availability of a policy 'ff consistent with V(s). The goal is to\nconstruct a parameterized approximation yes; 8) that can be applied to all states\nin M to yield a good policy if via one-step lookahead search. In the experiments\nreported below, the set L contains states that lie along trajectories from a small set\nof \"training\" starting states So to terminal states. A successful learning method will\nbe able to generalize to give a good policy for new starting states not in So. This\nwas the situation that arose in space shuttle scheduling, where the set L contained\nstates that were visited while solving \"training\" problems and the learned value\nfunction was applied to solve \"test\" problems.\nTo represent states for function approximation, let X (s) denote a vector of features\ndescribing the state s. Let K(X1 ,X2 ) be a kernel function (generalized inner prod(cid:173)\nuct) of the two feature vectors Xl and X 2 \u2022 In our experiments, we have employed\nthe gaussian kernel: K(X1,X2;U) == exp(-IIX1 - X 2 11\n\n) with parameter u.\n\n2 / (\n\n2\n\n3 Three LP Formulations of Function Approximation\n\nWe now introduce three linear programming formulations of the function approxi(cid:173)\nmationproblem. We first express each of these formulations in terms of a generic\nfitted function approximator V. Then, we implement V(s) as the dot product of\na weight vector W with the feature vector X (s): V(s) == W . X (s). Finally, we\napply the \"kernel trick\" by first rewriting W as a weighted sum of the training\npoints Sj E L, W == ~j ajX(sj), (aj 2: 0), and then replacing all dot products\nbetween data points by invocations of the kernel function K. We assume L con-\n\n\ftains all states along the best paths from So to terminal states and also all states\nthat can be reached from these paths in one step and that have been visited during\nexploration (so that V is known). In all three formulations we have employed linear\nobjective functions, but quadratic objectives like those employed in standard sup(cid:173)\nport vector machines could be used instead. All slack variables in these formulations\nare constrained- to be non-negative.\n\nFormulation 1: Supervised Learning. The first formulation treats the value\nfunction approximation problem as a supervised learning problem and applies the\nstandard c-insensitive loss function (Vapnik, 2000) to fit the function approximator.\n\nminimize L [u(s) + v(s)]\nsubject to V(s) + u(s) 2:: V(s) - c; V(s) - v(s) :::; V(s) + c\n\nS\n\n\"Is E L\n\nIn this formulation, u(s) and v(s) are slack variables that are non-zero only if V(s)\nhas an absolute deviation from V(s) of more than c. The objective function seeks to\nminimize these absolute deviation errors. A key idea of support vector methods is\nto combine this objective function with a penalty on the norm of the weight vector.\nWe can write this as\n\nIIWlll + C L[u(s) + v(s)]\n\nminimize\nsubject to W\u00b7 X(s) + u(s) 2:: V(s) - c; W\u00b7 X(s) - v(s) :::; V(s) + c\n\nS\n\n\"Is E L\n\nThe parameter C expresses the tradeoff between fitting the data (by driving the\nslack variables to zero) and minimizing the norm of the weight vector. We have\nchosen to minimize the I-norm of the weight vector (11Wlll == Ei IWi!), because this\nis easy to implement via linear programming. Of course, if the squared Euclidean\nnorm of W is preferred, then quadratic programming methods could be applied to\nminimize this.\nNext, we introduce the assumption that W can be written as a weighted sum of the\ndata points themselves. Substituting this into the constraint equations, we obtain\n\nL aj + C L[u(s) + v(s)]\nj\n\nminimize\nsubject to E j ajX(sj) . X(s) + u(s) ~ V(s) - c\nE j ajX(sj) . X(s) - v(s) :::; V(s) + c\n\n8\n\n\"Is E L\n\"Is E L\n\nFinally, we can apply the kernel trick by replacing each dot product by a call to a\nkernel function:\nminimize Laj + CL[u(s) + v(s)]\nsubject to E j ajK(X(sj),X(s)) + u(s) 2:: V(s) - c\nE j ajK(X(sj), X(s)) - v(s) :::; V(s) + c\n\n\"Is E L\n\"Is E L\n\ns\n\nj\n\nFormulation 2: Bellman Learning. The second formulation introduces con(cid:173)\nstraints from the Bellman equation V(s) == maxa ESI P(s'ls, a)[R(s'ls, a) + V(s')].\nThe standard approach to solving MDPs via linear programming is the following.\nFor each state s and action a,\n\nminimize L u(s; a)\n\ns,a\n\n\fsubject to V(s) == u(s,a) + LP(s'ls,a)[R(s'ls,a) + V(s')]\n\ns'\n\nThe idea is' that for the optimal action a* in state s, the slack variable u(s, a*)\ncan be driven to zero, while for non-optimal actions a_, the slack u(s, a_) will\nremain non-zero. Hence, the minimization of the slack variables implements the\nmaximization operation of the Bellman equation.\n\nWe attempted to apply this formulation with function approximation, but the errors\nintroduced by the approximation make the linear program infeasible, because V(s )\nmust sometimes be less than the backed-up value Ls' P(s'ls, a)[R(s'ls, a) + V(s')].\nThis led us to the following formulation in which we exploit the approximate value\nfunction 11 to provide \"advice\" to the LP optimizer about which constraints should\nbe tight and which ones should be loose. Consider a state s in L. We can group\nthe actions available in s into three groups:\n(a) the \"optimal\" action a* == 1f(s)\nchosen by the approximate policy it, (b) other actions that are tied for optimum\n(denoted by ao), and (c) actions that are sub-optimal (denoted by a_). We have\nthree different constraint equations, one for each type of action:\nL[u(s, a*) + v(s, a*)] + LY(s, ao) + L z(s, a_)\ns\n\nminimize\n\nsubject to 17(s) + u(s, a*) - v(s, a*) == L P(s'ls, a*)[R(s'ls, a*) + V(s')]\n\ns,ao\n\ns,a_\n\n17(8) + y(s, ao) ~ L P(s'ls, ao)[R(s'ls, ao) + V(s')]\n\ns'\n\n17(s) + z(s, a_) ~ L P(s'ls, a_)[R(s'ls, a_) + V(s')] + \u20ac\n\ns'\n\ns'\n\nThe first constraint requires V(s) to be approximately equal to the backed-up value\nof the chosen optimal action a*. The second constraint requires V(s) to be at least\nas large as the backed-up value of any alternative optimal actions ao.\nIf V(s) is\ntoo small, it will be penalized, because the slack variable y(s, ao) will be non-zero.\nBut there is no penalty if V(s) is too large. The main effect of this constraint is\nto drive the value of V(s') downward as necessary to satisfy the first constraint on\na*. Finally, the third constraint requires that V(s) be at least \u20ac\nlarger than the\nbacked-up value of all inferior actions a_. If these constraints can be satisfied with\nall slack variables u, v, y, and z set to zero, then V satisfies the Bellman equation.\nAfter applying the kernel trick and introducing the regularization objective, we\nobtain the following Bellman formulation:\n\nminimize ~ aj + C (,~_ u(s, a*) + v(s, a*) + y(s, ao) + z(s, a_))\nsubject to ~a.j [K(X(Sj),X(S)) - LP(s'ls,a*)K(X(Sj),X(S'))] +\n\nJ\n\ns'\n\nu(s, a*) - v(s, a*) == L P(s'ls, a*)R(s'ls, a*)\n~aj [K(X(Sj),X(S)) - LP(s'ls,ao)K(X(Sj),X(S'))] +y(s,ao)\n\n8'\n\nJ\n\n~\n\n~ LP(s'ls,ao)R(s'ls,ao)\n\ns'\n\n\f~O:j [K(X(Sj),X(S)) - LP(S'IS,a_)K(X(Sj),X(S'))] +z(s,a_)\n\n3\n\n~\n\n~ LP(s'ls,a_)R(s'ls,a_) +\u00a3\n\n8/\n\nFormulation 3: Advantage Learning. The third formulation focuses on the\nminimal constraints that must be satisfied to ensure that the greedy policy com(cid:173)\nputed from V will be identical to the greedy policy computed from V (cf. Utgoff\n& Saxena, 1987). Specifically, we require that the backed up value of the optimal\naction a* be greater than the backed up values of all other actions a.\n\nminimize\n\nu(s,a*,a)\n\nL\ns,a*,a\n\nsubject to L P(s'ls, a*)[R(s'ls, a*) + V(s')] + u(s, a*, a)\n\n8/\n\n~ LP(s!ls,a)[R(s!ls,a) + V(s/)] +\u00a3\n\ns/\n\nThere is one constraint and one slack variable u(s, a*, a) for every action executable\nin state s except for the chosen optimal action a* = i\"(s). The backed-up value of\na* must have an advantage of at least \u20ac over any other action a, even other actions\nthat, according to V, are just as good as a*. After applying the kernel trick and\nincorporating the complexity penalty, this becomes\n\nminimize Laj+C L\n\nu(s,a*,a)\n\nj\n\ns,a*,a\n\nsubject to Laj L[P(s'ls,a*) -P(s'ls,a)]K(X(sj),X(s')) +u(s,a*,a) ~\n\nj\n\ns/\n\nL P(s'ls, a)R(s'ls, a) - L P(s'ls, a*)R(s'ls, a*) + \u00a3\n\ns/\n\ns/\n\nOf course each of these formulations can easily be modified to incorporate a discount\nfactor for discounted cumulative reward.\n\n4 Experimental Results\n\nTo compare these three formulations, we generated a set of 10 random maze prob(cid:173)\nlems as follows. In a 100 by 100 maze, the agent starts in a randomly-chosen square\nin the left column, (0, y). Three actions are available in every state, east, northeast,\nand southeast, which deterministically move the agent one square in the indicated\ndirection. The maze is filled with 3000 rewards (each of value -5) generated ran(cid:173)\ndomly from a mixture of a uniform distribution (with probability 0.20) and five 2-D\ngaussians (each with probability 0.16) centered at (80,20), (80,60), (40,20), (40,80),\nand (20,50) with variance 10 in each dimension. Multiple rewards generated for\na single state are accumulated.\nIn addition, in column 99, terminal rewards are\ngenerated according to a distribution that varies from -5 to +15 with minima at\n(99,0), (99,40), and (99,80) and maxima at (99,20) and (99,60).\n\nFigure 1 shows one of the generated mazes. These maze problems are surpris(cid:173)\ningly hard because unlike \"traditional\" mazes, they contain no walls. In traditional\nn;tazes, the walls tend to guide the agent to the goal states by reducing what would\nbe a 2-D random walk to a random walk of lower dimension (e.g., 1-D along narrow\nhalls).\n\n\fRewards\n-5\n-10\n-15\n-20\n\n0\n+\n\n)(\n\n3IE\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nCZl\n(1)\n~\nVi\nbJJ\n.s\n~\nVi\n\nCZl\n\n~\nVi\n~\n.\u00a7\n~\n\n(1)\n~\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nFigure 1: Example randomly-generated maze. Agent enters at left edge and exits\nat right edge.\n\nWe applied the three LP formulations in an incremental-batch method as shown\nin Table 1. The LPs were solved using the CPLEX package from ILOG. The V\ngiving the best performance on the starting states in So over the 20 iterations\nwas saved and evaluated over all 100 possible starting states to obtain a measure of\ngeneralization. The values of C and a were determined by evaluating generalization\non a holdout set of 3 start states: (0,30), (0,50), and (0,70). Experimentation showed\n2 separately for\nthat C = 100,000 worked well for all three methods. We tuned 0-\neach problem using values of 5, 10, 20, 40, 60, 80, 120, and 160; larger values\nwere preferred in case of ties, since they give better generalization. The results are\nsummarized in Figure 2.\n\nThe figure shows that the three methods give essentially identical performance, and\nthat after 3 examples, all three methods have a regret per start state of about\n2 units, which is less than the cost of a single -5 penalty. However, the three\nformulations differ in their ease of training and in the information they require.\nTable 2 compares training performance in terms of (a) the CPU time required for\ntraining, (b) the number of support vectors constructed, (c) the number of states in\nwhich V prefers a tied-optimal action over the action chosen by n-, (d) the number\nof states in which V prefers an inferior action, and (e) the number of iterations\nperformed after the best-performing iteration on the training set. A high score\non this last measure indicates that the learning algorithm is not converging well,\neven though it may momentarily attain a good fit to the data. By virtually every\nmeasure, the advantage formulation scores better. It requires much less CPU time\nto train, finds substantially fewer support vectors, finds function approximators that\ngive better fit to .the data, and tends to converge better. In addition, the advantage\u00b7\n\n\fTable 1:__ Incremental Batch Reinforcement Learning\n\nRepeat 20 times:\n\nFor each start state So E 80 do\n\nGenerate 16 f-greedy trajectories using V\nRecord all transitions and rewards to build MDP model if\n\nSolve M via value \"iteration to obtain V and 7r\nL=0\nFor each start state 80 E 80 do\n\nGenerate trajectory according to -IT\nAdd to L all states visited along this trajectory\n\nApply LP method to L, V, and 7r to find new V\n\nPerform Monte Carlo rollouts using greedy policy for V to evaluate each possible start state\nReport total value of all start states.\n\nTable 2: Measures of the quality of the training process (average over 10 MDPs)\n\nCPU\n37.5\n30.4\n11.7\n\nCPU\n433.2\n208.0\n74.5\n\nSup\nBel\nAdv\n\nSup\nBel\nAdv\n\n1801= 1\n\n#SV #tie\n22.4\n29.5\n40.9\n18.8\n19.4\n17.2\n\n1801= 3\n\n#SV #tie\n70.5\n105.5\n82.4\n62.0\n46.7\n58.6\n\n#bad\n0.7\n0.9\n0.2\n\n#bad\n3.0\n2.2\n0.6\n\n#iter\n5.6\n5.9\n1.6\n\n#iter\n10.5\n3.3\n4.0\n\n1801 = 2\n\n#SV #tie\n49.8\n54.3\n47.9\n51.1\n29.1\n39.6\n\n1801 =4\n\n#SV #tie\n90.5\n117.2\n75.2\n145.7\n51.9\n74.0\n\n#bad\n1.9\n0.4\n1.4\n\n#bad\n3.3\n1.8\n3.2\n\n#iter\n7.3\n8.2\n2.0\n\n#iter\n9.6\n7.3\n2.8\n\nCPU\n190.7\n92.7\n38.4\n\nCPU\n789.1\n379.1\n122.4\n\nand Bellman formulations do not require the value of V, but only -fr. This makes\nthem suitable for learning to imitate a human-supplied policy.\n\n5 Conclusions\n\nThis paper has presented three formulations. of batch value function approximation\nby exploiting the power of linear programming to express a variety of constraints\nand borrowing the kernel trick from support vector machines. All three formulations\nwere able to learn and generalize well on difficult synthetic maze problems. The\nadvantage formulation is easier and more reliable to train, probably because it places\nfewer constraints on the value function approximation. Hence, we are now applying\nthe advantage formulation to combinatorial optimization problems in scheduling\nand protein structure determination~\n\nAcknowledgments\n\nThe authors gratefully acknowledge the support of AFOSR under contract F49620(cid:173)\n98-1-0375, and the NSF under grants IRl-9626584, I1S-0083292, 1TR-5710001197,\nand E1A-9818414. We thank Valentina Zubek and Adam Ashenfelter for their\ncareful reading of the paper.\n\n\f1200\n\n~ 1000\nu\n~\n0\n0.\n\naa0\n\n~.s\na0\nB\n'\"d\nQ)\n\n~\n......\nQ.)\nl-I\nb1)\nQ)\nl-I\n~\n(5\n~\n\n800\n\n600\n\n400\n\n200\n\n0\n\n0\n\nt\n\n3\n\n2\n\nNumber of Starting States\n\n4\n\n5\n\nFigure 2: Comparison of the total regret (optimal total reward - attained total\nreward) summed over all 100 starting states for the three formulations as a function\nof the number of start states in So. The three error bars represent the performance\nof the supervised, Bellman, and advantage formulations (left-to-right). The bars\nplot the 25th, 50th, and 75th percentiles computed over 10 randomly generated\nmazes. Average optimal total reward on these problems is 1306. The random\npolicy receives a total reward of -14,475.\n\nReferences\n\nBaird, L. C. (1993). Advantage updating. Tech. rep. 93-1146, Wright-Patterson\n\nAFB.\n\nMoll, R., Barto, A. G., Perkins, T. J., & Sutton, R. S. (1999). Learning instance(cid:173)\n\nindependent value functions to enhance local search. NIPS-II, 1017-1023.\n\nTesauro, G. (1995). Temporal difference learning and TD-Gammon. CACM, 28(3),\n\n58-68.\n\nUtgoff, P. E., & Saxena, S. (1987). Learning a preference predicate. In ICML-87,\n\n115-121.\n\nVapnik, V. (2000). The Nature of Statistical Learning Theory, 2nd Ed. Springer.\nZhang, W., & Dietterich, T. G. (1995). A reinforcement learning approach to job(cid:173)\n\nshop scheduling. In IJCAI95, 1114-1120.\n\n\f", "award": [], "sourceid": 2116, "authors": [{"given_name": "Thomas", "family_name": "Dietterich", "institution": null}, {"given_name": "Xin", "family_name": "Wang", "institution": null}]}