{"title": "Stabilizing Value Function Approximation with the BFBP Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1587, "page_last": 1594, "abstract": "", "full_text": "Stabilizing Value Function\n\nwith the\n\nXin Wang\n\nThomas G Dietterich\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nOregon State University\n\nCorvallis, OR, 97331\nwangxi@cs. orst. edu\n\nOregon State University\n\nCorvallis, OR, 97331\n\ntgd@cs. orst. edu\n\nAbstract\n\nWe address the problem of non-convergence of online reinforcement\nlearning algorithms (e.g., Q learning and SARSA(A)) by adopt(cid:173)\ning an incremental-batch approach that separates the exploration\nprocess from the function fitting process. Our BFBP (Batch Fit\nto Best Paths) algorithm alternates between an exploration phase\n(during which trajectories are generated to try to find fragments\nof the optimal policy) and a function fitting phase (during which\na function approximator is fit to the best known paths from start\nstates to terminal states). An advantage of this approach is that\nbatch value-function fitting is a global process, which allows it to\naddress the tradeoffs in function approximation that cannot be\nhandled by local, online algorithms. This approach was pioneered\nby Boyan and Moore with their GROWSUPPORT and ROUT al(cid:173)\ngorithms. We show how to improve upon their work by applying\na better exploration process and by enriching the function fitting\nprocedure to incorporate Bellman error and advantage error mea(cid:173)\nsures into the objective function. The results show improved per(cid:173)\nformance on several benchmark problems.\n\n1\n\nIntroduction\n\nFunction approximation is essential for applying value-function-based reinforcement\nlearning (RL) algorithms to solve large Markov decision problems (MDPs). How(cid:173)\never, online RL algorithms such as SARSA(A) have been shown experimentally to\nhave difficulty converging when applied with function approximators. Theoretical\nanalysis has not been able to prove convergence, even in the case-of linear function\napproximators.\n(See Gordon (2001), however, for a non-divergence result.) The\nheart of the problem is that the approximate values of different states (e.g., 81 and\n82) are coupled through the parameters of the function approximator. The optimal\npolicy at state 81 may require increasing a parameter, while the optimal policy at\nstate 82 may require decreasing it. As a result, algorithms based on local parameter\nupdates tend to oscillate or even to diverge.\n\nTo avoid this problem, a more global approach is called for-an approach that\n\n\fcan consider Sl and S2 simultaneously and find a solution that works well in both\nstates. One approach is to formulate the reinforcement learning problem as a global\nsearch through a space of parameterized policies as in the policy gradient algorithms\n(Williams, 1992; Sutton, McAllester, Singh, & Mansour, 2000; Konda & Tsitsik(cid:173)\nlis, 2000; Baxter & Bartlett, 2000). This avoids the oscillation problem, but the\nresulting algorithms are slow and only converge to local optima.\n\nWe pursue an alternative approach that formulates the function approximation\nproblem as a global supervised learning problem. This approach, pioneered by\nBoyan and Moore in their GROWSUPPORT (1995) and ROUT (1996) algorithms,\nseparates the reinforcement learning problem into two subproblems: the exploration\nproblem (finding a good partial value function) and the representation problem (rep(cid:173)\nresenting and generalizing that value function). These algorithms alternate between\ntwo phases. During the exploration phase, a support set of points is constructed\nIn the function fitting\nwhose optimal values are known within some tolerance.\nphase, a function approximator V is fit to the support set.\nIn this paper, we describe two ways of improving upon GROWSUPPORT and ROUT.\nFirst, we replace the support set with the set of states that lie along the best\npaths found during exploration. Second, we employ a combined error function that\nincludes terms for the supervised error, the Bellman error, and the advantage error\n(Baird, 1995) into the function fitting process. The resulting BFBP (Batch Fit to\nBest Paths) method gives significantly better performance on resource-constrained\nscheduling problems as well as on the mountain car toy benchmark problem.\n\n2 GrowSupport, ROUT, and BFBP\n\nConsider a deterministic, episodic MDP. Let s' == a(s) denote the state s' that\nresults from performing a in s and r(a, s) denote the one-step reward. Both GROW(cid:173)\nSUPPORT and ROUT build a support set S == {(Si' V(Si))} of states whose optimal\nvalues V (s) are known with reasonable accuracy. Both algorithms initialize S with\na set of terminal states (with V(s) == 0). In each iteration, a function approximator\nV is fit to S to minimize :Ei[V(Si) - V(Si)]2. Then, an exploration process attempts\nto identify new points to include in S.\nIn GROWSUPPORT, a sample of points X is initially drawn from the state space.\nIn each iteration, after fitting V, GROWSUPPORT computes a new estimate V(s)\nfor each state sEX according to V(s) == maxa r(s, a) + V(a(s)), where V(a(s))\nis computed by executing the greedy policy with respect to V starting in a(s). If\nV(a(s)) is within c of V(a(s)), for all actions a, then (s, V(s)) is added to S.\nROUT employs a different procedure suitable for stochastic MDPs. Let P(s'ls, a)\nbe the probability that action a in state s results in state s' and R(s'ls, a) be\nthe expected one-step reward. During the exploration phase, ROUT generates a\ntrajectory from the start state to a terminal state and then searches for a state s\nalong that trajectory such that (i) V(s) is not a good approximation to the backed-\nup value V(s) == maxa :Est P(s'ls, a)[R(s'ls, a) + V(s')], and (ii) for every state s\nalong a set of rollout trajectories starting at s', V(s) is within c of the backed-up\nvalue maxa :Est P(s'ls, a)[R(s'ls, a) + V(s')]. If such a state is found, then (s, V(s))\nis added to S.\n\nBoth GROWSUPPORT and ROUT rely on the function approximator to generalize\nwell at the boundaries of the support set. A new state s can only be added to\nS if V has generalized to all of s's successor states. H this occurs consistently,\n\n\fthen eventually the support set will expand to include all of the starting states of\nthe MDP, at which point a satisfactory policy has been found. However, if this\n\"boundary generalization\" does not occur, then no new points will be added to S,\nand both GROWSUPPORT and ROUT. terminate without a solution. Unfortunately,\nmost regression methods have high bias and variance near the boundaries of their\ntraining data, so failures of boundary generalization are common.\n\nThese observations led us to develop the BFBP algorithm. In BFBP, the exploration\nprocess maintains a data structure S that stores the best known path from the start\nstate to a terminal state and a \"tree\" of one-step departures from this best path\n(Le., states that can be reached by executing an action in some state on the best\npath). At each state Si E S, the data structure stores the action at executed in that\nstate (to reach the next state in the path), the one-step reward ri, and the estimated\nvalue V(Si). S also stores each action a_ that causes a departure from the best path\nalong with the resulting state S_, reward r_ and estimated value V(s_). We will\ndenote by B the subset of S that constitutes the best path. The estimated values\nV are computed as folloV1S. For states S'i E B, V(Si) is computed 'by summing the\nimmediate rewards rj for all steps j 2: i along B. For the one-step departure states\ns_, V(s_) is computed from an exploration trial in which the greedy policy was\nfollowed starting in state s_.\n\nfuitially, S is empty, so a random trajectory is generated from the start state So to a\nterminal state, and it becomes the initial best known path. fu subsequent iterations,\na state Si E B is chosen at random, and an action a' 1= at is chosen and executed to\nproduce state s' and reward r'. Then the greedy policy (with respect to the current\nV) is executed until a terminal state is reached. The rewards along this new path\nare summed to produce V(s'). If V(s') +r' > V(Si), then the best path is revised as\nfollows. The new best action in state Si becomes al with estimated value V(s') +r'.\nThis improved value is then propagated backwards to update the V estimates for\nall ancestor states in B. The old best action at in state Si becomes an inferior\naction a_ with result state s_. Finally all descendants of s_ along the old best\npath are deleted. This method of investigating one-step departures from the best\npath is inspired by Harvey and Ginsberg's (1995) limited discrepancy search (LDS)\nalgorithm. In each exploration phase, K one-step departure paths are explored.\nAfter the exploration phase, the value function approximation V is recomputed with\nthe goal of minimizing a combined error function:\n\nJ(V)\n\n==\n\nAs L (V(s) - V(S))2 + Ab L (V(s) - [r(s, a*) + V(a*(s))])2 +\nAa L L ([r(s,a-) + V(a-(s))] -\n\n[r(s,a*) + V(a*(s))]):.\n\nsES\n\nsEB\n\nThe three terms of this objective function are referred to as the supervised, Bellman,\nand advantage terms. Their relative importance is controlled by the coefficients As,\nAb' and Au. The supervised term is the usual squared error between the V(s) values\nstored in S and the fitted values V(s). The Bellman term is the squared error\nbetween the fitted value and the backed-up value of the next state on the best path.\nAnd the advantage term penalizes any case where the backed-up value of an inferior\naction a_ is larger than the backed-up value of the best action a*. The notation\n(u)+ == u if u 2: 0 and 0 otherwise.\n\nTheoreIll 1 Let M be a deterministic MDP such that (aJ there are only a finite\nnumber of starting states,\n(bJ there are only\u00b7 a finite set of actions executable in\neach state, and (c) all policies reach a terminal state. Then BFBP applied to M\nconverges.\n\n\fProof: The LDS exploration process is monotonic, since the data structure S is\nonly updated if a new best path is found. The conditions of the theorem imply\nthat there are only a finite number of possible paths that\u00b7 can be explored from the\nstarting states to the terminal states. Hence, the data structure S will eventually\nconverge. Consequently, the value function V fit to S will also converge. Q.E.D.\nThe theorem requires that the MDP contain no cycles. There are cycles in our job(cid:173)\nshop scheduling problems, but we eliminate them by remembering all states visited\nalong the current trajectory and barring any action that would return to a previously\nvisited state. Note also that the theorem applies to MDPs with continuous state\nspaces provided the action space and the start states are finite.\n\nUnfortunately, BFBP does not necessarily converge to an optimal policy. This is\nbecause LDS exploration can get stuck in a local optimum such that all one step\ndepartures using the V-greedy policy produce trajectories that do not improve over\nthe current best path. Hence, although BFBP resembles policy iteration, it does not\nhave the same optimality guarantees,. because policy iteration evaluates the current\ngreedy policy in all states in the state space.\n\nTheoretically, we could prove convergence to the optimal policy under modified con(cid:173)\nditions. If we replace LDS exploration with \u20ac-greedy exploration, then exploration\nwill converge to the optimal paths with probability 1. When trained on those paths,\nif the function approximator fits a sufficiently accurate V, then BFBS will converge\noptimally. hI our experiments, however, we have found that \u20ac-greedy gives no im(cid:173)\nprovement over LDS, whereas LDS exploration provides more complete coverage of\none-step departures from the current best path, and these are used in J(V).\n\n3 Experimental Evaluation\n\nWe have studied five domains: Grid World and Puddle World (Boyan & Moore,\n1995), Mountain Car (Sutton, 1996), and resource-constrained scheduling problems\nART-1 and ART-2 (Zhang & Dietterich, 1995). For the first three domains, fol(cid:173)\nlowing Boyan and Moore, we compare BFBP with GROWSUPPORT. For the final\ndomain, it is difficult to draw a sample of states X from the state space to initialize\nGROWSUPPORT. Hence, we compare against ROUT instead. As mentioned above,\nwe detected and removed cycles from the scheduling domain (since ROUT requires\nthis). We retained the cycles in the first three problems. On mountain car, we also\napplied SARSA(A) with the CMAC function approximator developed by Sutton\n(1996).\n\nregression trees (RT) and\nWe experimented with two function approximators:\nlocally-weighted linear regression (LWLR). Our regression trees employ linear sep(cid:173)\narating planes at the internal nodes and linear surfaces at the leaf nodes. The trees\nare grown top-down in the usual fashion. To determine the splitting plane at a\nnode, we choose a state Si at random from S, choose one of its inferior children S_,\nand construct the plane that is the perpendicular bisector of these two points. The\nsplitting plane is evaluated by fitting the resulting child nodes to the data (as leaf\nnodes) and computing the value of J (V). A number C of parent-child pairs (Si' S -\n)\nare generated and evaluated, and the best one is retained to be the splitting plane.\nThis process is then repeated recursively until a node contains fewer than M data\npoints~ The linear surfaces at the leaves are trained by gradient descent to minimize\nJ(V). The gradient descent terminates after 100 steps or earlier if J becomes very\nIn our experiments, we tried all combinations of the following parameters\nsmall.\nand report the best results: (a) 11 learning rates (from 0.00001 to 0.1), (b) M == 1,\n\n\fTable 1: Comparison of results on three toy domains.\n\nProblem Domain\n\nAlgorithms\n\nOptimal Policyfj\n\nBest Policy Length\n\nGrid World\n\nGROWSUPPORT\n\nBFBP\n\nPuddle World\n\nG ROWSUPPORT\n\nMountain Car\n\nBFBP\n\nSARSA(A)\n\nGROWSUPPORT\n\nBFBP\n\nYes\nYes\nYes\nYes\nNo\nNo\nYes\n\n39\n39\n39\n39\n103\n93\n88\n\nTable 2: Results of ROUT and BFBP on scheduling problem ART-I-TRNOO\nI\nI ROUT (RT) I ROUT (LWLR) I BFBP (RT)\nI\nI\nI Best final learned policy I\n\nBest policy explored\n\nPerformance\n\n1.75\n1.8625\n\nI\nI\n\n1.55\n1.8125\n\nI\nI\n\n1.50\n1.55\n\n10, 20, 50, 100, 1000, (c) C == 5, 10, 20, 50, 100, and (d) K == 50, 100, 150, 200.\nFor locally-weighted linear regression, we replicated the methods of B'oyan and\nMoore. To compute V(s), a linear regression is performed using all points Si E S\nweighted by their distance to S according to the kernel exp -(Ilsi - sII 2 /a 2 ). We\nexperimented with all combinations of the following parameters and report the best\nresults:\n(a) 29 values (from 0.01 to 1000.0) of the tolerance E that controls the\naddition of new points to S, and (b) 39 values (from 0.01 to 1000.0) for a.\n\nWe execute ROUT and GROWSUPPORT to termination. We execute BFBP for 100\niterations, but it converges much earlier: 36 iterations for the grid world, 3 for\npuddle world, 10 for mountain car, and 5 for the job-shop scheduling problems.\n\nTable 1 compares the results of the algorithms on the toy domains with parameters\nfor each method tuned to give the best results and with As == 1 and Ab == Aa == o.\nIn all cases, BFBP matches or beats the other methods.\nIn Mountain Car, in\nparticular, we were pleased that BFBP discovered the optimal policy very quickly.\nTable 2 compares the results of ROUT and BFBP on job-shop scheduling problem\nTRNOO from problem set ART-1 (again with As == 1 and Ab == Aa == 0). For ROUT,\nresults with both LWLR and RT are shown. LWLR gives better results for ROUT.\nWe conjecture that this is because ROUT needs a value function approximator that\nis conservative near the boundary of the training data, whereas BFBP does not.\nWe report both the best policy found during the iterations and the final policy at\nconvergence. Figure 1 plots the r,esults for ROUT (LWLR) against BFBP (RT) for\neight additional scheduling problems from ART-I. The figure of merit is RDF, which\nis a normalized measure of schedule length (small values are preferred). BFBP's\nlearned policy out-performs ROUT's in every case.\n\nThe experiments above all employed only the supervised term in the error function\nJ. These experiments demonstrate that LDS exploration gives better training sets\nthan the support set methods of GROWSUPPORT and ROUT. Now we turn to the\nquestion of whether the Bellman and advantage terms can provide improved results.\nFor the grid world and puddle world tasks, the supervised term already gives optimal\nperformance, so we focus on the mountain car and job-shop scheduling problems.\n\nTable 3 summarizes the results for BFBP on the mountain car problem. All pa(cid:173)\nrameter settings, except for the last, succeed in finding the optimal policy. To get\n\n\fbest policy explored\n\n+\n\ny=x -\n\nbest finalleamed policy\n\nx\n\nXx\n\n+ x\nx\n\n2.4\n\n2.2\n\n