{"title": "Approximate Policy Iteration with a Policy Language Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 847, "page_last": 854, "abstract": "", "full_text": "Approximate Policy Iteration\nwith a Policy Language Bias\n\nAlan Fern and SungWook Yoon and Robert Givan\n\nElectrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907\n\nAbstract\n\nWe explore approximate policy iteration, replacing the usual cost-\nfunction learning step with a learning step in policy space. We give\npolicy-language biases that enable solution of very large relational\nMarkov decision processes (MDPs) that no previous technique can solve.\nIn particular, we induce high-quality domain-speci\ufb01c planners for clas-\nsical planning domains (both deterministic and stochastic variants) by\nsolving such domains as extremely large MDPs.\n\n1 Introduction\nDynamic-programming approaches to \ufb01nding optimal control policies in Markov decision\nprocesses (MDPs) [4, 14] using explicit (\ufb02at) state space representations break down when\nthe state space becomes extremely large. More recent work extends these algorithms to\nuse propositional [6, 11, 7, 12] as well as relational [8] state-space representations. These\nextensions have not yet shown the capacity to solve large classical planning problems such\nas the benchmark problems used in planning competitions [2]. These methods typically\ncalculate a sequence of cost functions. For familiar STRIPS planning domains (among\nothers), useful cost functions can be dif\ufb01cult or impossible to represent compactly.\nThe above techniques guarantee a certain accuracy at each stage. Here, we focus on in-\nductive techniques that make no such guarantees. Existing inductive forms of approximate\npolicy iteration (API) select compactly represented, approximate cost functions at each it-\neration of dynamic programming [5], again suffering when such representation is dif\ufb01cult.\nWe know of no previous work that applies any form of API to benchmark problems from\nclassical planning.1 Perhaps one reason is the complexity of typical cost functions for\nthese problems, for which it is often more natural to specify a policy space. Recent work\non inductive policy selection in relational planning domains [17, 19, 28], has shown that\nuseful policies can be learned using a policy-space bias, described by a generic knowledge\nrepresentation language. Here, we incorporate that work into a practical approach to API\nfor STRIPS planning domains.\nWe replace the use of cost-function approximations as policy representations in API2 with\ndirect, compact state-action mappings, and use a standard relational learner to learn these\nmappings. We inherit from familiar API methods a (sampled) policy-evaluation phase\nusing simulation of the current policy, or rollout [25], and an inductive policy-selection\nphase inducing an approximate next policy from sampled current policy values.\n\n1Recent work in relationalreinforcementlearninghas been applied to STRIPS problems with\nmuch simpler goals than typical benchmark planning domains, and is discussed below in Section 5.\n2In concurrent work, [18] pursued a similar approach to API in attribute-value domains.\n\n\fWe evaluate our API approach in several STRIPS planning domains, showing iterative\npolicy improvement. Our technique solves entire planning domains, \ufb01nding a policy that\ncan be applied to any problem in the domain, rather than solving just a single problem\ninstance from the domain. We view each planning domain as a single large MDP where\neach \u201cstate\u201d speci\ufb01es both the current world and the goal. The API method thus learns\ncontrol knowledge (a \u201cpolicy\u201d) for the given planning domain.\nOur API technique naturally leverages heuristic functions (cost function estimates), if\navailable\u2014this allows us to bene\ufb01t from recent advances in domain-independent heuris-\ntics for classical planning, as discussed below. Even when greedy heuristic search solves\nessentially none of the domain instances, our API technique successfully bootstraps from\nthe heuristic guidance. We also demonstrate that our technique is able to iteratively im-\nprove policies that correspond to previously published hand-coded control knowledge (for\nTL-plan [3]) and policies learned by Yoon et al. [28]. Our technique gives a new way of\nusing heuristics in planning domains, complementing traditional heuristic search strategies.\n2 Approximate Policy Iteration\nWe \ufb01rst review API for a general, action-simulator\u2013based MDP representation, and later,\nin Section 3, detail a particular representation of planning domains as relational MDPs and\nthe corresponding policy-space learning bias.\nProblem Setup. We follow and adapt [16] and [5]. We represent an MDP using a genera-\ntive model !S, A, T, C, I\", where S is a \ufb01nite set of states, A is a \ufb01nite set of actions, and\nT is a randomized \u201caction-simulation\u201d algorithm that, given state s and action a, returns a\nnext state t. The component C is an action-cost function that maps S \u00d7 A to real-numbers,\nand I is a randomized \u201cinitial-state algorithm\u201d with no inputs that returns a state in S. We\nsometimes treat I and T (s, a) as random variables.\nFor MDP M = !S, A, T, C, I\", a policy \u03c0 is a (possibly stochastic) mapping from S to A.\nThe costfunctionJ \u03c0\n\nM(s, a) are the unique solutions to\n\nM(s) and the Q-costfunctionQ\u03c0\n\nM(T (s, a))], where J \u03c0\n\nM(s) = E[Q\u03c0\n\nM(s,\u03c0 (s))],\n\nQ\u03c0\n\nM(s, a) = C(s, a) + \u03b1E[J \u03c0\n\nM and selecting the minimizing action (policy selection).\n\nM(I)], due to the complexity of the problems we consider.\n\nrepresenting the expected, cumulative, discounted cost of following policy \u03c0 in M starting\nfrom state s, and where 0 \u2264 \u03b1< 1 is the discount factor.\nIn this work, we seek to\nheuristically minimize E[J \u03c0\nGiven a current policy \u03c0, we can de\ufb01ne a new improved policy PI[\u03c0](s) by\nM(s, a). The cost function of PI[\u03c0] is guaranteed to be no worse than that\nargmina\u2208AQ\u03c0\nof \u03c0 at each state and to improve at some state for non-optimal \u03c0. Exactpolicyiteration\niterates policy improvement (PI) from any initial policy to reach an optimal \ufb01xed point.\nM (policy evaluation) and then\nPolicy improvement is divided into two steps: computing J \u03c0\ncomputing Q\u03c0\nApproximate Policy Iteration. API, as described in [5], heuristically approximates pol-\nicy iteration in large state spaces by using an approximate policy-improvement operator\ntrained with Monte-Carlo simulation. The approximate operator performs policy evalua-\ntion by simulation\u2014evaluating a policy \u03c0 at a state s by drawing some number of sample\ntrajectories of \u03c0 starting at s\u2014and performs policy selection by constructing a training set\nof samples of either the J or Q cost functions from a \u201csmall\u201d but \u201crepresentative\u201d set of\nstates and then using this training set to induce a new \u201capproximately improved\u201d policy.\nThe use of API assumes that states and perhaps actions are represented in factored form\n(typically, a feature vector) that facilitates generalizing properties of the training data to the\nentire state and action spaces. Due to API\u2019s inductive nature, there are typically no guaran-\ntees for policy improvement\u2014nevertheless, API often \u201cconverges\u201d usefully, e.g. [24, 26].\nWe start API by providing it with an initial policy \u03c00 and a real-valued heuristic function\n\n\fH, where H(s) is interpreted as an estimate of the cost of state s (presumably with respect\nto the optimal policy). We note that H or \u03c00 may be trivial, i.e. always returning a constant\nor random action respectively. For API to be effective, however, it is important that \u03c00 and\nH combine to provide guidance toward improvement. For example, in goal-based planning\ndomains either \u03c00 should occasionally reach a goal or H should provide non-trivial goal-\ndistance information. In our experiments we consider scenarios that use different types of\ninitial policies and heuristics to bootstrap API.\nGiven \u03c00, H, and an MDP M = !S,{a1, . . . , am}, T, C, I\", API produces a policy se-\nquence by iterating steps of approximate policy improvement\u2014note that \u03c00 is used in only\nthe initial iteration but the heuristic is always used. Approximate policy improvement\ncomputes an (approximate) improvement \u03c0\" of a policy \u03c0 by attempting to approximate\nthe output of exact policy improvement, i.e. \u03c0\"(s) = argmina\u2208AQ\u03c0\nM(s, a). There are two\nsteps: estimating Q-costs for all actions at a representative set of states, and using resulting\ndata set to learn an approximation of \u03c0\". Figure 1 gives pseudo-code for our variant of API.\nStep 1: Q-Cost Estimation via Rollout. (see [25]) Given \u03c0, we construct a training set D,\ndescribing an improved policy \u03c0\", consisting of tuples !s,\u03c0 (s), \u02c6Q(s, a1), . . . , \u02c6Q(s, am)\".\nFor each sampled state s and action a, the term \u02c6Q(s, a) refers to Q\u03c0\nM(s, a) as estimated by\ndrawing \u201csampling width\u201d trajectories of length \u201chorizon\u201d from s and computing the aver-\nage discounted trajectory cost over the sampled trajectories, where the cost of a trajectory\nincludes the value of the heuristic function at the horizon state. To get a \u201crepresentative\nset\u201d of states, we include each state s visited by \u03c0\" (as indicated by the \u02c6Q estimates) within\n\u201chorizon\u201d steps from one of \u201ctraining set size\u201d states drawn from the initial distribution.3\nStep 2: Learn Policy. Select \u03c0\" with the goal of minimizing the cumulative \u02c6Q-cost for \u03c0\"\nover D (approximating the same minimization over S in exact policy iteration). Traditional\nAPI uses a cost-function space learning bias in this selection\u2014in Section 3 we detail the\npolicy-space learning bias used by our technique. By labeling each training state with\nthe associated Q-costs for each action, rather than simply with the best action, we enable\nthe learner to make more informed trade-offs. We note that the inclusion of \u03c0(s) in each\ntraining example enables the learner to normalize the data, if desired\u2014e.g. our learner (see\nSection 3) uses a bias that focuses on states where large improvement appears possible.\n3 API for Relational Planning\nIn order to use our API framework, we represent classical planning domains (not just single\ninstances) as relationally factored MDPs. We then describe our compact relational policy\nlanguage and the associated learner for use in step 2 of our API framework.\nPlanning Domains as MDPs. We say that an MDP !S, A, T, C, I\" is relationalwhen S\nand A are de\ufb01ned by giving the \ufb01nite sets of objects O, predicates P, and action types Y .\nA factis a predicate applied to the appropriate number of objects. A state in S is a set of\nfacts (taken to be \u201ctrue\u201d in the state), and S is all such states. An actionis an action type\napplied to the appropriate number of objects, and the action space A is the set of all actions.\nA classical planning domain is speci\ufb01ed by providing a set of world predicates, action\ntypes, and an action simulator. We simultaneously solve all problem instances of such a\nplanning domain4 by constructing a relational MDP as described below.\nLet O be a \ufb01xed set of objects and Y be the set of action types from the planning domain.\nTogether, O and Y de\ufb01ne the MDP action space. Each MDP state is a single problem\n3It is important that states are sampled from \u03c0! rather than \u03c0 to match the training distribution to\nthe implied \u201ctest set\u201d distribution.\n4As an example, the blocks world is a classical planning domain, where a problem instance is an\ninitial block con\ufb01guration and a set of goal conditions. Classical planners attempt to \ufb01nd solutions to\nspeci\ufb01c problem instances of a domain.\n\n\fAPI (n, w, h, H,\u03c0 0)\n// training set size n, sampling width w,\n// horizon h, initial policy \u03c00,\n// cost estimator (heuristic function) H.\n\u03c0 \u2190 \u03c00;\nloop\n\nD \u2190 Draw-Training-Set(n, w, h, H,\u03c0 );\n\u03c0 \u2190 Learn-Decision-List(D);\n//e.g. until change is small\n\nuntil satis\ufb01ed with \u03c0;\nReturn \u03c0;\n\nDraw-Training-Set(n, w, h, H,\u03c0 )\n// training set size n, sampling width w,\n// horizon h, cost estimator H, current policy \u03c0\nD \u2190 \u2205; E \u2190 set of n states sampled from I;\nfor each state s0 \u2208 E // Draw trajectory of\n// sample states from s0\n\ns \u2190 s0;\nfor i = 1 to h\n\nQ\u03c0(s) \u2190 Policy-Rollout(\u03c0, s, w, h, H);\na \u2190 action maximizing Q\u03c0(s, a);\nD \u2190 $s,\u03c0 (s), Q\u03c0(s)% \u222a D;\ns \u2190 state sampled from T (s, a);\n\nReturn D;\n\nPolicy-Rollout (\u03c0, s, w, h, H)\n// Computes estimate of Q\u03c0(s)\n// policy \u03c0, state s, sampling width w, horizon h, cost estimator H\nInitialize Q\u03c0(s), a vector indexed by the actions in A, to zeroes;\nfor1 each action a in A\n\nfor2 sample = 1 to w\ns! \u2190 s;\nfor3 step = 1 to h\n\nQ\u03c0(s, a) \u2190 Q\u03c0(s, a) +C (s!,\u03c0 (s!));\ns! \u2190 a state sampled from T (s!,\u03c0 (s!))\n\nQ\u03c0(s, a) \u2190 Q\u03c0(s, a) +H (s!);\n\nQ\u03c0(s, a) \u2190 Q\u03c0 (s,a)\n\nw\n\n// end for3\n// end for2\n// end for1\n\nReturn Q\u03c0(s)\nFigure 1: Pseudo-code for our API algorithm. The MDP !S, A, T, C, I\" is assumed glob-\nally known. The general approach is inherited from [5], and is restated here for clarity.\nKey differences are the use of Learn-Decision-List [28], as discussed in Section 3, and the\nchoice of action a in Draw-Training-Set (see Footnote 3).\ninstance (i.e. an initial state and a goal) from the planning domain by specifying both the\ncurrent world and the goal. We achieve this by letting P be the set of world predicates from\nthe classical domain together with a new set of goalpredicates, one for each world predi-\ncate. Goal predicates are named by prepending a \u2018g\u2019 to the corresponding world predicate.\nThus, the MDP states are sets of world and goal facts involving some or all objects in O.\nThe objective is to reach MDP states where the goal facts are a subset of the world facts\n(goalstates). The state {on-table(a),on(a, b),clear(b),gclear(b)} is thus a goal state in\na blocks-world MDP, but would not be a goal state without clear(b). We represent this\nobjective by de\ufb01ning C to assign zero cost to actions taken in goal states and a positive\ncost to actions in all other states. In addition, we take T to be the action simulator from\nthe planning domain (e.g. as de\ufb01ned by STRIPS rules), modi\ufb01ed to treat goal states as\nterminal and to preserve without change all goal predicates. With this cost function, a\nlow-cost policy must arrive at goal states as \u201cquickly\u201d as possible. Finally, the initial state\ndistribution I can be any program that generates legal problem instances (MDP states) of\nthe planning domain\u2014e.g. one might use a problem generator from a planning competition.\nWhile here we assume and accurate T model is known, a more general reinforcement-\nlearning context would require learning an approximate T, trading off exploitation of this\nmodel with exploration to improve it.\nTaxonomic Decision List Policies. We adapt the API method of Section 2 by using, for\nStep 2, the policy-space language bias and learning method of our previous work on learn-\ning policies in relational domains from small problem solutions [28], brie\ufb02y reviewed here.\n\n\fIn relational domains, useful rules often take the form \u201capply action type a to any object in\nset C\u201d, e.g. \u201cunload any object that is at its destination\u201d. In [19], decision lists of such rules\nwere used as a language bias for learning policies. We use such lists, and represent the sets\nof objects needed using classexpressionsC written in taxonomic syntax [20], de\ufb01ned by\nC ::= C0 | anything | \u00acC | (R C) | C \u2229 C, with R ::= R0 | R \u22121 | R \u2229 R | R\u2217.\nHere, C0 is any one argument relation and R0 any binary relation from the predicates in P.\nOne argument relations denote the set of objects that they are true of, (R C) denotes the\nimage of the objects in class C under the binary relation R, and for the (natural) seman-\ntics of the other constructs shown, please refer to [28]. Given a state s and a concept C\nexpressed in taxonomic syntax, it is straightforward to compute, in time polynomial in the\nsizes of s and C, the set of domain objects that are represented by C in s.\nRestricting our attention to one-argument\u2013action types5, we write a policy as\n!C1:a1, C2:a2, . . . , Cn:an\", where the Ci are taxonomic-syntax concepts and the ai are\naction types. See Yoon et al. [28] for examples and details.\nOur learner builds a decision-list of size-bounded rules by starting with the empty list and\ngreedily selecting a new rule to add, continuing until the list \u201ccovers\u201d all of the training\ndata. This procedure is described in Yoon et al. [28], where a heuristically guided beam-\nsearch is used to greedily select the next rule to add. The only difference between the\nlearner in [28] and the one used here is the heuristic function, which incorporates Q-cost\ninformation (unlike [28]). Given training example !s,\u03c0 (s), \u02c6Q(s, a1), . . . , \u02c6Q(s, am)\" in\nD, we de\ufb01ne the Q-advantage of taking action a instead of \u03c0(s) in state s by \u2206(s, a) =\n\u02c6Q(s,\u03c0 (s))\u2212 \u02c6Q(s, a). We take the heuristic value of a concept-action rule to be the number\nof training examples where the rule \u201c\ufb01res\u201d plus the cumulative Q-advantage that the rule\nachieves on those training examples.6 Using Q-advantage rather than Q-cost focuses the\nlearner toward instances where large improvement over the previous policy is possible.\n4 Relational Planning Experiments\nOur experiments support three claims. 1) Using only the guidance of an (often weak)\ndomain-independent heuristic, API learns effective policies for entire classical planning\ndomains. 2) Each learned policy is a domain-speci\ufb01c planner that is fast and empirically\ncompares well to the state-of-the-art domain-independent planner FF [13]. 3) API can im-\nprove on previously published control knowledge and on that learned by previous systems.\nDomains. We consider two deterministic domains with standard de\ufb01nitions and three\nstochastic domains from Yoon et al. [28]\u2014these are: BW(n), the n-block blocks world;\nLW(l,t,p), the l location, t truck, p package logistics world; SBW(n), a stochastic vari-\nant of BW(n); SLW(l,c,t,p), the stochastic logistics world with c cars and t trucks; and\nSPW(n), a version of SBW(n) with a paint action. We draw problem instances from each\ndomain by generating pairs of random initial states and goal conditions. The goal condi-\ntions specify block con\ufb01gurations involving all blocks in blocks worlds, and destinations\nfor all packages in logistics worlds.7\nThroughout, we use the domain-independent FF heuristic [13].8 Each experiment speci\ufb01es\na planning domain and an initial policy and then iterates API9 until \u201cno more progress\u201d is\nmade. We evaluate each policy on 1000 random problem instances, recording the success\n5Multiple argument actions can be simulated at some cost with multiple single argument actions.\n6If the coverage term is not included, then covering a zero Q-advantage example is the same as\nnot covering it. But zero Q-advantage can be good (e.g. the previous policy is optimal in that state).\n7PSTRIPS domain de\ufb01nitions are at http://www.ece.purdue.edu/\u223cgivan/nips03-domains.html.\n8Space precludes a description of this complex and well studied planning heuristic here.\n9We use discount factor 1 and select large enough horizons to accurately rank most policies: 4\u00d7n\nfor BW(n) and SBW(n), 6\u00d7n for SPW(n), 12\u00d7p for LW(l,t,p) and SLW(l,c,t,p). Training set size is\n\n\fBW(10)\n\nSR\nAL/H\n\nBW(15)\n\nSR\nAL/H\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n(a)\nFigure 2: Bootstrapping API with a domain-independent heuristic.\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n0\n(b)\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n0\n(c)\n\n15\niteration\n\n4\n6\niteration\n\n30\n\n35\n\n20\n\n25\n\n10\n\n10\n\n5\n\n2\n\n0\n\n8\n\nLW(4,4,12)\n\nSR\nAL(S)/H\n\n2\n\n4\n6\niteration\n\n8\n\n10\n\nTL-BW-b in BW(10)\n\nSR\nAL/H\n\nTL-BW-a in BW(10)\n\nSR\nAL/H\n\n1\n1\n0.8\n0.9\n0.6\n0.8\n0.4\n0.7\n0.2\n0.6\n0\n0.5\n0\n(a)\n(c)\nFigure 3: Using TL-Plan control knowledge as initial policies.\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0\n(b)\n\niteration\n\niteration\n\n5\n\n9\n\n1\n\n8\n\n4\n\n2\n\n7\n\n3\n\n6\n\n5\n\n4\n\n2\n\n3\n\n0\n\n1\n\n6\n\n7\n\n8\n\n9\n\nTL-LW in LW(4,6,4)\n\nSR\nAL(S)/H\n\n2\n\n4\n6\niteration\n\n8\n\n10\n\nSBW(10) Policy1 SR\nPolicy2 SR\n\nSPW(10) Policy1 SR\nPolicy2 SR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\nSLW(4,3,3,4)Policy1 SR\nPolicy2 SR\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0\n(c)\n\n0\n\n8\n\n2\n\n0\n\n2\n\n2\n\n8\n\n8\n\n10\n\n10\n\n10\n\n0.8\n(b)\n\n4\n6\niteration\n\n4\n6\niteration\n\n4\n6\niteration\n\n0.2\n(a)\nFigure 4: Using previously learned initial policies.\nratioSR (fraction of problems solved within the horizon) and normalizedaveragesolution\nlengthAL/H (average plan length in successful trials divided by horizon), omitting AL/H\nfor very low SR. Initial-policy performance is plotted at iteration zero.\nBootstrapping from the Heuristic. We consider the domain-independent initial policy10\nFF-Greedy, which acts using the FF heuristic with one-step look-ahead. Figures 2a and b\nshow SR and AL/H after each API iteration for BW(10) and BW(15). FF-Greedy is poor\nin both domains. There is an initial period of no (apparent) progress, followed by rapid\nimprovement to nearly perfect SR. Examination of the learned BW(15) policies shows that\nearly iterations \ufb01nd important concepts and later iterations \ufb01nd a policy that achieves a\nsmall SR; at that point, rapid improvement ensues. Figure 2c shows the SR and AL/H\nfor LW(4,4,12). FF-Greedy performs very well here; nevertheless, API yields compact\ndeclarative policies of the same quality as FF-Greedy. We replicated these experiments in\nthe stochastic variants of these domains, with similar results (not shown for space reasons).\nInitial Hand-Coded Policies. TL-Plan [3] uses human-coded domain-speci\ufb01c control\nknowledge to solve classical planning problems. Here we use initial policies for API that\ncorrespond to the domain-speci\ufb01c control knowledge appearing in [3].11 For the blocks\n100 trajectories, and sampling width is always 1, which worked well even for stochastic domains. A\nsampling width of 1 corresponds to a preference to draw a small number of trajectories from each of\na variety of problems rather than a larger number from each of relatively fewer training problems\u2014in\neither case, the learner must be robust to the noise resulting from stochastic effects.\n\n10What is considered \u201cdomain independent\u201d here is the means of constructing the policy.\n11We can not exactly capture the TL-Plan knowledge in our policy language. Instead, we write\npolicies that capture the knowledge but prune away some \u201cbad\u201d actions that TL-Plan might consider.\n\n\fFF (in C)\n\nTable 1: FF vs. learned policies.\n\nworld TL-Plan provides three sets of control knowledge of increasing quality\u2014we use the\nbest and second best sets to get the policies TL-BW-a and TL-BW-b, respectively. For\nlogistics there is only one set of knowledge given, yielding the policy TL-LW.\nFigures 3a\u20133c show the SR and AL/H for API when starting with TL-BW-a and TL-BW-\nb in BW(10) and TL-LW in LW(4,4,12). In each case, API improves the human-coded\npolicies. Starting with TL-BW-a and TL-LW, which have perfect SR, API uncovers policies\nthat maintain SR but improve AL/H by approximately 6.3% and 13%, respectively. Starting\nwith TL-BW-b, which has SR of only 30%, API quickly uncovers policies with perfect SR.\nThere is a dramatic difference in the quality of FF-Greedy (iteration 0 of Figure 2a), TL-\nBW-a, and TL-BW-b in BW(10); yet, for each initial policy, API \ufb01nds policies of roughly\nidentical quality\u2014requiring more iterations for lower quality initial policies.\nInitial Machine-Learned Policies. In Yoon et al. [28], policies were learned from so-\nlutions to randomly drawn small problems for the three stochastic domains we test here,\namong others. A signi\ufb01cant range of policy qualities results, due to the random draw. Here,\nwe use API starting with some below-average policies from that work.12 Figures 4a-c show\nresults for SPW(10), SLW(4,3,3,4), and SBW(10). For each domain, API is shown to im-\nprove the SR for two arbitrarily selected, below-average, learned starting policies to nearly\n1.0. API successfully exploits the previous, noisy learning to robustly obtain a good policy.\nComparing learned policies\nto FF. A\nlearned policy corresponds to a domain-\nAPI (Scheme)\nspeci\ufb01c planner for the target planning do-\nDomains SR AL Time SR AL Time\nmain. Here we show that these policies\nBW(10) 1\n0.1s 0.99 25 1.5s\n33\nare competitive with FF, a state-of-the-art AI\nBW(15) 0.96 58\n2.7s 0.99 39 2.5s\nplaner, with respect to planning time and suc-\nBW(20) 0.75 62 27.7s 0.98 55 3.7s\ncess ratio. We selected a blocks-world pol-\nBW(30) 0.14103166.0s 0.99 86 2.8s\nicy and logistics-world policy corresponding\n43 2.7s\nto the learned policies (beyond iteration 0) in\n74 3.6s\nFigures 2a and c with the best SR, breaking\nties with AL. We applied FF and the appropriate selected policy to each of 1000 new test\nproblems from each of the domains shown in Table 1. Planning cutoff times were set at\n600, 300, and 100 seconds for BW(30), BW(20), and all other domains, respectively. Ta-\nble 1 records the percent of problems solved within the time cutoff (SR), the average length\nof successful trials (AL), and the average time for successful trials (Time) for both FF and\nour two selected policies.\nIn blocks worlds with more than 10 blocks, the API policy improves on FF in every cate-\ngory, with scaling much better to 20 and 30 blocks. Using the same heuristic information\n(in a different way), API uncovers policies that signi\ufb01cantly outperform FF. FF\u2019s heuris-\ntic is well suited to logistics worlds, eliminating search for these problems. Our method\nperforms equivalently, but for the slow prototype Scheme implementation.\n5 Related Work\nTypically, previous \u201clearning for planning\u201d systems [22] learn from small-problem solu-\ntions to improve the ef\ufb01ciency and/or quality of planning. Two primary approaches are\nto learn control knowledge for search-based planners, e.g. [23, 27, 10, 15, 1], and, more\nclosely related, to learn stand-alone control policies [17, 19, 28].\nThe former work is severely limited by the utility problem (see [21]), i.e., being \u201cswamped\u201d\nby low utility rules. Critically, our policy-language bias confronts this issue by preferring\nsimpler policies. Regarding the latter, our work is novel in using API to iteratively improve\n12For these stochastic domains we provide the heuristic (designed for deterministic domains) with\n\na deterministic STRIPS domain approximation (using the mostly likely outcome of each action).\n\nLW(4,4,12) 1\nLW(5,14,20) 1\n\n42\n73\n\n0.0s 1\n0.4s 1\n\n\fcontrol knowledge. AIJ, 141(1-2):29\u201356, 2002.\n\nedge for planning. AIJ, 16:123\u2013191, 2000.\n\ngramming. In Lorenza Saitta, editor, ICML, 1996.\nwith factored representations. AIJ, 121(1-2):49\u2013107, 2000.\norder MDPs. In IJCAI, 2001.\n\npolicies, and leads to a more robust learner, as shown above.\nIn addition, we leverage\na domain-independent planning heuristic to avoid the need for access to small problems.\nOur learning approach is also not tied to having a base planner.\nThe most closely related work is relational reinforcement learning (RRL) [9], a form of on-\nline API that learns relational cost-function approximations. Q-cost functions are learned\nin the form of relational decision trees (Q-trees) and are used to learn corresponding poli-\ncies (P-trees). The RRL results clearly demonstrate the dif\ufb01culty of learning cost-function\napproximations in relational domains. Compared to P-trees, Q-trees tend to generalize\npoorly and be much larger. RRL has not yet demonstrated scalability to problems as com-\nplex as those considered here\u2014previous RRL blocks-world experiments include relatively\nsimple goals13, which lead to cost functions that are much less complex than the ones here.\nHowever, unlike RRL, our API assumes an unconstrained simulator and (for the FF heuris-\ntic) a world model, which must be provided or learned by additional techniques.\nReferences\n[1] Ricardo Aler, Daniel Borrajo, and Pedro Isasi. Using genetic programming to learn and improve\n[2] Fahiem Bacchus. The AIPS \u201900 planning competition. AI Magazine, 22(3)(3):57\u201362, 2001.\n[3] Fahiem Bacchus and Froduald Kabanza. Using temporal logics to express search control knowl-\n[4] R. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[6] Craig Boutilier and Richard Dearden. Approximating value trees in structured dynamic pro-\n[7] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Stochastic dynamic programming\n[8] Craig Boutilier, Raymond Reiter, and Bob Price. Symbolic dynamic programming for \ufb01rst-\n[9] S. Dzeroski, L. DeRaedt & K. Driessens. Relational reinforcement learning. MLJ, 43:7\u201352, 2001.\n[10] Tara A. Estlin and Raymond J. Mooney. Multi-strategy learning of search control for partial-\n[11] Robert Givan, Thomas Dean, and Matt Greig. Equivalence notions and model minimization in\n[12] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored MDPs.\n[13] Jorg Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through\n[14] R. Howard. Dynamic Programming and Markov Decision Processes. MIT Press, 1960.\n[15] Yi-Cheng Huang, Bart Selman, and Henry Kautz. Learning declarative control rules for\n[16] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-\n[17] Roni Khardon. Learning action strategies for planning domains. AIJ, 113(1-2):125\u2013148, 1999.\n[18] M. Lagoudakis and R. Parr. Reinforcement learning as classi\ufb01cation: Leveraging modern clas-\n[19] Mario Martin and Hector Geffner. Learning generalized policies in planning domains using\n[20] D. McAllester & R. Givan. Taxonomic syntax for 1st-order inference. JACM, 40:246\u201383, 1993.\n[21] S. Minton. Quantitative results on the utility of explanation-based learning. In AAAI, 1988.\n[22] S. Minton, editor. Machine Learning Methods for Planning. Morgan Kaufmann, 1993.\n[23] S. Minton, J. Carbonell, C. A. Knoblock, D. R. Kuokka, O. Etzioni, and Y. Gil. Explanation-\n[24] G. Tesauro. Practical issues in temporal difference learning. MLJ, 8:257\u2013277, 1992.\n[25] G. Tesauro & G. Galperin. Online policy improvement via monte-carlo search. In NIPS, 1996.\n[26] J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale DP. MLJ, 22:59\u201394, 1996.\n[27] M. Veloso, J. Carbonell, A. Perez, D. Borrajo, E. Fink, and J. Blythe. Integrating planning and\nlearning: The PRODIGY architecture. Journal of Experimental and Theoretical AI, 7(1), 1995.\n[28] S. Yoon, A. Fern, and R. Givan. Inductive policy selection for \ufb01rst-order MDPs. In UAI, 2002.\n13The most complex blocks-world goal for RRL was to achieve on(A, B) in an n block environ-\n\norder planning. In AAAI, 1996.\nMarkov decision processes. AIJ, 147(1-2):163\u2013223, 2003.\nIn IJCAI, pages 673\u2013680, 2001.\nheuristic search. JAIR, 14:263\u2013302, 2001.\n\nment. We consider blocks-world goals that involve all n blocks.\n\nconstraint-based planning. In ICML, pages 415\u2013422, 2000.\noptimal planning in large markov decision processes. MLJ, 49(2\u20133):193\u2013208, 2002.\n\nsi\ufb01ers. In ICML, 2003.\nconcept languages. In KRR, 2000.\n\nbased learning: A problem solving perspective. AIJ, 40:63\u2013118, 1989.\n\n\f", "award": [], "sourceid": 2456, "authors": [{"given_name": "Alan", "family_name": "Fern", "institution": null}, {"given_name": "Sungwook", "family_name": "Yoon", "institution": null}, {"given_name": "Robert", "family_name": "Givan", "institution": null}]}