{"title": "Envelope-based Planning in Relational MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 783, "page_last": 790, "abstract": "", "full_text": "Envelope-based Planning in Relational MDPs\n\nNatalia H. Gardiol\n\nMIT AI Lab\n\nCambridge, MA 02139\nnhg@ai.mit.edu\n\nLeslie Pack Kaelbling\n\nMIT AI Lab\n\nCambridge, MA 02139\nlpk@ai.mit.edu\n\nAbstract\n\nA mobile robot acting in the world is faced with a large amount of sen-\nsory data and uncertainty in its action outcomes. Indeed, almost all in-\nteresting sequential decision-making domains involve large state spaces\nand large, stochastic action sets. We investigate a way to act intelli-\ngently as quickly as possible in domains where \ufb01nding a complete policy\nwould take a hopelessly long time. This approach, Relational Envelope-\nbased Planning (REBP) tackles large, noisy problems along two axes.\nFirst, describing a domain as a relational MDP (instead of as an atomic\nor propositionally-factored MDP) allows problem structure and dynam-\nics to be captured compactly with a small set of probabilistic, relational\nrules. Second, an envelope-based approach to planning lets an agent be-\ngin acting quickly within a restricted part of the full state space and to\njudiciously expand its envelope as resources permit.\n\n1 Introduction\n\nQuickly generating generating usable plans when the world abounds with uncertainty is an\nimportant and dif\ufb01cult enterprise. Consider the classic blocks world domain: the number\nof ways to make a stack of a certain height grows exponentially with the number of blocks\non the table; and if the outcomes of actions are uncertain, the task becomes even more\ndaunting. We want planning techniques that can deal with large state spaces and large,\nstochastic action sets since most compelling, realistic domains have these characteristics.\nIn this paper we propose a method for planning in very large domains by using expressive\nrules to restrict attention to high-utility subsets of the state space.\n\nMuch of the work in traditional planning techniques centers on propositional, deterministic\ndomains. See Weld\u2019s survey [12] for an overview of the extensive work in this area. Efforts\nto extend classical planning approaches into stochastic domains include mainly techniques\nthat work with fully-ground state spaces [13, 2]. Conversely, efforts to move beyond propo-\nsitional STRIPS-based planning involve work in mainly deterministic domains [6, 10].\n\nBut the world is not deterministic: for an agent to act robustly, it must handle uncertain dy-\nnamics as well as large state and action spaces. Markov decision theory provides techniques\nfor dealing with uncertain outcomes in atomic-state contexts, and much work has been\ndone in leveraging structured representations to solve very large MDPs and some POMDPs\n[9, 3, 7]. While these techniques have moved MDP techniques from atomic-state represen-\ntations to factored ones, they still operate in fully-ground state spaces.\n\n\fIn order to describe large stochastic domains compactly, we need relational structures that\ncan represent uncertainty in the dynamics. Relational representations allow the structure\nof the domain to be expressed in terms of object properties rather than object identities\nand thus yield a much more compact representation of a domain than the equivalent propo-\nsitional version can. Ef\ufb01cient solutions for probabilistic, \ufb01rst-order MDPs are dif\ufb01cult to\ncome by, however. Boutilier et al.[3] \ufb01nd policies for \ufb01rst-order MDPs by solving for the\nvalue-function of a \ufb01rst-order domain: the approach manipulates logical expressions that\nstand for sets of underlying states, but keeping the value-function representation manage-\nable requires complex theorem-proving. Other approaches in relational MDPs represent the\nvalue function as a decision-tree [5] or as a sum of local subfunctions [8]. Another recent\nbody of work avoids learning the value function and learns policies directly from example\npolicies [14]. These approaches all compute full policies over complete state and action\nspaces, however, and so are of a different spirit than the work presented here.\n\nThe underlying message is nevertheless clear: the more an agent can compute logically and\nthe less it attends to particular domain objects, the more general its solutions will be. Since\nfully-ground representations grow too big to be useful and purely logical representations\nare as yet unwieldy, we propose a middle path: we agree to ground things out, but in a prin-\ncipled, restricted way. We represent world dynamics by a compact set of relational rules,\nand we extend the envelope method of Dean et al.[4] to use these structured dynamics. We\nquickly come up with an initial trajectory (an envelope of states) to the goal and then to\nre\ufb01ne the policy by gradually incorporating nearby states into the envelope. This approach\navoids the wild growth of purely propositional techniques by restricting attention to a use-\nful subset of states. Our approach strikes a balance along two axes: between fully ground\nand purely logical representations, and between straight-line plans and full MDP policies.\n\n2 Planning with an Envelope in Relational Domains\n\nThe envelope method was initially designed for planning in atomic-state MDPs. Goals of\nachievement are encoded as reward functions, and planning now becomes \ufb01nding a policy\nthat maximizes a long-term measure of reward. Extending the approach to a relational\nsetting lets us cast the problem of planning in stochastic, relational domains in terms of\n\ufb01nding a policy for a restricted Markovian state space.\n\n2.1 Encoding Markovian dynamics with rules\n\nThe \ufb01rst step to extending the envelope method to relational domains is to encode the\nworld dynamics relationally. We use a compact set of rules, as in Figure 1. Each rule, or\noperator, is denoted by an action symbol and a parameterized argument list. Its behavior\nis de\ufb01ned by a precondition and a set of outcomes, together called the rule schema. Each\nprecondition and outcome is a conjunction of domain predicates. A rule applies in a state\nif its precondition can be matched against some subset of the state ground predicates. Each\noutcome then describes the set of possible resulting ground states. Given this structured\nrepresentation of action dynamics, we de\ufb01ne a relational MDP as a tuple hP,Z,O,T ,Ri:\nStates: The set of states is de\ufb01ned by a \ufb01nite set P of relational predicates, representing\nthe properties and relations that can hold among the \ufb01nite set of domain objects, O. Each\nRMDP state is a ground interpretation of the domain predicates over the domain objects.\nActions: The set of ground actions depends on the set of rules Z and the objects in the\nworld. For example, move(A, B) can be bound to the table arrangement in Figure 2(a) by\nbinding A to block 1 and B to block 4 to yield the ground action move(1, 4).\nTransition Dynamics: For each action, the distribution over next states is given com-\npactly by the distribution over outcomes encoded in the schema. For example, executing\n\n\fmove(A, B)\npre: (clear(B, t), hold(nil), height(B,H ), incr(H ,H 0), clear(A,t),on(A,C ),broke(f))\n[ 0.70 ] (on(A, B), height(A, H ), clear(A, t), clear(B, f), hold(nil), clear(C , t))\ne\ufb00 :\n[ 0.30 ] (on(A, table), clear(A, t), height(A,H ), hold(nil), clear(C, t), broke(t))\n\n\ufb01x()\npre: (broke(t))\ne\ufb00 :\n\n[ 0.97 ] (broke(f))\n[ 0.03 ] (broke(t))\n\nstackon(B)\npre: ( clear(B , t), hold(A), height(B , H ), incr(H, H0), broke(f))\ne\ufb00 :\n\n[ .97 ] (on(A, B), height(A, H ), clear(A, t), clear(B, f),hold(nil))\n[ .03 ] (on(A, table), clear(A, t), height(A,H\u2019), hold(nil), broke(t))\n\nstackon(table)\npre: (clear(table, t), hold(A), broke(f))\ne\ufb00 :\n\n[ 1.00 ] (on(A, table), height(A, 0), clear(A, t), hold(nil))\n\n[ 1.00 ] (hold(A), clear(A, f), on(A, nil), clear(B, t), height(A,-1))\n\npickup(A)\npre: (clear(A, t), hold(nil), on(A, B),broke(f))\ne\ufb00 :\nFigure 1: The set of relational rules, Z, for blocks-world dynamics.2 Each rule schema contains the\naction name, precondition, and a set of effects.\n\nmove(1, 4) yields a 0.3 chance of landing in a state where block 1 falls on the table, and\na 0.7 chance of landing in a state where block 1 is correctly put on block 4. The rule\noutcomes themselves usually only specify a subset of the domain predicates, effectively\ndescribing a set of possible ground states. We assume a static frame: state predicates not\ndirectly changed by the rule are assumed to remain the same.\nRewards: A state is deterministically mapped to a scalar reward according to function R(s).\n\n2.2 Initial trajectory planning\n\nThe next step is \ufb01nding an initial path. In a relational setting, when the underlying MDP\nspace implied by the full instantiation of the representation is potentially huge, a good\ninitial envelope is crucial. It determines the quality of the early envelope policies and sets\nthe stage for more elaborate policies later on.\n\nFor planning in traditional STRIPS domains, the Graphplan algorithm is known to be effec-\ntive [1]. Graphplan \ufb01nds the shortest straight-line plan by iteratively growing a forward-\nchaining structure called a plangraph and testing for the presence of goal conditions at each\nstep. Blum and Langford [2] describe a probabilistic extension called TGraphplan (TGP)\nthat works by returning a plan\u2019s a probability of success rather than a just a boolean \ufb02ag.\nTGP can \ufb01nd straight-line plans fairly quickly from start to goal that satisfy a minimum\nprobability. Given TGP\u2019s success in probabilistic STRIPS domains, a straightforward idea\nis to use the trajectory found by TGP to populate our initial envelope.\n\nNevertheless, this should give us pause: we have just said that our relational MDP describes\na large underlying MDP. TGP and other Graphplan descendants work by grounding out\nthe rules and chaining them forward to construct the plangraph. Large numbers of ac-\ntions cause severe problems for Graphplan-based planners [11] since the branching factor\nquickly chokes the forward-chaining plangraph construction. So how do we cope?\n\n\f(cid:1)\n(cid:20)\n(cid:1)\n(cid:19)\n\n(cid:18)\n\n(cid:1)\n(cid:21)\n\n(cid:1)\n(cid:22)\n\n(cid:78)(cid:80)(cid:87)(cid:70)(cid:9)(cid:34)(cid:13)(cid:35)(cid:10)\n\n(cid:80)(cid:79)(cid:9)(cid:20)(cid:13)(cid:19)(cid:10)\n(cid:68)(cid:77)(cid:70)(cid:66)(cid:83)(cid:9)(cid:20)(cid:13)(cid:85)(cid:10)\n(cid:73)(cid:70)(cid:74)(cid:72)(cid:73)(cid:85)(cid:9)(cid:20)(cid:13)(cid:18)(cid:10)\n(cid:68)(cid:80)(cid:77)(cid:80)(cid:83)(cid:9)(cid:20)(cid:13)(cid:67)(cid:77)(cid:86)(cid:70)(cid:10)\n(cid:80)(cid:79)(cid:9)(cid:19)(cid:13)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:10)\n(cid:68)(cid:77)(cid:70)(cid:66)(cid:83)(cid:9)(cid:19)(cid:13)(cid:71)(cid:10)\n(cid:68)(cid:80)(cid:77)(cid:80)(cid:83)(cid:9)(cid:19)(cid:13)(cid:72)(cid:83)(cid:70)(cid:70)(cid:79)(cid:10)\n(cid:15)(cid:15)(cid:15)\n(cid:73)(cid:80)(cid:77)(cid:69)(cid:9)(cid:79)(cid:74)(cid:77)(cid:10)\n(cid:68)(cid:77)(cid:70)(cid:66)(cid:83)(cid:9)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:13)(cid:85)(cid:10)\n(cid:67)(cid:83)(cid:80)(cid:76)(cid:70)(cid:9)(cid:71)(cid:10)\n\n(a)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n\n(cid:1)\n(cid:1)\n\nA\n\nB\n\nH\n\n1\n1\n4\n4\n5\n5\n3\n3\n3\n1\n4\n5\n\n(b)\n\n4\n5\n1\n5\n4\n1\n1\n4\n5\n3\n3\n3\n\n0\n0\n0\n0\n0\n0\n1\n1\n1\n1\n1\n1\n\nH 0\n1\n1\n1\n1\n1\n1\n1\n1\n1\n2\n2\n2\n\nC\n\ntable\ntable\ntable\ntable\ntable\ntable\n\n2\n2\n2\n\ntable\ntable\ntable\n\nhold(nil)\n\nclear(table, t)\n\nbroke(f)\n\non(b1,table)\n\ncolor(b1,g)\n\nheight(b1,0)\n\nclear(b1, t)\n\non(b2,table)\n\ncolor(b2,r)\n\nheight(b2,0)\n\nclear(b2, t)\n\n(c)\n\nmove(b1,b2)\n\non(b1,b2)\n\n.7\n\nheight(b1,1)\n\nclear(b2,f)\n\nclear(table, t)\n\nheight(b1,0)\n\nbroke(t)\n\non(b1,table)\n\n.3\n\nFigure 2: (a) Given this world con\ufb01guration, the move action produces three types of effects. (b)\n12 different groundings for the argument variables, but not all produce different groundings for the\nderived variables. (c) A plangraph fragment with a particular instance of move chained forward.\n\n2.3 Equivalence-class sampling: reducing the planning action apace\nSTRIPS rules require every variable in the rule schema to appear in the argument list, so\nmove(A, B) becomes move(A, B, H, H0, C). The meaning of the operator shifts from\n\u201cmove A onto B\u201d to \u201cmove A at height H 0 onto B at height H from C \u201d. Not only is this\nawkward, but specifying all the variables in the argument list yields an exponential number\nof ground actions as the number of domain objects grows. In contrast, the operators we\nde\ufb01ned above have argument lists containing only those variables that are free parameters.\nThat is, when the operator move(A, B) takes two arguments, A and B, it means that the\nother variables (such as C, the block under A) are derivable from the relations in the rule\nschema. Guided by this observation, one can generalize among bindings that produce\nequivalent effects on the derivable properties.\nConsider executing the move(A, B) rule in the world con\ufb01guration in Figure 2. This cre-\nates 12 fully-ground actions. However examining the bindings reveals only three types of\naction-effects. There is one group of actions that move a block from one block and onto\nanother; a group that moves a block from the table and onto a block of height zero; and\nanother group that moves a block off the table and onto a block of height one.\nExcept for the identities of the argument blocks A and B, the actions in each class produce\nequivalent groundings for the properties of the related domain objects. Rather than using\nall the actions, then, the plangraph can be constructed chaining forward only a sampled\naction from each class. We call this equivalence-class sampling; the sampled action is\nrepresentative of the effects of any action from that class. Sampling reduces the branching\nfactor at each step in the plangraph, so signi\ufb01cantly larger domains can be handled.\n\n3 From a Planning Problem to a Policy\n\nNow we describe the approach in detail. We de\ufb01ne a planning problem as containing:\nRules: These are the relational operators that describe the action effects. In our system,\nthey are designed by hand and the probabilities are speci\ufb01ed by the programmer.\nInitial World State: The set of ground predicates that describes the starting state. REBP\ndoes not make the closed world assumption, so all predicates and objects required in the\nplanning task must appear in the initial state.\nGoal Condition: A conjunction of relational predicates. The goal may contain variables \u2014\nit does not need to be fully ground.\n\n\f1.0\nmove(2,1)\npickup(2)\n\npickup(1)\n\n1.0\n\n1.0\n\nOUT\n\n0.3\n\nmove(1,2)\n\n1\n\n2\n\n0.7\n\nmove(1,2)\n\n1\n2\n\n1.0\n\n1.0\n\n1.0\n\n1\n\n2\n\nOUT\n\n0.3\n\nmove(1,2)\n\n0.7\n\nmove(1,2)\n\n1\n\n2\n\n1\n2\n\nOUT\n\n1.0\n\n1.0\n\n1.0\n\n0.3\n\n0.97\n\nfix()\n\n1\n\n2\n\n.03\n\nfix()\n\n1\n2\n\n1\n\n2\n\n0.7\n\nFigure 3: An initial envelope corresponding to the plangraph segment of Figure 2(c) followed fringe\nsampling and envelope expansion.\n\nRewards: A list of conjunctions mapping matching states to a scalar reward value. If a state\nin the current MDP does not match a reward condition, the default value is 0. Additionally,\nthere must be a penalty associated with falling out of the envelope. This penalty is an\nestimate of the cost of having to recover from falling out (such as having to replan back to\nthe envelope, for example).\nGiven a planning problem, there are now three main components to REBP: \ufb01nding an initial\nplan, converting the plan into an MDP, and envelope manipulation. A running example to\nillustrate the approach will be the tiny task of making a two-block stack in a domain with\ntwo blocks. Figure 3 illustrates output produced by a run of the algorithm.\n\n3.1 Finding an initial plan\nThe process for making the initial trajectory essentially follows the TGP algorithm de-\nscribed by Blum and Langford [2]. The TGP algorithm starts with the initial world state as\nthe \ufb01rst layer in the graph, a minimum probability cutoff for the plan, and a maximum plan\ndepth. We use the equivalence-class sampling technique discussed above to prune actions\nfrom the plangraph. Figure 2(c) shows one step of a plangraph construction.\n\n3.2 Turning the initial plan into an MDP\nThe TGP algorithm produces a sequence of actions. The next step is to turn the sequence\nof action-effects into a well-de\ufb01ned envelope MDP; that is, we must compute the set of\nstates and the transitions. Usually, the sequence of action-effects alone leaves many state\npredicates unspeci\ufb01ed. Currently, we assume a static frame, which implies that the value\nof a predicate remains the same unless it is known to have explicitly changed.\nThe set of RMDP states are computed iteratively: \ufb01rst, the envelope is initialized with the\ninitial world state; then, the next state in the envelope is found by applying the plan action to\nthe previous state and \u201c\ufb01lling in\u201d any missing predicates with their previous values; when\nthe state containing the goal condition is reached, the set of states is complete. To compute\nthe set of actions, REBP loops through the list of operators and accumulates all the ground\nactions whose preconditions bind to any state in the envelope. Transitions that initiate in an\nenvelope state but do not land in an envelope state are redirected to OUT. The leftmost MDP\nin Figure 3 shows the initial envelope corresponding to the one-step plan of Figure 2(c).\n\n3.3 Envelope Expansion\nEnvelope expansion, or deliberation, involves adding to the subset of world states under\nconsideration. The decision of when and how long to deliberate must compaare the ex-\npected utility of further thinking against the cost of doing so. Dean et al. [4] discuss this\ncomplex issue in depth. As a \ufb01rst step, we considered the simple precursor deliberation\nmodel, in which deliberation occurs for some number r times and is completed before\nexecution takes place.\n\n\fA round of deliberation involves sampling from the current policy to estimate which fringe\nstates \u2014 states one step outside of the envelope \u2014 are likely. In each round, REBP draws\nd \u00b7 M samples (drawing from an exploratory action with probability \u0001) and keeps counts\nof which fringe states are reached. The f \u00b7 M most likely fringes are added to the enve-\nlope, where M is number of states in the current envelope and d and f are scalars. After\nexpansion, we recompute the set of actions and compute a new policy.\n\nFigure 3 shows a sequence of fringe sampling and envelope expansion. We see the incor-\nporation of the fringe state in which the hand breaks as a result of move. With the new\nenvelope, the policy is re-computed to include the \ufb01x action. This is a conditional plan that\na straight-line planner could not \ufb01nd.\n\n4 Experimental Domain\n\nTo illustrate the behavior of REBP, we show preliminary results in a stochastic blocks world.\nWhile simple, blocks world is a reasonably interesting \ufb01rst domain because, with enough\nblocks, it exposes the weaknesses of purely propositional approaches. Its regular dynamics,\non other hand, lend themselves to relational descriptions. This domain demonstrates the\ntype of scaling that can be achieved with the REBP approach.\n\nThe task at hand is to build a stack containing all the blocks on the table. In this domain,\nblocks are stacked on one another, with the top block in a stack being clear. Each block\nhas a color and is at some height in the stack. There is a gripper that may or may not\nbe broken. The pickup(A) action is deterministic and puts a clear block into the empty\nhand; a block in the hand is no longer clear, and its height and and on-ness are no longer\nde\ufb01ned. The \ufb01x() action takes a broken hand and \ufb01xes it with some probability. The\nstackon() action comes in two \ufb02avors: \ufb01rst, stackon(B), takes a block from the hand and\nputs it on block B, which may be dropped onto the table with a small probability; second,\nstackon(table), always puts the block from the hand onto the table. The move(A, B) and\nstackon(B) actions also have some chance of breaking the hand. If the hand is broken, it\nmust be \ufb01xed before any further actions can apply. The domain is formalized as follows:3\nP\nZ,T\nO\nR(s)\n\n: The rules are shown in Figure 1.\n: A set of n differently colored (red, green, blue) blocks.\n:\n\nIf \u2203A height(A, n \u2212 1), then 1; if broke(t), then \u22122; if OUT, then \u22121.\n\n:\n\non(Block, Block), clear(Block, T orF ), color(Block, Color),\nheight(Block, N um), hold(Block), clear(table, T orF ), broke(T orF ).\n\n5 Empirical Results\n\nWe compared the quality of the policies generated by the following algorithms: REBP;\nenvelope expansion starting from empty initial plan (i.e., the initial envelope containing\nonly the initial world state); and policy iteration on the fully ground MDP.4\nIn all cases, the policy was computed by simple policy iteration with a discount of 0.9 and\na stopping threshold of 0.1. In the case of REBP, the number of deliberation rounds r was\n10, d was 10, f was 0.3, and \u0001 was 0.2. In the case of the deliberation-only envelope, the r\nwas increased to 35. The runs were averaged over at least 7 trials in each case.\n\nWe show numerical results for domains with 5 and 6 blocks. The size of the full MDP\nin each case is, respectively, 768 and 5,228 states, with 351 and 733 ground actions. A\n\n3The predicates behave like functions in the sense that the nth argument represents the value of\nthe relation for the \ufb01rst n \u2212 1 arguments. Thus, we say clear(block5, f) instead of \u00acclear(block5).\n4Starting with the initial state, the set of states is generated by exhaustively applying our operators\n\nuntil no more new states are found; this yields the true set of reachable states.\n\n\fFigure 4: Results for the block-stacking tasks. The top plots show policy value against computation\ntime for REBP and the full MDP. The bottom plots show policy value against number of states for\nREBP and deliberation only (empty initial plan).\n\ndomain of 7 blocks results in an MDP of over 37,000 states with 1,191 actions, a combined\nstate and action space is too overwhelming for the full MDP solution. The REBP agent, on\nthe other hand, is able to \ufb01nd plans for making stacks in domains of more than 12 blocks,\nwhich corresponds to an MDP of about 88,000 states and 3,000 ground actions.\n\nThe plots in Figure 4 show intuitive results. The top row shows the value of the policy\nagainst execution time (as measured by a monitoring package) showing that the REBP al-\ngorithm produces good quality plans quickly. For REBP, we start measuring the value of\nthe policy at the point when initial trajectory \ufb01nding ends and deliberation begins; for the\nfull MDP solution, we measure the value of the policy at the end of each round of policy\niteration. The full MDP takes a long time to \ufb01nd a policy, but eventually converges. Without\nthe equivalence-class sampling, plangraph construction takes on the order of a couple of\nhours; with it, it takes a couple of minutes. The bottom row shows the value of the policy\nagainst the number of states in the envelope so far and shows that the a good initial envelope\nis key for behaving well with fewer states.\n\n6 Discussion and Conclusions\n\nUsing the relational envelope method, we can take real advantage of relational general-\nization to produce good initial plans ef\ufb01ciently, and use envelope-growing techniques to\nimprove the robustness of our plans incrementally as time permits. REBP is a planning sys-\ntem that tries to dynamically reformulate an apparently intractable problem into a small,\neasily handled problem at run time.\n\nHowever, there is plenty remaining to be done. The \ufb01rst thing needed is a more rigorous\nanalysis of the equivalence-class sampling. Currently, the action sampling is a purely local\ndecision made at each step of the plangraph. This works in the current setup because\nobject identities do not matter and properties not mentioned in the operator outcomes are\nnever part of the goal condition. If, on the other hand, the goal was to make a stack of\n\n\fheight n \u2212 1 with a green block on top, it could be problematic to construct the plangraph\nwithout considering block color in the sampled actions. We are currently investigating what\nconditions are necessary for making general guarantees about the sampling approach.\n\nFurthermore, the current envelope-extension method is relatively undirected; it might be\npossible to diagnose more effectively which fringe states would be most pro\ufb01table to add.\nIn addition, techniques such as those used by Dean et al. [4] could be employed to decide\nwhen to stop envelope growth, and to manage the eventual interleaving of envelope-growth\nand execution. Currently the states in the envelope are essentially atomic; it ought to be\npossible to exploit the factored nature of relational representations to allow abstraction in\nthe MDP model, with aggregate \u201cstates\u201d in the MDP actually representing sets of states in\nthe underlying world.\n\nIn summary, the REBP method provides a way to restrict attention to a small, useful subset\nof a large MDP space. It produces an initial plan quickly by taking advantage of general-\nization among action effects, and as a result behaves smarter in a large space much sooner\nthan it could by waiting for a full solution.\n\nAcknowledgements\n\nThis work was supported by an NSF Graduate Research Fellowship, by the Of\ufb01ce of Naval\nResearch contract #N00014-00-1-0298, and by NASA award #NCC2-1237.\n\nReferences\n[1] Avrim L. Blum and Merrick L. Furst. Fast plannning through planning graph analysis. Arti\ufb01cial\n\nIntelligence, 90:281\u2013300, 1997.\n\n[2] Avrim L. Blum and John C. Langford. Probabilistic planning in the graphplan framework. In\n\n5th European Conference on Planning, 1999.\n\n[3] Craig Boutilier, Raymond Reiter, and Bob Price. Symbolic dynamic programming for \ufb01rst-\n\norder MDPs. In IJCAI, 2001.\n\n[4] Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and Ann Nicholson. Planning under time\n\nconstraints in stochastic domains. Arti\ufb01cial Intelligence, 76, 1995.\n\n[5] Kurt Driessens, Jan Ramon, and Hendrik Blockeel. Speeding up relational reinforcement learn-\ning through the use of an incremental \ufb01rst order decision tree learner. In European Conference\non Machine Learning, 2001.\n\n[6] B. Cenk Gazen and Craig A. Knoblock. Combining the expressivity of UCPOP with the ef\ufb01-\n\nciency of graphplan. In Proc. European Conference on Planning (ECP-97), 1997.\n\n[7] H. Geffner and B. Bonet. High-level planning and control with incomplete information using\n\nPOMDPs. In Fall AAAI Symposium on Cognitive Robotics, 1998.\n\n[8] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments\n\nin relational MDPs. In International Joint Conference on Arti\ufb01cial Intelligence, 2003.\n\n[9] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. Spudd: Stochastic planning using\n\ndecision diagrams. In Fifteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, 1999.\n\n[10] J. Koehler, B. Nebel, J. Hoffmann, and Y. Dimopoulos. Extending planning graphs to an ADL\n\nsubset. In Proc. European Conference on Planning (ECP-97), 1997.\n\n[11] B. Nebel, J. Koehler, and Y. Dimopoulos. Ignoring irrelevant facts and operators in plan gener-\n\nation. In Proc. European Conference on Planning (ECP-97), 1997.\n\n[12] Daniel S. Weld. Recent advances in AI planning. AI Magazine, 20(2):93\u2013123, 1999.\n[13] Daniel S. Weld, Corin R. Anderson, and David E. Smith. Extending graphplan to handle uncer-\n\ntainty and sensing actions. In Proceedings of AAAI \u201998, 1998.\n\n[14] SungWook Yoon, Alan Fern, and Robert Givan. Inductive policy selection for \ufb01rst-order MDPs.\n\nIn 18th International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2002.\n\n\f", "award": [], "sourceid": 2424, "authors": [{"given_name": "Natalia", "family_name": "Gardiol", "institution": null}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": null}]}