{"title": "Multi-time Models for Temporally Abstract Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 1050, "page_last": 1056, "abstract": null, "full_text": "Multi-time Models for Temporally Abstract \n\nPlanning \n\nDoina Precup, Richard S. Sutton \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\n{dprecuplrich}@cs.umass.edu \n\nAbstract \n\nPlanning and learning at multiple levels of temporal abstraction is a key \nproblem for artificial intelligence. In this paper we summarize an ap(cid:173)\nproach to this problem based on the mathematical framework of Markov \ndecision processes and reinforcement learning. Current model-based re(cid:173)\ninforcement learning is based on one-step models that cannot represent \ncommon-sense higher-level actions, such as going to lunch, grasping an \nobject, or flying to Denver. This paper generalizes prior work on tem(cid:173)\nporally abstract models [Sutton, 1995] and extends it from the prediction \nsetting to include actions, control, and planning. We introduce a more \ngeneral form of temporally abstract model, the multi-time model, and es(cid:173)\ntablish its suitability for planning and learning by virtue of its relationship \nto the Bellman equations. This paper summarizes the theoretical frame(cid:173)\nwork of multi-time models and illustrates their potential advantages in a \ngrid world planning task. \n\nThe need for hierarchical and abstract planning is a fundamental problem in AI (see, e.g., \nSacerdoti, 1977; Laird et aI., 1986; Korf, 1985; Kaelbling, 1993; Dayan & Hinton, 1993). \nModel-based reinforcement learning offers a possible solution to the problem of integrating \nplanning with real-time learning and decision-making (Peng & Williams, 1993, Moore & \nAtkeson, 1993; Sutton and Barto, 1998). However, current model-based reinforcement \nlearning is based on one-step models that cannot represent common-sense, higher-level \nactions. Modeling such actions requires the ability to handle different, interrelated levels \nof temporal abstraction. \n\nA new approach to modeling at multiple time scales was introduced by Sutton (1995) based \non prior work by Singh, Dayan, and Sutton and Pinette. This approach enables models \nof the environment at different temporal scales to be intermixed, producing temporally \nabstract models. However, that work was concerned only with predicting the environment. \nThis paper summarizes an extension of the approach including actions and control of the \nenvironment [Precup & Sutton, 1997]. In particular, we generalize the usual notion of a \n\n\fMulti-time Models for Temporally Abstract Planning \n\n1051 \n\nprimitive, one-step action to an abstract action, an arbitrary, closed-loop policy. Whereas \nprior work modeled the behavior of the agent-environment system under a single, given \npolicy, here we learn different models for a set of different policies. For each possible way \nof behaving, the agent learns a separate model of what will happen. Then, in planning, it \ncan choose between these overall policies as well as between primitive actions. \n\nTo illustrate the kind of advance we are trying to make, consider the example shown in \nFigure 1. This is a standard grid world in which the primitive actions are to move from one \ngrid cell to a neighboring cell. Imagine the learning agent is repeatedly given new tasks \nin the form of new goal locations to travel to as rapidly as possible. If the agent plans at \nthe level of primitive actions, then its plans will be many actions long and take a relatively \nlong time to compute. Planning could be much faster if abstract actions could be used to \nplan for moving from room to room rather than from cell to cell. For each room, the agent \nlearns two models for two abstract actions, one for traveling efficiently to each adjacent \nroom. We do not address in this paper the question of how such abstract actions could be \ndiscovered without help; instead we focus on the mathematical theory of abstract actions. \nIn particular, we define a very general semantics for them-a property that seems to be \nrequired in order for them to be used in the general kind of planning typically used with \nMarkov decision processes. At the end of this paper we illustrate the theory in this example \nproblem, showing how room-to-room abstract actions can substantially speed planning. \n\n4 unreliable \nprimitive actions \n\nup 18\u00ab+ fight \n\ndown \n\nFaU33% \n01 th&tlm8 \n\n8 abstract actions \n(to each room's 2 hallways) \n\nFigure 1: Example Task. The Natural abstract actions are to move from room to room. \n\n1 Reinforcement Learning (MDP) Framework \n\nIn reinforcement learning, a learning agent interacts with an environment at some discrete, \nlowest-level time scale t = 0,1,2, ... On each time step, the agent perceives the state of \nthe environment, St , and on that basis chooses a primitive action, at. In response to each \nprimitive action, at, the environment produces one step later a numerical reward, Tt+l, \nand a next state, St+l. The agent's objective is to learn a policy, a mapping from states to \nprobabilities of taking each action, that maximizes the expected discounted future reward \nfrom each state s: \n\nv\"{s) = E7r{L: ';lTt+l I So = s}, \n\n00 \n\nt=O \n\nwhere'Y E [0, 1) is a discount-rate parameter, and E7r {} denotes an expectation implicitly \nconditional on the policy 7f being followed. The quantity v7r( s) is called the value of state S \nunder policy 7f, and v7r is called the value function for policy 7f. The value under the optimal \npolicy is denoted: \n\nv*(S) = maxv7r(s}. \n\n7r \n\nPlanning in reinforcement learning refers to the use of models of the effects of actions to \ncompute value functions, particularly v*. \n\n\f]052 \n\nD. Precup and R. S. Sutton \n\nWe assume that the states are discrete and fonn a finite set, St E {1,2, ... ,m}. This \nis viewed as a temporary theoretical convenience; it is not a limitation of the ideas we \npresent. This assumption allows us to alternatively denote the value functions, v7r and v*, \nas column vectors, v7r and v*, each having m components that contain the values of the \nm states. In general, for any m-vector, x, we will use the notation x( s) to refer to its sth \ncomponent. \n\nThe model of an action, a, whether primitive or abstract, has two components. One is an \nm x m matrix, Pa , predicting the state that will result from executing the action in each \nstate. The other is a vector, ga, predicting the cumulative reward that will be received along \nthe way. In the case of a primitive action, Pa is the matrix of I-step transition probabilities \nof the environment, times ,: \n\nP;(s) = ,E {St+! 1st = s, at = a}, \n\nVs \n\nwhere P;(s) denotes the sth column of P; (these are the predictions corresponding to \nstate s) and St denotes the unit basis m-vector corresponding to St. The reward prediction, \nga, for a primitive action contains the expected immediate rewards: \n\nga(s) = E {rt+l 1st = s, at = a}, \n\nVs \n\nFor any stochastic policy, 1f, we can similarly define its I-step model, g7r, P7r as: \n\nand \n\nVs \n\n(1) \n\n2 Suitability for Planning \n\nIn conventional planning, one-step models are used to compute value functions via the \nBellman equations for prediction and control. In vector notation, the prediction and control \nBellman equations are \n\nand \n\nv* = max{ga + PaV*}, \n\na \n\n(2) \n\nrespectively, where the max function is applied component-wise in the control equation. \nIn planning, these equalities are turned into updates, e.g., v k+! ~ g7r + P7r v k' which \nconverge to the value functions. Thus, the Bellman equations are usually used to define \nand compute value functions given models of actions. Following Sutton (1995), here we \nreverse the roles: we take the value functions as given and use the Bellman equations to \ndefine and compute models of new, abstract actions. \n\nIn particular, a model can be used in planning only if it is stable and consistent with the \nBellman equations. It is useful to define special tenns for consistency with each Bellman \nequation. Let g, P denote an arbitrary model (an m-vector and an m x m matrix). Then \nthis model is said to be vaLid for policy 1f [Sutton, 1995] if and only if limk-+oo pk = 0 \nand \n\n(3) \nAny valid model can be used to compute v7r via the iteration algorithm v k+1 t- g + Pvk. \nThis is a direct sense in which the validity of a model implies that it is suitable for planning. \nWe introduce here a parallel definition that expresses consistency with the control Bellman \nequation. The model g, P is said to be non-overpromising (NaP) if and only if P has only \npositive elements, limk-+oo pk = 0, and \n\nv7r = g + P v 7r. \n\nV* ~ g + Pv*, \n\n(4) \n\nwhere the ~ relation holds component-wise. If a Nap model is added inside the max op(cid:173)\nerator in the control Bellman equation (2), this condition ensures that the true value, v*, \nwill not be exceeded for any state. Thus, any model that does not promise more than it \n\n\fMulti-time Models for Temporally Abstract Planning \n\n1053 \n\nis achievable (is not (;>verpromising) can serve as an option for planning purposes. The \none-step models of primitive actions are obviously NOP, due to (2). It is similarly straight(cid:173)\nforward to show that the one-step model of any policy is also NOP. \nFor some purposes, it is more convenient to write a model g, P as a single (m+ 1) x (m+ 1) \nmatrix: \n\no \n\nP \n\nWe say that the model M has been put in homogeneous coordinates. The vectors corre(cid:173)\nsponding to the value functions can also be put into homogeneous coordinates, by adding \nan initial element that is always 1. \n\nUsing this notation, new models can be combined using two basic operations: composition \nand averaging. Two models Ml and M2 can be composed by matrix multiplication, yield(cid:173)\ning a new model M = M1 M2 . A set of models Mi can be averaged, weighted by a set of \ndiagonal matrices D i , such that I::i Di = I, to yield a new model M = I::i DiMi. Sutton \n(1995) showed that the set of models that are valid for a policy 7r is closed under compo(cid:173)\nsition and averaging. This enables models acting at different time scales to be mixed to(cid:173)\ngether, and the resulting model can still be used to compute v 1T\u2022 We have proven that the set \nof NOP models is also closed under composition and averaging [Precup & Sutton, 1997]. \nThese operations permit a richer variety of combinations for NOP models than they do for \nvalid models because the NOP models that are combined need not correspond to a particu(cid:173)\nlar policy. \n\n3 Multi-time models \n\nThe validity and NOP-ness of a model do not imply each other [Precup & Sutton, 1997] . \nNevertheless, we believe a good model should be both valid and NOP. We would like to \ndescribe a class of models that, in some sense, includes all the \"interesting\" models that are \nvalid and non-overpromising, and which is expressive enough to include common-sense \nnotions of abstract action. These goals have led us to the notion of a multi-time model. \n\nThe simplest example of multi-step model, called the n-step model for policy 7r, predicts \nthe n-step truncated return and the state n steps into the future (times Tn). If different n(cid:173)\nstep models of the same policy are averaged, the result is called a mixture model. Mixtures \nare valid and non-overpromising due to the closure properties established in the previous \nsection. One kind of mixture suggested in [Sutton, 1995] allows an exponential decay of \nthe weights over time, controlled by a parameter {3. \n\nFigure 2: Two hypothetical Markov environments \n\nAre mixture models expressive enough for capturing the properties of the environment? \nIn order to get some intuition about the expressive power that a model should have, let \nus consider the example in figure 2. If we are only interested if state G is attained, then \nthe two environments presented shOUld be characterized by significantly different models. \nHowever, n-step models, or 2ny linear mixture of n-step models cannot achieve this goal. \nIn order to remediate this problem, models should average differently over all the different \ntrajectories that are possible through the state space. A full {3-model [Sutton, 1995] can \n\n\f1054 \n\nD. Precup and R. S. Sutton \n\ndistinguish between these two situations. A ,B-model is a more general form of mixture \nmodel, in which a different ,B parameter is associated with each state. For a state i, ,Bi \ncan be viewed as the probability that the trajectory through the state space ends in state \ni. Although ,B-models seem to have more expressive power, they cannot describe n-step \nmodels. We would like to have a more general form of model, that unifies both classes. \nThis goal is achieved by accurate multi-time models. \n\nMulti-time models are defined with respect to a policy. Just as the one-step model for a \npolicy is defined by (1), we define g, P to be an accurate multi-time model if and only if \n\npT (s) = Ell'{ 2: Wt 'l St I So = s}, \n\n00 \n\nt=l \n\ng(s) = Ell'{2: wdrl + ,r2 + ... + ,t-Irt) I So = s} \n\n00 \n\nt=l \n\nfor some Jr, for all s, and for some sequence of random weights, WI, W2, \u2022.. such that \nWt > 0 and 2::1 Wt = 1. The weights are random variables chosen according to a \ndistribution that depends only on states visited at or before time t. The weight Wt is a \nmeasure of the importance given to the t-th state of the trajectory. In particular, if Wt = 0, \nthen state t has no weight associated with it. If Wt = 1- 2:~:~ Wi, all the remaining weight \nalong the trajectory is given to state t. The effect is that state St is the \"outcome\" state for \nthe trajectory. \n\nThe random weights along each trajectory make this a very general form of model. The \nonly necessary constraint is that the weights depend only on previously visited states. In \nparticular, we can choose weighting sequences that generate the types of multi-step models \ndescribed in [Sutton, 1995]. If the weighting variables are such that wn=l, and Wt = \nO;v't i= n , we obtain n-step models. A weighting sequence of the form Wt = rr~:6,Bi 'tit, \nwhere ,Bi is the parameter associated to the state visited on time step i, describes a full \n,B-model. \n\nThe main result for multi-time models is that they satisfy the two criteria defined in the \nprevious section. Any accurate multi-time model is also NOP and valid for Jr. The proofs \nof these results are too long to include here. \n\n4 Illustrative Example \n\nIn order to illustrate the way in which multi-time models can be used in practice, let us \nreturn to the grid world example (Figure I). The cells of the grid correspond to the states of \nthe environment. From any state the agent can perform one of four primitive actions, up, \ndown, left or right. With probability 2/3, the actions cause the agent to move one cell \nin the corresponding direction (unless this would take the agent into a wall, in which case \nit stays in the same state). With probability 1/3, the agent instead moves in one of the other \nthree directions (unless this takes it into a wall of course). There is no penalty for bumping \ninto walls. \n\nIn each room, we also defined two abstract actions, for going to each of the adjacent hall(cid:173)\nways. Each abstract action has a set of input states (the states in the room) and two outcome \nstates: the target hallway, which corresponds to a successful outcome, and the state adja(cid:173)\ncent to the other hallway, which corresponds to failure (the agent has wandered out of the \nroom). Each abstract action is given by its complete model g:;-', P:;, where Jr is the optimal \npolicy for getting into the target hallway, and the weighting variables W along any trajectory \nhave the value I for the outcome states and 0 everywhere else. \n\n\fMulti-time Models for Temporally Abstract Planning \n\nJ055 \n\nI \n\n.. \n\n\u2022 \n\nI \n\n.... \n\n\u2022 \u2022\u2022 \n\u2022 \n\nI \n\n. ... \n\n\u2022 \u2022\u2022 \n\u2022\u2022 \u2022 \u2022 \u2022 \n\u2022 \u2022 \u2022 \u2022 \n\nIteration #1 \n\nIteration #2 \n\nIteration #3 \n\nIteration #4 \n\nIteration #5 \n\nIteration #6 \n\nFigure 3: Value iteration using primitive and abstract actions \n\nThe goal state can have an arbitrary position in any of the rooms, but for this illustration let \nus suppose that the goal is two steps down from the right hallway. The value of the goal \nstate is 1, there are no rewards along the way, and the discounting factor is , = 0.9. We \nperfonned planning according to the standard value iteration method: \n\nwhere vo(s) = 0 for all the states except the goal state (which starts at 1). In one experi(cid:173)\nment, a ranged only over the primitive actions, in the other it ranged over the set including \nboth the primitive and the abstract actions. \n\nWhen using only primitive actions, the values are propagated one step away on each itera(cid:173)\ntion. After six iterations, for instance, only the states that are at most six steps away from \nthe goal will be attributed non-zero values. The models of abstract actions produce a signif(cid:173)\nicant speed-up in the propagation of values at each step. Figure 3 shows the value function \nafter each iteration, using both primitive and abstract actions for planning. The area of the \ncircle drawn in each state is proportional to the value attributed to the state. The first three \niterations are identical with the case when only primitive actions are used. However, once \nthe values are propagated to the first hallway, all the states in the rooms adjacent to that \nhallway will receive values as well. For the states in the room containing the goal, these \nvalues correspond to perfonning the abstract action of getting into the right hallway, and \nthen following the optimal primitive actions to get to the goal. At this point, a path to the \ngoal is known from each state in the right half of the environment, even if the path is not \noptimal for all states. After six iterations, an optimal policy is known for all the states in \nthe environment. \nThe models of the abstract actions do not need to be given a priori, they can be learned \nfrom experience. In fact, the abstract models that were used in this experiment have been \nlearned during a I ,OOO,DOO-step random walk in the environment. The starting point for \n\n\f1056 \n\nD. Precup and R. S. Sutton \n\nlearning was represented by the outcome states of each abstract action, along with the \nhypothetical utilities U associated with these states. We used Q-Iearning [Watkins, 1989] \nto learn the optimal state-action value function Q'U B associated with each abstract action. \nThe greedy policy with respect to Q'U,B is the pol'icy associated with the abstract action. \nAt the same time, we used the ,B-model learning algorithm presented in [Sutton, 1995] \nto compute the model corresponding to the policy. The learning algorithm is completely \nonline and incremental, and its complexity is comparable to that of regular I-step TD(cid:173)\nlearning. \n\nModels of abstract actions can be built while an agent is acting in the environment without \nany additional effort. Such models can then be used in the planning process as if they would \nrepresent primitive actions, ensuring more efficient learning and planning, especially if the \ngoal is changing over time. \n\nAcknowledgments \n\nThe authors thank Amy McGovern and Andy Fagg for helpful discussions and comments contributing \nto this paper. This research was supported in part by NSF grant ECS-951 1805 to Andrew G. Barto \nand Richard S. Sutton, and by AFOSR grant AFOSR-F49620-96-1-0254 to Andrew G. Barto and \nRichard S. Sutton. Doina Precup also acknowledges the support of the Fulbright foundation. \n\nReferences \n\nDayan, P. (1993). Improving generalization for temporal difference learning: The successor repre(cid:173)\nsentation. Neural Computation, 5, 613-624. \nDayan, P. & Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in Neural Information \nProcessing Systems, volume 5, (pp. 271-278)., San Mateo, CA. Morgan Kaufmann. \nKaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Pro(cid:173)\nceedings of the Tenth International Conference on Machine Learning ICML'93, (pp. 167-173)., San \nMateo, CA. Morgan Kaufmann. \nKorf, R. E. (1985). Learning to Solve Problems by Searching for Macro-Operators. London: Pitman \nPublishing Ltd. \nLaird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in SOAR: The anatomy of a general \nlearning mechanism. Machine Learning, I, 11-46. \nMoore, A. W. & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data \nand less real time. Machine Learning, 13, 103-130. \nPeng, J. & Williams, J. (1993). Efficient learning and planning within the Dyna framework. Adaptive \nBehavior, 4, 323-334. \nPrecup, D. & Sutton, R. S. (1997). Multi-Time models for reinforcement learning. In ICML'97 \nWorkshop: The Role of Models in Reinforcement Learning. \nSacerdoti, E. D. (1977). A Structure for Plans and Behavior. North-Holland, NY: Elsevier. \nSingh, S. P. (1992). Scaling reinforcement learning by learning variable temporal resolution models. \nIn Proceedings of the Ninth International Conference on Machine Learning ICML'92, (pp. 202-\n207)., San Mateo, CA. Morgan Kaufmann. \nSutton, R. S. (1995). TD models: Modeling the world as a mixture of time scales. In Proceedings \nof the Twelfth International Conference on Machine Learning ICML'95, (pp. 531-539)., San Mateo, \nCA. Morgan Kaufmann. \nSutton, R. S. & Barto, A. G. (1998). Reinforcement Learning. An Introduction. Cambridge, MA: \nMIT Press. \nSutton, R. S. & Pinette, B. (1985). The learning of world models by connectionist networks. In \nProceedings of the Seventh Annual Conference of the Cognitive Science Society, (pp. 54-64). \nWatkins, C. 1. C. H. (1989). Learning with Delayed Rewards. PhD thesis, Cambridge University. \n\n\f", "award": [], "sourceid": 1362, "authors": [{"given_name": "Doina", "family_name": "Precup", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}]}