{"title": "Reinforcement Learning with Hierarchies of Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1049, "abstract": null, "full_text": "Reinforcement Learning with \n\nHierarchies of Machines * \n\nRonald Parr and Stuart Russell \n\nComputer Science Division, UC Berkeley, CA 94720 \n\n{parr,russell}@cs.berkeley.edu \n\nAbstract \n\nWe present a new approach to reinforcement learning in which the poli(cid:173)\ncies considered by the learning process are constrained by hierarchies of \npartially specified machines. This allows for the use of prior knowledge \nto reduce the search space and provides a framework in which knowledge \ncan be transferred across problems and in which component solutions \ncan be recombined to solve larger and more complicated problems. Our \napproach can be seen as providing a link between reinforcement learn(cid:173)\ning and \"behavior-based\" or \"teleo-reactive\" approaches to control. We \npresent provably convergent algorithms for problem-solving and learn(cid:173)\ning with hierarchical machines and demonstrate their effectiveness on a \nproblem with several thousand states. \n\n1 Introduction \nOptimal decision making in virtually all spheres of human activity is rendered intractable \nby the complexity of the task environment. Generally speaking, the only way around in(cid:173)\ntractability has been to provide a hierarchical organization for complex activities. Although \nit can yield suboptimal policies, top-down hierarchical control often reduces the complexity \nof decision making from exponential to linear in the size of the problem. For example, hier(cid:173)\narchical task network (HTN) planners can generate solutions containing tens of thousands \nof steps [5], whereas \"fiat\" planners can manage only tens of steps. \nHTN planners are successful because they use a plan library that describes the decomposition \nof high-level activities into lower-level activities. This paper describes an approach to \nlearning and decision making in uncertain environments (Markov decision processes) that \nuses a roughly analogous form of prior knowledge. We use hierarchical abstract machines \n(HAMs), which impose constraints on the policies considered by our learning algorithms. \nHAMs consist of nondeterministic finite state machines whose transitions may invoke \nlower-level machines. Nondeterminism is represented by choice states where the optimal \naction is yet to be decided or learned. The language allows a variety of prior constraints \nto be expressed, ranging from no constraint all the way to a fully specified solution. One \n\n*This research was supported in part by ARO under the MURI program \"Integrated Approach to \n\nIntelligent Systems,\" grant number DAAH04-96-1-0341. \n\n\f1044 \n\nR. Parr and S. Russell \n\n(a) \n\n(b) \n\n0_1. \n\n(c) \n\nFigure 1: (a) An MOP with ~ 3600 states. The initial state is in the top left. (b) Close(cid:173)\nup showing a typical obstacle. (c) Nondetenninistic finite-state controller for negotiating \nobstacles. \n\nuseful intennediate point is the specification of just the general organization of behavior \ninto a layered hierarchy, leaving it up to the learning algorithm to discover exactly which \nlower-level activities should be invoked by higher levels at each point. \nThe paper begins with a brief review of Markov decision processes (MOPs) and a descrip(cid:173)\ntion of hierarchical abstract machines. We then present, in abbreviated fonn, the following \nresults: 1) Given any HAM and any MOP, there exists a new MOP such that the optimal \npolicy in the new MOP is optimal in the original MOP among those policies that satisfy the \nconstraints specified by the HAM. This means that even with complex machine specifica(cid:173)\ntions we can still apply standard decision-making and learning methods. 2) An algorithm \nexists that detennines this optimal policy, given an MOP and a HAM. 3) On an illustrative \nproblem with 3600 states, this algorithm yields dramatic perfonnance improvements over \nstandard algorithms applied to the original MOP. 4) A reinforcement learning algorithm \nexists that converges to the optimal policy, subject to the HAM constraints, with no need \nto construct explicitly a new MOP. 5) On the sample problem, this algorithm learns dra(cid:173)\nmatically faster than standard RL algorithms. We conclude with a discussion of related \napproaches and ongoing work. \n\n2 Markov Decision Processes \nWe assume the reader is familiar with the basic concepts of MOPs. To review, an MOP is \na 4-tuple, (5, A, T, R) where 5 is a set of states, A is a set of actions, T is a transition \nmodel mapping 5 x A x 5 into probabilities in [0, I J, and R is a reward function mapping \n5 x A x 5 into real-valued rewards. Algorithms for solving MOPs can return a policy 7r that \nmaps from 5 to A, a real-valued value function V on states, or a real-valued Q-function on \nstate-action pairs. In this paper, we focus on infinite-horizon MOPs with a discount factor \n/3. The aim is to find an optimal policy 7r* (or, equivalently, V* or Q*) that maximizes the \nexpected discounted total reward of the agent. \nThroughout the paper, we will use as an example the MOP shown in Figure l(a). Here A \ncontains four primitive actions (up, down, left, right). The transition model, T, specifies that \neach action succeeds 80% of time, while 20% of the time the agent moves in an unintended \nperpendicular direction. The agent begins in a start state in the upper left corner. A reward \nof 5.0 is given for reaching the goal state and the discount factor /3 is 0.999. \n\n3 Hierarchical abstract machines \nA HAM is a program which, when executed by an agent in an environment, constrains the \nactions that the agent can take in each state. For example, a very simple machine might \ndictate, \"repeatedly choose right or down,\" which would eliminate from consideration all \npolicies that go up or left. HAMs extend this simple idea of constraining policies by \nproviding a hierarchical means of expressing constraints at varying levels of detail and \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1045 \n\nspecificity. Machines for HAMs are defined by a set of states, a transition function, and a \nstart function that detennines the initial state of the machine. Machine states are of four \ntypes: Action states execute an action in the environment. Call states execute another \nmachine as a subroutine. Choice states nondetenninistically select a next machine state. \nStop states halt execution of the machine and return control to the previous call state. \nThe transition function detennines the next machine state after an action Or call state \nas a stochastic function of the current machine state and some features of the resulting \nenvironment state. Machines will typically use a partial description of the environment to \ndetennine the next state. Although machines can function in partially observable domains, \nfor the purposes of this paper we make the standard assumption that the agent has access to \na complete description as well. \nA HAM is defined by an initial machine in which execution begins and the closure of all \nmachines reachable from the initial machine. Figure I(c) shows a simplified version of \none element of the HAM we used for the MDP in Figure I. This element is used for \ntraversing a hallway while negotiating obstacles of the kind shown in Figure 1 (b). It runs \nuntil the end of the hallway or an intersection is reached. When it encounters an obstacle, a \nchoice point is created to choose between two possible next machine states. One calls the \nbackoff machine to back away from the obstacle and then move forward until the next one. \nThe other calls the follow-wall machine to try to get around the obstacle. The follow-wall \nmachine is very simple and will be tricked by obstacles that are concave in the direction of \nintended movement; the backoff machine, on the other hand, can move around any obstacle \nin this world but could waste time backing away from some obstacles unnecessarily and \nshould be used sparingly. \nOur complete \"navigation HAM\" involves a three-level hierarchy, somewhat reminiscent \nof a Brooks-style architecture but with hard-wired decisions replaced by choice states. The \ntop level of the hierarchy is basically just a choice state for choosing a hallway navigation \ndirection from the four coordinate directions. This machine has control initially and regains \ncontrol at intersections or corners. The second level of the hierarchy contains four machines \nfor moving along hallways, one for each direction. Each machine at this level has a choice \nstate with four basic strategies for handling obstacles. Two back away from obstacles \nand two attempt to follow walls to get around obstacles. The third level of the hierarchy \nimplements these strategies using the primitive actions. \nThe transition function for this HAM assumes that an agent executing the HAM has access \nto a short-range, low-directed sonar that detects obstacles in any of the four axis-parallel \nadjacent squares and a long-range, high-directed sonar that detects larger objects such as \nthe intersections and the ends of hallways. The HAM uses these partial state descriptions \nto identify feasible choices. For example, the machine to traverse a hallway northwards \nwould not be called from the start state because the high-directed sonar would detect a wall \nto the north. \nOur navigation HAM represents an abstract plan to move about the environment by re(cid:173)\npeatedly selecting a direction and pursuing this direction until an intersection is reached. \nEach machine for navigating in the chosen direction represents an abstract plan for moving \nin a particular direction while avoiding obstacles. The next section defines how a HAM \ninteracts with a specific MDP and how to find an optimal policy that respects the HAM \nconstraints. \n\n4 Defining and solving the HAM-induced MDP \nA policy for a model, M, that is HAM-consistent with HAM H is a scheme for making \nchoices whenever an agent executing H in M, enters a choice state. To find the optimal \nHAM-consistent policy we apply H to M to yield an induced MDP, HoM. A somewhat \nsimplified description of the construction of HoM is as follows: 1) The set of states in \nHoM is the cross-product of the states of H with the states of M. 2) For each state in \nHoM where the machine component is an action state, the model and machine transition \n\n\fWittlHAM ....... \n\nWillloutHAM -\nr--\n\nR. Parr and S. Russell \n\nWllItoIIlHAM - (cid:173)\n\nWtthHAM .... --\n\n1046 \n\n20 ~ \n1 \nI ~ : \n\n10 \n\no ____ ~ ________________ ~~. \n\no \n\nSOO \n\n1(0) \n\nlSOO \n\n2:C)X) \n\nl500 \n\nlOOO \nRuntiJne(.ca::ondI' \n\n3SOO \n\n.4QOO \n\n..soD ~ \n\n(a) \n\n(b) \n\nFigure 2: Experimental results showing policy value (at the initial state) as a function of \nruntime on the domain shown in Figure 1. (a) Policy iteration with and without the HAM. \n(b) Q-learning with and without the HAM (averaged over 10 runs). \n\nfunctions are combined. 3) For each state where the machine component is a choice state, \nactions that change only the machine component of the state are introduced. 4) The reward \nis taken from M for primitive actions, otherwise it is zero. With this construction, we have \nthe following (proof omitted): \nLemma 1 For any Markov decision process M and any! HAM H, the induced process \nHoM is a Markov decision process. \nLemma 2 If 7r is an optimal policy for HoM , then the primitive actions specified by 7r \nconstitute the optimal policy for M that is HAM-consistent with H. \nOf course, HoM may be quite large. Fortunately, there are two things that will make \nthe problem much easier in most cases. The first is that not all pairs of HAM states and \nenvironment states will be possible, i.e., reachable from an initial state. The second is that \nthe actual complexity of the induced MOP is determined by the number of choice points, \ni.e., states of HoM in which the HAM component is a choice state. This leads to the \nfollowing: \nTheorem 1 For any MOP, M, and HAM, H, let C be the set of choice points in HoM . \nThere exists a decision process, reduce(H 0 M), with states C such that the optimal policy \nfor reduce(H 0 M) corresponds to the optimal policy for M that is HAM-consistent with \nH . \nProof sketch We begin by applying Lemma 1 and then observing that in states of HoM \nwhere the HAM component is not a choice state, only one action is permitted. These \nstates can be removed to produce an equivalent Semi-Markov decision process (SMOP). \n(SMOPs are a generalization of Markov decision processes that permit different discount \nrates for different transitions.) The optimal policy for this SMOP will be the same as the \noptimal policy for HoM and by Lemma 2, this will be the optimal policy for M that is \nHAM-consistent with H. 0 \nThis theorem formally establishes the mechanism by which the constraints embodied in a \nHAM can be used to simplify an MDP. As an example of the power of this theorem, and \nto demonstrate that this transformation can be done efficiently, we applied our navigation \nHAM to the problem described in the previous section. Figure 2(a) shows the results of \napplying policy iteration to the original model and to the transformed model. Even when \nwe add in the cost of transformation (which, with our rather underoptimized code, takes \n\nITo preserve the Markov property, we require that if a machine has more than one possible caller \nin the hierarchy, that each appearance is treated as a distinct machine. This is equivalent to requiring \nthat the call graph for the HAM is a tree. It follows from this that circular calling sequences are also \nforbidden. \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1047 \n\n866 seconds), the HAM method produces a good policy in less than a quarter of the time \nrequired to find the optimal policy in the original model. The actual solution time is 185 \nseconds versus 4544 seconds. \nAn important property of the HAM approach is that model transformation produces an \nMDP that is an accurate model of the application of the HAM to the original MDP. Unlike \ntypical approximation methods for MDPs, the HAM method can give strict performance \nguarantees. The solution to the transformed model Teduce(H 0 M) is the optimal solution \nfrom within a well-defined class of policies and the value assigned to this solution is the \ntrue expected value of applying the concrete HAM policy to the original MDP. \n\n5 Reinforcement learning with HAMs \nHAMs can be of even greater advantage in a reinforcement learning context, where the \neffort required to obtain a solution typically scales very badly with the size of the problem. \nHAM contraints can focus exploration of the state space, reducing the \"blind search\" phase \nthat reinforcement learning agents must endure while learning about a new environment. \nLearning will also be fasterfor the same reason policy iteration is faster in the HAM-induced \nmodel; the agent is effectively operating in a reduced state space. \nWe now introduce a variation of Q-learning called HAMQ-1earning that learns directly \nin the reduced state space without performing the model transformation described in the \nprevious section. This is significant because the the environment model is not usually \nknown a priori in reinforcement learning contexts. \nA HAMQ-learning agent keeps track of the following quantities: t, the current environment \nstate; n, the current machine state; Se and me, the environment state and machine state at \nthe previous choice point; a, the choice made at the previous choice point; and T e and 13e, \nthe total accumulated reward and discount since the previous choice point. It also maintains \nan extended Q-table, Q([s, m], a), which is indexed by an environment-state/machine-state \npair and by an action taken at a choice point. \nFor every environment transition from state s to state t with observed reward T and discount \n13, the HAMQ-Iearning agent updates: Te ~ Te + 13eT and 13e ~ 13l3e. For each transition \nto a choice point, the agent does \n\nQ([se, me], a) ~ Q([se, mc], a) + a[Te + 13e V([t, n]) - Q([Se, mc], a)], \n\nand then Te ~ 0, 13e ~ 1. \nTheorem 2 For any finite-state MDP, M, and any HAM, H, HAMQ-Iearning will converge \nto the optimal choice for every choice point in Teduce(H 0 M) with probability l. \nProof sketch We note that the expected reinforcement signal in HAMQ-Iearning is the \nsame as the expected reinforcement signal that would be received if the agent were acting \ndirectly in the transformed model of Theorem 1 above. Thus, Theorem 1 of [11] can be \napplied to prove the convergence of the HAMQ-learning agent, provided that we enforce \nsuitable constraints on the exploration strategy and the update parameter decay rate. 0 \nWe ran some experiments to measure the performance of HAMQ-learning on our sample \nproblem. Exploration was achieved by selecting actions according to the Boltzman distri(cid:173)\nbution with a temperature parameter for each state. We also used an inverse decay for the \nupdate parameter a. Figure 2(b) compares the learning curves for Q-Iearning and HAMQ(cid:173)\nlearning. HAMQ-Iearning appears to learn much faster: Q-Iearning required 9,000,000 \niterations to reach the level achieved by HAMQ-learning after 270,000 iterations. Even \nafter 20,000,000 iterations, Q-Iearning did not do as well as HAMQ-learning.2 \n\n2Speedup techniques such as eligibility traces could be applied to get better Q-Ieaming results; \n\nsuch methods apply equally well to HAMQ-Iearning. \n\n\f1048 \n\nR. Parr and S. Russell \n\n6 Related work \nState aggregation (see, e.g., [18] and [7]) clusters \"similar\" states together and assigns them \nthe same value, effectively reducing the state space. This is orthogonal to our approach \nand could be combined with HAMs. However, aggregation should be used with caution \nas it treats distinct states as a single state and can violate the Markov property leading to \nthe loss of performance guarantees and oscillation or divergence in reinforcement learning. \nMoreover, state aggregation may be hard to apply effectively in many cases. \nDean and Lin [8] and Bertsekas and Tsitsiklis [2], showed that some MDPs are loosely \ncoupled and hence amenable to divide-and-conquer algorithms. A machine-like language \nwas used in [13] to partition an MDP into decoupled subproblems. In problems that are \namenable to decoupling, this could approaches could be used in combinated with HAMs. \nDayan and Hinton [6] have proposedJeudal RL which specifies an explicit subgoal structure, \nwith fixed values for each sub goal achieved, in order to achieve a hierarchical decomposition \nof the state space. Dietterich extends and generalizes this approach in [9]. Singh has \ninvestigated a number of approaches to subgoal based decomposition in reinforcement \nlearning (e.g. \n[17] and [16]). Subgoals seem natural in some domains, but they may \nrequire a significant amount of outside knowledge about the domain and establishing the \nrelationship between the value of subgoals with respect to the overall problem can be \ndifficult. \nBradtke and Duff [3] proposed an RL algorithm for SMDPs. Sutton [19] proposes temporal \nabstractions, which concatenate sequences of state transitions together to permit reasoning \nabout temporally extended events, and which can thereby form a behavioral hierarchy as \nin [14] and [15]. Lin's somewhat informal scheme [12] also allows agents to treat entire \npolicies as single actions. These approaches can be emcompassed within our framework \nby encoding the events or behaviors as machines. \nThe design of hierarchically organized, \"layered\" controllers was popularized by Brooks [4]. \nHis designs use a somewhat different means of passing control, but our analysis and theorems \napply equally well to his machine description language. The \"teleo-reactive\" agent designs \nof Benson and Nilsson [I] are even closer to our HAM language. Both of these approaches \nassume that the agent is completely specified, albeit self-modifiable. The idea of partial \nbehavior descriptions can be traced at least to Hsu's partial programs [10], which were \nused with a deterministic logical planner. \n\n7 Conclusions and future work \nWe have presented HAMs as a principled means of constraining the set of policies that are \nconsidered for a Markov decision process and we have demonstrated the efficacy of this \napproach in a simple example for both policy iteration and reinforcement learning. Our \nresults show very significant speedup for decision-making and learning-but of course, this \nreflects the provision of knowledge in the form of the HAM. The HAM language provides \na very general method of transferring knowledge to an agent and we only have scratched \nthe surface of what can be done with this approach. \nWe believe that if desired, subgoal information can be incorporated into the HAM structure, \nunifying subgoal-based approaches with the HAM approach. Moreover, the HAM structure \nprovides a natural decomposition of the HAM-induced model, making it amenable to the \ndivide-and-conquer approaches of [8] and [2]. \nThere are opportunities for generalization across all levels of the HAM paradigm. Value \nfunction approximation can be used for the HAM induced model and inductive learning \nmethods can be used to produce HAMs or to generalize their effects upon different regions \nof the state space. Gradient-following methods also can be used to adjust the transition \nprobabilities of a stochastic HAM. \nHAMs also lend themselves naturally to partially observable domains. They can be applied \ndirectly when the choice points induced by the HAM are states where no confusion about \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1049 \n\nthe true state of the environment is possible. The application of HAMs to more general \npartially observable domains is more complicated and is a topic of ongoing research. We \nalso believe that the HAM approach can be extended to cover the average-reward optimality \ncriterion. \nWe expect that successful pursuit of these lines of research will provide a formal basis for \nunderstanding and unifying several seemingly disparate approaches to control, including \nbehavior-based methods. It should also enable the use of the MDP framework in real-world \napplications of much greater complexity than hitherto attacked, much as HTN planning has \nextended the reach of classical planning methods. \n\nReferences \n[1] S. Benson and N. Nilsson. Reacting, planning and learning in an autonomous agent. \n\nIn \nK. Furukawa, D. Michie, and S. Muggleton, editors, Machine Intelligence 14. Oxford University \nPress, Oxford, 1995. \n\n[2] D. C. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Metlwds. \n\nPrentice-Hall, Englewood Cliffs, New Jersey, 1989. \n\n[3] S. J. Bradtke and M. O. Duff. Reinforcement learning methods for continuous-time Markov \ndecision problems. In Advances in Neurallnfonnation Processing Systems 7: Proc. of the 1994 \nConference, Denver, Colorado, December 1995. MIT Press. \n\n[4] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics \n\n[5] K. W. Currie and A. Tate. O-Plan: the Open Planning Architecture. Artificial Intelligence, \n\nand Automation, 2, 1986. \n\n52(1), November 1991. \n\n[6] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Stephen Jose Hanson, Jack D. \nCowan, and C. Lee Giles, editors, Neural Information Processing Systems 5, San Mateo, \nCalifornia, 1993. Morgan Kaufman. \n\n[7] T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing approximately \noptimal solutions for markov decision processes. In Proc. of the Thirteenth Conference on Un(cid:173)\ncertainty in Artificial Intelligence , Providence, Rhode Island, August 1997. Morgan Kaufmann. \n[8] T. Dean and S.-H. Lin. Decomposition techniques for planning in stochastic domains. In Proc. \nof the Fourteenth Int. Joint Conference on Artificial Intelligence, Montreal, Canada, August \n1995. Morgan Kaufmann. \n\n[9] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function \ndecomposition. Technical report, Department of Computer Science, Oregon State University, \nCorvallis, Oregon, 1997. \n\n[10] Y.-J. Hsu. Synthesizing efficient agents from partial programs. In Metlwdologiesfor Intelligent \nSystems: 6th Int. Symposium, ISMIS '91, Proc., Charlotte, North Carolina, October 1991. \nSpringer-Verlag. \n\n[11] T. Jaakkola, M.l. Jordan, and S.P. Singh. On the convergence of stochastic iterative dynamic \n\nprogramming algorithms. Neural Computation, 6(6), 1994. \n\n[12] L.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Computer \n\nScience Department, Carnegie-Mellon University, Pittsburgh, Pennsylvania, 1993. \n\n[13] Shieu-Hong Lin. Exploiting Structure for Planning and Control. PhD thesis, Computer Science \n\nDepartment, Brown University, Providence, Rhode Island, 1997. \n\n[14] A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in accelerating reinforcement \n\nlearning. In 1997 Grace Hopper Celebration of Women in Computing, 1997. \n\n[15] D. Precup and R. S. Sutton. Multi-time models fortemporally abstract planning. In This Volume . \n[16] S. P. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution \nmodels. In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, \nJuly 1992. Morgan Kaufmann. \n\n[17] S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine \n\nLearning, 8(3), May 1992. \n\n[18] S. P. Singh, T. Jaakola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In \nG. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Neural Information Processing Systems 7, \nCambridge, Massachusetts, 1995. MIT Press. \n\n[19] R. S. Sutton. Temporal abstraction in reinforcement learning. In Proc. of the Twelfth Int. \n\nConference on Machine Learning, Tahoe City, CA, July 1995. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1384, "authors": [{"given_name": "Ronald", "family_name": "Parr", "institution": null}, {"given_name": "Stuart", "family_name": "Russell", "institution": null}]}