{"title": "State Abstraction in MAXQ Hierarchical Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 994, "page_last": 1000, "abstract": null, "full_text": "State Abstraction in MAXQ Hierarchical \n\nReinforcement Learning \n\nThomas G. Dietterich \n\nDepartment of Computer Science \n\nOregon State University \n\nCorvallis, Oregon 97331-3202 \n\ntgd@cs.orst.edu \n\nAbstract \n\nMany researchers have explored methods for hierarchical reinforce(cid:173)\nment learning (RL) with temporal abstractions, in which abstract \nactions are defined that can perform many primitive actions before \nterminating. However, little is known about learning with state ab(cid:173)\nstractions, in which aspects of the state space are ignored. In previ(cid:173)\nous work, we developed the MAXQ method for hierarchical RL. In \nthis paper, we define five conditions under which state abstraction \ncan be combined with the MAXQ value function decomposition. \nWe prove that the MAXQ-Q learning algorithm converges under \nthese conditions and show experimentally that state abstraction is \nimportant for the successful application of MAXQ-Q learning. \n\n1 \n\nIntroduction \n\nMost work on hierarchical reinforcement learning has focused on temporal abstrac(cid:173)\ntion. For example, in the Options framework [1,2], the programmer defines a set of \nmacro actions (\"options\") and provides a policy for each. Learning algorithms (such \nas semi-Markov Q learning) can then treat these temporally abstract actions as if \nthey were primitives and learn a policy for selecting among them. Closely related \nis the HAM framework, in which the programmer constructs a hierarchy of finite(cid:173)\nstate controllers [3]. Each controller can include non-deterministic states (where the \nprogrammer was not sure what action to perform). The HAMQ learning algorithm \ncan then be applied to learn a policy for making choices in the non-deterministic \nstates. In both of these approaches-and in other studies of hierarchical RL (e.g., \n[4, 5, 6])-each option or finite state controller must have access to the entire state \nspace. The one exception to this-the Feudal-Q method of Dayan and Hinton [7](cid:173)\nintroduced state abstractions in an unsafe way, such that the resulting learning \nproblem was only partially observable. Hence, they could not provide any formal \nresults for the convergence or performance of their method. \nEven a brief consideration of human-level intelligence shows that such methods can(cid:173)\nnot scale. When deciding how to walk from the bedroom to the kitchen, we do not \nneed to think about the location of our car. Without state abstractions, any RL \nmethod that learns value functions must learn a separate value for each state of the \n\n\fState Abstraction in MAXQ Hierarchical Reinforcement Learning \n\n995 \n\nworld. Some argue that this can be solved by clever value function approximation \nmethods-and there is some merit in this view. In this paper, however, we explore \na different approach in which we identify aspects of the MDP that permit state ab(cid:173)\nstractions to be safely incorporated in a hierarchical reinforcement learning method \nwithout introducing function approximations. This permits us to obtain the first \nproof of the convergence of hierarchical RL to an optimal policy in the presence of \nstate abstraction. \n\nWe introduce these state abstractions within the MAXQ framework [8], but the \nbasic ideas are general. In our previous work with MAXQ, we briefly discussed state \nabstractions, and we employed them in our experiments. However, we could not \nprove that our algorithm (MAXQ-Q) converged with state abstractions, and we did \nnot have a usable characterization of the situations in which state abstraction could \nbe safely employed. This paper solves these problems and in addition compares the \neffectiveness of MAXQ-Q learning with and without state abstractions. The results \nshow that state abstraction is very important, and in most cases essential, to the \neffective application of MAXQ-Q learning. \n\n2 The MAXQ Framework \n\nLet M be a Markov decision problem with states S, actions A, reward function \nR(s/ls, a) and probability transition function P(s/ls, a). Our results apply in both \nthe finite-horizon undiscounted case and the infinite-horizon discounted case. Let \n{Mo, .. . ,Mn} be a set of subtasks of M, where each subtask Mi is defined by a \ntermination predicate Ti and a set of actions Ai (which may be other subtasks or \nprimitive actions from A). The \"goal\" of subtask Mi is to move the environment into \na state such that Ti is satisfied. (This can be refined using a local reward function \nto express preferences among the different states satisfying Ti [8], but we omit this \nrefinement in this paper.) The subtasks of M must form a DAG with a single \"root\" \nnode-no subtask may invoke itself directly or indirectly. A hierarchical policy is \na set of policies 1r = {1ro, ... , 1r n}, one for each subtask. A hierarchical policy \nis executed using standard procedure-call-and-return semantics, starting with the \nroot task Mo and unfolding recursively until primitive actions are executed. When \nthe policy for Mi is invoked in state s, let P(SI, Nls, i) be the probability that it \nterminates in state Sl after executing N primitive actions. A hierarchical policy is \nrecursively optimal if each policy 1ri is optimal given the policies of its descendants \nin the DAG. \nLet V(i, s) be the value function for subtask i in state s (Le., the value of following \nsome policy starting in s until we reach a state Sl satisfying Ti (S/) ) \u2022 Similarly, let \nQ(i, s,j) be the Q value for subtask i of executing child action j in state sand \nthen executing the current policy until termination. The MAXQ value function \ndecomposition is based on the observation that each subtask Mi can be viewed as a \nSemi-Markov Decision problem in which the reward for performing action j in state \ns is equal to V(j, s), the value function for subtask j in state s. To see this, consider \nthe sequence of rewards rt that will be received when we execute child action j and \nthen continue with subsequent actions according to hierarchical policy 1r: \n\nQ(i, s,j) = E{rt + ,rt+l + ,2rt+2 + .. ' Ist = S,1r} \n\nThe macro action j will execute for some number of steps N and then return. Hence, \nwe can partition this sum into two terms: \n\n\f996 \n\nT. G. Dietterich \n\nThe first term is the discounted sum ofrewards until subtask j terminates-V(j, s). \nThe second term is the cost of finishing subtask i after j is executed (discounted \nto the time when j is initiated). We call this second term the completion function, \nand denote it C(i,s,j). We can then write the Bellman equation as \n\nQ(i,s,j) \n\nL P(s',Nls,j)\u00b7 [V(j,s) +,N m.,?-xQ(i,s',j')] \ns',N \nV(j, s) + C(i, s,j) \n\nJ \n\nTo terminate this recursion, define V (a, s) for a primitive action a to be the expected \nreward of performing action a in state s. \nThe MAXQ-Q learning algorithm is a simple variation of Q learning in which at \nsubtask M i , state s, we choose a child action j and invoke its (current) policy. When \nit returns, we observe the resulting state s' and the number of elapsed time steps \nN and update C(i, s,j) according to \n\nC(i, s, j) := (1 - Ut)C(i, s, j) + Ut .,N[max V(a', s') + C(i, s', a')]. \n\na' \n\nTo prove convergence, we require that the exploration policy executed during learn(cid:173)\ning be an ordered GLIE policy. An ordered policy is a policy that breaks Q-value \nties among actions by preferring the action that comes first in some fixed ordering. \nA GLIE policy [9] is a policy that (a) executes each action infinitely often in every \nstate that is visited infinitely often and (b) converges with probability 1 to a greedy \npolicy. The ordering condition is required to ensure that the recursively optimal \npolicy is unique. Without this condition, there are potentially many different re(cid:173)\ncursively optimal policies with different values, depending on how ties are broken \nwithin subtasks, subsubtasks, and so on. \nTheorem 1 Let M = (S, A, P, R) be either an episodic MDP for which all de(cid:173)\nterministic policies are proper or a discounted infinite horizon MDP with discount \nfactor,. Let H be a DAG defined over subtasks {Mo, ... ,Mk}. Let Ut(i) > 0 be a \nsequence of constants for each subtask Mi such that \nT \n\nT \n\nlim L Ut(i) = 00 \n\nT-too \n\nt=l \n\nand \n\nlim '\" u;(i) < 00 \nT-too~ \nt=l \n\n(1) \n\nLet 7rx (i, s) be an ordered GLIE policy at each subtask Mi and state s and assume \nthat IVt (i, s) I and ICt (i, s, a) I are bounded for all t, i, s, and a. Then with probability \n1, algorithm MAXQ-Q converges to the unique recursively optimal policy for M \nconsistent with Hand 7r x . \n\nProof: (sketch) The proof is based on Proposition 4.5 from Bertsekas and Tsit(cid:173)\nsiklis [10] and follows the standard stochastic approximation argument due to [11] \ngeneralized to the case of non-stationary noise. There are two key points in the \nproof. Define Pt(s',Nls,j) to be the probability transition function that describes \nthe behavior of executing the current policy for subtask j at time t. By an inductive \nargument, we show that this probability transition function converges (w.p. 1) to \nthe probability transition function of the recursively optimal policy for j. Second, \nwe show how to convert the usual weighted max norm contraction for Q into a \nweighted max norm contraction for C. This is straightforward, and completes the \nproof. \n\nWhat is notable about MAXQ-Q is that it can learn the value functions of all \nsubtasks simultaneously-it does not need to wait for the value function for subtask \nj to converge before beginning to learn the value function for its parent task i. This \ngives a completely online learning algorithm with wide applicability. \n\n\fState Abstraction in MAXQ Hierarchical Reinforcement Learning \n\n997 \n\n4 \n3 \n2 \n\n1 \no \n\nR \n0 \n\nG \n\nB \n\ny \no 1 23 4 \nFigure 1: Left: The Taxi Domain (taxi at row 3 column 0) . Right: Task Graph. \n\n3 Conditions for Safe State Abstraction \n\nTo motivate state abstraction, consider the simple Taxi Task shown in Figure 1. \nThere are four special locations in this world, marked as R(ed), B(lue), G(reen), \nand Y(ellow). In each episode, the taxi starts in a randomly-chosen square. There \nis a passenger at one of the four locations (chosen randomly), and that passenger \nwishes to be transported to one of the four locations (also chosen randomly). The \ntaxi must go to the passenger's location (the \"source\"), pick up the passenger, go \nto the destination location (the \"destination\"), and put down the passenger there. \nThe episode ends when the passenger is deposited at the destination location. \n\nThere are six primitive actions in this domain: (a) four navigation actions that \nmove the taxi one square North, South, East, or West, (b) a Pickup action, and (c) \na Putdown action. Each action is deterministic. There is a reward of -1 for each \naction and an additional reward of +20 for successfully delivering the passenger. \nThere is a reward of -10 if the taxi attempts to execute the Putdown or Pickup \nactions illegally. If a navigation action would cause the taxi to hit a wall, the action \nis a no-op, and there is only the usual reward of -1. \nThis task has a hierarchical structure (see Fig. 1) in which there are two main \nsub-tasks: Get the passenger (Get) and Deliver the passenger (Put). Each of these \nsubtasks in turn involves the subtask of navigating to one of the four locations \n(Navigate(t); where t is bound to the desired target location) and then performing \na Pickup or Putdown action. This task illustrates the need to support both tem(cid:173)\nporal abstraction and state abstraction. The temporal abstraction is obvious-for \nexample, Get is a temporally extended action that can take different numbers of \nsteps to complete depending on the distance to the target. The top level policy (get \npassenger; deliver passenger) can be expressed very simply with these abstractions. \nThe need for state abstraction is perhaps less obvious. Consider the Get subtask. \nWhile this subtask is being solved, the destination of the passenger is completely \nirrelevant- it cannot affect any of the nagivation or pickup decisions. Perhaps more \nimportantly, when navigating to a target location (either the source or destination \nlocation of the passenger), only the taxi's location and identity ofthe target location \nare important. The fact that in some cases the taxi is carrying the passenger and \nin other cases it is not is irrelevant. \nWe now introduce the five conditions for state abstraction. We will assume that the \nstate s of the MDP is represented as a vector of state variables. A state abstraction \ncan be defined for each combination of subtask Mi and child action j by identifying \na subset X of the state variables that are relevant and defining the value function \nand the policy using only these relevant variables. Such value functions and policies \n\n\f998 \n\nare said to be abstract. \n\nT. G. Dietterich \n\nThe first two conditions involve eliminating irrelevant variables within a subtask of \nthe MAXQ decomposition. \nCondition 1: Subtask Irrelevance. Let Mi be a subtask of MDP M. A set \nof state variables Y is irrelevant to sub task i if the state variables of M can be \npartitioned into two sets X and Y such that for any stationary abstract hierarchical \npolicy 7r executed by the descendants of M i , the following two properties hold: (a) \nthe state transition probability distribution P7r(5',NI5,j) for each child action j of \nMi can be factored into the product of two distributions: \n\nP7r(x',y',Nlx,y,j) = P7r(x',Nlx,j)' P7r(y'lx,y,j), \n\n(2) \nwhere x and x' give values for the variables in X, and y and y' give values for the \nvariables in Y; and (b) for any pair of states 51 = (x, yr) and 52 = (x, Y2) and any \nchild action j, V 7r (j, 51) = V7r(j, 52)' \nIn the Taxi problem, the source and destination of the passenger are irrelevant to \nthe Navigate(t) subtask-only the target t and the current taxi position are relevant. \nThe advantages of this form of abstraction are similar to those obtained by Boutilier, \nDearden and Goldszmidt [12] in which belief network models of actions are exploited \nto simplify value iteration in stochastic planning. \n\nCondition 2: Leaf Irrelevance. A set of state variables Y is irrelevant for a \nprimitive action a if for any pair of states 51 and 52 that differ only in their values \nfor the variables in Y, \n\nL P(5~151' a)R(5~151' a) = L P(5~152' a)R(s~152' a). \n\ns' 1 \n\ns' 2 \n\nThis condition is satisfied by the primitive actions North, South, East, and West in \nthe taxi task, where all state variables are irrelevant because R is constant. \nThe next two conditions involve \"funnel\" actions- macro actions that move the \nenvironment from some large number of possible states to a small number of re(cid:173)\nsulting states. The completion function of such subtasks can be represented using \na number of values proportional to the number of resulting states. \n\nCondition 3: Result Distribution Irrelevance (Undiscounted case.) A set \nof state variables }j is irrelevant for the result distribution of action j if, for all \nabstract policies 7r executed by M j and its descendants in the MAXQ hierarchy, the \nfollowing holds: for all pairs of states 51 and 52 that differ only in their values for \nthe state variables in }j, \n\nV 5' P7r(5'151,j) = P7r(5'152,j). \n\nConsider, for example, the Get subroutine under an optimal policy for the taxi \ntask. Regardless of the taxi's position in state 5, the taxi will be at the passenger's \nstarting location when Get finishes executing (Le., because the taxi will have just \ncompleted picking up the passenger). Hence, the taxi's initial position is irrelevant \nto its resulting position. (Note that this is only true in the undiscounted setting(cid:173)\nwith discounting, the result distributions are not the same because the number of \nsteps N required for Get to finish depends very much on the starting location of the \ntaxi. Hence this form of state abstraction is rarely useful for cumulative discounted \nreward.) \n\nCondition 4: Termination. Let Mj be a child task of Mi with the property \nthat whenever Mj terminates, it causes Mi to terminate too. Then the completion \n\n\fState Abstraction in MAXQ Hierarchical Reinforcement Learning \n\n999 \n\ncost C ( i, s, j) = 0 and does not need to be represented. This is a particular kind of \nfunnel action- it funnels all states into terminal states for Mi' \nFor example, in the Taxi task, in all states where the taxi is holding the passenger, \nthe Put subroutine will succeed and result in a terminal state for Root. This is \nbecause the termination predicate for Put (i.e., that the passenger is at his or her \ndestination location) implies the termination condition for Root (which is the same). \nThis means that C(Root, s, Put) is uniformly zero, for all states s where Put is not \nterminated. \n\nCondition 5: Shielding. Consider subtask Mi and let s be a state such that \nfor all paths from the root of the DAG down to M i , there exists a subtask that is \nterminated. Then no C values need to be represented for subtask Mi in state s, \nbecause it can never be executed in s. \n\nIn the Taxi task, a simple example of this arises in the Put task, which is terminated \nin all states where the passenger is not in the taxi. This means that we do not need \nto represent C(Root, s, Put) in these states. The result is that, when combined \nwith the Termination condition above, we do not need to explicitly represent the \ncompletion function for Put at all! \n\nBy applying these abstraction conditions to the Taxi task, the value function can \nbe represented using 632 values, which is much less than the 3,000 values required \nby flat Q learning. Without state abstractions, MAXQ requires 14,000 values! \n\nTheorem 2 (Convergence with State Abstraction) Let H be a MAXQ task \ngraph that incorporates the five kinds of state abstractions defined above. Let 7r x be \nan ordered GLIE exploration policy that is abstract. Then under the same condi(cid:173)\ntions as Theorem 1, MAXQ-Q converges with probability 1 to the unique recursively \noptimal policy 7r; defined by 7r x and H . \nProof: (sketch) Consider a subtask Mi with relevant variables X and two ar(cid:173)\nbitrary states (x, Yl) and (x, Y2). We first show that under the five abstraction \nconditions, the value function of 7r; can be represented using C(i,x,j) (Le., ignor(cid:173)\ning the Y values). To learn the values of C(i,x,j) = L:xl,NP(xl,Nlx,j)V(i,x'), a \nQ-learning algorithm needs samples of x' and N drawn according to P(x' , Nlx,j). \nThe second part of the proof involves showing that regardless of whether we execute \nj in state (x, Yl) or in (x, Y2), the resulting x' and N will have the same distribu(cid:173)\ntion, and hence, give the correct expectations. Analogous arguments apply for leaf \nirrelevance and V (a, x). The termination and shielding cases are easy. \n\n4 Experimental Results \n\nWe implemented MAXQ-Q for a noisy version of the Taxi domain and for Kael(cid:173)\nbling's HDG navigation task [5] using Boltzmann exploration. Figure 2 shows the \nperformance of flat Q and MAXQ-Q with and without state abstractions on these \ntasks. Learning rates and Boltzmann cooling rates were separately tuned to opti(cid:173)\nmize the performance of each method. The results show that without state abstrac(cid:173)\ntions, MAXQ-Q learning is slower to converge than flat Q learning, but that with \nstate abstraction, it is much faster. \n\n5 Conclusion \n\nThis paper has shown that by understanding the reasons that state variables are \nirrelevant, we can obtain a simple proof of the convergence of MAXQ-Q learning \n\n\f1000 \n\n200 \n\n0 \n\nMAXQ+Abscradion \n\n-600 \n\n-BOO \n\n\u00b7 1000 \n\n1 \u00b7200 \n--e \n\n.~ \ni \n\nj \n\nT. G. Dietterich \n\n\u00b7 20 \n\n..0 \n\n-60 \n\n-SO \n\n- 100 \n\n- 120 \n\n\u00b7 140 \n\n'! \n~ \n.~ \n\n~ g \n\n::E \n\nl , \\ FIalQ \n1-\\ f \n\nLX.\\' \n~ ~fo~ \\ \n\n0 \n\n20000 40000 \n\n60000 80000 100000 \nPrimidve Actions \n\n120000 140000 160000 \n\n200000 \n\n400000 \n\n800000 \n600000 \nPrimitive Actioru \n\nle+06 \n\n1.2e+06 \n\nl .~ \n\nFigure 2: Comparison of MAXQ-Q with and without state abstraction to flat Q learning \non a noisy taxi domain (left) and Kaelbling's HDG task (right). The horizontal axis gives \nthe number of primitive actions executed by each method. The vertical axis plots the \naverage of 100 separate runs. \n\nunder state abstraction. This is much more fruitful than previous efforts based \nonly on weak notions of state aggregation [10], and it suggests that future research \nshould focus on identifying other conditions that permit safe state abstraction. \n\nReferences \n\n[1) D. Precup and R. S. Sutton, \"Multi-time models for temporally abstract planning,\" \n\nin NIPS10, The MIT Press, 1998. \n\n[2) R. S. Sutton, D. Precup, and S. Singh, \"Between MDPs and Semi-MDPs: Learn(cid:173)\n\ning, planning, and representing knowledge at multiple temporal scales,\" tech. rep., \nUniv. Mass., Dept. Compo Inf. Sci., Amherst, MA, 1998. \n\n[3] R. Parr and S. Russell, \"Reinforcement learning with hierarchies of machines,\" in \n\nNIPS-10, The MIT Press, 1998. \n\n[4) S. P. Singh, \"Transfer of learning by composing solutions of elemental sequential \n\ntasks,\" Machine Learning, vol. 8, p. 323, 1992. \n\n[5) L. P. Kaelbling, \"Hierarchical reinforcement learning: Preliminary results,\" in Pro(cid:173)\n\nceedings ICML-l0, pp. 167-173, Morgan Kaufmann, 1993. \n\n[6) M. Hauskrecht , N. Meuleau, C. Boutilier, L. Kaelbling, and T . . Dean, \"Hierarchical \nsolution of Markov decision processes using macro-actions,\" tech. rep ., Brown Univ., \nDept. Comp o Sci., Providence, RI, 1998. \n\n[7) P. Dayan and G. Hinton, \"Feudal reinforcement learning,\" in NIPS-5, pp. 271- 278, \n\nSan Francisco, CA: Morgan Kaufmann, 1993. \n\n[8) T . G. Dietterich, \"The MAXQ method for hierarchical reinforcement learning,\" in \n\nICML-15, Morgan Kaufmann, 1998. \n\n[9) S. Singh, T. Jaakkola, M. L. Littman, and C. Szpesvari, \"Convergence results \nfor single-step on-policy reinforcement-learning algorithms,\" tech. rep. , Univ. Col., \nDept. Compo Sci., Boulder, CO, 1998. \n\n[10) D. P. Bertsekas and J . N. Tsitsiklis, Neu.ro-Dynamic Programming. Belmont, MA: \n\nAthena Scientific, 1996. \n\n[11) T. Jaakkola, M. 1. Jordan, and S. P. Singh, \"On the convergence of stochastic iterative \ndynamic programming algorithms,\" Neur. Comp ., vol. 6, no. 6, pp. 1185- 1201, 1994. \n[12) C. Boutilier, R. Dearden, and M. Goldszmidt, \"Exploiting structure in policy con(cid:173)\n\nstruction,\" in Proceedings IJCAI-95, pp. 1104- 1111, 1995. \n\n\f", "award": [], "sourceid": 1770, "authors": [{"given_name": "Thomas", "family_name": "Dietterich", "institution": null}]}