{"title": "Improved Switching among Temporally Abstract Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 1066, "page_last": 1072, "abstract": null, "full_text": "Improved Switching \n\namong Temporally Abstract Actions \n\nRichard S. Sutton Satinder Singh \n\nDoina Precup Balaraman Ravindran \n\nAT&T Labs \n\nFlorham Park, NJ 07932 \n\n{ sutton,baveja}@research.att.com \n\nUniversity of Massachusetts \nAmherst, MA 01003-4610 \n\n{ dprecup,ravi}@cs.umass.edu \n\nAbstract \n\nIn robotics and other control applications it is commonplace to have a pre(cid:173)\nexisting set of controllers for solving subtasks, perhaps hand-crafted or \npreviously learned or planned, and still face a difficult problem of how to \nchoose and switch among the controllers to solve an overall task as well as \npossible. In this paper we present a framework based on Markov decision \nprocesses and semi-Markov decision processes for phrasing this problem, \na basic theorem regarding the improvement in performance that can be ob(cid:173)\ntained by switching flexibly between given controllers, and example appli(cid:173)\ncations of the theorem. In particular, we show how an agent can plan with \nthese high-level controllers and then use the results of such planning to find \nan even better plan, by modifying the existing controllers, with negligible \nadditional cost and no re-planning. In one of our examples, the complexity \nof the problem is reduced from 24 billion state-action pairs to less than a \nmillion state-controller pairs. \n\nIn many applications, solutions to parts of a task are known, either because they were hand(cid:173)\ncrafted by people or because they were previously learned or planned. For example, in \nrobotics applications, there may exist controllers for moving joints to positions, picking up \nobjects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid:173)\ngent system may have available to it several temporally extended courses of action to choose \nfrom. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid:173)\ntended actions, to choose or switch among them effectively, and to plan at their level rather \nthan at the level of individual actions. \n\nRecently, several researchers have begun to address these challenges within the framework of \nreinforcement learning and Markov decision processes (e.g., Singh, 1992; Kaelbling, 1993; \nDayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & \nRussell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is \nthe modeling of a temporally extended action as a policy (controller) and a condition for \nterminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In \nthis paper we consider the problem of effectively combining given options into one overall \npolicy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; \nour new results are in Sections 4 and 5. \n\n\fImproved Switching among Temporally Abstract Actions \n\n1067 \n\n1 Reinforcement Learning (MDP) Framework \nIn a Markov decision process (MDP), an agent interacts with an environment at some dis(cid:173)\ncrete, lowest-level time scale t = 0,1,2, ... On each time step, the agent perceives the state \nof the environment, St E S, and on that basis chooses a primitive action, at E A. In response \nto each action, at, the environment produces one step later a numerical reward, Tt+l' and \na next state, StH. The one-step model of the environment consists of the one-step state(cid:173)\ntransition probabilities and the one-step expected rewards, \n\np~s' = Pr{sHl = s' I St = S,at = a} \n\nand \n\nT~ = E{TtH 1st = S,at = a}, \n\nfor all s, s' E S and a E A. The agent's objective is to learn an optimal Markov policy, a \nmapping from states to probabilities of taking each available primitive action, 7r : S x A -+ \n[0, 1], that maximizes the expected discounted future reward from each state s: \n\nV 1T (s) = E{Tt+l +,Tt+2 + ... \\ St = S,7r} = L \n\n7r(s,a)[T~ +, LP~S,V1T(S')], \n\naEA. \n\ns' \n\nwhere 7r(s, a) is the probability with which the policy 7r chooses action a E As in state s, and \n, E [0, 1] is a discount-rate parameter. V1T (s) is called the value of state S under policy 7r, and \nV1T is called the state-value Junction for7r. The optimal state-value function gives the value of \na state under an optimal policy: V*(s) = max1T V1T(S) = maxaEA.[T~ +,2: s' P~SI V*(s')]. \nGiven V*, an optimal policy is easily formed by choosing in each state S any action that \nachieves the maximum in this equation. A parallel set of value functions, denoted Q1T and Q*, \nand Bellman equations can be defined for state-action pairs, rather than for states. Planning \nin reinforcement learning refers to the use of models of the environment to compute value \nfunctions and thereby to optimize or improve policies. \n\n2 Options \nWe use the term options for our generalization of primitive actions to include temporally \nextended courses of action. Let ht,T = St, at, Tt+l, St+l, at+l, . .. , TT, ST be the history \nsequence from time t :::; T to time T, and let n denote the set of all possible histories in \nthe given MDP. Options consist of three components: an initiation set I ~ S, a policy \n7r : n x A -+ [0, 1], and a termination condition {3 : n -+ [0, 1]. An option 0 = (I, 7r, (3) \ncan be taken in state S if and only if S E I. If 0 is taken in state St, the next action at \nis selected according to 7r(St, .). The environment then makes a transition to SHl, where \no terminates with probability (3(ht ,t+d, or else continues, determining atH according to \n7r(ht,tH' .), and transitioning to state SH2, where 0 terminates with probability (3(ht ,t+2) \netc. We call the general options defined above semi-Markov because 7r and {3 depend on the \nhistory sequence; in Markov options 7r and {3 depend only on the current state. Semi-Markov \noptions allow \"timeouts\", i.e., termination after some period of time has elapsed, and other \nextensions which cannot be handled by Markov options. \n\nThe initiation set and termination condition of an option together limit the states over which \nthe option's policy must be defined. For example, a h~nd-crafted policy 7r for a mobile robot \nto dock with its battery charger might be defined only for states I in which the battery charger \nis within sight. The termination condition (3 would be defined to be 1 outside of I and when \nthe robot is successfupy docked. \n\nWe can now define policies over options. Let the set of options available in state S be denoted \nas; the set of all options is denoted a = USES aS. When initiated in a state St, the Markov \npolicy over options p : S X 0-+ [0,1] selects an option 0 E aS! according to the probability \ndistribution p(St, .). The option 0 is then taken in St, determining actions until it terminates \nin St+k. at which point a new option is selected, according to P(SHk' .), and so on. In this \nway a policy over options, p, determines a (non-stationary) policy over actions, or flat policy, \n7r = f(p). We define the value of a state S under a general flat policy 7r as the expected return \n\n\f1068 \n\nR. S. Sutton, S. Singh, D. Precup and B. Ravindran \n\nif the policy is started in s: \n\nV 1T (s) d~f E {rt+l + r'rt+2 + .. \u00b71 \u00a3(7r, s, t) }, \n\nwhere \u00a3(7r, s, t) denotes the event of 7r being initiated in s at time t. The value of a state \nunder a general policy (i.e., a policy over options) J-L can then be defined as the value of \nthe state under the corresponding flat policy: VtL(s) ~f Vf(tL) (s). An analogous definition \ncan be used for the option-value function, QtL(s,o). For semi-Markov options it is useful \nto define QtL(h, 0) as the expected discounted future reward after having followed option 0 \nthrough history h. \n\n3 SMDP Planning \nOptions are closely related to the actions in a special kind of decision problem known as a \nsemi-Markov decision process, or SMDP (Puterman, 1994; see also Singh, 1992; Bradtke & \nDuff, 1995; Mahadevan et. aI., 1997; Parr & Russell, 1998). In fact, any MDP with a fixed \nset of options is an SMDP. Accordingly, the theory of SMDPs provides an important basis for \na theory of options. In this section, we review the standard SMDP framework for planning, \nwhich will provide the basis for our extension. \n\nPlanning with options requires a model of their consequences. The form of this model is \ngiven by prior work with SMDPs. The reward part of the model of 0 for state s E S is the \ntotal reward received along the way: \n\nr~ = E{rt+l +,rt+2 + .. . +,k-lrt+k I \u00a3(o,s,t)}, \n\nwhere \u00a3(0, s, t) denotes the event of 0 being initiated in state s at time t. The state-prediction \npart of the model is \n\n00 \n\np~s' = LP(s', k)'l, E{-l&s'st+k 1\u00a3(0, s, t)}, \n\nk=l \n\nfor all s' E S, where p(s', k) is the probability that the option terminates in s' after k steps. \nWe call this kind of model a multi-time model because it describes the outcome of an option \nnot at a single time but at potentially many different times, appropriately combined. \n\nUsing multi-time models we can write Bellman equations for general policies and options. \nFor any general Markov policy J-L, its value functions satisfy the equations: \n\nVtL(s) = L J-L(s, 0) [r~ + 2:P~s' VtL(S')] \n\nand QtL(s,o) = r~ + LP~s' VtL(s'). \n\noEO. \n\ns' \n\ns' \n\nLet us denote a restricted set of options by 0 and the set of all policies selecting only from \noptions in 0 by IJ( 0). Then the optimal value function given that we can select only from 0 \nis Va(s) = maxoEO. [r~ + 2:s' P~s' Va(s')]. A corresponding optimal policy, denoted J-Lo' \nis any policy that achieves Va' i.e., for which VtLe, (s) = Va (s) in all states s E S. If Va and \nthe models of the options are known, then J-Lo can be formed .by choosing in any proportion \namong the maximizing options in the equation above for Va' \nIt is straightforward to extend MDP planning methods to SMDPs. For example, synchronous \nvalue iteration with options initializes an approximate value function %(s) arbitrarily and \nthen updates it by : \n\nVk+l(S) f- max[r~ + 2: p~s' Vk(s')], \n\noEO s \n\n\"Is E S. \n\nNote that this algorithm reduces to conventional value iteration in the special case in which \no = A. Standard results from SMDP theory guarantee that such processes converge for \n\ns'ES \n\n\fImproved Switching among Temporally Abstract Actions \n\n1069 \n\ngeneral semi-Markov options: limk-too Vk(s) = Vo(s) for all s E S, 0 E 0, and for all O. \nThe policies found using temporally abstract options are approximate in the sense that they \nachieve only Vo' which is typically less than the maximum possible, V\u00b7. \n4 Interrupting Options \nWe are now ready to present the main new insight and result of this paper. SMDP meth(cid:173)\nods apply to options, but only when they are treated as opaque indivisible units. Once an \noption has been selected, such methods require that its policy be followed until the option \nterminates. More interesting and potentially more powerful methods are possible by looking \ninside options and by altering their internal structure (e.g. Sutton, Precup & Singh, 1998). \nIn particular, suppose we have determined the option-value function QI' (s, 0) for some policy \nJ-L and for all state-options pairs s,o that could be encountered while following J-L. This \nfunction tells us how well we do while following J-L committing irrevocably to each option, \nbut it can also be used to re-evaluate our commitment on each step. Suppose at time t we \nare in the midst of executing option o. If 0 is Markov in s, then we can compare the value \nof continuing with 0, which is QI' (St, 0), to the value of interrupting 0 and selecting a new \noption according to J-L, which is VI'(s) = Lo' J-L(s, o')QI'(s, 0'). If the latter is more highly \nvalued, then why not interrupt 0 and allow the switch? This new way of behaving is indeed \nbetter, as shown below. \n\nWe can characterize the new way of behaving as following a policy J-L' that is the same as the \noriginal one, but over new options, i.e. J-L' (s, 0') = J-L( s, 0), for all s E S. Each new option \n0' is the same as the corresponding old option 0 except that it terminates whenever switching \nseems better than continuing according to QI'. We call such a J-L' an interrupted policy of J-L. \nWe will now state a general theorem, which extends the case described above, in that options \nmay be semi-Markov (instead of Markov) and interruption is optional at each state where it \ncould be done. The latter extension lifts the requirement that QI' be completely known, since \nthe interruption can be restricted to states for which this information is available. \n\nTheorem 1 (Interruption) For any MDP, any set of options 0, and any Markov policy \nJ-L : S x 0 -+ [0,1], define a new set of options, 0', with a one-to-one mapping between \nthe two option sets as follows: for every 0 = (I, 7r, (3) E 0 we define a corresponding \n0' = (I, 7r, (3') EO', where{3' = (3exceptthatforanyhistoryhinwhichQI'(h,o) < VI'(s), \nwhere s is the final state of h, we may choose to set (3' (h) = 1. Any histories whose termina(cid:173)\ntion conditions are changed in this way are called interrupted histories. Let J-L' be the policy \nover 0' corresponding to J-L.' J-L'(s, 0') = J-L(s, 0), where 0 is the option in 0 corresponding to \no',for all s E S. Then \n\n1. VI\" (s) ~ VI'(s) for all s E S. \n2. Iffrom state s E S there is a non-zero probability of encountering an interrupted \n\nhistory upon initiating J-L' in s, then VI\" (s) > VI'(s). \n\nProof: The idea is to show that, for an arbitrary start state s, executing the option given by \nthe termination improved policy J-L' and then following policy J-L thereafter is no worse than \nalways following policy J-L. In other words, we show that the following inequality holds: \nLJ-L'(s,o')[r~' + LP~~'VI'(s')] ~ VI'(s) = LJ-L(s,o)[r~ + LP~8'VI'(S')]. \n\n(1) \n\n0' \n\ns' \n\no \n\ns' \n\nIf this is true, then we can use it to expand the left-hand side, repeatedly replacing every \noccurrence of VI'(x) on the left by the corresponding Lo' J-L' (x, o')[r~' + Lx' p~'x' VI' (x')]. \nIn the limit, the left-hand side becomes VI\", proving that VI\" ~ VI'. Since J-L'(s, 0') = \nJ-L(s,o) \\Is E S, we need to show that \n\ns' \n\ns' \n\n(2) \n\n\f1070 \n\nR. S. Sutton. S. Singh. D. Precup and B. Ravindran \n\nLet r denote the set of all interrupted histories: r = {h En: f3 (h) =f f3' (h)}. Then, the left \nhand side of (2) can be re-written as \nE {r + ,kVJL(s') I \u00a3(0', s), hSSI ~ r} + E {r + ,kVJL(s') I \u00a3(0', s), hSSI E r}, \nwhere s', r, and k are the next state, cumulative reward, and number of elapsed steps fol(cid:173)\nlowing option 0 from s (hSSI is the history from s to s'). Trajectories that end because of \nencountering a history hSSI ~ r never encounter a history in r, and therefore also occur \nwith the same probability and expected reward upon executing option 0 in state s. There-\nfor~, we can re-write the right hand side of (2) as E {r + ,kVJL(S') I \u00a3(0', s), hSSI ~ r} + \nE {f3(s')[r + ,kVJL(S')] + (1 - f3(s'))[r + ,kQJL(hsSI, 0)]1 \u00a3(0', s), hSsl E r}. \nThis proves (1) because for all hSSI E r, Q6(hsSI, 0) :S VJL(s'). Note that strict inequality \nholds in (2) if Q6(hsSI, 0) < VJL(s') for at least one history hSSI E r that ends a trajectory \ngenerated by 0' with non-zero probability.) \n<> \n\nAs one application of this result, consider the case in which /-L is an optimal policy for a given \nset of Markov options O. The interruption theorem gives us a way of improving over /-La \nwith just the cost of checking (on each time step) if a better option exists, which is negligible \ncompared to the combinatorial process of computing Q'O or Va' Kaelbling (1993) and Di(cid:173)\netterich (1998) demonstrated a similar performance improvement by interrupting temporally \nextended actions in a different setting. \n\n5 Illustration \n\nFigure 1 shows a simple example of the gain that can be obtained by interrupting options. \nThe task is to navigate from a start location to a goal location within a continuous two(cid:173)\ndimensional state space. The actions are movements of length 0.01 in any direction from the \ncurrent state. Rather than work with these low-level actions, infinite in number, we introduce \nseven landmark locations in the space. For each landmark we define a controller that takes us \nto the landmark in a direct path. Each controller is only applicable within a limited range of \nstates, in this case within a certain distance of the corresponding landmark. Each controller \nthen defines an option: the circular region around the controller'S landmark is the option's \ninitiation set, the controller itself is the policy, and the arrival at the target landmark is the \ntermination condition. We denote the set of seven landmark options by O. Any action within \n0.01 of the goal location transitions to the terminal state, , = 1, and the reward is -Ion all \ntransitions, which makes this a minimum-time task. \n\nOne of the landmarks coincides with the goal, so it is possible to reach the goal while picking \nonly from O. The optimal policy within II(O) runs from landmark to landmark, as shown \nby the thin line in Figure 1. This is the optimal solution to the SMDP defined by 0 and is \nindeed the best that one can do while picking only from these options. But of course one can \ndo better if the options are not followed all the way to each landmark. The trajectory shown \nby the thick line in Figure 1 cuts the corners and is shorter. This is the interrupted policy \nwith respect to the SMDP-optimal policy. The interrupted policy takes 474 steps from start \nto goal which, while not as good as the optimal policy (425 steps), is much better than the \nSMDP-optimal policy, which takes 600 steps. The state-value functions, VJLe, and VJL' for \nthe two policies are also shown in Figure 1. \n\nFigure 2 presents a more complex, mission planning task. A mission is a flight from base to \nobserve as many of a given set of sites as possible and to return to base without running out \nof fuel. The local weather at each site flips from cloudy to clear according to independent \n\nlWe note that the same proof would also apply for switching to other options (not selected by /1-) if \nthey improved over continuing with o. That result would be more general and closer to conventional \npolicy improvement. We prefer the result given here because it emphasizes its primary application. \n\n\fImproved Switching among Temporally Abstract Actions \n\n1071 \n\n, ~ - - -', \n\no \n\nTrajectories through \nSpace of Landmarks ,\" \n\nInterrupted Sorution / ' \\ /' \nL _ /'..._ \n(474 Stops) \n\nI \n\nJ_ \n~ 1 \n\\ \n\\' \n\nt', \n\n\" S \n, \n\nI \n\n, \nI , \n\\ , \n\n,\n\n-\n\n...... \n\nI \nI \n\n\\< \nJ-\\-\nI \n\n... 1-- ....... ----.. \n.. G ...... \n\" \n\n.... ~ I \nr \n\n.... \n\nI \n'::l..-J -\n\\ \n~~ \n,\\ \n\" \n1-\" I \n->- -l - \" - -.-1 \n\n.... \n\n' ; ' \n\n\"-... / \\ , /1 \n\n'- , ~ '- -\" SMDPSoIution \n\n, \n\n'\n\n(600 Stops) \n\n\u00b7100 \n\u00b7200 \n\u00b7300 \n\n1 \n\nSMDP Value Function \n\n0 0 \n\nValues with Interruption \n\nFigure 1: Using interruption to improve navigation with landmark-directed controllers. The task (left) \nis to navigate from S to G in minimum time using options based on controllers that run each to one \nof seven landmarks (the black dots). The circles show the region around each landmark within which \nthe controllers operate. The thin line shows the optimal behavior that uses only these controllers run to \ntermination, and the thick line shows the corresponding interrupted behavior, which cuts the corners. \nThe right panels show the state-value functions for the SMDP-optimal and interrupted policies. \n\nPoisson processes. If the sky at a given site is cloudy when the plane gets there, no observa(cid:173)\ntion is made and the reward is a. If the sky is clear, the plane gets a reward, according to the \nimportance of the site. The positions, rewards, and mean time between two weather changes \nfor each site are given in Figure 2. The plane has a limited amount of fuel, and it consumes \none unit of fuel during each time tick. If the fuel runs out before reaching the base, the plane \ncrashes and receives a reward of -lOa. \n\nThe primitive actions are tiny movements in any direction (there is no inertia). The state of \nthe system is described by several variables: the current position of the plane, the fuel level, \nthe sites that have been observed so far, and the current weather at each of the remaining sites. \nThe state-action space has approximately 24.3 billion elements (assuming 100 discretization \nlevels of the continuous variables) and is intractable by normal dynamic programming meth(cid:173)\nods. We introduced options that can take the plane to each of the sites (including the base), \nfrom any position in the state space. The resulting SMDP has only 874,800 elements and it \nis feasible to exactly determine Vo (S') for all sites S'. From this solution and the model of \nthe options, we can determine Qo(s , 0) = r~ + LSi P~SI VO(S') for any option 0 and any \nstate s in the whole space. \n\nWe performed asynchronous value iteration using the options in order to compute the optimal \noption-value function , and then used the interruption approach based on the values computed, \nThe policies obtained by both approaches were compared to the results of a static planner, \nwhich exhaustively searches for the best tour assuming the weather does not change, and \nthen re-plans whenever the weather does change. The graph in Figure 2 shows the reward \nobtained by each of these methods, averaged over 100 independent simulated missions. The \npolicy obtained by interruption performs significantly better than the SMDP policy, which in \nturn is significantly better than the static planner.2 \n6 Closing \nThis paper has developed a natural, even obvious, observation-that one can do better by \ncontinually re-evaluating one 's commitment to courses of action than one can by commit(cid:173)\nting irrevocably to them. Our contribution has been to formulate this observation precisely \nenough to prove it and to demonstrate it empirically. Our final example suggests that this \ntechnique can be used in applications far too large to be solved at the level of primitive ac(cid:173)\ntions. Note that this was achieved using exact methods, without function approximators to \nrepresent the value function . With function approximators and other reinforcement learning \ntechniques, it should be possible to address problems that are substantially larger stilL \n\n2In preliminary experiments, we also used interruption on a crudely learned estimate of Qo . The \n\nperformance of the interrupted solution was very close to the result reported here. \n\n\f1072 \n\nR. S. Sutton, S. Singh, D. Precup and B. Ravindran \n\n'n\".! Ok: 25 (mean time between \n\n.4T~h, \n'/I I 10\\ \n\nweather changes) \n\n\u00b0 \n\n15 (reward) \n\n50 \n\n100 \ndecision \n\n'''' \n\noPtions~8 \n,'/ r~~j7'A \n' \n\n'\" \n\n')( ~ \"\"iiF \n\nt1: 0 ~ 10 \n\nExpected \nReward \n\nper \n\n50 \n\nMission \n\n\u2022 Base \n\n\u00b050 \n\n40 \n\nHigh Fuel \n\nLow Fuel \n\nFigure 2: The mission planning task and the perfonnance of policies constructed by SMDP meth(cid:173)\nods, interruption of the SMDP policy, and an optimal static re-planner that does not take into account \npossible changes in weather conditions. \n\nAcknowledgments \n\nThe authors gratefully acknowledge the substantial help they have received from many col(cid:173)\nleagues, including especially Amy McGovern, Andrew Barto, Ron Parr, Tom Dietterich, \nAndrew Fagg, Leo Zelevinsky and Manfred Huber. We also thank Paul Cohen, Robbie Moll, \nMance Harmon, Sascha Engelbrecht, and Ted Perkins for helpful reactions and constructive \ncriticism. This work was supported by NSF grant ECS-9511805 and grant AFOSR-F49620-\n96-1-0254, both to Andrew Barto and Richard Sutton. Satinder Singh was supported by NSF \ngrant IIS-9711753. \n\nReferences \n\nBradtke, S. 1. & Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov \n\ndecision problems. In NIPS 7 (393-500). MIT Press. \n\nDayan, P. & Hinton, G. E. (1993). Feudal reinforcement learning. In NIPS 5 (271-278). MIT Press. \nDietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. In Proceedings \n\nof the Fifteenth International Conference on Machine Learning. Morgan Kaufmann. \n\nKaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceed(cid:173)\nings of the Tenth International Conference on Machine Learning (167-173). Morgan Kaufmann. \nMahadevan, S., Marchallek, N., Das, T. K. & Gosavi, A. (1997). Self-improving factory simulation \nusing continuous-time average-reward reinforcement learning. In Proceedings of the Fourteenth \nInternational Conference on Machine Learning (202-210). Morgan Kaufmann. \n\nMcGovern, A., Sutton, R. S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforce(cid:173)\n\nment learning. In Grace Hopper Celebration of Women in Computing (13-17). \n\nParr, R. & Russell, S. (1998). Reinforcement learning with hierarchies of machines. In NIPS 10. MIT \n\nPress. \n\nPuterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. \n\nWiley. \n\nSingh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models. In Proceedings of the \n\nTenth National Conference on Artificial Intelligence (202-207). MIT/AAAI Press. \n\nSutton, R. S. (1995). TD models: Modeling the world as a mixture of time scales. In Proceedings of \n\nthe Twelfth International Conference on Machine Learning (531-539). Morgan Kaufmann. \n\nSutton, R. S., Precup, D. & Singh, S. (1998). Intra-option learning about temporally abstract actions. In \nProceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufman. \nSutton, R. S., Precup, D. & Singh, S. (1998). Between MDPs and Semi-MDPs: learning, planning, \nand representing knowledge at multiple temporal scales. TR 98-74, Department of Compo Sci., \nUniversity of Massachusetts, Amherst. \n\nThrun, S. & Schwartz, A. (1995). Finding structure in reinforcement learning. In NIPS 7 (385-392). \n\nMIT Press. \n\n\f", "award": [], "sourceid": 1607, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Doina", "family_name": "Precup", "institution": null}, {"given_name": "Balaraman", "family_name": "Ravindran", "institution": null}]}