{"title": "Learning Abstract Options", "book": "Advances in Neural Information Processing Systems", "page_first": 10424, "page_last": 10434, "abstract": "Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.", "full_text": "Learning Abstract Options\n\nMatthew Riemer, Miao Liu, and Gerald Tesauro\n\nIBM Research\n\nT.J. Watson Research Center, Yorktown Heights, NY\n{mdriemer, miao.liu1, gtesauro}@us.ibm.com\n\nAbstract\n\nBuilding systems that autonomously create temporal abstractions from data is a\nkey challenge in scaling learning and planning in reinforcement learning. One\npopular approach for addressing this challenge is the options framework [29].\nHowever, only recently in [1] was a policy gradient theorem derived for online\nlearning of general purpose options in an end to end fashion. In this work, we\nextend previous work on this topic that only focuses on learning a two-level\nhierarchy including options and primitive actions to enable learning simultaneously\nat multiple resolutions in time. We achieve this by considering an arbitrarily deep\nhierarchy of options where high level temporally extended options are composed\nof lower level options with \ufb01ner resolutions in time. We extend results from\n[1] and derive policy gradient theorems for a deep hierarchy of options. Our\nproposed hierarchical option-critic architecture is capable of learning internal\npolicies, termination conditions, and hierarchical compositions over options without\nthe need for any intrinsic rewards or subgoals. Our empirical results in both discrete\nand continuous environments demonstrate the ef\ufb01ciency of our framework.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), options [29, 21] provide a general framework for de\ufb01ning temporally\nabstract courses of action for learning and planning. Discovering these temporal abstractions au-\ntonomously has been the subject of extensive research [16, 28, 17, 27, 26] with approaches that can be\nused in continuous state and/or action spaces only recently becoming feasible [9, 20, 15, 14, 10, 31, 3].\nMost existing work has focused on \ufb01nding subgoals (i.e. useful states for the agent to reach) and\nthen learning policies to achieve them. However, these approaches do not scale well because of their\ncombinatorial nature. Recent work on option-critic learning blurs the line between option discovery\nand option learning by providing policy gradient theorems for optimizing a two-level hierarchy of op-\ntions and primitive actions [1]. These approaches have achieved success when applied to Q-learning\non Atari games, but also in continuous action spaces [7] and with asynchronous parallelization [6].\nIn this paper, we extend option-critic to a novel hierarchical option-critic framework, presenting\ngeneralized policy gradient theorems that can be applied to an arbitrarily deep hierarchy of options.\n\nFigure 1: State trajectories over a three-level hierarchy of options. Open circles represent SMDP\ndecision points while \ufb01lled circles are primitive steps within an option. The low level options are\ntemporally extended over primitive actions, and high level options are even further extended.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWork on learning with temporal abstraction is motivated by two key potential bene\ufb01ts over learning\nwith primitive actions: long term credit assignment and exploration. Learning only at the primitive\naction level or even with low levels of abstraction slows down learning, because agents must learn\nlonger sequences of actions to achieve the desired behavior. This frustrates the process of learning\nin environments with sparse rewards. In contrast, agents that learn a high level decomposition of\nsub-tasks are able to explore the environment more effectively by exploring in the abstract action\nspace rather than the primitive action space. While the recently proposed deliberation cost [6] can be\nused as a margin that effectively controls how temporally extended options are, the standard two-level\nversion of the option-critic framework is still ill-equipped to learn complex tasks that require sub-task\ndecomposition at multiple quite different temporal resolutions of abstraction. In Figure 1 we depict\nhow we overcome this obstacle to learn a deep hierarchy of options. The standard two-level option\nhierarchy constructs a Semi-Markov Decision Process (SMDP), where new options are chosen when\ntemporally extended sequences of primitive actions are terminated. In our framework, we consider\nnot just options and primitive actions, but also an arbitrarily deep hierarchy of lower level and higher\nlevel options. Higher level options represent a further temporally extended SMDP than the low level\noptions below as they only have an opportunity to terminate when all lower level options terminate.\nWe will start by reviewing related research and by describing the seminal work we build upon in\nthis paper that \ufb01rst derived policy gradient theorems for learning with primitive actions [30] and\noptions [1]. We will then describe the core ideas of our approach, presenting hierarchical intra-\noption policy and termination gradient theorems. We leverage this new type of policy gradient\nlearning to construct a hierarchical option-critic architecture, which generalizes the option-critic\narchitecture to an arbitrarily deep hierarchy of options. Finally, we demonstrate the empirical bene\ufb01t\nof this architecture over standard option-critic when applied to RL benchmarks. To the best of our\nknowledge, this is the \ufb01rst general purpose end-to-end approach for learning a deep hierarchy of\noptions beyond two-levels in RL settings, scaling to very large domains at comparable ef\ufb01ciency.\n\n2 Related Work\n\nOur work is related to recent literature on learning to compose skills in RL. As an example, Sahni\net al. [23] leverages a logic for combining pre-learned skills by learning an embedding to represent\nthe combination of a skill and state. Unfortunately, their system relies on a pre-speci\ufb01ed sub-task\ndecomposition into skills. In [25], the authors propose to ground all goals in a natural language\ndescription space. Created descriptions can then be high level and express a sequence of goals. While\nthese are interesting directions for further exploration, we will focus on a more general setting without\nprovided natural language goal descriptions or sub-task decomposition information.\nOur work is also related to methods that learn to decompose the problem over long time horizons.\nA prominent paradigm for this is Feudal Reinforcement Learning [4], which learns using manager\nand worker models. Theoretically, this can be extended to a deep hierarchy of managers and their\nmanagers as done in the original work for a hand designed decomposition of the state space. Much\nmore recently, Vezhnevets et al. [32] showed the ability to successfully train a Feudal model end\nto end with deep neural networks for the Atari games. However, this has only been achieved for\na two-level hierarchy (i.e. one manager and one worker). We can think of Feudal approaches as\nlearning to decompose the problem with respect to the state space, while the options framework learns\na temporal decomposition of the problem. Recent work [11] also breaks down the problem over a\ntemporal hierarchy, but like [32] is based on learning a latent goal representation that modulates the\npolicy behavior as opposed to options. Conceptually, options stress choosing among skill abstractions\nand Feudal approaches stress the achievement of certain kinds of states. Humans tend to use both\nof these kinds of reasoning when appropriate and we conjecture that a hybrid approach will likely\nwin out in the end. Unfortunately, in the space available we feel that we cannot come to de\ufb01nitive\nconclusions about the precise nature of the differences and potential synergies of these approaches.\nThe concept of learning a hierarchy of options is not new. It is an obviously desirable extension of\noptions envisioned in the original papers. However, actually learning a deep hierarchy of options end\nto end has received surprisingly little attention to date. Compositional planning where options select\nother options was \ufb01rst considered in [26]. The authors provided a generalization of value iteration to\noption models for multiple subgoals, leveraging explicit subgoals for options. Recently, Fox et al.\n[5] successfully trained a hierarchy of options end to end for imitation learning. Their approach\nleverages an EM based algorithm for recursive discovery of additional levels of the option hierarchy.\n\n2\n\n\fUnfortunately, their approach is only applicable to imitation learning and not general purpose RL. We\nare the \ufb01rst to propose theorems along with a practical algorithm and architecture to train arbitrarily\ndeep hierarchies of options end to end using policy gradients, maximizing the expected return.\n\n\u2202\u03c0\u03b8 (a|s)\n\u2202\u03b8 Q\u03c0\u03b8 (s,a), where \u00b5\u03c0\u03b8 (s|s0) = \u2211\u221e\n\nt=0 \u03b3trt+1|s0 = s,a0 = a] where \u03b3 \u2208 [0,1) is the discount factor.\n\n3 Problem Setting and Notation\nA Markov Decision Process (MDP) consists of a set of states S, a set of actions A, a transition\nfunction P : S \u00d7A \u2192 (S \u2192 [0,1]) and a reward function r : S \u00d7A \u2192 R. We follow [1] and develop\nour ideas assuming discrete state and action sets, while our results extend to continuous spaces using\nusual measure-theoretic assumptions as we demonstrate in our experiments. A policy is a probability\ndistribution over actions conditioned on states, \u03c0 : S \u2192 A \u2192 [0,1]. The value function of a policy\nt=0 \u03b3trt+1|s0 = s] with an action-value function of\n\u03c0 is de\ufb01ned as the expected return V\u03c0 (s) = E\u03c0 [\u2211\u221e\nQ\u03c0 (s,a) = E\u03c0 [\u2211\u221e\nPolicy gradient methods [30, 8] address the problem of \ufb01nding a good policy by performing\nstochastic gradient descent to optimize a performance objective over a given family of parametrized\nstochastic policies, \u03c0\u03b8 . The policy gradient theorem [30] provides an expression for the gradient of the\ndiscounted reward objective with respect to \u03b8. The objective is de\ufb01ned with respect to a designated\nt=0 \u03b3trt+1|s0]. The policy gradient theorem shows that:\nstart state (or distribution) s0 : \u03c1(\u03b8 ,s0) = E\u03c0\u03b8 [\u2211\u221e\nt=0 \u03b3tP(st = s|s0) is a discounted\n\u2202\u03b8 = \u2211s \u00b5\u03c0\u03b8 (s|s0)\u2211a\n\u2202\u03c1(\u03b8 ,s0)\nweighting of the states along the trajectories starting from s0.\nThe options framework [29, 21] formalizes the idea of temporally extended actions. A Markovian\noption o \u2208 \u2126 is a triple (Io,\u03c0o,\u03b2o) in which Io \u2286 S is an initiation set, \u03c0o is an intra-option policy, and\n\u03b2o : S \u2192 [0,1] is a termination function. Like most option discovery algorithms, we assume that all\noptions are available everywhere. MDPs with options become SMDPs [22] with a corresponding\noptimal value function over options V\u2217\nThe option-critic architecture [1] leverages a call-and-return option execution model, in which an\nagent picks option o according to its policy over options \u03c0\u2126(o|s), then follows the intra-option policy\n\u03c0(a|s,o) until termination (as dictated by \u03b2 (s,o)), at which point this procedure is repeated. Let\n\u03c0\u03b8 (a|s,o) denote the intra-option policy of option o parametrized by \u03b8 and \u03b2\u03c6 (s,o) the termination\nfunction of o parameterized by \u03c6. Like policy gradient methods, the option-critic architecture\noptimizes directly for the discounted return expected over trajectories starting at a designated state s0\nt=0 \u03b3trt+1|s0,o0]. The option-value function is then:\nand option o0: \u03c1(\u2126,\u03b8 ,\u03c6 ,s0,o0) = E\u2126,\u03c0\u03b8 ,\u03b2\u03c6 [\u2211\u221e\n\u03c0\u03b8 (a|s,o)QU (s,o,a),\nQ\u2126(s,o) = \u2211\n(1)\nwhere QU : S \u00d7 \u2126\u00d7A \u2192 R is the value of executing an action in the context of a state-option pair:\n(2)\n\n\u2126(s) and option-value function Q\u2217\n\nP(s(cid:48)|s,a)U(s(cid:48),o).\n\n\u2126(s,o) [29, 21].\n\na\n\nQU (s,o,a) = r(s,a) + \u03b3 \u2211\ns(cid:48)\n\nThe (s,o) pairs lead to an augmented state space [12]. The option-critic architecture instead leverages\nthe function U : \u2126\u00d7S \u2192 R which is called the option-value function upon arrival [29]. The value of\nexecuting o upon entering state s(cid:48) is given by:\n\nU(s(cid:48),o) = (1\u2212 \u03b2\u03c6 (s(cid:48),o))Q\u2126(s(cid:48),o) + \u03b2\u03c6 (s(cid:48),o)V\u2126(s(cid:48)).\n\n(3)\n\nQU and U both depend on \u03b8 and \u03c6, but are omitted from the notation for clarity. The intra-option\npolicy gradient theorem results from taking the derivative of the expected discounted return with\nrespect to the intra-option policy parameters \u03b8 and de\ufb01nes the update rule for the intra-option policy:\n\n\u2202 Q\u2126(s0,o0)\n\n= \u2211\n\n\u00b5\u2126(s,o|s0,o0)\u2211\n\n\u2202\u03b8\n\n(4)\nwhere \u00b5\u2126(s,o|s0,o0) is a discounted weighting of state-option pairs along trajectories starting from\n(s0,o0) : \u00b5\u2126(s,o|s0,o0) = \u2211\u221e\nt=0 \u03b3tP(st = s,ot = o|s0,o0). The termination gradient theorem results\nfrom taking the derivative of the expected discounted return with respect to the termination policy\nparameters \u03c6 and de\ufb01nes the update rule for the termination policy for the initial condition (s1,o0):\n\nQU (s,o,a).\n\n\u2202\u03b8\n\ns,o\n\na\n\n\u2202\u03c0\u03b8 (a|s,o)\n\n\u2202 Q\u2126(s,o)\n\n\u2202\u03c6\n\n= \u2211\n\na\n\n\u03c0\u03b8 (a|s,o)\u2211\ns(cid:48)\n\n\u03b3P(s(cid:48)|s,a)\n\n3\n\n\u2202U(s(cid:48),o)\n\n\u2202\u03c6\n\n,\n\n(5)\n\n\f\u2202U(s1,o0)\n\n\u2202\u03c6\n\n= \u2212\u2211\ns(cid:48),o\n\n\u00b5\u2126(s(cid:48),o|s1,oo)\n\n\u2202\u03b2\u03c6 (s(cid:48),o)\n\n\u2202\u03c6\n\nA\u2126(s(cid:48),o),\n\n(6)\n\nwhere \u00b5\u2126 is a discounted weighting of (s,o) from (s1,o0) : \u00b5\u2126(s,o|s1,o0) = \u2211\u221e\no|s1,o0). A\u2126 is the advantage function over options: A\u2126(s(cid:48),o) = Q\u2126(s(cid:48),o)\u2212V\u2126(s(cid:48)).\n\nt=0 \u03b3tP(st+1 = s,ot =\n\n4 Learning Options with Arbitrary Levels of Abstraction\n\nNotation: As it makes our equations much clearer and more condensed we adopt the notation\nxi:i+ j = xi, ...,xi+ j. This implies that xi:i+ j denotes a list of variables in the range of i through i + j.\nThe hierarchical options framework that we introduce in this work considers an agent that learns\nusing an N level hierarchy of policies, termination functions, and value functions. Our goal is to\nextend the ideas of the option-critic architecture in such a way that our framework simpli\ufb01es to policy\ngradient based learning when N = 1 and option-critic learning when N = 2. At each hierarchical\nlevel above the lowest primitive action level policy, we consider an available set of options \u21261:N\u22121\nthat is a subset of the total set of available options \u2126. This way we keep our view of the possible\navailable options at each level very broad. On one extreme, each hierarchical level may get its own\nunique set of options and on the other extreme each hierarchical level may share the same set of\noptions. We present a diagram of our proposed architecture in Figure 2.\n\u03b8 1(o1|s) as the pol-\nWe denote \u03c01\nicy over the most abstract op-\ntions in the hierarchy o1 \u2208 \u21261\ngiven the state s. For exam-\nple, \u03c01 = \u03c0\u2126 from our discus-\nsion of the option-critic architec-\nture. Once o1 is chosen with\npolicy \u03c01, then we go to pol-\n\u03b8 2(o2|s,o1), which is the\nicy \u03c02\nnext highest level policy, to se-\nlect o2 \u2208 \u21262 conditioning it on\nboth the current state s and the\nselected highest level option o1.\nThis process continues on in the\nsame fashion stepping down to\npolicies at lower levels of ab-\nstraction conditioned on the aug-\nmented state space considering\nall selected higher level options until we reach policy \u03c0N\nand it \ufb01nally selects over the primitive action space conditioned on all of the selected options.\nEach\nof\nfunction\n\u03c6 1(s,o1), ...,\u03b2 N\u22121\n\u03c6 N\u22121(s,o1:N\u22121) that governs the termination pattern of the selected option at\n\u03b2 1\nthat level. We adopt a bottom up termination strategy where high level options only have an\nopportunity to terminate when all of the lower level options have terminated \ufb01rst. For example,\noN\u22122\n\u03c6 N\u22122(s,o1:N\u22122) to see\nt\nwhether oN\u22122\nthe opportunity to asses if it\nshould terminate and so on. This condition ensures that higher level options will be more temporally\nextended than their lower level option building blocks, which is a key motivation of this work.\nThe \ufb01nal key component of our system is the value function over the augmented state space. To\nenable comprehensive reasoning about the policies at each level of the option hierarchy, we need to\nmaintain value functions that consider the state and every possible combination of active options and\nactions V\u2126(s),Q\u2126(s,o1), ...,Q\u2126(s,o1:N\u22121,a). These value functions collectively serve as the critic in\nour analogy to the actor-critic and option-critic training paradigms.\n\nFigure 2: A diagram describing our proposed hierarchical option-\ncritic architecture. Dotted lines represent processes within the\nagent while solid lines represent processes within the environment.\nOption selection is top down through the hierarchy and option\ntermination is bottom up (represented with red dotted lines).\n\nterminates. If it did terminate, this would allow oN\u22123\n\ncannot terminate until oN\u22121\n\nterminates at which point we can assess \u03b2 N\u22122\n\n\u03b8 N (a|s,o1:N\u22121). \u03c0N is the lowest level policy\n\ncomplimentary\n\ntermination\n\nt\n\nlevel\n\nthe\n\noption\n\nhierarchy\n\nhas\n\na\n\nt\n\nt\n\n4\n\n\f4.1 Generalizing the Option Value Function to N Hierarchical Levels\n\nLike policy gradient methods and option-critic, the hierarchical options framework optimizes directly\nfor the discounted return expected over all trajectories starting at a state s0 with active options o1:N\u22121\n:\n\n0\n\n\u03c1(\u21261:N\u22121,\u03b8 1:N,\u03c6 1:N\u22121,s0,o1:N\u22121\n\n0\n\n) = E\n\n\u21261:N\u22121,\u03c01:N\n\u03b8\n\n,\u03b2 1:N\u22121\n\u03c6\n\n\u03b3trt+1|s0,o1:N\u22121\n\n0\n\n]\n\n(7)\n\n\u221e\n\u2211\n\n[\nt=0\n\nThis return depends on the policies and termination functions at each level of abstraction. We now\nconsider the option value function for understanding reasoning about an option o(cid:96) at level 1 \u2264 (cid:96) \u2264 N\nbased on the augmented state space (s,o1:(cid:96)\u22121):\n\u03b8 (cid:96)(o(cid:96)|s,o1:(cid:96)\u22121)QU (s,o1:(cid:96))\nQ\u2126(s,o1:(cid:96)\u22121) = \u2211\n\u03c0 (cid:96)\n\n(8)\n\nNote that in order to simplify our notation we write o(cid:96) as referring to both abstract and primitive\nactions. As a result, oN is equivalent to a, leveraging the primitive action space A. Extending the\nmeaning of QU from [1], we de\ufb01ne the corresponding value of executing an option in the presence of\nthe currently active higher level options by integrating out the lower level options:\n\no(cid:96)\n\nQU (s,o1:(cid:96)) =\u2211\noN\n\n...\u2211\no(cid:96)+1\n\nN\n\n\u220f\n\nj=(cid:96)+1\n\n\u03c0 j(o j|s,o1: j\u22121)[r(s,oN)+\u03b3\u2211\ns(cid:48)\n\nP(s(cid:48)|s,oN)U(s(cid:48),o1:(cid:96)\u22121)].\n\n(9)\n\nThe hierarchical option value function upon arrival U with augmented state (s,o1:(cid:96)\u22121) is de\ufb01ned as:\n\nU(s(cid:48),o1:(cid:96)\u22121) = (1\u2212 \u03b2 N\u22121\n\n(cid:124)\n\n(cid:125)\n\u03c6 N\u22121(s(cid:48),o1:N\u22121))Q\u2126(s(cid:48),o1:(cid:96)\u22121)\n\nnone terminate (N \u2265 1)\n\n+V\u2126(s(cid:48))\n(cid:124)\n\n(cid:123)(cid:122)\nQ\u2126(s(cid:48),o1:(cid:96)\u22121)\n(cid:124)\n\n+\n\n+\n\n1\n\nq\n\n\u220f\nj=N\u22121\n\n(cid:123)(cid:122)\n\u220f\nz=N\u22121\n\n\u03c6 j (s(cid:48),o1: j)\n(cid:125)\n\u03b2 j\nall options terminate (N \u2265 2)\n\u03c6 z(s(cid:48),o1:z)\n(cid:125)\n\u03b2 z\n\u03c6 k (s(cid:48),o1:k)\n(cid:125)\n\u03b2 k\n\nk=i+1\n\n].\n\n(1\u2212 \u03b2 q\u22121\n\u03c6 q\u22121(s(cid:48),o1:q\u22121))\n(cid:123)(cid:122)\n\n(cid:96)\u2211\nq=N\u22121\nonly lower level options terminate (N \u2265 3)\n(cid:96)\u22122\nN\u22121\n\u03c6 i(s(cid:48),o1:i))Q\u2126(s(cid:48),o1:i)\n(1\u2212 \u03b2 i\n(cid:124)\n\u2211\n\u220f\n\n(cid:123)(cid:122)\n\ni=1\n\nsome relevant higher level options terminate (N \u2265 3)\n\n(10)\n\nWe explain the derivation of this equation 1 in the Appendix 1.1. Finally, before we can extend\nthe policy gradient theorem, we must establish the Markov chain along which we can measure\nperformance for options with N levels of abstraction. This is derived in the Appendix 1.2.\n\n4.2 Generalizing the Intra-option Policy Gradient Theorem\n\nWe can think of actor-critic architectures, generalizing to the option-critic architecture as well, as\npairing a critic with each actor network so that the critic has additional information about the value\nof the actor\u2019s actions that can be used to improve the actor\u2019s learning. However, this is derived\nby taking gradients with respect to the parameters of the policy while optimizing for the expected\ndiscounted return. The discounted return is approximated by a critic (i.e. value function) with the\nsame augmented state-space as the policy being optimized for. As examples, an actor-critic policy\n\u03c0(a|s) is optimized by taking the derivative of its parameters with respect to V\u03c0 (s) [30] and an\noption-critic policy \u03c0(a|s,o) is optimized by taking the derivative of its parameters with respect to\nQ\u2126(s,o) [1]. The intra-option policy gradient theorem [1] is an important contribution, outlining how\nto optimize for a policy that is also associated with a termination function. As the policy over options\nin that work never terminates, it does not need a special training methodology and the option-critic\narchitecture allows the practitioner to pick their own method of learning the policy over options while\nusing Q Learning as an example in their experiments. We do the same for our highest level policy\n\u03c01 that also never terminates. For all other policies \u03c02:N we perform a generalization of actor-critic\nlearning by providing a critic at each level and guiding gradients using the appropriate critic.\n\n1Note that when no options terminate, as in the \ufb01rst term in equation (10), the lowest level option does not\n\nterminate and thus no higher level options have the opportunity to terminate.\n\n5\n\n\fWe now seek to generalize the intra-option policy gradients theorem, deriving the update rule for a\npolicy at an arbitrary level of abstraction \u03c0 (cid:96) by taking the gradient with respect to \u03b8 (cid:96) using the value\nfunction with the same augmented state space Q\u2126(s,o1:(cid:96)\u22121). Substituting from equation (8) we \ufb01nd:\n(11)\n\n\u2202 Q\u2126(s,o1:(cid:96)\u22121)\n\n\u03b8 (cid:96)(o(cid:96)|s,o1:(cid:96)\u22121)QU (s,o1:(cid:96)).\n\u03c0 (cid:96)\n\n=\n\n\u2202\u03b8 (cid:96)\n\nTheorem 1 (Hierarchical Intra-option Policy Gradient Theorem). Given an N level hierarchical set of\nMarkov options with stochastic intra-option policies differentiable in their parameters \u03b8 (cid:96) governing\neach policy \u03c0 (cid:96), the gradient of the expected discounted return with respect to \u03b8 (cid:96) and initial conditions\n(s0,o1:N\u22121\n\n) is:\n\n0\n\n\u2202\n\u2202\u03b8 (cid:96) \u2211\n\no(cid:96)\n\n\u2211\ns,o1:(cid:96)\u22121\n\n\u00b5\u2126(s,o1:(cid:96)\u22121|s0,o1:N\u22121\n\n0\n\n)\u2211\n\no(cid:96)\n\n\u2202\u03c0 (cid:96)\n\n\u03b8 (cid:96)(o(cid:96)|s,o1:(cid:96)\u22121)\n\n\u2202\u03b8 (cid:96)\n\nQU (s,o1:(cid:96)),\n\n0\n\n) : \u00b5\u2126(s,o1:(cid:96)\u22121|s0,o1:N\u22121\n\nwhere \u00b5\u2126 is a discounted weighting of augmented state tuples along trajectories starting from\n(s0,o1:N\u22121\n). A proof is in\nAppendix 1.3.\n4.3 Generalizing the Termination Gradient Theorem\nWe now turn our attention to computing gradients for the termination functions \u03b2 (cid:96) at each level,\nassumed to be stochastic and differentiable with respect to the associated parameters \u03c6 (cid:96).\n\nt=0 \u03b3tP(st = s,o1:(cid:96)\u22121\n\n= o1:(cid:96)\u22121|s0,o1:N\u22121\n\n) = \u2211\u221e\n\n0\n\n0\n\nt\n\n\u2202 Q\u2126(s,o1:(cid:96))\n\n\u2202\u03c6 (cid:96)\n\n= \u2211\noN\n\n...\u2211\no(cid:96)+1\n\nN\n\n\u220f\n\nj=(cid:96)+1\n\n\u03c0 j(o j|s,o1: j\u22121)\u03b3\u2211\ns(cid:48)\n\nP(s(cid:48)|s,oN)\n\n\u2202U(s(cid:48),o1:(cid:96))\n\n\u2202\u03c6 (cid:96)\n\n(12)\n\nHence, the key quantity is the gradient of U. This is a natural consequence of call-and-return\nexecution, where termination function quality can only be evaluated upon entering the next state.\nTheorem 2 (Hierarchical Termination Gradient Theorem). Given an N level hierarchical set of\nMarkov options with stochastic termination functions differentiable in their parameters \u03c6 (cid:96) governing\neach function \u03b2 (cid:96), the gradient of the expected discounted return with respect to \u03c6 (cid:96) and initial\nconditions (s1,o1:N\u22121\n\u2212 \u2211\n\n) is:\n\u00b5\u2126(s,o1:(cid:96)|s1,o1:N\u22121\n\n\u03b2 i\n\u03c6 i(s,o1:i)\n\nA\u2126(s,o1:(cid:96))\n\nN\u22121\n\u220f\n\n\u2202\u03b2 (cid:96)\n\n)\n\n0\n\n0\n\n\u03c6 (cid:96)(s,o1:(cid:96))\n\u2202\u03c6 (cid:96)\n\ns,o1:(cid:96)\n\ni=(cid:96)+1\n\n0\n\n0\n\n0\n\nk=i+1 \u03b2 k\n\nj=(cid:96)\u22121 \u03b2 j\n\nt = o1:(cid:96)|s1,o1:N\u22121\n\n) = \u2211\u221e\nt=0 \u03b3tP(st = s,o1:(cid:96)\n\u03c6 i(s(cid:48),o1:i))Q\u2126(s(cid:48),o1:i)[\u220f(cid:96)\u22121\n\n) : \u00b5\u2126(s,o1:(cid:96)\u22121|s1,o1:N\u22121\n\u03c6 j (s(cid:48),o1: j)] \u2212 \u2211(cid:96)\u22121\ni=1 (1 \u2212 \u03b2 i\n\nwhere \u00b5\u2126 is a discounted weighting of augmented state tuples along trajectories starting\nfrom (s1,o1:N\u22121\n). A\u2126 is the\ngeneralized advantage function over a hierarchical set of options A\u2126(s(cid:48),o1:(cid:96)) = Q\u2126(s(cid:48),o1:(cid:96)) \u2212\n\u03c6 k (s(cid:48),o1:k)]. A\u2126 compares\nV\u2126(s)[\u220f1\nthe advantage of not terminating the current option with a probability weighted expectation based on\nthe likelihood that higher level options also terminate. In [1] this expression was simple as there was\nnot a hierarchy of higher level termination functions to consider. A proof is in Appendix 1.4.\nIt is interesting to see the emergence of an advantage function as a natural consequence of the\nderivation. As in [1] where this kind of relationship also appears, the advantage function gives the\ntheorem an intuitive interpretation. When the option choice is sub-optimal at level (cid:96) with respect\nto the expected value of terminating option (cid:96), the advantage function is negative and increases the\nodds of terminating that option. A new concept, not paralleled in the option-critic derivation, is\nthe inclusion of a \u220fN\u22121\n\u03c6 i(s,o1:i) multiplicative factor. This can be interpreted as discounting\ngradients by the likelihood of this termination function being assessed as \u03b2 (cid:96) is only used if all lower\nlevel options terminate. This is a natural consequence of multi-level call and return execution.\n5 Experiments\nWe would now like to empirically validate the ef\ufb01cacy of our proposed hierarchical option-critic\n(HOC) model. We achieve this by exploring benchmarks in the tabular and non-linear function\napproximation settings. In each case we implement an agent that is restricted to primitive actions (i.e.\nN = 1), an agent that leverages the option-critic (OC) architecture (i.e. N = 2), and an agent with\nthe HOC architecture at level of abstraction N = 3. We will demonstrate that complex RL problems\nmay be more easily learned using beyond two levels of abstraction and that the HOC architecture can\nsuccessfully facilitate this level of learning using data from scratch.\n\ni=(cid:96)+1 \u03b2 i\n\n6\n\n\fFor our tabular architectures, we followed\nprotocol from [1] and chose to parametrize\nthe intra-option policies with softmax dis-\ntributions and the terminations with sig-\nmoid functions. The policy over options\nwas learned using intra-option Q-learning.\nWe also implemented primitive actor-critic\n(AC) using a softmax policy. For the non-\nlinear function approximation setting, we\ntrained our agents using A3C [19]. Our\nprimitive action agents conduct A3C train-\ning using a convolutional network when\nthere is image input followed by an LSTM\nto contextualize the state. This way we ensure that bene\ufb01ts seen from options are orthogonal to\nthose seen from these common neural network building blocks. We follow [6] to extend A3C to\nthe Asynchronous Advantage Option-Critic (A2OC) and Asynchronous Advantage Hierarchical\nOption-Critic architectures (A2HOC). We include detailed algorithm descriptions for all of our\nexperiments in Appendix 2. We also conducted hyperparameter optimization that is summarized\nalong with detail on experimental protocol in Appendix 2. In all of our experiments, we made sure\nthat the two-level OC architecture had access to more total options than the three level alternative and\nthat the three level architecture did not include any additional hyperparameters. This ensures that\nempirical gains are the result of increasingly abstract options.\n\nFigure 3: Learning performance as a function of the\nabstraction level for a nonstationary four rooms domain\nwhere the goal location changes every episode.\n\n5.1 Tabular Learning Challenge Problems\n\nExploring four rooms: We \ufb01rst consider a naviga-\ntion task in the four-rooms domain [29]. Our goal\nis to evaluate the ability of a set of options learned\nfully autonomously to learn an ef\ufb01cient exploration\npolicy within the environment. The initial state and\nthe goal state are drawn uniformly from all open\nnon-wall cells every episode. This setting is highly\nnon-stationary, since the goal changes every episode.\nPrimitive movements can fail with probability 1\n3, in\nwhich case the agent transitions randomly to one of\nthe empty adjacent cells. The reward is +1 at the goal\nand 0 otherwise. In Figure 3 we report the average\nnumber of steps taken in the last 100 episodes every\n100 episodes, reporting the average of 50 runs with\ndifferent random seeds for each algorithm. We can\nclearly see that reasoning with higher levels of ab-\nstraction is critical to achieving a good exploration\npolicy and that reasoning with three levels of abstraction results in better sample ef\ufb01cient learning than\nreasoning with two levels of abstraction. For this experiment we explore four levels of abstraction as\nwell, but unfortunately there seem to be diminishing returns at least for this tabular setting.\nDiscrete stochastic decision process: Next, we consider a hierarchical RL challenge problem as\nexplored in [10] with a stochastic decision process where the reward depends on the history of visited\nstates in addition to the current state. There are 6 possible states and the agent always starts at s2.\nThe agent moves left deterministically when it chooses left action; but the action right only succeeds\nhalf of the time, resulting in a left move otherwise. The terminal state is s1 and the agent receives a\nreward of 1 when it \ufb01rst visits s6 and then s1. The reward for going to s1 without visiting s6 is 0.01.\nIn Figure 4 we report the average reward over the last 100 episodes every 100 episodes, considering\n10 runs with different random seeds for each algorithm. Reasoning with higher levels of abstraction is\nagain critical to performing well at this task with reasonable sample ef\ufb01ciency. Both OC learning and\nHOC learning converge to a high quality solution surpassing performance obtained in [10]. However,\nit seems that learning converges faster with three levels of abstractions than it does with just two.\n\nFigure 4: The diagram, from [10], details the\nstochastic decision process challenge prob-\nlem. The chart compares learning perfor-\nmance across abstract reasoning levels.\n\n7\n\n\f5.2 Deep Function Approximation Problems\n\nFigure 5: Building navigation learning perfor-\nmance across abstract reasoning levels.\n\n8.43 \u00b1 2.29\n10.56 \u00b10.49\n13.12 \u00b11.46\n\nArchitecture Clipped Reward\nA3C\nA2OC\nA2HOC\n\nMultistory building navigation: For an intuitive\nlook at higher level reasoning, we consider the four\nrooms problem in a partially observed setting with\nan 11x17 grid at each level of a seven level building.\nThe agent has a receptive \ufb01eld size of 3 in both direc-\ntions, so observations for the agent are 9-dimension\nfeature vectors with 0 in empty spots, 1 where there\nis a wall, 0.25 if there are stairs, or 0.5 if there is a\ngoal location. The stairwells in the north east corner\nof the \ufb02oor lead upstairs to the south west corner of\nthe next \ufb02oor up. Stairs in the south west corner of\nthe \ufb02oor lead down to the north east corner of the\n\ufb02oor below. Agents start in a random location in the\nbasement (which has no south west stairwell) and\nmust navigate to the roof (which has no north east stairwell) to \ufb01nd the goal in a random location.\nThe reward is +10 for \ufb01nding the goal and -0.1 for hitting a wall. This task could seemingly bene\ufb01t\nfrom abstraction such as a composition of sub-policies to get to the stairs at each intermediate level.\nWe report the rolling mean and standard deviation of the reward. In Figure 5 we see a qualitative\ndifference between the policies learned with three levels of abstraction which has high variance,\nbut fairly often \ufb01nds the goal location and those learned with less abstraction. A2OC and A3C are\nhovering around zero reward, which is equivalent to just learning a policy that does not run into walls.\nLearning many Atari games with one model: We \ufb01nally\nconsider application of the HOC to the Atari games [2].\nEvaluation protocols for the Atari games are famously in-\nconsistent [13], so to ensure for fair comparisons we imple-\nment apples to apples versions of our baseline architectures\ndeployed with the same code-base and environment set-\ntings. We put our models to the test and consider a very\nTable 1: Average clipped reward per\nchallenging setting [24] where a single agent attempts to\nepisode over 5 runs on 21 Atari games.\nlearn many Atari games at the same time. Our agents at-\ntempt to learn 21 Atari games simultaneously, matching the largest previous multi-task setting on\nAtari [24]. Our tasks are hand picked to fall into three categories of related games each with 7 games\nrepresented. The \ufb01rst category is games that include maze style navigation (e.g. MsPacman), the\nsecond category is mostly fully observable shooter games (e.g. SpaceInvaders), and the \ufb01nal category\nis partially observable shooter games (e.g. BattleZone). We train each agent by always sampling the\ngame with the least training frames after each episode, ensuring the games are sampled very evenly\nthroughout training. We also clip rewards to allow for equal learning rates across tasks [18]. We\ntrain each game for 10 million frames (210 million total) and report statistics on the clipped reward\nachieved by each agent when evaluating the policy without learning for another 3 million frames on\neach game across 5 separate training runs. As our main metric, we report the summary of how each\nmulti-task agent maximizes its reward in Table 1. While all agents struggle in this dif\ufb01cult setting,\nHOC is better able to exploit commonalities across games using fewer parameters and policies.\nAnalysis of Learned Options: An advantage of the multi-task setting is it allows for a degree of\nquantitative interpretability regarding when and how options are used. We report characteristics of\nthe agents with median performance during the evaluation period. A2OC with 16 options uses 5\noptions the bulk of the time with the rest of the time largely split among another 6 options (Figure 6).\nThe average number of time steps between switching options has a pretty narrow range across games\nfalling between 3.4 (Solaris) and 5.5 (MsPacman). In contrast, A2HOC with three options at each\nbranch of the hierarchy learns to switch options at a rich range of temporal resolutions depending on\nthe game. The high level options vary between an average of 3.2 (BeamRider) and 9.7 (Tutankham)\nsteps before switching. Meanwhile, the low level options vary between an average of 1.5 (Alien) and\n7.8 (Tutankham) steps before switching. In Appendix 2.4 we provide additional details about the\naverage duration before switching options for each game. In Figure 7 we can see that the most used\noptions for HOC are distributed pretty evenly across a number of games, while OC tends to specialize\nits options on a smaller number of games. In fact, the average share of usage dominated by a single\ngame for the top 7 most used options is 40.9% for OC and only 14.7% for HOC. Additionally, we can\n\n8\n\n\fFigure 6: Option usage (left) and specialization across Atari games for the top 9 most used options\n(right) of a 16 option Option-Critic architecture trained in the many task learning setting.\n\nFigure 7: Option usage (left) and specialization across Atari games (right) of a Hierarchical Option-\nCritic architecture with N = 3 and 3 options at each layer trained in the many task learning setting.\n\nsee that a hierarchy of options imposes structure in the space of options. For example, when o1 = 1\nor o1 = 2 the low level options tend to focus on different situations within the same games.\n\n6 Conclusion\n\nIn this work we propose the \ufb01rst policy gradient theorems to optimize an arbitrarily deep hierarchy of\noptions to maximize the expected discounted return. Moreover, we have proposed a particular hierar-\nchical option-critic architecture that is the \ufb01rst general purpose reinforcement learning architecture\nto successfully learn options from data with more than two abstraction levels. We have conducted\nextensive empirical evaluation in the tabular and deep non-linear function approximation settings. In\nall cases we found that, for signi\ufb01cantly complex problems, reasoning with more than two levels of\nabstraction can be bene\ufb01cial for learning. While the performance of the hierarchical option-critic\narchitecture is impressive, we envision our proposed policy gradient theorems eventually transcending\nit in overall impact. Although the architectures we explore in this paper have a \ufb01xed structure and\n\ufb01xed depth of abstraction for simplicity, the underlying theorems can also guide learning for much\nmore dynamic architectures that we hope to explore in future work.\n\n9\n\n\fAcknowledgements\n\nThe authors thank Murray Campbell, Xiaoxiao Guo, Ignacio Cases, and Tim Klinger for fruitful\ndiscussions that helped shape this work.\n\nReferences\n[1] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. 2017.\n\n[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\njun 2013.\n\n[3] Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference\nfor determining options in reinforcement learning. Machine Learning, 104(2-3):337\u2013357, 2016.\n[4] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural\n\ninformation processing systems, pages 271\u2013278, 1993.\n\n[5] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options.\n\narXiv preprint arXiv:1703.08294, 2017.\n\n[6] Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an\n\noption: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017.\n\n[7] Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. Learnings options end-to-end\n\nfor continuous action tasks. arXiv preprint arXiv:1712.00004, 2017.\n\n[8] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information\n\nprocessing systems, pages 1008\u20131014, 2000.\n\n[9] George Konidaris, Scott Kuindersma, Roderic A Grupen, and Andrew G Barto. Autonomous\n\nskill acquisition on a mobile manipulator. In AAAI, 2011.\n\n[10] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical\ndeep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In\nAdvances in neural information processing systems, pages 3675\u20133683, 2016.\n\n[11] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actor-critic. arXiv preprint\n\narXiv:1712.00948, 2017.\n\n[12] K\ufb01r Y Levy and Nahum Shimkin. Uni\ufb01ed inter and intra options learning using policy gradient\nmethods. In European Workshop on Reinforcement Learning, pages 153\u2013164. Springer, 2011.\n\n[13] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and\nMichael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open\nproblems for general agents. arXiv preprint arXiv:1709.06009, 2017.\n\n[14] Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Adaptive skills adaptive partitions\n\n(asap). In Advances in Neural Information Processing Systems, pages 1588\u20131596, 2016.\n\n[15] Timothy A Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally\n\nextended actions. Journal of Arti\ufb01cial Intelligence Research, 53:375\u2013438, 2015.\n\n[16] Amy McGovern and Andrew G Barto. Automatic discovery of subgoals in reinforcement\nlearning using diverse density. In Proceedings of the Eighteenth International Conference on\nMachine Learning, pages 361\u2013368. Morgan Kaufmann Publishers Inc., 2001.\n\n[17] Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut\u2014dynamic discovery of sub-goals\nin reinforcement learning. In European Conference on Machine Learning, pages 295\u2013306.\nSpringer, 2002.\n\n[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n10\n\n\f[19] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[20] Scott D Niekum. Semantically grounded learning from unstructured demonstrations. University\n\nof Massachusetts Amherst, 2013.\n\n[21] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts\n\nAmherst, 2000.\n\n[22] Martin L Puterman. Markov decision processes: Discrete dynamic stochastic programming,\n\n92\u201393, 1994.\n\n[23] Himanshu Sahni, Saurabh Kumar, Farhan Tejani, and Charles Isbell. Learning to compose\n\nskills. arXiv preprint arXiv:1711.11289, 2017.\n\n[24] S. Sharma, A. Jha, P. Hegde, and B. Ravindran. Learning to multi-task by active sampling.\n\narXiv preprint arXiv:1702.06053, 2017.\n\n[25] Tianmin Shu, Caiming Xiong, and Richard Socher. Hierarchical and interpretable skill acquisi-\n\ntion in multi-task reinforcement learning. arXiv preprint arXiv:1712.07294, 2017.\n\n[26] David Silver and Kamil Ciosek. Compositional planning using optimal option models. arXiv\n\npreprint arXiv:1206.6473, 2012.\n\n[27] \u00d6zg\u00fcr \u00b8Sim\u00b8sek and Andrew G Barto. Skill characterization based on betweenness. In Advances\n\nin neural information processing systems, pages 1497\u20131504, 2009.\n\n[28] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International\nSymposium on abstraction, reformulation, and approximation, pages 212\u2013223. Springer, 2002.\n\n[29] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):\n181\u2013211, 1999.\n\n[30] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[31] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John\nAgapiou, et al. Strategic attentive writer for learning macro-actions. In Advances in neural\ninformation processing systems, pages 3486\u20133494, 2016.\n\n[32] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,\nDavid Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning.\narXiv preprint arXiv:1703.01161, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6687, "authors": [{"given_name": "Matthew", "family_name": "Riemer", "institution": "IBM Research AI"}, {"given_name": "Miao", "family_name": "Liu", "institution": "IBM"}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": "IBM TJ Watson Research Center"}]}