{"title": "Language as an Abstraction for Hierarchical Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9419, "page_last": 9431, "abstract": "Solving complex, temporally-extended tasks is a long-standing problem in reinforcement learning (RL). We hypothesize that one critical element of solving such problems is the notion of compositionality. With the ability to learn sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporally-extended behaviors. However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. Our approach learns an instruction-following low-level policy and a high-level policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. To study compositional task learning, we introduce an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine. We find that, using our approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations. Our analysis find that the compositional nature of language is critical for learning and systematically generalizing sub-skills in comparison to non-compositional abstractions that use the same supervision.", "full_text": "Language as an Abstraction\n\nfor Hierarchical Deep Reinforcement Learning\n\nYiding Jiang\u2217, Shixiang Gu, Kevin Murphy, Chelsea Finn\n\n{ydjiang,shanegu,kpmurphy,chelseaf}@google.com\n\nGoogle Research\n\nAbstract\n\nSolving complex, temporally-extended tasks is a long-standing problem in rein-\nforcement learning (RL). We hypothesize that one critical element of solving such\nproblems is the notion of compositionality. With the ability to learn concepts\nand sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL,\nwe can acquire temporally-extended behaviors. However, acquiring effective yet\ngeneral abstractions for hierarchical RL is remarkably challenging. In this paper,\nwe propose to use language as the abstraction, as it provides unique composi-\ntional structure, enabling fast learning and combinatorial generalization, while\nretaining tremendous \ufb02exibility, making it suitable for a variety of problems. Our\napproach learns an instruction-following low-level policy and a high-level policy\nthat can reuse abstractions across tasks, in essence, permitting agents to reason\nusing structured language. To study compositional task learning, we introduce an\nopen-source object interaction environment built using the MuJoCo physics engine\nand the CLEVR engine. We \ufb01nd that, using our approach, agents can learn to\nsolve to diverse, temporally-extended tasks such as object sorting and multi-object\nrearrangement, including from raw pixel observations. Our analysis reveals that\nthe compositional nature of language is critical for learning diverse sub-skills and\nsystematically generalizing to new sub-skills in comparison to non-compositional\nabstractions that use the same supervision.2\n\n1\n\nIntroduction\n\nDeep reinforcement learning offers a promising framework for enabling agents to autonomously\nacquire complex skills, and has demonstrated impressive performance on continuous control prob-\nlems [35, 56] and games such as Atari [41] and Go [59]. However, the ability to learn a variety of\ncompositional, long-horizon skills while generalizing to novel concepts remain an open challenge.\nLong-horizon tasks demand sophisticated exploration strategies and structured reasoning, while\ngeneralization requires suitable representations. In this work, we consider the question: how can we\nleverage the compositional structure of language for enabling agents to perform long-horizon tasks\nand systematically generalize to new goals?\n\nTo do so, we build upon the framework of hierarchical reinforcement learning (HRL), which offers\na potential solution for learning long-horizon tasks by training a hierarchy of policies. However,\nthe abstraction between these policies is critical for good performance. Hard-coded abstractions\noften lack modeling \ufb02exibility and are task-speci\ufb01c [63, 33, 26, 47], while learned abstractions\noften \ufb01nd degenerate solutions without careful tuning [5, 24]. One possible solution is to have the\nhigher-level policy generate a sub-goal state and have the low-level policy try to reach that goal\nstate [42, 36]. However, using goal states still lacks some degree of \ufb02exibility (e.g. in comparison\n\n\u2217Work done as a part of the Goolge AI Residency program\n2Code and videos of the environment, and experiments are at https://sites.google.com/view/hal-demo\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Goal is g0: \u201cThere is a red ball; are there\nany matte cyan sphere right of it?\". Currently\n\u03a8(st, g0) = 0\n\n(b) Agent performs actions and interacts with\nthe environment and tries to satisfy goal.\n\n(c) Resulting state st+1 does not satisfy g0,\n\u2032: \u201c There is a green\nso relabel goal to g\nsphere; are there any rubber cyan balls be-\nhind it?\" so \u03a8(st+1, g\n\n\u2032) = 1\n\nFigure 1: The environment and some instructions that we consider in this work, along with an illustration of\nhindsight instruction relabeling (HIR), which we use to enable the agent to learn from many different language\ngoals at once (Details in Section 4.2).\n\nto goal regions or attributes), is challenging to scale to visual observations naively, and does not\ngeneralize systematically to new goals. In contrast to these prior approaches, language is a \ufb02exible\nrepresentation for transferring a variety of ideas and intentions with minimal assumptions about\nthe problem setting; its compositional nature makes it a powerful abstraction for representing\ncombinatorial concepts and for transferring knowledge [22].\n\nIn this work, we propose to use language as the interface between high- and low-level policies in\nhierarchical RL. With a low-level policy that follows language instructions (Figure 1), the high-level\npolicy can produce actions in the space of language, yielding a number of appealing bene\ufb01ts. First,\nthe low-level policy can be re-used for different high-level objectives without retraining. Second,\nthe high-level policies are human-interpretable as the actions correspond to language instructions,\nmaking it easier to recognize and diagnose failures. Third, language abstractions can be viewed as\na strict generalization of goal states, as an instruction can represent a region of states that satisfy\nsome abstract criteria, rather than the entirety of an individual goal state. Finally, studies have also\nsuggested that humans use language as an abstraction for reasoning and planning [21, 49]. In fact, the\nmajority of knowledge learning and skill acquisition we do throughout our life is through languages.\n\nWhile language is an appealing choice as the abstraction for hierarchical RL, training a low-level\npolicy to follow language instructions is highly non-trivial [20, 6] as it involves learning from binary\nrewards that indicate completion of the instruction. To address this problem, we generalize prior work\non goal relabeling to the space of language instructions (which instead operate on regions of state\nspace, rather than a single state), allowing the agent to learn from many language instructions at once.\n\nTo empirically study the role of language abstractions for long-horizon tasks, we introduce a new\nenvironment inspired by the CLEVR engine [28] that consists of procedurally-generated scenes of\nobjects that are paired with programatically-generated language descriptions. The low-level policy\u2019s\nobjective is to manipulate the objects in the scene such that a description or statement is satis\ufb01ed\nby the arrangement of objects in the scene. We \ufb01nd that our approach is able to learn a variety\nof vision-based long-horizon manipulation tasks such as object recon\ufb01guration and sorting, while\noutperforming state-of-the-art RL and hierarchical RL approaches. Further, our experimental analysis\n\ufb01nds that HRL with non-compositional abstractions struggles to learn the tasks, even when the\nnon-compositional abstraction is derived from language instructions themselves, demonstrating the\ncritical role of compositionality in learning. Lastly, we \ufb01nd that our instruction-following agent is\nable to generalize to instructions that are systematically different from those seen during training.\n\nIn summary, the main contribution of our work is three-fold:\n\n1. a framework for using language abstractions in HRL, with which we \ufb01nd that the structure\nand \ufb02exibility of language enables agents to solve challenging long-horizon control problems\n\n2. an open-source continuous control environment for studying compositional, long-horizon\n\ntasks, integrated with language instructions inspired by the CLEVR engine [28]\n\n3. empirical analysis that studies the role of compositionality in learning long-horizon tasks and\n\nachieving systematic generalization\n\n2 Related Work\n\nDesigning, discovering and learning meaningful and effective abstractions of MDPs has been studied\nextensively in hierarchical reinforcement learning (HRL) [15, 46, 63, 16, 5]. Classically, the work\n\n2\n\n\fon HRL has focused on learning only the high-level policy given a set of hand-engineered low-level\npolicies [60, 38, 9], or more generically options policies with \ufb02exible termination conditions [63, 52].\n\nRecent HRL works have begun to tackle more dif\ufb01cult control domains with both large state spaces\nand long planning horizons [26, 34, 65, 17, 42, 43]. These works can typically be categorized into\ntwo approaches. The \ufb01rst aims to learn effective low-level policies end-to-end directly from \ufb01nal task\nrewards with minimal human engineering, such as through the option-critic architecture [5, 24] or\nmulti-task or meta learning [18, 58]. While appealing in theory, this end-to-end approach relies solely\non \ufb01nal task rewards and is shown to scale poorly to complex domains [5, 42], unless distributions of\ntasks are carefully designed [18]. The second approach instead augments the low-level learning with\nauxiliary rewards that can bring better inductive bias. These rewards include mutual information-\nbased diversity rewards [13, 17], hand-crafted rewards based on domain knowledge [33, 26, 34, 65],\nand goal-oriented rewards [15, 55, 4, 69, 42, 43]. Goal-oriented rewards have been shown to balance\nsuf\ufb01cient inductive bias for effective learning with minimal domain-speci\ufb01c engineering, and achieve\nperformance gains on a range of domains [69, 42, 43]. Our work is a generalization on these lines of\nwork, representing goal regions using language instructions, rather than individual goal states. Here,\nregion refers to the sets of states (possibly disjoint and far away from each other) that satisfy more\nabstract criteria (e.g. \u201cred ball in front of blue cube\" can be satis\ufb01ed by in\ufb01nitely many states that are\ndrastically different from each other in the pixel space) rather than a simple \u01eb-ball around a single\ngoal state that is only there for creating a reachable non-zero measure goal. Further, our experiments\ndemonstrate signi\ufb01cant empirical gains over these prior approaches.\n\nSince our low-level policy training is related to goal-conditioned HRL, we can bene\ufb01t from algo-\nrithmic advances in multi-goal reinforcement learning [29, 55, 4, 50]. In particular, we extend the\nrecently popularized goal relabeling strategy [29, 4] to instructions, allowing us to relabel based on\nachieving a language statement that describes a region of state space, rather than relabeling based on\nreaching an individual state.\n\nLastly, there are a number of prior works that study how language can guide or improve reinforcement\nlearning [37, 19, 30, 6, 20, 12, 10]. While prior work has made use of language-based sub-goal\npolicies in hierarchical RL [57, 14], the instruction representations used lack the needed diversity\nto bene\ufb01t from the compositionality of language over one-hot goal representations. In a concurrent\nwork, Wu et al. [70] show that language can help with learning dif\ufb01cult tasks where more naive\ngoal representations lead to poor performance, even with hindsight goal relabeling. While we are\nalso interested in using language to improve learning of challenging tasks, we focus on the use\nof language in the context of hierarchical RL, demonstrating that language can be further used to\ncompose complex objectives for the agent. Andreas et al. [3] leverage language descriptions to\nrapidly adapt to unseen environments through structured policy search in the space of language;\neach environment is described by one sentence. In contrast, we show that a high-level policy can\neffectively leverage the combinatorially many sub-policies induced by language by generating a\nsequence of instructions for the low-level agent. Further, we use language not only for adaptation but\nalso for learning the lower level control primitives, without the need for imitation learning from an\nexpert. Another line of work focuses on RL for textual adventure games where the state is represented\nas language descriptions and the actions are either textual actions available at each state [25] or all\npossible actions [45] (even though not every action is applicable to all states). In general, these\nworks look at text-based games with discrete 1-bit actions, while we consider continuous actions in\nphysics-based environments. One may view the latter as a high-level policy with oracular low-level\npolicies that are speci\ufb01c to each state; the discrete nature of these games entails limited complexity of\ninteractions with the environment.\n\n3 Preliminaries\n\nStandard reinforcement learning. The typical RL problem considers a Markov decision process\n(MDP) de\ufb01ned by the tuple (S, A, T, R, \u03b3) where S where S is the state space, A is the action space,\nthe unknown transition probability T : S \u00d7 A \u00d7 S \u2192 [0, \u221e) represents the probability density of\nreaching st+1 \u2208 S from st \u2208 S by taking the action a \u2208 A, \u03b3 \u2208 [0, 1) is the discount factor, and the\nbounded real-valued function R : S \u00d7 A \u2192 [rmin, rmax] represents the reward of each transition.\nWe further denote \u03c1\u03c0(st) and \u03c1\u03c0(st, at) as the state marginal and the state-action marginal of the\ntrajectory induced by policy \u03c0(at|st). The objective of reinforcement learning is to \ufb01nd a policy\nE(st,at)\u223c\u03c1\u03c0 [\u03b3tR(st, at)] is maximized.\n\n\u03c0(at|st) such that the expected discounted future rewardPt\n\n3\n\n\fGoal conditioned reinforcement learning. In goal-conditioned RL, we work with an Augmented\nMarkov Decision Process, which is de\ufb01ned by the tuple (S, G, A, T, R, \u03b3). Most elements represent\nthe same quantities as a standard MDP. The additional tuple element G is the space of all possible goals,\nand the reward function R : S \u00d7 A \u00d7 G \u2192 [rmin, rmax] represents the reward of each transition under\na given goal. Similarly, the policy \u03c0(at|st, g) is now conditioned on g. Finally, pg(g) represents\na distribution over G. The objective of goal directed reinforcement learning is to \ufb01nd a policy\ng\u223cpg ,(st,at)\u223c\u03c1\u03c0 [\u03b3tR(st, at, g)] is\nmaximized. While this objective can be expressed with a standard MDP by augmenting the state\nvector with a goal vector, the policy does not change the goal; the explicit distinction between goal\nand state facilitates discussion later.\n\n\u03c0(at|st, g) such that the expected discounted future reward Pt\n\nE\n\nQ-learning. Q-learning is a large class of off-policy reinforcement learning algorithms that focuses\non learning the Q-function, Q\u2217(st, at), which represents the expected total discounted reward that\ncan be obtained after taking action at in state st assuming the agent acts optimally thereafter. It can\nbe recursively de\ufb01ned as:\n\nQ\u2217(st, at) = E\n\nst+1 [R(st, at) + \u03b3 max\na\u2208A\n\n(Q\u2217(st+1, a))]\n\n(1)\n\nThe optimal policy learned can be recovered through \u03c0\u2217(at|st) = \u03b4(at = arg maxa\u2208A Q\u2217(st, a)).\nIn high-dimensional spaces, the Q-function is usually represented with function approximators and\n\ufb01t using transition tuples, (st, at, st+1, rt), which are stored in a replay buffer [41].\n\nHindsight experience replay (HER). HER [4] is a data augmentation technique for off-policy goal\nconditioned reinforcement learning. For simplicity, assume that the goal is speci\ufb01ed in the state\nspace directly. A trajectory can be transformed into a sequence of goal augmented transition tuples\n(st, at, sg, st+1, rt). We can relabel each tuple\u2019s sg with st+1 or other future states visited in the\ntrajectory and adjust rt to be the appropriate value. This makes the otherwise sparse reward signal\nmuch denser. This technique can also been seen as generating an implicit curriculum of increasing\ndif\ufb01culty for the agent as it learns to interact with the environment more effectively.\n\n4 Hierarchical Reinforcement Learning with Language Abstractions\n\nIn this section, we present our framework for training a 2-layer\nhierarchical policy with compositional language as the abstrac-\ntion between the high-level policy and the low-level policy. We\nopen the exposition with formalizing the problem of solving tem-\nporally extended task with language, including our assumptions\nregarding the availability of supervision. We will then discuss\nhow we can ef\ufb01ciently train the low-level policy, \u03c0l(a|st, g) con-\nditioned on language instructions g in Section 4.2, and how a\nhigh-level policy, \u03c0h(g|st), can be trained using such a low-level\npolicy in Section 4.3. We refer to this framework as Hierarchical\nAbstraction with Language (HAL, Figure 2, Appendix C.1).\n\n4.1 Problem statement\n\nFigure 2: HAL: The high-level pol-\nicy \u03c0h produces language instruc-\ntions g for the low level policy \u03c0l.\n\nWe are interested in learning temporally-extended tasks by leveraging the compositionality of lan-\nguage. Thus, in addition to the standard reinforcement learning assumptions laid out in Section 3,\nwe also need some form of grounded language supervision in the environemnt E during training. To\nthis end, we also assume the access to a conditional density \u03c9(g|s) that maps observation s to a\ndistribution of language statements g \u2208 G that describes s. This distribution can take the form of a\nsupervised image captioning model, a human supervisor, or a functional program that is executed on\nst similar to CLEVR. Further, we de\ufb01ne \u2126(st) to be the support of \u03c9(g|st). Moreover, we assume\nthe access to a function \u03a8 that maps a state and an instruction to a single Boolean value which\nindicates whether the instruction is satis\ufb01ed or not by the s, i.e. \u03a8 : S \u00d7 G \u2192 {0, 1}. Once again,\n\u03a8 can be a VQA model, a human supervisor or a program. Note that any goal speci\ufb01ed in the state\nspace can be easily expressed by a Boolean function of this form by checking if two states are close\nto each other up to some threshold parameter. \u03a8 can effectively act as the reward for the low-level\npolicy.\n\nAn example for the high-level tasks is arranging objects in the scene according to a speci\ufb01c spatial\nrelationship. This can be putting the object in speci\ufb01c arrangement or ordering the object according\n\n4\n\n\f(a) Object\narrangement\n\n(b) Object\nordering\n\n(c) Object\nsorting\n\n(d) Color\nordering\n\n(e) Shape\nordering\n\n(f) Color & shape\n\nordering\n\nFigure 3: Sample goal states for the high-level tasks in the standard (a-c) and diverse (d-f) environments. The\nhigh-level policy only receives reward if all constraints are satis\ufb01ed. The global location of the objects may vary.\n\nthe the colors (Figure 3) by pushing the objects around (Figure 1). Details of these high-level tasks\nare described in Section 5. These tasks are complex but can be naturally decomposed into smaller\nsub-tasks, giving rise to a naturally de\ufb01ned hierarchy, making it an ideal testbed for HRL algorithms.\nProblems of similar nature including organizing a cluttered table top or stacking LEGO blocks to\nconstruct structures such as a castle or a bridge. We train the low-level policy \u03c0l(a|s, g) to solve an\naugmented MDP described Section 3. For simplicity, we assume that \u2126\u2019s output is uniform over G.\nThe low-level policy receives supervision from \u2126 and \u03a8 by completing instructions. The high-level\npolicy \u03c0h(g|s) is trained to solve a standard MDP whose state space is the same S as the low-level\npolicy, and the action space is G. In this case, the high-level policy\u2019s supervision comes from the\nreward function of the environment which may be highly sparse.\n\nWe separately train the high-level policy and low-level policy, so the low-level policy is agnostic to\nthe high-level policy. Since the policies share the same G, the low-level policy can be reused for\ndifferent high-level policies (Appendix C.3). Jointly \ufb01ne-tuning the low-level policy with a speci\ufb01c\nhigh-level policy is certainly a possible direction for future work (Appendix C.1).\n\n4.2 Training a language-conditioned low-level policy\n\nTo train a goal conditioned low-level policy, we need to de\ufb01ne a suitable reward function for training\nsuch a policy and a mechanism for sampling language instructions. A straightforward way to represent\nthe reward for the low-level policy would be R(st, at, st+1, g) = \u03a8(st+1, g) or, to ensure that at is\ninducing the reward:\n\nR(st, at, st+1, g) = (cid:26)0\n\n\u03a8(st+1, g) \u2295 \u03a8(st, g)\n\nif \u03a8(st+1, g) = 0\nif \u03a8(st+1, g) = 1\n\nHowever, optimizing with this reward directly is dif\ufb01cult because the reward signal is only non-zero\nwhen the goal is achieved. Unlike prior work (e.g. HER [4]), which uses a state vector or a task-\nrelevant part of the state vector as the goal, it is dif\ufb01cult to de\ufb01ne meaningful distance metrics in\nthe space of language statements [8, 53, 61], and, consequently, dif\ufb01cult to make the reward signal\nsmooth by assigning partial credit (unlike, e.g., the \u2113p norm of the difference between 2 states). To\novercome these dif\ufb01culties, we propose a trajectory relabeling technique for language instructions:\nInstead of relabeling the trajectory with states reached, we relabel states in the the trajectory \u03c4 with\nthe elements of \u2126(st) as the goal instruction using a relabeling strategy S. We refer to this procedure\nas hindsight instruction relabeling (HIR). The details of S is located in Algorithm 4 in Appendix C.4.\nPseudocode for the method can be found in Algorithm 2 in Appendix C.2 and an illustration of the\nprocess can be found in Figure 1.\n\nThe proposed relabeling scheme, HIR, is reminiscent of HER [4]. In HER, the goal is often the state\nor a simple function of the state, such as a masked representation. However, with high-dimensional\nobservation spaces such as images, there is excessive information in the state that is irrelevant to\nthe goal, while task-relevant information is not readily accessible. While one can use HER with\ngenerative models of images [54, 51, 44], the resulting representation of the state may not effectively\ncapture the relevant aspects of the desired task. Language can be viewed as an alternative, highly-\ncompressed representation of the state that explicitly exposes the task structure, e.g. the relation\nbetween objects. Thus, we can readily apply HIR to settings with image observations.\n\n4.3 Acting in language with the high-level policy\n\nWe aim to learn a high-level policy for long-horizon tasks that can explore and act in the space\nof language by providing instructions g to the low-level policy \u03c0l(a|st, g). The use of language\nabstractions through g allows the high-level policy to structurally explore with actions that are\nsemantically meaningful and span multiple low-level actions.\n\n5\n\n\fIn principle, the high-level policy, \u03c0h(g|s), can be trained with any reinforcement learning algorithm,\ngiven a suitable way to generate sentences for the goals. However, generating coherent sequences\nof discrete tokens is dif\ufb01cult, particularly when combined with existing reinforcement learning\nalgorithms. We explore how we might incorporate a language model into the high level policy in\nAppendix A, which shows promising preliminary results but also signi\ufb01cant challenges. Fortunately,\nwhile the size of the instruction space G scales exponentially with the size of the vocabulary, the\nelements of G are naturally structured and redundant \u2013 many elements correspond to effectively\nthe same underlying instruction with different synonyms or grammar. While the low-level policy\nunderstands all the different instructions, in many cases, the high-level policy only needs to generate\ninstruction from a much smaller subset of G to direct the low-level policy. We denote such subsets of\nG as I.\n\nIf I is relatively small, the problem can be recast as a discrete-action RL problem, where one action\nchoice corresponds to an instruction, and can be solved with algorithms such as DQN [41]. We adopt\nthis simple approach in this work. As the instruction often represents a sequence of low-level actions,\nwe take T \u2032 actions with the low-level policy for every high-level instruction. T \u2032 can be a \ufb01xed number\nof steps, or computed dynamically by a terminal policy learned by the low-level policy like the option\nframework. We found that simply using a \ufb01xed T \u2032 was suf\ufb01cient in our experiments.\n\n5 The Environment and Implementation\n\nEnvironment. To empirically study how compositional languages can aid in long-horizon reasoning\nand generalization, we need an environment that will test the agent\u2019s ability to do so. While prior\nworks have studied the use of language in navigation [1], instruction following in a grid-world [10],\nand compositionality in question-answering, we aim to develop a physical simulation environment\nwhere the agent must interact with and change the environment in order to accomplish long-horizon,\ncompositional tasks. These criteria are particularly appealing for robotic applications, and, to the best\nof our knowledge, none of the existing benchmarks simultaneously ful\ufb01lls all of them. To this end,\nwe developed a new environment using the MuJoCo physics engine [66] and the CLEVR language\nengine, that tests an agents ability to manipulate and rearrange objects of various shapes and colors.\nTo succeed in this environment, the agent must be able to handle varying number of objects with\ndiverse visual and physical properties. Two versions of the environment of varying complexity are\nillustrated in Figures 3 and 1 and further details are in Appendix B.1.\n\nHigh-level tasks. We evaluate our framework 6 challenging temporally-extended tasks across two\nenvironments, all illustrated in Figure 3: (a) object arrangement: manipulate objects such that 10\npair-wise constraints are satis\ufb01ed, (b) object ordering: order objects by color, (c) object sorting:\narrange 4 objects around a central object, and in a more diverse environment (d) color ordering: order\nobjects by color irrespective of shape, (e) shape ordering: order objects by shape irrespective of color,\nand (f) color & shape ordering: order objects by both shape and color. In all cases, the agent receives\na binary reward only if all constraints are satis\ufb01ed. Consequently, this makes obtaining meaningful\nsignal in these tasks extremely challenging as only a very small number of action sequences will\nyield non-zero signal. For more details, see Appendix B.2.\nAction and observation parameterization. The state-based observation is s \u2208 R10 that represents\nthe location of each object and |A| = 40 which corresponds to picking an object and pushing it in\none of the eight cardinal directions. The image-based observation is s \u2208 R64\u00d764\u00d73 which is the\nrendering of the scene and |A| = 800 which corresponds to picking a location in a 10 \u00d7 10 grid and\npushing in one of the eight cardinal directions. For more details, see Appendix B.1.\n\nPolicy parameterization. The low-level policy encodes the instruction with a GRU and feeds the\nresult, along with the state, into a neural network that predicts the Q-value of each action. The\nhigh-level policy is also a neural network Q-function. Both use Double DQN [68] for training. The\nhigh-level policy uses sets of 80 and 240 instructions as the action space in the standard and diverse\nenvironments respectively, a set that suf\ufb01ciently covers relationships between objects. We roll out the\nlow-level policy for T \u2032 = 5 steps for every high-level instruction. For details, see Appendix B.3.\n\n6 Experiments\n\nTo evaluate our framework, and study the role of compositionality in RL in general, we design the\nexperiments to answer the following questions: (1) as a representation, how does language compare\n\n6\n\n\fto alternative representations, such as those that are not explicitly compositional? (2) How well does\nthe framework scale with the diversity of instruction and dimensionality of the state (e.g. vision-\nbased observation)? (3) Can the policy generalize in systematic ways by leveraging the structure of\nlanguage? (4) Overall, how does our approach compare to state-of-the-art hierarchical RL approaches,\nalong with learning \ufb02at, homogeneous policies?\n\nTo answer these questions, in Section 6.1, we \ufb01rst evaluate and analyze training of effective low-\nlevel policies, which are critical for effective learning of long-horizon tasks. Then, in Section 6.2,\nwe evaluate the full HAL method on challenging temporally-extended tasks. Finally, we apply\nour method to the Crafting environment from Andreas et al. [2] to showcase the generality of our\nframework. For details on the experimental set-up and analysis, see Appendix D and E.\n\n6.1 Low-level Policy\n\nRole of compositionality and relabeling. We start by evaluating the \ufb01delity of the low-level\ninstruction-following policy, in isolation, with a variety of representations for the instruction. For\nthese experiments, we use state-based observations. We start with a set of 600 instructions, which we\nparaphrase and substitute synonyms to obtain more than 10,000 total instructions which allows us to\nanswer the \ufb01rst part of (2). We evaluate the performance all low-level policies by the average number\nof instructions it can successfully achieve each episode (100 steps), measured over 100 episodes. To\nanswer (1), and evaluate the importance of compositionality, we compare against:\n\n\u2022 a one-hot encoded representation of instructions where each instruction has its own row in a\n\nreal-valued embedding matrix which uses the same instruction relabeling (see Appendix D.1)\n\n\u2022 a non-compositional latent variable representation with identical information content. We train\nan sequence auto-encoder on sentences, which achieves 0 reconstruction error and is hence a\nlossless non-compositional representation of the instructions (see Appendix D.2)\n\n\u2022 a bag-of-words (BOW) representation of instructions (see Appendix D.3)\n\nIn the \ufb01rst comparison, we observe that while one-hot encoded representation works on-par with or\nbetter than the language in the regime where the number of instructions is small, its performance\nquickly deteriorates as the number of instruction increases (Fig.4, middle). On the other hand,\nlanguage representation of instruction can leverage the structures shared by different instructions and\ndoes not suffer from increasing number of instructions (Fig.4, right blue); in fact, an improvement\nin performance is observed. This suggests, perhaps unsurprisingly, that one-hot representations and\nstate-based relabeling scale poorly to large numbers of instructions, even when the underlying number\nof instructions does not change, while, with instruction relabeling (HIR), the policy acquires better,\nmore successful representations as the number of instructions increases.\n\nIn the second comparison, we observe that the agent is unable to make meaningful progress with\nthis representation despite receiving identical amount of supervision as language. This indicates that\nthe compositionality of language is critical for effective learning. Finally, we \ufb01nd that relabeling is\ncritical for good performance, since without it (no HIR), the reward signal is signi\ufb01cantly sparser.\n\nFinally, in the comparison to a bag-of-words representation (BOW), we observe that, while the BOW\nagent\u2019s return increases at a faster rate than that of the language agent at the beginning of training \u2013\nlikely due to the dif\ufb01culty of optimizing recurrent neural network in language agent \u2013 the language\nagent achieves signi\ufb01cantly better \ufb01nal performance. On the other hand, the performance of the BOW\nagent plateaus at around 8 instructions per episode. This is expected as BOW does not consider the\nsequential nature of an instruction, which is important for correctly executing an instruction.\n\nVision-based observations. To answer the second part of (2), we extend our framework to pixel ob-\nservations. The agent reaches the same performance as the state-based model, albeit requiring longer\nconvergence time with the same hyper-parameters. On the other hand, the one-hot representation\nreaches much worse relative performance with the same amount of experience (Fig.4, right).\n\nVisual generalization. One of the most appealing aspects of language is the promise of combinatorial\ngeneralization [7] which allows for extrapolation rather than simple interpolation over the training\nset. To evaluate this (i.e. (3)), we design the training and test instruction sets that are systematically\ndistinct. We evaluate the agent\u2019s ability to perform such generalization by splitting the 600 instruction\nsets through the following procedure: (i) standard: random 70/30 split of the instruction set; (ii)\nsystematic: the training set only consists of instructions that do not contain the words red in the\n\ufb01rst half of the instructions and the test set contains only those that have red in the \ufb01rst half of the\n\n7\n\n\fFigure 4: Results for low-level policies in terms of goals accomplished per episode over training steps for\nHIR. Left: HIR with different number of instructions and results with non-compositional representation and\nwith no relabeling. Middle: Results for one-hot encoded representation with increasing number of instructions.\nSince the one-hot cannot leverage compositionality of the language, it suffers signi\ufb01cantly as instruction sets\ngrow, while HIR on sentences in fact learns even faster when instruction sets increase. Right: Performance of\nimage-based low-level policy compared against one-hot and non-compositional instruction representations.\n\ninstructions. We emphasize that the agent has never seen the words red in the \ufb01rst part of the sentence\nin training; in other words, the task is zero-shot as the training set and the test set are disjoint (i.e. the\ndistributions do not share support). From a pure statistical learning theoretical perspective, the agent\nshould not do better than chance on such a test set. Remarkably, we observe that the agent generalizes\nbetter with language than with non-compositional representation (table 1). This suggests that the\nagent recognizes the compositional structure of the language, and achieves systematic generalization\nthrough such understanding.\n\nStandard\n\ntrain\n\nStandard\n\nStandard\n\nSystematic\n\nSystematic\n\nSystematic\n\ntest\n\ngap\n\ntrain\n\ntest\n\ngap\n\nLanguage\n\n21.50 \u00b1 2.28\n\n21.49 \u00b1 2.53\n\n0.001\n\n20.09 \u00b1 2.46\n\n8.13 \u00b1 2.34\n\n0.596\n\nNon-Compositional\n\n6.26 \u00b1 1.18\n\n5.78 \u00b1 1.44\n\n0.077\n\n7.54 \u00b1 1.14\n\n0.76 \u00b1 0.69\n\n0.899\n\nRandom\n\n0.17 \u00b1 0.20\n\n0.21 \u00b1 0.17\n\n-\n\n0.11 \u00b1 0.19\n\n0.18 \u00b1 0.22\n\n-\n\nTable 1: Final performance of the low-level policy on different training and test instruction distributions (20\nepisodes). Language outperforms the non-compositional language representation in both absolute performance\nand relative generalization gap for every setting. Gap is equal to one minus the ratio of between mean test\nperformance and mean train performance; this quantity can be interpreted as the generalization gap. For\ninstructions with language representation, the generalization gap increases by approximately 59.5% from\nstandard generalization to zero-shot generalization while for non-compositional representation the generalization\ngap increases by 82.2%\n\nFigure 5: Results for high-level policy on tasks (a-c). Blue curves for HAL include the steps for training the\nlow-level policy (a single low-level policy is used for all 3 tasks). In all settings, HAL demonstrates faster\nlearning than DDQN. Means and standard deviations of 3 random seeds are plotted.\n\n6.2 High-level policy\n\nNow that we have analyzed the low-level policy performance, we next evaluate the full HAL algorithm.\nTo answer (4), we compare our framework in the state space against a non-hierarchical baseline\nDDQN and two representative hierarchical reinforcement learning frameworks HIRO [42] and Option-\nCritic (OC) [5] on the proposed high-level tasks with sparse rewards (Sec.5). We observe that neither\nHRL baselines are able to learn a reasonable policy while DDQN is able to solve only 2 of the 3 tasks.\nHAL is able to solve all 3 tasks consistently and with much lower variance and better asymptotic\nperformance (Fig.5). Then we show that our framework successfully transfers to high-dimensional\n\n8\n\n\fobservation (i.e. images) in all 3 tasks without loss of performance whereas even the non-hierarchical\nDDQN fails to make progress (Fig.6, left). Finally, we apply the method to 3 additional diverse tasks\n(Fig.6, middle). In these settings, we observed that the high-level policy has dif\ufb01culty learning from\npixels alone, likely due to the visual diversity and the simpli\ufb01ed high-level policy parameterization.\nAs such, the high-level policy for diverse setting receives state observation but the low-level policy\nuses the raw-pixel observation. For more details, please refer to Appendix B.3.\n\nFigure 6: Left: Results for vision-based hierarchical RL. In all settings, HAL demonstrates faster and more\nstable learning while the baseline DDQN cannot learn a non-trivial policy. In this case, the vision-based low-level\npolicy needs longer training time (~5 \u00d7 106 steps) so we start the x-axis there. Means and standard deviations of\n3 seeds are plotted (Near 0 variance for DDQN). Middle: Results of HRL on the proposed 3 diverse tasks (d-e).\nIn this case, the low-level policy used is trained on image observation for (~4 \u00d7 106 steps). 3 random seeds are\nplotted and training has not converged. Right: HAL vs policy sketches on the Crafting Environment. HAL is\nsigni\ufb01cantly more sample ef\ufb01cient since it is off-policy and uses relabeling among all modules.\n\n6.3 Crafting Environment\n\nTo show the generality of the proposed framework, we apply our method to the Crafting environments\nintroduced by Andreas et al. [2] (Fig. 6, right). We apply HAL to this environment by training separate\npolicy networks for each module since there are fewer than 20 modules in the environment. These low-\nlevel policies receive binary rewards (i.e. one-bit supervision analogous to completing an instruction),\nand are trained jointly with HIR. Another high-level policy that picks which module to execute for\na \ufb01xed 5 steps and is trained with regular DDQN. Note that our method uses a different form of\nsupervision compared to policy sketch since we provide binary reward for the low-level policy \u2013 such\nsupervision can sometimes be easier to specify than the entire sketch.\n\n7 Discussion\n\nWe demonstrate that language abstractions can serve as an ef\ufb01cient, \ufb02exible, and human-interpretable\nrepresentation for solving a variety of long-horizon control problems in HRL framework. Through re-\nlabeling and the inherent compositionality of language, we show that low-level, language-conditioned\npolicies can be trained ef\ufb01ciently without engineered reward shaping and with large numbers of\ninstructions, while exhibiting strong generalizations. Our framework HAL can thereby leverage these\npolicies to solve a range of dif\ufb01cult sparse-reward manipulation tasks with greater success and sample\nef\ufb01ciency than training without language abstractions.\n\nWhile our method demonstrates promising results, one limitation is that the current method relies on\ninstructions provided by a language supervisor which has access to the instructions that describe a\nscene. The language supervisor can, in principle, be replaced with an image-captioning model and\nquestion-answering model such that it can be deployed on real image observations for robotic control\ntasks, an exciting direction for future work. Another limitation is that the instruction set used is\nspeci\ufb01c to our problem domain, providing a substantial amount of pre-de\ufb01ned structure to the agent.\nIt remains an open question on how to enable an agent to follow a much more diverse instruction\nset, that is not speci\ufb01c to any particular domain, or learn compositional abstractions without the\nsupervision of language. Our experiments suggest that both would likely yield an HRL method\nthat requires minimal domain-speci\ufb01c supervision, while yielding signi\ufb01cant empirical gains over\nexisting domain-agnostic works, indicating an promising direction for future research. Overall, we\nbelieve this work represents a step towards RL agents that can effectively reason using compositional\nlanguage to perform complex tasks, and hope that our empirical analysis will inspire more research\nin compositionality at the intersection of language and reinforcement learning.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank anonymous reviewers for the valuable feedback. We would also like to thank\nJacob Andreas, Justin Fu, Sergio Guadarrama, O\ufb01r Nachum, Vikash Kumar, Allan Zhou, Archit\nSharma, and other colleagues at Google Research for helpful discussion and feedback on the draft of\nthis work.\n\nReferences\n\n[1] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian\nReid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting\nvisually-grounded navigation instructions in real environments. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018.\n\n[2] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with\n\npolicy sketches, 2016.\n\n[3] Jacob Andreas, Dan Klein, and Sergey Levine. Learning with latent language, 2017.\n\n[4] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience\nreplay. In Advances in Neural Information Processing Systems, pages 5048\u20135058, 2017.\n\n[5] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages\n\n1726\u20131734, 2017.\n\n[6] Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Pushmeet Kohli, and Edward\n\nGrefenstette. Learning to understand goal speci\ufb01cations by modelling reward, 2018.\n\n[7] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius\nZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan\nFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\narXiv:1806.01261, 2018.\n\n[8] Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role of bleu in\nmachine translation research. In 11th Conference of the European Chapter of the Association\nfor Computational Linguistics, 2006.\n\n[9] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated rein-\nforcement learning. In Advances in neural information processing systems, pages 1281\u20131288,\n2005.\n\n[10] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Sa-\nharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: First steps towards grounded language\nlearning with a human in the loop. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=rJeXCo0cYX.\n\n[11] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the\n\nproperties of neural machine translation: Encoder-decoder approaches, 2014.\n\n[12] John D Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, John DeNero, Pieter\nAbbeel, and Sergey Levine. Meta-learning language-guided policy learning. In International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=HkgSEnA5KQ.\n\n[13] Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 273\u2013281, 2012.\n\n[14] Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Neural modular\n\ncontrol for embodied question answering. arXiv preprint arXiv:1810.11181, 2018.\n\n[15] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural\n\ninformation processing systems, pages 271\u2013278, 1993.\n\n10\n\n\f[16] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function\n\ndecomposition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[17] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical\n\nreinforcement learning. arXiv preprint arXiv:1704.03012, 2017.\n\n[18] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared\n\nhierarchies. International Conference on Learning Representations (ICLR), 2018.\n\n[19] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe\nMorency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-\nfollower models for vision-and-language navigation.\nIn Advances in Neural Information\nProcessing Systems, pages 3314\u20133325, 2018.\n\n[20] Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Guadarrama. From language to\ngoals: Inverse reinforcement learning for vision-based instruction following. In International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=r1lq1hRqYQ.\n\n[21] Lila Gleitman and Anna Papafragou. Language and thought. Cambridge handbook of thinking\n\nand reasoning, pages 633\u2013661, 2005.\n\n[22] H Paul Grice. Logic and conversation. 1975, pages 41\u201358, 1975.\n\n[23] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n[24] Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an\n\noption: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017.\n\n[25] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep\nreinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636,\n2015.\n\n[26] Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, and David Sil-\nver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182,\n2016.\n\n[27] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[28] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,\nand Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary\nvisual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2901\u20132910, 2017.\n\n[29] Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. In\nProceedings of the tenth international conference on machine learning, volume 951, pages\n167\u2013173, 1993.\n\n[30] Russell Kaplan, Christopher Sauer, and Alexander Sosa. Beating atari with natural language\n\nguided reinforcement learning, 2017.\n\n[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[32] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[33] George Konidaris and Andrew G Barto. Building portable options: Skill transfer in reinforce-\n\nment learning. In IJCAI, volume 7, pages 895\u2013900, 2007.\n\n11\n\n\f[34] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical\ndeep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In\nAdvances in neural information processing systems, pages 3675\u20133683, 2016.\n\n[35] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[36] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforcement learning with hindsight.\nIn International Conference on Learning Representations, 2019. URL https://openreview.\nnet/forum?id=ryzECoAcY7.\n\n[37] Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward\nGrefenstette, Shimon Whiteson, and Tim Rockt\u00e4schel. A survey of reinforcement learning\ninformed by natural language. arXiv preprint arXiv:1906.03926, 2019.\n\n[38] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement\nlearning via clustering. In Proceedings of the twenty-\ufb01rst international conference on Machine\nlearning, page 71. ACM, 2004.\n\n[39] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of\n\ncontinuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.\n\n[40] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[41] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[42] O\ufb01r Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-ef\ufb01cient hierarchical\n\nreinforcement learning, 2018.\n\n[43] O\ufb01r Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation\nlearning for hierarchical reinforcement learning. International Conference on Learning Repre-\nsentations (ICLR), 2019.\n\n[44] Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine.\nVisual reinforcement learning with imagined goals. CoRR, abs/1807.04742, 2018. URL\nhttp://arxiv.org/abs/1807.04742.\n\n[45] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text-\n\nbased games using deep reinforcement learning. arXiv preprint arXiv:1506.08941, 2015.\n\n[46] Ronald Parr and Stuart J Russell. Reinforcement learning with hierarchies of machines. In\n\nAdvances in neural information processing systems, pages 1043\u20131049, 1998.\n\n[47] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. Deeploco: Dynamic lo-\ncomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics\n(Proc. SIGGRAPH 2017), 36(4), 2017.\n\n[48] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:\nVisual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[49] Steven T Piantadosi, Joshua B Tenenbaum, and Noah D Goodman. Bootstrapping in a language\nof thought: A formal model of numerical concept learning. Cognition, 123(2):199\u2013217, 2012.\n\n[50] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models:\n\nModel-free deep rl for model-based control, 2018.\n\n[51] Vitchyr H. Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine.\nSkew-\ufb01t: State-covering self-supervised reinforcement learning. CoRR, abs/1903.03698, 2019.\nURL http://arxiv.org/abs/1903.03698.\n\n12\n\n\f[52] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts\n\nAmherst, 2000.\n\n[53] Ehud Reiter. A structured review of the validity of bleu. Computational Linguistics, 44(3):\n\n393\u2013401, 2018.\n\n[54] Himanshu Sahni, Toby Buckley, Pieter Abbeel, and Ilya Kuzovkin. Visual hindsight experience\n\nreplay. CoRR, abs/1901.11529, 2019. URL http://arxiv.org/abs/1901.11529.\n\n[55] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-\n\nmators. In International Conference on Machine Learning, pages 1312\u20131320, 2015.\n\n[56] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. arXiv preprint\narXiv:1506.02438, 2015.\n\n[57] Tianmin Shu, Caiming Xiong, and Richard Socher. Hierarchical and interpretable skill acquisi-\n\ntion in multi-task reinforcement learning. arXiv preprint arXiv:1712.07294, 2017.\n\n[58] Olivier Sigaud and Freek Stulp. Policy search in continuous action domains: an overview. arXiv\n\npreprint arXiv:1803.04706, 2018.\n\n[59] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[60] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International\nSymposium on abstraction, reformulation, and approximation, pages 212\u2013223. Springer, 2002.\n\n[61] Elior Sulem, Omri Abend, and Ari Rappoport. Bleu is not suitable for the evaluation of text\n\nsimpli\ufb01cation. arXiv preprint arXiv:1810.05995, 2018.\n\n[62] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural\n\nnetworks, 2014.\n\n[63] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):\n181\u2013211, 1999.\n\n[64] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. Action branching architectures for deep\n\nreinforcement learning. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[65] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep\n\nhierarchical approach to lifelong learning in minecraft. In AAAI, volume 3, page 6, 2017.\n\n[66] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,\npages 5026\u20135033. IEEE, 2012.\n\n[67] A\u00e4ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation\n\nlearning. CoRR, abs/1711.00937, 2017. URL http://arxiv.org/abs/1711.00937.\n\n[68] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\n\nq-learning, 2015.\n\n[69] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,\nDavid Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning.\narXiv preprint arXiv:1703.01161, 2017.\n\n[70] Yuhuai Wu, Harris Chan, Jamie Kiros, Sanja Fidler, and Jimmy Ba. ACTRCE: Augmenting expe-\nrience via teacher\u2019s advice, 2019. URL https://openreview.net/forum?id=HyM8V2A9Km.\n\n13\n\n\f", "award": [], "sourceid": 5017, "authors": [{"given_name": "YiDing", "family_name": "Jiang", "institution": "Google Research"}, {"given_name": "Shixiang (Shane)", "family_name": "Gu", "institution": "Google Brain"}, {"given_name": "Kevin", "family_name": "Murphy", "institution": "Google"}, {"given_name": "Chelsea", "family_name": "Finn", "institution": "Google Brain"}]}