{"title": "Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies", "book": "Advances in Neural Information Processing Systems", "page_first": 7156, "page_last": 7166, "abstract": "We introduce a new RL problem where the agent is required to generalize to a previously-unseen environment characterized by a subtask graph which describes a set of subtasks and their dependencies. Unlike existing hierarchical multitask RL approaches that explicitly describe what the agent should do at a high level, our problem only describes properties of subtasks and relationships among them, which requires the agent to perform complex reasoning to find the optimal subtask to execute. To solve this problem, we propose a neural subtask graph solver (NSGS) which encodes the subtask graph using a recursive neural network embedding. To overcome the difficulty of training, we propose a novel non-parametric gradient-based policy, graph reward propagation, to pre-train our NSGS agent and further finetune it through actor-critic method. The experimental results on two 2D visual domains show that our agent can perform complex reasoning to find a near-optimal way of executing the subtask graph and generalize well to the unseen subtask graphs. In addition, we compare our agent with a Monte-Carlo tree search (MCTS) method showing that our method is much more efficient than MCTS, and the performance of NSGS can be further improved by combining it with MCTS.", "full_text": "Hierarchical Reinforcement Learning for Zero-shot\n\nGeneralization with Subtask Dependencies\n\nSungryull Sohn\n\nUniversity of Michigan\nsrsohn@umich.edu\n\nJunhyuk Oh\u2217\n\nUniversity of Michigan\njunhyuk@google.com\n\nHonglak Lee\nGoogle Brain\n\nUniversity of Michigan\nhonglak@google.com\n\nAbstract\n\nWe introduce a new RL problem where the agent is required to generalize to a\npreviously-unseen environment characterized by a subtask graph which describes a\nset of subtasks and their dependencies. Unlike existing hierarchical multitask RL\napproaches that explicitly describe what the agent should do at a high level, our\nproblem only describes properties of subtasks and relationships among them, which\nrequires the agent to perform complex reasoning to \ufb01nd the optimal subtask to\nexecute. To solve this problem, we propose a neural subtask graph solver (NSGS)\nwhich encodes the subtask graph using a recursive neural network embedding. To\novercome the dif\ufb01culty of training, we propose a novel non-parametric gradient-\nbased policy, graph reward propagation, to pre-train our NSGS agent and further\n\ufb01netune it through actor-critic method. The experimental results on two 2D visual\ndomains show that our agent can perform complex reasoning to \ufb01nd a near-optimal\nway of executing the subtask graph and generalize well to the unseen subtask\ngraphs. In addition, we compare our agent with a Monte-Carlo tree search (MCTS)\nmethod showing that our method is much more ef\ufb01cient than MCTS, and the\nperformance of NSGS can be further improved by combining it with MCTS.\n\nIntroduction\n\n1\nDeveloping the ability to execute many different tasks depending on given task descriptions and\ngeneralize over unseen task descriptions is an important problem for building scalable reinforcement\nlearning (RL) agents. Recently, there have been a few attempts to de\ufb01ne and solve different forms\nof task descriptions such as natural language [1, 2] or formal language [3, 4]. However, most of the\nprior works have focused on task descriptions which explicitly specify what the agent should do at a\nhigh level, which may not be readily available in real-world applications.\nTo further motivate the problem, let\u2019s consider a scenario in which an agent needs to generalize\nto a complex novel task by performing a composition of subtasks where the task description and\ndependencies among subtasks may change depending on the situation. For example, a human user\ncould ask a physical household robot to make a meal in an hour. A meal may be served with different\ncombinations of dishes, each of which takes a different amount of cost (e.g., time) and gives a\ndifferent amount of reward (e.g., user satisfaction) depending on the user preferences. In addition,\nthere can be complex dependencies between subtasks. For example, a bread should be sliced before\ntoasted, or an omelette and an egg sandwich cannot be made together if there is only one egg left.\nDue to such complex dependencies as well as different rewards and costs, it is often cumbersome\nfor human users to manually provide the optimal sequence of subtasks (e.g., \u201cfry an egg and toast\na bread\u201d). Instead, the agent should learn to act in the environment by \ufb01guring out the optimal\nsequence of subtasks that gives the maximum reward within a time budget just from properties and\ndependencies of subtasks.\nThe goal of this paper is to formulate and solve such a problem, which we call subtask graph execution,\nwhere the agent should execute the given subtask graph in an optimal way as illustrated in Figure 1.\n\n\u2217Now at DeepMind.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Example task and our agent\u2019s trajectory. The agent is required to execute subtasks in the optimal\norder to maximize the reward within a time limit. The subtask graph describes subtasks with the corresponding\nrewards (e.g., subtask L gives 1.0 reward) and dependencies between subtasks through AND and OR nodes. For\ninstance, the agent should \ufb01rst get the \ufb01rewood (D) OR coal (G) to light a furnace (J). In this example, our agent\nlearned to execute subtask F and its preconditions (shown in red) as soon as possible, since it is a precondition of\nmany subtasks even though it gives a negative reward. After that, the agent mines minerals that require stone\npickaxe and craft items (shown in blue) to achieve a high reward.\n\nA subtask graph consists of subtasks, corresponding rewards, and dependencies among subtasks in\nlogical expression form where it subsumes many existing forms (e.g., sequential instructions [1]).\nThis allows us to de\ufb01ne many complex tasks in a principled way and train the agent to \ufb01nd the\noptimal way of executing such tasks. Moreover, we aim to solve the problem without explicit search\nor simulations so that our method can be more easily applicable to practical real-world scenarios,\nwhere real-time performance (i.e., fast decision-making) is required and building the simulation\nmodel is extremely challenging.\nTo solve the problem, we propose a new deep RL architecture, called neural subtask graph\nsolver (NSGS), which encodes a subtask graph using a recursive-reverse-recursive neural network\n(R3NN) [5] to consider the long-term effect of each subtask. Still, \ufb01nding the optimal sequence of\nsubtasks by re\ufb02ecting the long-term dependencies between subtasks and the context of observation\nis computationally intractable. Therefore, we found that it is extremely challenging to learn a good\npolicy when it\u2019s trained from scratch. To address the dif\ufb01culty of learning, we propose to pre-train the\nNSGS to approximate our novel non-parametric policy called graph reward propagation policy. The\nkey idea of the graph reward propagation policy is to construct a differentiable representation of the\nsubtask graph such that taking a gradient over the reward results in propagating reward information\nbetween related subtasks, which is used to \ufb01nd a reasonably good subtask to execute. After the\npre-training, our NSGS architecture is \ufb01netuned using the actor-critic method.\nThe experimental results on 2D visual domains with diverse subtask graphs show that our agent\nimplicitly performs complex reasoning by taking into account long-term subtask dependencies as\nwell as the cost of executing each subtask from the observation, and it can successfully generalize to\nunseen and larger subtask graphs. Finally, we show that our method is computationally much more\nef\ufb01cient than Monte-Carlo tree search (MCTS) algorithm, and the performance of our NSGS agent\ncan be further improved by combining with MCTS, achieving a near-optimal performance.\nOur contributions can be summarized as follows: (1) We propose a new challenging RL problem and\ndomain with a richer and more general form of graph-based task descriptions compared to the recent\nworks on multitask RL. (2) We propose a deep RL architecture that can execute arbitrary unseen\nsubtask graphs and observations. (3) We demonstrate that our method outperforms the state-of-the-art\nsearch-based method (e.g., MCTS), which implies that our method can ef\ufb01ciently approximate the\nsolution of an intractable search problem without performing any search. (4) We further show that\nour method can also be used to augment MCTS, which signi\ufb01cantly improves the performance of\nMCTS with a much less amount of simulations.\n\n2 Related Work\nProgrammable Agent The idea of learning to execute a given program using RL was introduced\nby programmable hierarchies of abstract machines (PHAMs) [6\u20138]. PHAMs specify a partial policy\nusing a set of hierarchical \ufb01nite state machines, and the agent learns to execute the partial program.\nA different way of specifying a partial policy was explored in the deep RL framework [4]. Other\napproaches used a program as a form of task description rather than a partial policy in the context of\nmultitask RL [1, 3]. Our work also aims to build a programmable agent in that we train the agent to\nexecute a given task. However, most of the prior work assumes that the program speci\ufb01es what to do,\n\n2\n\nSubtask graphObservationTrajectory: B-A-E-F-I-H-G-J-L-K-M: AND node: OR nodeSymbol Name Reward BAEFIHGJ,L,KM\fand the agent just needs to learn how to do it. In contrast, our work explores a new form of program,\ncalled subtask graph (see Figure 1), which describes properties of subtasks and dependencies between\nthem, and the agent is required to \ufb01gure out what to do as well as how to do it.\nHierarchical Reinforcement Learning Many hierarchical RL approaches have been proposed to\nsolve complex decision problems via multiple levels of temporal abstractions [9\u201313]. Our work\nbuilds upon the prior work in that a high-level controller focuses on \ufb01nding the optimal subtask, while\na low-level controller focuses on executing the given subtask. In this work, we focus on how to train\nthe high-level controller for generalizing to novel complex dependencies between subtasks.\nClassical Search-Based Planning One of the most closely related problems is the planning prob-\nlem considered in hierarchical task network (HTN) approaches [14\u201318] in that HTNs also aim to \ufb01nd\nthe optimal way to execute tasks given subtask dependencies. However, they aim to execute a single\ngoal task, while the goal of our problem is to maximize the cumulative reward in RL context. Thus,\nthe agent in our problem not only needs to consider dependencies among subtasks but also needs to\ninfer the cost from the observation and deal with stochasticity of the environment. These additional\nchallenges make it dif\ufb01cult to apply such classical planning methods to solve our problem.\nMotion Planning Another related problem to our subtask graph execution problem is motion\nplanning (MP) problem [19\u201323]. MP problem is often mapped to a graph, and reduced to a graph\nsearch problem. However, different from our problem, the MP approaches aim to \ufb01nd an optimal\npath to the goal in the graph while avoiding obstacles similar to HTN approaches.\n\n3 Problem De\ufb01nition\n3.1 Preliminary: Multitask Reinforcement Learning and Zero-Shot Generalization\nWe consider an agent presented with a task drawn from some distribution as in [4, 24]. We model each\ntask as Markov Decision Process (MDP). Let G \u2208 G be a task parameter available to agent drawn\nfrom a distribution P (G) where G de\ufb01nes the task and G is a set of all possible task parameters. The\n\ngoal is to maximize the expected reward over the whole distribution of MDPs:(cid:82) P (G)J(\u03c0, G)dG,\nwhere J(\u03c0, G) = E\u03c0[(cid:80)T\n\nt=0 \u03b3trt] is the expected return of the policy \u03c0 given a task de\ufb01ned by G, \u03b3\nis a discount factor, \u03c0 : S \u00d7 G \u2192 A is a multitask policy that we aim to learn, and rt is the reward\nat time step t. We consider a zero-shot generalization where only a subset of tasks Gtrain \u2282 G is\navailable to agent during training, and the agent is required to generalize over a set of unseen tasks\nGtest \u2282 G for evaluation, where Gtest \u2229 Gtrain = \u03c6.\n3.2 Subtask Graph Execution Problem\nThe subtask graph execution problem is a multitask RL problem with a speci\ufb01c form of task parameter\nG called subtask graph. Figure 1 illustrates an example subtask graph and environment. The task of\nour problem is to execute given N subtasks in an optimal order to maximize reward within a time\nbudget, where there are complex dependencies between subtasks de\ufb01ned by the subtask graph. We\nassume that the agent has learned a set of options (O) [11, 25, 9] that performs subtasks by executing\none or more primitive actions.\nSubtask Graph and Environment We de\ufb01ne the terminologies as follows:\n\u2022 Precondition: A precondition of subtask is de\ufb01ned as a logical expression of subtasks in sum-of-\nproducts (SoP) form where multiple AND terms are combined with an OR term (e.g., the precondition\nof subtask J in Figure 1 is OR(AND(D), AND(G)).\nt ] where ei\n\n\u2022 Eligibility vector: et = [e1\n\u2022 Completion vector: xt = [x1\n\u2022 Subtask reward vector: r = [r1, . . . , rN ] speci\ufb01es the reward for executing each subtask.\n\u2022 Reward: rt = ri if the agent executes the subtask i while it is eligible, and rt = 0 otherwise.\n\u2022 Time budget: stept \u2208 R is the remaining time-steps until episode termination.\n\u2022 Observation: obst \u2208 RH\u00d7W\u00d7C is a visual observation at time t as illustrated in Figure 1.\nTo summarize, a subtask graph G de\ufb01nes N subtasks with corresponding rewards r and the precon-\nditions. The state input at time t consists of st = {obst, xt, et, stept}. The goal is to \ufb01nd a policy\n\u03c0 : st, G (cid:55)\u2192 ot which maps the given context of the environment to an option (ot \u2208 O).\n\nsubtask is satis\ufb01ed and it has never been executed by the agent) at time t, and 0 otherwise.\n\nt , . . . , xN\n\nt ] where xi\n\nt = 1 if subtask i has been executed by the agent\n\nt , . . . , eN\n\nt = 1 if subtask i is eligible (i.e., the precondition of\n\nwhile it is eligible, and 0 otherwise.\n\n3\n\n\ft\n\nt\n\n. The \ufb01nal policy is a softmax policy over the sum of two scores.\n\nagent is required to execute previously unseen and larger subtask graphs (Gtest).\n\nFigure 2: Neural subtask graph solver architecture. The task module encodes subtask graph through a bottom-up\nand top-down process, and outputs the reward score preward\n. The observation module encodes observation\nusing CNN and outputs the cost score pcost\nChallenges Our problem is challenging due to the following aspects:\n\u2022 Generalization: Only a subset of subtask graphs (Gtrain) is available during training, but the\n\u2022 Complex reasoning: The agent needs to infer the long-term effect of executing individual subtasks\nin terms of reward and cost (e.g., time) and \ufb01nd the optimal sequence of subtasks to execute without\nany explicit supervision or simulation-based search. We note that it may not be easy even for\nhumans to \ufb01nd the solution without explicit search due to the exponentially large solution space.\n\u2022 Stochasticity: The outcome of subtask execution is stochastic in our setting (for example, some\nobjects are randomly moving). Therefore, the agent needs to consider the expected outcome when\ndeciding which subtask to execute.\n\n4 Method\nOur neural subtask graph solver (NSGS) is a neural network which consists of a task module and\nan observation module as shown in Figure 2. The task module encodes the precondition of each\nsubtask via bottom-up process and propagates the information about future subtasks and rewards to\npreceding subtasks (i.e., pre-conditions) via the top-down process. The observation module learns\nthe correspondence between a subtask and its target object, and the relation between the locations\nof objects in the observation and the time cost. However, due to the aforementioned challenge (i.e.,\ncomplex reasoning) in Section 3.2, learning to execute the subtask graph only from the reward is\nextremely challenging. To facilitate the learning, we propose graph reward propagation policy\n(GRProp), a non-parametric policy that propagates the reward information between related subtasks\nto model their dependencies. Since our GRProp acts as a good initial policy, we train the NSGS to\napproximate the GRProp policy through policy distillation [26, 27], and \ufb01netune it through actor-critic\nmethod with generalized advantage estimation (GAE) [28] to maximize the reward. Section 4.1\ndescribes the NSGS architecture, and Section 4.2 describes how to construct the GRProp policy.\n\n4.1 Neural Subtask Graph Solver\nTask Module Given a subtask graph G, the remaining time steps stept \u2208 R, an eligibility vector\net and a completion vector xt, we compute a context embedding using recursive-reverse-recursive\nneural network (R3NN) [5] as follows:\n\nt, ei\n\nt, stept,\n\n\u03c6j\nbot,a\n\n\u03c6j\nbot,a = b\u03b8a\n\nbot,o, wj,k\n\u03c6k\n\n+\n\nbot,o, ri,\n\n\u03c6j\ntop,a, wi,j\n\n+\n\ntop,a = t\u03b8a\n\nwhere [\u00b7] is a concatenation operator, b\u03b8, t\u03b8 are the bottom-up and top-down encoding function,\ntop,a are the bottom-up and top-down embedding of i-th AND node respectively, and\n\u03c6i\nbot,a, \u03c6i\n\n4\n\n\uf8eb\uf8edxi\n\uf8eb\uf8ed\u03c6i\n\n\u03c6i\n\nbot,o = b\u03b8o\n\n\u03c6i\n\ntop,o = t\u03b8o\n\n(cid:88)\n(cid:104)\n\nj\u2208Childi\n\n(cid:88)\n\nj\u2208P ari\n\n\uf8f6\uf8f8 ,\n(cid:105)\uf8f6\uf8f8 , \u03c6j\n\n\uf8eb\uf8ed (cid:88)\n\uf8eb\uf8ed\u03c6j\n\nbot,a,\n\nk\u2208Childj\n\n(cid:104)\n(cid:88)\n\n\u03c6k\n\ntop,o\n\nk\u2208P arj\n\n(cid:105)\uf8f6\uf8f8 ,\n\uf8f6\uf8f8 ,\n\n(1)\n\n(2)\n\n\ftop,o are the bottom-up and top-down embedding of i-th OR node respectively (see Ap-\n\u03c6i\nbot,o, \u03c6i\npendix for the detail). The wi,j\n+ , Childi, and P arenti speci\ufb01es the connections in the subtask graph\nG. Speci\ufb01cally, wi,j\n+ = 1 if j-th OR node and i-th AND node are connected without NOT operation,\n\u22121 if there is NOT connection and 0 if not connected, and Childi, P arenti represent a set of i-th\nnode\u2019s children and parents respectively. The embeddings are transformed to reward scores via:\ntop,o] \u2208 RE\u00d7N , E is the dimension of the top-down\npreward\nembedding of OR node, and v \u2208 RE is a weight vector for reward scoring.\nObservation Module The observation module encodes the input observation obst using a convo-\nlutional neural network (CNN) and outputs a cost score:\n\ntopv, where \u03a6top = [\u03c61\n\ntop,o, . . . , \u03c6N\n\n= \u03a6(cid:62)\n\nt\n\n(3)\nwhere stept is the number of remaining time steps. An ideal observation module would learn to\nestimate high score for a subtask if the target object is close to the agent because it would require less\ncost (i.e., time). Also, if the expected number of step required to execute a subtask is larger than the\nremaining step, ideal agent would assign low score. The NSGS policy is a softmax policy:\n\nt = CNN(obst, stept).\npcost\n\n\u03c0(ot|st, G) = Softmax(preward\n\nt\n\n+ pcost\n\nt\n\n),\n\n(4)\n\nwhich adds reward scores and cost scores.\n\n4.2 Graph Reward Propagation Policy: Pre-training Neural Subtask Graph Solver\nIntuitively, the graph reward propagation policy is designed to put high probabilities over subtasks\n\nthat are likely to maximize the sum of modi\ufb01ed and smoothed reward (cid:101)Ut at time t, which will be\n\nde\ufb01ned in Eq. 9. Let xt be a completion vector and r be a subtask reward vector (see Section 3 for\nde\ufb01nitions). Then, the sum of reward until time-step t is given as:\n\n(5)\nWe \ufb01rst modify the reward formulation such that it gives a half of subtask reward for satisfying the\npreconditions and the rest for executing the subtask to encourage the agent to satisfy the precondition\nof a subtask with a large reward:\n\nUt = rT xt.\n\n(6)\nAN D be the output of j-th AND node. The eligibility vector et can be computed from the subtask\n\nt = xk\n\nt wj,k + (1 \u2212 xk\n\nt )(1 \u2212 wj,k),\n(7)\n\n(cid:98)Ut = rT (xt + et)/2.\n(cid:98)xj,k\n\n(cid:17)\n\n,\n\nt\n\n,\n\n(cid:17)\n\n(cid:16)\n\nyj\nAN D\n\nj\u2208Childi\n\nt = OR\nei\n\nyj\nAN D = AND\nk\u2208Childj\n\nLet yj\ngraph G and xt as follows:\n\n(cid:16)(cid:98)xj,k\nand k-th node, otherwise wj,k = 1. Intuitively,(cid:98)xj,k\n\nwhere wj,k = 0 if there is a NOT connection between j-th node\nt = 1 when\nk-th node does not violate the precondition of j-th node. Note\nthat \u02dcUt is not differentiable with respect to xt because AND(\u00b7)\nand OR(\u00b7) are not differentiable. To derive our graph reward\npropagation policy, we propose to substitute AND(\u00b7) and OR(\u00b7)\n\nfunctions with \u201csmoothed\u201d functions (cid:93)AND and (cid:102)OR as follows:\nt = (cid:102)OR\n(cid:101)ei\nwhere (cid:93)AND and (cid:102)OR were implemented as scaled sigmoid and\n\n(cid:101)yj\nAN D = (cid:93)AND\nk\u2208Childj\n\n(cid:16)(cid:98)xj,k\n\n(cid:16)(cid:101)yj\n\ntanh functions as illustrated by Figure 3 (see Appendix for\ndetails). With the smoothed operations, the sum of smoothed\nand modi\ufb01ed reward is given as:\n\nj\u2208Childi\n\n(cid:17)\n\n(cid:17)\n\nAN D\n\n,\n\n(8)\n\n,\n\nt\n\nFigure 3: Visualization of OR,(cid:102)OR, AND,\nand (cid:103)AND operations with three inputs\n\n(a,b,c). These smoothed functions are\nde\ufb01ned to handle arbitrary number of\noperands (see Appendix).\n\n(cid:101)Ut = rT (xt +(cid:101)et)/2.\n(cid:16)\u2207xt(cid:101)Ut\n\n(cid:17)\n\nFinally, the graph reward propagation policy is a softmax policy,\n\n\u03c0(ot|xt, G) = Softmax\n\nthat is the softmax of the gradient of (cid:101)Ut with respect to xt.\n\n= Softmax\n\n(9)\n\n(cid:18) 1\n\n2\n\nrT\u2207xt(cid:101)et\n\n1\n2\n\nrT +\n\n(cid:19)\n\n,\n\n(10)\n\n5\n\n0123a+b+c00.51output0123a+b+c00.51output\f(cid:104)E\n\n(cid:34)\n\n(cid:34)\n\n4.3 Policy Optimization\nThe NSGS is \ufb01rst trained through policy distillation by minimizing the KL divergence between NSGS\nand teacher policy (GRProp) as follows:\n\n\u2207\u03b8L1 = EG\u223cGtrain\n\ns\u223c\u03c0G\n\n\u03b8\n\n,\n\n(11)\n\n(cid:2)\u2207\u03b8DKL\n\n(cid:0)\u03c0G\n\nT ||\u03c0G\n\n\u03b8\n\n(cid:1)(cid:3)(cid:105)\n\n(cid:33)\n\n(cid:35)(cid:35)\n\n(cid:32) l\u22121(cid:89)\n\n\u221e(cid:88)\n\nl=0\n\nn=0\n\nwhere \u03b8 is the parameter of NSGS, \u03c0G\n\u03b8 is the simpli\ufb01ed notation of NSGS policy with subtask graph\nG, \u03c0G\nT is the simpli\ufb01ed notation of teacher (GRProp) policy with subtask graph G, DKL is KL\ndivergence, and Gtrain is the training set of subtask graphs. After policy distillation, we \ufb01netune\nNSGS agent in an end-to-end manner using actor-critic method with GAE [28] as follows:\n\n\u2207\u03b8L2 = EG\u223cGtrain\n\nE\ns\u223c\u03c0G\n\n\u03b8\n\n\u2212\u2207\u03b8 log \u03c0G\n\n\u03b8\n\n(\u03b3\u03bb)kn\n\n\u03b4t+l\n\n,\n\n(12)\n\n\u03b8(cid:48) (st+1, G) \u2212 V \u03c0\n\n\u03b8(cid:48) (st, G),\n\n(Rt \u2212 V \u03c0\n\n\u03b4t = rt + \u03b3ktV \u03c0\n\n\u03b8(cid:48) (st, G))2(cid:105)\n\nby \u03b8(cid:48). During training, we update the critic network to minimize E(cid:104)\n\n(13)\nwhere kt is the duration of option ot, \u03b3 is a discount factor, \u03bb \u2208 [0, 1] is a weight for balancing\n\u03b8(cid:48) is the critic network parameterized\nbetween bias and variance of the advantage estimation, and V \u03c0\n, where Rt\nis the discounted cumulative reward at time t. The complete procedure for training our NSGS agent\nis summarized in Algorithm 1. We used \u03b7d=1e-4, \u03b7c=3e-6 for distillation and \u03b7ac=1e-6, \u03b7c=3e-7 for\n\ufb01ne-tuning in the experiment.\nAlgorithm 1 Policy optimization\n1: for iteration n do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\n(cid:80)D (\u2207\u03b8(cid:48)V \u03c0\n(cid:0)\u03c0G\n(cid:80)D \u2207\u03b8DKL\nT ||\u03c0G\n(cid:80)D \u2207\u03b8 log \u03c0G\n(cid:80)\u221e\nCompute \u03b4t from Eq. 13 for all t\n\u03b8 \u2190 \u03b8 + \u03b7ac\n\nSample G \u223c Gtrain\nD = {(st, ot, rt, Rt, stept), . . .} \u223c \u03c0G\n\u03b8(cid:48) \u2190 \u03b8(cid:48) + \u03b7c\nif distillation then\n\n\u03b8(cid:48) (st, G)) (Rt \u2212 V \u03c0\n\n(cid:1)\n(cid:16)(cid:81)l\u22121\n\n(cid:46) do rollout\n(cid:46) update critic\n\nelse if \ufb01ne-tuning then\n\n\u03b8 \u2190 \u03b8 + \u03b7d\n\nn=0 (\u03b3\u03bb)kn\n\n\u03b4t+l\n\n(cid:46) update policy\n\n(cid:46) update policy\n\n\u03b8(cid:48) (st, G))\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\nl=0\n\n(cid:17)\n\n5 Experiment\nIn the experiment, we investigated the following research questions: 1) Does GRProp outperform\nother heuristic baselines (e.g., greedy policy, etc.)? 2) Can NSGS deal with complex subtask\ndependencies, delayed reward, and the stochasticity of the environment? 3) Can NSGS generalize to\nunseen subtask graphs? 4) How does NSGS perform compared to MCTS? 5) Can NSGS be used to\nimprove MCTS?\n\n5.1 Environment\nWe evaluated the performance of our agents on two domains: Mining and Playground that are\ndeveloped based on MazeBase [29]. We used a pre-trained subtask executer for each domain. The\nepisode length (time budget) was randomly set for each episode in a range such that GRProp agent\nexecutes 60% \u2212 80% of subtasks on average. The subtasks in the higher layer in subtask graph are\ndesigned to give larger reward (see Appendix for details).\nMining domain is inspired by Minecraft (see Figures 1 and 5). The agent may pickup raw materials\nin the world, and use it to craft different items on different craft stations. There are two forms of\npreconditions: 1) an item may be an ingredient for building other items (e.g., stick and stone are\ningredients of stone pickaxe), and 2) some tools are required to pick up some objects (e.g., agent need\nstone pickaxe to mine iron ore). The agent can use the item multiple times after picking it once. The\nset of subtasks and preconditions are hand-coded based on the crafting recipes in Minecraft, and used\nas a template to generate 640 random subtask graphs. We used 200 for training and 440 for testing.\nPlayground is a more \ufb02exible and challenging domain (see Figure 6). The subtask graph in Play-\nground was randomly generated, hence its precondition can be any logical expression and the reward\n\n6\n\n\fmay be delayed. Some of the objects randomly move, which makes the environment stochastic.\nThe agent was trained on small subtask graphs, while evaluated on much larger subtask graphs (See\nTable 1). The set of subtasks is O = Aint\u00d7X , where Aint is a set of primitive actions to interact with\nobjects, and X is a set of all types of interactive objects in the domain. We randomly generated 500\ngraphs for training and 2,000 graphs for testing. Note that the task in playground domain subsumes\nmany other hierarchical RL domains such as Taxi [30], Minecraft [1] and XWORLD [2]. In addition,\nwe added the following components into subtask graphs to make the task more challenging:\n\u2022 Distractor subtask: A subtask with only NOT connection to parent nodes in the subtask graph.\nExecuting this subtask may give an immediate reward, but it may make other subtasks ineligible.\n\u2022 Delayed reward: Agent receives no reward from subtasks in the lower layers, but it should execute\nsome of them to make higher-level subtasks eligible (see Appendix for fully-delayed reward case).\n\ntask.\n\ntask with the largest reward.\n\ntive search on eligible subtasks.\n\n5.2 Agents\nWe evaluated the following policies:\n\u2022 Random policy executes any eligible sub-\n\u2022 Greedy policy executes the eligible sub-\n\u2022 Optimal policy is computed from exhaus-\n\u2022 GRProp (Ours) is graph reward propaga-\n\u2022 NSGS (Ours) is distilled from GRProp\n\u2022 Independent is an LSTM-based base-\nline trained on each subtask graph inde-\npendently, similar to Independent model\nin [4]. It takes the same set of input as\nNSGS except the subtask graph.\n\npolicy and \ufb01netuned with actor-critic.\n\ntion policy.\n\nTo our best knowledge, existing work on\nhierarchical RL cannot directly address our\nproblem with a subtask graph input. Instead,\nwe evaluated an instance of hierarchical RL\nmethod (Independent agent) in adaptation\nsetting, as discussed in Section 5.3.\n\nSubtask Graph Setting\nPlayground\nD3\nD2\n5\n4\n15\n16\n\nD1\n4\n13\n\nTask\nDepth\nSubtask\n\nZero-Shot Performance\n\nTask\nNSGS (Ours)\nGRProp (Ours)\nGreedy\nRandom\n\nD1\n.820\n.721\n.164\n\n0\n\nTask\nNSGS (Ours)\nIndependent\n\nD1\n.828\n.346\n\nPlayground\nD3\nD2\n.715\n.785\n.623\n.682\n.144\n.178\n\n0\n\n0\n\nPlayground\nD2\nD3\n.733\n.797\n.296\n.193\n\nAdaptation Performance\n\nMining\nEval\n4-10\n10-26\n\nMining\nEval\n8.19\n6.16\n3.39\n2.79\n\nMining\nEval\n8.58\n3.89\n\nD4\n6\n16\n\nD4\n.527\n.424\n.228\n\n0\n\nD4\n.552\n.188\n\nTable 1: Generalization performance on unseen and larger\nsubtask graphs. (Playground) The subtask graphs in D1 have\nthe same graph structure as training set, but the graph was\nunseen. The subtask graphs in D2, D3, and D4 have (unseen)\nlarger graph structures.\n(Mining) The subtask graphs in\nEval are unseen during training. NSGS outperforms other\ncompared agents on all the task and domain.\n\n5.3 Quantitative Result\nTraining Performance The\nlearning\ncurves of NSGS and performance of other\nagents are shown in Figure 4. Our GRProp\npolicy signi\ufb01cantly outperforms the Greedy\npolicy. This implies that the proposed idea\nof back-propagating the reward gradient\ncaptures long-term dependencies among\nsubtasks to some extent. We also found that\nNSGS further improves the performance\nthrough \ufb01ne-tuning with actor-critic method.\nWe hypothesize that NSGS learned to\nestimate the expected costs of executing\nsubtasks from the observations and consider\nthem along with subtask graphs.\nGeneralization Performance We considered two different types of generalization: a zero-shot\nsetting where agent must immediately achieve good performance on unseen subtask graphs without\nlearning, and an adaptation setting where agent can learn about task through the interaction with\nenvironment. Note that Independent agent was evaluated in adaptation setting only since it has no\nability to generalize as it does not take subtask graph as input. Particularly, we tested agents on larger\nsubtask graphs by varying the number of layers of the subtask graphs from four to six with a larger\n\nFigure 4: Learning curves on Mining and Playground do-\nmain. NSGS is distilled from GRProp on 77K and 256K\nepisodes, respectively, and \ufb01netuned after that.\n\n7\n\n0100200300400500Episode (thousands)23456789Average rewardMining0200400600800Episode (thousands)23456Average rewardPlaygroundNSGS(Ours)GRProp(Ours)GreedyRandomOptimal\fFigure 5: Example trajectories of Greedy, GRProp, and NSGS agents given 75 steps on Mining domain. We\nused different colors to indicate that agent has different types of pickaxes: red (no pickaxe), blue (stone pickaxe),\nand green (iron pickaxe). Greedy agent prefers subtasks C, D, F, and G to H and L since C, D, F, and G gives\npositive immediate reward, whereas NSGS and GRProp agents \ufb01nd a short path to make stone pickaxe, focusing\non subtasks with higher long-term reward. Compared to GRProp, the NSGS agent can \ufb01nd a shorter path to\nmake an iron pickaxe, and succeeds to execute more number of subtasks.\n\nFigure 6: Example trajectories of Greedy, GRProp, and NSGS agents given 45 steps on Playground domain.\nThe subtask graph includes NOT operation and distractor (subtask D, E, and H). We removed stochasticity\nin environment for the controlled experiment. Greedy agent executes the distractors since they give positive\nimmediate rewards, which makes it impossible to execute the subtask K which gives the largest reward. GRProp\nand NSGS agents avoid distractors and successfully execute subtask K by satisfying its preconditions. After\nexecuting subtask K, the NSGS agent found a shorter path to execute remaining subtasks than the GRProp agent\nand gets larger reward.\n\nFigure 7: Performance of MCTS+NSGS, MCTS+GRProp and MCTS per the number of simulated steps on\n(Left) Eval of Mining domain and (Right) D2 of Playground domain (see Table 1).\n\n8\n\n10BFCKNTask graph Trajectory :\u00a0A-E-D-L-B-H-J-\u00a0\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 K-N-M-O-R-T-Q-S Reward = 14.2NSGS AE,DBHKN,MORTGRProp Trajectory :\u00a0A-E-B-H-K-J-I- \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0L-N-P-M-O-R Reward = 7.9L,NPOGreedy Trajectory :\u00a0A-B-F-C-G-D-E- \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0L-H-K-N-P-J-M-I Reward = 6.3AG,D,ELHPJMJAEBHKIJMBFCKNLQSIRTrajectory :\u00a0B-C-G-I-\u00a0 \u00a0\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0K-A-JReward=3.4 Trajectory :\u00a0B-C-G-I-K- \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0F-J-A-HReward=4.14AIBCKBCGIKFJHNSGS GRProp Greedy AJTrajectory :\u00a0D-E-A-B- \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0G-I-J-CReward=2.15DABIJGCEG05101520253035#Simulation steps (thousands)3456789101112RewardMining01020304050#Simulation steps (thousands)0.00.20.40.60.81.0Normalized rewardPlaygroundMCTS+NSGS(Ours)MCTS+GRProp(Ours)MCTS\fnumber of subtasks on Playground domain. Table 1 summarizes the results in terms of normalized\nreward \u00afR = (R \u2212 Rmin)/(Rmax \u2212 Rmin) where Rmin and Rmax correspond to the average reward\nof the Random and the Optimal policy respectively. Due to large number of subtasks (>16) in Mining\ndomain, the Optimal policy was intractable to be evaluated. Instead, we reported the un-normalized\nmean reward. Though the performance degrades as the subtask graph becomes larger as expected,\nNSGS generalizes well to larger subtask graphs and consistently outperforms all the other agents on\nPlayground and Mining domains in zero-shot setting. In adaptation setting, NSGS performs slightly\nbetter than zero-shot setting by \ufb01ne-tuning on the subtask graphs in evaluation set. Independent agent\nlearned a policy comparable to Greedy, but performs much worse than NSGS.\n\n5.4 Qualitative Result\nFigure 5 visualizes trajectories of agents on Mining domain. Greedy policy mostly focuses on\nsubtasks with immediate rewards (e.g., get string, make bow) that are sub-optimal in the long run.\nIn contrast, NSGS and GRProp agents focus on executing subtask H (make stone pickaxe) in order\nto collect materials much faster in the long run. Compared to GRProp, NSGS learns to consider\nobservation also and avoids subtasks with high cost (e.g., get coal).\nFigure 6 visualizes trajectories on Playground domain. In this graph, there are distractors (e.g., D,\nE, and H) and the reward is delayed. In the beginning, Greedy chooses to execute distractors, since\nthey gives positive reward while subtasks A, B, and C do not. However, GRProp observes non-zero\ngradient for subtasks A, B, and C that are propagated from the parent nodes. Thus, even though\nthe reward is delayed, GRProp can \ufb01gure out which subtask to execute. NSGS learns to understand\nlong-term dependencies from GRProp, and \ufb01nds shorter path by also considering the observation.\n\n5.5 Combining NSGS with Monte-Carlo Tree Search\nWe further investigated how well our NSGS agent performs compared to conventional search-based\nmethods and how our NSGS agent can be combined with search-based methods to further improve\nthe performance. We implemented the following methods (see Appendix for the detail):\n\u2022 MCTS: An MCTS algorithm with UCB [31] criterion for choosing actions.\n\u2022 MCTS+NSGS: An MCTS algorithm combined with our NSGS agent. NSGS policy was used\nas a rollout policy to explore reasonably good states during tree search, which is similar to\nAlphaGo [32].\n\u2022 MCTS+GRProp: An MCTS algorithm combined with our GRProp agent similar to MCTS+NSGS.\nThe results are shown in Figure 7. It turns out that our NSGS performs as well as MCTS method\nwith approximately 32K simulations on Playground and 11K simulations on Mining domain, while\nGRProp performs as well as MCTS with approximately 11K simulations on Playground and 1K\nsimulations on Mining domain. This indicates that our NSGS agent implicitly performs long-term\nreasoning that is not easily achievable by a sophisticated MCTS, even though NSGS does not use any\nsimulation and has never seen such subtask graphs during training. More interestingly, MCTS+NSGS\nand MCTS+GRProp signi\ufb01cantly outperforms MCTS, and MCTS+NSGS achieves approximately\n0.97 normalized reward with 33K simulations on Playground domain. We found that the Optimal\npolicy, which corresponds to normalized reward of 1.0, uses approximately 648M simulations on\nPlayground domain. Thus, MCTS+NSGS performs almost as well as the Optimal policy with only\n0.005% simulations compared to the Optimal policy. This result implies that NSGS can also be used\nto improve simulation-based planning methods by effectively reducing the search space.\n\n6 Conclusion\nWe introduced the subtask graph execution problem which is an effective and principled framework\nof describing complex tasks. To address the dif\ufb01culty of dealing with complex subtask dependencies,\nwe proposed a graph reward propagation policy derived from a differentiable form of subtask graph,\nwhich plays an important role in pre-training our neural subtask graph solver architecture. The\nempirical results showed that our agent can deal with long-term dependencies between subtasks\nand generalize well to unseen subtask graphs. In addition, we showed that our agent can be used\nto effectively reduce the search space of MCTS so that the agent can \ufb01nd a near-optimal solution\nwith a small number of simulations. In this paper, we assumed that the subtask graph (e.g., subtask\ndependencies and rewards) is given to the agent. However, it will be very interesting future work\nto investigate how to extend to more challenging scenarios where the subtask graph is unknown (or\npartially known) and thus need to be estimated through experience.\n\n9\n\n\fAcknowledgments\n\nThis work was supported mainly by the ICT R&D program of MSIP/IITP (2016-0-00563: Re-\nsearch on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital\nCompanion) and partially by DARPA Explainable AI (XAI) program #313498 and Sloan Research\nFellowship.\n\nReferences\n[1] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization\n\nwith multi-task deep reinforcement learning. ICML, 2017.\n\n[2] Haonan Yu, Haichao Zhang, and Wei Xu. A deep compositional framework for human-like\n\nlanguage acquisition in virtual environment. arXiv:1703.09831, 2017.\n\n[3] Misha Denil, Sergio G\u00f3mez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas.\n\nProgrammable agents. arXiv:1706.06383, 2017.\n\n[4] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with\n\npolicy sketches. ICML, 2017.\n\n[5] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and\n\nPushmeet Kohli. Neuro-symbolic program synthesis. arXiv:1611.01855, 2016.\n\n[6] Ronald Parr and Stuart J. Russell. Reinforcement learning with hierarchies of machines. NIPS,\n\n1997.\n\n[7] David Andre and Stuart J. Russell. Programmable reinforcement learning agents. NIPS, 2000.\n\n[8] David Andre and Stuart J. Russell. State abstraction for programmable reinforcement learning\n\nagents. AAAI/IAAI, 2002.\n\n[9] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A\n\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 1999.\n\n[10] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function\n\ndecomposition. JAIR, 2000.\n\n[11] Doina Precup. Temporal abstraction in reinforcement learning. PhD thesis, 2000.\n\n[12] Mohammad Ghavamzadeh and Sridhar Mahadevan. Hierarchical policy gradient algorithms.\n\nICML, 2003.\n\n[13] George Konidaris and Andrew G. Barto. Building portable options: Skill transfer in reinforce-\n\nment learning. IJCAI, 2007.\n\n[14] Earl D Sacerdoti. The nonlinear nature of plans. Technical report, Stanford Research Institute,\n\nMenlo Park, CA, 1975.\n\n[15] Kutluhan Erol. Hierarchical task network planning: formalization, analysis, and implementa-\n\ntion. PhD thesis, 1996.\n\n[16] Kutluhan Erol, James A Hendler, and Dana S Nau. Umcp: A sound and complete procedure for\n\nhierarchical task-network planning. AIPS, 1994.\n\n[17] Dana Nau, Yue Cao, Amnon Lotem, and Hector Munoz-Avila. Shop: Simple hierarchical\n\nordered planner. IJCAI, 1999.\n\n[18] Luis Castillo, Juan Fdez-Olivares, \u00d3scar Garc\u00eda-P\u00e9rez, and Francisco Palao. Temporal enhance-\n\nments of an htn planner. CAEPIA, 2005.\n\n[19] Takao Asano, Tetsuo Asano, Leonidas Guibas, John Hershberger, and Hiroshi Imai. Visibility-\n\npolygon search and euclidean shortest paths. FOCS, 1985.\n\n10\n\n\f[20] John Canny. A voronoi method for the piano-movers problem. ICRA, 1985.\n\n[21] John Canny. A new algebraic method for robot motion planning and real geometry. FOCS,\n\n1987.\n\n[22] Bernard Faverjon and Pierre Tournassoud. A local based approach for path planning of\n\nmanipulators with a high number of degrees of freedom. ICRA, 1987.\n\n[23] J Mark Keil and Jorg-R Sack. Minimum decompositions of polygonal objects. Machine\n\nIntelligence and Pattern Recognition, 1985.\n\n[24] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills.\n\narXiv:1206.6398, 2012.\n\n[25] Martin Stolle and Doina Precup. Learning options in reinforcement learning. ISARA, 2002.\n\n[26] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James\nKirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy\ndistillation. arXiv:1511.06295, 2015.\n\n[27] Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and\n\ntransfer reinforcement learning. ArXiv, 2015.\n\n[28] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. arXiv:1506.02438,\n2015.\n\n[29] Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus.\n\nMazebase: A sandbox for learning from games. arXiv:1511.07401, 2015.\n\n[30] Mitchell Keith Bloch. Hierarchical reinforcement learning in the taxicab domain. Technical\n\nreport, Center for Cognitive Architecture, University of Michigan, 2009.\n\n[31] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 2002.\n\n[32] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. Nature, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3550, "authors": [{"given_name": "Sungryull", "family_name": "Sohn", "institution": "University of Michigan"}, {"given_name": "Junhyuk", "family_name": "Oh", "institution": "DeepMind"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google / U. Michigan"}]}