{"title": "Learning from Trajectories via Subgoal Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 8411, "page_last": 8421, "abstract": "Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples. In such cases, using a set of expert trajectories could help to learn faster. However, Imitation Learning (IL) via supervised pre-training with these trajectories may not perform as well and generally requires additional finetuning with expert-in-the-loop. In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. We learn a function which partitions the state-space into sub-goals, which can then be used to design an extrinsic reward function. We follow a strategy where the agent first learns from the trajectories using IL and then switches to Reinforcement Learning (RL) using the identified sub-goals, to alleviate the errors in the IL step. To deal with states which are under-represented by the trajectory set, we also learn a function to modulate the sub-goal predictions. We show that our method is able to solve complex goal-oriented tasks, which other RL, IL or their combinations in literature are not able to solve.", "full_text": "Learning from Trajectories via Subgoal Discovery\n\nSujoy Paul1\n\nsupaul@ece.ucr.edu\n\nJeroen van Baar2\njeroen@merl.com\n\nAmit K. Roy-Chowdhury1\n\namitrc@ece.ucr.edu\n\n1University of California-Riverside\n\n2Mitsubishi Electric Research Laboratories (MERL)\n\nAbstract\n\nLearning to solve complex goal-oriented tasks with sparse terminal-only rewards\noften requires an enormous number of samples. In such cases, using a set of\nexpert trajectories could help to learn faster. However, Imitation Learning (IL)\nvia supervised pre-training with these trajectories may not perform as well and\ngenerally requires additional \ufb01netuning with expert-in-the-loop. In this paper, we\npropose an approach which uses the expert trajectories and learns to decompose\nthe complex main task into smaller sub-goals. We learn a function which partitions\nthe state-space into sub-goals, which can then be used to design an extrinsic reward\nfunction. We follow a strategy where the agent \ufb01rst learns from the trajectories\nusing IL and then switches to Reinforcement Learning (RL) using the identi\ufb01ed\nsub-goals, to alleviate the errors in the IL step. To deal with states which are under-\nrepresented by the trajectory set, we also learn a function to modulate the sub-goal\npredictions. We show that our method is able to solve complex goal-oriented tasks,\nwhich other RL, IL or their combinations in literature are not able to solve.\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) aims to take sequential actions so as to maximize, by interacting with\nan environment, a certain pre-speci\ufb01ed reward function, designed for the purpose of solving a task.\nRL using Deep Neural Networks (DNNs) has shown tremendous success in several tasks such as\nplaying games [1, 2], solving complex robotics tasks [3, 4], etc. However, with sparse rewards,\nthese algorithms often require a huge number of interactions with the environment, which is costly\nin real-world applications such as self-driving cars [5], and manipulations using real robots [3].\nManually designed dense reward functions could mitigate such issues, however, in general, it is\ndif\ufb01cult to design detailed reward functions for complex real-world tasks.\nImitation Learning (IL) using trajectories generated by an expert can potentially be used to learn the\npolicies faster [6]. But, the performance of IL algorithms [7] are not only dependent on the perfor-\nmance of the expert providing the trajectories, but also on the state-space distribution represented by\nthe trajectories, especially in case of high dimensional states. In order to avoid such dependencies\non the expert, some methods proposed in the literature [8, 9] take the path of combining RL and IL.\nHowever, these methods assume access to the expert value function, which may become impractical\nin real-world scenarios.\nIn this paper, we follow a strategy which starts with IL and then switches to RL. In the IL step, our\nframework performs supervised pre-training which aims at learning a policy which best describes the\nexpert trajectories. However, due to limited availability of expert trajectories, the policy trained with\nIL will have errors, which can then be alleviated using RL. Similar approaches are taken in [9] and\n[10], where the authors show that supervised pre-training does help to speed-up learning. However,\nnote that the reward function in RL is still sparse, making it dif\ufb01cult to learn. With this in mind, we\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpose the following question: can we make more ef\ufb01cient use of the expert trajectories, instead of just\nsupervised pre-training?\nGiven a set of trajectories, humans can quickly identify waypoints, which need to be completed in\norder to achieve the goal. We tend to break down the entire complex task into sub-goals and try to\nachieve them in the best order possible. Prior knowledge of humans helps to achieve tasks much faster\n[11, 12] than using only the trajectories for learning. The human psychology of divide-and-conquer\nhas been crucial in several applications and it serves as a motivation behind our algorithm which learns\nto partition the state-space into sub-goals using expert trajectories. The learned sub-goals provide\na discrete reward signal, unlike value based continuous reward [13, 14], which can be erroneous,\nespecially with a limited number of trajectories in long time horizon tasks. As the expert trajectories\nset may not contain all the states where the agent may visit during exploration in the RL step, we\naugment the sub-goal predictor via one-class classi\ufb01cation to deal with such under-represented states.\nWe perform experiments on three goal-oriented tasks on MuJoCo [15] with sparse terminal-only\nreward, which state-of-the-art RL, IL or their combinations are not able to solve.\n\n2 Related Works\n\nOur work is closely related to learning from demonstrations or expert trajectories as well as discover-\ning sub-goals in complex tasks. We \ufb01rst discuss works on imitation learning using expert trajectories\nor reward-to-go. We also discuss the methods which aim to discover sub-goals, in an online manner\nduring the RL stage from its past experience.\nImitation Learning. Imitation Learning [16, 17, 18, 19, 20] uses a set of expert trajectories or\ndemonstrations to guide the policy learning process. A naive approach to use such trajectories is to\ntrain a policy in a supervised learning manner. However, such a policy would probably produce errors\nwhich grow quadratically with increasing steps. This can be alleviated using Behavioral Cloning\n(BC) algorithms [7, 21, 22], which queries expert action at states visited by the agent, after the initial\nsupervised learning phase. However, such query actions may be costly or dif\ufb01cult to obtain in many\napplications. Trajectories are also used by [23], to guide the policy search, with the main goal of\noptimizing the return of the policy rather than mimicking the expert. Recently, some works [8, 24, 14]\naim to combine IL with RL by assuming access to experts reward-to-go at the states visited by the\nRL agent. [9] take a moderately different approach where they switch from IL to RL and show\nthat randomizing the switch point can help to learn faster. The authors in [25] use demonstration\ntrajectories to perform skill segmentation in an Inverse Reinforcement Learning (IRL) framework.\nThe authors in [26] also perform expert trajectory segmentation, but do not show results on learning\nthe task, which is our main goal. SWIRL [27] make certain assumptions on the expert trajectories to\nlearn the reward function and their method is dependent on the discriminability of the state features,\nwhich we on the other hand learn end-to-end.\nLearning with Options. Discovering and learning options have been studied in the literature\n[28, 29, 30] which can be used to speed-up the policy learning process. [31] developed a framework\nfor planning based on options in a hierarchical manner, such that low level options can be used to\nbuild higher level options. [32] propose to learn a set of options, or skills, by augmenting the state\nspace with a latent categorical skill vector. A separate network is then trained to learn a policy over\noptions. The Option-Critic architecture [33] developed a gradient based framework to learn the\noptions along with learning the policy. This framework is extended in [34] to handle a hierarchy\nof options. [35] proposed a framework where the goals are generated using Generative Adversarial\nNetworks (GAN) in a curriculum learning manner with increasingly dif\ufb01cult goals. Researchers\nhave shown that an important way of identifying sub-goals in several tasks is identifying bottle-neck\nregions in tasks. Diverse Density [36], Relative Novelty [37], Graph Partitioning [38], clustering [39]\ncan be used to identify such sub-goals. However, unlike our method, these algorithms do not use a set\nof expert trajectories, and thus would still be dif\ufb01cult to identify useful sub-goals for complex tasks.\n\n3 Methodology\n\nWe \ufb01rst provide a formal de\ufb01nition of the problem we are addressing in this paper, followed by a\nbrief overall methodology, and then present a detailed description of our framework.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) This shows an overview of our proposed framework to train the policy network along\nwith sub-goal based reward function with out-of-set augmentation. (b) An example state-partition\nwith two independent trajectories in black and red. Note that the terminal state is shown as a separate\nstate partition because we assume it to be indicated by the environment and not learned.\n\nti)}ni\n\nt=1}nd\n\noptimizes the expected discounted reward E\u03c4 [(cid:80)\u221e\n\nProblem De\ufb01nition. Consider a standard RL setting where an agent interacts with an environment\nwhich can be modeled by a Markov Decision Process (MDP) M = (S,A,P, r, \u03b3,P0), where S is\nthe set of states, A is the set of actions, r is a scalar reward function, \u03b3 \u2208 [0, 1] is the discount factor\nand P0 is the initial state distribution. Our goal is to learn a policy \u03c0\u03b8(a|s), with a \u2208 A, which\nt=0 \u03b3tr(st, at)], where \u03c4 = (. . . , st, at, rt, . . . )\nand s0 \u223c P0, at \u223c \u03c0\u03b8(a|st) and st+1 \u223c P(st+1|st, at).\nWith sparse rewards, optimizing the expected discounted reward using RL may be dif\ufb01cult. In such\ncases, it may be bene\ufb01cial to use a set of state-action trajectories D = {{(sti, a\u2217\ni=1 generated\nby an expert to guide the learning process. nd is the number of trajectories in the dataset and ni is the\nlength of the ith trajectory. We propose a methodology to ef\ufb01ciently use D by discovering sub-goals\nfrom these trajectories and use them to develop an extrinsic reward function.\nOverall Methodology. Several complex, goal-oriented, real-world tasks can often be broken down\ninto sub-goals with some natural ordering. Providing positive rewards after completing these sub-\ngoals can help to learn much faster compared to sparse, terminal-only rewards. In this paper, we\nadvocate that such sub-goals can be learned directly from a set of expert demonstration trajectories,\nrather than manually designing them.\nA pictorial description of our method is presented in Fig. 1a. We use the set D to \ufb01rst train a policy\nby applying supervised learning. This serves a good initial point for policy search using RL. However,\nwith sparse rewards, the search can still be dif\ufb01cult and the network may forget the learned parameters\nin the \ufb01rst step if it does not receive suf\ufb01ciently useful rewards. To avoid this, we use D to learn\na function \u03c0\u03c6(g|s), which given a state, predicts sub-goals. We use this function to obtain a new\nreward function, which intuitively informs the RL agent whenever it moves from one sub-goal to\nanother. We also learn a utility function u\u03c8(s) to modulate the sub-goal predictions over the states\nwhich are not well-represented in the set D. We approximate the functions \u03c0\u03b8, \u03c0\u03c6, and u\u03c8 using\nneural networks. We next describe our meaning of sub-goals followed by an algorithm to learn them.\n\ni=1Si and \u2229ng\n\n3.1 Sub-goal De\ufb01nition\nDe\ufb01nition 1. Consider that the state-space S is partitioned into sets of states as {S1,S2, . . . ,Sng},\ns.t., S = \u222ang\ni=1Si = \u2205 and ng is the number of sub-goals speci\ufb01ed by the user. For each\n(s, a, s(cid:48)), we say that the particular action takes the agent from one sub-goal to another iff s \u2208 Si,\ns(cid:48) \u2208 Sj for some i, j \u2208 G = {1, 2, . . . , ng} and i (cid:54)= j.\nWe assume that there is an ordering in which groups of states appear in the trajectories as shown in\nFig. 1b. However, the states within these groups of states may appear in any random order in the\ntrajectories. These groups of states are not de\ufb01ned a priori and our algorithm aims at estimating these\npartitions. Note that such orderings are natural in several real-world applications where a certain\nsub-goal can only be reached after completing one or more previous sub-goals. We show (empirically\nin the supplementary) that our assumption is soft rather than being strict, i.e., the degree by which\nthe trajectories deviate from the assumption determines the granularity of the discovered sub-goals.\nWe may consider that states in the trajectories of D appear in increasing order of sub-goal indices,\n\n3\n\n+State \ud835\udc94\ud835\udf0b\ud835\udf03(\ud835\udc82|\ud835\udc94)\ud835\udf0b\ud835\udf19(\ud835\udc54|\ud835\udc94)\ud835\udc53\ud835\udf13(\ud835\udc94)\ud835\udc84\u03a6\ud835\udf19,\ud835\udf13(\ud835\udc94)\u2212\ud835\udc62\ud835\udf13(\ud835\udc94)State \ud835\udc94\u2032...+\u2212\ud835\udc5f\u2019Policy NetworkSub-Goal PredictorOut-of-Set EstimatorEnvironmentAction\ud835\udc82\u03a6\ud835\udf19,\ud835\udf13(\ud835\udc94\u2032)\ud835\udcae1Terminal StatesInitial StatesState Space \ud835\udcaeTrajectories\ud835\udcae2\ud835\udcae3\ud835\udcae4State\fi.e., achieving sub-goal j is harder than achieving sub-goal i (i < j). This gives us a natural way of\nde\ufb01ning an extrinsic reward function, which would help towards faster policy search. Also, all the\ntrajectories in D should start from the initial state distribution and end at the terminal states.\n\n3.2 Learning Sub-Goal Prediction\nWe use D to partition the state-space into ng sub-goals, with ng being a hyperparameter. We learn\na neural network to approximate \u03c0\u03c6(g|s), which given a state s \u2208 S predicts a probability mass\nfunction (p.m.f.) over the possible sub-goal partitions g \u2208 G. The order in which the sub-goals occur\nin the trajectories, i.e., S1 < S2 < \u00b7\u00b7\u00b7 < Sng, acts as a supervisory signal, which can be derived\nfrom our assumption mentioned above.\nWe propose an iterative framework to learn \u03c0\u03c6(g|s) using these ordered constraints. In the \ufb01rst step,\nwe learn a mapping from states to sub-goals using equipartition labels among the sub-goals. Then we\ninfer the labels of the states in the trajectories and correct them by imposing ordering constraints. We\nuse the new labels to again train the network and follow the same procedure until convergence. These\ntwo steps are as follows.\nLearning Step. In this step we consider that we have a set of tuples (s, g), which we use to learn\nthe function \u03c0\u03c6, which can be posed as a multi-class classi\ufb01cation problem with ng categories. We\noptimize the following cross-entropy loss function,\n\n\u22121{gti = k} log \u03c0\u03c6(g = j|sti)\n\n(1)\n\n\u03c0\u2217\n\u03c6 = arg min\n\n\u03c0\u03c6\n\n1\nN\n\nnd(cid:88)\n\nni(cid:88)\n\nng(cid:88)\n\ni=1\n\nt=1\n\nk=1\n\nwhere 1 is the indicator function and N is the number of states in the dataset D. To begin with, we\ndo not have any labels g, and thus we consider equipartition of all the sub-goals in G along each\ntrajectory. That is, given a trajectory of states {s1i, s2i, . . . , snii} for some i \u2208 {1, 2, . . . , nd}, the\ninitial sub-goals are,\n\ngti = j, \u2200 (cid:98) (j \u2212 1)ni\n\nng\n\n(cid:99) < t <= (cid:98) jni\nng\n\n(cid:99), j \u2208 G\n\n(2)\n\nUsing this initial labeling scheme, similar states across trajectories may have different labels, but the\nnetwork is expected to converge at the Maximum Likelihood Estimate (MLE) of the entire dataset.\nWe also optimize CASL [40] for stable learning as the initial labels can be erroneous. In the next\niteration of the learning step, we use the inferred sub-goal labels, which we obtain as follows.\nInference Step. Although the equipartition labels in Eqn. 2 may have similar states across different\ntrajectories mapped to dissimilar sub-goals, the learned network modeling \u03c0\u03c6 provides maps similar\nstates to the same sub-goal. But, Eqn. 1, and thus the predictions of \u03c0\u03c6 does not account for\nthe natural temporal ordering of the sub-goals. Even with architectures such as Recurrent Neural\nNetworks (RNN), it may be better to impose such temporal order constraints explicitly rather than\nrelying on the network to learn them. We inject such order constraints using Dynamic Time Warping\n(DTW).\nFormally, for the ith trajectory in D, we obtain the following set: {(sti, \u03c0\u03c6(g|sti)}ni\nt=1, where\n\u03c0\u03c6 is a vector representing the p.m.f. over the sub-goals G. However, as the predictions\ndo not consider temporal ordering, the constraint that sub-goal j occurs after sub-goal i, for\ni < j, is not preserved. To impose such constraints, we use DTW between the two sequences\n{e1, e2, . . . , eng}, which are the standard basis vectors in the ng dimensional Euclidean space and\n{\u03c0\u03c6(g|s1i), \u03c0\u03c6(g|s2i), . . . , \u03c0\u03c6(g|snii)}. We use the l1-norm of the difference between two vectors\nas the similarity measure in DTW. In this process, we obtain a sub-goal assignment for each state in\nthe trajectories, which become the new labels for training in the learning step.\nWe then invoke the learning step using the new labels (instead of Eqn. 2), followed by the inference\nstep to obtain the next sub-goal labels. We continue this process until the number of sub-goal labels\nchanged between iterations is less than a certain threshold. This method is presented in Algorithm 1,\nwhere the superscript k represents the iteration number in learning-inference alternates.\nReward Using Sub-Goals. The ordering of the sub-goals, as discussed before, provides a natural\nway of designing a reward function as follows:\n\nr(cid:48)(s, a, s(cid:48)) = \u03b3 \u2217 arg max\n\n\u03c0\u03c6(g = j|s(cid:48)) \u2212 arg max\n\n\u03c0\u03c6(g = k|s)\n\nj\u2208G\n\nk\u2208G\n\n(3)\n\n4\n\n\fwhere the agent in state s takes action a and reaches state s(cid:48). The augmented reward function would\nbecome r + r(cid:48). Considering that we have a function of the form \u03a6\u03c6(s) = arg maxj\u2208G \u03c0\u03c6(g = j|s),\nand without loss of generality that G = {0, 1, . . . , ng \u2212 1}, so that for the initial state \u03a6\u03c6(s0) = 0, it\nfollows from [13] that every optimal policy in M(cid:48) = (S,A,P, r + r(cid:48), \u03b3,P0), will also be optimal in\nM. However, the new reward function may help to learn the task faster.\nOut-of-Set Augmentation. In several applications, it might be the case that the trajectories only\ncover a small subset of the state space, while the agent, during the RL step, may visit states outside\nof the states in D. The sub-goals estimated at these out-of-set states may be erroneous. To alleviate\nthis problem, we use a logical assertion on the potential function \u03a6\u03c6(s) that the sub-goal predictor\nis con\ufb01dent only for states which are well-represented in D, and not elsewhere. We learn a neural\nnetwork to model a utility function u\u03c8 : S \u2192 R, which given a state, predicts the degree by which\nit is seen in the dataset D. To do this, we build upon Deep One-Class Classi\ufb01cation [41], which\nperforms well on the task of anomaly detection. The idea is derived from Support Vector Data\nDescription (SVDD) [42], which aims to \ufb01nd the smallest hypersphere enclosing the given data\npoints with minimum error. Data points outside the sphere are then deemed as anomalous. We learn\nthe parameters of u\u03c8 by optimizing the following function:\n\n\u03c8\u2217 = arg min\n\n\u03c8\n\n1\nN\n\n||f\u03c8(sti) \u2212 c||2 + \u03bb||\u03c8||2\n2,\n\nnd(cid:88)\n\nni(cid:88)\n\ni=1\n\nt=1\n\nwhere c \u2208 Rm is a vector determined a priori [41], f is modeled by a neural network with parameters\n\u03c8, s.t. f\u03c8(s) \u2208 Rm. The second part is the l2 regularization loss with all the parameters of the\nnetwork lumped to \u03c8. The utility function u\u03c8 can be expressed as follows:\n\n(4)\nA lower value of u\u03c8(s) indicates that the state has been seen in D. We modify the potential function\n\u03a6\u03c6(s) and thus the extrinsic reward function, to incorporate the utility score as follows:\n\n2\n\nu\u03c8(s) = ||f\u03c8(s) \u2212 c||2\n\n\u03a6\u03c6,\u03c8(s) = 1{u\u03c8(s) \u2264 \u03b4} \u2217 arg max\n\n\u03c0\u03c6(g = j|s),\nr(cid:48)(s, a, s(cid:48)) =\u03b3\u03a6\u03c6,\u03c8(s(cid:48)) \u2212 \u03a6\u03c6,\u03c8(s),\n\nj\u2208G\n\n(5)\nwhere \u03a6\u03c6,\u03c8 denotes the modi\ufb01ed potential function. It may be noted that as the extrinsic reward\nfunction is still a potential-based function [13], the optimality conditions between the MDP M and\nM(cid:48) still hold as discussed previously.\n\nAlgorithm 1 Learning Sub-Goal Prediction\n\nInput: Expert trajectory set D\nOutput: Sub-goal predictor \u03c0\u03c6(g|s)\nk \u2190 0\nObtain gk for each s \u2208 D using Eqn. 2\nrepeat\n\nOptimize Eqn. 1 to obtain \u03c0k\n\u03c6\nPredict p.m.f of G for each s \u2208 D using \u03c0k\nObtain new sub-goals gk+1 using the p.m.f in DTW\ndone = True, if |gk \u2212 gk+1| < \u0001, else False\nk \u2190 k + 1\n\n\u03c6\n\nuntil done is True\n\nSupervised Pre-Training. We \ufb01rst\npre-train the policy network using the\ntrajectories D (details in supplemen-\ntary). The performance of the pre-\ntrained policy network is generally\nquite poor and is upper bounded by\nthe expert performance from which\nthe trajectories are drawn. We then\nemploy RL, which starts from the\npre-trained policy, to learn from the\nsubgoal based reward function. Un-\nlike standard imitation learning algo-\nrithms, e.g., DAgger, which \ufb01netune\nthe pre-trained policy with the expert\nin the loop, our algorithm only uses\nthe initial set of expert trajectories and\ndoes not invoke the expert otherwise.\n\n4 Experiments\n\nIn this section, we perform experimental evaluation of the proposed method of learning from\ntrajectories and compare it with other state-of-the-art methods. We also perform ablation of different\nmodules of our framework.\n\n5\n\n\fFigure 2: This \ufb01gure presents the three envi-\nronments used in this paper - (a) Ball-in-Maze\nGame (BiMGame) (b) Ant locomotion in an\nopen environment with an end goal (AntTar-\nget) (c) Ant locomotion in a maze with an end\ngoal (AntMaze)\n\n(a) BiMGame\n\n(b) AntTarget\n\n(c) AntMaze\n\n(a) BiMGame\n\n(b) AntTarget\n\n(c) AntMaze\n\nFigure 3: This \ufb01gure shows the comparison of our proposed method with the baselines. Some lines\nmay not be visible as they overlap. For tasks (a) and (c) our method clearly outperforms others. For\ntask (b), although value reward initially performs better, our method eventually achieves the same\nperformance. For a fair comparison, we do not use the out-of-set augmentation to generate this plot.\n\nTasks. We perform experiments on three challenging environments as shown in Fig. 2. First is Ball-\nin-Maze Game (BiMGame) introduced in [43], where the task is to move a ball from the outermost\nto the innermost ring using a set of \ufb01ve discrete actions - clock-wise and anti-clockwise rotation by\n1\u25e6 along the two principal dimensions of the board and \u201cno-op\" where the current orientation of the\nboard is maintained. The states are images of size 84 \u00d7 84. The second environment is AntTarget\nwhich involves the Ant [44]. The task is to reach the center of a circle of radius 5m with the Ant being\ninitialized on a 45\u25e6 arc of the circle. The state and action are continuous with 41 and 8 dimensions\nrespectively. The third environment, AntMaze, uses the same Ant, but in a U-shaped maze used in\n[35]. The Ant is initialized on one end of the maze with the goal being the other end indicated as red\nin Fig. 2c. Details about the network architectures we use for \u03c0\u03b8, \u03c0\u03c6 and f\u03c8(s) can be found in the\nsupplementary material.\nReward. For all tasks, we use sparse terminal-only reward, i.e., +1 only after reaching the goal state\nand 0 otherwise. Standard RL methods such as A3C [45] are not able to solve these tasks with such\nsparse rewards.\nTrajectory Generation. We generate trajectories from A3C [45] policies trained with dense reward,\nwhich we do not use in any other experiments. We also generate sub-optimal trajectories for BiMGame\nand AntMaze. To do so for BiMGame, we use the simulator via Model Predictive Control (MPC) as\nin [46] (details in the supplementary). For AntMaze, we generate sub-optimal trajectories from an\nA3C policy stopped much before convergence. We generate around 400 trajectories for BiMGame\nand AntMaze, and 250 for AntTarget. As we generate two separate sets of trajectories for BiMGame\nand AntTarget, we use the sub-optimal set for all experiments, unless otherwise mentioned.\nBaselines. We primarily compare our method with RL methods which utilize trajectory or expert\ninformation - AggreVaTeD [8] and value based reward shaping [13], equivalent to the K = \u221e in\nTHOR [14]. For these methods, we use D to \ufb01t a value function to the sparse terminal-only reward of\nthe original MDP M and use it as the expert value function. We also compare with standard A3C, but\npre-trained using D. It may be noted that we pre-train all the methods using the trajectory set to have\na fair comparison. We report results with mean cumulative reward and \u00b1\u03c3 over 3 independent runs.\nComparison with Baselines. First, we compare our method with other baselines in Fig 3. Note that\nas out-of-set augmentation using u\u03c8 can be applied for other methods which learn from trajectories,\nsuch as value-based reward shaping, we present the results for comparison with baselines without\nusing u\u03c8, i.e., Eqn. 3. Later, we perform an ablation study with and without using u\u03c8. As may be\nobserved, none of the baselines show any sign of learning for the tasks, except for ValueReward,\nwhich performs comparably with the proposed method for AntTarget only. Our method, on the other\nhand, is able to learn and solve the tasks consistently over multiple runs. The expert cumulative\n\n6\n\n01234Number of samples (in Millions)0.00.10.20.30.40.5Episode Cumulative RewardProposedAggreVaTeDValue RewardA3CExpert010203040Number of samples (in Millions)0.000.050.100.150.20Episode Cumulative RewardProposedAggreVaTeDValue RewardA3CExpert020406080Number of samples (in Millions)0.0000.0020.0040.0060.0080.0100.0120.0140.016Episode Cumulative RewardProposedAggreVaTeDValue RewardA3CExpert\f(a) BiMGame\n\n(b) AntTarget\n\n(c) AntMaze\n\nFigure 4: (a) This plot presents the learning curves associated with different number of learned\nsub-goals for the three tasks. For BiMGame and AntTarget, the number of sub-goals hardly matters.\nHowever, due to the inherently longer length of the task for AntMaze , lower number of sub-goals\nsuch as ng = 5 perform much worse than with higher ng.\n\nFigure 5: This plot presents\nthe comparison of our proposed\nmethod for with and without us-\ning the one-class classi\ufb01cation\nmethod for out-of-set augmen-\ntation.\n\n(a) BiMGame\n\n(b) AntTarget\n\nrewards are also drawn as straight lines in the plots and imitation learning methods like DAgger\n[7] can only reach that mark. Our method is able to surpass the expert for all the tasks. In fact, for\nAntMaze, even with a rather sub-optimal expert (an average cumulative reward of only 0.0002), our\nalgorithm achieves about 0.012 cumulative reward at 100 million steps.\nThe poor performance of the ValueReward and AggreVaTeD can be attributed to the imperfect value\nfunction learned with a limited number of trajectories. Speci\ufb01cally, with an increase in the trajectory\nlength, the variations in cumulative reward in the initial set of states are quite high. This introduces a\nconsiderable amount of error in the estimated value function in the initial states, which in turn traps\nthe agent in some local optima when such value functions are used to guide the learning process.\nVariations in Sub-Goals. The number of sub-goals ng is speci\ufb01ed by the user, based on domain\nknowledge. For example, in the BiMGame, the task has four bottle-necks, which are states to be\nvisited to complete the task and they can be considered as sub-goals. We perform experiments\nwith different sub-goals and present the plots in Fig. 4. It may be observed that for BiMGame\nand AntTarget, our method performs well over a large variety of sub-goals. On the other hand for\nAntMaze, as the length of the task is much longer than AntTarget (12m vs 5m), ng \u2265 10 learn much\nfaster than ng = 5, as higher number of sub-goals provides more frequent rewards. Note that the\nvariations in speed of learning with number of sub-goals is also dependent on the number of expert\ntrajectories. If the pre-training is good, then less frequent sub-goals might work \ufb01ne, whereas if we\nhave a small number of expert trajectories, the RL agent may need more frequent reward (see the\nsupplementary material for more experiments).\nEffect of Out-of-Set Augmentation. The set D may not cover the entire state-space. To deal\nwith this situation we developed the extrinsic reward function in Eqn. 5 using u\u03c8. To evaluate its\neffectiveness we execute our algorithm using Eqn. 3 and Eqn. 5, and show the results in Fig. 5,\nwith legends showing without and with u\u03c8 respectively. For BiMGame, we used the optimal A3C\ntrajectories, for this evaluation. This is because, using MPC trajectories with Eqn. 3 can still solve\nthe task with similar reward plots, since MPC trajectories visit a lot more states due to its short-tem\nplanning. The (optimal) A3C trajectories on the other hand, rarely visit some states, due to its\nlong-term planning. In this case, using Eqn. 3 actually traps the agents to a local optimum (in the\noutermost ring), whereas using u\u03c8 as in Eqn. 5, learns to solve the task consistently (Fig. 5a).\nFor AntTarget in Fig. 5b, using u\u03c8 performs better than without using u\u03c8 (and also surpasses Value\nbased Reward Shaping). This is because the trajectories only span a small sector of the circle (Fig.\n7b) while the Ant is allowed to visit states outside of it in the RL step. Thus, u\u03c8 avoids incorrect\nsub-goal assignments to states not well-represented in D and helps in the overall learning.\n\n7\n\n01234Number of samples (in Millions)0.00.10.20.30.40.5Episode Cumulative Rewardng=2ng=3ng=4ng=501020304050Number of samples (in Millions)0.0500.0750.1000.1250.1500.1750.2000.2250.250Episode Cumulative Rewardng=3ng=5ng=10ng=15020406080Number of samples (in Millions)0.0000.0020.0040.0060.0080.0100.0120.0140.016Episode Cumulative Rewardng=5ng=10ng=20ng=250.00.51.01.52.0Number of samples (in Millions)0.00.10.20.30.40.5Episode Cumulative RewardProposed with uProposed without uExpert01020304050Number of samples (in Millions)0.000.050.100.150.20Episode Cumulative RewardProposed with uProposed without uExpert\fFigure 6: This plot presents\na comparison of our proposed\nmethod for two different types\nof expert trajectories. The\ncorresponding expert rewards\nare also plotted as horizontal\nlines.\n\nFigure 7: This \ufb01gure presents\nthe learned sub-goals for the\nthree tasks which are color\ncoded. Note that for (b) and\n(c), multiple sub-goals are as-\nsigned the same color, but they\ncan be distinguished by their\nspatial locations.\n\n(a) BiMGame\n\n(b) AntMaze\n\n(a) BiMGame\n\nng = 4\n\n(b) AntTarget\n\nng = 10\n\n(c) AntMaze\n\nng = 15\n\nEffect of Sub-Optimal Expert. In general, the optimality of the expert may have an effect on\nperformance. The comparison of our algorithm with optimal vs. sub-optimal expert trajectories are\nshown in Fig. 6. As may be observed, the learning curve for both the tasks is better for the optimal\nexpert trajectories. However, in spite of using such sub-optimal experts, our method is able to surpass\nand perform much better than the experts. We also see that our method performs better than even the\noptimal expert (as it is only optimal w.r.t. some cost function) used in AntMaze.\nVisualization. We visualize the sub-goals discovered by our algorithm and plot it on the x-y plane in\nFig. 7. As can be seen in BiMGame, with 4 sub-goals, our method is able to discover the bottle-neck\nregions of the board as different sub-goals. For AntTarget and AntMaze, the path to the goal is more\nor less equally divided into sub-goals. This shows that our method of sub-goal discovery can work for\nboth environments with and without bottle-neck regions. (See supplementary for more visualizations).\n\n5 Discussions\nThe experimental analysis we presented in the previous section contain the following key observations:\n\u2022 Our method for sub-goal discovery works both for tasks with inherent bottlenecks (e.g. BiMGame)\nand for tasks without any bottlenecks (e.g. AntTarget and AntMaze), but with temporal orderings\nbetween groups of states in the expert trajectories, which is the case for many applications.\n\u2022 Experiments show, that our assumption on the temporal ordering of groups of states in expert\ntrajectories is soft, and determines the granularity of the discovered sub-goals (see supplementary).\n\u2022 Discrete rewards using sub-goals performs much better than value function based continuous\nrewards. Moreover, value functions learned from long and limited number of trajectories may be\nerroneous, whereas segmenting the trajectories based on temporal ordering may still work well.\n\u2022 As the expert trajectories may not cover the entire state-space regions the agent visits during\nexploration in the RL step, augmenting the sub-goal based reward function using out-of-set\naugmentation performs better compared to not using it.\n\n6 Conclusion\nIn this paper, we presented a framework to utilize the demonstration trajectories in an ef\ufb01cient manner\nby discovering sub-goals, which are waypoints that need to be completed in order to achieve a certain\ncomplex goal-oriented task. We use these sub-goals to augment the reward function of the task,\nwithout affecting the optimality of the learned policy. Experiments on three complex task show that\nunlike state-of-the-art RL, IL or methods which combines them, our method is able to solve the tasks\nconsistently. We also show that our method is able to perform much better than sub-optimal experts\nused to obtain the expert trajectories and at least as good as the optimal experts. Our future work will\nconcentrate on extending our method for repetitive non-goal oriented tasks.\nAcknowledgement. This work was partially supported by US NSF grant 1724341 and Mitsubishi\nElectric Research Labs.\n\n8\n\n01234Number of samples (in Millions)0.00.10.20.30.40.5Episode Cumulative RewardProposed w. A3C trajectoriesProposed w. MPC trajectoriesA3C ExpertMPC Expert020406080Number of samples (in Millions)0.0000.0050.0100.0150.0200.025Episode Cumulative RewardOurs w. A3C trajectoriesOurs w. Sub-Optimal A3C trajectoriesA3C ExpertSub-Optimal A3C Expert\fReferences\n[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. nature, 529(7587):484,\n2016.\n\n[3] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. JMLR, 17(1):1334\u20131373, 2016.\n\n[4] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep\n\nreinforcement learning for continuous control. In ICML, pages 1329\u20131338, 2016.\n\n[5] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon\nGoyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning\nfor self-driving cars. arXiv preprint arXiv:1604.07316, 2016.\n\n[6] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\n[7] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\n\nstructured prediction to no-regret online learning. In AISTATS, pages 627\u2013635, 2011.\n\n[8] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply\naggrevated: Differentiable imitation learning for sequential prediction. In ICML, pages 3309\u2013\n3318, 2017.\n\n[9] Ching-An Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through\n\nimitation and reinforcement. UAI, 2018.\n\n[10] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. In ICRA,\npages 7559\u20137566. IEEE, 2018.\n\n[11] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with\n\npolicy sketches. In ICML, pages 166\u2013175, 2017.\n\n[12] Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Thomas L Grif\ufb01ths, and Alexei A Efros.\nInvestigating human priors for playing video games. arXiv preprint arXiv:1802.10217, 2018.\n\n[13] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In ICML, volume 99, pages 278\u2013287,\n1999.\n\n[14] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining\n\nreinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.\n\n[15] Emanuel Todorov. Convex and analytically-invertible dynamics with contacts and constraints:\n\nTheory and implementation in mujoco. In ICRA, pages 6054\u20136061, 2014.\n\n[16] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences,\n\n3(6):233\u2013242, 1999.\n\n[17] David Silver, James Bagnell, and Anthony Stentz. High performance outdoor navigation from\n\noverhead data using imitation learning. RSS, 2008.\n\n[18] Sonia Chernova and Manuela Veloso. Interactive policy learning through con\ufb01dence-based\n\nautonomy. JAIR, 34:1\u201325, 2009.\n\n[19] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel\nTodorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement\nlearning and demonstrations. RSS, 2017.\n\n9\n\n\f[20] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan,\nJohn Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In\nThirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[21] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive\n\nno-regret learning. arXiv preprint arXiv:1406.5979, 2014.\n\n[22] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. IJCAI,\n\n2018.\n\n[23] Sergey Levine and Vladlen Koltun. Guided policy search. In ICML, pages 1\u20139, 2013.\n\n[24] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume III, and John Langford.\n\nLearning to search better than your teacher. ICML, 2015.\n\n[25] Pravesh Ranchod, Benjamin Rosman, and George Konidaris. Nonparametric bayesian reward\nsegmentation for skill discovery using inverse reinforcement learning. In IROS, pages 471\u2013477.\nIEEE, 2015.\n\n[26] Adithyavairavan Murali, Animesh Garg, Sanjay Krishnan, Florian T Pokorny, Pieter Abbeel,\nTrevor Darrell, and Ken Goldberg. Tsc-dl: Unsupervised trajectory segmentation of multi-modal\nsurgical demonstrations with deep learning. In ICRA, pages 4150\u20134157, 2016.\n\n[27] Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T\nPokorny, and Ken Goldberg. Swirl: A sequential windowed inverse reinforcement learning\nalgorithm for robot tasks with delayed rewards. IJRR, 38(2-3):126\u2013145, 2019.\n\n[28] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-\n2):181\u2013211, 1999.\n\n[29] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts\n\nAmherst, 2000.\n\n[30] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In SARA, pages\n\n212\u2013223. Springer, 2002.\n\n[31] David Silver and Kamil Ciosek. Compositional planning using optimal option models. arXiv\n\npreprint arXiv:1206.6473, 2012.\n\n[32] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical\n\nreinforcement learning. ICLR, 2017.\n\n[33] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages\n\n1726\u20131734, 2017.\n\n[34] Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options. NIPS, 2018.\n\n[35] David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel. Automatic goal generation for\n\nreinforcement learning agents. ICML, 2017.\n\n[36] Amy McGovern and Andrew G Barto. Automatic discovery of subgoals in reinforcement\n\nlearning using diverse density. ICML, 2001.\n\n[37] \u00d6zg\u00fcr \u00b8Sim\u00b8sek and Andrew G Barto. Using relative novelty to identify useful temporal abstrac-\n\ntions in reinforcement learning. In ICML, page 95. ACM, 2004.\n\n[38] \u00d6zg\u00fcr \u00b8Sim\u00b8sek, Alicia P Wolfe, and Andrew G Barto. Identifying useful subgoals in reinforce-\n\nment learning by local graph partitioning. In ICML, pages 816\u2013823. ACM, 2005.\n\n[39] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement\n\nlearning via clustering. In ICML, page 71. ACM, 2004.\n\n[40] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly-supervised temporal\nactivity localization and classi\ufb01cation. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 563\u2013579, 2018.\n\n10\n\n\f[41] Lukas Ruff, Nico G\u00f6rnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen,\nAlexander Binder, Emmanuel M\u00fcller, and Marius Kloft. Deep one-class classi\ufb01cation. In ICML,\npages 4390\u20134399, 2018.\n\n[42] David MJ Tax and Robert PW Duin. Support vector data description. Machine learning,\n\n54(1):45\u201366, 2004.\n\n[43] Jeroen van Baar, Alan Sullivan, Radu Cordorel, Devesh Jha, Diego Romeres, and Daniel\nNikovski. Sim-to-real transfer learning using robusti\ufb01ed controllers in robotic tasks involving\ncomplex dynamics. ICRA, 2018.\n\n[44] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-\n\ndimensional continuous control using generalized advantage estimation. ICLR, 2015.\n\n[45] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In ICML, pages 1928\u20131937, 2016.\n\n[46] Sujoy Paul and Jeroen van Baar. Trajectory-based learning for ball-in-maze games. arXiv\n\npreprint arXiv:1811.11441, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4556, "authors": [{"given_name": "Sujoy", "family_name": "Paul", "institution": "UC Riverside"}, {"given_name": "Jeroen", "family_name": "Vanbaar", "institution": "MERL (Mitsubishi Electric Research Laboratories), Cambridge MA"}, {"given_name": "Amit", "family_name": "Roy-Chowdhury", "institution": "University of California, Riverside, USA"}]}