{"title": "Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 1235, "page_last": 1245, "abstract": "Imitation learning has traditionally been applied to learn a single task from demonstrations thereof. The requirement of structured and isolated demonstrations limits the scalability of imitation learning approaches as they are difficult to apply to real-world scenarios, where robots have to be able to execute a multitude of tasks. In this paper, we propose a multi-modal imitation learning framework that is able to segment and imitate skills from unlabelled and unstructured demonstrations by learning skill segmentation and imitation learning jointly. The extensive simulation results indicate that our method can efficiently separate the demonstrations into individual skills and learn to imitate them using a single multi-modal policy.", "full_text": "Multi-Modal Imitation Learning from Unstructured\nDemonstrations using Generative Adversarial Nets\n\nKarol Hausman\u2217\u2020, Yevgen Chebotar\u2217\u2020\u2021, Stefan Schaal\u2020\u2021, Gaurav Sukhatme\u2020, Joseph J. Lim\u2020\n\n\u2020University of Southern California, Los Angeles, CA, USA\n\n\u2021Max-Planck-Institute for Intelligent Systems, T\u00fcbingen, Germany\n{hausman, ychebota, sschaal, gaurav, limjj}@usc.edu\n\nAbstract\n\nImitation learning has traditionally been applied to learn a single task from demon-\nstrations thereof. The requirement of structured and isolated demonstrations\nlimits the scalability of imitation learning approaches as they are dif\ufb01cult to\napply to real-world scenarios, where robots have to be able to execute a mul-\ntitude of tasks.\nIn this paper, we propose a multi-modal imitation learning\nframework that is able to segment and imitate skills from unlabelled and un-\nstructured demonstrations by learning skill segmentation and imitation learning\njointly. The extensive simulation results indicate that our method can ef\ufb01ciently\nseparate the demonstrations into individual skills and learn to imitate them us-\ning a single multi-modal policy. The video of our experiments is available at\nhttp://sites.google.com/view/nips17intentiongan.\n\n1\n\nIntroduction\n\nOne of the key factors to enable deployment of robots in unstructured real-world environments is\ntheir ability to learn from data. In recent years, there have been multiple examples of robot learning\nframeworks that present promising results. These include: reinforcement learning [31] - where a\nrobot learns a skill based on its interaction with the environment and imitation learning [2, 5] - where\na robot is presented with a demonstration of a skill that it should imitate. In this work, we focus on\nthe latter learning setup.\nTraditionally, imitation learning has focused on using isolated demonstrations of a particular skill [29].\nThe demonstration is usually provided in the form of kinesthetic teaching, which requires the user to\nspend suf\ufb01cient time to provide the right training data. This constrained setup for imitation learning\nis dif\ufb01cult to scale to real world scenarios, where robots have to be able to execute a combination\nof different skills. To learn these skills, the robots would require a large number of robot-tailored\ndemonstrations, since at least one isolated demonstration has to be provided for every individual skill.\nIn order to improve the scalability of imitation learning, we propose a framework that can learn to\nimitate skills from a set of unstructured and unlabeled demonstrations of various tasks.\nAs a motivating example, consider a highly unstructured data source, e.g. a video of a person cooking\na meal. A complex activity, such as cooking, involves a set of simpler skills such as grasping,\nreaching, cutting, pouring, etc. In order to learn from such data, three components are required: i) the\nability to map the image stream to state-action pairs that can be executed by a robot, ii) the ability to\nsegment the data into simple skills, and iii) the ability to imitate each of the segmented skills. In this\nwork, we tackle the latter two components, leaving the \ufb01rst one for future work. We believe that the\ncapability proposed here of learning from unstructured, unlabeled demonstrations is an important\nstep towards scalable robot learning systems.\n\n\u2217Equal contribution\n\n\fIn this paper, we present a novel imitation learning method that learns a multi-modal stochastic\npolicy, which is able to imitate a number of automatically segmented tasks using a set of unstructured\nand unlabeled demonstrations. Our results indicate that the presented technique can separate the\ndemonstrations into sensible individual skills and imitate these skills using a learned multi-modal\npolicy. We show applications of the presented method to the tasks of skill segmentation, hierarchical\nreinforcement learning and multi-modal policy learning.\n\n2 Related Work\n\nImitation learning is concerned with learning skills from demonstrations. Approaches that are suitable\nfor this setting can be split into two categories: i) behavioral cloning [27], and ii) inverse reinforcement\nlearning (IRL) [24]. While behavioral cloning aims at replicating the demonstrations exactly, it\nsuffers from the covariance shift [28]. IRL alleviates this problem by learning a reward function that\nexplains the behavior shown in the demonstrations. The majority of IRL works [16, 35, 1, 12, 20]\nintroduce algorithms that can imitate a single skill from demonstrations thereof but they do not readily\ngeneralize to learning a multi-task policy from a set of unstructured demonstrations of various tasks.\nMore recently, there has been work that tackles a problem similar to the one presented in this paper,\nwhere the authors consider a setting where there is a large set of tasks with many instantiations [10].\nIn their work, the authors assume a way of communicating a new task through a single demonstration.\nWe follow the idea of segmenting and learning different skills jointly so that learning of one skill\ncan accelerate learning to imitate the next skill. In our case, however, the goal is to separate the\nmix of expert demonstrations into single skills and learn a policy that can imitate all of them, which\neliminates the need of new demonstrations at test time.\nThe method presented here belongs to the \ufb01eld of multi-task inverse reinforcement learning. Examples\nfrom this \ufb01eld include [9] and [4]. In [9], the authors present a Bayesian approach to the problem,\nwhile the method in [4] is based on an EM approach that clusters observed demonstrations. Both of\nthese methods show promising results on relatively low-dimensional problems, whereas our approach\nscales well to higher dimensional domains due to the representational power of neural networks.\nThere has also been a separate line of work on learning from demonstration, which is then iteratively\nimproved through reinforcement learning [17, 6, 23]. In contrast, we do not assume access to the\nexpert reward function, which is required to perform reinforcement learning in the later stages of the\nabove algorithms.\nThere has been much work on the problem of skill segmentation and option discovery for hierarchical\ntasks. Examples include [25, 19, 14, 33, 13]. In this work, we consider a possibility to discover\ndifferent skills that can all start from the same initial state, as opposed to hierarchical reinforcement\nlearning where the goal is to segment a task into a set of consecutive subtasks. We demonstrate,\nhowever, that our method may be used to discover the hierarchical structure of a task similarly to the\nhierarchical reinforcement learning approaches. In [13], the authors explore similar ideas to discover\nuseful skills. In this work, we apply some of these ideas to the imitation learning setup as opposed to\nthe reinforcement learning scenario.\nGenerative Adversarial Networks (GANs) [15] have enjoyed success in various domains including\nimage generation [8], image-image translation [34, 18] and video prediction [22]. More recently, there\nhave been works connecting GANs and other reinforcement learning and IRL methods [26, 11, 16].\nIn this work, we expand on some of the ideas presented in these works and provide a novel framework\nthat exploits this connection.\nThe works that are most closely related to this paper are [16], [7] and [21]. In [7], the authors show\na method that is able to learn disentangled representations and apply it to the problem of image\ngeneration. In this work, we provide an alternative derivation of our method that extends their\nwork and applies it to multi-modal policies. In [16], the authors present an imitation learning GAN\napproach that serves as a basis for the development of our method. We provide an extensive evaluation\nof the hereby presented approach compared to the work in [16], which shows that our method, as\nopposed to [16], can handle unstructured demonstrations of different skills. A concurrent work [21]\nintroduces a method similar to ours and applies it to detecting driving styles from unlabelled human\ndata.\n\n2\n\n\fR(\u03c4 ) =(cid:80)T\n\n3 Preliminaries\nLet M = (S, A, P, R, p0, \u03b3, T ) be a \ufb01nite-horizon Markov Decision Process (MDP), where S and A\nare state and action spaces, P : S \u00d7 A \u00d7 S \u2192 R+ is a state-transition probability function or system\ndynamics, R : S \u00d7 A \u2192 R a reward function, p0 : S \u2192 R+ an initial state distribution, \u03b3 a reward\ndiscount factor, and T a horizon. Let \u03c4 = (s0, a0, . . . , sT , aT ) be a trajectory of states and actions and\nt=0 \u03b3tR(st, at) the trajectory reward. The goal of reinforcement learning methods is to\n\ufb01nd parameters \u03b8 of a policy \u03c0\u03b8(a|s) that maximizes the expected discounted reward over trajectories\ninduced by the policy: E\u03c0\u03b8 [R(\u03c4 )] where s0 \u223c p0, st+1 \u223c P (st+1|st, at) and at \u223c \u03c0\u03b8(at|st).\nIn an imitation learning scenario, the reward function is unknown. However, we are given a set of\ndemonstrated trajectories, which presumably originate from some optimal expert policy distribution\n\u03c0E1 that optimizes an unknown reward function RE1. Thus, by trying to estimate the reward function\nRE1 and optimizing the policy \u03c0\u03b8 with respect to it, we can recover the expert policy. This approach\nis known as inverse reinforcement learning (IRL) [1]. In order to model a variety of behaviors, it is\nbene\ufb01cial to \ufb01nd a policy with the highest possible entropy that optimizes RE1. We will refer to this\napproach as the maximum-entropy IRL [35] with the optimization objective\n\n(cid:18)\n\n(cid:19)\n\nmin\n\nR\n\nmax\n\u03c0\u03b8\n\nH(\u03c0\u03b8) + E\u03c0\u03b8 R(s, a)\n\n\u2212 E\u03c0E1\n\nR(s, a),\n\n(1)\n\nwhere H(\u03c0\u03b8) is the entropy of the policy \u03c0\u03b8.\nHo and Ermon [16] showed that it is possible to rede\ufb01ne the maximum-entropy IRL problem with\nmultiple demonstrations sampled from a single expert policy \u03c0E1 as an optimization of GANs [15]. In\nthis framework, the policy \u03c0\u03b8(a|s) plays the role of a generator, whose goal is to make it dif\ufb01cult for a\ndiscriminator network Dw(s, a) (parameterized by w) to differentiate between imitated samples from\n\u03c0\u03b8 (labeled 0) and demonstrated samples from \u03c0E1 (labeled 1). Accordingly, the joint optimization\ngoal can be de\ufb01ned as\n\nmax\n\n\u03b8\n\nmin\n\nw\n\nE(s,a)\u223c\u03c0\u03b8 [log(Dw(s, a))] + E(s,a)\u223c\u03c0E1\n\n[log(1 \u2212 Dw(s, a))] + \u03bbH H(\u03c0\u03b8).\n\n(2)\n\nThe discriminator and the generator policy are both represented as neural networks and optimized by\nrepeatedly performing alternating gradient updates. The discriminator is trained on the mixed set of\nexpert and generator samples and outputs probabilities that a particular sample has originated from\nthe generator or the expert policies. This serves as a reward signal for the generator policy that tries\nto maximize the probability of the discriminator confusing it with an expert policy. The generator can\nbe trained using the trust region policy optimization (TRPO) algorithm [30] with the cost function\nlog(Dw(s, a)). At each iteration, TRPO takes the following gradient step:\n\nE(s,a)\u223c\u03c0\u03b8 [\u2207\u03b8 log \u03c0\u03b8(a|s) log(Dw(s, a))] + \u03bbH\u2207\u03b8H(\u03c0\u03b8),\n\n(3)\n\nwhich corresponds to minimizing the objective in Eq. (2) with respect to the policy \u03c0\u03b8.\n\n4 Multi-modal Imitation Learning\n\nThe traditional imitation learning scenario described in Sec. 3 considers a problem of learning to\nimitate one skill from demonstrations. The demonstrations represent samples from a single expert\npolicy \u03c0E 1. In this work, we focus on an imitation learning setup where we learn from unstructured\nand unlabelled demonstrations of various tasks.\nIn this case, the demonstrations come from a\nset of expert policies \u03c0E1, \u03c0E2, . . . , \u03c0Ek, where k can be unknown, that optimize different reward\nfunctions/tasks. We will refer to this set of unstructured expert policies as a mixture of policies \u03c0E.\nWe aim to segment the demonstrations of these policies into separate tasks and learn a multi-modal\npolicy that will be able to imitate all of the segmented tasks.\nIn order to be able to learn multi-modal policy distributions, we augment the policy input with a\nlatent intention i distributed by a categorical or uniform distribution p(i), similar to [7]. The goal of\nthe intention variable is to select a speci\ufb01c mode of the policy, which corresponds to one of the skills\npresented in the demonstrations. The resulting policy can be expressed as:\n\n\u03c0(a|s)\np(i)\n\n\u03c0(a|s, i) = p(i|s, a)\n\n3\n\n.\n\n(4)\n\n\fresulting reward of the trajectory with the latent intention is R(\u03c4i) =(cid:80)T\nE\u03c0\u03b8 [R(\u03c4i)] =(cid:82) R(\u03c4i)\u03c0\u03b8(\u03c4i)d\u03c4i where \u03c0\u03b8(\u03c4i) = p0(s0)(cid:81)T\u22121\n\nWe augment the trajectory to include the latent intention as \u03c4i = (s0, a0, i0, ...sT , aT , iT ). The\nt=0 \u03b3tR(st, at, it). R(a, s, i)\nis a reward function that depends on the latent intention i as we have multiple demonstrations that\noptimize different reward functions for different tasks. The expected discounted reward is equal to:\n\nt=0 P (st+1|st, at)\u03c0\u03b8(at|st, it)p(it).\n\nHere, we show an extension of the derivation presented in [16] (Eqs. (1, 2)) for a policy \u03c0(a|s, i)\naugmented with the latent intention variable i, which uses demonstrations from a set of expert policies\n\u03c0E, rather than a single expert policy \u03c0E1. We are aiming at maximum entropy policies that can be\ndetermined from the latent intention variable i. Accordingly, we transform the original IRL problem\nto re\ufb02ect this goal:\n\n(cid:17) \u2212 E\u03c0E R(s, a, i),\n\n(cid:16)\nwhere \u03c0(a|s) = (cid:80)\n\nR\n\n\u03c0\n\nmin\n\nH(\u03c0(a|s)) \u2212 H(\u03c0(a|s, i)) + E\u03c0R(s, a, i)\n\n(5)\nmax\n\u03c0(a|s, i)p(i), which results in the policy averaged over intentions (since the\np(i) is constant). This goal re\ufb02ects our objective: we aim to obtain a multi-modal policy that has a\nhigh entropy without any given intention, but it collapses to a particular task when the intention is\nspeci\ufb01ed. Analogously to the solution for a single expert policy, this optimization objective results in\nthe optimization goal of the generative adversarial imitation learning network, with the exception that\nthe state-action pairs (s, a) are sampled from a set of expert policies \u03c0E:\n\ni\n\nEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 [log(Dw(s, a))] + E(s,a)\u223c\u03c0E [1 \u2212 log(Dw(s, a))]\n\n(6)\n\nmax\n\n\u03b8\n\nw\n\nmin\n+\u03bbH H(\u03c0\u03b8(a|s)) \u2212 \u03bbI H(\u03c0\u03b8(a|s, i)),\n\nwhere \u03bbI, \u03bbH correspond to the weighting parameters on the respective objectives. The resulting\nentropy H(\u03c0\u03b8(a|s, i)) term can be expressed as:\nH(\u03c0\u03b8(a|s, i)) = Ei\u223cp(i),(s,a)\u223c\u03c0\u03b8 (\u2212 log(\u03c0\u03b8(a|s, i))\np(i|s, a)\n\n= \u2212Ei\u223cp(i),(s,a)\u223c\u03c0\u03b8 log\n= \u2212Ei\u223cp(i),(s,a)\u223c\u03c0\u03b8 log(p(i|s, a)) \u2212 Ei\u223cp(i),(s,a)\u223c\u03c0\u03b8 log \u03c0\u03b8(a|s) + Ei\u223cp(i) log p(i)\n= \u2212Ei\u223cp(i),(s,a)\u223c\u03c0\u03b8 log(p(i|s, a)) + H(\u03c0\u03b8(a|s)) \u2212 H(i),\n\n\u03c0\u03b8(a|s)\np(i)\n\n(cid:18)\n\n(cid:19)\n\n(7)\n\nwhich results in the \ufb01nal objective:\n\nmax\n\nmin\n\n\u03b8\n\nw\n\nEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 [log(Dw(s, a))] + E(s,a)\u223c\u03c0E [1 \u2212 log(Dw(s, a))]\n\n(8)\n\n+(\u03bbH \u2212 \u03bbI )H(\u03c0\u03b8(a|s)) + \u03bbIEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 log(p(i|s, a)) + \u03bbI H(i),\n\nwhere H(i) is a constant that does not in\ufb02uence the optimization. This results in the same\noptimization objective as for the single expert policy (see Eq. (2)) with an additional term\n\u03bbIEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 log(p(i|s, a)) responsible for rewarding state-action pairs that make the latent\nintention inference easier. We refer to this cost as the latent intention cost and represent p(i|s, a) with\na neural network. The \ufb01nal reward function for the generator is:\n\nEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 [log(Dw(s, a))] + \u03bbIEi\u223cp(i),(s,a)\u223c\u03c0\u03b8 log(p(i|s, a)) + \u03bbH(cid:48)H(\u03c0\u03b8(a|s)).\n\n(9)\n\n4.1 Relation to InfoGAN\n\nIn this section, we provide an alternative derivation of the optimization goal in Eq. (8) by extending\nthe InfoGAN approach presented in [7]. Following [7], we introduce the latent variable c as a means\nto capture the semantic features of the data distribution. In this case, however, the latent variables\nare used in the imitation learning scenario, rather than the traditional GAN setup, which prevents us\nfrom using additional noise variables (z in the InfoGAN approach) that are used as noise samples to\ngenerate the data from.\nSimilarly to [7], to prevent collapsing to a single mode, the policy optimization objective is augmented\n\u03b8, c)) between the latent variable and the state-action pairs generator\nwith mutual information I(c; G(\u03c0c\nG dependent on the policy distribution \u03c0c\n\u03b8. This encourages the policy to produce behaviors that are\n\n4\n\n\finterpretable from the latent code, and given a larger number of possible latent code values leads to an\nincrease in the diversity of policy behaviors. The corresponding generator goal can be expressed as:\n(10)\n\u03b8, c)), we follow the derivation from [7] that introduces a lower bound:\n(11)\n\nIn order to compute I(c; G(\u03c0c\nI(c; G(\u03c0c\n\n[log(Dw(s, a))] + \u03bbI I(c; G(\u03c0c\n\n\u03b8, c)) = H(c) \u2212 H(c|G(\u03c0c\n\nEc\u223cp(c),(s,a)\u223c\u03c0c\n\n\u03b8, c)) + \u03bbH H(\u03c0c\n\u03b8)\n\n\u03b8, c))\n\n\u03b8\n\n= E(s,a)\u223cG(\u03c0c\n= E(s,a)\u223cG(\u03c0c\n\u2265 E(s,a)\u223cG(\u03c0c\n= Ec\u223cP (c),(s,a)\u223cG(\u03c0c\n\n\u03b8,c)[Ec(cid:48)\u223cP (c|s,a)[log P (c(cid:48)|s, a)]] + H(c)\n\u03b8,c)[DKL(P (\u00b7|s, a)||Q(\u00b7|s, a)) + Ec(cid:48)\u223cP (c|s,a)[log Q(c(cid:48)|s, a)]] + H(c)\n\u03b8,c)[Ec(cid:48)\u223cP (c|s,a)[log Q(c(cid:48)|s, a)]] + H(c)\n\n\u03b8,c)[log Q(c|s, a)] + H(c)\n\n\u03b8, c)). The auxiliary distribution Q(c|s, a)\n\nBy maximizing this lower bound we maximize I(c; G(\u03c0c\ncan be parametrized by a neural network.\nThe resulting optimization goal is\n\nmin\n\nmax\n\nEc\u223cp(c),(s,a)\u223c\u03c0c\n+ \u03bbIEc\u223cP (c),(s,a)\u223cG(\u03c0c\nwhich results in the generator reward function:\n\nw\n\n\u03b8\n\n\u03b8\n\n[log(Dw(s, a))] + E(s,a)\u223c\u03c0E [1 \u2212 log(Dw(s, a))]\n\n\u03b8,c)[log Q(c|s, a)] + \u03bbH H(\u03c0c\n\u03b8)\n\nEc\u223cp(c),(s,a)\u223c\u03c0c\n\n\u03b8\n\n[log(Dw(s, a))] + \u03bbIEc\u223cP (c),(s,a)\u223cG(\u03c0c\n\n\u03b8,c)[log Q(c|s, a)] + \u03bbH H(\u03c0c\n\u03b8).\n\n(12)\n\n(13)\n\nThis corresponds to the same objective that was derived in Section 4. The auxiliary distribution over\nthe latent variables Q(c|s, a) is analogous to the intention distribution p(i|s, a).\n\n5\n\nImplementation\n\nIn this section, we discuss implementation details that can alleviate instability of the training procedure\nof our model. The \ufb01rst indicator that the training has become unstable is a high classi\ufb01cation accuracy\nof the discriminator. In this case, it is dif\ufb01cult for the generator to produce a meaningful policy as\nthe reward signal from the discriminator is \ufb02at and the TRPO gradient of the generator vanishes.\nIn an extreme case, the discriminator assigns all the generator samples to the same class and it is\nimpossible for TRPO to provide a useful gradient as all generator samples receive the same reward.\nPrevious work suggests several ways to avoid this behavior. These include leveraging the Wasserstein\ndistance metric to improve the convergence behavior [3] and adding instance noise to the inputs of\nthe discriminator to avoid degenerate generative distributions [32]. We \ufb01nd that adding the Gaussian\nnoise helped us the most to control the performance of the discriminator and to produce a smooth\nreward signal for the generator policy. During our experiments, we anneal the noise similar to [32],\nas the generator policy improves towards the end of the training.\nAn important indicator that the generator policy distribution has collapsed to a uni-modal policy is a\nhigh or increasing loss of the intention-prediction network p(i|s, a). This means that the prediction\nof the latent variable i is dif\ufb01cult and consequently, the policy behavior can not be categorized into\nseparate skills. Hence, the policy executes the same skill for different values of the latent variable. To\nprevent this, one can increase the weight of the latent intention cost \u03bbI in the generator loss or add\nmore instance noise to the discriminator, which makes its reward signal relatively weaker.\nIn this work, we employ both categorical and continuous latent variables to represent the latent\nintention. The advantage of using a continuous variable is that we do not have to specify the number\nof possible values in advance as with the categorical variable and it leaves more room for interpolation\nbetween different skills. We use a softmax layer to represent categorical latent variables, and use a\nuniform distribution for continuous latent variables as proposed in [7].\n\n6 Experiments\n\nOur experiments aim to answer the following questions: (1) Can we segment unstructured and\nunlabelled demonstrations into skills and learn a multi-modal policy that imitates them? (2) What\n\n5\n\n\fFigure 1: Left: Walker-2D running forwards, running backwards, jumping. Right: Humanoid running\nforwards, running backwards, balancing.\n\nFigure 2: Left: Reacher with 2 targets: random initial state, reaching one target, reaching another\ntarget. Right: Gripper-pusher: random initial state, grasping policy, pushing (when grasped) policy.\n\nis the in\ufb02uence of the introduced intention-prediction cost on the resulting policies? (3) Can we\nautonomously discover the number of skills presented in the demonstrations, and even accomplish\nthem in different ways? (4) Does the presented method scale to high-dimensional policies? (5)\nCan we use the proposed method for learning hierarchical policies? We evaluate our method on\na series of challenging simulated robotics tasks described below. We would like to emphasize\nthat the demonstrations consist of shuf\ufb02ed state-action pairs such that no temporal information\nor segmentation is used during learning. The performance of our method can be seen in our\nsupplementary video2.\n\n6.1 Task setup\n\nReacher The Reacher environment is depicted in Fig. 2 (left). The actuator is a 2-DoF arm attached\nat the center of the scene. There are several targets placed at random positions throughout the\nenvironment. The goal of the task is, given a data set of reaching motions to random targets, to\ndiscover the dependency of the target selection on the intention and learn a policy that is capable of\nreaching different targets based on the speci\ufb01ed intention input. We evaluate the performance of our\nframework on environments with 1, 2 and 4 targets.\nWalker-2D The Walker-2D (Fig. 1 left) is a 6-DoF bipedal robot consisting of two legs and feet\nattached to a common base. The goal of this task is to learn a policy that can switch between three\ndifferent behaviors dependent on the discovered intentions: running forward, running backward and\njumping. We use TRPO to train single expert policies and create a combined data set of all three\nbehaviors that is used to train a multi-modal policy using our imitation framework.\nHumanoid Humanoid (Fig. 1 right) is a high-dimensional robot with 17 degrees of freedom. Similar\nto Walker-2D the goal of the task is to be able to discover three different policies: running forward,\nrunning backward and balancing, from the combined expert demonstrations of all of them.\nGripper-pusher This task involves controlling a 4-DoF arm with an actuated gripper to push a sliding\nblock to a speci\ufb01ed goal area (Fig. 2 right). We provide separate expert demonstrations of grasping\nthe object, and pushing it towards the goal starting from the object already being inside the hand. The\ninitial positions of the arm, block and the goal area are randomly sampled at the beginning of each\nepisode. The goal of our framework is to discover both intentions and the hierarchical structure of the\ntask from a combined set of demonstrations.\n\n6.2 Multi-Target Imitation Learning\n\nOur goal here is to analyze the ability of our method to segment and imitate policies that perform\nthe same task for different targets. To this end, we \ufb01rst evaluate the in\ufb02uence of the latent intention\ncost on the Reacher task with 2 and 4 targets. For both experiments, we use either a categorical\nintention distribution with the number of categories equal to the number of targets or a continuous,\n\n2http://sites.google.com/view/nips17intentiongan\n\n6\n\n\fFigure 3: Results of the imitation GAN with (top row) and without (bottom row) the latent intention\ncost. Left: Reacher with 2 targets(crosses): \ufb01nal positions of the reacher (circles) for categorical (1)\nand continuous (2) latent intention variable. Right: Reacher with 4 targets(crosses): \ufb01nal positions of\nthe reacher (circles) for categorical (3) and continuous (4) latent intention variable.\n\nFigure 4: Left: Rewards of different Reacher policies for 2 targets for different intention values over\nthe training iterations with (1) and without (2) the latent intention cost. Right: Two examples of a\nheatmap for 1 target Reacher using two latent intentions each.\n\nuniformly-distributed intention variable, which means that the network has to discover the number of\nintentions autonomously. Fig. 3 top shows the results of the reaching tasks using the latent intention\ncost for 2 and 4 targets with different latent intention distributions. For the continuous latent variable,\nwe show a span of different intentions between -1 and 1 in the 0.2 intervals. The colors indicate\nthe intention \u201cvalue\u201d. In the categorical distribution case, we are able to learn a multi-modal policy\nthat can reach all the targets dependent on the given latent intention (Fig. 3-1 and Fig. 3-3 top). The\ncontinuous latent intention is able to discover two modes in case of two targets (Fig. 3-2 top) but\nit collapses to only two modes in the four targets case (Fig. 3-4 top) as this is a signi\ufb01cantly more\ndif\ufb01cult task.\nAs a baseline, we present the results of the Reacher task achieved by the standard GAN imitation\nlearning presented in [16] without the latent intention cost. The obtained results are presented in\nFig. 3 bottom. Since the network is not encouraged to discover different skills through the intention\nlearning cost, it collapses to a single target for 2 targets in both the continuous and discrete latent\nintention variables. In the case of 4 targets, the network collapses to 2 modes, which can be explained\nby the fact that even without the latent intention cost the imitation network tries to imitate most of\nthe presented demonstrations. Since the demonstration set is very diverse in this case, the network\nlearned two modes without the explicit instruction (latent intention cost) to do so.\nTo demonstrate the development of different intentions, in Fig. 4 (left) we present the Reacher\nrewards over training iterations for different intention variables. When the latent intention cost is\nincluded, (Fig. 4-1), the separation of different skills for different intentions starts to emerge around\nthe 1000-th iteration and leads to a multi-modal policy that, given the intention value, consistently\nreaches the target associated with that intention. In the case of the standard imitation learning GAN\nsetup (Fig. 4-2), the network learns how to imitate reaching only one of the targets for both intention\nvalues.\nIn order to analyze the ability to discover different ways to accomplish the same task, we use our\nframework with the categorical latent intention in the Reacher environment with a single target.\n\n7\n\n\fFigure 5: Top: Rewards of Walker-2D policies for different intention values over the training iterations\nwith (left) and without (right) the latent intention cost. Bottom: Rewards of Humanoid policies for\ndifferent intention values over the training iterations with (left) and without (right) the latent intention\ncost.\n\nSince we only have a single set of expert trajectories that reach the goal in one, consistent manner,\nwe subsample the expert state-action pairs to ease the intention learning process for the generator.\nFig. 4 (right) shows two examples of a heatmap of the visited end-effector states accumulated for\ntwo different values of the intention variable. For both cases, the task is executed correctly, the robot\nreaches the target, but it achieves it using different trajectories. These trajectories naturally emerged\nthrough the latent intention cost as it encourages different behaviors for different latent intentions. It\nis worth noting that the presented behavior can be also replicated for multiple targets if the number of\ncategories in the categorical distribution of the latent intention exceeds the number of targets.\n\n6.3 Multi-Task Imitation Learning\n\nWe also seek to further understand whether our model extends to segmenting and imitating policies\nthat perform different tasks. In particular, we evaluate whether our framework is able to learn a\nmulti-modal policy on the Walker-2D task. We mix three different policies \u2013 running backwards,\nrunning forwards, and jumping \u2013 into one expert policy \u03c0E and try to recover all of them through\nour method. The results are depicted in Fig. 5 (top). The additional latent intention cost results in\na policy that is able to autonomously segment and mimic all three behaviors and achieve a similar\nperformance to the expert policies (Fig. 5 top-left). Different intention variable values correspond to\ndifferent expert policies: 0 - running forwards, 1 - jumping, and 2 - running backwards. The imitation\nlearning GAN method is shown as a baseline in Fig. 5 (top-right). The results show that the policy\ncollapses to a single mode, where all different intention variable values correspond to the jumping\nbehavior, ignoring the demonstrations of the other two skills.\nTo test if our multi-modal imitation learning framework scales to high-dimensional tasks, we evaluate\nit in the Humanoid environment. The expert policy is constructed using three expert policies: running\nbackwards, running forwards, and balancing while standing upright. Fig. 5 (bottom) shows the\nrewards obtained for different values of the intention variable. Similarly to Walker-2D, the latent\n\n8\n\n\fFigure 6: Time-lapse of the learned Gripper-pusher policy. The intention variable is changed manually\nin the \ufb01fth screenshot, once the grasping policy has grasped the block.\n\nintention cost enables the neural network to segment the tasks and learn a multi-modal imitation\npolicy. In this case, however, due to the high dimensionality of the task, the resulting policy is able\nto mimic running forwards and balancing policies almost as well as the experts, but it achieves a\nsuboptimal performance on the running backwards task (Fig. 5 bottom-left). The imitation learning\nGAN baseline collapses to a uni-modal policy that maps all the intention values to a balancing\nbehavior (Fig. 5 bottom-right).\nFinally, we evaluate the ability of our method to discover options in hierarchical IRL tasks. In order\nto test this, we collect expert policies in the Gripper-pusher environment that consist of grasping and\npushing when the object is grasped demonstrations. The goal of this task is to check whether our\nmethod will be able to segment the mix of expert policies into separate grasping and pushing-when-\ngrasped skills. Since the two sub-tasks start from different initial conditions, we cannot present the\nresults in the same form as for the previous tasks. Instead, we present a time-lapse of the learned\nmulti-modal policy (see Fig. 6) that presents the ability to change in the intention during the execution.\nThe categorical intention variable is manually changed after the block is grasped. The intention\nchange results in switching to a pushing policy that brings the block into the goal region. We present\nthis setup as an example of extracting different options from the expert policies that can be further\nused in an hierarchical reinforcement learning task to learn the best switching strategy.\n\n7 Conclusions\n\nWe present a novel imitation learning method that learns a multi-modal stochastic policy, which is\nable to imitate a number of automatically segmented tasks using a set of unstructured and unlabeled\ndemonstrations. The presented approach learns the notion of intention and is able to perform different\ntasks based on the policy intention input. We evaluated our method on a set of simulation scenarios\nwhere we show that it is able to segment the demonstrations into different tasks and to learn a\nmulti-modal policy that imitates all of the segmented skills. We also compared our method to a\nbaseline approach that performs imitation learning without explicitly separating the tasks.\nIn the future work, we plan to focus on autonomous discovery of the number of tasks in the given\npool of demonstrations as well as evaluating this method on real robots. We also plan to learn an\nadditional hierarchical policy over the discovered intentions as an extension of this work.\n\nAcknowledgements\n\nThis research was supported in part by National Science Foundation grants IIS-1205249, IIS-1017134,\nEECS-0926052, the Of\ufb01ce of Naval Research, the Okawa Foundation, and the Max-Planck-Society.\nAny opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of\nthe author(s) and do not necessarily re\ufb02ect the views of the funding organizations.\n\nReferences\n[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn Proc. ICML, 2004.\n\n[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\n[3] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. CoRR, abs/1701.07875,\n\n2017.\n\n9\n\n\f[4] Monica Babes, Vukosi Marivate, Kaushik Subramanian, and Michael L Littman. Apprenticeship\nlearning about multiple intentions. In Proceedings of the 28th International Conference on\nMachine Learning (ICML-11), pages 897\u2013904, 2011.\n\n[5] Aude Billard, Sylvain Calinon, Ruediger Dillmann, and Stefan Schaal. Robot programming by\n\ndemonstration. In Springer handbook of robotics, pages 1371\u20131394. Springer, 2008.\n\n[6] Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine.\n\nPath integral guided policy search. arXiv preprint arXiv:1610.00529, 2016.\n\n[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets,\n2016.\n\n[8] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using\na laplacian pyramid of adversarial networks. In Advances in neural information processing\nsystems, pages 1486\u20131494, 2015.\n\n[9] Christos Dimitrakakis and Constantin A Rothkopf. Bayesian multitask inverse reinforcement\nlearning. In European Workshop on Reinforcement Learning, pages 273\u2013284. Springer, 2011.\n\n[10] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya\nSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. arXiv preprint\narXiv:1703.07326, 2017.\n\n[11] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between\ngenerative adversarial networks, inverse reinforcement learning, and energy-based models.\narXiv preprint arXiv:1611.03852, 2016.\n\n[12] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal\ncontrol via policy optimization. In Proceedings of the 33rd International Conference on Machine\nLearning, volume 48, 2016.\n\n[13] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical\n\nreinforcement learning. arXiv preprint arXiv:1704.03012, 2017.\n\n[14] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options.\n\narXiv preprint arXiv:1703.08294, 2017.\n\n[15] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets.\nIn Zoubin\nGhahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger,\neditors, NIPS, pages 2672\u20132680, 2014.\n\n[16] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. CoRR,\n\nabs/1606.03476, 2016.\n\n[17] Mrinal Kalakrishnan, Ludovic Righetti, Peter Pastor, and Stefan Schaal. Learning force control\npolicies for compliant manipulation. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ\nInternational Conference on, pages 4639\u20134644. IEEE, 2011.\n\n[18] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to\ndiscover cross-domain relations with generative adversarial networks. In Doina Precup and\nYee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning,\nvolume 70 of Proceedings of Machine Learning Research, pages 1857\u20131865, International\nConvention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[19] Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke Van Hoof, and Jan Peters. Towards\nlearning hierarchical skills for multi-phase manipulation tasks. In Robotics and Automation\n(ICRA), 2015 IEEE International Conference on, pages 1503\u20131510. IEEE, 2015.\n\n[20] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning\nwith gaussian processes. In Advances in Neural Information Processing Systems, pages 19\u201327,\n2011.\n\n10\n\n\f[21] Yunzhu Li, Jiaming Song, and Stefano Ermon. Inferring the latent structure of human decision-\n\nmaking from raw visual inputs. CoRR, abs/1703.08840, 2017.\n\n[22] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\n\nmean square error. arXiv preprint arXiv:1511.05440, 2015.\n\n[23] Katharina M\u00fclling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and\ngeneralize striking movements in robot table tennis. The International Journal of Robotics\nResearch, 32(3):263\u2013279, 2013.\n\n[24] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml,\n\npages 663\u2013670, 2000.\n\n[25] Scott Niekum, Sachin Chitta, Andrew G Barto, Bhaskara Marthi, and Sarah Osentoski. Incre-\nmental semantically grounded learning from demonstration. In Robotics: Science and Systems,\nvolume 9, 2013.\n\n[26] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic\n\nmethods. arXiv preprint arXiv:1610.01945, 2016.\n\n[27] Dean A Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation.\n\nNeural Computation, 3(1):88\u201397, 1991.\n\n[28] St\u00e9phane Ross and Drew Bagnell. Ef\ufb01cient reductions for imitation learning. In AISTATS,\n\nvolume 3, pages 3\u20135, 2010.\n\n[29] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences,\n\n3(6):233\u2013242, 1999.\n\n[30] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust\nregion policy optimization. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of\nJMLR Workshop and Conference Proceedings, pages 1889\u20131897. JMLR.org, 2015.\n\n[31] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, 1998.\n\n[32] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised\n\nmap inference for image super-resolution. CoRR, abs/1610.04490, 2016.\n\n[33] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,\nDavid Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning.\narXiv preprint arXiv:1703.01161, 2017.\n\n[34] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n[35] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy\nIn Dieter Fox and Carla P. Gomes, editors, AAAI, pages\n\ninverse reinforcement learning.\n1433\u20131438. AAAI Press, 2008.\n\n11\n\n\f", "award": [], "sourceid": 823, "authors": [{"given_name": "Karol", "family_name": "Hausman", "institution": "University of Southern California"}, {"given_name": "Yevgen", "family_name": "Chebotar", "institution": "University of Southern California"}, {"given_name": "Stefan", "family_name": "Schaal", "institution": "MPI-IS and USC"}, {"given_name": "Gaurav", "family_name": "Sukhatme", "institution": "USC"}, {"given_name": "Joseph", "family_name": "Lim", "institution": "University of Southern California"}]}