{"title": "Meta-Inverse Reinforcement Learning with Probabilistic Context Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 11772, "page_last": 11783, "abstract": "Reinforcement learning demands a reward function, which is often difficult to provide or design in real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, and more subtly, existing methods typically assume demonstrations for one, isolated behavior or task, while in practice, it is significantly more natural and scalable to provide datasets of heterogeneous behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from unstructured, multi-task demonstration data, and critically, use this experience to infer robust rewards for new, structurally-similar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to state-of-the-art imitation and inverse reinforcement learning methods.", "full_text": "Meta-Inverse Reinforcement Learning with\n\nProbabilistic Context Variables\n\nLantao Yu\u2217, Tianhe Yu\u2217, Chelsea Finn, Stefano Ermon\nDepartment of Computer Science, Stanford University\n\nStanford, CA 94305\n\n{lantaoyu,tianheyu,cbfinn,ermon}@cs.stanford.edu\n\nAbstract\n\nProviding a suitable reward function to reinforcement learning can be dif\ufb01cult in\nmany real world applications. While inverse reinforcement learning (IRL) holds\npromise for automatically learning reward functions from demonstrations, several\nmajor challenges remain. First, existing IRL methods learn reward functions from\nscratch, requiring large numbers of demonstrations to correctly infer the reward for\neach task the agent may need to perform. Second, existing methods typically as-\nsume homogeneous demonstrations for a single behavior or task, while in practice,\nit might be easier to collect datasets of heterogeneous but related behaviors. To this\nend, we propose a deep latent variable model that is capable of learning rewards\nfrom demonstrations of distinct but related tasks in an unsupervised way. Criti-\ncally, our model can infer rewards for new, structurally-similar tasks from a single\ndemonstration. Our experiments on multiple continuous control tasks demonstrate\nthe effectiveness of our approach compared to state-of-the-art imitation and inverse\nreinforcement learning methods.\n\n1\n\nIntroduction\n\nWhile reinforcement learning (RL) has been successfully applied to a range of decision-making\nand control tasks in the real world, it relies on a key assumption: having access to a well-de\ufb01ned\nreward function that measures progress towards the completion of the task. Although it can be\nstraightforward to provide a high-level description of success conditions for a task, existing RL\nalgorithms usually require a more informative signal to expedite exploration and learn complex\nbehaviors in a reasonable time. While reward functions can be hand-speci\ufb01ed, reward engineering\ncan require signi\ufb01cant human effort. Moreover, for many real-world tasks, it can be challenging to\nmanually design reward functions that actually bene\ufb01t RL training, and reward mis-speci\ufb01cation can\nhamper autonomous learning [2].\nLearning from demonstrations [31] sidesteps the reward speci\ufb01cation problem by instead learning\ndirectly from expert demonstrations, which can be obtained through teleoperation [39] or from\nhumans experts [38]. Demonstrations can often be easier to provide than rewards, as humans\ncan complete many real-world tasks quite ef\ufb01ciently. Two major methodologies of learning from\ndemonstrations include imitation learning and inverse reinforcement learning. Imitation learning\nis simple and often exhibits good performance [39, 16]. However, it lacks the ability to transfer\nlearned policies to new settings where the task speci\ufb01cation remains the same but the underlying\nenvironment dynamics change. As the reward function is often considered as the most succinct,\nrobust and transferable representation of a task [1, 11], the problem of inferring reward functions\nfrom expert demonstrations, i.e. inverse RL (IRL) [23], is important to consider.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhile appealing, IRL still typically relies on large amounts of high-quality expert data, and it can\nbe prohibitively expensive to collect demonstrations that cover all kinds of variations in the wild\n(e.g. opening all kinds of doors or navigating to all possible target positions). As a result, these\nmethods are data-inef\ufb01cient, particularly when learning rewards for individual tasks in isolation,\nstarting from scratch. On the other hand, meta-learning [32, 4], also known as learning to learn, seeks\nto exploit the structural similarity among a distribution of tasks and optimizes for rapid adaptation to\nunknown settings with a limited amount of data. As the reward function is able to succinctly capture\nthe structure of a reinforcement learning task, e.g. the goal to achieve, it is promising to develop\nmethods that can quickly infer the structure of a new task, i.e. its reward, and train a policy to adapt\nto it. Xu et al. [36] and Gleave and Habryka [12] have proposed approaches that combine IRL and\ngradient-based meta-learning [9], which provide promising results on deriving generalizable reward\nfunctions. However, they have been limited to tabular MDPs [36] or settings with provided task\ndistributions [12], which are challenging to gather in real-world applications.\nThe primary contribution of this paper is a new framework, termed Probabilistic Embeddings for\nMeta-Inverse Reinforcement Learning (PEMIRL), which enables meta-learning of rewards from\nunstructured multi-task demonstrations. In particular, PEMIRL combines and integrates ideas from\ncontext-based meta-learning [5, 26], deep latent variable generative models [17], and maximum\nentropy inverse RL [42, 41], into a uni\ufb01ed graphical model (see Figure 4 in Appendix D) that\nbridges the gap between few-shot reward inference and learning from unstructured, heterogeneous\ndemonstrations. PEMIRL can learn robust reward functions that generalize to new tasks with a single\ndemonstration on complex domains with continuous state-action spaces, while meta-training on a set\nof unstructured demonstrations without speci\ufb01ed task groupings or labeling for each demonstration.\nOur experiment results on various continuous control tasks including Point-Maze, Ant, Sweeper, and\nSawyer Pusher demonstrate the effectiveness and scalability of our method.\n\n2 Preliminaries\n\nMarkov Decision Process (MDP). A discrete-time \ufb01nite-horizon MDP is de\ufb01ned by a tuple\n(T,S,A, P, r, \u03b7), where T is the time horizon; S is the state space; A is the action space;\nP : S \u00d7 A \u00d7 S \u2192 [0, 1] describes the (stochastic) transition process between states; r : S \u00d7 A \u2192 R\nis a bounded reward function; \u03b7 \u2208 P(S) speci\ufb01es the initial state distribution, where P(S) denotes\nthe set of probability distributions over the state space S. We use \u03c4 to denote a trajectory, i.e. a\nsequence of state action pairs for one episode. We also use \u03c1\u03c0(st) and \u03c1\u03c0(st, at) to denote the state\nand state-action marginal distribution encountered when executing a policy \u03c0(at|st).\nMaximum Entropy Inverse Reinforcement Learning (MaxEnt IRL). The maximum entropy\nreinforcement learning (MaxEnt RL) objective is de\ufb01ned as:\n\n(cid:35)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\n2\n\nT(cid:88)\n\nt=1\n\nmax\n\n\u03c0\n\n(cid:34)\n\nT(cid:89)\n\nE(st,at)\u223c\u03c1\u03c0 [r(st, at) + \u03b1H(\u03c0(\u00b7|st))]\n\n(1)\n\nwhich augments the reward function with a causal entropy regularization term H(\u03c0) =\nE\u03c0[\u2212 log \u03c0(a|s)]. Here \u03b1 is an optional parameter to control the relative importance of reward\nand entropy. For notational simplicity, without loss of generality, in the following we will assume\n\u03b1 = 1. Given some expert policy \u03c0E that is obtained by above MaxEnt RL procedure , the MaxEnt\nIRL framework [42] aims to \ufb01nd a reward function that rationalizes the expert behaviors, which can\nbe interpreted as solving the following maximum likelihood estimation (MLE) problem:\n\np\u03b8(\u03c4 ) \u221d\n\n\u03b7(s1)\n\nP (st+1|st, at)\n\nexp\n\nr\u03b8(st, at)\n\n= p\u03b8(\u03c4 )\n\n(2)\n\nt=1\n\nt=1\n\narg min\n\n\u03b8\n\nDKL(p\u03c0E (\u03c4 )||p\u03b8(\u03c4 )) = arg max\n\nHere, \u03b8 is the parameter of the reward function and Z\u03b8 is the partition function, i.e.(cid:82) p\u03b8(\u03c4 )d\u03c4, an\n\nEp\u03c0E (\u03c4 ) [log p\u03b8(\u03c4 )] = E\u03c4\u223c\u03c0E\n\nintegral over all possible trajectories consistent with the environment dynamics. Z\u03b8 is intractable to\ncompute when state-action spaces are large or continuous, or environment dynamics are unknown.\n\n\u2212 log Z\u03b8\n\nr\u03b8(st, at)\n\n\u03b8\n\n\fFinn et al. [7] and Fu et al. [11] proposed the adversarial IRL (AIRL) framework as an ef\ufb01cient\nsampling-based approximation to MaxEnt IRL, which resembles Generative Adversarial Networks\n[13]. Specially, in AIRL, there is a discriminator D\u03b8 (a binary classi\ufb01er) parametrized by \u03b8 and\nan adaptive sampler \u03c0\u03c9 (a policy) parametrized by \u03c9. The discriminator takes a particular form:\nD\u03b8(s, a) = exp(f\u03b8(s, a))/(exp(f\u03b8(s, a)) + \u03c0\u03c9(a|s)) , where f\u03b8(s, a) is the learned reward function\nand \u03c0\u03c9(a|s) is pre-computed as an input to the discriminator. The discriminator is trained to\ndistinguish between the trajectories sampled from the expert and the adaptive sampler; while the\nadaptive sampler \u03c0\u03c9(a|s) is trained to maximize E\u03c1\u03c0\u03c9\n[log D\u03b8(s, a) \u2212 log(1 \u2212 D\u03b8(s, a))], which is\nequivalent to maximizing the following entropy regularized policy objective (with f\u03b8(s, a) serving as\n(cid:35)\nthe reward function):\n\n(cid:35)\n\nE\u03c0\u03c9\n\nlog(D\u03b8(st, at)) \u2212 log(1 \u2212 D\u03b8(st, at))\n\n= E\u03c0\u03c9\n\nf\u03b8(st, at) \u2212 log \u03c0\u03c9(at|st)\n\n(3)\n\n(cid:34) T(cid:88)\n\n(cid:34) T(cid:88)\n\nt=1\n\nt=1\n\nUnder certain conditions, it can be shown that the learned reward function will recover the ground-\ntruth reward up to a constant (Theorem C.1 in [11]).\n\n3 Probabilistic Embeddings for Meta-Inverse Reinforcement Learning\n\n3.1 Problem Statement\n\nBefore de\ufb01ning our meta-inverse reinforcement learning problem (Meta-IRL), we \ufb01rst de\ufb01ne the\nconcept of optimal context-conditional policy.\nWe start by generalizing the notion of MDP with a probabilistic context variable denoted as m \u2208 M,\nwhere M is the (discrete or continuous) value space of m. For example, in a navigation task, the\ncontext variables could represent different goal positions in the environment. Now, each component\nof the MDP has an additional dependency on the context variable m. For example, by slightly\noverloading the notation, the reward function is now de\ufb01ned as r : S \u00d7 A \u00d7 M \u2192 R. For simplicity,\nthe state space, action space, initial state distribution and transition dynamics are often assumed to be\nindependent of m [5, 9], which we will follow in this work. Intuitively, different m\u2019s correspond to\ndifferent tasks with shared structures.\nGiven above de\ufb01nitions, the context-conditional trajectory distribution induced by a context-\nconditional policy \u03c0 : S \u00d7 M \u2192 P(A) can be written as:\n\np\u03c0(\u03c4 = {s1:T , a1:T}|m) = \u03b7(s1)\n\n\u03c0(at|st, m)P (st+1|st, at)\n\n(4)\n\nT(cid:89)\n\nt=1\n\n(cid:34) T(cid:88)\n\nt=1\n\nLet p(m) denote the prior distribution of the latent context variable (which is a part of the problem\nde\ufb01nition). With the conditional distribution de\ufb01ned above, the optimal entropy-regularized context-\nconditional policy is de\ufb01ned as:\n\n\u03c0\u2217 = arg max\n\n\u03c0\n\nEm\u223cp(m), (s1:T ,a1:T )\u223cp\u03c0(\u00b7|m)\n\nr(st, at, m) \u2212 log \u03c0(at|st, m)\n\n(5)\n\n(cid:35)\n\np\u03c0E (\u03c4 ) = (cid:82)\n\nNow,\nlet us introduce the problem of Meta-IRL from heterogeneous multi-task demonstra-\ntion data. Suppose there is some ground-truth reward function r(s, a, m) and a correspond-\ning expert policy \u03c0E(at|st, m) obtained by solving the optimization problem de\ufb01ned in Equa-\ntion (5). Given a set of demonstrations i.i.d. sampled from the induced marginal distribution\nM p(m)p\u03c0E (\u03c4|m)dm, the goal is to meta-learn an inference model q(m|\u03c4 ) and a\nreward function f (s, a, m), such that given some new demonstration \u03c4E generated by sampling\nm(cid:48) \u223c p(m), \u03c4E \u223c p\u03c0E (\u03c4|m(cid:48)), with \u02c6m being inferred as \u02c6m \u223c q(m|\u03c4E), the learned reward function\nf (s, a, \u02c6m) and the ground-truth reward r(s, a, m(cid:48)) will induce the same set of optimal policies [24].\nCritically, we assume no knowledge of the prior task distribution p(m), the latent context variable m\nassociated with each demonstration, nor the transition dynamics P (st+1|st, at) during meta-training.\nNote that the entire supervision comes from the provided unstructured demonstrations, which means\nwe also do not assume further interactions with the experts as in Ross et al. [28].\n\n3\n\n\f3.2 Meta-IRL with Mutual Information Regularization over Context Variables\n\nUnder the framework of MaxEnt IRL, we \ufb01rst parametrize the context variable inference model\nq\u03c8(m|\u03c4 ) and the reward function f\u03b8(s, a, m) (where the input m is inferred by q\u03c8), The induced\n\u03b8-parametrized trajectory distribution is given by:\n\n(cid:34)\n\nT(cid:89)\n\n(cid:35)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\np\u03b8(\u03c4 = {s1:T , a1:T}|m) =\n\n1\n\nZ(\u03b8)\n\n\u03b7(s1)\n\nP (st+1|st, at)\n\nexp\n\nt=1\n\nt=1\n\nf\u03b8(st, at, m)\n\n(6)\n\nwhere Z(\u03b8) is the partition function, i.e., an integral over all possible trajectories. Without further\nconstraints over m, directly applying AIRL to learning the reward function (by augmenting each\ncomponent of AIRL with an additional context variable m inferred by q\u03c8) could simply ignore m,\nwhich is similar to the case of InfoGAIL [21]. Therefore, some connection between the reward\nfunction and the latent context variable m need to be established. With MaxEnt IRL, a parametrized\nreward function will induce a trajectory distribution. From the perspective of information theory, the\nmutual information between the context variable m and the trajectories sampled from the reward\ninduced distribution will provide an ideal measure for such a connection.\nFormally, the mutual information between two random variables m and \u03c4 under joint distribution\np\u03b8(m, \u03c4 ) = p(m)p\u03b8(\u03c4|m) is given by:\n\nIp\u03b8 (m; \u03c4 ) = Em\u223cp(m),\u03c4\u223cp\u03b8(\u03c4|m)[log p\u03b8(m|\u03c4 ) \u2212 log p(m)]\n\n(7)\nwhere p\u03b8(\u03c4|m) is the conditional distribution (Equation (6)), and p\u03b8(m|\u03c4 ) is the corresponding\nposterior distribution.\nAs we do not have access to the prior distribution p(m) and posterior distribution p\u03b8(m|\u03c4 ), directly\noptimizing the mutual information in Equation (7) is intractable. Fortunately, we can leverage\nq\u03c8(m|\u03c4 ) as a variational approximation to p\u03b8(m|\u03c4 ) to reason about the uncertainty over tasks, as well\nas conduct approximate sampling from p(m) (we will elaborate this later in Section 3.3). Formally,\nlet p\u03c0E (\u03c4 ) denote the expert trajectory distribution, we have the following desiderata:\nDesideratum 1. Matching conditional distributions: Ep(m) [DKL(p\u03c0E (\u03c4|m)||p\u03b8(\u03c4|m))] = 0\nDesideratum 2. Matching posterior distributions: Ep\u03b8(\u03c4 )[DKL(p\u03b8(m|\u03c4 )||q\u03c8(m|\u03c4 ))] = 0\nThe \ufb01rst desideratum will encourage the \u03b8-induced conditional trajectory distribution to match the\nempirical distribution implicitly de\ufb01ned by the expert demonstrations, which is equivalent to the MLE\nobjective in the MaxEnt IRL framework. Note that they also share the same marginal distribution\nover the context variable p(m), which implies that matching the conditionals in Desideratum 1 will\nalso encourage the joint distributions, conditional distributions p\u03c0E (m|\u03c4 ) and p\u03b8(m|\u03c4 ), and marginal\ndistributions over \u03c4 to be matched. The second desideratum will encourage the variational posterior\nq\u03c8(m|\u03c4 ) to be a good approximation to p\u03b8(m|\u03c4 ) such that q\u03c8(m|\u03c4 ) can correctly infer the latent\ncontext variable given a new expert demonstration sampled from a new task.\nWith the mutual information (Equation (7)) being the objective, and Desideratum 1 and 2 being the\nconstraints, the meta-inverse reinforcement learning with probabilistic context variables problem can\nbe interpreted as a constrained optimization problem, whose Lagrangian dual function is given by:\n\u2212Ip\u03b8 (m; \u03c4 ) + \u03b1 \u00b7 Ep(m) [DKL(p\u03c0E (\u03c4|m)||p\u03b8(\u03c4|m))] + \u03b2 \u00b7 Ep\u03b8(\u03c4 )[DKL(p\u03b8(m|\u03c4 )||q\u03c8(m|\u03c4 ))]\n(8)\n\nmin\n\u03b8,\u03c8\n\n(cid:20)\n\nWith the Lagrangian multipliers taking speci\ufb01c values (\u03b1 = 1, \u03b2 = 1) [40], the above Lagrangian\ndual function can be rewritten as:\n\nmin\n\u03b8,\u03c8\n\u2261 max\n\n\u03b8,\u03c8\n= max\n\u03b8,\u03c8\n\nEp(m) [DKL(p\u03c0E (\u03c4|m)||p\u03b8(\u03c4|m))] + Ep\u03b8(m,\u03c4 )\n\u2212Ep(m) [DKL(p\u03c0E (\u03c4|m)||p\u03b8(\u03c4|m))] + Em\u223cp(m),\u03c4\u223cp\u03b8(\u03c4|m)[log q\u03c8(m|\u03c4 )]\n\u2212Ep(m) [DKL(p\u03c0E (\u03c4|m)||p\u03b8(\u03c4|m))] + Linfo(\u03b8, \u03c8)\n\np(m)\np\u03b8(m|\u03c4 )\n\n+ log\n\nlog\n\n(10)\nHere the negative entropy term \u2212Hp(m) = Ep\u03b8(m,\u03c4 )[log p(m)] = Ep(m)[log p(m)] is omitted (in\nEq. (9)) as it can be treated as a constant in the optimization procedure of parameters \u03b8 and \u03c8.\n\n(9)\n\n(cid:21)\n\np\u03b8(m|\u03c4 )\nq\u03c8(m|\u03c4 )\n\n4\n\n\f3.3 Achieving Tractability with Sampling-Based Gradient Estimation\n\nNote that Equation (10) cannot be evaluated directly, as the \ufb01rst term requires estimating the KL\ndivergence between the empirical expert distribution and the energy-based trajectory distribution\np\u03b8(\u03c4|m) (induced by the \u03b8-parametrized reward function), and the second term requires sampling\nfrom it. For the purpose of optimizing the \ufb01rst term in Equation (10), as introduced in Section 2, we\ncan employ the adversarial reward learning framework [11] to construct an ef\ufb01cient sampling-based\napproximation to the maximum likelihood objective. Note that different from the original AIRL\nframework, now the adaptive sampler \u03c0\u03c9(a|s, m) is additionally conditioned on the context variable\nm. Furthermore, we here introduce the following lemma, which will be helpful for deriving the\noptimization of the second term in Equation (10).\nLemma 1. In context variable augmented Adversarial IRL (with the adaptive sampler being\n\u03c0\u03c9(a|s, m) and the discriminator being D\u03b8(s, a, m) =\nexp(f\u03b8(s,a,m))+\u03c0\u03c9(a|s,m) ) , under deter-\nministic dynamics, when training the adaptive sampler \u03c0\u03c9 with reward signal (log D\u03b8 \u2212 log(1\u2212 D\u03b8))\nto optimality, the trajectory distribution induced by \u03c0\u2217\n\u03c9 corresponds to the maximum entropy trajectory\ndistribution with f\u03b8(s, a, m) serving as the reward function:\n\nexp(f\u03b8(s,a,m))\n\n(cid:34)\n\nT(cid:89)\n\n(cid:35)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\n\u03b7(s1)\n\nP (st+1|st, at)\n\nexp\n\nf\u03b8(st, at, m)\n\n= p\u03b8(\u03c4|m)\n\nt=1\n\nt=1\n\n(\u03c4|m) =\n\np\u03c0\u2217\n\n\u03c9\n\n1\nZ\u03b8\n\nProof. See Appendix A.\n\nNow we are ready to introduce how to approximately optimize the second term of the objective in\nEquation (10) w.r.t. \u03b8 and \u03c8. First, we observe that the gradient of Linfo(\u03b8, \u03c8) w.r.t. \u03c8 is given by:\n\nLinfo(\u03b8, \u03c8) = Em\u223cp(m),\u03c4\u223cp\u03b8(\u03c4|m)\n\n\u2202\n\u2202\u03c8\n\n1\n\nq(m|\u03c4, \u03c8)\n\n\u2202q(m|\u03c4, \u03c8)\n\n\u2202\u03c8\n\n(11)\n\n\u03c9\n\n(\u03c4|m) matches the desired distribution p\u03b8(\u03c4|m).\n\nThus to construct an estimate of the gradient in Equation (11), we need to obtain samples from the\n\u03b8-induced trajectory distribution p\u03b8(\u03c4|m). With Lemma 1, we know that when the adaptive sampler\n\u03c0\u03c9 in AIRL is trained to optimality, we can use \u03c0\u2217\n\u03c9 to construct samples, as the trajectory distribution\np\u03c0\u2217\nAlso note that the expectation in Equation (11) is also taken over the prior task distribution p(m). In\ncases where we have access to the ground-truth prior distribution, we can directly sample m from\n(\u03c4|m) to construct a gradient estimation. For the most general case, where we do\nit and use p\u03c0\u2217\nnot have access to p(m) but instead have expert demonstrations sampled from p\u03c0E (\u03c4 ), we use the\nfollowing generative process:\n\n\u03c9\n\n\u03c4 \u223c p\u03c0E (\u03c4 ), m \u223c q\u03c8(m|\u03c4 )\n\n(12)\nto synthesize latent context variables, which approximates the prior task distribution when \u03b8 and \u03c8\nare trained to optimality.\nTo optimize Linfo(\u03b8, \u03c8) w.r.t. \u03b8, which is an important step of updating the reward function parameters\nsuch that it encodes the information of the latent context variable, different from the optimization of\nEquation (11), we cannot directly replace p\u03b8(\u03c4|m) with p\u03c0\u03c9 (\u03c4|m). The reason is that we can only\nuse the approximation of p\u03b8 to do inference (i.e. computing the value of an expectation). When we\nwant to optimize an expectation (Linfo(\u03b8, \u03c8)) w.r.t. \u03b8 and the expectation is taken over p\u03b8 itself, we\ncannot instead replace p\u03b8 with \u03c0\u03c9 to do the sampling for estimating the expectation. In the following,\nwe discuss how to estimate the gradient of Linfo(\u03b8, \u03c8) w.r.t. \u03b8 with empirical samples from \u03c0\u03c9.\nLemma 2. The gradient of Linfo(\u03b8, \u03c8) w.r.t. \u03b8 can be estimated with:\n\nEm\u223cp(m),\u03c4\u223cp\u03c0\u2217\n\n\u03c9\n\n(\u03c4|m)\n\nlog q\u03c8(m|\u03c4 )\n\n\u2202\n\u2202\u03b8\n\nf\u03b8(st, at, m) \u2212 E\u03c4(cid:48)\u223cp\u03c0\u2217\n\n\u03c9\n\n(\u03c4|m)\n\nf\u03b8(s(cid:48)\n\nt, a(cid:48)\n\nt, m)\n\n\u2202\n\u2202\u03b8\n\n(cid:34)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)(cid:35)\n\nT(cid:88)\n\nt=1\n\nWhen \u03c9 is trained to optimality, the estimation is unbiased.\n\nProof. See Appendix B.\n\n5\n\n\fWith Lemma 2, as before, we can use the generative process in Equation (12) to sample m and use\n\u2202\u03b8Linfo(\u03b8, \u03c8).\nthe conditional trajectory distribution p\u03c0\u2217\nThe overall training objective of PEMIRL is:\n\n(\u03c4|m) to sample trajectories for estimating \u2202\n\n\u03c9\n\nmin\n\n\u03c9\n\nmax\n\u03b8,\u03c8\n\nE\u03c4E\u223cp\u03c0E (\u03c4 ),m\u223cq\u03c8(m|\u03c4E ),(s,a)\u223c\u03c1\u03c0\u03c9 (s,a|m) log(1 \u2212 D\u03b8(s, a, m))+\nE\u03c4E\u223cp\u03c0E (\u03c4 ),m\u223cq\u03c8(m|\u03c4E ) log(D\u03b8(s, a, m)) + Linfo(\u03b8, \u03c8)\n\nwhere D\u03b8(s, a, m) = exp(f\u03b8(s, a, m))/(exp(f\u03b8(s, a, m)) + \u03c0\u03c9(a|s, m))\n\n(13)\n\nWe summarize the meta-training procedure in Algorithm 1 and the meta-test procedure in Appendix C.\n\nAlgorithm 1 PEMIRL Meta-Training\nInput: Expert trajectories DE = {\u03c4 j\nrepeat\n\nE}; Initial parameters of f\u03b8, \u03c0\u03c9, q\u03c8.\n\nE \u223c DE\n\nSample two batches of unlabeled demonstrations: \u03c4E, \u03c4(cid:48)\nInfer a batch of latent context variables from the sampled demonstrations: m \u223c q\u03c8(m|\u03c4E)\nSample trajectories D from \u03c0\u03c9(\u03c4|m), with the latent context variable \ufb01xed during each rollout and included\nin D.\nUpdate \u03c8 to increase Linfo(\u03b8, \u03c8) with gradients in Equation (11), with samples from D.\nUpdate \u03b8 to increase Linfo(\u03b8, \u03c8) with gradients in Equation (15), with samples from D.\nUpdate \u03b8 to decrease the binary classi\ufb01cation loss:\nE(s,a,m)\u223cD[\u2207\u03b8 log D\u03b8(s, a, m)] + E\u03c4(cid:48)\n\nE )[\u2207\u03b8 log(1 \u2212 D\u03b8(s, a, m))]\n\nE\u223cDE ,m\u223cq\u03c8 (m|\u03c4(cid:48)\n\nUpdate \u03c9 with TRPO to increase the following objective: E(s,a,m)\u223cD[log D\u03b8(s, a, m)]\n\nuntil Convergence\nOutput: Learned inference model q\u03c8(m|\u03c4 ), reward function f\u03b8(s, a, m) and policy \u03c0\u03c9(a|s, m).\n\n4 Related Work\n\nInverse reinforcement learning (IRL), \ufb01rst introduced by Ng and Russell [23], is the problem of\nlearning reward functions directly from expert demonstrations. Prior work tackling IRL include\nmargin-based methods [1, 27] and maximum entropy (MaxEnt) methods [42]. Margin-based methods\nsuffer from being an underde\ufb01ned problem, while MaxEnt requires the algorithm to solve the forward\nRL problem in the inner loop, making it challenging to use in non-tabular settings. Recent works have\nscaled MaxEnt IRL to large function approximators, such as neural networks, by only partially solving\nthe forward problem in the inner loop, developing an adversarial framework for IRL [7, 8, 11, 25].\nOther imitation learning approaches [16, 21, 14, 18] are also based on the adversarial framework,\nbut they do not recover a reward function. We build upon the ideas in these single-task IRL works.\nInstead of considering the problem of learning reward functions for a single task, we aim at the\nproblem of inferring a reward that is disentangled from the environment dynamics and can quickly\nadapt to new tasks from a single demonstration by leveraging prior data.\nWe base our work on the problem of meta-learning. Prior work has proposed memory-based\nmethods [5, 30, 22, 26] and methods that learn an optimizer and/or a parameter initialization [3, 20, 9].\nWe adopt a memory-based meta-learning method similar to [26], which uses a deep latent variable\ngenerative model [17] to infer different tasks from demonstrations. While prior multi-task and meta-\nRL methods [15, 26, 29] have investigated the effectiveness of applying latent variable generative\nmodels to learning task embeddings, we focus on the IRL problem instead. Meta-IRL [36, 12]\nincorporates meta-learning and IRL, showing fast adaptation of the reward functions to unseen\ntasks. Unlike these approaches, our method is not restrictred to discrete tabular settings and does not\nrequire access to grouped demonstrations sampled from a task distribution. Meanwhile, one-shot\nimitation learning [6, 10, 38, 37] demonstrates impressive results on learning new tasks using a single\ndemonstration; yet, they also require paired demonstrations from each task and hence need prior\nknowledge on the task distribution. More importantly, one-shot imitation learning approaches only\nrecover a policy, and cannot use additional trials to continue to improve, which is possible when a\nreward function is inferred instead. Several prior approaches for multi-task imitation learning [21,\n14, 34] propose to use unstructured demonstrations without knowing the task distribution, but they\nneither study quick generalization to new tasks nor provide a reward function. Our work is thus\ndriven by the goal of extending meta-IRL to addressing challenging high-dimensional control tasks\nwith the help of an unstructured demonstration dataset.\n\n6\n\n\fFigure 1: Experimental domains (left to right): Point-Maze, Ant, Sweeper, and Sawyer Pusher.\n\n5 Experiments\n\nIn this section, we seek to investigate the following two questions: (1) Can PEMIRL learn a policy\nwith competitive few-shot generalization abilities compared to one-shot imitation learning methods\nusing only unstructured demonstrations? (2) Can PEMIRL ef\ufb01ciently infer robust reward functions\nof new continuous control tasks where one-shot imitation learning fails to generalize, enabling an\nagent to continue to improve with more trials?\nWe evaluate our method on four simulated domains using the Mujoco physics engine [35]. To our\nknowledge, there\u2019s no prior work on designing meta-IRL or one-shot imitation learning methods for\ncomplex domains with high-dimensional continuous state-action spaces with unstructured demonstra-\ntions. Hence, we also designed the following variants of existing state-of-the-art (one-shot) imitation\nlearning and IRL methods so that they can be used as fair comparisons to our method:\n\n\u2022 AIRL: The original AIRL algorithm without incorporating latent context variables, trained\n\nacross all demonstrations.\n\n\u2022 Meta-Imitation Learning with Latent Context Variables (Meta-IL): As in [26], we use\nthe inference model q\u03c8(m|\u03c4 ) to infer the context of a new task from a single demonstrated\ntrajectory, denoted as \u02c6m, and then train the conditional imitaiton policy \u03c0\u03c9(a|s, \u02c6m) using the\nsame demonstration. This approach also resembles [6].\n\u2022 Meta-InfoGAIL: Similar to the method above, except that an additional discriminator D(s, a)\nis introduced to distinguish between expert and sample trajectories, and trained along with the\nconditional policy using InfoGAIL [21] objective.\n\nWe use trust region policy optimization (TRPO) [33] as our policy optimization algorithm across\nall methods. We collect demonstrations by training experts with TRPO using ground truth reward.\nHowever, the ground truth reward is not available to imitation learning and IRL algorithms. We\nprovide full hyperparameters, architecture information, data ef\ufb01ciency, and experimental setup details\nin Appendix F. We also include ablation studies on sensitivity of the latent dimensions, importance\nof the mutual information objective and the performance on stochastic environments in Appendix E.\nFull video results are on the anonymous supplementary website2 and our code is open-sourced on\nGitHub3.\n\n5.1 Policy Performance on Test Tasks\n\nWe \ufb01rst answer our \ufb01rst question by showing that our method is able to learn a policy that can adapt to\ntest tasks from a single demonstration, on four continuous control domains: Point Maze Navigation:\nIn this domain, a pointmass needs to navigate around a barrier to reach the goal. Different tasks\ncorrespond to different goal positions and the reward function measures the distance between the\npointmass and the goal position; Ant: Similar to [9], this locomotion task requires fast adaptation to\nwalking directions of the ant where the ant needs to learn to move backward or forward depending\non the demonstration; Sweeper: A robot arm needs to sweep an object to a particular goal position.\nFast adaptation of this domain corresponds to different goal locations in the plane; Sawyer Pusher:\nA simulated Sawyer robot is required to push a mug to a variety of goal positions and generalize to\nunseen goals. We illustrate the set-up for these experimental domains in Figure 1.\n\n2Video results can be found at: https://sites.google.com/view/pemirl\n3Our implementation of PEMIRL can be found at: https://github.com/ermongroup/MetaIRL\n\n7\n\n\fExpert\nRandom\nAIRL [11]\nMeta-IL\nMeta-InfoGAIL\nPEMIRL (ours)\n\nAnt\n\n968.80 \u00b1 27.11\n\nSweeper\n\n\u221250.86 \u00b1 4.75\n\nSawyer Pusher\nPoint Maze\n\u221223.36 \u00b1 2.54\n\u22125.21 \u00b1 0.93\n\u221251.39 \u00b1 10.31 \u221255.65 \u00b1 18.39 \u2212259.71 \u00b1 11.24 \u2212106.88 \u00b1 18.06\n\u221251.56 \u00b1 8.57\n\u221218.15 \u00b1 3.17\n\u22126.68 \u00b1 1.51\n\u221228.13 \u00b1 4.93\n\u221227.56 \u00b1 4.86\n\u22127.66 \u00b1 1.85\n\u22127.37 \u00b1 1.02\n\u221227.16 \u00b1 3.11\n\n127.61 \u00b1 27.34\n218.53 \u00b1 26.48\n871.93 \u00b1 31.28\n846.18 \u00b1 25.81\n\n\u2212152.78 \u00b1 7.39\n\u221289.02 \u00b1 7.06\n\u221287.06 \u00b1 6.57\n\u221274.17 \u00b1 5.31\n\nTable 1: One-shot policy generalization to test tasks on four experimental domains. Average return\nand standard deviations are reported over 5 runs.\n\nFigure 2: Visualizations of learned reward functions for point-maze navigation. The red star represents\nthe target position and the white circle represents the initial position of the agent (both are different\nacross different iterations). The black horizontal line represents the barrier that cannot be crossed. To\nshow the generalization ability, the expert demonstration used to infer the target position are sampled\nfrom new target positions that have not been seen in the meta-training set.\n\nFigure 3: From top to bottom, we show the disabled ant running forward and backward respectively.\n\nWe summarize the results in Table 1. PEMIRL achieves comparable imitation performance compared\nto Meta-IL and Meta-InfoGAIL, while AIRL is incapable of handling multi-task scenarios without\nincorporating the latent context variables.\n\n5.2 Reward Adaptation to Challenging Situations\n\nAfter demonstrating that the policy learned by our method is able to achieve competitive \u201cone-shot\u201d\ngeneralization ability, we now answer the second question by showing PEMIRL learns a robust\nreward that can adapt to new and more challenging settings where the imitation learning methods and\nthe original AIRL fail. Speci\ufb01cally, after providing the demonstration of an unseen task to the agent,\nwe change the underlying environment dynamics but keep the same task goal. In order to succeed in\nthe task with new dynamics, the agent must correctly infer the underlying goal of the task instead\nof simply mimicking the demonstration. We show the effectiveness of our reward generalization by\ntraining a new policy with TRPO using the learned reward functions on the new task.\n\n8\n\n\fPolicy\n\nGeneralization\n\nReward\n\nAdaptation\n\nMethod\nMeta-IL\n\nMeta-InfoGAIL\n\nPEMIRL\n\nAIRL\n\nMeta-InfoGAIL\nPEMIRL (ours)\n\nExpert\n\nPoint-Maze-Shift\n\u221228.61 \u00b1 3.71\n\u221229.40 \u00b1 3.05\n\u221228.93 \u00b1 3.59\n\u221229.07 \u00b1 4.12\n\u221229.72 \u00b1 3.11\n\u22129.04 \u00b1 1.09\n\u22125.37 \u00b1 0.86\n\nDisabled-Ant\n\u221227.86 \u00b1 10.31\n\u221251.08 \u00b1 4.81\n\u221246.77 \u00b1 5.54\n\u221276.21 \u00b1 10.35\n\u221238.73 \u00b1 6.41\n152.62 \u00b1 11.75\n331.17 \u00b1 17.82\n\nTable 2: Results on direct policy generalization and reward adaptation to challenging situations.\nPolicy generalization examines if the policy learned by Meta-IL is able to generalize to new tasks\nwith new dynamics, while reward adaptation tests if the learned RL can lead to ef\ufb01cient RL training\nin the same setting. The RL agent learned by PEMIRL rewards outperforms other methods in such\nchallenging settings.\n\nPoint-Maze Navigation with a Shifted Barrier. Following the setup of Fu et al. [11], at meta-test\ntime, after showing a demonstration moving towards a new target position, we change the position of\nthe barrier from left to right. As the agent must adapt by reaching the target with a different path from\nwhat was demonstrated during meta-training, it cannot succeed without correctly inferring the true\ngoal (the target position in the maze) and learning from trial-and-error. As a result, all direct policy\ngeneralization approaches fail as all the policies are still directing the pointmass to the right side of\nthe maze. As shown in Figure 2, PEMIRL learns disentangled reward functions that successfully infer\nthe underlying goal of the new task without much reward shaping. Such reward functions enable the\nRL agent to bypass the right barrier and reach the true goal position. The RL agent trained with the\nreward learned by AIRL also fail to bypass the barrier and navigate to the target position, as without\nincorporating the latent context variables and treating the demonstration as multi-modal, AIRL learns\nan \u201caverage\u201d reward and policy among different tasks. We also use the output of the discriminator\nof Meta-InfoGAIL as reward signals and evaluate its adaptation performance. The agent trained by\nthis reward fails to complete the task since Meta-InfoGAIL does not explicitly optimize for reward\nlearning and the discriminator output converges to uninformative uniform distribution at convergence.\n\nDisabled Ant Walking. As in Fu et al. [11], we disable and shorten two front legs of the ant\nsuch that it cannot walk without changing its gait to a large extent. Similar to Point-Maze-Shift, all\nimitaiton policies fail to maneuver the disabled ant to the right direction. As shown in Figure 3, reward\nfunctions learned by PEMIRL encourage the RL policy to orient the ant towards the demonstrated\ndirection and move along that direction using two healthy legs, which is only possible when the\ninferred reward corresponds to the true underlying goal and is disentangled with the dynamics. In\ncontrast, the learned reward of original AIRL as well as the discriminator output of Meta-InfoGAIL\ncannot infer the underlying goal of the task and provide precise supervision signal, which leads to the\nunsatisfactory performance of the induced RL policies. Quantitative results are presented in Table 2.\n\n6 Conclusion\n\nIn this paper, we propose a new meta-inverse reinforcement learning algorithm, PEMIRL, which is\nable to ef\ufb01ciently infer robust reward functions that are disentangled from the dynamics and highly\ncorrelated with the ground-truth rewards under meta-learning settings. To our knowledge, PEMIRL\nis the \ufb01rst model-free Meta-IRL algorithm that can achieve this and scale to complex domains with\ncontinuous state-action spaces. PEMIRL generalizes to new tasks by performing inference over\na latent context variable with a single demonstration, on which the recovered policy and reward\nfunction are conditioned. Extensive experimental results demonstrate the scalability and effectiveness\nof our method against strong baselines.\n\nAcknowledgments\n\nThis research was supported by Toyota Research Institute, NSF (#1651565, #1522054, #1733686),\nONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024). The authors would like to thank Chris\nCundy for discussions over the paper draft.\n\n9\n\n\fReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 1, 2004.\n\n[2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David\nPfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient\ndescent. arXiv preprint arXiv:1606.04474, 2016.\n\n[4] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. In\n\nIJCNN-91-Seattle International Joint Conference on Neural Networks, 1991.\n\n[5] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl$\u02c62$:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[6] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. Neural Information\nProcessing Systems (NIPS), 2017.\n\n[7] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between\ngenerative adversarial networks, inverse reinforcement learning, and energy-based models.\narXiv preprint arXiv:1611.03852, 2016.\n\n[8] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal\ncontrol via policy optimization. In International Conference on Machine Learning, pages 49\u201358,\nJune 2016.\n\n[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In International Conference on Machine Learning, 2017.\n\n[10] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. 2017.\n\n[11] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse\n\nreinforcement learning. arXiv preprint arXiv:1710.11248, 2017.\n\n[12] Adam Gleave and Oliver Habryka. Multi-task maximum entropy inverse reinforcement learning.\n\narXiv preprint arXiv:1805.08882, 2018.\n\n[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[14] Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multi-\nmodal imitation learning from unstructured demonstrations using generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 1235\u20131245, 2017.\n\n[15] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller.\n\nLearning an embedding space for transferable robot skills. 2018.\n\n[16] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems 29, pages 4565\u20134573. 2016.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[18] Alex Kue\ufb02er and Mykel J. Kochenderfer. Burn-in demonstrations for multi-modal imitation\nlearning. In Proceedings of the 17th International Conference on Autonomous Agents and\nMultiAgent Systems, AAMAS \u201918, 2018.\n\n10\n\n\f[19] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and\n\nreview. arXiv preprint arXiv:1805.00909, 2018.\n\n[20] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441,\n\n2017.\n\n[21] Yunzhu Li, Jiaming Song, and Stefano Ermon.\n\nInfogail: Interpretable imitation learning\nfrom visual demonstrations. In Advances in Neural Information Processing Systems, pages\n3812\u20133822, 2017.\n\n[22] Tsendsuren Munkhdalai and Hong Yu. Meta networks. International Conference on Machine\n\nLearning (ICML), 2017.\n\n[23] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning.\n\nIn\nProceedings of the Seventeenth International Conference on Machine Learning, ICML \u201900,\n2000.\n\n[24] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In ICML, volume 99, pages 278\u2013287,\n1999.\n\n[25] Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational\ndiscriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining\ninformation \ufb02ow. arXiv preprint arXiv:1810.00821, 2018.\n\n[26] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient\noff-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint\narXiv:1903.08254, 2019.\n\n[27] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In\n\nProceedings of the 23rd International Conference on Machine Learning, ICML \u201906, 2006.\n\n[28] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In Proceedings of the fourteenth international\nconference on arti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n[29] Steind\u00f3r S\u00e6mundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement\n\nlearning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.\n\n[30] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nIn International Conference on\n\nMeta-learning with memory-augmented neural networks.\nMachine Learning (ICML), 2016.\n\n[31] Stefan Schaal, Auke Ijspeert, and Aude Billard. Computational approaches to motor learning\nby imitation. Philosophical Transactions of the Royal Society of London B: Biological Sciences,\n2003.\n\n[32] J\u00fcrgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: the meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[33] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust\n\nregion policy optimization. International Conference on Machine Learning, 2015.\n\n[34] Arjun Sharma, Mohit Sharma, Nicholas Rhinehart, and Kris M. Kitani. Directed-info GAIL:\nlearning hierarchical policies from unsegmented demonstrations using directed information.\narXiv preprint arXiv:1810.01266, 2018.\n\n[35] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\n\ncontrol. In International Conference on Intelligent Robots and Systems (IROS), 2012.\n\n[36] Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, and Chelsea Finn. Learning a prior over\n\nintent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018.\n\n[37] Tianhe Yu, Pieter Abbeel, Sergey Levine, and Chelsea Finn. One-shot hierarchical imitation\n\nlearning of compound visuomotor tasks. arXiv preprint arXiv:1810.11043, 2018.\n\n11\n\n\f[38] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and\nSergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning.\nRobotics: Science and Systems (R:SS), 2018.\n\n[39] Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Ken Goldberg, and Pieter Abbeel. Deep\nimitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv\npreprint arXiv:1710.04615, 2017.\n\n[40] Shengjia Zhao, Jiaming Song, and Stefano Ermon. The information autoencoding family: A\nlagrangian perspective on latent variable generative models. arXiv preprint arXiv:1806.06514,\n2018.\n\n[41] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. 2010.\n\n[42] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy\ninverse reinforcement learning. In Aaai, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n12\n\n\f", "award": [], "sourceid": 6278, "authors": [{"given_name": "Lantao", "family_name": "Yu", "institution": "Stanford University"}, {"given_name": "Tianhe", "family_name": "Yu", "institution": "Stanford University"}, {"given_name": "Chelsea", "family_name": "Finn", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}