{"title": "A Bayesian Approach to Generative Adversarial Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7429, "page_last": 7439, "abstract": "Generative adversarial training for imitation learning has shown promising results on high-dimensional and continuous control tasks. This paradigm is based on reducing the imitation learning problem to the density matching problem, where the agent iteratively refines the policy to match the empirical state-action visitation frequency of the expert demonstration. Although this approach has shown to robustly learn to imitate even with scarce demonstration, one must still address the inherent challenge that collecting trajectory samples in each iteration is a costly operation. To address this issue, we first propose a Bayesian formulation of generative adversarial imitation learning (GAIL), where the imitation policy and the cost function are represented as stochastic neural networks. Then, we show that we can significantly enhance the sample efficiency of GAIL leveraging the predictive density of the cost, on an extensive set of imitation learning tasks with high-dimensional states and actions.", "full_text": "A Bayesian Approach to Generative Adversarial\n\nImitation Learning\n\nWonseok Jeon1, Seokin Seo1, Kee-Eung Kim1,2\n1 School of Computing, KAIST, Republic of Korea\n\n2 PROWLER.io\n\n{wsjeon, siseo}@ai.kaist.ac.kr, kekim@cs.kaist.ac.kr\n\nAbstract\n\nGenerative adversarial training for imitation learning has shown promising results\non high-dimensional and continuous control tasks. This paradigm is based on\nreducing the imitation learning problem to the density matching problem, where\nthe agent iteratively re\ufb01nes the policy to match the empirical state-action visitation\nfrequency of the expert demonstration. Although this approach can robustly learn\nto imitate even with scarce demonstration, one must still address the inherent\nchallenge that collecting trajectory samples in each iteration is a costly operation. To\naddress this issue, we \ufb01rst propose a Bayesian formulation of generative adversarial\nimitation learning (GAIL), where the imitation policy and the cost function are\nrepresented as stochastic neural networks. Then, we show that we can signi\ufb01cantly\nenhance the sample ef\ufb01ciency of GAIL leveraging the predictive density of the\ncost, on an extensive set of imitation learning tasks with high-dimensional states\nand actions.\n\n1\n\nIntroduction\n\nImitation learning is the problem where an agent learns to mimic the demonstration provided by the\nexpert, in an environment with unknown cost function. Imitation learning with policy gradients [Ho\net al., 2016] is a recently proposed approach that uses gradient-based stochastic optimizers. Along\nwith trust-region policy optimization (TRPO) [Schulman et al., 2015] as the optimizer, it is shown\nto be one of the most practical approaches that scales well to large-scale environments, i.e. high-\ndimensional state and action spaces. Generative adversarial imitation learning (GAIL) [Ho and Ermon,\n2016], which is of our primary interest, is a recent instance of imitation learning algorithms with\npolicy gradients. GAIL reformulates the imitation learning problem as a density matching problem,\nand makes use of generative adversarial networks (GANs) [Goodfellow et al., 2014]. This is achieved\nby generalizing the representation of the underlying cost function using neural networks, instead\nof restricting it to the class of linear functions for the sake of simpler optimization. As a result, the\npolicy being learned becomes the generator, and the cost function becomes the discriminator. Based\non the promising results from GAIL, a number of improvements appeared in the literature [Wang\net al., 2017, Li et al., 2017].\nYet, one of the fundamental challenges lies in the fact that obtaining trajectory samples from the\nenvironment is often very costly, e.g., physical robots situated in real-world. Among a number of\nimproved variants of GAIL, we remark that generative moment matching imitation learning (GM-\nMIL) [Kim and Park, 2018], which uses kernel mean embedding to improve the discriminator training\njust as in generative moment matching networks (GMMNs) [Li et al., 2015], was experimentally\nshown to converge much faster and more stable compared to GAIL. This gives us a hint that a robust\ndiscriminator is an important factor in improving the sample ef\ufb01ciency of generative-adversarial\napproaches to imitation learning.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work, we also aim to enhance the sample ef\ufb01ciency of the generative-adversarial approach to\nimitation learning. Our main idea is to use a Bayesian discriminator in GAIL, e.g. using a Bayesian\nneural network, thus referring to our algorithm as Bayes-GAIL (BGAIL). To achieve this, we \ufb01rst\nreformulate GAIL in the Bayesian framework. As a result, we show that GAIL can be seen as\noptimizing a surrogate objective in our approach, with iterative updates being maximum-likelihood\n(ML) point estimations. In our work, instead of using the ML point estimate, we propose to use the\npredictive density of the cost. This gives more informative cost signals for the policy training and\nmakes BGAIL signi\ufb01cantly more sample-ef\ufb01cient compared to the original GAIL.\n\n2 Preliminaries\n\n2.1 Reinforcement Learning (RL) and Notations\n\nWe \ufb01rst de\ufb01ne the basic notions from RL. The RL problem considers an agent that chooses an action\nafter observing an environment state and the environment that reacts with a cost and a successor\nstate to the agent\u2019s action. The agent-environment interaction is modeled by using a Markov decision\nprocess (MDP) M := (cid:104)S, A, c, PT , \u03bd, \u03b3(cid:105); S is a state space, A is an action space, c(s, a) is a cost\nfunction, PT (s(cid:48)|s, a) is the state transition distribution, \u03bd(s) is the initial state distribution, \u03b3 \u2208 [0, 1]\nis a discount factor. M\u2212 denotes an MDP M without the cost function (MDP\\C), i.e., (cid:104)S, A, P, \u03bd, \u03b3(cid:105).\nThe (stochastic) policy \u03c0(a|s) is de\ufb01ned as the probability of choosing action a in state s.\nGiven the cost function c, the objective of RL is to \ufb01nd the policy \u03c0 that minimizes the expected\nt=0 \u03b3tc(ssst, aaat)], where the subscript \u03c0 in the expectation im-\nplies that the trajectory (sss0, aaa0, sss1, aaa1, ...) is generated from the policy \u03c0 with the transition dis-\ntribution of M\u2212. The state value function V c\n\u03c0 are de\ufb01ned as\nt=0 c(ssst, aaat)|sss0 = s, aaa0 = a], respec-\nV c\ntively. The optimal value functions V c\u2217 , Qc\u2217 for c are the value functions for the optimal policy\n\u03c0c\u2217 := arg min\u03c0\u03b7(\u03c0, c) under the cost function c. The \u03b3-discounted state visitation occupancy\nt=0 \u03b3t\u03b4(s \u2212 ssst)] for Dirac delta function \u03b4 when\nthe state space S is assumed to be continuous. For convenience, we denote the \u03b3-discounted\nstate-action visitation occupancy for \u03c0 as \u03c1\u03c0(s, a) := \u03c1\u03c0(s)\u03c0(a|s). It can be simply shown that\ns,a \u03c1\u03c0(s, a)c(s, a). Throughout this paper, bold-math letters are\n\nlong-term cost \u03b7(\u03c0, c) := E\u03c0 [(cid:80)\u221e\n\u03c0 (s) := E\u03c0 [(cid:80)\u221e\n\u03c1\u03c0 for policy \u03c0 is de\ufb01ned as \u03c1\u03c0(s) := E\u03c0 [(cid:80)\u221e\n\u03b7(\u03c0, c) = E(sss,aaa)\u223c\u03c1\u03c0 [c(sss, aaa)] :=(cid:80)\n\n\u03c0(s, a) := E\u03c0 [(cid:80)\u221e\n\nt=0 c(ssst, aaat)|sss0 = s] and Qc\n\n\u03c0 and the action value function Qc\n\nused to indicate random variables, and their realizations are written as non-bold letters.\n\n2.2\n\nImitation Learning\n\nHistorically, behavioral cloning (BC) [Pomerleau, 1991] is one of the simplest approach to imitation\nlearning, which learns to map the states to demonstrated actions using supervised learning. However,\nBC is susceptible to compounding error, which refers to small prediction error accumulated over time\nto a catastrophic level [Bagnell, 2015]. Inverse reinforcement learning (IRL) [Russell, 1998, Ng and\nRussell, 2000, Ziebart et al., 2008] is a more modern approach, where the objective is to learn the\nunderlying unknown cost function that makes the expert optimal. Although this is a more principled\napproach to imitation learning, IRL algorithms usually involve planning as an inner loop, which\nusually requires the knowledge of transition distribution and mainly increases the computational\ncomplexity of IRL. In addition, IRL is fundamentally an ill-posed problem, i.e., there exist in\ufb01nitely\nmany cost functions that can describe identical policies, and thus requires some form of preferences\non the choice of cost functions [Ng and Russell, 2000]. The Bayesian approach to IRL [Ramachandran\nand Amir, 2007, Choi and Kim, 2011] is one way of encoding the cost function preferences, which\nwill be introduced in the following section.\nFinally, imitation learning with policy gradients [Ho et al., 2016] is one of the most recent approaches,\nwhich replaces the costly planning inner loop with the policy gradient update in RL, making the\nalgorithm practical and scalable. Generative adversarial imitation learning (GAIL) [Ho and Ermon,\n2016] is an instance of this approach, based on the adversarial training objective\n\n(cid:111)\n[log D(sss, aaa)] + E(sss,aaa)\u223c\u03c1\u03c0 [log(1 \u2212 D(sss, aaa))]\n\narg min\n\nD\u2208(0,1)S\u00d7A\n\n(1)\nfor a set (0, 1)S\u00d7A of functions D : S \u00d7 A \u2192 (0, 1). This is essentially the training objective of\nGAN, where the generator is the policy \u03c0, and the discriminator D is the intermediate cost function\nto be used in policy gradient update to match \u03c1\u03c0 to \u03c1\u03c0E .\n\n(cid:110)E(sss,aaa)\u223c\u03c1\u03c0E\n\nmax\n\n\u03c0\n\n,\n\n2\n\n\fFigure 1: Graphical model for GAIL. The state-action pairs are denoted by zzz := (sss, aaa). Note that\np(z1) = \u03bd(s1)\u03c0\u03b8(a1|s1) and p(zt+1|zt) = \u03c0\u03b8(at+1|st+1)PT (st+1|st, at). Also, the discriminator\nparameter \u03c6\u03c6\u03c6 and the policy parameter \u03b8\u03b8\u03b8 are regarded as random variables.\n\n2.3 Bayesian Inverse Reinforcement Learning (BIRL)\n\n, a(n)\n\nt\n\nt\n\n(cid:81)T (n)\n\ntion Qc\u2217, i.e. p(D|c) := (cid:81)N\n\nThe Bayesian framework for IRL was proposed by Ramachandran and Amir [2007], where the\ncost function c is regarded as a random function. For the expert demonstration set D := {\u03c4n =\nt=1 |n = 1, ..., N} collected under M\u2212, the cost function preference and the optimality\n(s(n)\n)T (n)\ncon\ufb01dence on the expert\u2019s trajectories D are encoded as prior p(c) and likelihood p(D|c), respectively.\nAs for the likelihood, the samples in D are assumed independent Gibbs distribution with potential func-\n)/\u03b2)\nwith the temperature parameter \u03b2. Under this model, reward inference and imitation learning using\nthe posterior mean reward were suggested. Choi and Kim [2011] suggested a BIRL approach using\nmaximum a posterior (MAP) inference. Based on the reward optimality region [Ng and Russell,\n2000], the authors found that there are cases where the posterior mean reward exists outside the\noptimality region, whereas MAP reward is posed inside the region. In addition, it was shown that\nthe existing works on IRL [Ng and Russell, 2000, Ratliff et al., 2006, Syed et al., 2008, Neu and\nSzepesv\u00e1ri, 2007, Ziebart et al., 2008] can be viewed as special cases of MAP inference if we choose\nthe likelihood and a prior properly.\n\n, c) \u221d exp(Qc\u2217(s(n)\n\n, c) for p(a(n)\n\nt=1 p(a(n)\n\nt\n\n|s(n)\n\nt\n\n|s(n)\n\nt\n\n, a(n)\n\nt\n\nn=1\n\nt\n\nt\n\n3 Bayesian Generative Adversarial Imitation Learning\n\nIn order to formally present our approach, let us denote the agent\u2019s policy as \u03c0A and the expert\u2019s\npolicy as \u03c0E. In addition, let us denote sets DA and DE of trajectories generated by \u03c0A and \u03c0E,\nrespectively, under M\u2212 for\n\nDA :=\n\n\u03c4 (n)\nA = (s(n)\n\nA,t, a(n)\n\nA,t)T\n\nt=1\n\n,\n\n(2)\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12)n = 1, ..., NA\n\n(cid:27)\n\nwhere the quantities for expert are de\ufb01ned in a similar way. In the remainder of this work, we drop\nthe subscripts A and E if there is no confusion. Also, note that DE will be given as input to the\nimitation learning algorithm, whereas DA will be generated in each iteration of optimization. It is\nnatural to assume that the agent\u2019s and the expert\u2019s trajectories \u03c4A and \u03c4E are independently generated,\nt=2 PT (st|st\u22121, at\u22121)\u03c0(at|st). In this\n\ni.e., p(\u03c4A, \u03c4E) = p(\u03c4A)p(\u03c4E), with p(\u03c4 ) := \u03bd(s1)\u03c0(a1|s1)(cid:81)T\n\nwork, we reformulate GAIL [Ho and Ermon, 2016] in the Bayesian framework as follows.\n\n3.1 Bayesian Framework for Adversarial Imitation Learning\n\nAgent-expert discrimination Suppose \u03c0A is \ufb01xed for simplicity, which will be later parameterized\nfor learning. Let us consider binary auxiliary random variables oooA,t, oooE,t for all t, where ooot becomes\n1 if given state-action pair (ssst, aaat) is generated by the expert, and becomes 0 otherwise. Then, the\n\n3\n\n\fjoint distribution of (\u03c4\u03c4\u03c4 A, \u03c4\u03c4\u03c4 E, oooA, oooE) can be written as\n\np(\u03c4A, \u03c4E, oA, oE) = p(\u03c4A)p(\u03c4E)\n\np(oA,t|sA,t, aE,t)\n\n(cid:35)\n\np(oE,t|sA,t, aE,t)\n\n(cid:35)(cid:34) T(cid:89)\n\nt=1\n\n.\n\n(3)\n\n(cid:34) T(cid:89)\n\nt=1\n\nT(cid:89)\n\nt=1\n\nfor ooo := (ooot)T\nt=1 := (ooo1, ..., oooT ), where ot is a realization of a random variable ooot. Although\np(ot|st, at) cannot be the same for both agent and expert and all t, we can simplify the problem by\napplying a single approximate discriminator D\u03c6(s, a) with parameter \u03c6 such that\np(ot|st, at) \u2248 \u02c6p(ot|st, at; \u03c6) := (1 \u2212 D\u03c6(st, at))otD\u03c6(st, at)1\u2212ot =\n\n(cid:26)1 \u2212 D\u03c6(st, at), if ot = 1,\n\nD\u03c6(st, at),\n\notherwise.\n(4)\n\nUsing the approximation in (4), the distribution in (3) is given by\n\np(\u03c4A, \u03c4E, oA, oE) \u2248 \u02c6p(\u03c4A, \u03c4E, oA, oE; \u03c6)\n\n:= p(\u03c4A)p(\u03c4E)\n\n\u02c6p(oA,t|sA,t, aA,t; \u03c6)\n\nT(cid:89)\n\nt=1\n\n\u02c6p(oE,t|sE,t, aE,t; \u03c6).\n\n(5)\n\n(6)\n\nIt should be noted that the distribution in (6) works for the arbitrary choice of \u03c4A, \u03c4E, oA, oE. Also,\nthe graphical model for those random variables is shown in Figure 1 to clarify the dependencies\nbetween random variables.\nNow, suppose a discrimination optimality event oooA = 0 , oooE = 1 is observed for some \ufb01xed\ntrajectories \u03c4A, \u03c4E, where 1 := (1)T\nt=1 := (1, ..., 1) and 0 is de\ufb01ned in a similar way. Intuitively, the\ndiscrimination optimality event is an event such that the discriminator perfectly recognizes the policy\nthat generates given state-action pairs. By introducing a prior p(\u03c6) on the discriminator parameter \u03c6\nand the agent policy \u03c0A(\u00b7|\u00b7; \u03b8) parameterized with \u03b8, we obtain the following posterior distribution\nconditioned on the discrimination optimality event and \u03b8:\n\np(\u03c6, \u03c4A, \u03c4E|0A, 1E; \u03b8) \u221d p(\u03c6)p(\u03c4A; \u03b8)p(\u03c4E)p(0A|\u03c4A; \u03c6)p(1E|\u03c4E; \u03c6).\n\n(7)\nHere, 0A and 1E is de\ufb01ned as the events oooA = 0 and oooE = 1 , respectively. By using the posterior\np(\u03c6|0A, 1E; \u03b8) which marginalizes out \u03c4A and \u03c4E in (7), we can consider the full distribution of \u03c6 or\nselect an appropriate point estimate for \u03c6 that maximizes the posterior.\n\nDiscrimination-based imitation Suppose we want to \ufb01nd the parameter \u03b8 of \u03c0A that well approx-\nimates \u03c0E based on the discrimination results. By considering parameters (\u03b8, \u03c6) as random variables\n(\u03b8\u03b8\u03b8, \u03c6\u03c6\u03c6), the distribution for (\u03c4\u03c4\u03c4 A, \u03c4\u03c4\u03c4 E, oooA, oooE, \u03b8\u03b8\u03b8, \u03c6\u03c6\u03c6) is\n\n(8)\n\n(9)\n\np(\u03c4A, \u03c4E, oA, oE, \u03b8, \u03c6) = p(\u03b8)p(\u03c6)p(\u03c4A, \u03c4E, oA, oE; \u03b8, \u03c6)\n\nT(cid:89)\n\nT(cid:89)\n\n= p(\u03b8)p(\u03c6)p(\u03c4A; \u03b8)p(\u03c4E)\n\n\u02c6p(oA,t|sA,t, aA,t; \u03c6)\n\n\u02c6p(oE,t|sE,t, aE,t; \u03c6),\n\nt=1\n\nt=1\n\nwhere \u03c6\u03c6\u03c6 is assumed to be independent with \u03b8\u03b8\u03b8. Similar to the optimism for the agent-expert discrimi-\nnation, suppose we observe the imitation optimality event oooA (cid:54)= 0 that is irrespective of oooE. Note\nthat the imitation optimality event implies preventing the occurrence of discrimination optimality\nevents. To get the optimal policy parameter by using the discriminator, we can consider the following\n(conditional) posterior:\n\np(\u03b8, \u03c4A|\u02dc0A; \u03c6) \u221d p(\u03b8)p(\u03c4A; \u03b8)p(\u02dc0A|\u03c4A; \u03c6).\n\n(10)\nHere, \u02dc0A is de\ufb01ned as an probabilistic event oooA (cid:54)= 0 . Finally by using p(\u03b8|\u02dc0A; \u03c6) that comes from\nthe marginalization of \u03c4A in (10), either the full distribution of \u03b8 or corresponding point estimate can\nbe used.\n\n3.2 GAIL as an Iterative Point Estimator\n\nUnder our Bayesian framework, GAIL can be regarded as an algorithm that iteratively uses (7) and\n(10) for updating \u03b8 and \u03c6 using their point estimates. For the discriminator update, the objective of\n\n4\n\n\fGAIL is to maximize the expected log-likelihood with \u03b8prev given from the previous iteration and \u03c4\u03c4\u03c4 A\ngenerated by using \u03c0A(a|s; \u03b8prev):\n\nE\u03c4\u03c4\u03c4 A,\u03c4\u03c4\u03c4 E|\u03b8\u03b8\u03b8=\u03b8prev [log p(0A|\u03c4\u03c4\u03c4 A, \u03c6)p(1E|\u03c4\u03c4\u03c4 E, \u03c6)]\n\narg max\n\n\u03c6\n\n= arg max\n\n\u03c6\n\nE\u03c4\u03c4\u03c4 A,\u03c4\u03c4\u03c4 E|\u03b8\u03b8\u03b8=\u03b8prev\n\nlog D\u03c6(sssA,t, aaaA,t) +\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\n(11)\n\n.\n\n(12)\n\nlog(1 \u2212 D\u03c6(sssE,t, aaaE,t))\n\nT(cid:88)\n\nt=1\n\nThis can be regarded as a surrogate objective with an uninformative prior p(\u03c6) since\n\nlog p(0A, 1E|\u03c6, \u03b8prev) = log E\u03c4\u03c4\u03c4 A,\u03c4\u03c4\u03c4 E|\u03b8\u03b8\u03b8=\u03b8prev [p(0A|\u03c4\u03c4\u03c4 A, \u03c6)p(1E|\u03c4\u03c4\u03c4 E, \u03c6)] + constant\n\u2265 E\u03c4\u03c4\u03c4 A,\u03c4\u03c4\u03c4 E|\u03b8\u03b8\u03b8=\u03b8prev [log p(0A|\u03c4\u03c4\u03c4 A, \u03c6)p(1E|\u03c4\u03c4\u03c4 E, \u03c6)] + constant,\n\n(13)\n(14)\n\nwhere the inequality in (14) follows from the Jensen\u2019s inequality. For the policy update, the objective\nof GAIL is\n\n(cid:2)log p(\u02dc0A|\u03c4\u03c4\u03c4 A, \u03c6prev)(cid:3) = arg min\n\nE\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\n\u03b8\n\narg max\n\n\u03b8\n\nE\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\nlog D\u03c6prev (sssA,t, aaaA,t)\n\n.\n\n(15)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nSimilarly, for the uninformative prior p(\u03b8), we can show that\n\nlog p(\u02dc0A|\u03b8, \u03c6prev) = log E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\n\u2265 E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\n(cid:2)p(\u02dc0A|\u03c4\u03c4\u03c4 A, \u03c6prev)(cid:3) + constant\n(cid:2)log p(\u02dc0A|\u03c4\u03c4\u03c4 A, \u03c6prev)(cid:3) + constant,\n\n(16)\n(17)\n\nand thus, the objective in (15) can be regarded as a surrogate objective. In addition, since the form\nof the objective in (15) is the same as the policy optimization with an immediate cost function\nlog D\u03c6prev (\u00b7,\u00b7), GAIL uses TRPO, a state-of-the-art policy gradient algorithm, for updating \u03b8.\nNote that our approach shares the same insight behind the probabilistic inference formulation of\nreinforcement learning, in which the reinforcement learning problem is casted into the probabilistic\ninference problem by introducing the auxiliary return optimality event [Toussaint, 2009, Neumann,\n2011, Abdolmaleki et al., 2018]. Also, if we consider the maximization of log p(1A|\u03b8, \u03c6prev), which\nresult from de\ufb01ning the imitation optimality event as oooA = 1 , it can be shown that the corresponding\nsurrogate objective becomes the policy optimization with an immediate reward function log(1 \u2212\nD\u03c6prev (\u00b7,\u00b7)). This is in line with speeding up GAN training by either maximizing log(1 \u2212 D(\u00b7)) or\nminimizing log D(\u00b7), suggested in Goodfellow et al. [2014]. Some recent work on adversarial inverse\nreinforcement learning also support the use of such reward function [Finn et al., 2016, Fu et al.,\n2018].\n\n3.3 Sample-ef\ufb01cient Imitation Learning with Predictive Cost Function\n\nSince model-free imitation learning algorithms (e.g. GAIL) require experience samples obtained\nfrom the environment, improving the sample-ef\ufb01ciency is critical. From the Bayesian formulation in\nthe previous section, GAIL can be seen as maximizing (minimizing) the expected log-likelihood in a\npoint-wise manner for discriminator (policy) updates, and this makes the algorithm quite inef\ufb01cient\ncompared to using the full predictive distribution.\nWe thus propose to use the posterior of the discriminator parameter so that more robust cost signals\nare available for policy training. Formally, let us consider the iterative updates for the policy parameter\n\u03b8 and the discriminator parameter \u03c6, where the point estimate of \u03b8 is obtained using the distribution\nover \u03c6 in each iteration. In other words, given \u03b8prev from the previous iteration, we want to utilize\npposterior(\u03c6) := p(\u03c6|0A, 1E, \u03b8prev) that satis\ufb01es\n\nlog pposterior(\u03c6) = log(cid:8)p(\u03c6)E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8prev [p(0A|\u03c4\u03c4\u03c4 A, \u03c6)] E\u03c4\u03c4\u03c4 E [p(1E|\u03c4\u03c4\u03c4 E, \u03c6)](cid:9) + constant.\n\n(18)\n\nBy using Monte-Carlo estimations for the expectations over trajectories in (18), the log posterior in\n(18) can be approximated as\n\nlog p(\u03c6) + log\n\nexp(F (n)\n\nA,\u03c6) + log\n\nexp(F (n)\n\nE,\u03c6) + constant,\n\n(19)\n\nN(cid:88)\n\nN(cid:88)\n\nn=1\n\nn=1\n\n5\n\n\f(cid:40)\nK(cid:88)\n(cid:32)\n(cid:40) T(cid:88)\n\n1\nK\n\nk=1\n\nT(cid:88)\nK(cid:88)\n\nt=1\n\n1\nK\n\n(cid:41)\n(cid:33)(cid:41)\n\nA,\u03c6 :=(cid:80)T\n\nE,\u03c6 :=(cid:80)T\n\nt=1 log(1 \u2212 D\u03c6(s(n)\n\nA,t, a(n)\n\nE,t, a(n)\n\nA,t) and F (n)\n\nt=1 log D\u03c6(s(n)\n\nwhere F (n)\nE,t)). Note that we\ncan also use the surrogate objective of GAIL in (14) with prior on p(\u03c6), which might be suitable for\nthe in\ufb01nite-horizon problems.\nAt each iteration of our algorithm, we try to \ufb01nd policy parameter \u03b8 that maximizes the log of the\nposterior log p(\u03b8|\u02dc0A). For an uninformative prior on \u03b8, the objective can be written as\narg max\u03b8 log p(\u03b8|\u02dc0A) = arg min\u03b8 log p(0A|\u03b8) = arg min\u03b8 log E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8,\u03c6\u03c6\u03c6\u223cpposterior[p(0A|\u03c4\u03c4\u03c4 A, \u03c6\u03c6\u03c6)].\n(20)\nBy applying the Jensen\u2019s inequality to (20), we have E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8,\u03c6\u03c6\u03c6\u223cpposterior[log p(0A|\u03c4\u03c4\u03c4 A, \u03c6\u03c6\u03c6)], which\ncan be minimized by policy optimization. In contrast to GAIL that uses the single point estimate\nfor the maximization of pposterior, multiple parameters \u03c61, ..., \u03c6K that are randomly sampled from\npposterior are used to estimate the objective:\n\nE\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\nlog p(0A|\u03c4\u03c4\u03c4 A, \u03c6k)\n\n= E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\nlog D\u03c6k (sssA,t, aaaA,t)\n\n(21)\n\n(cid:40)\n\nK(cid:88)\n\nk=1\n\n1\nK\n\n(cid:41)\n\n= E\u03c4\u03c4\u03c4 A|\u03b8\u03b8\u03b8=\u03b8\n\nlog D\u03c6k (sssA,t, aaaA,t)\n\n.\n\n(22)\n\n(cid:80)K\n(cid:80)K\nNote that (22) implies we can perform RL policy optimization with the predictive cost function\nk=1 log D\u03c6k (s, a). In addition, if we consider p(1A|\u03c4\u03c4\u03c4 A, \u03c6k) rather than p(\u02dc0A|\u03c4\u03c4\u03c4 A, \u03c6k), the\n1\nK\nk=1(1 \u2212 log D\u03c6k (s, a)).\noptimization problem becomes RL with the predictive reward function 1\nK\nThe remaining question is how to get the samples from the posterior, and this will be discussed in the\nnext section.\n\nk=1\n\nt=1\n\n4 Posterior Sampling Based on Stein Variational Gradient Descent (SVGD)\n\nSVGD [Liu and Wang, 2016] is a recently proposed Bayesian inference algorithm based on the\nparticle updates, which we brie\ufb02y review as follows: suppose that random variable xxx follows the\ndistribution q(0), and target distribution p is known up to the normalization constant. Also, consider a\nsequence of transformations T (0), T (1), ..., where\n\nT (i)(x) := x + \u0001(i)\u03c8q(i),p(x), \u03c8q,p(x(cid:48)) := Exxx\u223cq[k(xxx, x(cid:48))\u2207xxx log p(xxx) + \u2207xxxk(xxx, x(cid:48))]\n\n(23)\nwith suf\ufb01ciently small step size \u0001(i), probability distribution q(i) of (T (i\u22121) \u25e6 \u00b7\u00b7\u00b7 T (0))(xxx) and some\npositive de\ufb01nite kernel k(\u00b7,\u00b7). Interestingly, the deterministic transformation (23) turns out to be an\niterative update to the probability distribution towards the target distribution p, and \u03c8q(i),p can be\ninterpreted as the functional gradient in the reproducing kernel Hilbert space (RKHS) de\ufb01ned by the\nkernel k(\u00b7,\u00b7). SVGD was shown to minimize the kernelized Stein discrepancy S(q(i), p) between\nq(i) and p [Liu et al., 2016] in each iteration. In practice, SVGD uses a \ufb01nite number of particles.\nMore formally, for K particles {x(0)\nk=1 that are initially sampled, SVGD iteratively updates those\nparticles by the following transformation that approximates (23):\nk , x)\u2207x(i)\n\nT (i)(x) := x + \u0001(i) \u02c6\u03c8(i)\n\nk ) + \u2207x(i)\n\nK(cid:88)\n\nlog p(x(i)\n\nk }K\n\np (x) :=\n\np (x),\n\nk(x(i)\n\nk(x(i)\n\nk , x)\n\n(cid:16)\n\n(cid:17)\n\n\u02c6\u03c8(i)\n\n.\n\nk\n\nk\n\n1\nK\n\nk=1\n\n(24)\nEven with the approximate deterministic transform and a few particles, SVGD was experimentally\nshown to signi\ufb01cantly outperform common Bayesian inference algorithms. In the extreme case, if a\nsingle particle is used, SVGD is equivalent to MAP inference.\nIn our work, we use SVGD to draw the samples of the discriminator parameters from the posterior (19).\nSpeci\ufb01cally, we \ufb01rst choose a set of K initial particles (discriminator parameters) {\u03c6(0)\nk=1. Then,\nwe use the gradient of (19) for those particles and apply the update rule in (24) to get the particles\ngenerated from the posterior distribution in (19). Finally, by using those particles, the predictive\ncost function is derived. The complete BGAIL algorithm leveraging SVGD and the predictive cost\nfunction is summarized in Algorithm 1.\n\nk }K\n\n6\n\n\f{\u03c6k}K\n\nk=1, p(\u03c6) for preference of \u03c6\n\nAlgorithm 1 Bayesian Generative Adversarial Imitation Learning (BGAIL)\n1: Input: Expert trajectories DE, initial policy parameter \u03b8, a set of initial discriminator parameters\n2: for each iteration do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nSample trajectories by using policy \u03c0\u03b8.\nUpdate \u03b8 using policy optimization, e.g., TRPO, with cost function 1\nK\nSample trajectories from DE.\nfor k = 1, ..., K do\n\nUpdate \u03c6k \u2190 \u03c6k + \u03b1 \u02c6\u03c8(\u03c6k) for a step size parameter \u03b1, where\n\nCalculate gradient \u03b4k of either (19) or its surrogate objective (17) for \u03c6k.\n\nend for\nfor k = 1, ..., K do\n\nk=1 log D\u03c6(i)\n\nk\n\n(s, a).\n\n(cid:80)K\n\n(cid:0)k(\u03c6j, \u03c6)\u03b4j + \u2207\u03c6j k(\u03c6j, \u03c6)(cid:1) .\n\nK(cid:88)\n\nj=1\n\n(cid:46) SVGD\n\n\u02c6\u03c8(\u03c6) :=\n\n1\nK\n\nend for\n\n11:\n12: end for\n\n5 Experiments\n\nWe evaluated our BGAIL on \ufb01ve continuous control tasks (Hopper-v1, Walker2d-v1, HalfCheetah-\nv1, Ant-v1, Humanoid-v1) from OpenAI Gym, implemented with the MuJoCo physics simula-\ntor [Todorov et al., 2012]. We summarize our experimental setting as follows. For all tasks, neural\nnetworks with 2 hidden layers were used for all policy and discriminator networks, where 100 hidden\nunits for each hidden layer and tanh activations are used. Before training, expert\u2019s trajectories were\ncollected from the expert policy released by the authors of the original GAIL1, but our code was\nbuilt on the GAIL implementation in OpenAI Baselines [Dhariwal et al., 2017] which uses Tensor-\nFlow [Abadi et al., 2016]. For the policy, Gaussian policy was used with both mean and variance\ndependent on the observation. For the discriminator, the number of particles K was chosen to be 5.\nAll discriminator parameters \u03c61, ..., \u03c6K were initialized independently and randomly. For training, we\nused uninformative prior and SVGD along with the Adam optimizer [Kingma and Ba, 2014], whereas\nAdagrad was used in the SVGD paper [Liu and Wang, 2016]. Our SVGD was implemented using\nthe code released by the authors2, with the radial basis function (RBF) kernel (squared-exponential\nkernel) k(\u00b7,\u00b7) and the median heuristic for choosing the bandwidth parameter. In addition, 5 inner\nloops were used for updating discriminator parameters, which corresponds to the inner loop from\nline 6 to line 11 in Algorithm 1.\nFirst, we compare BGAIL to two different settings for GAIL. The \ufb01rst setting is the same as in\nthe authors\u2019 code, where the variance of the Gaussian policy is learnable constant parameter and\na single discriminator update is performed in each iteration. Also, the state-action pairs of the\nexpert demonstration were subsampled from complete trajectories. In the second setting, we made\nchanges to the original setting to improve sample ef\ufb01ciency by (1) state-dependent variance and\n(2) 5 disciminator updates per iteration, and (3) use the whole trajectories without sub-sampling.\nIn the remainder of this paper, these two settings shall be referred to as vanilla GAIL and tuned\nGAIL, respectively. In all settings of our experiments, the maximum number of expert trajectories\nwas chosen as in Ho and Ermon [2016], i.e. 240 for Humanoid and 25 for all other tasks, and 50000\nstate-action pairs were used for each iteration in the \ufb01rst experiment. The number of training iterations\nwere also also chosen as the same as written in GAIL paper. The imitation performances of vanilla\nGAIL, tuned GAIL and our algorithm are summarized in Table 1. Note that the evaluation in Table 1\nwas done in the exactly same setting as the original GAIL paper. In that paper, the imitation learner\nwas evaluated over 50 independent trajectories using the single trained policy, and the mean and the\nstandard deviation of those 50 trajectories were given. Similarly, we evaluated each of the 5 trained\npolicies over 50 independent trajectories, and we reported the mean and the standard deviation over 50\ntrajectories of the 3rd best policy in terms of the mean score for fair comparison. As we can see, tuned\n\n1https://github.com/openai/imitation\n2https://github.com/DartML/Stein-Variational-Gradient-Descent\n\n7\n\n\fFigure 2: Comparison with GAIL when either 1000 (Hopper-v1, Walker-v1, HalfCheetah-v1) or 5000\n(Ant-v1, Humanoid-v1) state-action pairs are used for each training iteration. The numbers inside the\nbracket on the titles indicate (from left to right) the state dimension and the action dimension of the\ntask, respectively. The tasks are ordered by following the ascending order of the state dimension.\n\nTask\nHopper-v1\nWalker2d-v1\nHalfCheetah-v1\nAnt-v1\nHumanoid-v1\n\nDataset size\n\n25\n25\n25\n25\n240\n\nGAIL (released)\n3560.85 \u00b1 3.09\n6832.01 \u00b1 254.64\n4840.07 \u00b1 95.36\n4132.90 \u00b1 878.67\n10361.94 \u00b1 61.28\n\nGAIL (tuned)\n3595.30 \u00b1 5.89\n7011.02 \u00b1 25.18\n5022.93 \u00b1 81.46\n4759.12 \u00b1 416.15\n10329.66 \u00b1 59.37\n\nBGAIL\n\n3613.94 \u00b1 10.25\n7017.46 \u00b1 33.32\n4970.77 \u00b1 363.48\n4808.90 \u00b1 78.10\n10388.34 \u00b1 99.03\n\nTable 1: Imitation Performances for vanilla GAIL, tuned GAIL and BGAIL\n\nGAIL and BGAIL perform slightly better than vanilla GAIL for most of the tasks and hugely better\nat Ant-v1. We think this is due to (1) the expressive power by using the policy with state-dependent\nvariance, (2) the stabilization of the algorithm due to the multiple iteration for discriminator training\nand (3) the ef\ufb01cient use of expert\u2019s trajectories without sub-sampling procedure.\nSecond, we checked the sample ef\ufb01ciency of our algorithm by reducing the number of state-action\npairs used for each training iteration from 50000 to 1000 for Hopper-v1, Walker2d-v1, HalfCheetah-\nv1 and to 5000 for other much high-dimensional tasks. Note that the vanilla GAIL in this experiment\nused 50000 state-action pairs to see the sample ef\ufb01ciency of the original work, whereas the tuned\nGAIL was trained with either 1000 or 5000 state-action pairs per each iteration to compare its sample\nef\ufb01ciency with our algorithm. Compared to vanilla GAIL, the performances of both tuned GAIL and\nBGAIL converge to the optimal (expert\u2019s performance) much faster as depicted in Figure 2. Note\nthat 5 different policies were trained for both BGAIL and tuned GAIL, whereas a single policy was\ntrained for vanilla GAIL. The shades in Figure 2 indicate the standard deviation of scores over these\n5 policies. Also, it can be shown that the performances of tuned GAIL and BGAIL are almost similar\nat Hopper-v1 that is relatively a low-dimensional task. On the other hand, as the dimension of tasks\nincreases, BGAIL becomes much more sample-ef\ufb01cient compared to tuned GAIL.\n\n8\n\n0500100015002000iter01000200030004000averaged scoreHopper-v1 (11, 3)BGAILGAIL (tuned)GAIL0500100015002000iter010002000300040005000600070008000averaged scoreWalker2d-v1 (17, 6)BGAILGAIL (tuned)GAIL01000200030004000iter20001000010002000300040005000averaged scoreHalfCheetah-v1 (17, 6)BGAILGAIL (tuned)GAIL0500100015002000iter20001000010002000300040005000averaged scoreAnt-v1 (111, 8)BGAILGAIL (tuned)GAIL050010001500200025003000iter0200040006000800010000averaged scoreHumanoid-v1 (376, 17)BGAILGAIL (tuned)GAIL\f6 Discussion\n\nIn this work, GAIL is analyzed in the Bayesian approach, and such approach can lead to highly sample-\nef\ufb01cient model-free imitation learning. Our Bayesian approach is related to Bayesian GAN [Saatci\nand Wilson, 2017] that considered the posterior distributions of both generator and discriminator\nparameters in the generative adversarial training. Similarly in our work, the posterior for the agent-\nexpert discriminator was used for the predictive density of the cost during training, whereas only\na point estimate for the policy parameter was used for simplicity. We think our algorithm can be\nsimply extended to the multi-policy imitation learning, and the sample ef\ufb01ciency of our algorithm\nmay be enhanced by utilizing the posterior of the policy parameter as shown in Stein variational\npolicy gradient (SVPG) [Liu et al., 2017]. Also for the theoretical analysis, ours slightly differs from\nthe analysis in Bayesian GAN due to the inter-trajectory correlation from MDP formulation in our\nwork. This makes the objective of original GAIL regarded as the surrogate objective in our Bayesian\napproach, whereas the objective of Bayesian GAN is exactly reduced into that of original GAN for ML\npoint estimation. In addition, we think our analysis \ufb01lls the gap between theory and experiments in\nGAIL since GAIL was theoretically analyzed based on the discounted occupancy measure, which can\nbe de\ufb01ned in the in\ufb01nite-horizon setting, but their experiments were only done on the \ufb01nite-horizon\ntasks in MuJoCo simulator. Finally, while BGAIL effectively works with uninformative prior in our\nexperiments, the proper choice of the prior such as Gaussian prior with Fisher information covariance\nin [Abdolmaleki et al., 2018]. may also enhance the sample ef\ufb01ciency.\n\nAcknowledgement\n\nThis work was supported by the ICT R&D program of MSIT/IITP. (No. 2017-0-01778, Development\nof Explainable Human-level Deep Machine Learning Inference Framework) and the Ministry of Trade,\nIndustry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10063424,\nDevelopment of Distant Speech Recognition and Multi-task Dialog Processing Technologies for\nIn-door Conversational Robots).\n\nReferences\nJonathan Ho, Jayesh Gupta, and Stefano Ermon. Model-free imitation learning with policy optimiza-\ntion. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages\n2760\u20132769, 2016.\n\nJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML), pages 1889\u20131897, 2015.\n\nJonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural\n\nInformation Processing Systems (NIPS) 29, pages 4565\u20134573, 2016.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems (NIPS) 27, pages 2672\u20132680, 2014.\n\nZiyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess.\nRobust imitation of diverse behaviors. In Advances in Neural Information Processing Systems\n(NIPS) 30, pages 5326\u20135335, 2017.\n\nYunzhu Li, Jiaming Song, and Stefano Ermon. InfoGAIL: Interpretable imitation learning from\nvisual demonstrations. In Advances in Neural Information Processing Systems (NIPS) 30, pages\n3815\u20133825, 2017.\n\nKee-Eung Kim and Hyun Soo Park. Imitation learning via kernel mean embedding. In Proceedings\n\nof the 33rd AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nYujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings\n\nof the 32nd International Conference on Machine Learning (ICML), pages 1718\u20131727, 2015.\n\n9\n\n\fDean A Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation. Neural\n\nComputation, 3(1):88\u201397, 1991.\n\nJ Andrew Bagnell. An invitation to imitation. Technical report, Carnegie Mellon University, 2015.\n\nStuart Russell. Learning agents for uncertain environments. In Proceedings of the 11st Annual\n\nConference on Computational Learning Theory (COLT), pages 101\u2013103, 1998.\n\nAndrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In Proceedings of\n\nthe 17th International Conference on Machine Learning (ICML), pages 663\u2013670, 2000.\n\nBrian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\nreinforcement learning. In Proceedings of the 23rd AAAI Conference on Arti\ufb01cial Intelligence,\npages 1433\u20131438, 2008.\n\nDeepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings of\nthe 20th International Joint Conference on Arti\ufb01cal Intelligence (IJCAI), pages 2586\u20132591, 2007.\n\nJaedeug Choi and Kee-Eung Kim. MAP inference for Bayesian inverse reinforcement learning. In\n\nAdvances in Neural Information Processing Systems (NIPS) 24, pages 1989\u20131997, 2011.\n\nNathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In\n\nProceedings of the 23rd International Conference on Machine Learning, pages 729\u2013736, 2006.\n\nUmar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear program-\nming. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages\n1032\u20131039, 2008.\n\nGergely Neu and Csaba Szepesv\u00e1ri. Apprenticeship learning using inverse reinforcement learning and\ngradient methods. In Proceedings of the 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), pages 295\u2013302, 2007.\n\nMarc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the\n\n26th International Conference on Machine Learning (ICML), pages 1049\u20131056. ACM, 2009.\n\nGerhard Neumann. Variational inference for policy search in changing situations. In Proceedings of\n\nthe 28th International Conference on Machine Learning (ICML), pages 817\u2013824, 2011.\n\nAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin\nRiedmiller. Maximum a posteriori policy optimization. In Proceedings of the 6th International\nConference on Learning Representations (ICLR), 2018.\n\nChelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative\nadversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint\narXiv:1611.03852, 2016.\n\nJustin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforce-\nment learning. In Proceedings of the 6th International Conference on Learning Representations\n(ICLR), 2018.\n\nQiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference\nalgorithm. In Advances in Neural Information Processing Systems (NIPS) 29, pages 2378\u20132386,\n2016.\n\nQiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests.\nIn Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 276\u2013284,\n2016.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control.\nIn Proceedings of the 25th IEEE/RSH International Conference on Intelligent Robots and Systems\n(IROS), pages 5026\u20135033, 2012.\n\nPrafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,\nJohn Schulman, Szymon Sidor, and Yuhuai Wu. OpenAI Baselines. https://github.com/\nopenai/baselines, 2017.\n\n10\n\n\fMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. TensorFlow: Large-scale machine\nlearning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nYunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in Neural Information Processing\n\nSystems (NIPS) 30, pages 3622\u20133631, 2017.\n\nYang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv\n\npreprint arXiv:1704.02399, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3698, "authors": [{"given_name": "Wonseok", "family_name": "Jeon", "institution": "KAIST"}, {"given_name": "Seokin", "family_name": "Seo", "institution": "KAIST"}, {"given_name": "Kee-Eung", "family_name": "Kim", "institution": "KAIST"}]}