{"title": "State Aware Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2911, "page_last": 2920, "abstract": "Imitation learning is the study of learning how to act given a set of demonstrations provided by a human expert. It is intuitively apparent that learning to take optimal actions is a simpler undertaking in situations that are similar to the ones shown by the teacher. However, imitation learning approaches do not tend to use this insight directly. In this paper, we introduce State Aware Imitation Learning (SAIL), an imitation learning algorithm that allows an agent to learn how to remain in states where it can confidently take the correct action and how to recover if it is lead astray. Key to this algorithm is a gradient learned using a temporal difference update rule which leads the agent to prefer states similar to the demonstrated states. We show that estimating a linear approximation of this gradient yields similar theoretical guarantees to online temporal difference learning approaches and empirically show that SAIL can effectively be used for imitation learning in continuous domains with non-linear function approximators used for both the policy representation and the gradient estimate.", "full_text": "State Aware Imitation Learning\n\nYannick Schroecker\nCollege of Computing\n\nGeorgia Institute of Technology\n\nCharles Isbell\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nyannickschroecker@gatech.edu\n\nisbell@cc.gatech.edu\n\nAbstract\n\nImitation learning is the study of learning how to act given a set of demonstrations\nprovided by a human expert. It is intuitively apparent that learning to take optimal\nactions is a simpler undertaking in situations that are similar to the ones shown by\nthe teacher. However, imitation learning approaches do not tend to use this insight\ndirectly. In this paper, we introduce State Aware Imitation Learning (SAIL), an\nimitation learning algorithm that allows an agent to learn how to remain in states\nwhere it can con\ufb01dently take the correct action and how to recover if it is lead astray.\nKey to this algorithm is a gradient learned using a temporal difference update rule\nwhich leads the agent to prefer states similar to the demonstrated states. We show\nthat estimating a linear approximation of this gradient yields similar theoretical\nguarantees to online temporal difference learning approaches and empirically show\nthat SAIL can effectively be used for imitation learning in continuous domains\nwith non-linear function approximators used for both the policy representation and\nthe gradient estimate.\n\n1\n\nIntroduction\n\nOne of the foremost challenges in the \ufb01eld of Arti\ufb01cial Intelligence is to program or train an agent\nto act intelligently without perfect information and in arbitrary environments. Many avenues have\nbeen explored to derive such agents but one of the most successful and practical approaches has\nbeen to learn how to imitate demonstrations provided by a human teacher. Such imitation learning\napproaches provide a natural way for a human expert to program agents and are often combined\nwith other approaches such as reinforcement learning to narrow the search space and to help \ufb01nd\na near optimal solution. Success stories are numerous in the \ufb01eld of robotics [3] where imitation\nlearning has long been subject of research but can also be found in software domains with recent\nsuccess stories including AlphaGo [23] which learns to play the game of Go from a database of\nexpert games before improving further and the benchmark domain of Atari games where imitation\nlearning combined with reinforcement learning has been shown to signi\ufb01cantly improve performance\nover pure reinforcement learning approaches [9].\nFormally, we de\ufb01ne the problem domain as a Markov decision process, i.e. by its states, actions\nand unknown Markovian transition probabilities p(s(cid:48)|s, a) of taking action a in state s leading to\nstate s(cid:48). Imitation learning aims to \ufb01nd a policy \u03c0(a|s) that dictates the action an agent should take\nin any state by learning from a set of demonstrated states SD and the corresponding demonstrated\nactions AD. The likely most straight-forward approach to imitation learning is to employ a supervised\nlearning algorithm such as neural networks in order to derive a policy, treating the demonstrated\nstates and actions as training inputs and outputs respectively. However, while this can work well\nin practice and has a long history of successes starting with, among other examples, early ventures\ninto autonomous driving[18], it also violates a key assumption of statistical supervised learning by\nhaving past predictions affect the distribution of inputs seen in the future. It has been shown that\nagents trained this way have a tendency to take actions that lead it to states that are dissimilar from\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fany encountered during training and in which the agent is less likely to have an accurate model of\nhow to act [18, 19]. Deviations from the demonstrations based on limitations of the learning model\nor randomness in the domain are therefore ampli\ufb01ed as time progresses. Several approaches exist\nthat are capable of addressing this problem. Interactive imitation learning methods (e.g. [5, 19, 20])\naddress this problem directly but require continuing queries to the human teacher which is often not\npractical. Inverse Reinforcement Learning (IRL) approaches attempt to learn the objective function\nthat the demonstrations are optimizing and show better generalization capabilities. However, IRL\napproaches often require a model of the domain, can be limited by the representation of the reward\nfunction and are learning a policy indirectly. A consequence of the latter is that small changes to the\nlearned objective function can lead to large changes in the learned policy.\nIn this paper we introduce State Aware Imitation Learning (SAIL). SAIL aims to address the\naforementioned problem by explicitly learning to reproduce demonstrated trajectories based on their\nstates as well as their actions. Intuitively, if an agent trained with SAIL \ufb01nds itself in a state similar\nto a demonstrated state it will prefer actions that are similar to the demonstrated action but it will also\nprefer to remain near demonstrated states where the trained policy is more likely to be accurate. An\nagent trained with SAIL will thus learn how to recover if it deviates from the demonstrated trajectories.\nWe achieve this in a principled way by \ufb01nding the maximum-a-posteriori (MAP) estimate of the\ncomplete trajectory. Thus, our objective is to \ufb01nd a policy which we de\ufb01ne to be a parametric\ndistribution \u03c0\u03b8(a|s) using parameters \u03b8. Natural choices would be linear functions or neural networks.\nThe MAP problem is then given by\n\nargmax\u03b8p(\u03b8|SD, AD) = argmax\u03b8 log p(AD|SD, \u03b8) + log p(SD|\u03b8) + log p(\u03b8).\n\n(1)\n\nNote that this equation differs from the naive supervised approach in which the second term\nlog p(SD|\u03b8) is assumed to be independent from the current policy and is thus irrelevant to the\noptimization problem. Maximizing this term leads to the agent actively trying to reproduce states\nthat are similar to the ones in SD. It seems natural that additional information about the domain\nis necessary in order to learn how to reach these states. In this work, we obtain this information\nusing unsupervised interactions with the environment. We would like to stress that our approach\ndoes not require further input from the human teacher, any additional measure of optimality, or any\nmodel of the environment. A key component of our algorithm is based on the work of Morimura\net al.[15] who estimate a gradient of the distribution of states observed when following the current\npolicy using a least squares temporal difference learning approach and use their results to derive an\nalternative policy gradient algorithm. We discuss their approach in detail in section 3.1 and extend\nthe idea to an online temporal difference learning approach in section 3.2. This adaptation gives us\ngreater \ufb02exibility for our choice of function approximator and also provides a natural way to deal\nwith an additional constraint to the optimization problem which we will introduce below. In section\n3.3, we describe the full SAIL algorithm in detail and show that the estimated gradient can be used\nto derive a principled and novel imitation learning approach. We then evaluate our approach on a\ntabular domain in section 4.1, comparing our results to a purely supervised approach to imitation\nlearning as well as to sample based inverse reinforcement learning. In section 4.2 we show that SAIL\ncan successfully be applied to learn a neural network policy in a continuous bipedal walker domain\nand achieves signi\ufb01cant improvements over supervised imitation learning in this domain.\n\n2 Related works\n\nOne of the main problems SAIL is trying to address is the problem of remaining close to states\nwhere the agent can act with high con\ufb01dence. We identify three different classes of imitation\nlearning algorithms that address this problem either directly or indirectly under different assumptions\nand with different limitations. A specialized solution to this problem can be found in the \ufb01eld of\nrobotics. Imitation learning approaches in robotics often do not aim to learn a full policy using\ngeneral function approximators but instead try to predict a trajectory that the robot should follow.\nTrajectory representations such as Dynamic Movement Primitives [21] give the robot a sequence\nof states (or its derivatives) which the robot then follows using a given control law. The role of the\ncontrol law is to drive the robot towards the demonstrated states which is also a key objective of\nSAIL. However, this solution is highly domain speci\ufb01c and a controller needs to be chosen that \ufb01ts\nthe task and representation of the state space. It can, for example, be more challenging to use image\nbased state representations. For a survey of imitation learning methods applied to robotics, see [3].\n\n2\n\n\fThe second class of algorithms is what we will call iterative imitation learning algorithms. A key\ncharacteristic of these algorithms is that the agent actively queries the expert for demonstrations\nin states that it sees when executing its current policy. One of the \ufb01rst approaches in this class is\nSEARN[5]. When applied to Imiteration Learning, SEARN starts by following the experts action\nat every step, then iteratively uses the demonstrations collected during the last episode to train a\nnew policy and collects new episodes by taking actions according to a mixture of all previously\ntrained policies and the experts actions. Over time SEARN learns to follow its mixture of policies\nand stops relying on the expert to decide which actions to take. Ross et al. [19] \ufb01rst proved that\nthe pure supervised approach to imitation learning can lead to the error rate growing over time.\nTo alleviate this issue they introduced a similar iterative algorithm called SMILe and proved that\nthe error rate increases near linearly with respect to the time horizon. Building on this, Ross et\nal. introduced DAGGER [20]. DAGGER provides similar theoretical guarantees and empirically\noutperforms SMILe by augmenting a single training set during each iteration based on queries to\nthe expert on the states seen during execution. DAGGER does not require previous policies to be\nstored in order to calculate a mixture. Note that while these algorithms are guaranteed to address the\nissue of straying too far from demonstrations, they approach the problem from a different direction.\nInstead of preferring states on which the agent has demonstrations, the algorithms collects more\ndemonstrations in states the agent actually sees during execution. This can be effective but requires\nadditional interaction with the human teacher which is often not cheaply available in practice.\nAs mentioned above, our approach also shares signi\ufb01cant similarities with Inverse Reinforcement\nLearning (IRL) approaches [17]. IRL methods aim to derive a reward function for which the provided\ndemonstrations are optimal. This reward function can then be used to compute a complete policy.\nNote that the IRL problem is known to be ill-formed as a set of demonstrations can have an in\ufb01nite\namount of corresponding reward functions. Successful approaches such as Maximum Entropy IRL\n(MaxEntIRL) [27] thus attempt to disambiguate between possible reward functions by reasoning\nexplicitly about the distribution of both states and actions. In fact, Choi and Kim [4] argue that many\nexisting IRL methods can be rewritten as \ufb01nding the MAP estimate for the reward function given the\nprovided demonstrations using different probabilistic models. This provides a direct link to our work\nwhich maximizes the same objective but with respect to the policy as opposed to the reward function.\nA signi\ufb01cant downside of many IRL approaches is that they require a model describing the dynamics\nof the world. However, sample based approaches exist. Boularias et al. [1] formulate an objective\nfunction similar to MaxEntIRL but \ufb01nd the optimal solution based on samples. Relative Entropy\nIRL (RelEntIRL) aims to \ufb01nd a reward function corresponding to a distribution over trajectories\nthat matches the observed features while remaining within a relative entropy bound to the uniform\ndistribution. While RelEntIRL can be effective, it is limited to linear reward functions. Few sample\nbased methods exist that are able to learn non-linear reward functions. Recently, Finn et al. proposed\nGuided Cost Learning [6] which optimizes an objective based on MaxEntIRL using importance\nsampling and iterative re\ufb01nement of the sample policy. Re\ufb01nement is based on optimal control\nwith learned models and is thus best suited for problems in domains in which such methods have\nbeen shown to work well, e.g. robotic manipulation tasks. A different direction for sample based\nIRL has been proposed by Klein et al. who treat the scores of a score-based classi\ufb01er trained using\nthe provided demonstration as a value function, i.e. the long-term expected reward, and use these\nvalues to derive a reward function. Structured Classi\ufb01cation for IRL (SCIRL) [13] uses estimated\nfeature expectations and linearity of the value function to derive the parameters of a linear reward\nfunction while the more recent Cascaded Supervised IRL (CSI) [14] derives the reward function by\ntraining a Support Vector Machine based on the observed temporal differences. While non-linear\nclassi\ufb01ers could be used, the method is dependent on the interpretability of the score as a value\nfunction. Recently, Ho et al.[11] introduced an approach that aims to \ufb01nd a policy that implicitly\nmaximizes a linear reward function but without the need to explicitly represent such a reward function.\nGenerative Adversarial Imitation Learning [10] uses a method similar to Generative Adversarial\nNetworks[7] to extend this approach to nonlinear reward functions. The resulting algorithm trains a\ndiscriminator to distinguish between demonstration and sampled trajectory and uses the probability\ngiven by the discriminator as a reward to train a policy using reinforcement learning. The maximum\nlikelihood approach presented here can be seen as an approximation of minimizing the KL divergence\nbetween the demonstrated states and actions and the reproduction by the learned policy. This can\nd\u03c0\u03b8 (s)\u03c0\u03b8(a|s) as a reward which is a\nalso be achieved by using the ratio of state-action probabilities\nstraight-forward transformation of the output of the optimal discriminator[7]. Note however that this\nequality only holds assuming an in\ufb01nite number of demonstrations. Furthermore note that unlike the\n\npD(a,s)\n\n3\n\n\fgradient network introduced in this paper, the discriminator needs to learn about the distribution of\nthe expert\u2019s demonstrations.\nFinally, we would like to point out the similarities our work shares with meta learning techniques that\nlearn the gradients (e.g.[12]) or determine the weight updates (e.g. [22], [8]) for a neural network.\nSimilar to these meta learning approaches, we propose to estimate the gradient w.r.t. the policy.\nWhile a complete review of this work is beyond the scope of this paper, we believe that many of the\ntechniques developed to address challenges in this \ufb01eld can be applicable to our work as well.\n\n3 Approach\n\nSAIL is a gradient ascent based algorithm to \ufb01nding the true MAP estimate of the policy. A signi\ufb01cant\nrole in estimating the gradient \u2207\u03b8 log p(\u03b8|SD, AD) will be to estimate the gradient of the (stationary)\nstate distribution induced by following the current policy. We write the stationary state distribution as\nd\u03c0\u03b8 (s), assume that the Markov chain is ergodic (i.e. the distribution exists) and review the work\nby Morimura et al. [15] on estimating its gradient \u2207\u03b8 log d\u03c0\u03b8 (s) in section 3.1. We outline our own\nonline adaptation to retrieve this estimate in section 3.2 and use it in order to derive the full SAIL\ngradient \u2207\u03b8 log p(\u03b8|SD, AD) in section 3.3.\n3.1 A temporal difference approach to estimating \u2207\u03b8 log d\u03c0(s)\nWe \ufb01rst review the work by Morimura et al. [15] who \ufb01rst discovered a relationship between the\ngradient \u2207\u03b8 log d\u03c0\u03b8 (s) and value functions as used in the \ufb01eld of reinforcement learning. Morimura\net al. showed that the gradient can be written recursively and decomposed into an in\ufb01nite sum so that\na corresponding temporal difference loss can be derived.\nBy de\ufb01nition, the gradient of the stationary state distribution in a state s(cid:48) can be written in terms of\nprior states s and actions a.\n\n(2)\nUsing \u2207\u03b8(d\u03c0\u03b8 (s)\u03c0\u03b8(a|s)p(s(cid:48)|s, a)) = p(s, a, s(cid:48))(\u2207\u03b8 log d\u03c0\u03b8 (s) + \u2207\u03b8 log \u03c0\u03b8(a|s)) and dividing by\nd\u03c0\u03b8 (s(cid:48)) on both sides, we obtain\n\nd\u03c0\u03b8 (s)\u03c0\u03b8(a|s)p(s(cid:48)|s, a)ds, a\n\n\u2207\u03b8d\u03c0\u03b8 (s(cid:48)) = \u2207\u03b8\n\n(cid:90)\n\n(cid:90)\n\n0 =\n\nq(s, a|s(cid:48)) (\u2207\u03b8 log d\u03c0\u03b8 (s) + \u2207\u03b8 log \u03c0\u03b8(a|s) \u2212 \u2207\u03b8 log d\u03c0\u03b8 (s(cid:48))) ds, a\n\n(3)\n\n\u03b4(s, a, s(cid:48)) := \u2207\u03b8 log d\u03c0\u03b8 (s) + \u2207\u03b8 log \u03c0\u03b8(a|s) \u2212 \u2207\u03b8 log d\u03c0\u03b8 (s(cid:48))\n\nWhere q denotes the reverse transition probabilities. This can be seen as an expected temporal\ndifference error over the previous state and action where the temporal difference error is de\ufb01ned as\n(4)\nIn the original work, Morimura et al. derive a least squares estimator for \u2207\u03b8 log d\u03c0\u03b8 (s(cid:48)) based\non minimizing the expected squared temporal difference error as well as a penalty to enforce the\nconstraint E[\u2207\u03b8 log d\u03c0\u03b8 (s)] = 0, ensuring d\u03c0\u03b8 remains a proper probability distribution, and apply it\nto policy gradient reinforcement learning. In the following sections we formulate an online update\nrule to estimate the gradient, argue convergence in the linear case, and use the estimated gradient to\nderive a novel imitation learning algorithm.\n3.2 Online temporal difference learning for \u2207\u03b8 log d\u03c0(s)\nIn this subsection we de\ufb01ne the online temporal difference update rule for SAIL and show that\nconvergence properties are similar to the case of average reward temporal difference learning[25].\nOnline temporal difference learning algorithms are computationally more ef\ufb01cient than their least\nsquares batch counter parts and are essential when using high-dimensional non-linear function\napproximations to represent the gradient. We furthermore show that online methods give us a natural\nway to enforce the constraint E[\u2207\u03b8 log d\u03c0\u03b8 (s)] = 0. We aim to approximate \u2207\u03b8 log d\u03c0(s) up to\nan unknown constant vector c and thus de\ufb01ne our target as f\u2217(s) := \u2207\u03b8 log d\u03c0(s) + c. We use a\ntemporal difference update to learn a parametric approximation f\u03c9(s) \u2248 f\u2217(s). The update rule\nbased on taking action a in state s and transitioning to state s(cid:48) is given by\n\n\u03c9k+1 = \u03c9k + \u03b1\u2207\u03c9f\u03c9(s(cid:48)) (f\u03c9(s) + \u2207\u03b8 log \u03c0(a|s) \u2212 f\u03c9(s(cid:48))) .\n\n(5)\n\n4\n\n\fAlgorithm 1 State Aware Imitation Learning\n1: function SAIL(\u03c9, \u03b1\u03b8, \u03b1\u03c9, SD, AD)\n\u03b8 \u2190 SupervisedTraining(SD, AD)\n2:\nfor k \u2190 0..#Iterations do\n3:\n4:\n5:\n6:\n\n1|SE|\ns\u2208SE\n\nSE, AE \u2190 CollectUnsupervisedEpisode(\u03c0\u03b8))\n\u03c9 \u2190 \u03c9 + \u03b1\u03c9\n\u00b5 \u2190 1|SE|\n\u03b8 \u2190 \u03b8 + \u03b1\u03b8\n\n(cid:80)\n(cid:80)\ns,a,s(cid:48)\u2208transitions(SE ,AE ) (f\u03c9(s) + \u2207\u03b8 log \u03c0\u03b8(a|s) \u2212 f\u03c9(s(cid:48)))\u2207\u03c9f (s(cid:48)))\n(cid:16) 1|SD|\n(cid:17)\n(cid:80)\nf\u03c9(s)\ns,a\u2208pairs(SD,AD) (\u2207\u03b8 log \u03c0\u03b8(a|s) + (f\u03c9(s) \u2212 \u00b5)) + \u2207\u03b8p(\u03b8)\n\n7:\n\nreturn \u03b8\n\nNote that if f\u03c9 converges to an approximation of f\u2217 then due to E[\u2207\u03b8 log d\u03c0\u03b8 (s)] = 0, we have\n\u2207\u03b8 log d\u03c0(s) \u2248 f\u03c9(s) \u2212 E[f\u03c9(s)] where the expectation can be estimated based on samples.\nWhile convergence of temporal difference methods is not guaranteed in the general case, some\nguarantees can be made in the case of linear function approximation f\u03c9(s) := \u03c9T \u03c6(s)[25]. We note\nthat E[\u2207\u03b8 log \u03c0(a|s)] = 0 and thus for each dimension of \u03b8 the update can be seen as a variation\nof average reward temporal difference learning where the scalar reward is replaced by the gradient\nvector \u2207\u03b8 log \u03c0(a|s) and f\u03c9 is bootstrapped based on the previous state as opposed to the next. While\nthe role of current and next state in this update rule are reversed and this might suggest that updates\nshould be done in reverse, the convergence results by Tsitsiklis and Van Roy[25] are dependent only\non the limiting distribution of following the sample policy on the domain which remains unchanged\nregardless of the ordering of updates [15]. It is therefore intuitively apparent that the convergence\nresults still hold and that f\u03c9 converges to an approximation of f\u2217. We formalize this notion in\nAppendix A.\n\nIntroducing a discount factor So far we related the update rule to average reward temporal\ndifference learning as this was a natural consequence of the assumptions we were making. However, in\npractice we found that a formulation analogous to discounted reward temporal difference learning may\nwork better. While this can be seen as a biased but lower variance approximation to the average reward\nproblem [26], a perhaps more satisfying justi\ufb01cation can be obtained by reexamining the simplifying\nassumption that the sampled states are distributed by the stationary state distribution d\u03c0\u03b8. An\nalternative simplifying assumption is that the previous states are distributed by a mixture of the starting\nstate distribution d0(s\u22121) and the stationary state distribution p(s\u22121) = (1 \u2212 \u03b3)d0(s\u22121) + \u03b3d\u03c0(s\u22121)\nfor \u03b3 \u2208 [0, 1]. In this case, equation 3 has to be altered and we have\n\n0 =\n\np(s, a|s(cid:48)) (\u03b3\u2207\u03b8 log d\u03c0\u03b8 (s) + (1 \u2212 \u03b3)\u2207\u03b8 log d0(s) + \u2207\u03b8 log \u03c0\u03b8(a|s) \u2212 \u2207\u03b8 log d\u03c0\u03b8 (s(cid:48))) ds, a.\n\n(cid:90)\n\nNote that \u2207\u03b8 log d0(s) = 0 and thus we recover the discounted update rule\n\n\u03c9k+1 = \u03c9k + \u03b1\u2207\u03c9f (s(cid:48)) (\u03b3f (s) + \u2207\u03b8 log \u03c0(a|s) \u2212 f (s(cid:48)))\n\n(6)\n\n3.3 State aware imitation learning\nBased on this estimate of \u2207\u03b8 log d\u03c0\u03b8 we can now derive the full State Aware Imitation Learning\nalgorithm. SAIL aims to \ufb01nd the full MAP estimate as de\ufb01ned in Equation 1 via gradient ascent. The\ngradient decomposes into three parts:\n\n\u2207\u03b8 log p(\u03b8|SD, AD) = \u2207\u03b8 log p(AD|SD, \u03b8) + \u2207\u03b8 log p(SD|\u03b8) + \u2207\u03b8 log p(\u03b8)\n\n(7)\n\ndistribution. Under this assumption, we can estimate \u2207\u03b8 log p(SD|\u03b8) =(cid:80)\n\nThe \ufb01rst and last term make up the gradient used for gradient descent based supervised learning and\ncan usually be computed analytically. To estimate \u2207\u03b8 log p(SD|\u03b8), we disregard information about\nthe order of states and make the simplifying assumptions that all states are drawn from the stationary\n\u2207\u03b8 log d\u03c0\u03b8 (s) based\non unsupervised transition samples using the approach described in section 3.2. The full SAIL\nalgorithm thus maintains a current policy as well an estimate of \u2207\u03b8 log p(SD|\u03b8) and iteratively\n\ns\u2208SD\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: a) The sum of probabilities of taking the optimal action double over the baseline. b) The\nreward (+/ \u2212 2\u03c3) obtained after 5000 iterations of SAIL is much closer to the optimal policy.\n\n1. Collects unsupervised state and action samples SE and AE from the current policy,\n2. Updates the gradient estimate using Equation 5 and estimates E[f\u03c9(s)] using the sample\n\nmean of the unsupervised states or an exponentially moving sample mean\n\n(cid:88)\n\ns\u2208SE\n\n\u00b5 :=\n\n1\n|SE|\n\nf\u03c9(s)\n\n3. Updates the current policy using the estimated gradient f\u03c9(s) \u2212 \u00b5 as well as the analytical\n\ngradients for \u2207\u03b8 log p(\u03b8) and \u2207\u03b8 log p(AD|SD, \u03b8). The SAIL gradient is given by\n\n\u2207\u03b8 log p(\u03b8|SD, AD) =\n\n(f\u03c9(s) \u2212 \u00b5 + \u2207\u03b8 log p(a|s, \u03b8)) + \u2207\u03b8p(\u03b8)\n\n(cid:88)\n\ns,a\u2208pairs(SD,AD)\n\nThe full algorithm is also outlined in Algorithm 1.\n\n4 Evaluation\n\nWe evaluate our approach on two domains. The \ufb01rst domain is a harder variation of the tabular\nracetrack domain \ufb01rst used in [1] with 7425 states and 5 actions. In section 4.1.1, we use this domain\nto show that SAIL can improve on the policy learned by a supervised baseline and learn to act in states\nthe policy representation does not generalize to. In section 4.1.2 we evaluate sample ef\ufb01ciency of an\noff-policy variant of SAIL. The tabular representation allows us to compare the results to RelEntIRL\n[1] as a baseline without restrictions arising from the chosen representation of the reward function.\nThe second domain we use is a noisy variation of the bipedal walker domain found in OpenAI gym[2].\nWe use this domain to evaluate the performance of SAIL on tasks with continuous state and action\nspaces using neural networks to represent the policy as well as the gradient estimate and compare it\nagainst the supervised baseline using the same representations.\n\n4.1 Racetrack domain\n\nWe \ufb01rst evaluate SAIL on the racetrack domain. This domain is a more dif\ufb01cult variation of the\ndomain used by Boularias et al. [1] and consists of a grid with 33 by 9 possible positions. Each\nposition has 25 states associated with it, encoding the velocity (-2, -1, 0, +1, +2) in the x and y\ndirection which dictates the movement of the agent at each time step. The domain has 5 possible\nactions allowing the agent to increase or reduce its velocity in either direction or to keep its current\nvelocity. Randomness is introduced to the domain using the notion of a failure probability which\nis set to be 0.8 if the absolute velocity in either direction is 2 and 0.1 otherwise. The goal of the\nagent is to complete a lap around the track without going off-track which we de\ufb01ne to be the area\nsurrounding the track (x = 0, y = 0, x > 31 or y > 6) as well as the inner rectangle (2 < x < 31\nand 2 < y < 6). Note that unlike in [1], the agent has the ability to go off-track as opposed to being\nconstrained by a wall and has to learn to move back on track if random chance makes it stray from it.\nFurthermore, the probability of going off-track is higher as the track is more narrow in this variation\nof the domain. This makes the domain more challenging to learn using imitation learning alone.\n\n6\n\n010002000300040005000Iteration145015001550160016501700175018001850AgreementSAILSupervisedbaselineRandombaseline010002000300040005000Iteration3.83.94.04.14.24.34.44.54.6AveragerewardoptimalpolicysupervisedbaselineSAIL\fFigure 2: Reward obtained using off-policy training. SAIL learns a near-optimal policy using only\n1000 sample episodes. The scale is logarithmic on the x-axis after 5000 iterations (gray area).\n\nFor all our experiments, we use a set of 100 episodes collected from an oracle. To measure perfor-\nmance, we assign a score of \u22120.1 to being off-track, a score of 5 for completing the lap and \u22125 for\ncrossing the \ufb01nish line the wrong way. Note that this score is not used during training but is purely\nused to measure performance in this evaluation. We also use this score as a reward to derive an oracle.\n\n4.1.1 On-policy results\n\nFor our \ufb01rst experiment, we compare SAIL against a supervised baseline. As the oracle is determinis-\ntic and the domain is tabular, this means taking the optimal action in states encountered as part of\none of the demonstrated episodes and uniformly random actions otherwise. For the evaluation of\nSAIL, we initialize the policy to the supervised baseline and use the algorithm to improve the policy\nover 5000 iterations. At each iteration, 20 unsupervised sample episodes are collected to estimate the\nSAIL gradient, using plain stochastic gradient descent with a learning rate of 0.1 for the temporal\ndifference update and RMSprop with a a learning rate of 0.01 for updating the policy. Figure 1b\nshows that SAIL stably converges to a policy that signi\ufb01cantly outperforms the supervised baseline.\nWhile we do not expect SAIL to act optimally in previously unseen states but to instead exhibit\nrecovery behavior, it is interesting to measure on how many states the learned policy agrees with the\noptimal policy using a soft count for each state based on the probability of the optimal action. Figure\n1a shows that the amount of states in which the agent takes the optimal action roughly doubles its\nadvantage over random chance and that the learned behavior is signi\ufb01cantly closer to the optimal\npolicy on states seen during execution.\n\n4.1.2 Off-policy sample ef\ufb01ciency\n\nFor our second experiment, we evaluate the sample ef\ufb01ciency of SAIL by reusing previous sample\nepisodes. As a temporal difference method, SAIL can be adapted using any off-policy temporal\ndifference learning technique. In this work we elected to use truncated importance weights [16]\nwith emphatic decay [24]. We evaluate the performance of SAIL collecting one new unsupervised\nsample episode in each iteration, reusing the samples collected in the past 19 episodes and compare\nthe results against our implementation of Relative Entropy IRL[1]. We found that the importance\nsampling approach used by RelEntIRL makes interactions obtained by a pre-trained policy ineffective\nwhen using a tabular policy1 and thus collect samples by taking actions uniformly at random. For\ncomparability, we also evaluated SAIL using a \ufb01xed set of samples obtained by following a uniform\npolicy. In this case, we found that the temporal-difference learning can become unstable in later\niterations and thus decay the learning rate by a factor of 0.995 after each iteration.\nWe vary the number of unsupervised sample episodes and show the score achieved by the trained\npolicy in Figure 2. The score for RelEntIRL is measured by computing the optimal policy given the\nlearned reward function. Note that this requires a model that is not normally available. We found\nthat in this domain depending on the obtained samples, RelEntIRL has a tendency to learn shortcuts\nthrough the off-track area. Since small changes in the reward function can lead to large changes in\nthe \ufb01nal policy, we average the results for RelEntIRL over 20 trials and bound the total score from\n\n1The original work by Boularias et al. shows that a pre-trained sample policy can be used effectively if a\n\ntrajectory based representation is used\n\n7\n\n501000250010000500003.83.94.04.14.24.34.44.54.6OptimalpolicySupervisedbaselineUniformoff-policySAILOff-policySAILRelEntIRL\f(a)\n\n(b)\n\nFigure 3: a) The bipedal walker has to traverse the plain, controlling the 4 noisy joint motors in its\nlegs. b) Failure rate of SAIL over 1000 traversals compared to the supervised baseline measured.\nAfter 15000 iterations, SAIL traverses the plain far more reliably than the baseline.\n\nbelow by the score achieved using the supervised baseline. We can see that SAIL is able to learn a\nnear optimal policy using a low number of sample episodes. We can furthermore see that SAIL using\nuniform samples is able to learn a good policy and outperform the RelEntIRL baseline reliably.\n\n4.2 Noisy bipedal walker\n\nFor our second experiment, we evaluate the performance of SAIL on a noisy variant of a two-\ndimensional Bipedal walker domain (see Figure 3a). The goal of this domain is to learn a policy that\nenables the simulated robot to traverse a plain without falling. The state space in this domain consists\nof 4 dimensions for velocity in x and y directions, angle of the hull, angular velocity, 8 dimensions\nfor the position and velocity of the 4 joints in the legs, 2 dimensions that denote whether the leg has\ncontact with the ground and 10 dimensions corresponding to lidar readings, telling the robot about its\nsurroundings. The action space is 4 dimensional and consists of the torque that is to be applied to\neach of the 4 joints. To make the domain more challenging, we also apply additional noise to each of\nthe torques. The noise is sampled from a normal distribution with standard deviation of 0.1 and is\nkept constant for \ufb01ve consecutive frames at a time. The noise thus has the ability to destabilize the\nwalker. Our goal in this experiment is to learn a continuous policy from demonstrations, mapping the\nstate to torques and enabling the robot to traverse the plain reliably. As a demonstration, we provide a\nsingle successful crossing of the plain. The demonstration has been collected from an oracle that\nhas been trained on the bipedal walker domain without additional noise and is therefore not optimal\nand prone to failure. Our main metric for success on this domain is failure rate, i.e. the fraction of\ntimes that the robot is not able to traverse the plain due to falling to the ground. While the reward\nmetric used in [2] is more comprehensive as it measures speed and control cost, it cannot be expected\nthat a pure imitation learning approach can minimize control cost when trained with an imperfect\ndemonstration that does not achieve this goal itself. Failure rate, on the other hand can always be\nminimized by aiming to reproduce a demonstration of a successful traversal as well as possible.\nTo represent our policy, we use a single shallow neural network with one hidden layer consisting of\n100 nodes with tanh activation. We train this policy using a pure supervised approach as a baseline\nas well as with SAIL and contrast the results. During evaluation and supervised training, the output\nof the neural network is taken to be the exact torques whereas SAIL requires a probabilistic policy.\nTherefore we add additional Gaussian noise, kept constant for 8 consecutive frames at a time.\nTo train the network in a purely supervised approach, we use RMSProp over 3000 epochs with a\nbatch size of 128 frames and a learning rate of 10\u22125. After the training process has converged, we\nfound that the neural network trained with pure supervised learning fails 1650 times out of 5000 runs.\nTo train the policy with SAIL, we \ufb01rst initialize it with the aforementioned supervised approach.\nThe training is then followed up with training using the combined gradient estimated by SAIL until\nthe failure rate stops decreasing. To represent the gradient of the logarithmic stationary distribution,\nwe use a fully connected neural network with two hidden layers of 80 nodes each using ReLU\nactivations. Each episode is split into mini-batches of 16 frames. The \u2207\u03b8 log d\u03c0\u03b8-network is trained\nusing RMSprop with a learning rate of 10\u22124 whereas the policy network is trained using RMSprop\n\n8\n\n02000400060008000100001200014000Iteration0.000.050.100.150.200.250.300.350.40FailurerateSAILSupervisedBaseline\fand a learning rate of 10\u22126, starting after the \ufb01rst 1000 episodes. As can be seen in Figure 3b, SAIL\nincreases the success rate of 0.67 achieved by the baseline to 0.938 within 15000 iterations.\n\n5 Conclusion\n\nImitation learning has long been a topic of active research. However, naive supervised learning has a\ntendency to lead the agent to states in which it cannot act with certainty and alternative approaches\neither make additional assumptions or, in the case of IRL methods, address this problem only\nindirectly. In this work, we proposed a novel imitation learning algorithm that directly addresses\nthis issue and learns a policy without relying on intermediate representations. We showed that the\nalgorithm can generalize well and provides stable learning progress in both, domains with a \ufb01nite\nnumber of discrete states as well as domains with continuous state and action spaces. We believe that\nexplicit reasoning over states can be helpful even in situations where reproducing the distributions of\nstates will not result in a desirable policy and see this as a promising direction for future research.\n\nAcknowledgements\nThis work was supported by the Of\ufb01ce of Naval Research under grant N000141410003\n\nReferences\n[1] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative Entropy Inverse Reinforcement Learning.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 15:1\u20138, 2011.\n\n[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. OpenAI Gym, 2016.\n\n[3] Sonia Chernova and Andrea L Thomaz. Robot learning from human teachers. Synthesis Lectures on\n\nArti\ufb01cial Intelligence and Machine Learning, 8(3):1\u2013121, 2014.\n\n[4] Jaedeug Choi and Kee-eung Kim. MAP Inference for Bayesian Inverse Reinforcement Learning. Neural\n\nInformation Processing System (NIPS), 2011.\n\n[5] Hal Daum\u00b4e, John Langford, and Daniel Marcu. Search-based structured prediction. Machine Learning\n\nJournal (MLJ), 75(3):297\u2013325, 2009.\n\n[6] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided Cost Learning: Deep Inverse Optimal Control via\n\nPolicy Optimization. International Conference on Machine Learning (ICML), 2016.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[8] David Ha, Andrew Dai, and Quoc V. Le. HyperNetworks. arXiv preprint, page arXiv:1609.09106v4\n\n[cs.LG], 2016.\n\n[9] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris,\nGabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Learning from\nDemonstrations for Real World Reinforcement Learning. arXiv preprint, page 1704.03732v1 [cs.AI],\n2017.\n\n[10] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.\n\nInformation Processing Systems, pages 4565\u20134573, 2016.\n\nIn Advances in Neural\n\n[11] Jonathan Ho, Jayesh Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization.\n\nIn International Conference on Machine Learning, pages 2760\u20132769, 2016.\n\n[12] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray\nKavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. arXiv preprint, page 1608.05343v1\n[cs.LG], 2016.\n\n[13] Edouard Klein, Matthieu Geist, Bilal Piot, and Olivier Pietquin. Inverse Reinforcement Learning through\n\nStructured Classi\ufb01cation. Neural Information Processing System (NIPS), 2012.\n\n9\n\n\f[14] Edouard Klein, Bilal Piot, Matthieu Geist, and Olivier Pietquin. A cascaded supervised learning approach\nto inverse reinforcement learning. Joint European Conference on Machine Learning and Knowledge\nDiscovery in Databases (ECML/PKDD), 2013.\n\n[15] Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Jan Peters, and Kenji Doya. Derivatives of logarith-\nmic stationary distributions for policy gradient reinforcement learning. Neural computation, 22(2):342\u2013376,\n2010.\n\n[16] Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and Ef\ufb01cient Off-Policy\n\nReinforcement Learning. In Neural Information Processing System (NIPS), 2016.\n\n[17] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference\n\non Machine Learning (ICML), 2000.\n\n[18] Dean a Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Neural Information\n\nProcessing System (NIPS), 1989.\n\n[19] St\u00b4ephane Ross and J. Andrew Bagnell. Ef\ufb01cient Reductions for Imitation Learning.\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\nInternational\n\n[20] St\u00b4ephane Ross, Geoffrey Gordon, and J. Andrew Bagnell. A Reduction of Imitation Learning and\nStructured Prediction to No-Regret Online Learning. International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), 2011.\n\n[21] Stefan Schaal. Robot learning from demonstration. Neural Information Processing System (NIPS), 1997.\n\n[22] Juergen H. Schmidhuber. A self-referential Weight Matrix. International Conference on Arti\ufb01cial Neural\n\nNetworks, 1993.\n\n[23] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Dieleman Sander, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks\nand tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[24] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of\noff-policy temporal-difference learning. Journal of Machine Learning Research (JMLR), 17:1\u201329, 2016.\n\n[25] John N Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning. Automatica, 35:1799\u2013\n\n1808, 1999.\n\n[26] John N. Tsitsiklis and Benjamin Van Roy. On average versus discounted reward temporal-difference\n\nlearning. Machine Learning, 49(2-3):179\u2013191, 2002.\n\n[27] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse\n\nReinforcement Learning. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2007.\n\n10\n\n\f", "award": [], "sourceid": 1677, "authors": [{"given_name": "Yannick", "family_name": "Schroecker", "institution": "Georgia Institute of Technology"}, {"given_name": "Charles", "family_name": "Isbell", "institution": "Georgia Tech"}]}