{"title": "Nonlinear Inverse Reinforcement Learning with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert's policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.", "full_text": "Nonlinear Inverse Reinforcement Learning with\n\nGaussian Processes\n\nSergey Levine\n\nStanford University\n\nsvlevine@cs.stanford.edu\n\nZoran Popovi\u00b4c\n\nUniversity of Washington\n\nzoran@cs.washington.edu\n\nVladlen Koltun\n\nStanford University\n\nvladlen@cs.stanford.edu\n\nAbstract\n\nWe present a probabilistic algorithm for nonlinear inverse reinforcement learn-\ning. The goal of inverse reinforcement learning is to learn the reward function in a\nMarkov decision process from expert demonstrations. While most prior inverse re-\ninforcement learning algorithms represent the reward as a linear combination of a\nset of features, we use Gaussian processes to learn the reward as a nonlinear func-\ntion, while also determining the relevance of each feature to the expert\u2019s policy.\nOur probabilistic algorithm allows complex behaviors to be captured from subop-\ntimal stochastic demonstrations, while automatically balancing the simplicity of\nthe learned reward structure against its consistency with the observed actions.\n\n1\n\nIntroduction\n\nInverse reinforcement learning (IRL) methods learn a reward function in a Markov decision process\n(MDP) from expert demonstrations, allowing the expert\u2019s policy to be generalized to unobserved\nsituations [7]. Each task is consistent with many reward functions, but not all rewards provide a\ncompact, portable representation of the task, so the central challenge in IRL is to \ufb01nd a reward with\nmeaningful structure [7]. Many prior methods impose structure by describing the reward as a linear\ncombination of hand selected features [1, 12]. In this paper, we extend the Gaussian process model\nto learn highly nonlinear reward functions that still compactly capture the demonstrated behavior.\nGP regression requires input-output pairs [11], and was previously used for value function approx-\nimation [10, 4, 2]. Our Gaussian Process Inverse Reinforcement Learning (GPIRL) algorithm only\nobserves the expert\u2019s actions, not the rewards, so we extend the GP model to account for the stochas-\ntic relationship between actions and underlying rewards. This allows GPIRL to balance the simplic-\nity of the learned reward function against its consistency with the expert\u2019s actions, without assuming\nthe expert to be optimal. The learned GP kernel hyperparameters capture the structure of the reward,\nincluding the relevance of each feature. Once learned, the GP can recover the reward for the current\nstate space, and can predict the reward for any unseen state space within the domain of the features.\nPrevious IRL algorithms generally learn the reward as a linear combination of features, either by\n\ufb01nding a reward under which the expert\u2019s policy has a higher value than all other policies by a margin\n[7, 1, 12, 15], or else by maximizing the probability of the reward under a model of near-optimal\nexpert behavior [6, 9, 17, 3]. While several margin-based methods learn nonlinear reward functions\nthrough feature construction [13, 14, 5], such methods assume optimal expert behavior. To the best\nof our knowledge, GPIRL is the \ufb01rst method to combine probabilistic reasoning about stochastic\nexpert behavior with the ability to learn the reward as a nonlinear function of features, allowing it to\noutperform prior methods on tasks with inherently nonlinear rewards and suboptimal examples.\n\n1\n\n\fInverse Reinforcement Learning Preliminaries\n\nthe expected discounted sum of rewards E [(cid:80)\u221e\n\n2\nA Markov decision process is de\ufb01ned as a tuple M = {S,A,T , \u03b3, r}, where S is the state space, A\nis the set of actions, T sa\nis the probability of a transition from s \u2208 S to s(cid:48) \u2208 S under action a \u2208 A,\ns(cid:48)\n\u03b3 \u2208 [0, 1) is the discount factor, and r is the reward function. The optimal policy \u03c0(cid:63) maximizes\nt=0 \u03b3trst|\u03c0(cid:63)]. In inverse reinforcement learning, the\nalgorithm is presented with M\\ r, as well as expert demonstrations, denoted D = {\u03b61, ..., \u03b6N},\nwhere \u03b6i is a path \u03b6i = {(si,0, ai,0), ..., (si,T , ai,T )}. The algorithm is also presented with features\nof the form f : S \u2192 R that can be used to represent the unknown reward r.\nIRL aims to \ufb01nd a reward function r under which the optimal policy matches the expert\u2019s demon-\nstrations. To this end, we could assume that the examples D are drawn from the optimal policy\n\u03c0(cid:63). However, real human demonstrations are rarely optimal. One approach to learning from a sub-\noptimal expert is to use a probabilistic model of the expert\u2019s behavior. We employ the maximum\nentropy IRL (MaxEnt) model [17], which is closely related to linearly-solvable MDPs [3], and has\nbeen used extensively to learn from human demonstrations [16, 17]. Under this model, the prob-\nability of taking a path \u03b6 is proportional to the exponential of the rewards encountered along that\npath. This model is convenient for IRL, because its likelihood is differentiable [17], and a complete\nstochastic policy uniquely determines the reward function [3]. Intuitively, such a stochastic policy is\nmore deterministic when the stakes are high, and more random when all choices have similar value.\nUnder this policy, the probability of an action a in state s can be shown to be proportional to the\nexponential of the expected total reward after taking the action, denoted P (a|s) \u221d exp(Qr\nsa), where\nQr = r + \u03b3T Vr. The value function Vr is computed with a \u201csoft\u201d version of the familiar Bellman\nsa. The probability of a in state s is therefore normalized by\nbackup operator: Vr\nexp Vr, giving P (a|s) = exp(Qr\ns). Detailed derivations of these equations can be found in\nprior work [16]. The complete log likelihood of the data under r can be written as\n\ns = log(cid:80)\na exp Qr\nsa \u2212 Vr\n(cid:88)\n(cid:88)\n\nlog P (D|r) =\n\nlog P (ai,t|si,t) =\n\nQr\n\nsi,tai,t\n\n\u2212 Vr\n\nsi,t\n\n(1)\n\ni\n\nt\n\ni\n\nt\n\nWhile we can maximize Equation 1 directly to obtain r, such a reward is unlikely to exhibit mean-\ningful structure, and would not be portable to novel state spaces. Prior methods address this problem\nby representing r as a linear combination of a set of provided features [17]. However, if r is not lin-\near in the features, such methods are not suf\ufb01ciently expressive. In the next section, we describe\nhow Gaussian processes can be used to learn r as a general nonlinear function of the features.\n\n3 The Gaussian Process Inverse Reinforcement Learning Algorithm\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)\n\n(cid:17)\n\nGPIRL represents the reward as a nonlinear function of feature values. This function is modeled as a\nGaussian process, and its structure is determined by its kernel function. The Bayesian GP framework\nprovides a principled method for learning the hyperparameters of this kernel, thereby learning the\nstructure of the unknown reward. Since the reward is not known, we use Equation 1 to specify a\ndistribution over GP outputs, and learn both the output values and the kernel function.\nIn GP regression, we use noisy observations y of the true underlying outputs u. GPIRL directly\nlearns the true outputs u, which represent the rewards associated with feature coordinates Xu. These\ncoordinates may simply be the feature values of all states or, as discussed in Section 5, a subset of\nall states. The rewards at states that are not included in this subset are inferred by the GP. We also\nlearn the kernel hyperparameters \u03b8 in order to recover the structure of the reward. The most likely\nvalues of u and \u03b8 are found by maximizing their probability under the expert demonstrations D:\n\n\uf8ee\uf8ef\uf8f0(cid:90)\n\nP (D|r)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nIRL term\n\n(cid:124)\n\nr\n\nP (r|u, \u03b8, Xu)\n\ndr\n\nGP posterior\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n\uf8f9\uf8fa\uf8fb P (u, \u03b8|Xu)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n\nGP probability\n\n(2)\n\nP (u, \u03b8|D, Xu) \u221d P (D, u, \u03b8|Xu) =\n\nThe log of P (D|r) is given by Equation 1, the GP posterior P (r|u, \u03b8, Xu) is the probability of a\nreward function under the current values of u and \u03b8, and P (u, \u03b8|Xu) is the prior probability of a\n\n2\n\n\fparticular assignment to u and \u03b8. The log of P (u, \u03b8|Xu) is the GP log marginal likelihood, which\nfavors simple kernel functions and values of u that conform to the current kernel matrix [11]:\n\nlog P (u, \u03b8|Xu) = \u2212 1\n2\n\nuTK\u22121\n\nu,uu \u2212 1\n2\n\nlog |Ku,u| \u2212 n\n2\n\nlog 2\u03c0 + log P (\u03b8)\n\n(3)\n\nThe last term log P (\u03b8) is a hyperparameter prior, which is discussed in Section 4. The entries of the\ncovariance matrix Ku,u are given by the kernel function. In order to determine the relevance of each\nfeature, we use the automatic relevance detection (ARD) kernel, with hyperparameters \u03b8 = {\u03b2, \u039b}:\n\nk(xi, xj) = \u03b2 exp\n\n(xi \u2212 xj)T\u039b(xi \u2212 xj)\n\n(cid:18)\n\n\u2212 1\n2\n\n(cid:19)\n\nr,uK\u22121\n\nThe hyperparameter \u03b2 is the overall variance, and the diagonal matrix \u039b speci\ufb01es the weight on each\nfeature. When \u039b is learned, less relevant features receive low weights, and more relevant features\nreceive high weights. States distinguished by highly-weighted features can take on different reward\nvalues, while those that have similar values for all highly-weighted features take on similar rewards.\nThe GP posterior P (r|u, \u03b8, Xu) is a Gaussian distribution with mean KT\nu,uu and covariance\nKr,r \u2212 KT\nu,uKr,u. Kr,u is the covariance of the rewards at all states with the inducing point\nvalues u, located respectively at Xr and Xu [11]. Due to the complexity of P (D|r), the integral in\nEquation 2 cannot be computed in closed form. Instead, we can consider this problem as analogous\nto sparse approximation for GP regression [8], where a small set of inducing points u acts as the\nsupport for the full set of training points r. In this context, the Gaussian posterior distribution over\nr is called the training conditional. One approximation is to assume that the training conditional is\ndeterministic \u2013 that is, has variance zero [8]. This approximation is particularly appropriate in our\ncase, because if the learned GP is used to predict a reward for a novel state space, the most likely\nreward would have the same form as the mean of the training conditional. Under this approximation,\nthe integral disappears, and r is set to KT\n\nu,uu. The resulting log likelihood is simply\n\nr,uK\u22121\n\nr,uK\u22121\n\nlog P (D, u, \u03b8|Xu) = log P (D|r = KT\n\nr,uK\u22121\n\nu,uu)\n\n+ log P (u, \u03b8|Xu)\n\n(4)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nIRL log likelihood\n\nGP log likelihood\n\nr,uK\u22121\n\nu,uu can be used to recover the expert\u2019s\nOnce the likelihood is optimized, the reward r = KT\npolicy on the entire state space. The GP can also predict the reward function for any novel state\nspace in the domain of the features. The most likely reward for a novel state space is the mean\n(cid:63),uK\u22121u, where K(cid:63),u is the covariance of the new states and the inducing points. In\nposterior KT\nour implementation, the likelihood is optimized with the L-BFGS method, with derivatives provided\nin the supplement. When the hyperparameters are learned, the likelihood is generally not convex.\nWhile this is not unusual for GP methods, it does mean that the method can suffer from local optima.\nIn the supplement, we also describe a simple restart procedure we used to mitigate this problem.\n\n4 Regularization and Hyperparameter Priors\n\nIn GP regression, a noise term is often added to the diagonal of the kernel matrix to account for noisy\nobservations. Since GPIRL learns the noiseless underlying outputs u, there is no cause to add a noise\nterm, which means that the kernel matrix Ku,u can become singular. Intuitively, this indicates that\ntwo or more inducing points are deterministically covarying, and therefore redundant. To ensure\nthat no inducing point is redundant, we assume that their positions in feature space Xu, rather than\ntheir values, are corrupted by white noise with variance \u03c32. The expected squared difference in the\nkth feature values between two points xi and xj is then given by (xik \u2212 xjk)2 + 2\u03c32, and the new,\nregularized kernel function is given by\n\nk(xi, xj) = \u03b2 exp\n\n(xi \u2212 xj)T\u039b(xi \u2212 xj) \u2212 1i(cid:54)=j\u03c32tr(\u039b)\n\n(5)\n\n(cid:19)\n\n(cid:18)\n\n\u2212 1\n2\n\n3\n\n\f2 log |Ku,u| \u2192 \u221e and, so long as u \u2192 0, all other terms remain \ufb01nite.\n\nThe regularization ensures that k(xi, xj) < k(xi, xi) so long as at least one feature is relevant \u2013\nthat is, tr(\u039b) > 0. While the regularized kernel prevents singular covariance matrices when many\nfeatures become irrelevant, the log likelihood can still increase to in\ufb01nity as \u039b \u2192 0 or \u03b2 \u2192 0: in\nboth cases, \u2212 1\nTo prevent such degeneracies, we use a hyperparameter prior that discourages kernels under which\n(cid:80)\ntwo inducing points become deterministically covarying. As two points ui and uj become de-\nterministically related, the magnitude of their partial correlation [K\u22121\nu,u]ij becomes in\ufb01nity. We can\ntherefore prevent degeneracies with a prior term of the form \u2212 1\nij[K\u22121\nu,u), which\ndiscourages large partial correlations between inducing points. Such a prior is dependent on Xu.\nHowever, unlike in GP regression, Xu and u are parameters of the algorithm rather than data, and\nsince the inducing point positions are \ufb01xed in advance, it is possible to condition the prior on Xu.\nTo encourage sparse feature weights \u039b, we also use a sparsity-inducing penalty \u03c6(\u039b), resulting in\nthe prior log P (\u03b8|Xu) = \u2212 1\nu,u) \u2212 \u03c6(\u039b). A variety of penalties are suitable, but we obtained\ni log(\u039bii + 1). Although we can also optimize for the noise variance\n\u03c32, we did not observe that this signi\ufb01cantly altered the results, and instead \ufb01xed 2\u03c32 to 10\u22122.\n\nthe best results with \u03c6(\u039b) =(cid:80)\n\n2tr(K\u22122\n\n2\n\nu,u]2\n\nij = \u2212 1\n\n2tr(K\u22122\n\n5\n\nInducing Points and Large State Spaces\n\nA straightforward choice for the inducing points Xu is the feature values of all states in the state\nspace S. Unfortunately, the kernel matrix Ku,u is constructed and inverted at each iteration of the\noptimization in order to compute the gradient. This is a costly procedure: constructing the matrix\nhas running time O(dX|Xu|2) and inverting it is O(|Xu|3), where dX is the number of features. To\nmake GPIRL tractable on large state spaces, we can instead choose Xu to be a small subset of S,\nso that only the construction of Kr,u depends on |S|, and this dependence is linear. In principle, the\nminimum size of Xu corresponds to the complexity of the reward function. For example, if the true\nreward has two constant regions, it can be represented by just two properly placed inducing points.\nIn practice, Xu must cover the space of feature values well enough to represent an unknown reward\nfunction, but we can nonetheless use many fewer points than there are states in S.\nIn our implementation, we chose Xu to contain the feature values of all states visited in the example\npaths, as well as additional random states added to raise |Xu| to a desired size. While this heuristic\nworked well in our experiments, we can also view the choice of Xu as analogous to the choice of\nthe active set in sparse GP approximation. A number of methods have been proposed for selecting\nthese sets [8], and applying such methods to GPIRL is a promising avenue for future work.\n\n6 Alternative Kernels\n\nThe particular choice of kernel function in\ufb02uences the structure of the learned reward. The stationary\nkernel in Equation 5 favors rewards that are smooth with respect to feature values. Other kernels can\nbe used to learn other types of structure. For example, a reward function might have wide regions\nwith uniform values, punctuated by regions of high-frequency variation, as is the case for piecewise\nconstant rewards. A stationary kernel would have dif\ufb01culty representing such structure. Instead, we\ncan warp each coordinate xik of xi by a function wk(xik) to give high resolution to one region, and\nlow resolution everywhere else. One such function is a sigmoid centered at mk and scaled by (cid:96)k:\n\nwk(xik) =\n\n1 + exp\n\n(cid:16)\u2212 xik\u2212mk\n\n1\n\n(cid:96)k\n\n(cid:17)\n\nReplacing xi by w(xi) in Equation 5, we get a regularized warped kernel of the form\n\n(cid:32)\n\n(cid:88)\n\nk\n\n\u2212 1\n2\n\n(cid:2)(wk(xik) \u2212 wk(xjk))2 + 1i(cid:54)=j\u03c32(w\u03c3\n\nk (xik) + w\u03c3\n\nk (xjk))(cid:3)(cid:33)\n\nk(xi, xj) = \u03b2 exp\n\n\u039bkk\n\nThe second term in the sum is the contribution of the noise to the expected distance. Assuming \u03c32 is\nsmall, this value can be approximated to \ufb01rst order by setting w\u03c3\n+ sk, where sk is an\n\nk (xik) = \u2202wk\n\u2202xik\n\n4\n\n\fk , and the derivatives of the warped kernel function.\n\nadditional parameter that increases the noise in the tails of the sigmoid to prevent degeneracies. The\nparameters m, (cid:96), and s are added to \u03b8 and jointly optimized with u and the other hyperparameters,\nusing unit variance Gaussian priors for (cid:96) and s and gamma priors for m. Note that this procedure\nis not equivalent to merely \ufb01tting a sigmoid to the reward function, since the reward can still vary\nnonlinearly in the high resolution regions around each sigmoid center mk. The accompanying sup-\nplement includes details about the priors placed on the warp parameters in our implementation, a\ncomplete derivation of w\u03c3\nDuring the optimization, as the sigmoid scales (cid:96) become small, the derivatives with respect to the\nsigmoid centers m fall to zero.\nIf the centers have not yet converged to the correct values, the\noptimization will end in a local optimum. It is therefore more important to address local optima\nwhen using the warped kernel. As mentioned in Section 3, we mitigate the effects of local optima\nwith a small number of random restarts. Details of the particular random restart technique we used\ncan also be found in the supplement.\nWe presented just one example of how an alternative kernel allows us to learn a reward with a\nparticular structure. Many kernels have been proposed for GPs [11], and this variety of kernel\nfunctions can be used to apply GPIRL to new domains and to extend its generality and \ufb02exibility.\n\n7 Experiments\n\nWe compared GPIRL with prior methods on several IRL tasks, using examples sampled from the\nstochastic MaxEnt policy (see Section 2) as well as human demonstrations. Examples drawn from\nthe stochastic policy can intuitively be viewed as noisy samples of an underlying optimal policy,\nwhile the human demonstrations contain the stochasticity inherent in human behavior. GPIRL was\ncompared with the MaxEnt IRL algorithm [17] and FIRL [5], as well as a variant of MaxEnt with\na sparsity-inducing Laplace prior, which we refer to as MaxEnt/Lp. We evaluated a variety of other\nmargin-based methods, including Abbeel and Ng\u2019s projection algorithm, MMP, MWAL, MMPBoost\nand LEARCH [1, 12, 15, 13, 14]. Since GPIRL, FIRL, and MaxEnt consistently produced better\nresults, the other algorithms are not shown here, but are included in the supplementary result tables.\nWe compare the algorithms using the \u201cexpected value difference\u201d score, which is a measure of how\nsuboptimal the learned policy is under the true reward. To compute this score, we \ufb01nd the optimal\ndeterministic policy under each learned reward, measure its expected sum of discounted rewards un-\nder the true reward function, and subtract this quantity from the expected sum of discounted rewards\nunder the true policy. While we could also evaluate the optimal stochastic policies, this would un-\nfairly penalize margin-based methods, which are unaware of the MaxEnt model. To determine how\nwell each algorithm captured the structure of the reward function, we evaluated the learned reward\non the environment on which it was learned, and on 4 additional random environments (denoted\n\u201ctransfer\u201d). Algorithms that do not express the reward function in terms of the correct features are\nexpected to perform poorly on the transfer environments, even if they perform well on the training\nenvironment. Methods that correctly identify relevant features should perform well on both. For\neach environment, we evaluated the algorithms with both discrete and continuous-valued features.\nIn the latter case, GPIRL used the warped kernel in Section 6 and FIRL, which requires discrete\nfeatures, was not tested. Each test was repeated 8 times with different random environments.\n\n7.1 Objectworld Experiments\nThe objectworld is an N \u00d7 N grid of states with \ufb01ve actions per state, corresponding to steps in\neach direction and staying in place. Each action has a 30% chance of moving in a different random\ndirection. Randomly placed objects populate the objectworld, and each is assigned one of C inner\nand outer colors. Object placement is randomized in the transfer environments, while N and C\nremain the same. There are 2C continuous features, each giving the Euclidean distance to the nearest\nobject with a speci\ufb01c inner or outer color. In the discrete feature case, there are 2CN binary features,\neach one an indicator for a corresponding continuous feature being less than d \u2208 {1, ..., N}. The\ntrue reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer\ncolor 2, negative within 3 cells of outer color 1, and zero otherwise. Inner colors and all other outer\ncolors are distractors. The algorithms were provided example paths of length 8, and the number of\nexamples and colors was varied to determine their ability to handle limited data and distractors.\n\n5\n\n\fFigure 1: Results for 32\u00d732 objectworlds with C = 2 and varying numbers of examples. Shading\nshows standard error. GPIRL learned accurate rewards that generalized well to new state spaces.\n\nFigure 2: Objectworld evaluation with 32 examples and varying numbers of colors C. GPIRL was\nable to perform well even as the number of distractor features increased.\n\nBecause of the large number of irrelevant features and the nonlinearity of the reward, this example is\nparticularly challenging for methods that learn linear reward functions. With 16 or more examples,\nGPIRL consistently learned reward functions that performed as well as the true reward, as shown\nin Figure 1, and was able to sustain this performance as the number of distractors increased, as\nshown in Figure 2. While the performance of MaxEnt and FIRL also improved with additional\nexamples, they were consistently outperformed by GPIRL. In the case of FIRL, this was likely due\nto the suboptimal expert examples. In the case of MaxEnt, although the Laplace prior improved the\nresults, the inability to represent nonlinear rewards limited the algorithm\u2019s accuracy. These issues\nare evident in Figure 3, which shows part of a reward function learned by each method. When using\ncontinuous features, the performance of MaxEnt suffered even more from the increased nonlinearity\nof the reward function, while GPIRL maintained a similar level of accuracy.\n\nFigure 3: Part of a reward function learned by each algorithm on an objectworld. While GPIRL\nlearned the correct reward function, MaxEnt was unable to represent the nonlinearities, and FIRL\nlearned an overly complex reward under which the suboptimal expert would have been optimal.\n\n6\n\ndiscrete featuresexamplesexpected value difference48163264128051015202530GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference48163264128051015202530discrete features transferexamplesexpected value difference48163264128051015202530GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference48163264128051015202530continuous featuresexamplesexpected value difference48163264128051015202530GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference48163264128051015202530continuous features transferexamplesexpected value difference48163264128051015202530GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference48163264128051015202530discrete featurescolorsexpected value difference24681012051015202530GPIRLMaxEntMaxEnt/LpFIRLcolorsexpected value difference24681012051015202530discrete features transfercolorsexpected value difference24681012051015202530GPIRLMaxEntMaxEnt/LpFIRLcolorsexpected value difference24681012051015202530continuous featurescolorsexpected value difference24681012051015202530GPIRLMaxEntMaxEnt/Lpcolorsexpected value difference24681012051015202530continuous features transfercolorsexpected value difference24681012051015202530GPIRLMaxEntMaxEnt/Lpcolorsexpected value difference24681012051015202530True RewardGPIRLMaxEnt/LpFIRLouter color 1 objectsouter color 2 objectsother objects (distractors)expert actions\fFigure 4: Results for 64-car-length highways with varying example counts. While GPIRL achieved\nonly modest improvement over prior methods on the training environment, the large improvement\nin the transfer tests indicates that the underlying reward structure was captured more accurately.\n\nFigure 5: Evaluation on the highway environment with human demonstrations. GPIRL learned a\nreward function that more accurately re\ufb02ected the true policy the expert was attempting to emulate.\n\n7.2 Highway Driving Behavior\n\nIn addition to the objectworld environment, we evaluated the algorithms on more concrete behaviors\nin the context of a simple highway driving simulator, modeled on the experiment in [5] and similar\nevaluations in other work [1]. The task is to navigate a car on a three-lane highway, where all other\nvehicles move at a constant speed. The agent can switch lanes and drive at up to four times the speed\nof traf\ufb01c. Other vehicles are either civilian or police, and each vehicle can be a car or motorcycle.\nContinuous features indicate the distance to the nearest vehicle of a speci\ufb01c class (car or motorcycle)\nor category (civilian or police) in front of the agent, either in the same lane, the lane to the right, the\nlane to the left, or any lane. Another set of features gives the distance to the nearest such vehicle in a\ngiven lane behind the agent. There are also features to indiciate the current speed and lane. Discrete\nfeatures again discretize the continuous features, with distances discretized in the same way as in\nthe objectworld. In this section, we present results from synthetic and manmade demonstrations of a\npolicy that drives as fast as possible, but avoids driving more than double the speed of traf\ufb01c within\ntwo car-lengths of a police vehicle. Due to the connection between the police and speed features,\nthe reward for this policy is nonlinear. We also evaluated a second policy that instead avoids driving\nmore than double the speed of traf\ufb01c in the rightmost lane. The results for this policy were similar\nto the \ufb01rst, and are included in the supplementary result tables.\nFigure 4 shows a comparison of GPIRL and prior algorithms on highways with varying numbers\nof 32-step synthetic demonstrations of the \u201cpolice\u201d task. GPIRL only modestly outperformed prior\nmethods on the training environments with discrete features, but achieved large improvement on the\ntransfer experiment. This indicates that, while prior algorithms learned a reasonable reward, this\nreward was not expressed in terms of the correct features, and did not generalize correctly. With\ncontinuous features, the nonlinearity of the reward was further exacerbated, making it dif\ufb01cult for\nlinear methods to represent it even on the training environment. In Figure 5, we also evaluate how\nGPIRL and prior methods were able to learn the \u201cpolice\u201d behavior from human demonstrations.\n\n7\n\ndiscrete featuresexamplesexpected value difference2481632640102030405060GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference2481632640102030405060discrete features transferexamplesexpected value difference2481632640102030405060GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference2481632640102030405060continuous featuresexamplesexpected value difference2481632640102030405060GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference2481632640102030405060continuous features transferexamplesexpected value difference2481632640102030405060GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference2481632640102030405060discrete featuresexamplesexpected value difference248160102030405060GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference248160102030405060discrete features transferexamplesexpected value difference248160102030405060GPIRLMaxEntMaxEnt/LpFIRLexamplesexpected value difference248160102030405060continuous featuresexamplesexpected value difference248160102030405060GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference248160102030405060continuous features transferexamplesexpected value difference248160102030405060GPIRLMaxEntMaxEnt/Lpexamplesexpected value difference248160102030405060\fFigure 6: Highway reward functions learned from human demonstration. Road color indicates the\nreward at the highest speed, when the agent should be penalized for driving fast near police vehicles.\nThe reward learned by GPIRL most closely resembles the true one.\n\nAlthough the human demonstrations were suboptimal, GPIRL was still able to learn a reward func-\ntion that re\ufb02ected the true policy more accurately than prior methods. Furthermore, the similarity\nof GPIRL\u2019s performance with the human and synthetic demonstrations suggests that its model of\nsuboptimal expert behavior is a reasonable re\ufb02ection of actual human suboptimality. An example of\nrewards learned from human demonstrations is shown in Figure 6. Example videos of the learned\npolicies and human demonstrations, as well as source code for our implementation of GPIRL, can\nbe found at http://graphics.stanford.edu/projects/gpirl/index.htm\n\n8 Discussion and Future Work\n\nWe presented an algorithm for inverse reinforcement learning that represents nonlinear reward func-\ntions with Gaussian processes. Using a probabilistic model of a stochastic expert with a GP prior\non reward values, our method is able to recover both a reward function and the hyperparameters of\na kernel function that describes the structure of the reward. The learned GP can be used to predict a\nreward function consistent with the expert on any state space in the domain of the features.\nIn experiments with nonlinear reward functions, GPIRL consistently outperformed prior methods,\nespecially when generalizing the learned reward to new state spaces. However, like many GP mod-\nels, the GPIRL log likelihood is multimodal. When using the warped kernel function, a random\nrestart procedure was needed to consistently \ufb01nd a good optimum. More complex kernels might\nsuffer more from local optima, potentially requiring more robust optimization methods.\nIt should also be noted that our experiments were intentionally chosen to be challenging for algo-\nrithms that construct rewards as linear combinations. When good features that form a linear basis for\nthe reward are already known, prior methods such as MaxEnt would be expected to perform compa-\nrably to GPIRL. However, it is often dif\ufb01cult to ensure this is the case in practice, and previous work\non margin-based methods suggests that nonlinear methods often outperform linear ones [13, 14].\nWhen presented with a novel state space, GPIRL currently uses the mean posterior of the GP to\nestimate the reward function. In principle, we could leverage the fact that GPs learn distributions\nover functions to account for the uncertainty about the reward in states that are different from any of\nthe inducing points. For example, such an approach could be used to learn a \u201cconservative\u201d policy\nthat aims to achieve high rewards with some degree of certainty, avoiding regions where the reward\ndistribution has high variance. In an interactive training setting, such a method could also inform\nthe expert about states that have high reward variance and require additional demonstrations.\nMore generally, by introducing Gaussian processes into inverse reinforcement learning, GPIRL can\nbene\ufb01t from the wealth of prior work on Gaussian process regression. For instance, we apply ideas\nfrom sparse GP approximation in the use of a small set of inducing points to learn the reward func-\ntion in time linear in the number of states. A substantial body of prior work discusses techniques for\nautomatically choosing or optimizing these inducing points [8], and such methods could be incorpo-\nrated into GPIRL to learn reward functions with even smaller active sets. We also demonstrate how\ndifferent kernels can be used to learn different types of reward structure, and further investigation\ninto the kinds of kernel functions that are useful for IRL is another exciting avenue for future work.\n\nAcknowledgments. We thank Andrew Y. Ng and Krishnamurthy Dvijotham for helpful feedback\nand discussion. This work was supported by NSF Graduate Research Fellowship DGE-0645962.\n\n8\n\nTrueRewardGPIRLMaxEnt/LpFIRL\fReferences\n\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML\n\n\u201904: Proceedings of the 21st International Conference on Machine Learning, 2004.\n\n[2] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic programming.\n\nNeurocomputing, 72(7\u20139):1508\u20131524, 2009.\n\n[3] K. Dvijotham and E. Todorov. Inverse optimal control with linearly-solvable MDPs. In ICML\n\u201910: Proceedings of the 27th International Conference on Machine Learning, pages 335\u2013342,\n2010.\n\n[4] Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In ICML\n\u201905: Proceedings of the 22nd International Conference on Machine learning, pages 201\u2013208,\n2005.\n\n[5] S. Levine, Z. Popovi\u00b4c, and V. Koltun. Feature construction for inverse reinforcement learning.\n\nIn Advances in Neural Information Processing Systems 23. 2010.\n\n[6] G. Neu and C. Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and\n\ngradient methods. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2007.\n\n[7] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning.\n\nIn ICML \u201900:\nProceedings of the 17th International Conference on Machine Learning, pages 663\u2013670, 2000.\n[8] J. Qui\u02dcnonero Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[9] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In IJCAI\u201907: Pro-\nceedings of the 20th International Joint Conference on Arti\ufb01cal Intelligence, pages 2586\u20132591,\n2007.\n\n[10] C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Advances in\n\nNeural Information Processing Systems 16, 2003.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 2005.\n\n[12] N. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In ICML \u201906:\nProceedings of the 23rd International Conference on Machine Learning, pages 729\u2013736, 2006.\n[13] N. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured prediction for\n\nimitation learning. In Advances in Neural Information Processing Systems 19, 2007.\n\n[14] N. Ratliff, D. Silver, and J. A. Bagnell. Learning to search: Functional gradient techniques for\n\nimitation learning. Autonomous Robots, 27(1):25\u201353, 2009.\n\n[15] U. Syed and R. Schapire. A game-theoretic approach to apprenticeship learning. In Advances\n\nin Neural Information Processing Systems 20, 2008.\n\n[16] B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal\n\nEntropy. PhD thesis, Carnegie Mellon University, 2010.\n\n[17] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement\nlearning. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI 2008), pages 1433\u20131438, 2008.\n\n9\n\n\f", "award": [], "sourceid": 13, "authors": [{"given_name": "Sergey", "family_name": "Levine", "institution": null}, {"given_name": "Zoran", "family_name": "Popovic", "institution": null}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": null}]}