{"title": "Compatible Reward Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2050, "page_last": 2059, "abstract": "Inverse Reinforcement Learning (IRL) is an effective approach to recover a reward function that explains the behavior of an expert by observing a set of demonstrations.  This paper is about a novel model-free IRL approach that, differently from most of the existing IRL algorithms, does not require to specify a function space where to search for the expert's reward function. Leveraging on the fact that the policy gradient needs to be zero for any optimal policy, the algorithm generates a set of basis functions that span the subspace of reward functions that make the policy gradient vanish. Within this subspace, using a second-order criterion, we search for the reward function that penalizes the most a deviation from the expert's policy. After introducing our approach for finite domains, we extend it to continuous ones. The proposed approach is empirically compared to other IRL methods both in the (finite) Taxi domain and in the (continuous) Linear Quadratic Gaussian (LQG) and Car on the Hill environments.", "full_text": "Compatible Reward Inverse Reinforcement Learning\n\nAlberto Maria Metelli\n\nDEIB\n\nPolitecnico di Milano, Italy\n\nMatteo Pirotta\nSequeL Team\n\nInria Lille, France\n\nMarcello Restelli\n\nDEIB\n\nPolitecnico di Milano, Italy\n\nalbertomaria.metelli@polimi.it\n\nmatteo.pirotta@inria.fr\n\nmarcello.restelli@polimi.it\n\nAbstract\n\nInverse Reinforcement Learning (IRL) is an effective approach to recover a reward\nfunction that explains the behavior of an expert by observing a set of demonstrations.\nThis paper is about a novel model-free IRL approach that, differently from most\nof the existing IRL algorithms, does not require to specify a function space where\nto search for the expert\u2019s reward function. Leveraging on the fact that the policy\ngradient needs to be zero for any optimal policy, the algorithm generates a set of\nbasis functions that span the subspace of reward functions that make the policy\ngradient vanish. Within this subspace, using a second-order criterion, we search\nfor the reward function that penalizes the most a deviation from the expert\u2019s policy.\nAfter introducing our approach for \ufb01nite domains, we extend it to continuous ones.\nThe proposed approach is empirically compared to other IRL methods both in the\n(\ufb01nite) Taxi domain and in the (continuous) Linear Quadratic Gaussian (LQG) and\nCar on the Hill environments.\n\n1\n\nIntroduction\n\nImitation learning aims to learn to perform a task by observing only expert\u2019s demonstrations. We\nconsider the settings where only expert\u2019s demonstrations are given, no information about the dynamics\nand the objective of the problem is provided (e.g., reward) or ability to query for additional samples.\nThe main approaches solving this problem are behavioral cloning [1] and inverse reinforcement\nlearning [2]. The former recovers the demonstrated policy by learning the state-action mapping in a\nsupervised learning way, while inverse reinforcement learning aims to learn the reward function that\nmakes the expert optimal. Behavioral Cloning (BC) is simple, but its main limitation is the intrinsic\ngoal, i.e., to replicate the observed policy. This task has several limitations: it requires a huge amount\nof data when the environment (or the expert) is stochastic [3]; it does not provide good generalization\nor a description of the expert\u2019s goal. On the contrary, Inverse Reinforcement Learning (IRL) accounts\nfor generalization and transferability by directly learning the reward function. This information can\nbe transferred to any new environment in which the features are well de\ufb01ned. As a consequence, IRL\nallows recovering the optimal policy a posteriori, even under variations of the environment. IRL has\nreceived a lot of attention in literature and has succeeded in several applications [e.g., 4, 5, 6, 7, 8].\nHowever, BC and IRL are tightly related by the intrinsic relationship between reward and optimal\npolicy. The reward function de\ufb01nes the space of optimal policies and to recover the reward it is\nrequired to observe/recover the optimal policy. The idea of this paper, and of some recent paper [e.g.,\n9, 8, 3], is to exploit the synergy between BC and IRL.\nUnfortunately, also IRL approaches present issues. First, several IRL methods require solving the\nforward problem as part of an inner loop [e.g., 4, 5]. Literature has extensively focused on removing\nthis limitation [10, 11, 9] in order to scale IRL to real-world applications [12, 3, 13]. Second, IRL\nmethods generally require designing the function space by providing features that capture the structure\nof the reward function [e.g., 4, 14, 5, 10, 15, 9]. This information, provided in addition to expert\u2019s\ndemonstrations, is critical for the success of the IRL approach. The issue of designing the function\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fspace is a well-known problem in supervised learning, but it is even more critical in IRL since a wrong\nchoice might prevent the algorithm from \ufb01nding good solutions to the IRL problem [2, 16], especially\nwhen linear reward models are considered. The importance of incorporating feature construction\nin IRL has been known in literature since a while [4] but, as far as we know, it has been explicitly\naddressed only in [17]. Recently, IRL literature, by mimicking supervised learning one, has focused\non exploiting neural network capability of automatically constructing relevant features out of the\nprovided data [12, 8, 13]. By exploiting a \u201cblack-box\u201d approach, these methods do not take advantage\nof the structure of the underlying Markov decision process (in the phase of feature construction).\nWe present an IRL algorithm that constructs reward features directly from expert\u2019s demonstrations.\nThe proposed algorithm is model-free and does not require solving the forward problem (i.e., \ufb01nding\nan optimal policy given a candidate reward function) as an inner step. The Compatible Reward Inverse\nReinforcement Learning (CR-IRL) algorithm builds a reward function that is compatible with the\nexpert\u2019s policy. It mixes BC and IRL in order to recover the \u201coptimal\u201d and most \u201cinformative\u201d reward\nfunction in the space spanned by the recovered features. Inspired by the gradient-minimization IRL\napproach proposed in [9], we focus on the space of reward functions that makes the policy gradient\nof the expert vanish. Since a zero gradient is only a necessary condition for optimality, we consider a\nsecond order optimality criterion based on the policy Hessian to rank the reward functions and \ufb01nally\nselect the best one (i.e., the one that penalizes the most a deviation from the expert\u2019s policy).\n\n2 Algorithm Overview\nA Markov Decision Process (MDP) [18] is de\ufb01ned as M = (S,A,P, R, \u03b3, \u00b5) where S is the state\nspace, A is the action space, P(s(cid:48)|s, a) is a Markovian transition model that de\ufb01nes the conditional\ndistribution of the next state s(cid:48) given the current state s and the current action a, \u03b3 \u2208 [0, 1] is\nthe discount factor, R(s, a) is the expected reward for performing action a in state s and \u00b5 is the\ndistribution of the initial state. The optimal policy \u03c0\u2217 is the policy that maximizes the discounted\n\nsum of rewards E[(cid:80)+\u221e\ntrajectories from the expert policy \u03c0E, denoted by D = (cid:8)(s\u03c4i,0, a\u03c4i,0, . . . , s\u03c4i,T (\u03c4i), a\u03c4i,T (\u03c4i))(cid:9),\n\nCR-IRL takes as input a parametric policy space \u03a0\u0398 = {\u03c0\u03b8 : \u03b8 \u2208 \u0398 \u2286 Rk} and a set of rewardless\n\nt=0 \u03b3tR(st, at)|\u03c0,M].\n\nwhere s\u03c4i,t is the t-th state in trajectory \u03c4i and i = 1, . . . , N. CR-IRL is a non-iterative algorithm\nthat recovers a reward function for which the expert is optimal without requiring to specify a reward\nfunction space. It starts building the features {\u03c6i} of the value function that are compatible with\npolicy \u03c0E, i.e., that make the policy gradient vanish (Phase 1, see Sec. 3). This step requires a\nparametric representation \u03c0\u03b8E \u2208 \u03a0\u0398 of the expert\u2019s policy which can be obtained through behavioral\ncloning.1 The choice of the policy space \u03a0\u0398 in\ufb02uences the size of the functional space used by\nCR-IRL for representing the value function (and the reward function) associated with the expert\u2019s\npolicy. In order to formalize this notion, we introduce the policy rank, a quantity that represents\nthe ability of a parametric policy to reduce the dimensions of the approximation space for the value\nfunction of the expert\u2019s policy. Once these value features have been built, they can be transformed into\nreward features {\u03c8i} (Phase 2 see Sec. 4) by means of the Bellman equation [18] (model-based) or\nreward shaping [19] (model-free). All the rewards spanned by the features {\u03c8i} satisfy the \ufb01rst-order\nnecessary optimality condition [20], but we are not sure about their nature (minima, maxima or saddle\npoints). The \ufb01nal step is thus to recover a reward function that is maximized by the expert\u2019s policy\n(Phase 3 see Sec. 5). This is achieved by considering a second-order optimality condition, with the\nidea that we want the reward function that penalizes the most a deviation from the parameters of the\nexpert\u2019s policy \u03c0\u03b8E . This criterion is similar in spirit to what done in [2, 4, 14], where the goal is to\nidentify the reward function that makes the expert\u2019s policy better than any other policy by a margin.\nThe algorithmic structure is reported in Alg. 1.\nIRL literature usually considers two different settings: optimal or sub-optimal expert. This distinction\nis necessary when a \ufb01xed reward space is provided. In fact, the demonstrated behavior may not be\noptimal under the considered reward space. In this case, the problem becomes somehow not well\nde\ufb01ned and additional \u201coptimality\u201d criteria are required [16]. This is not the case for CR-IRL that is\nable to automatically generate the space of reward functions that make the policy gradient vanish,\n\n1We want to stress that our primal objective is to recover the reward function since we aim to explain the\nmotivations that guide the expert and to transfer it, not just to replicate the behavior. As explained in the\nintroduction, we aim to exploit the synergy between BC and IRL.\n\n2\n\n\fthus containing also reward functions under which the recovered expert\u2019s policy \u03c0\u03b8E is optimal. In\nthe rest of the paper, we will assume to have a parametric representation of the expert\u2019s policy that\nwe will denote for simplicity by \u03c0\u03b8.\n\n3 Expert\u2019s Compatible Value Features\nIn this section, we present the procedure to obtain the set {\u03c6i}p\ni=1 of Expert\u2019s COmpatible Q-features\n(ECO-Q) that make the policy gradient vanish2 (Phase 1). We start introducing the policy gradient\nand the associated \ufb01rst-order optimality condition. We will indicate with T the set of all possible\ntrajectories, p\u03b8(\u03c4 ) the probability density of trajectory \u03c4 and R(\u03c4 ) the \u03b3-discounted trajectory reward\nt=0 \u03b3tR(s\u03c4,t, a\u03c4,t) that, in our settings, is obtained as a linear combination of\nreward features. Given a policy \u03c0\u03b8, the expected \u03b3-discounted return for an in\ufb01nite horizon MDP is:\n\nde\ufb01ned as R(\u03c4 ) =(cid:80)T (\u03c4 )\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nT\n\nJ(\u03b8) =\n\nd\u03c0\u03b8\n\u00b5 (s)\n\nS\n\nA\n\n\u03c0\u03b8(a|s)R(s, a)dads =\n\np\u03b8(\u03c4 )R(\u03c4 )d\u03c4,\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nS\n\nA\n\n\u00b5 is the \u03b3-discounted future state occupancy [21]. If \u03c0\u03b8 is differentiable w.r.t. the parameter\n\nwhere d\u03c0\u03b8\n\u03b8, the gradient of the expected reward (policy gradient) [21, 22, 23] is:\n\u2207\u03b8J(\u03b8) =\n\n\u00b5 (s, a)\u2207\u03b8 log \u03c0\u03b8(a|s)Q\u03c0\u03b8 (s, a)dads =\nd\u03c0\u03b8\n(1)\n\u00b5 (s)\u03c0\u03b8(a|s) is the \u03b3-discounted future state-action occupancy, which represents\nwhere d\u03c0\u03b8\nthe expected discounted number of times action a is executed in state s given \u00b5 as initial state\ndistribution and following policy \u03c0\u03b8. When \u03c0\u03b8 is an optimal policy in the class of policies \u03a0\u0398 =\n{\u03c0\u03b8 : \u03b8 \u2208 \u0398 \u2286 Rk} then \u03b8 is a stationary point of the expected return and thus \u2207\u03b8J(\u03b8) = 0\n(\ufb01rst-order necessary conditions for optimality [20]).\nWe assume the space S \u00d7 A to be a Hilbert space [24] equipped with the weighted inner product:3\n\np\u03b8(\u03c4 )\u2207\u03b8 log p\u03b8(\u03c4 )R(\u03c4 )d\u03c4,\n\n\u00b5 (s, a) = d\u03c0\u03b8\n\nT\n\n(cid:104)f, g(cid:105)\u00b5,\u03c0\u03b8 =\n\nS\n\nA\n\nf (s, a)d\u03c0\u03b8\n\n\u00b5 (s, a)g(s, a)dsda.\n\n(2)\nWhen \u03c0\u03b8 is optimal for the MDP, \u2207\u03b8 log \u03c0\u03b8 and Q\u03c0\u03b8 are orthogonal w.r.t. the inner product (2).\nWe can exploit the orthogonality property to build an approximation space for the Q-function. Let\nG\u03c0\u03b8 = {\u2207\u03b8 log \u03c0\u03b8\u03b1 : \u03b1 \u2208 Rk} the subspace spanned by the gradient of the log-policy \u03c0\u03b8. From\nequation (1) \ufb01nding an approximation space for the Q-function is equivalent to \ufb01nd the orthogonal\ncomplement of the subspace G\u03c0\u03b8, which in turn corresponds to \ufb01nd the null space of the functional:\n(3)\nWe de\ufb01ne an Expert\u2019s COmpatible Q-feature as any function \u03c6 making the functional (3) null. This\n:= null(G\u03c0\u03b8 ) represents the Hilbert subspace of the features for the Q-function that\nspace G\u22a5\nare compatible with the policy \u03c0\u03b8 in the sense that any Q-function optimized by policy \u03c0\u03b8 can be\nexpressed as a linear combination of those features. Section 3.2 and 3.3 describe how to compute\nthe ECO-Q from samples in \ufb01nite and continuous MDPs, respectively. The dimension of G\u22a5\n\u03c0\u03b8 is\ntypically very large since the number k of policy parameters is signi\ufb01cantly smaller than the number\nof state-action pairs. A formal discussion of this issue for \ufb01nite MDPs is presented in the next section.\n\nG\u03c0\u03b8 [\u03c6] = (cid:104)\u2207\u03b8 log \u03c0\u03b8, \u03c6(cid:105)\u00b5,\u03c0\u03b8 .\n\n\u03c0\u03b8\n\n3.1 Policy rank\nThe parametrization of the expert\u2019s policy in\ufb02uences the size of G\u22a5\n\u03c0\u03b8. Intuition suggests that the larger\nthe number k of the parameters the more the policy is informative to infer the Q-function and so the\nreward function. This is motivated by the following rationale. Consider representing the expert\u2019s\npolicy using two different policy models such that one model is a superclass of the other one (for\ninstance, assume to use linear models where the features used in the simpler model are a subset of the\nfeatures used by policies in the other model). All the reward functions that make the policy gradient\n\n2Notice that any linear combination of the ECO-Q also satis\ufb01es the \ufb01rst-order optimality condition.\n3The inner product as de\ufb01ned is clearly symmetric, positive de\ufb01nite and linear, but there could be state-action\n\u00b5 (s, a) = 0, making (cid:104)f, f(cid:105)\u00b5,\u03c0\u03b8 = 0 for non-zero f. To ensure the properties of the\n\npairs never visited, i.e., d\u03c0\u03b8\ninner product, we assume to compute it only on visited state-action pairs.\n\n3\n\n\fvanish with the rich policy model, do the same with the simpler model, while the vice versa does not\nhold. This suggests that complex policy models are able to reduce more the space of optimal reward\nfunction w.r.t. simpler models. This notion plays an important role for \ufb01nite MDPs, i.e., MDPs where\nthe state-action space is \ufb01nite. We formalize the ability of a policy to infer the characteristics of the\nMDP with the concept of policy rank.\nDe\ufb01nition 1. Let \u03c0\u03b8 a policy with k parameters belonging to the class \u03a0\u0398 and differentiable in \u03b8.\nThe policy rank is the dimension of the space of the linear combinations of the partial derivatives of\n\u03c0\u03b8 w.r.t. \u03b8:\n\nrank(\u03c0\u03b8) = dim(\u0393\u03c0\u03b8 ), \u0393\u03c0\u03b8 = {\u2207\u03b8\u03c0\u03b8\u03b1 : \u03b1 \u2208 Rk}.\n\nA \ufb01rst important note is that the policy rank depends not only on the policy model \u03a0\u0398 but also on\nthe value of the parameters of the policy \u03c0\u03b8. So the policy rank is a property of the policy not of the\npolicy model. The following bound on the policy rank holds (the proof can be found in App. A.1).\nProposition 1. Given a \ufb01nite MDP M, let \u03c0\u03b8 a policy with k parameters belonging to the class \u03a0\u0398\nand differentiable in \u03b8, then: rank(\u03c0\u03b8) \u2264 min{k,|S||A| \u2212 |S|}.\nFrom an intuitive point of view this is justi\ufb01ed by the fact that \u03c0\u03b8(\u00b7|s) is a probability distribution.\nAs a consequence, for all s \u2208 S the probabilities \u03c0\u03b8(a|s) must sum up to 1, removing |S| degrees\nof freedom. This has a relevant impact on the algorithm since it induces a lower bound on the\n) \u2265 max{|S||A| \u2212 k,|S|}, thus even the most\ndimension of the orthogonal complement dim(G\u22a5\n\ufb02exible policy (i.e., a policy model with a parameter for each state-action pair) cannot determine a\nunique reward function that makes the expert\u2019s policy optimal, leaving |S| degrees of freedom. It\nfollows that it makes no sense to consider a policy with more than |S||A| \u2212 |S| parameters. The\ngeneralization capabilities enjoyed by the recovered reward function are deeply related to the choice\nof the policy model. Complex policies (many parameters) would require \ufb01nding a reward function\nthat explains the value of all the parameters, resulting in a possible over\ufb01tting, whereas a simple\npolicy model (few parameters) would enforce generalization as the imposed constraints are fewer.\n\n\u03c0\u03b8\n\n3.2 Construction of ECO-Q in Finite MDPs\n\nWe now develop in details the algorithm to generate ECO-Q in the case of \ufb01nite MDPs. From now\non we will indicate with |D| the number of distinct state-action pairs visited by the expert along the\navailable trajectories. When the state-action space is \ufb01nite the inner product (2) can be written in\nmatrix notation as:\n\n\u00b5 are real vectors with |D| components and D\u03c0\u03b8\n\n\u00b5 ). The term \u2207\u03b8 log \u03c0\u03b8\nwhere f, g and d\u03c0\u03b8\nis a |D| \u00d7 k real matrix, thus \ufb01nding the null space of the functional (3) is equivalent to \ufb01nding the\nnull space of the matrix \u2207\u03b8 log \u03c0T\n\u00b5 . This can be done for instance through SVD which allows\nto obtain a set of orthogonal basis functions \u03a6. Given that the weight vector d\u03c0\u03b8\n\u00b5 (s, a) is usually\nunknown, it needs to be estimated. Since the policy \u03c0\u03b8 is known, we need to estimate just d\u03c0\u03b8\n\u00b5 (s), as\n\u00b5 (s)\u03c0\u03b8(a|s). A Monte Carlo estimate exploiting the expert\u2019s demonstrations in D is:\nd\u03c0\u03b8\n\u00b5 (s, a) = d\u03c0\u03b8\n\n\u00b5 = diag(d\u03c0\u03b8\n\n\u03b8 D\u03c0\u03b8\n\n(cid:104)f , g(cid:105)\u00b5,\u03c0\u03b8 = f T D\u03c0\u03b8\n\n\u00b5 g,\n\nN(cid:88)\n\nT (\u03c4i)(cid:88)\n\ni=1\n\nt=0\n\n\u02c6d\u03c0\u03b8\n\u00b5 (s) =\n\n1\nN\n\n\u03b3t1(s\u03c4i,t = s).\n\n(4)\n\n3.3 Construction of ECO-Q in Continuous MDPs\n\nTo extend the previous approach to the continuous domain we assume that the state-action space is\nequipped with the Euclidean distance. Now we can adopt an approach similar to the one exploited\nto extend Proto-Value Functions (PVF) [25, 26] to in\ufb01nite observation spaces [27]. The problem is\ntreated as a discrete one considering only the state-action pairs visited along the collected trajectories.\nA Nystr\u00f6m interpolation method is used to approximate the value of a feature in a non-visited\nstate-action pair as a weighted mean of the values of the closest k features. The weight of each feature\nis computed by means of a Gaussian kernel placed over the Euclidean space S \u00d7 A:\n,\n\nK(cid:0)(s, a), (s(cid:48), a(cid:48))(cid:1) = exp\n\n(cid:16) \u2212 1\n\n(cid:107)a \u2212 a(cid:48)(cid:107)2\n\n(cid:107)s \u2212 s(cid:48)(cid:107)2\n\n(cid:17)\n\n(5)\n\n2\n\n2 \u2212 1\n2\u03c32A\n\nwhere \u03c3S and \u03c3A are respectively the state and action bandwidth. In our setting this approach is fully\nequivalent to a kernel k-Nearest Neighbors regression.\n\n2\u03c32S\n\n4\n\n\f4 Expert\u2019s Compatible Reward Features\n\nThe set of ECO-Q basis functions allows representing the optimal value function under the policy \u03c0\u03b8.\nIn this section, we will show how it is possible to exploit ECO-Q functions to generate basis functions\nfor the reward representation (Phase 2). In principle, we can use the Bellman equation to obtain the\nreward from the Q-function but this approach requires the knowledge of the transition model (see\nApp. B). The reward can be recovered in a model-free way by exploiting optimality-invariant reward\ntransformations.\nReversing the Bellman equation [e.g., 10] allows \ufb01nding the reward space that generates the estimated\nQ-function. However, IRL is interested in \ufb01nding just a reward space under which the expert\u2019s policy\nis optimal. This problem can be seen as an instance of reward shaping [19] where the authors show\nthat the space of all the reward functions sharing the same optimal policy is given by:\n\nR(cid:48)(s, a) = R(s, a) + \u03b3\n\nP(s(cid:48)|s, a)\u03c7(s(cid:48))ds(cid:48) \u2212 \u03c7(s),\n\n(cid:90)\n\nS\n\nwhere \u03c7(s) is a state-dependent potential function. A smart choice [19] is to set \u03c7 = V \u03c0\u03b8 under\nwhich the new reward space is given by the advantage function: R(cid:48)(s, a) = Q\u03c0\u03b8 (s, a) \u2212 V \u03c0\u03b8 (s) =\nA\u03c0\u03b8 (s, a). Thus the expert\u2019s advantage function is an admissible reward optimized by the expert\u2019s\npolicy itself. This choice is, of course, related to using Q\u03c0\u03b8 as reward. However, the advantage\nfunction encodes a more local and more transferable information w.r.t. the Q-function.\nThe space of reward features can be recovered through matrix equality \u03a8 = (I \u2212 \u02dc\u03c0\u03b8)\u03a6, where \u02dc\u03c0\u03b8 is\na |D| \u00d7 |D| matrix obtained from \u03c0\u03b8 repeating the row of each visited state a number of times equal\nto the number of distinct actions performed by the expert in that state. Notice that this is a simple\nlinear transformation through the expert\u2019s policy. The speci\ufb01c choice of the state-potential function\nhas the advantage to improve the learning capabilities of any RL algorithm [19]. This is not the only\nchoice of the potential function possible, but it has the advantage of allowing model-free estimation.\nOnce the ECO-R basis functions have been generated, they can be used to feed any IRL algorithm\nthat represents the expert\u2019s reward through a linear combination of basis functions. In the next section,\nwe propose a new method based on the optimization of a second-order criterion that favors reward\nfunctions that signi\ufb01cantly penalize deviations from the expert\u2019s policy.\n\n5 Reward Selection via Second-Order Criterion\nAny linear combination of the ECO-R {\u03c8i}p\ni=1 makes the gradient vanish, however in general this is\nnot suf\ufb01cient to ensure that the policy parameter \u03b8 is a maximum of J(\u03b8). Combinations that lead to\nminima or saddle points should be discarded. Furthermore, provided that a subset of ECO-R leading\nto maxima has been selected, we should identify a single reward function in the space spanned by this\nsubset of features (Phase 3). Both these requirements can be enforced by imposing a second-order\noptimality criterion based on the policy Hessian that is given by [28, 29]:\n\n(cid:16)\u2207\u03b8 log p\u03b8(\u03c4 )\u2207\u03b8 log p\u03b8(\u03c4 )T + H\u03b8 log p\u03b8(\u03c4 )\n(cid:17)\n\nR(\u03c4, \u03c9)d\u03c4,\n\n(cid:90)\n\nH\u03b8J(\u03b8, \u03c9) =\n\nwhere \u03c9 is the reward weight and R(\u03c4, \u03c9) =(cid:80)p\n\np\u03b8(\u03c4 )\n\nT\n\n(cid:80)T (\u03c4 )\n\ni=1 \u03c9i\n\nt=0 \u03b3t\u03c8i(s\u03c4,t, a\u03c4,t).\n\nIn order to retain only maxima we need to impose that the Hessian is negative de\ufb01nite. Furthermore,\nwe aim to \ufb01nd the reward function that best represents the optimal policy parametrization in the sense\nthat even a slight change of the parameters of the expert\u2019s policy induces a signi\ufb01cant degradation of\nthe performance. Geometrically this corresponds to \ufb01nd the reward function for which the expected\nreturn locally represents the sharpest hyper-paraboloid. These requirements can be enforced using\na Semi-De\ufb01nite Programming (SDP) approach where the objective is to minimize the maximum\neigenvalue of the Hessian whose eigenvector corresponds to the direction of minimum curvature\n(maximum eigenvalue optimality criterion). This problem is not appealing in practice due to its high\ncomputational burden. Furthermore, it might be the case that the strict negative de\ufb01niteness constraint\nis never satis\ufb01ed due to blocked-to-zero eigenvalues (for instance in presence of policy parameters\nthat do not affect the policy performance). In these cases, we can consider maximizing an index of\nthe overall concavity. The trace of the Hessian, being the sum of the eigenvalues, can be used for this\npurpose. This problem can be still de\ufb01ned as a SDP problem (trace optimality criterion). See App. C\nfor details.\n\n5\n\n\fT D\u03c0\u03b8\n\nN(cid:88)\n\nj=1\n\nAlg 1: CR-IRL algorithm.\n\ni=1 a set of\n\n\u00b5 and\n\n(cid:1).\n\n\u2207\u03b8 log \u03c0\u03b8\n\nT D\u03c0\u03b8\n\u00b5\n\nand compute d\u03c0\u03b8\n\n\u00b5 (s, a) = d\u03c0\u03b8\n\n\u00b5 (s)\u03c0\u03b8(a|s).\n\n2. Collect d\u03c0\u03b8\n\nInput: D =(cid:8)(s\u03c4i,0, a\u03c4i,0, . . . , s\u03c4i,T (\u03c4i), a\u03c4i,T (\u03c4i))(cid:9)N\n\n4. Get the set of ECO-R by applying reward shaping to the set of\n\n3. Get the set of ECO-Q by computing the null space of matrix\n\n\u00b5 (s, a) in the |D| \u00d7 |D| diagonal matrix D\u03c0\u03b8\n\n\u00b5 through SVD: \u03a6 = null(cid:0)\n\nPhase 2\n5. Apply SVD to orthogonalize \u03a8.\n6. Estimate the policy Hessian for each ECO-R \u03c8i, i = 1, ...p\n\nexpert\u2019s trajectories and parametric expert\u2019s policy \u03c0\u03b8.\nOutput: trace heuristic ECO-R, Rtr\u2212heu.\n1. Estimate d\u03c0\u03b8\n\nPhase 1\n\u00b5 (s) for the visited state-action pairs using Eq. (4)\n\n\u2207\u03b8 log \u03c0\u03b8(s, a) in the |D| \u00d7 k matrix \u2207\u03b8 log \u03c0\u03b8.\n\u2207\u03b8 log \u03c0\u03b8\nECO-Q: \u03a8 = (I \u2212 \u02dc\u03c0\u03b8)\u03a6.\n\nTrace optimality criterion, although\nless demanding w.r.t. the eigenvalue-\nbased one, still displays performance\ndegradation as the number of basis\nfunctions increases due to the neg-\native de\ufb01niteness constraint. Solv-\ning the semide\ufb01nite programming\nproblem of one of the previous op-\ntimality criteria is unfeasible for al-\nmost all the real world problems.\nWe are interested in formulating a\nnon-SDP problem, which is a sur-\nrogate of the trace optimality crite-\nrion, that can be solved more ef\ufb01-\nciently (trace heuristic criterion). In\nour framework, the reward function\ncan be expressed as a linear com-\n(cid:80)p\nbination of the ECO-R so we can\nrewrite the Hessian as H\u03b8J(\u03b8, \u03c9) =\ni=1 \u03c9iH\u03b8Ji(\u03b8) where Ji(\u03b8) is the\nexpected return considering as re-\nward function \u03c8i. We assume that\nthe ECO-R are orthonormal in order\nto compare them.4 The main chal-\nlenge is how to select the weight \u03c9\nin order to get a (sub-)optimal trace\nminimizer that preserves the negative\nsemide\ufb01nite constraint. From Weyl\u2019s\ninequality, we get a feasible solution\nby retaining only the ECO-Rs yield-\ning a semide\ufb01nite Hessian and switching sign to those with positive semide\ufb01nite Hessian. Our\nheuristic consists in looking for the weights \u03c9 that minimize the trace in this reduced space (in which\nall ECO-R have a negative semide\ufb01nite Hessian). Notice that in this way we can loose the optimal\nsolution since the trace minimizer might assign a non-zero weight to a ECO-R with inde\ufb01nite Hessian.\nFor brevity, we will indicate with tri = tr(H\u03b8Ji(\u03b8)) and tr the vector whose components are tri.\nSDP is no longer needed:\n\n7. Discard the ECO-R having inde\ufb01nite Hessian, switch sign for\nthose having positive semide\ufb01nite Hessian, compute the traces\nof each Hessian and collect them in the vector tr.\n\n(cid:0)\n\u2207\u03b8 log p\u03b8(\u03c4j)\u2207\u03b8 log p\u03b8(\u03c4j)T\n\n+ H\u03b8 log p\u03b8(\u03c4j)(cid:1)(cid:0)\u03c8i(\u03c4j) \u2212 b(cid:1).\n\nRtr\u2212heu = \u03a8\u03c9 , \u03c9 = \u2212tr/(cid:107)tr(cid:107)2.\n\nPhase 3\n9. (Optional) Apply penalization to unexplored state-action pairs.\n\nusing equation:a\n1\n\u02c6H\u03b8Ji(\u03b8) =\nN\n\naThe optimal baseline b is provided in [30, 31].\n\n8. Compute the trace heuristic ECO-R as:\n\n\u03c9T tr\n\ns.t. (cid:107)\u03c9(cid:107)2\n\n\u03c9\n\nmin\n\n2 = 1.\n\n(6)\nThe constraint (cid:107)\u03c9(cid:107)2\n2 = 1 ensures that, when the ECO-R are orthonormal, the resulting ECO-R has\nEuclidean norm one. This is a convex programming problem with linear objective function and\nquadratic constraint, the closed form solution can be found with Lagrange multipliers: \u03c9i = \u2212 tri(cid:107)tr(cid:107)2\n(see App. A.2 for the derivation). Refer to Algorithm 1 for a complete overview of CR-IRL (the\ncomputational analysis of CR-IRL is reported in App. E).\nCR-IRL does not assume to know the state space S and the action space A, thus the recovered reward\nis de\ufb01ned only in the state-action pairs visited by the expert along the trajectories in D. When the state\nand action spaces are known, we can complete the reward function also for unexplored state-action\npairs assigning a penalized reward (e.g., a large negative value), otherwise the penalization can be\nperformed online when the recovered reward is used to solve the forward RL problem.\n\n6 Related Work\n\nThere has been a surge of recent interest in improving IRL in order to make it more appealing for\nreal-world applications. We highlight the lines of works that are more related to this paper.\nWe start investigating how IRL literature has faced the problem of designing a suitable reward space.\nAlmost all the IRL approaches share the necessity to de\ufb01ne a priori a set of handcrafted features,\n\n4A normalization condition is necessary since the magnitude of the trace of a matrix can be arbitrarily\n\nchanged by multiplying the matrix by a constant.\n\n6\n\n\fspanning the approximation space of the reward functions. While a good set of basis functions can\ngreatly simplify the IRL problem, a bad choice may signi\ufb01cantly harm the performance of any IRL\nalgorithm. The Feature construction for Inverse Reinforcement Learning (FIRL) algorithm [17], as far\nas we know, is the only approach that explicitly incorporates the feature construction as an inner step.\nFIRL alternates between optimization and \ufb01tting phases. The optimization phase aims to recover a\nreward function\u2014from the current feature set as a linear projection\u2014such that the associated optimal\npolicy is consistent with the demonstrations. In the \ufb01tting phase new features are created (using a\nregression tree) in order to better explain regions where the old features were too coarse. The method\nproved to be effective achieving also (features) transfer capabilities. However, FIRL requires the\nMDP model to solve the forward problem and the complete optimal policy for the \ufb01tting step in order\nto evaluate the consistency with demonstrations.\nRecent works have indirectly coped with the feature construction problem by exploiting neural\nnetworks [12, 3, 13]. Although effective, the black-box approach does not take into account the MDP\nstructure of the problem. RL has extensively investigated the feature construction for the forward\nproblem both for value function [25, 26, 32, 33] and policy [21] features. In this paper, we have\nfollowed this line of work mixing concepts deriving from policy and value \ufb01elds. We have leveraged\non the policy gradient theorem and on the associated concept of compatible functions to derive\nECO-Q features. First-order necessary conditions have already been used in literature to derive IRL\nalgorithm [9, 34]. However, in both the cases the authors assume a \ufb01xed reward space under which it\nmay not be possible to \ufb01nd a reward for which the expert is optimal. Although there are similarities,\nthis paper exploits \ufb01rst-order optimality to recover the reward basis while the \u201cbest\u201d reward function\nis selected according to a second-order criterion. This allows recovering a more robust solution\novercoming uncertainty issues raised by the use of the \ufb01rst-order information only.\n\n7 Experimental results\n\nWe evaluate CR-IRL against some popular IRL algorithms both in discrete and in continuous domains:\nthe Taxi problem (discrete), the Linear Quadratic Gaussian and the Car on the Hill environments\n(continuous). We provide here the most signi\ufb01cant results, the full data are reported in App. D.\n\n7.1 Taxi\n\nThe Taxi domain is de\ufb01ned in [35]. We assume the expert plays an \u0001-Boltzmann policy with \ufb01xed \u0001:\n\n\u03c0\u03b8,\u0001(a|s) = (1 \u2212 \u0001)\n\n(cid:80)\n\ne\u03b8T\n\na \u03b6s\n\na(cid:48)\u2208A e\u03b8a(cid:48) T \u03b6s\n\n+\n\n\u0001\n|A| ,\n\nwhere the policy features \u03b6s are the following state features: current location, passenger location,\ndestination location, whether the passenger has already been pick up.\nThis test is meant to compare the learning speed of the reward functions recovered by the considered\nIRL methods when a Boltzmann policy (\u0001 = 0) is trained with REINFORCE [22]. To evaluate the\nrobustness to imperfect experts, we introduce a noise (\u0001) in the optimal policy. Figure 2 shows that\nCR-IRL, with 100 expert\u2019s trajectories, outperforms the true reward function in terms of convergence\nspeed regardless the exploration level. Behavioral Cloning (BC), obtained by recovering the maximum\nlikelihood \u0001-Boltzmann policy (\u0001 = 0, 0.1) from expert\u2019s trajectories, is very susceptible to noise.\nWe compare also the second-order criterion of CR-IRL to single out the reward function with\nMaximum Entropy IRL (ME-IRL) [6] and Linear Programming Apprenticeship Learning (LPAL) [5]\nusing as reward features the set of ECO-R (comparisons with different sets of features is reported\nin App. D.2). We can see in Figure 2 that ME-IRL does not perform well when \u0001 = 0, since the\ntransition model is badly estimated. The convergence speed remains very slow also for \u0001 = 0.1, since\nME-IRL does not guarantee that the recovered reward is a maximum of J. LPAL provides as output\nan apprenticeship policy (not a reward function) and, like BC, is very sensitive to noise and to the\nquality of the estimated transition model.\n\n7.2 Linear Quadratic Gaussian Regulator\n\nWe consider the one-dimensional Linear Quadratic Gaussian regulator [36] with an expert playing a\nGaussian policy \u03c0K(\u00b7|s) \u223c N (Ks, \u03c32), where K is the parameter and \u03c32 is \ufb01xed.\n\n7\n\n\fFigure 2: Average return of the Taxi problem as a function of the number of iterations of REINFORCE.\n\nFigure 3: Parameter value of LQG as a function\nof the number of iterations of REINFORCE.\n\nFigure 4: Average return of Car on the Hill as a\nfunction of the number of FQI iterations.\n\nWe compare CR-IRL with GIRL [9] using two linear parametrizations of the reward function:\nR(s, a, \u03c9) = \u03c91s2 + \u03c92a2 (GIRL-square) and R(s, a, \u03c9) = \u03c91|s| + \u03c92|a| (GIRL-abs-val). Figure 3\nshows the parameter (K) value learned with REINFORCE using a Gaussian policy with variance\n\u03c32 = 0.01. We notice that CR-IRL, fed with 20 expert\u2019s trajectories, converges closer and faster to the\nexpert\u2019s parameter w.r.t. to the true reward, advantage function and GIRL with both parametrizations.\n\n7.3 Car on the Hill\n\nWe further experiment CR-IRL in the continuous Car on the Hill domain [37]. We build the optimal\npolicy via FQI [37] and we consider a noisy expert\u2019s policy in which a random action is selected with\nprobability \u0001 = 0.1. We exploit 20 expert\u2019s trajectories to estimate the parameters w of a Gaussian\npolicy \u03c0w(a|s) \u223c N (yw(s), \u03c32) where the mean yw(s) is a radial basis function network (details\nand comparison with \u0001 = 0.2 in appendix D.4). The reward function recovered by CR-IRL does\nnot necessary need to be used only with policy gradient approaches. Here we compare the average\nreturn as a function of the number of iterations of FQI, fed with the different recovered rewards.\nFigure 4 shows that FQI converges faster to optimal policies when coped with the reward recovered\nby CR-IRL rather than with the original reward. Moreover, it overcomes the performance of the\npolicy recovered via BC.\n\n8 Conclusions\n\nWe presented an algorithm, CR-IRL, that leverages on the policy gradient to recover, from a set\nof expert\u2019s demonstrations, a reward function that explains the expert\u2019s behavior and penalizes\ndeviations. Differently from large part of IRL literature, CR-IRL does not require to specify a priori\nan approximation space for the reward function. The empirical results show (quite unexpectedly)\nthat the reward function recovered by our algorithm allows learning policies that outperform both\nbehavioral cloning and those obtained with the true reward function (learning speed). Furthermore,\nthe Hessian trace heuristic criterion, when applied to ECO-R, outperforms classic IRL methods.\n\n8\n\n050100150\u2212200\u22121000iterationaveragereturn\u0001=0050100150\u2212200\u22121000iterationaveragereturn\u0001=0.1RewardCR-IRLME-IRLLPALBC(\u0001=0)BC(\u0001=0.1)Expert0100200300400\u22120.6\u22120.4\u22120.2iterationparameterRewardAdvantageGIRL-abs-valGIRL-squareCR-IRLExpert0510152000.20.4iterationaveragereturnCR-IRLBCExpertReward\fAcknowledgments\n\nThis research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn\n(n.ANR-14-CE24-0010-01).\n\nReferences\n[1] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from\n\ndemonstration. Robotics and Autonomous Systems, 57(5):469\u2013483, 2009.\n\n[2] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In ICML, pages\n\n663\u2013670, 2000.\n\n[3] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pages 4565\u20134573,\n\n2016.\n\n[4] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML,\n\npage 1. ACM, 2004.\n\n[5] Umar Syed, Michael H. Bowling, and Robert E. Schapire. Apprenticeship learning using linear program-\nming. In ICML, volume 307 of ACM International Conference Proceeding Series, pages 1032\u20131039. ACM,\n2008.\n\n[6] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n[7] Nathan D. Ratliff, David Silver, and J. Andrew Bagnell. Learning to search: Functional gradient techniques\n\nfor imitation learning. Autonomous Robots, 27(1):25\u201353, 2009.\n\n[8] Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization.\nIn ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 2760\u20132769. JMLR.org, 2016.\n\n[9] Matteo Pirotta and Marcello Restelli. Inverse reinforcement learning through policy gradient minimization.\n\nIn AAAI, pages 1993\u20131999, 2016.\n\n[10] Edouard Klein, Bilal Piot, Matthieu Geist, and Olivier Pietquin. A cascaded supervised learning approach\nto inverse reinforcement learning. In ECML/PKDD (1), volume 8188 of Lecture Notes in Computer\nScience, pages 1\u201316. Springer, 2013.\n\n[11] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and reward-regularized classi\ufb01cation for appren-\n\nticeship learning. In AAMAS, pages 1249\u20131256. IFAAMAS/ACM, 2014.\n\n[12] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via\npolicy optimization. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 49\u201358.\nJMLR.org, 2016.\n\n[13] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris,\nGabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Learning from\ndemonstrations for real world reinforcement learning. CoRR, abs/1704.03732, 2017.\n\n[14] Nathan D. Ratliff, J. Andrew Bagnell, and Martin Zinkevich. Maximum margin planning. In ICML,\n\nvolume 148 of ACM International Conference Proceeding Series, pages 729\u2013736. ACM, 2006.\n\n[15] Julien Audiffren, Michal Valko, Alessandro Lazaric, and Mohammad Ghavamzadeh. Maximum entropy\n\nsemi-supervised inverse reinforcement learning. In IJCAI, pages 3315\u20133321. AAAI Press, 2015.\n\n[16] Gergely Neu and Csaba Szepesv\u00e1ri. Training parsers by inverse reinforcement learning. Machine Learning,\n\n77(2-3):303\u2013337, 2009.\n\n[17] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Feature construction for inverse reinforcement learning.\n\nIn NIPS, pages 1342\u20131350. Curran Associates, Inc., 2010.\n\n[18] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.\n\n[19] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory\n\nand application to reward shaping. 99:278\u2013287, 1999.\n\n9\n\n\f[20] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research\n\nand Financial Engineering. Springer New York, 2006.\n\n[21] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods\nfor reinforcement learning with function approximation. In NIPS, pages 1057\u20131063. The MIT Press, 1999.\n\n[22] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[23] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4):682\u2013697, 2008.\n\n[24] Wendelin B\u00f6hmer, Steffen Gr\u00fcnew\u00e4lder, Yun Shen, Marek Musial, and Klaus Obermayer. Construction of\napproximation spaces for reinforcement learning. Journal of Machine Learning Research, 14(1):2067\u20132118,\n2013.\n\n[25] Sridhar Mahadevan. Proto-value functions: Developmental reinforcement learning. In ICML, pages\n\n553\u2013560. ACM, 2005.\n\n[26] Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A laplacian framework for learn-\ning representation and control in markov decision processes. Journal of Machine Learning Research,\n8(Oct):2169\u20132231, 2007.\n\n[27] Sridhar Mahadevan, Mauro Maggioni, Kimberly Ferguson, and Sarah Osentoski. Learning representation\n\nand control in continuous markov decision processes. In AAAI, volume 6, pages 1194\u20131199, 2006.\n\n[28] Sham Kakade. A natural policy gradient. In NIPS, pages 1531\u20131538. MIT Press, 2001.\n\n[29] Thomas Furmston and David Barber. A unifying perspective of parametric policy search methods for\nmarkov decision processes. In Advances in neural information processing systems, pages 2717\u20132725,\n2012.\n\n[30] Giorgio Manganini, Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Following newton direction\nin policy gradient with parameter exploration. In Neural Networks (IJCNN), 2015 International Joint\nConference on, pages 1\u20138. IEEE, 2015.\n\n[31] Simone Parisi, Matteo Pirotta, and Marcello Restelli. Multi-objective reinforcement learning through\ncontinuous pareto manifold approximation. Journal Arti\ufb01cial Intelligence Research, 57:187\u2013227, 2016.\n\n[32] Ronald Parr, Christopher Painter-Wake\ufb01eld, Lihong Li, and Michael L. Littman. Analyzing feature\ngeneration for value-function approximation. In ICML, volume 227 of ACM International Conference\nProceeding Series, pages 737\u2013744. ACM, 2007.\n\n[33] Amir Massoud Farahmand and Doina Precup. Value pursuit iteration. In NIPS, pages 1349\u20131357, 2012.\n\n[34] Peter Englert and Marc Toussaint. Inverse kkt\u2013learning cost functions of manipulation tasks from demon-\n\nstrations. In Proceedings of the International Symposium of Robotics Research, 2015.\n\n[35] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J.\n\nArtif. Intell. Res.(JAIR), 13:227\u2013303, 2000.\n\n[36] Peter Dorato, Vito Cerone, and Chaouki Abdallah. Linear Quadratic Control: An Introduction. Krieger\n\nPublishing Co., Inc., Melbourne, FL, USA, 2000.\n\n[37] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research, 6(Apr):503\u2013556, 2005.\n\n[38] C-L Hwang and Abu Syed Md Masud. Multiple objective decision making-methods and applications: a\n\nstate-of-the-art survey, volume 164. Springer Science & Business Media, 2012.\n\n[39] Jose M. Vidal and Jos\u00e9 M Vidal. Fundamentals of multiagent systems. 2006.\n\n[40] Emre Mengi, E Alper Yildirim, and Mustafa Kilic. Numerical optimization of eigenvalues of hermitian\n\nmatrix functions. SIAM Journal on Matrix Analysis and Applications, 35(2):699\u2013724, 2014.\n\n[41] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n10\n\n\f", "award": [], "sourceid": 1247, "authors": [{"given_name": "Alberto Maria", "family_name": "Metelli", "institution": "Politecnico di Milano"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}