{"title": "Lifelong Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4502, "page_last": 4513, "abstract": "Methods for learning from demonstration (LfD) have shown success in acquiring behavior policies by imitating a user. However, even for a single task, LfD may require numerous demonstrations. For versatile agents that must learn many tasks via demonstration, this process would substantially burden the user if each task were learned in isolation. To address this challenge, we introduce the novel problem of lifelong learning from demonstration, which allows the agent to continually build upon knowledge learned from previously demonstrated tasks to accelerate the learning of new tasks, reducing the amount of demonstrations required. As one solution to this problem, we propose the first lifelong learning approach to inverse reinforcement learning, which learns consecutive tasks via demonstration, continually transferring knowledge between tasks to improve performance.", "full_text": "Lifelong Inverse Reinforcement Learning\n\nJorge A. Mendez, Shashank Shivkumar, and Eric Eaton\n\nDepartment of Computer and Information Science\n\nUniversity of Pennsylvania\n\n{mendezme,shashs,eeaton}@seas.upenn.edu\n\nAbstract\n\nMethods for learning from demonstration (LfD) have shown success in acquiring\nbehavior policies by imitating a user. However, even for a single task, LfD may\nrequire numerous demonstrations. For versatile agents that must learn many tasks\nvia demonstration, this process would substantially burden the user if each task\nwere learned in isolation. To address this challenge, we introduce the novel problem\nof lifelong learning from demonstration, which allows the agent to continually\nbuild upon knowledge learned from previously demonstrated tasks to accelerate\nthe learning of new tasks, reducing the amount of demonstrations required. As\none solution to this problem, we propose the \ufb01rst lifelong learning approach to\ninverse reinforcement learning, which learns consecutive tasks via demonstration,\ncontinually transferring knowledge between tasks to improve performance.\n\n1\n\nIntroduction\n\nIn many applications, such as personal robotics or intelligent virtual assistants, a user may want\nto teach an agent to perform some sequential decision-making task. Often, the user may be able\nto demonstrate the appropriate behavior, allowing the agent to learn the customized task through\nimitation. Research in inverse reinforcement learning (IRL) [29, 1, 43, 21, 31, 28] has shown success\nwith framing the learning from demonstration (LfD) problem as optimizing a utility function from user\ndemonstrations. IRL assumes that the user acts to optimize some reward function in performing the\ndemonstrations, even if they cannot explicitly specify that reward function as in typical reinforcement\nlearning (RL).1 IRL seeks to recover this reward function from demonstrations, and then use it to train\nan optimal policy. Learning the reward function instead of merely copying the user\u2019s policy provides\nthe agent with a portable representation of the task. Most IRL approaches have focused on an agent\nlearning a single task. However, as AI systems become more versatile, it is increasingly likely that\nthe agent will be expected to learn multiple tasks over its lifetime. If it learned each task in isolation,\nthis process would cause a substantial burden on the user to provide numerous demonstrations.\nTo address this challenge, we introduce the novel problem of lifelong learning from demonstration, in\nwhich an agent will face multiple consecutive LfD tasks and must optimize its overall performance.\nBy building upon its knowledge from previous tasks, the agent can reduce the number of user\ndemonstrations needed to learn a new task. As one illustrative example, consider a personal service\nrobot learning to perform household chores from its human owner. Initially, the human might want to\nteach the robot to load the dishwasher by providing demonstrations of the task. At a later time, the\nuser could teach the robot to set the dining table. These tasks are clearly related since they involve\nmanipulating dinnerware and cutlery, and so we would expect the robot to leverage any relevant\nknowledge obtained from loading the dishwasher while setting the table for dinner. Additionally, we\nwould hope the robot could improve its understanding of the dishwasher task with any additional\n\n1Complex RL tasks require similarly complex reward functions, which are often hand-coded. This hand-\n\ncoding would be very cumbersome for most users, making demonstrations better for training novel behavior.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fknowledge it gains from setting the dining table. Over the robot\u2019s lifetime of many tasks, the ability\nto share knowledge between demonstrated tasks would substantially accelerate learning.\nWe frame lifelong LfD as an online multi-task learning problem, enabling the agent to accelerate\nlearning by transferring knowledge among tasks. This transfer can be seen as exploiting the underlying\nrelations among different reward functions (e.g., breaking a wine glass is always undesired). Although\nlifelong learning has been studied in classi\ufb01cation, regression, and RL [10, 34, 4], this is the \ufb01rst study\nof lifelong learning for IRL. Our framework wraps around existing IRL methods, performing lifelong\nfunction approximation of the learned reward functions. As an instantiation of our framework, we\npropose the Ef\ufb01cient Lifelong IRL (ELIRL) algorithm, which adapts Maximum Entropy (MaxEnt)\nIRL [43] into a lifelong learning setting. We show that ELIRL can successfully transfer knowledge\nbetween IRL tasks to improve performance, and this improvement increases as it learns more tasks.\nIt signi\ufb01cantly outperforms the base learner, MaxEnt IRL, with little additional cost, and can achieve\nequivalent or better performance than IRL via Gaussian processes with far less computational cost.\n\n2 Related Work\n\nThe IRL problem is under-de\ufb01ned, so approaches use different means of identifying which reward\nfunction best explains the observed trajectories. Among these, maximum margin IRL methods [29, 1]\nchoose the reward function that most separates the optimal policy and the second-best policy. Variants\nof these methods have allowed for suboptimal demonstrations [32], non-linear reward functions [35],\nand game-theoretic learning [37]. Bayesian IRL approaches [31, 30] use prior knowledge to bias\nthe search over reward functions, and can support suboptimal demonstrations [33]. Gradient-based\nalgorithms optimize a loss to learn the reward while, for instance, penalizing deviations from the\nexpert\u2019s policy [28]. Maximum entropy models [43, 21, 42] \ufb01nd the most likely reward function\ngiven the demonstrations, and produce a policy that matches the user\u2019s expected performance without\nmaking further assumptions on the preference over trajectories. Other work has avoided learning the\nreward altogether and focuses instead on modeling the user\u2019s policy via classi\ufb01cation [27].\nNote, however, that all these approaches focus on learning a single IRL task, and do not consider\nsharing knowledge between multiple tasks. Although other work has focused on multi-task IRL,\nexisting methods either assume that the tasks share a state and action space, or scale poorly due\nto their computational cost; our approach differs in both respects. An early approach to multi-task\nIRL [12] learned different tasks by sampling from a joint prior on the rewards and policies, assuming\nthat the state-action spaces are shared. Tanwani and Billard [38] studied knowledge transfer for\nlearning from multiple experts, by using previously learned reward functions to bootstrap the search\nwhen a new expert demonstrates trajectories. Although ef\ufb01cient, their approach does not optimize\nperformance across all tasks, and only considers learning different experts\u2019 approaches to one task.\nThe notion of transfer in IRL was also studied in an unsupervised setting [2, 11], where each task\nis assumed to be generated from a set of hidden intentions. These methods cluster an initial batch\nof tasks, and upon observing each new task, use the clusters to rapidly learn the corresponding\nreward function. However, they do not address how to update the clusters after observing a new task.\nMoreover, these methods assume the state-action space is shared across tasks, and, as an inner loop\nin the optimization, learn a single policy for all tasks. If the space was not shared, the repeated policy\nlearning would become computationally infeasible for numerous tasks. Most recently, transfer in\nIRL has been studied for solving the one-shot imitation learning problem [13, 17]. In this setting, the\nagent is tasked with using knowledge from an initial set of tasks to generalize to a new task given\na single demonstration of the new task. The main drawback of these methods is that they require a\nlarge batch of tasks available at training time, and so cannot handle tasks arriving sequentially.\nOur work is most similar to that by Mangin and Oudeyer [25], which poses the multi-task IRL\nproblem as batch dictionary learning of primitive tasks, but appears to be incomplete and unpublished.\nFinn et al. [16] used IRL as a step for transferring knowledge in a lifelong RL setting, but they do not\nexplore lifelong learning speci\ufb01cally for IRL. In contrast to existing work, our method can handle\ndistinct state-action spaces. It is fully online and computationally ef\ufb01cient, enabling it to rapidly learn\nthe reward function for each new task via transfer and then update a shared knowledge repository.\nNew knowledge is transferred in reverse to improve the reward functions of previous tasks (without\nretraining on these tasks), thereby optimizing all tasks. We achieve this by adapting ideas from\nlifelong learning in the supervised setting [34], which we show achieves similar bene\ufb01ts in IRL.\n\n2\n\n\f3\n\nInverse Reinforcement Learning\n\nWe \ufb01rst describe IRL and the MaxEnt IRL method, before introducing the lifelong IRL problem.\n\n3.1 The Inverse RL Problem\nA Markov decision process (MDP) is de\ufb01ned as a tuple hS,A, T, r, i, where S is the set of states,\nA is the set of actions, the transition function T : S \u21e5 A \u21e5 S 7! [0, 1] gives the probability\nP (si+1 | si, ai) that being in state si and taking action ai will yield a next state si+1, r : S 7! R\nis the reward function2, and  2 [0, 1) is the discount factor. A policy \u21e1 : S \u21e5 A 7! [0, 1] models\nthe distribution P (ai | si) over actions the agent should take in any state. When fully speci\ufb01ed, an\nMDP can be solved via linear or dynamic programming for an optimal policy \u21e1\u21e4 that maximizes the\nrewards earned by the agent: \u21e1\u21e4 = argmax\u21e1 V \u21e1, with V \u21e1 = E\u21e1\u21e5Pi ir(si)\u21e4.\nIn IRL [29], the agent does not know the MDP\u2019s reward function, and must infer it from demonstra-\ntions Z = {\u21e31, . . . ,\u21e3 n} given by an expert user. Each demonstration \u21e3j is a sequence of state-action\npairs [s0:H, a0:H] that is assumed to be generated by the user\u2019s unknown policy \u02c6\u21e1\u21e4. Once the reward\nfunction is learned, the MDP is complete and so can be solved for the optimal policy \u21e1\u21e4.\nGiven an MDP\\r = hS,A, T, i and expert demonstrations Z, the goal of IRL is to estimate the\nunknown reward function r of the MDP. Previous work has de\ufb01ned the optimal reward such that the\npolicy enacted by the user be (near-)optimal under the learned reward (V \u21e1\u21e4 = V \u02c6\u21e1\u21e4), while (nearly)\nall other actions would be suboptimal. This problem is unfortunately ill-posed, since it has numerous\nsolutions, and so it becomes necessary to make additional assumptions in order to \ufb01nd solutions that\ngeneralize well. These various assumptions and the strategies to recover the user\u2019s policy have been\nthe focus of previous IRL research. We next focus on the MaxEnt approach to the IRL problem.\n\n3.2 Maximum Entropy IRL\n\nIn the maximum entropy (MaxEnt) algorithm for IRL [43], each state si is represented by a feature\ni=0 ixsi, giving\n\nvector xsi 2 Rd. Each demonstrated trajectory \u21e3j gives a feature count x\u21e3j =PH\nnPj x\u21e3j that must be matched by the agent\u2019s policy\nan approximate expected feature count \u02dcx = 1\nto satisfy the condition V \u21e1\u21e4 = V \u02c6\u21e1\u21e4. The reward function is represented as a parameterized linear\nfunction with weight vector \u2713 2 Rd as rsi = r(xsi, \u2713) = \u2713>xsi and so the cumulative reward of a\ntrajectory \u21e3j is given by r\u21e3j = r(x\u21e3j , \u2713) =Psi2\u21e3j\nThe algorithm deals with the ambiguity of the IRL problem in a probabilistic way, by as-\nsuming that the user acts according to a MaxEnt policy.\nIn this setting, the probability of\nT (si+1 | si, ai), where\na trajectory is given as: P (\u21e3j | \u2713, T ) \u21e1 1\nZ(\u2713, T ) is the partition function, and the approximation comes from assuming that the tran-\nsition uncertainty has little effect on behavior. This distribution does not prefer any trajec-\ntory over another with the same reward, and exponentially prefers trajectories with higher re-\nwards. The IRL problem is then solved by maximizing the likelihood of the observed trajectories\n\nZ(\u2713,T ) exp(r\u21e3j )Q(si,ai,si+1)2\u21e3j\n\ni\u2713>xsi = \u2713>x\u21e3j .\n\nis the difference between the user\u2019s and the agent\u2019s feature expectations, which can be expressed\n\n\u2713\u21e4 = argmax\u2713 log P (Z | \u2713) = argmax\u2713P\u21e3j2Z log P (\u21e3j | \u2713, T ). The gradient of the log-likelihood\nP (\u02dc\u21e3 | \u2713, T )x\u02dc\u21e3 = \u02dcx Ps2S Dsxs,\nin terms of the state visitation frequencies Ds: \u02dcx P\u02dc\u21e32ZMDP\nwhere ZMDP is the set of all possible trajectories. The Ds can be computed ef\ufb01ciently via a forward-\nbackward algorithm [43]. The maximum of this concave objective is then achieved when the feature\ncounts match, and so V \u21e1\u21e4 = V \u02c6\u21e1\u21e4.\n\n4 The Lifelong Inverse RL Problem\n\nWe now introduce the novel problem of lifelong IRL. In contrast to most previous work on IRL, which\nfocuses on single-task learning, this paper focuses on online multi-task IRL. Formally, in the lifelong\nlearning setting, the agent faces a sequence of IRL tasks T (1), . . . ,T (Nmax), each of which is an\n2Although we typically notate functions as uppercase non-bold symbols, we notate the reward function as r,\nsince primarily it will be represented as a parameterized function of the state features and a target for learning.\n\n3\n\n\fexpert demonstrations for each task before moving on to the next. We assume that a priori the agent\ndoes not know the total number of tasks Nmax, their distribution, or the order of the tasks.\n\nMDP\\r T (t) =\u2326S (t),A(t), T (t), (t)\u21b5. The agent will learn tasks consecutively, receiving multiple\nThe agent\u2019s goal is to learn a set of reward functions R =r(\u2713(1)), . . . , r(\u2713(Nmax)) with a corre-\nsponding set of parameters \u21e5 =\u2713(1), . . . , \u2713(Nmax) . At any time, the agent may be evaluated on\nany previous task, and so must strive to optimize its performance for all tasks T (1), . . . ,T (N ), where\nN denotes the number of tasks seen so far (1 \uf8ff N \uf8ff Nmax). Intuitively, when the IRL tasks are\nrelated, knowledge transfer between their reward functions has the potential to improve the learned\nreward function for each task and reduce the number of expert demonstrations needed.\nAfter N tasks, the agent must optimize the likelihood of all observed trajectories over those tasks:\n\nmax\n\nr(1),...,r(N )\n\nP\u21e3r(1), . . . , r(N )\u2318 NYt=1\n\n0@\n\nntYj=1\n\nP\u21e3\u21e3j | r(t)\u23181A\n\n1\nnt\n\n,\n\n(1)\n\nwhere P (r(1), . . . , r(N )) is a reward prior to encourage relationships among the reward functions,\nand each task is given equal importance by weighting it by the number of associated trajectories nt.\n\n5 Lifelong Inverse Reinforcement Learning\n\nThe key idea of our framework is to use lifelong function approximation to represent the reward\nfunctions for all tasks, enabling continual online transfer between the reward functions with ef\ufb01cient\nper-task updates. Intuitively, this framework exploits the fact that certain aspects of the reward\nfunctions are often shared among different (but related) tasks, such as the negative reward a service\nrobot might receive for dropping objects. We assume the reward functions r(t) for the different tasks\nare related via a latent basis of reward components L. These components can be used to reconstruct\nthe true reward functions via a sparse combination of such components with task-speci\ufb01c coef\ufb01cients\ns(t), using L as a mechanism for transfer that has shown success in previous work [19, 26].\nThis section develops our framework for lifelong IRL, instantiating it following the MaxEnt approach\nto yield the ELIRL algorithm. Although we focus on MaxEnt IRL, ELIRL can easily be adapted to\nother IRL approaches, as shown in Appendix D. We demonstrate the merits of the novel lifelong IRL\nproblem by showing that 1) transfer between IRL tasks can signi\ufb01cantly increase their accuracy and\n2) this transfer can be achieved by adapting ideas from lifelong learning in supervised settings.\n\n5.1 The Ef\ufb01cient Lifelong IRL Algorithm\nAs described in Section 4, the lifelong IRL agent must optimize its performance over all IRL tasks\nobserved so far. Using the MaxEnt assumption that the reward function r(t)\nsi for each task\nis linear and parameterized by \u2713(t) 2 Rd, we can factorize these parameters into a linear combination\n\u2713(t) = Ls(t) to facilitate transfer between parametric models, following Kumar and Daum\u00e9 [19] and\nMaurer et al. [26]. The matrix L 2 Rd\u21e5k represents a set of k latent reward vectors that are shared\nbetween all tasks, with sparse task-speci\ufb01c coef\ufb01cients s(t) 2 Rk to reconstruct \u2713(t).\nUsing this factorized representation to facilitate transfer between tasks, we place a Laplace prior on\nthe s(t)\u2019s to encourage them to be sparse, and a Gaussian prior on L to control its complexity, thereby\nencouraging the reward functions to share structure. This gives rise to the following reward prior:\n\nsi = \u2713>x(t)\n\nP\u21e3r(1), . . . , r(N )\u2318 =\n\n1\n\nZ (, \u00b5)\n\nexpNkLk2\n\nF NYt=1\n\nexp\u21e3\u00b5ks(t)k1\u2318 ,\n\nwhere Z(, \u00b5) is the partition function, which has no effect on the optimization. We can substitute\nthe prior in Equation 2 along with the MaxEnt likelihood into Equation 1. After taking logs and\nre-arranging terms, this yields the equivalent objective:\n\nmin\n\nL\n\n1\nN\n\nNXt=1\n\ns(t)(\n\nmin\n\n1\n\nntX\u21e3(t)\n\nlog P\u21e3\u21e3(t)\n\nj\n\nj 2Z (t)\n\n| Ls(t), T (t)\u2318 + \u00b5ks(t)k1) + kLk2\n\nF .\n\n(2)\n\n(3)\n\n4\n\n\fNote that Equation 3 is separably, but not jointly, convex in L and the s(t)\u2019s; typical multi-task\napproaches would optimize similar objectives [19, 26] using alternating optimization.\nTo enable Equation 3 to be solved online when tasks are observed consecutively, we adapt concepts\nfrom the lifelong learning literature. Ruvolo and Eaton [34] approximate a multi-task objective with\na similar form to Equation 3 online as a series of ef\ufb01cient online updates. Note, however, that their\napproach is designed for the supervised setting, using a general-purpose supervised loss function in\nplace of the MaxEnt negative log-likelihood in Equation 3, but with a similar factorization of the\nlearned parametric models. Following their approach but substituting in the IRL loss function, for\neach new task t, we can take a second-order Taylor expansion around the single-task point estimate of\n\nwhere the Hessian H(t) of the MaxEnt negative log-likelihood is given by (derivation in Appendix A):\n\n1\nN\n\nj\n\nmin\n\n| \u21b5, T (t), and then simplify to reformulate Equation 3 as\n\nj 2Z (t)log P\u21e3(t)\ns(t) \u21e2\u21e3\u21b5(t)  Ls(t)\u2318> H(t)\u21e3\u21b5(t)  Ls(t)\u2318 + \u00b5ks(t)k1 + kLk2\nx>\u02dc\u21e3 P (\u02dc\u21e3|\u2713)!+X\u02dc\u21e32ZMDP\n\nNXt=1\n\u2713,\u2713L\u21e3r\u21e3Ls(t)\u2318,Z (t)\u2318 = X\u02dc\u21e32ZMDP\n\nx\u02dc\u21e3P (\u02dc\u21e3|\u2713)! X\u02dc\u21e32ZMDP\n\nx\u02dc\u21e3x>\u02dc\u21e3 P (\u02dc\u21e3|\u2713) . (5)\n\nF ,\n\n(4)\n\n\u21b5(t) = argmin\u21b5 P\u21e3(t)\n\nmin\n\nL\n\nH(t) =\n\n1\nntr2\n\nSince H(t) is non-linear in the feature counts, we cannot make use of the state visitation frequencies\nobtained for the MaxEnt gradient in the lifelong learning setting. This creates the need for obtaining\na sample-based approximation. We \ufb01rst solve the MDP for an optimal policy \u21e1\u21b5(t) from the\nparameterized reward learned by single-task MaxEnt. We compute the feature counts for a \ufb01xed\nnumber of \ufb01nite horizon paths by following the stochastic policy \u21e1\u21b5(t). We then obtain the sample\ncovariance of the feature counts of the paths as an approximation of the true covariance in Equation 5.\nGiven each new consecutive task t, we \ufb01rst estimate \u21b5(t) as described above. Then, Equation 4 can\nbe approximated online as a series of ef\ufb01cient update equations [34]:\n\ns\n\nL\n\nF +\n\n1\nN\n\nkLk2\n\nNXt=1\n\ns(t) argmin\n\n`\u21e3LN , s, \u21b5(t), H(t)\u2318 LN +1 argmin\n\n`\u21e3L, s(t), \u21b5(t), H(t)\u2318 , (6)\nwhere ` (L, s, \u21b5, H) = \u00b5ksk1 + (\u21b5 Ls)>H(\u21b5 Ls), and L can be built incrementally in practice\n(see [34] for details). Critically, this online approximation removes the dependence of Equation 3\non the numbers of training samples and tasks, making it scalable for lifelong learning, and provides\nguarantees on its convergence with equivalent performance to the full multi-task objective [34]. Note\nthat the s(t) coef\ufb01cients are only updated while training on task t and otherwise remain \ufb01xed.\nThis process yields the estimated re-\nward function as r(t)\nsi = Ls(t)xsi. We\ncan then solve the now-complete MDP\nfor the optimal policy using standard\nRL. The complete ELIRL algorithm is\ngiven as Algorithm 1. ELIRL can either\nsupport a common feature space across\ntasks, or can support different feature\nspaces across tasks by making use of\nprior work in autonomous cross-domain\ntransfer [3], as shown in Appendix C.\n\nZ (t) getExampleTrajectories(T (t))\n\u21b5(t), H(t) inverseReinforcementLearner(Z (t))\ns(t) argmins(\u21b5(t)Ls)>H(t)(\u21b5(t)Ls) + \u00b5ksk1\nL updateL(L, s(t), \u21b5(t), H(t), )\n\nAlgorithm 1 ELIRL (k, , \u00b5)\nL RandomMatrixd,k\nwhile some task T (t) is available do\n\nend while\n\nImproving Performance on Earlier Tasks\n\n5.2\nAs ELIRL is trained over multiple IRL tasks, it gradually re\ufb01nes the shared knowledge in L. Since\neach reward function\u2019s parameters are modeled as \u2713(t) = Ls(t), subsequent changes to L after\ntraining on task t can affect \u2713(t). Typically, this process improves performance in lifelong learning\n[34], but it might occasionally decrease performance through negative transfer, due to the ELIRL\n\n5\n\n\fsimpli\ufb01cations restricting that s(t) is \ufb01xed except when training on task t. To prevent this problem,\nwe introduce a novel technique. Whenever ELIRL is tested on a task t, it can either directly use the\n\u2713(t) vector obtained from Ls(t), or optionally repeat the optimization step for s(t) in Equation 6 to\naccount for potential major changes in the L matrix since the last update to s(t). This latter optional\nstep only involves running an instance of the LASSO, which is highly ef\ufb01cient. Critically, it does not\nrequire either re-running MaxEnt or recomputing the Hessian, since the optimization is always done\naround the optimal single-task parameters, \u21b5(t). Consequently, ELIRL can pay a small cost to do this\noptimization when it is faced with performing on a previous task, but it gains potentially improved\nperformance on that task by bene\ufb01ting from up-to-date knowledge in L, as shown in our results.\n\n5.3 Computational Complexity\n\nThe addition of a new task to ELIRL requires an initial run of single-task MaxEnt to obtain \u21b5(t),\nwhich we assume to be of order O(i\u21e0(d,|A|,|S|)), where i is the number of iterations required for\nMaxEnt to converge. The next step is computing the Hessian, which costs O(M H + M d2), where\nM is the number of trajectories sampled for the approximation and H is their horizon. Finally, the\ncomplexity of the update steps for L and s(t) is O(k2d3) [34]. This yields a total per-task cost of\nO(i\u21e0(d,|A|,|S|) + M H + M d2 + k2d3) for ELIRL. The optional step of re-updating s(t) when\nneeding to perform on task t would incur a computational cost of O(d3 + kd2 + dk2) for constructing\nthe target of the optimization and running LASSO [34].\nNotably, there is no dependence on the number of tasks N, which is precisely what makes ELIRL\nsuitable for lifelong learning. Since IRL in general requires \ufb01nding the optimal policy for different\nchoices of the reward function as an inner loop in the optimization, the additional dependence on N\nwould make any IRL method intractable in a lifelong setting. Moreover, the only step that depends\non the size of the state and action spaces is single-task MaxEnt. Thus, for high-dimensional tasks\n(e.g., robotics tasks), replacing the base learner would allow our algorithm to scale gracefully.\n\n5.4 Theoretical Convergence Guarantees\n\nELIRL inherits the theoretical guarantees showed by Ruvolo and Eaton [34]. Speci\ufb01cally, the\noptimization is guaranteed to converge to a local optimum of the approximate cost function in\nEquation 4 as the number of tasks grows large. Intuitively, the quality of this approximation depends\non how much the factored representation \u2713(t) = Ls(t) deviates from \u21b5(t), which in turn depends\non how well this representation can capture the task relatedness. However, we emphasize that this\napproximation is what allows the method to solve the multi-task learning problem online, and it has\nbeen shown empirically in the contexts of supervised learning [34] and RL [4] that this approximate\nsolution can achieve equivalent performance to exact multi-task learning in a variety of problems.\n\n6 Experimental Results\n\nWe evaluated ELIRL on two environments, chosen to allow us to create arbitrarily many tasks with\ndistinct reward functions. This also gives us known rewards as ground truth. No previous multi-task\nIRL method was tested on such a large task set, nor on tasks with varying state spaces as we do.\nObjectworld: Similar to the environment presented by Levine et al. [21], Objectworld is a 32 \u21e5 32\ngrid populated by colored objects in random cells. Each object has one of \ufb01ve outer colors and one of\ntwo inner colors, and induces a constant reward on its surrounding 5 \u21e5 5 grid. We generated 100\ntasks by randomly choosing 2\u20134 outer colors, and assigning to each a reward sampled uniformly from\n[10, 5]; the inner colors are distractor features. The agent\u2019s goal is then to move toward objects with\n\u201cgood\u201d (positive) colors and away from objects with \u201cbad\u201d (negative) colors. Ideally, each column of\nL would learn the impact \ufb01eld around one color, and the s(t)\u2019s would encode how good or bad each\ncolor is in each task. There are d = 31(5 + 2) features, representing the distance to the nearest object\nwith each outer and inner color, discretized as binary indicators of whether the distance is less than\n1\u201331. The agent can choose to move along the four cardinal directions or stay in place.\nHighway: Highway simulations have been used to test various IRL methods [1, 21]. We simulate the\nbehavior of 100 different drivers on a three-lane highway in which they can drive at four speeds. Each\ndriver prefers either the left or the right lane, and either the second or fourth speed. Each driver\u2019s\n\n6\n\n\fweight for those two factors is sampled uniformly from [0, 5]. Intuitively, each column of L should\nlearn a speed or lane, and the s(t)\u2019s should encode the drivers\u2019 preferences over them. There are\nd = 4 + 3 + 64 features, representing the current speed and lane, and the distances to the nearest cars\nin each lane in front and back, discretized in the same manner as Objectworld. Each time step, drivers\ncan choose to move left or right, speed up or slow down, or maintain their current speed and lane.\nIn both environments, the agent\u2019s chosen action has a 70% probability of success and a 30% proba-\nbility of a random outcome. The reward is discounted with each time step by a factor of  = 0.9.\n\n6.1 Evaluation Procedure\n\nFor each task, we created an instance of the MDP by placing the objects in random locations. We\nsolved the MDP for the true optimal policy, and generated simulated user trajectories following this\npolicy. Then, we gave the IRL algorithms the MDP\\r and the trajectories to estimate the reward\nr. We compared the learned reward function with the true reward function by standardizing both\nand computing the `2-norm of their difference. Then, we trained a policy using the learned reward\nfunction, and compared its expected return to that obtained by a policy trained using the true reward.\nWe tested ELIRL using L trained on various subsets of tasks, ranging from 10 to 100 tasks. At\neach testing step, we evaluated performance of all 100 tasks; this includes as a subset evaluating all\npreviously observed tasks, but it is signi\ufb01cantly more dif\ufb01cult because the latent basis L, which is\ntrained only on the initial tasks, must generalize to future tasks. The single-task learners were trained\non all tasks, and we measured their average performance across all tasks. All learners were given\nnt = 32 trajectories for Objectworld and nt = 256 trajectories for Highway, all of length H = 16.\nWe chose the size k of L via domain knowledge, and initialized L sequentially with the \u21b5(t)\u2019s of the\n\ufb01rst k tasks. We measured performance on a new random instance of the MDP for each task, so as\nnot to con\ufb02ate over\ufb01tting the training environment with high performance. Results were averaged\nover 20 trials, each using a random task ordering.\nWe compared ELIRL with both the original (ELIRL) and re-optimized (ELIRLre) s(t) vectors to\nMaxEnt IRL (the base learner) and GPIRL [21] (a strong single-task baseline). None of the existing\nmulti-task IRL methods were suitable for this experimental setting\u2014other methods assume a shared\nstate space and are prohibitively expensive for more than a few tasks [12, 2, 11], or only learn different\nexperts\u2019 approaches to a single task [38] . Appendix B includes a comparison to MTMLIRL [2] on a\nsimpli\ufb01ed version of Objectworld, since MTMLIRL was unable to handle the full version.\n\nFigure 1: Average reward and\nvalue difference in the life-\nlong setting. Reward differ-\nence measures the error be-\ntween learned and true reward.\nValue difference compares ex-\npected return from the policy\ntrained on the learned reward\nand the policy trained on the\ntrue reward. The whiskers\ndenote std. error.\nELIRL\nimproves as the number of\ntasks increases, achieving bet-\nter performance than its base\nlearner, MaxEnt IRL. Using\nre-optimization after learning\nall tasks allows earlier tasks to\nbene\ufb01t from the latest knowl-\nedge, increasing ELIRL\u2019s per-\nformance above GPIRL. (Best\nviewed in color.)\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n \n\nd\nr\na\nw\ne\nr\n \n\ne\ng\na\nr\ne\nv\nA\n\n15\n\n10\n\n5\n\n0\n\n0\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\n \n\nd\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n12\n10\n8\n6\n4\n2\n0\n\n0\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n \n\nl\n\ne\nu\na\nv\n \ne\ng\na\nr\ne\nv\nA\n\n3\n\n2\n\n1\n\n0\n\n0\n\n100\n\nELIRL\nELIRLre\nMaxEnt IRL\nGPIRL\n\n100\n\n100\n\n40\n\n20\n80\nNumber of tasks trained\n\n60\n\n40\n\n20\n80\nNumber of tasks trained\n\n60\n\n(a) Objectworld\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n \n\nl\n\ne\nu\na\nv\n \ne\ng\na\nr\ne\nv\nA\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n100\n\n(b) Highway\n\n40\n\n20\n80\nNumber of tasks trained\n\n60\n\n7\n\n40\n\n20\n80\nNumber of tasks trained\n\n60\n\n\f(a) Green and yellow\n\n(b) Green, blue, yellow\n\n(c) Orange\n\nFigure 2: Example latent reward functions from Objectworld learned by ELIRL. Each column of L\ncan be visualized as a reward function, and captures a reusable chunk of knowledge. The grayscale\nvalues show the learned reward and the arrows show the corresponding optimal policy. Each latent\ncomponent has specialized to focus on objects of particular colors, as labeled. (Best viewed in color.)\n\n10\n\nr\no\nr\nr\n\nE\na\n\n \n\nt\nl\n\ne\nD\n\n0\n\n10\n\n20\n\n30\n\n50\n\n40\n60\nTask Number\n\n70\n\n80\n\n90\n\n100\n\n(a) Objectworld \u2013 original s(t)\u2019s\n\n4\n\n2\n\n0\n\n2\n\n0\n\nr\no\nr\nr\n\nE\na\n\n \n\nt\nl\n\ne\nD\n\nr\no\nr\nr\n\nE\na\n\n \n\nt\nl\n\ne\nD\n\n-2\n\n0\n\n0\n\n10\n\n20\n\n30\n\n50\n\n40\n60\nTask Number\n\n70\n\n80\n\n90\n\n100\n\n(b) Objectworld \u2013 re-optimized s(t)\u2019s\n\n5\n\n0\n\n3\n2\n1\n0\n\nr\no\nr\nr\n\nE\na\n\n \n\nt\nl\n\ne\nD\n\n10\n\n20\n\n30\n\n50\n\n40\n60\nTask Number\n\n70\n\n80\n\n90\n\n100\n\n0\n\n10\n\n20\n\n30\n\n50\n\n40\n60\nTask Number\n\n70\n\n80\n\n90\n\n100\n\n(c) Highway \u2013 original s(t)\u2019s\n\n(d) Highway \u2013 re-optimized s(t)\u2019s\n\nFigure 3: Reverse transfer. Difference in error in the learned reward between when a task was \ufb01rst\ntrained and after the full model had been trained, as a function of task order. Positive change in errors\nindicates positive transfer; negative change indicates interference from negative transfer. Note that\nthe re-optimization has both decreased negative transfer on the earliest tasks, and also signi\ufb01cantly\nincreased the magnitude of positive reverse transfer. Red curves show the best exponential curve.\n\n6.2 Results\n\nFigure 1 shows the advantage of sharing knowledge among IRL tasks. ELIRL learned the reward\nfunctions more accurately than its base learner, MaxEnt IRL, after suf\ufb01cient tasks were used to train\nthe knowledge base L. This directly translated to increased performance of the policy trained using\nthe learned reward function. Moreover, the s(t) re-optimization (Section 5.2) allowed ELIRLre to\noutperform GPIRL, by making use of the most updated knowledge.\nAs shown in Table 1, ELIRL re-\nquires little extra training time ver-\nsus MaxEnt IRL, even with the op-\ntional s(t) re-optimization, and runs\nsigni\ufb01cantly faster than GPIRL. The\nre-optimization\u2019s additional time is\nnearly imperceptible. This signi\ufb01es\na clear advantage for ELIRL when\nlearning multiple tasks in real-time.\nIn order to analyze how ELIRL captures the latent structure underlying the tasks, we created new\ninstances of Objectworld and used a single learned latent component as the reward of each new MDP\n(i.e., a column of L, which can be treated as a latent reward function factor). Figure 2 shows example\n\nHighway (sec)\n21.438 \u00b1 0.173\n21.440 \u00b1 0.173\n18.283 \u00b1 0.775\n392.117 \u00b1 18.484\nTable 1: The average learning time per task. The standard\nerror is reported after the \u00b1.\n\nObjectworld (sec)\n17.055 \u00b1 0.091\n17.068 \u00b1 0.091\n16.572 \u00b1 0.407\n1008.181 \u00b1 67.261\n\nELIRL\nELIRLre\nMaxEnt IRL\nGPIRL\n\n8\n\n\fFigure 4: Results for extensions of\nELIRL. Whiskers denote standard er-\nrors.\n(a) Reward difference (lower\nis better) between MaxEnt, in-domain\nELIRL, and cross-domain ELIRL. Trans-\nferring knowledge across domains im-\nproved the accuracy of the learned re-\nward.\n(b) Value difference (lower is\nbetter) obtained by ELIRL and AME-\nIRL on the planar navigation environ-\nment. ELIRL improves the performance\nof AME-IRL, and this improvement in-\ncreases as ELIRL observes more tasks.\n\nf\nf\ni\n\n \n\nd\nd\nr\na\nw\ne\nr\n \n\ng\nv\nA\n\n10\n8\n6\n4\n2\n0\n\nELIRL\n\nELIRLre\nMaxEnt IRL\n\nCD-ELIRLre\nCD-ELIRL\n(a) Cross-domain transfer\n\ns\ns\no\nL\n\n \n\nl\n\n \n\nd\nf\nf\nr\ni\na\nd\nw\ne\ne\nu\nR\na\n \ne\nv\ng\n \ng\na\nv\nr\nA\ne\nv\nA\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\nAME-IRL\nELIRL\n\n10\n40\n50\n100\n80\n20\nNumber of tasks trained\nPercentage of tasks trained on\n\n20\n40\n\n30\n60\n\n(b) Continuous domains\n\nlatent components learned by the algorithm, revealing that each latent component represents the 5\u21e5 5\ngrid around a particular color or small subset of the colors.\nWe also examined how performance on the earliest tasks changed during the lifelong learning process.\nRecall that as ELIRL learns new tasks, the shared knowledge in L continually changes. Consequently,\nthe modeled reward functions for all tasks continue to be re\ufb01ned automatically over time, without\nretraining on the tasks. To measure this effect of \u201creverse transfer\u201d [34], we compared the performance\non each task when it was \ufb01rst encountered to its performance after learning all tasks, averaged over 20\nrandom task orders. Figure 3 reveals that ELIRL improves previous tasks\u2019 performance as L is re\ufb01ned,\nachieving reverse transfer in IRL. Reverse transfer was further improved by the s(t) re-optimization.\n\n6.3 ELIRL Extensions to Cross-Domain Transfer and Continuous State-Action Spaces\nWe performed additional experiments to show how simple extensions to ELIRL can transfer knowl-\nedge across tasks with different feature spaces and with continuous state-action spaces.\nELIRL can support transfer across task domains with different feature spaces by adapting prior\nwork in cross-domain transfer [3]; details of this extension are given in Appendix C. To evaluate\ncross-domain transfer, we constructed 40 Objectworld domains with different feature spaces by\nvarying the grid sizes from 5 to 24 and letting the number of outer colors be either 3 or 5. We created\n10 tasks per domain, and provided the agents with 16 demonstrations per task, with lengths varying\naccording to the number of cells in each domain. We compared MaxEnt IRL, in-domain ELIRL\nwith the original (ELIRL) and re-optimized (ELIRLre) s(t)\u2019s, and cross-domain ELIRL with the\noriginal (CD-ELIRL) and reoptimized (CD-ELIRLre) s(t)\u2019s, averaged over 10 random task orderings.\nFigure 4a shows how cross-domain transfer improved the performance of an agent trained only on\ntasks within each domain. Notice how the s(t) re-optimization compensates for the major changes in\nthe shared knowledge that occur when the agent encounters tasks from different domains.\nWe also explored an extension of ELIRL to continuous state spaces, as detailed in Appendix D. To\nevaluate this extension, we used a continuous planar navigation task similar to that presented by\nLevine and Koltun [20]. Analogous to Objectworld, this continuous environment contains randomly\ndistributed objects that have associated rewards (sampled randomly), and each object has an area\nof in\ufb02uence de\ufb01ned by a radial basis function. Figure 4b shows the performance of ELIRL on 50\ncontinuous navigation tasks averaged over 20 different task orderings, compared against the average\nperformance of the single-task AME-IRL algorithm [20] across all tasks. These results show that\nELIRL is able to achieve better performance in the continuous space than the single-task learner,\nonce a suf\ufb01cient number of tasks has been observed.\n\n7 Conclusion\n\nWe introduced the novel problem of lifelong IRL, and presented a general framework that is capable\nof sharing learned knowledge about the reward functions between IRL tasks. We derived an algorithm\nfor lifelong MaxEnt IRL, and showed how it can be easily extended to handle different single-task\nIRL methods and diverse task domains. In future work, we intend to study how more powerful base\nlearners can be used for the learning of more complex tasks, potentially from human demonstrations.\n\n9\n\n\fAcknowledgements\n\nThis research was partly supported by AFRL grant #FA8750-16-1-0109 and DARPA agreement\n#FA8750-18-2-0117. We would like to thank the anonymous reviewers for their helpful feedback.\n\nReferences\n[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn Proceedings of the 21st International Conference on Machine Learning (ICML-04), 2004.\n\n[2] Monica Babes, Vukosi N. Marivate, Kaushik Subramanian, and Michael L. Littman. Appren-\nticeship learning about multiple intentions. In Proceedings of the 28th International Conference\non Machine Learning (ICML-11), 2011.\n\n[3] Haitham Bou Ammar, Eric Eaton, Jose Marcio Luna, and Paul Ruvolo. Autonomous cross-\ndomain knowledge transfer in lifelong policy gradient reinforcement learning. In Proceedings\nof the 24th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI-15), 2015.\n\n[4] Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, and Matthew E. Taylor. Online multi-task\nlearning for policy gradient methods. In Proceedings of the 31st International Conference on\nMachine Learning (ICML-14), June 2014.\n\n[5] Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, and Matthew E. Taylor. Unsupervised cross-\ndomain transfer in policy gradient reinforcement learning via manifold alignment. In Proceed-\nings of the 29th Conference on Arti\ufb01cial Intelligence (AAAI-15), 2015.\n\n[6] Haitham Bou Ammar, Decebal Constantin Mocanu, Matthew E. Taylor, Kurt Driessens, Karl\nTuyls, and Gerhard Weiss. Automatically mapped transfer between reinforcement learning\ntasks via three-way restricted Boltzmann machines. In Proceedings of the 2013 European\nConference on Machine Learning and Principles and Practice of Knowledge Discovery in\nDatabases (ECML-PKDD-13), 2013.\n\n[7] Haitham Bou Ammar and Matthew E. Taylor. Common subspace transfer for reinforcement\nlearning tasks. In Proceedings of the Adaptive and Learning Agents Workshop at the 10th\nAutonomous Agents and Multi-Agent Systems Conference (AAMAS-11), 2011.\n\n[8] Haitham Bou Ammar, Matthew E. Taylor, Karl Tuyls, and Gerhard Weiss. Reinforcement\nlearning transfer using a sparse coded inter-task mapping. In Proceedings of the 11th European\nWorkshop on Multi-Agent Systems (EUMAS-13), 2013.\n\n[9] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement\nlearning. In Proceedings of the 14th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS-11), 2011.\n\n[10] Zhiyuan Chen and Bing Liu. Lifelong Machine Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning. Morgan & Claypool Publishers, 2016.\n\n[11] Jaedeug Choi and Kee-eung Kim. Nonparametric Bayesian inverse reinforcement learning for\nmultiple reward functions. In Advances in Neural Information Processing Systems 25 (NIPS-12).\n2012.\n\n[12] Christos Dimitrakakis and Constantin A. Rothkopf. Bayesian multitask inverse reinforcement\nlearning. In Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL-11),\n2011.\n\n[13] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural\nInformation Processing Systems 30 (NIPS-17). 2017.\n\n[14] Anestis Fachantidis, Ioannis Partalas, Matthew E. Taylor, and Ioannis Vlahavas. Transfer\nlearning via multiple inter-task mappings. In Proceedings of the 9th European Workshop on\nReinforcement Learning (EWRL-11), 2011.\n\n10\n\n\f[15] Anestis Fachantidis, Ioannis Partalas, Matthew E. Taylor, and Ioannis Vlahavas. Transfer\n\nlearning with probabilistic mapping selection. Adaptive Behavior, 2015.\n\n[16] Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with\nsemi-supervised reinforcement learning. In Proceedings of the 5th International Conference on\nLearning Representations (ICLR-17), 2017.\n\n[17] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\nimitation learning via meta-learning. In Proceedings of the 1st Annual Conference on Robot\nLearning (CoRL-17), 2017.\n\n[18] George Konidaris, Ilya Scheidwasser, and Andrew Barto. Transfer in reinforcement learning\n\nvia shared features. Journal of Machine Learning Research (JMLR), 2012.\n\n[19] A. Kumar and H. Daum\u00e9 III. Learning task grouping and overlap in multi-task learning. In\n\nProceedings of the 29th International Conference on Machine Learning (ICML-12), 2012.\n\n[20] Sergey Levine and Vladlen Koltun. Continuous inverse optimal control with locally optimal\nexamples. In Proceedings of the 29th International Conference on Machine Learning (ICML-12),\n2012.\n\n[21] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning\nwith Gaussian processes. In Advances in Neural Information Processing Systems 24 (NIPS-11).\n2011.\n\n[22] Yong Luo, Dacheng Tao, and Yonggang Wen. Exploiting high-order information in heteroge-\nneous multi-task feature learning. In Proceedings of the 26th International Joint Conference on\nArti\ufb01cial Intelligence (IJCAI-17), 2017.\n\n[23] Yong Luo, Yonggang Wen, and Dacheng Tao. On combining side information and unlabeled\ndata for heterogeneous multi-task metric learning. In Proceedings of the 25th International\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI-16), 2016.\n\n[24] James MacGlashan. Brown-UMBC reinforcement learning and planning (BURLAP) Java\n\nlibrary, version 3.0. Available online at http://burlap.cs.brown.edu, 2016.\n\n[25] Olivier Mangin and Pierre-Yves Oudeyer. Feature learning for multi-task inverse reinforcement\nlearning. Available online at https://olivier.mangin.com/media/pdf/mangin.2014.\ufb01rl.pdf, 2013.\n\n[26] Andreas Maurer, Massi Pontil, and Bernardino Romera-Paredes. Sparse coding for multitask\nand transfer learning. In Proceedings of the 30th International Conference on Machine Learning\n(ICML-13), 2013.\n\n[27] Francisco S. Melo and Manuel Lopes. Learning from demonstration using MDP induced\nmetrics. In Proceedings of the 2010 European Conference on Machine Learning and Principles\nand Practice of Knowledge Discovery in Databases (ECML-PKDD-10), 2010.\n\n[28] Gergely Neu and Csaba Szepesv\u00e1ri. Apprenticeship learning using inverse reinforcement\nIn Proceedings of the 23rd Conference on Uncertainty in\n\nlearning and gradient methods.\nArti\ufb01cial Intelligence (UAI-07), 2007.\n\n[29] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proceedings\n\nof the 17th International Conference on Machine Learning (ICML-00), 2000.\n\n[30] Qifeng Qiao and Peter A. Beling. Inverse reinforcement learning with Gaussian process. In\n\nProceedings of the 2011 American Control Conference (ACC-11). IEEE, 2011.\n\n[31] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings\n\nof the 20th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI-07), 2007.\n\n[32] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In\n\nProceedings of the 23rd International Conference on Machine Learning (ICML-06), 2006.\n\n11\n\n\f[33] Constantin A. Rothkopf and Christos Dimitrakakis. Preference elicitation and inverse reinforce-\nment learning. In Proceedings of the 2011 European Conference on Machine Learning and\nPrinciples and Practice of Knowledge Discovery in Databases (ECML-PKDD-11), 2011.\n\n[34] Paul Ruvolo and Eric Eaton. ELLA: An ef\ufb01cient lifelong learning algorithm. In Proceedings of\n\nthe 30th International Conference on Machine Learning (ICML-13), June 2013.\n\n[35] David Silver, J. Andrew Bagnell, and Anthony Stentz. Perceptual interpretation for autonomous\nIn Proceedings of the 14th International\n\nnavigation through dynamic imitation learning.\nSymposium on Robotics Research (ISRR-09), 2009.\n\n[36] Jonathan Sorg and Satinder Singh. Transfer via soft homomorphisms. In Proceedings of The 8th\nInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS-09), 2009.\n[37] Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In\n\nAdvances in Neural Information Processing Systems 20 (NIPS-07). 2007.\n\n[38] Ajay Kumar Tanwani and Aude Billard. Transfer in inverse reinforcement learning for multiple\nstrategies. In Proceedings of the 2013 International Conference on Intelligent Robots and\nSystems (IROS-13). IEEE, 2013.\n\n[39] Matthew E. Taylor, Gregory Kuhlmann, and Peter Stone. Autonomous transfer for reinforcement\nIn Proceedings of the 7th International Conference on Autonomous Agents and\n\nlearning.\nMultiagent Systems (AAMAS-08), 2008.\n\n[40] Matthew E. Taylor and Peter Stone. Cross-domain transfer for reinforcement learning. In\n\nProceedings of the 24th International Conference on Machine Learning (ICML-07), 2007.\n\n[41] Matthew E. Taylor, Shimon Whiteson, and Peter Stone. Transfer via inter-task mappings in\npolicy search reinforcement learning. In Proceedings of the 6th International Conference on\nAutonomous Agents and Multiagent Systems (AAMAS-07), 2007.\n\n[42] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse\n\nreinforcement learning. arXiv preprint arXiv:1507.04888, 2015.\n\n[43] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind Dey. Maximum entropy\ninverse reinforcement learning. In Proceedings of the 23rd Conference on Arti\ufb01cial Intelligence\n(AAAI-08), 2008.\n\n12\n\n\f", "award": [], "sourceid": 2203, "authors": [{"given_name": "Jorge", "family_name": "Mendez", "institution": "University of Pennsylvania"}, {"given_name": "Shashank", "family_name": "Shivkumar", "institution": "University of Pennsylvania"}, {"given_name": "Eric", "family_name": "Eaton", "institution": "University of Pennsylvania"}]}