{"title": "Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 305, "page_last": 313, "abstract": "We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to be met in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms the previous ones via experiments on a number of problem domains.", "full_text": "Nonparametric Bayesian\n\nInverse Reinforcement Learning\nfor Multiple Reward Functions\n\nJaedeug Choi and Kee-Eung Kim\nDepartment of Computer Science\n\nKorea Advanced Institute of Science and Technology\n\nDaejeon 305-701, Korea\n\njdchoi@ai.kaist.ac.kr, kekim@cs.kaist.ac.kr\n\nAbstract\n\nWe present a nonparametric Bayesian approach to inverse reinforcement learning\n(IRL) for multiple reward functions. Most previous IRL algorithms assume that\nthe behaviour data is obtained from an agent who is optimizing a single reward\nfunction, but this assumption is hard to guarantee in practice. Our approach is\nbased on integrating the Dirichlet process mixture model into Bayesian IRL. We\nprovide an ef\ufb01cient Metropolis-Hastings sampling algorithm utilizing the gradient\nof the posterior to estimate the underlying reward functions, and demonstrate that\nour approach outperforms previous ones via experiments on a number of problem\ndomains.\n\n1\n\nIntroduction\n\nInverse reinforcement learning (IRL) aims to \ufb01nd the agent\u2019s underlying reward function given the\nbehaviour data and the model of environment [1]. IRL algorithms often assume that the behaviour\ndata is from an agent who behaves optimally without mistakes with respect to a single reward func-\ntion. From the Markov decision process (MDP) perspective, the IRL can be de\ufb01ned as the problem\nof \ufb01nding the reward function given the trajectory data of an optimal policy, consisting of state-\naction histories. Under this assumption, a number of studies on IRL have appeared in the liter-\nature [2, 3, 4, 5].\nIn addition, IRL has been applied to various practical problems that includes\ninferring taxi drivers\u2019 route preferences from their GPS data [6], estimating patients\u2019 preferences to\ndetermine the optimal timing of living-donor liver transplants [7], and implementing simulated users\nto assess the quality of dialogue management systems [8].\n\nIn practice, the behaviour data is often gathered collectively from multiple agents whose reward\nfunctions are potentially different from each other. The amount of data generated from a single\nagent may be severely limited, and hence we may suffer from the sparsity of data if we try to infer\nthe reward function individually. Moreover, even when we have enough data from a single agent,\nthe reward function may change depending on the situation.\n\nHowever, most of the previous IRL algorithms assume that the behaviour data is generated by a\nsingle agent optimizing a \ufb01xed reward function, although there are a few exceptions that address\nIRL for multiple reward functions. Dimitrakakis and Rothkopf [9] proposed a multi-task learning\napproach, generalizing the Bayesian approach to IRL [4]. In this work, the reward functions are\nindividually estimated for each trajectory, which are assumed to share a common prior. Other than\nthe common prior assumption, there is no effort to group trajectories that are likely to be generated\nfrom the same or similar reward functions. On the other hand, Babes\u00b8-Vroman et al. [10] took a more\ndirect approach that combines EM clustering with IRL algorithm. The behaviour data are clustered\n\n1\n\n\fbased on the inferred reward functions, where the reward functions are de\ufb01ned per cluster. However,\nthe number of clusters (hence the number of reward functions) has to be speci\ufb01ed as a parameter in\norder to use the approach.\n\nIn this paper, we present a nonparametric Bayesian approach using the Dirichlet process mixture\nmodel in order to address the IRL problem with multiple reward functions. We develop an ef\ufb01cient\nMetropolis-Hastings (MH) sampler utilizing the gradient of the reward function posterior to infer\nreward functions from the behaviour data. In addition, after completing IRL on the behaviour data,\nwe can ef\ufb01ciently estimate the reward function for a new trajectory by computing the mean of the\nreward function posterior given the pre-learned results.\n\n2 Preliminaries\n\nWe assume that the environment is modeled as an MDP hS, A, T, R, \u03b3, b0i where: S is the \ufb01nite set\nof states; A is the \ufb01nite set of actions; T (s, a, s\u2032) is the state transition probability of changing to\nstate s\u2032 from state s when action a is taken; R(s, a) is the immediate reward of executing action a\nin state s; \u03b3 \u2208 [0, 1) is the discount factor; b0(s) denotes the probability of starting in state s. For\nnotational convenience, we use the vector r = [r1, . . . , rD] to denote the reward function.1\nA policy is a mapping \u03c0 : S \u2192 A. The value of policy \u03c0 is the expected discounted return of\nexecuting the policy, de\ufb01ned as V \u03c0 = E [P\u221e\nt=0 \u03b3tR(st, at)|b0, \u03c0]. The value function of policy \u03c0\nfor each state s is computed by V \u03c0(s) = R(s, \u03c0(s)) + \u03b3Ps\u2032\u2208S T (s, \u03c0(s), s\u2032)V \u03c0(s\u2032) so that the\nvalue is calculated by V \u03c0 = Ps\u2208S b0(s)V \u03c0(s). Similarly, the Q-function is de\ufb01ned as Q\u03c0(s, a) =\nR(s, a) + \u03b3Ps\u2032\u2208S T (s, a, s\u2032)V \u03c0(s\u2032). Given an MDP, the agent\u2019s objective is to execute an optimal\noptimality equation: V \u2217(s) = maxa\u2208A(cid:2)R(s, a) + \u03b3Ps\u2032\u2208S T (s, a, s\u2032)V \u2217(s\u2032)(cid:3).\n\npolicy \u03c0\u2217 that maximizes the value function for all the states, which should satisfy the Bellman\n\nWe assume that the agent\u2019s behavior data is generated by executing an optimal policy with some\nunknown reward function(s) R, given as the set X of M trajectories where the m-th trajectory is an\nH-step sequence of state-action pairs: Xm = {(sm,1, am,1), (sm,2, am,2), . . . , (sm,H , am,H )}.2\n\n2.1 Bayesian Inverse Reinforcement Learning (BIRL)\n\nRamachandran and Amir [4] proposed a Bayesian approach to IRL with the assumption that the\nbehaviour data is generated from a single reward function. The prior encodes the the reward function\npreference and the likelihood measures the compatibility of the reward function with the data.\n\nP (r) = QD\n\nAssuming that the reward function entries are independently distributed, the prior is de\ufb01ned as\nd=1 P (rd). We can use various distributions for the reward prior. For instance, the\nuniform distribution can be used if we have no knowledge or preference on rewards other than its\nrange, and the normal or Laplace distributions can be used if we prefer rewards to be close to some\nspeci\ufb01c values. The Beta distribution can also be used if we treat rewards as the parameter of the\nBernoulli distribution, i.e. P (\u03bed = 1) = rd with auxiliary binary random variable \u03bed [11].\nThe likelihood is de\ufb01ned as an independent exponential distribution, analogous to the softmax dis-\ntribution over actions:\n\nP (X |r, \u03b7) = QM\n\nm=1QH\n\nh=1 P (am,h|sm,hr, \u03b7) = QM\n\nm=1QH\n\nexp(\u03b7Q\u2217(sm,h,am,h;r))\nPa\u2032 exp(\u03b7Q\u2217(sm,h,a\u2032;r))\n\n(1)\n\nh=1\n\nwhere \u03b7 is the con\ufb01dence parameter of choosing optimal actions and Q\u2217(\u00b7, \u00b7; r) denotes the optimal\nQ-function computed using reward function r.\nFor the sake of exposition, we assume that the reward function entries are independently and\nnormally distributed with mean \u00b5 and variance \u03c32 so that the prior is de\ufb01ned as P (r|\u00b5, \u03c3) =\nQD\nd=1 N (rd; \u00b5, \u03c3), but our approach to be presented in later sections can be generalized to use\nmany other distributions for the prior. The posterior over the reward functions is then formulated by\n\n1D denotes the number of features. Note that we can assign individual reward values to every state-action\n\npair by using |S||A| indicator functions for features.\n\n2Although we assume that all trajectories are of length H for notational brevity, our formulation trivially\n\nextends to different lengths.\n\n2\n\n\fFigure 1: Graphical model for BIRL.\n\nAlgorithm 1: MH algorithm for DPM-BIRL\nInitialize c and {rk}K\nk=1\nfor t = 1 to MaxIter do\nfor m = 1 to M do\n\nc\u2217\nm \u223c P (c|c\u2212m, \u03b1)\nm /\u2208 c\u2212m then rc\u2217\nif c\u2217\nhcm, rcm i \u2190 hc\u2217\nm, rc\u2217\nP (Xm|r c\u2217\nP (Xm|r cm ,\u03b7) }\nmin{1,\nm\n\n,\u03b7)\n\nm \u223c P (r|\u00b5, \u03c3)\nm i with prob. of\n\nfor k = 1 to K do\n\n\u01eb \u223c N (0, 1)\nk \u2190 rk + \u03c4 2\nr\u2217\nrk \u2190 r\u2217\n\n2 \u2207 log f (rk) + \u03c4 \u01eb\nk with prob. of min{1, f (r\n\n\u2217\n\nk)g(r\n\n\u2217\n\nf (r k)g(r k ,r\n\nk ,r k)\nk) }\n\n\u2217\n\nFigure 2: Graphical model for DPM-BIRL.\n\nBayes rule as follows:\n\nP (r|X , \u03b7, \u00b5, \u03c3) \u221d P (X |r, \u03b7)P (r|\u00b5, \u03c3).\n\n(2)\n\nWe can infer the reward function from the model by computing the posterior mean using a Markov\nchain Monte Carlo (MCMC) algorithm [4] or the maximum-a-posteriori (MAP) estimates using a\ngradient method [12]. Fig. 1 shows the graphical model used in BIRL.\n\n3 Nonparametric Bayesian IRL for Multiple Reward Functions\n\nIn this section, we present our approach to IRL for multiple reward functions. We assume that each\ntrajectory in the behaviour data is generated by an agent with a \ufb01xed reward function.\nIn other\nwords, we assume that the reward function does not change within a trajectory. However, the whole\ntrajectories are assumed be generated by one or more agents whose reward functions are distinct\nfrom each other. We do not assume any information regarding which trajectory is generated by\nwhich agent as well as the number of agents. Hence, the goal is to infer an unknown number of\nreward functions from the unlabeled behaviour data.\n\nA naive approach to this problem setting would be solving M separate and independent IRL prob-\nlems by treating each trajectory as the sole behaviour data and employing one of the well-known\nIRL algorithms designed for a single reward function. We can then use an unsupervised learning\nmethod with the M reward functions as data points. However, this approach would suffer from the\nsparsity of data, since each trajectory may not contain a suf\ufb01cient amount of data to infer the reward\nfunction reliably, or the number of trajectories may not be enough for the unsupervised learning\nmethod to yield a meaningful result. Babes\u00b8-Vroman et al. [10] proposed an algorithm that combines\nEM clustering with IRL algorithm. It clusters trajectories and assumes that all the trajectories in a\ncluster are generated by a single reward function. However, as a consequence of using EM clus-\ntering, we need to specify the number of clusters (i.e. the number of distinct reward functions) as a\nparameter.\n\nWe take a nonparametric Bayesian approach to IRL using the Dirichlet process mixture model. Our\napproach has three main advantages. First, we do not need to specify the number of distinct reward\nfunctions due to the nonparametric nature of our model. Second, we can encode our preference\nor domain knowledge on the reward function into the prior since it is a Bayesian approach to IRL.\nThird, we can acquire rich information from the behaviour data such as the distribution over the\nreward functions.\n\n3.1 Dirichlet Process Mixture Models\n\nThe Dirichlet process mixture (DPM) model [13] provides a nonparametric Bayesian framework for\nclustering using mixture models with a countably in\ufb01nite number of mixture components. The prior\nof the mixing distribution is given by the Dirichlet process, which is a distribution over distributions\n\n3\n\n\fparameterized by base distribution G0 and concentration parameter \u03b1. The DPM model for a data\n{xm}M\n\nm=1 using a set of latent parameters {\u03b8m}M\n\nm=1 can be de\ufb01ned as:\n\nG|\u03b1, G0 \u223c DP (\u03b1, G0),\n\n\u03b8m|G \u223c G\n\nxm|\u03b8m \u223c F (\u03b8m)\n\nwhere G is the prior used to draw each \u03b8m and F (\u03b8m) is the parameterized distribution for data xm.\nThis is equivalent to the following form with K \u2192 \u221e:\n\np|\u03b1 \u223c Dirichlet(\u03b1/K, . . . , \u03b1/K)\ncm|p \u223c Multinomial(p1, . . . , pK )\n\n\u03c6k \u223c G0\n\nxm|cm, \u03c6 \u223c F (\u03c6cm)\n\n(3)\n\nwhere p = {pk}K\nk=1 is the mixing proportion for the latent classes, cm \u2208 {1, . . . , K} is the class\nassignment of xm so that cm = k when xm is assigned to class k, \u03c6k is the parameter of the data\ndistribution for class k, and \u03c6 = {\u03c6k}K\n\nk=1.\n\n3.2 DPM-BIRL for Multiple Reward Functions\n\nWe address the IRL for multiple reward functions by extending BIRL with the DPM model. We\nplace a Dirichlet process prior on the reward functions rk. The base distribution G0 is de\ufb01ned\nas the reward function prior, i.e. the product of the normal distribution for each reward entry\nQD\nd=1 N (rk,d; \u00b5, \u03c3). The cluster assignment cm = k indicates that the trajectory Xm belongs to\nthe cluster k, which represents that the trajectory is generated by the agent with the reward function\nrk. We can thus regard the behavior data X = {X1, . . . , XM } as being drawn from the following\ngenerative process:\n\n1. The cluster assignment cm is drawn by the \ufb01rst two equations in Eqn. (3).\n\n2. The reward function rk is drawn from QD\n\nd=1 N (rk,d; \u00b5, \u03c3).\n3. The trajectory Xm is drawn from P (Xm|rcm, \u03b7) in Eqn. (1).\n\nFig. 2 shows the graphical model of DPM-BIRL. The joint posterior of the cluster assignment c =\n{cm}M\n\nm=1 and the set of reward functions {rk}K\n\nk=1 is de\ufb01ned as:\n\nP (c, {rk}K\n\nk=1|X , \u03b7, \u00b5, \u03c3, \u03b1) = P (c|\u03b1)QK\n\nk=1 P (rk|Xc(k), \u03b7, \u00b5, \u03c3)\n\n(4)\n\nwhere Xc(k) = {Xm|cm = k for m = 1, . . . , M } and P (rk|X , \u03b7, \u00b5, \u03c3) are taken from Eqn. (2).\nThe inference in DPM-BIRL can be done using the Metropolis-Hastings (MH) algorithm that sam-\nples each hidden variable in turn. First, note that we can safely assume that there are K distinct\nvalues of cm\u2019s so that cm \u2208 {1, . . . , K} without loss of generality. The conditional distribution to\nsample cm for the MH update can be de\ufb01ned as\n\nP (cm|c\u2212m, {rk}K\n\nk=1, X , \u03b7, \u03b1) \u221d P (Xm|rcm, \u03b7)P (cm|c\u2212m, \u03b1)\n\nP (cm|c\u2212m, \u03b1) \u221d (cid:26)n\u2212m,cj ,\n\n\u03b1,\n\nif cm = cj for some j\nif cm 6= cj for all j\n\n(5)\n\nwhere c\u2212m = {ci|i 6= m for i = 1, . . . , M }, P (Xm|rcm, \u03b7) is the likelihood de\ufb01ned in Eqn. (1),\nand n\u2212m,cj = |{ci = cj|i 6= m for i = 1, . . . , M }| is the number of trajectories, excluding Xm,\nassigned to the cluster cj. Note that if the sampled cm 6= cj for all j then Xm is assigned to a new\ncluster. The conditional distribution to sample rk for the MH update is de\ufb01ned as\n\nP (rk|c, r\u2212k, X , \u03b7, \u00b5, \u03c3) \u221d P (Xc(k)|rk, \u03b7)P (rk|\u00b5, \u03c3)\n\nwhere P (Xc(k)|rk, \u03b7) is again the likelihood de\ufb01ned in Eqn.\n\n(1) and P (rk|\u00b5, \u03c3) =\n\nQD\nd=1 N (rk,d; \u00b5, \u03c3).\n\nIn Alg. 1, we present the MH algorithm for DPM-BIRL that uses the above MH updates. The\nalgorithm consists of two steps. The \ufb01rst step updates the cluster assignment c. We sample new\n\n4\n\n\fassignment c\u2217\nfunction rc\u2217\nof min{1,\nthe reward functions {rk}K\n\n,\u03b7)\n\nm from Eqn. (5). If c\u2217\nfrom the reward prior P (r|\u00b5, \u03c3). We then set cm = c\u2217\n\nm 6= cj for all j, we draw new reward\nm with the acceptance probability\nm\nP (Xm|rc\u2217\nP (Xm|rcm ,\u03b7) }, since we are using a non-conjugate prior [13]. The second step updates\nm\n\nm is not in c\u2212m, i.e., c\u2217\n\nk=1. We sample a new reward function r\u2217\n\nk using the equation\n\nk = rk + \u03c4 2\nr\u2217\n\n2 \u2207 log f (rk) + \u03c4 \u01eb\n\nwhere \u01eb is a sample from the standard normal distribution N (0, 1), \u03c4 is a non-negative scalar for the\nscaling parameter, and f (rk) is the target distribution of the MH update P (Xc(k)|rk, \u03b7)P (rk|\u00b5, \u03c3)\nwhich is the unnormalized posterior of the reward function rk. We then set rk = r\u2217\nk with the\nacceptance probability of min{1, f (r\n\nk)g(r\n\nk,rk)\n\n\u2217\n\n\u2217\n\nf (rk)g(rk,r\n\nk) } where\n\n\u2217\n\ng(x, y) =\n\n1\n\n(2\u03c0\u03c4 2)D/2 exp(cid:0)\u2212 1\n\n2\u03c4 2 ||x \u2212 y \u2212 1\n\n2 \u03c4 2\u2207 log f (x)||2\n2(cid:1) .\n\nThis step is motivated by the Langevin algorithm [14] which exploits local information (i.e. gradient)\nof f in order to ef\ufb01ciently move towards the high probability region. This algorithm is known to\nbe more ef\ufb01cient than random walk MH algorithms. We can compute the gradient of f using the\nresults of Choi and Kim [12].\n\n3.3\n\nInformation Transfer to a New Trajectory\n\nSuppose that we would like to infer the reward function of a new trajectory after we \ufb01nish IRL on the\nbehaviour data consisting of M trajectories. A naive approach would be running IRL from scratch\nusing all of the M + 1 trajectories. However, it would be more desirable to transfer the relevant\ninformation from the pre-computed IRL results. In order to do so, Babes\u00b8-Vroman et al. [10] use the\nweighted average of cluster reward functions assuming that the new trajectory is generated from the\nsame population of the behaviour data. Note that we can relax this assumption and allow the new\ntrajectory generated by a novel reward function, as a direct result of using DPM model.\nGiven the cluster assignment c and the reward functions {rk}K\ndata, the conditional prior of the reward function r for the new trajectory can be de\ufb01ned as:\n\nk=1 computed from the behaviour\n\nP (r|c, {rk}K\n\nk=1, \u00b5, \u03c3, \u03b1) = \u03b1\n\n(6)\nwhere nk = |{Xm|cm = k for m = 1, . . . , M }| is the number of trajectories assigned to cluster k\nand \u03b4(x) is the Dirac delta function. Running Alg. 1 on the behaviour data X , we already have a set\nn=1 drawn from the joint posterior. The conditional posterior of r\nof N samples {c(n), {r\nk=1 }N\nfor the new trajectory Xnew is then:\n\n\u03b1+M P (r|\u00b5, \u03c3) + 1\n\nk=1 nk\u03b4(r \u2212 rk)\n\n\u03b1+M PK\n\nk }K(n)\n\n(n)\n\nP (r|Xnew, X , \u0398) \u221d P (Xnew|r, \u03b7)P (r|X , \u0398)\n\n= P (Xnew|r, \u03b7)Z P (r|c, {rk}K\nN PN\n\u2248 P (Xnew|r, \u03b7) 1\n= P (Xnew|r, \u03b7)(cid:20) \u03b1\n\u03b1+M P (r|\u00b5, \u03c3) + 1\n\nn=1 P (r|{c(n), {r\n\nk=1, \u00b5, \u03c3, \u03b1)dP (c, {rk}K\n\nk=1|X , \u0398)\n\nk=1 }N\n\nn=1, \u00b5, \u03c3, \u03b1)\n\n(n)\n\nk }K(n)\n\u03b1+M PN\n\nn=1PK(n)\n\nk=1\n\nn(n)\nN \u03b4(r \u2212 r\n\nk\n\n(n)\n\nk )(cid:21)\n\nwhere \u0398 = {\u03b7, \u00b5, \u03c3, \u03b1}.\nWe can then re-draw samples of r using the approximated posterior and take the sample average\nas the inferred reward function. However, we present a more ef\ufb01cient way of calculating the pos-\nterior mean of r without re-drawing the samples. Note that Eqn. (6) is a mixture of a continuous\nk=1. If we approximate\ndistribution P (r|\u00b5, \u03c3) with a number of point mass distributions on {rk}K\nthe continuous one by a point mass distribution, i.e., P (r|\u00b5, \u03c3) \u2248 \u03b4(\u02c6r), the posterior mean is ana-\nlytically computable using the above approximation:\n\nE[r|Xnew, X , \u0398] = R rdP (r|Xnew, X , \u0398)\n\n\u2248 1\n\nZ (cid:20)\u03b1P (Xnew|\u02c6r, \u03b7)\u02c6r +PN\n\nn=1PK(n)\n\nk=1\n\nn(n)\nN P (Xnew|r\n\nk\n\n(n)\nk , \u03b7)r\n\n(n)\n\nk (cid:21)\n\n(7)\n\nwhere Z is the normalizing constant. We choose \u02c6r = argmaxr P (Xnew|r, \u03b7)P (r|\u00b5, \u03c3), which is\nthe MAP estimate of the reward function for the new trajectory Xnew only, ignoring the previous\nbehaviour data X .\n\n5\n\n\f1.5\n\n1\n\n0.5\n\n \n\nD\nV\nE\ne\ng\na\nr\ne\nv\nA\n\n0\n\n2\n\n1\n\ne\nr\no\nc\ns\n\u2212\nF\n\n0.9\n\n0.8\n\n12\n\n0.7\n\n2\n\n1\n\n0.9\n\n0.8\n\nI\n\nM\nN\n\n12\n\n0.7\n\n2\n\n4\n\n6\n\n8\n\n10\n\n# of trajectories per agent\n\n5\n\n4\n\n3\n\ns\nr\ne\n\nt\ns\nu\nc\n \nf\n\nl\n\no\n#\n\n \n\n12\n\n2\n\n \n2\n\nBIRL\nEM\u2212MLIRL(3)\nEM\u2212MLIRL(6)\nEM\u2212MLIRL(9)\nDPM\u2212BIRL(U)\nDPM\u2212BIRL(G)\n\n4\n\n6\n\n8\n\n10\n\n# of trajectories per agent\n\n \ny\nr\no\n\nj\n\nt\nc\ne\na\nr\nt\n \n\nw\ne\nn\n\n \n\ne\nh\n\nt\n \nr\no\n\nf\n \n\nD\nV\nE\n\n12\n\n1.5\n\n1\n\n0.5\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n# of trajectories per agent\n\n4\n\n6\n\n8\n\n10\n\n# of trajectories per agent\n\n4\n\n6\n\n8\n\n10\n\n# of trajectories per agent\n\nFigure 3: Results with increasing number of trajectories per agent in the gridworld problem. DPM-\nBIRL uses the uniform (U) and the standard normal (N) priors.\n\n4 Experimental Results\n\nWe compared the performance of DPM-BIRL to the EM-MLIRL algorithm [10] and the baseline\nalgorithm which runs BIRL separately on each trajectory. The experiments consisted of two tasks:\nThe \ufb01rst task was \ufb01nding multiple reward functions from the behaviour data with a number of\ntrajectories. The second task was inferring the reward function underlying a new trajectory, while\nexploiting the results learned in the \ufb01rst task.\n\nThe performance of each algorithm was evaluated by the expected value difference (EVD)\nL)(rA)| where rA is the agent\u2019s ground truth reward function, rL is the learned\n|V \u2217(rA) \u2212 V \u03c0\u2217(r\nreward function, \u03c0\u2217(r) is the optimal policy induced by reward function r, and V \u03c0(r) is the value of\npolicy \u03c0 measured using r. The EVD thus measures the performance difference between the agent\u2019s\noptimal policy and the optimal policy induced by the learned reward function. In the \ufb01rst task, we\nevaluated the EVD for the true and learned reward functions of each trajectory and computed the\naverage EVD over the trajectories in the behaviour data. In the second task, we evaluated the EVD\nfor the new trajectory. The clustering quality on the behaviour data was evaluated by F-score and\nnormalized mutual information (NMI).\n\nIn all the experiments, we assumed that the reward function was linearly parameterized such that\n\nR(s, a) = PD\n\nd=1 rd\u03c6d(s, a) with feature functions \u03c6d : S \u00d7 A \u2192 R, hence r = [r1, . . . , rD].\n\n4.1 Gridworld Problem\n\nIn order to extensively evaluate our approach, we \ufb01rst performed experiments on a small toy domain,\n8\u00d78 gridworld, where each of the 64 cells corresponds to the state. The agent can move north, south,\neast, or west, but with probability of 0.2, it fails and moves in a random direction. The initial state is\nrandomly chosen from the states. The grid is partitioned into non-overlapping regions of size 2 \u00d7 2,\nand the feature function is de\ufb01ned by a binary indicator function for each region. Random instances\nof IRL with three reward functions were generated as follows: each element of r was sampled to\nhave a non-zero value with probability of 0.2 and the value is drawn from the uniform distribution\nbetween -1 and 1. We obtained the trajectories of 40 time steps and measured the performance as\nwe increased the number of trajectories per reward function.\n\nFig. 3 shows the averages and standard errors of the performance results over 10 problem instances.\nThe left four panels in the \ufb01gure present the results for the \ufb01rst task of learning multiple reward\nfunctions from the behaviour data. When the size of the behaviour data is small, the clustering\nperformances of both DPM-BIRL and EM-MLIRL were not good enough due to the sparsity of\ndata, hence their EVD results were similar to that of the baseline algorithm that independently runs\nBIRL on each trajectory. However, as we increased the size of the data, both DPM-BIRL and EM-\nMLIRL achieved better EVD results than the baseline since they could utilize more information by\ngrouping the trajectories to infer the reward functions. As for EM-MLIRL, we set the parameter\nK used for the maximum number of clusters to 3 (ground truth), 6 (2x), and 9 (3x). DPM-BIRL\nachieved signi\ufb01cantly better results than EM-MLIRL with all of the parameter settings, in terms of\nEVD and clustering quality. The rightmost panel in the \ufb01gure present the results for the second task\nof inferring the reward function for a new trajectory. DPM-BIRL clearly outperformed EM-MLIRL\nsince it exploits the rich information from the reward function posterior. The relatively large error\nbars of the EM-MLIRL results are due to the local convergence inherent to EM clustering.\n\n6\n\n\f3\n\n2\n\n1\n\n \n\nD\nV\nE\ne\ng\na\nr\ne\nv\nA\n\n0\n\n \n0\n\n20\n\n \n\nEM\u2212MLIRL(3)\nEM\u2212MLIRL(6)\nEM\u2212MLIRL(9)\nDPM\u2212BIRL(U)\nDPM\u2212BIRL(G)\n\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\nTime step: 79\n\n40\n\n60\n\nCpu time (sec)\n\n80\n\n100\n\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\nSpeed: high\n\nFigure 4: CPU timing results in the\ngridworld problem.\n\nFigure 5: Screenshots of Simulated-highway problem\n(left) and Mario Bros (right).\n\nBIRL\n\nEM-MLIRL(3)\nEM-MLIRL(6)\nDPM-BIRL(U)\nDPM-BIRL(N)\n\n0.52\u00b10.05\n4.53\u00b10.96\n0.89\u00b10.57\n0.35\u00b10.04\n0.36\u00b10.05\n\nn.a.\n\n0.80\u00b10.05\n0.96\u00b10.02\n0.98\u00b10.01\n0.99\u00b10.01\n\nNMI\nn.a.\n\n0.74\u00b10.09\n0.96\u00b10.03\n0.97\u00b10.01\n0.99\u00b10.01\n\nTable 1: Results in Simulated-highway problem.\nAverage EVD\n\nF-score\n\n# of clusters\n\nn.a.\n\n2.20\u00b10.20\n3.10\u00b10.18\n3.30\u00b10.15\n3.10\u00b10.10\n\nEVD for Xnew\n\n0.41\u00b10.00\n4.14\u00b10.88\n0.82\u00b10.53\n0.32\u00b10.04\n0.30\u00b10.04\n\nFig. 4 compares the average CPU timing results of DPM-BIRL and EM-MLIRL with 10 trajectories\nper reward function. DPM-BIRL using Alg. 1 took much shorter time to converge than EM-MLIRL.\nThis is mainly due to the fact that, whereas EM-MLIRL performs full single-reward IRL multiple\ntimes in each iteration, DPM-BIRL takes a sample from the posterior leveraging the gradient that\ndoes not involve a full IRL.\n\n4.2 Simulated-highway Problem\n\nThe second set of experiments was conducted in Simulated-highway problem [15] where the agent\ndrives on a three lane road. The left panel in Fig. 5 shows a screenshot of the problem. The agent\ncan move one lane left or right and drive at speeds 2 through 3, but it fails to change the lane with\nprobability of 0.2 and 0.4 respectively in speed 2 and 3. All the other cars on the road constantly\ndrive at speed 1 and do not change the lane. The reward function is de\ufb01ned by using 6 binary\nfeature functions: one function for indicating the agent\u2019s collision with other cars, 3 functions for\nindicating the agent\u2019s current lane, 2 functions for indicating the agent\u2019s current speed. We generated\nthree agents having different driving styles. The \ufb01rst one prefers driving at speed 3 in the left-most\nlane and avoiding collisions. The second one prefers driving at speed 3 in the right-most lane and\navoiding collisions. The third one prefers driving at speed 2 and colliding with other cars. We\nprepared 3 trajectories of 40 time steps per driver agent for the \ufb01rst task and 20 trajectories of 40\ntime steps yielded by a driver randomly chosen among the three for the second task.\n\nTbl. 1 presents the averages and standard errors of the results over 10 sets of the behaviour data.\nDPM-BIRL signi\ufb01cantly outperformed the others while EM-MLIRL suffered from the convergence\nto a local optimum.\n\n4.3 Mario Bros.\n\nFor the third set of experiments, we used the open source simulator of the game Mario Bros, which\nis a challenging problem due to its huge state space. The right panel in Fig. 5 is a screenshot of the\ngame. Mario can move left, move right, or jump. Mario\u2019s goal is to reach the end of the level by\ntraversing from left to right while collecting coins and avoiding or killing enemies. We used 8 binary\nfeature functions, each being an indicator for: Mario successfully reaching the end of the level;\nMario getting killed; Mario killing an enemy; Mario collecting a coin; Mario receiving damage by\nan enemy; existence of a wall preventing Mario from moving in the current direction; Mario moving\nto the right; Mario moving to the left. We collected the behaviour data from 4 players: The expert\nplayer is good at both collecting coins and killing enemies. The coin collector likes to collect coins\nbut avoids killing enemies. The enemy killer likes to kill enemies but avoids collecting coins. The\n\n7\n\n\fExpert player\n\nc\n\nDPM-BIRL\n\nEM-MLIRL(4)\nEM-MLIRL(8)\n\nCoin collector\n\nTable 2: Cluster assignments in Mario Bros.\nEnemy killer\n4\n3\n2\n1\n1\n3\n\n2\n2\n2\n\n3\n2\n3\n\n1\n1\n1\n\n1\n1\n1\n\n1\n1\n1\n\n1\n1\n1\n\n2\n1\n2\n\nSpeedy Gonzales\n\n5\n3\n3\n\n5\n3\n3\n\n5\n3\n3\n\nTable 3: Results of DPM-BIRL in Mario Bros.\n\nReward function entry (rk,d)\n\nk from DPM-BIRL\n\n\u03c6enemy-killed\n\u03c6coin-collected\n\n1\n1.00\n1.00\n\n2\n-0.81\n1.00\n\n3\n1.00\n-1.00\n\n4\n1.00\n-0.42\n\n5\n-1.00\n-1.00\n\n1\n3.10\n21.60\n\nAverage feature counts\n4\n1.90\n7.85\n\n2\n1.60\n21.55\n\n3\n2.80\n7.55\n\n5\n0.55\n6.75\n\nspeedy Gonzales avoids both collecting coins and killing enemies. All the players commonly try\nto reach the end of the level while acting according to their own preferences. The behaviour data\nconsisted of 3 trajectories per player. Since only the simulator of the environment is available instead\nof the complete model, we used the relative entropy IRL [16] which is a model-free IRL algorithm.\n\nTbl. 2 presents the cluster assignment results. Each column represents each trajectory and the num-\nber denotes the cluster assignment cm of trajectory Xm. For example, DPM-BIRL produced 5\nclusters and trajectories X1, . . . , X4 are assigned to the cluster 1 representing the expert player. EM-\nMLIRL failed to group the trajectories that align well with the players, even though we restarted it\n100 times in order to mitigate the convergence to bad local optima. On the other hand, DPM-BIRL\nwas incorrect on only one trajectory, assigning a coin collector\u2019s trajectory to the expert player clus-\nter. Tbl. 3 presents the reward function entries (rk,d) learned from DPM-BIRL and the average\nfeature counts acquired by the players with the learned reward functions. For the sake of brevity,\nwe present only two important features (d=enemy-killed, coin-collected) that determine the playing\nstyle. To compute each player\u2019s feature counts, we executed an n-step lookahead policy yielded by\neach reward function rk on the simulator in 20 randomly chosen levels. The reward function entries\nalign well with each playing style. For example, the cluster 2 represents the coin collector, and its\nreward function entry for killing an enemy is negative but that for collecting a coin is positive.\n\nAs a demonstration, we implemented a small piece of software that visualizes the posterior proba-\nbility of a gamer\u2019s behavior belonging to one of the clusters including a new one. A demo video is\nprovided as supplementary material.\n\n5 Conclusion\n\nWe proposed a nonparametric Bayesian approach to IRL for multiple reward functions using the\nDirichlet process mixture model, which extends the previous Bayesian approach to IRL assuming\na single reward function. We can learn an appropriate number of reward functions from the be-\nhavior data due to the nonparametric nature and facilitates incorporating domain knowledge on the\nreward function by utilizing a Bayesian approach. We presented an ef\ufb01cient Metropolis-Hastings\nsampling algorithm that draws samples from the posterior of DPM-BIRL, leveraging the gradient\nof the posterior. We also provided an analytical way to compute the approximate posterior mean\nfor the information transfer task. In addition, we showed that DPM-BIRL outperforms the previous\napproach in various problem domains.\n\nAcknowledgments\n\nThis work was supported by National Research Foundation of Korea (Grant# 2012-007881), the\nDefense Acquisition Program Administration and Agency for Defense Development of Korea (Con-\ntract# UD080042AD), and the SW Computing R&D Program of KEIT (2011-10041313) funded by\nthe Ministry of Knowledge Economy of Korea.\n\n8\n\n\fReferences\n[1] Stuart Russell. Learning agents for uncertain environments (extended abstract). In Proceedings of COLT,\n\n1998.\n\n[2] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning.\n\nICML, 2000.\n\nIn Proceedings of\n\n[3] Gergely Neu and Csaba Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and\n\ngradient methods. In Proceedings of UAI, 2007.\n\n[4] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings of IJCAI,\n\n2007.\n\n[5] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse\n\nreinforcement learning. In Proceedings of AAAI, 2008.\n\n[6] Brian D. Ziebart, Andrew L. Maas, Anind K. Dey, and J. Andrew Bagnell. Navigate like a cabbie: proba-\nbilistic reasoning from observed context-aware behavior. In Proceedings of the international conference\non Ubiquitous computing, 2008.\n\n[7] Zeynep Erkin, Matthew D. Bailey, Lisa M. Maillart, Andrew J. Schaefer, and Mark S. Roberts. Eliciting\npatients\u2019 revealed preferences: An inverse Markov decision process approach. Decision Analysis, 7(4),\n2010.\n\n[8] Senthilkumar Chandramohan, Matthieu Geist, Fabrice Lefevre, and Olivier Pietquin. User simulation in\n\ndialogue systems using inverse reinforcement learning. In Proceedings of Interspeech, 2011.\n\n[9] Christos Dimitrakakis and Constantin A. Rothkopf. Bayesian multitask inverse reinforcement learning.\n\nIn Proceedings of the European Workshop on Reinforcement Learning, 2011.\n\n[10] Monica Babes\u00b8-Vroman, Vukosi Marivate, Kaushik Subramanian, and Michael Littman. Apprenticeship\n\nlearning about multiple intentions. In Proceedings of ICML, 2011.\n\n[11] Peter Dayan and Geoffrey E. Hinton. Using expectation-maximization for reinforcement learning. Neural\n\nComputation, 9(2), 1997.\n\n[12] Jaedeug Choi and Kee-Eung Kim. MAP inference for Bayesian inverse reinforcement learning. In Pro-\n\nceedings of NIPS, 2011.\n\n[13] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2), 2000.\n\n[14] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of discrete approximations to langevin\n\ndiffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1), 1998.\n\n[15] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Pro-\n\nceedings of ICML, 2004.\n\n[16] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. In\n\nProceedings of AISTATS, 2011.\n\n9\n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Jaedeug", "family_name": "Choi", "institution": null}, {"given_name": "Kee-eung", "family_name": "Kim", "institution": null}]}