{"title": "Multi-Agent Generative Adversarial Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7461, "page_last": 7472, "abstract": "Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments.\nWe propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.", "full_text": "Multi-Agent Generative Adversarial Imitation\n\nLearning\n\nJiaming Song\n\nStanford University\n\ntsong@cs.stanford.edu\n\nHongyu Ren\n\nStanford University\n\nhyren@cs.stanford.edu\n\nDorsa Sadigh\n\nStanford University\n\ndorsa@cs.stanford.edu\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nImitation learning algorithms can be used to learn a policy from expert demonstra-\ntions without access to a reward signal. However, most existing approaches are not\napplicable in multi-agent settings due to the existence of multiple (Nash) equilibria\nand non-stationary environments. We propose a new framework for multi-agent\nimitation learning for general Markov games, where we build upon a generalized\nnotion of inverse reinforcement learning. We further introduce a practical multi-\nagent actor-critic algorithm with good empirical performance. Our method can be\nused to imitate complex behaviors in high-dimensional environments with multiple\ncooperative or competing agents.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) methods are becoming increasingly successful at optimizing reward\nsignals in complex, high dimensional environments [1]. A key limitation of RL, however, is the\ndif\ufb01culty of designing suitable reward functions for complex and not well-speci\ufb01ed tasks [2, 3]. If\nthe reward function does not cover all important aspects of the task, the agent could easily learn\nundesirable behaviors [4]. This problem is further exacerbated in multi-agent scenarios, such as\nmultiplayer games [5], multi-robot control [6] and social interactions [7]; in these cases, agents do\nnot even necessarily share the same reward function and might even have con\ufb02icting rewards.\nImitation learning methods address these problems via expert demonstrations [8\u201311]; the agent\ndirectly learns desirable behaviors by imitating an expert. Notably, inverse reinforcement learn-\ning (IRL) frameworks assume that the expert is (approximately) optimizing an underlying reward\nfunction, and attempt to recover a reward function that rationalizes the demonstrations; an agent\npolicy is subsequently learned through RL [12, 13]. Unfortunately, this paradigm is not suitable for\ngeneral multi-agent settings due to environment being non-stationary to individual agents [14] and\nthe existence of multiple equilibrium solutions [15]. The optimal policy of one agent could depend\non the policies of other agents, and vice versa, so there could exist multiple solutions in which each\nagents\u2019 policy is the optimal response to others.\nIn this paper, we propose a new framework for multi-agent imitation learning \u2013 provided with\ndemonstrations of a set of experts interacting with each other in the same environment, we aim to\nlearn multiple parametrized policies that imitate the behavior of each expert respectively. Using the\nframework of Markov games, we integrate multi-agent RL with a suitable extension of multi-agent\ninverse RL. The resulting procedure strictly generalizes Generative Adversarial Imitation Learning\n(GAIL, [16]) in the single agent case. Imitation learning in our setting corresponds to a two-player\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fgame between a generator and a discriminator. The generator controls the policies of all the agents\nin a distributed way, and the discriminator contains a classi\ufb01er for each agent that is trained to\ndistinguish that agent\u2019s behavior from that of the corresponding expert. Upon training, the behaviors\nproduced by the policies should be indistinguishable from the training data. We can incorporate prior\nknowledge into the discriminators, including the presence of cooperative or competitive agents. In\naddition, we propose a novel multi-agent natural policy gradient algorithm that addresses the issue of\nhigh variance gradient estimates commonly observed in reinforcement learning [14, 17]. Empirical\nresults demonstrate that our method can imitate complex behaviors in high-dimensional environments,\nsuch as particle environments and cooperative robotic control tasks, with multiple cooperative or\ncompetitive agents; the imitated behaviors are close to the expert behaviors with respect to \u201ctrue\u201d\nreward functions which the agents do not have access to during training.\n\n2 Preliminaries\n\n2.1 Markov games\n\nWe consider an extension of Markov decision processes (MDPs) called Markov games [18]. A\ni=1. The\nMarkov game (MG) for N agents is de\ufb01ned via a set of states S, and N sets of actions {Ai}N\nfunction P : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 P(S) describes the (stochastic) transition process between\nstates, where P(S) denotes the set of probability distributions over the set S. Given that we are\nin state st at time t, the agents take actions (a1, . . . , aN ) and the state transitions to st+1 with\nRi = (cid:80)\u221e\nprobability P (st+1|st, a1, . . . , aN ). Each agent i obtains a (bounded) reward given by a function\nri : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 R. Each agent i aims to maximize its own total expected return\n\u03b7 : S \u2192 [0, 1]. The joint policy is de\ufb01ned as \u03c0(a|s) =(cid:81)N\nt=0 \u03b3tri,t, where \u03b3 is the discount factor, by selecting actions through a (stationary and\nMarkovian) stochastic policy \u03c0i : S \u00d7 Ai \u2192 [0, 1]. The initial states are determined by a distribution\npolicy(cid:81)N\ni=1 \u03c0i(ai|s), where we use bold variables\nwithout subscript i to denote the concatenation of all variables for all agents (e.g., \u03c0 denotes the joint\ni=1 \u03c0i in a multi-agent setting, r denotes all rewards, a denotes actions of all agents). We\nuse expectation with respect to a policy \u03c0 to denote an expectation with respect to the trajectories it\ngenerates, and use subscript \u2212i to denote all agents except for i. For example, (ai, a\u2212i) represents\n(a1, . . . , aN ), the actions of all N agents.\n\n2.2 Reinforcement learning and Nash equilibrium\nIn reinforcement learning (RL), the goal of each agent is to maximize total expected return E\u03c0[r(s, a)]\ngiven access to the reward signal r. In single agent RL, an optimal Markovian policy exists but the\noptimal policy might not be unique (e.g., all policies are optimal for an identically zero reward; see\n[19], Chapter 3.8). An entropy regularizer can be introduced to resolve this ambiguity. The optimal\npolicy is found via the following RL procedure:\n\nRL(r) = arg max\n\n\u03c0\u2208\u03a0\n\nH(\u03c0) + E\u03c0[r(s, a)],\n\n(1)\n\nwhere H(\u03c0) is the \u03b3-discounted causal entropy [20] of policy \u03c0 \u2208 \u03a0.\nDe\ufb01nition 1 (\u03b3-discounted Causal Entropy). The \u03b3-discounted causal entropy for a policy \u03c0 is\nde\ufb01ned as follows:\n\nH(\u03c0) (cid:44) E\u03c0[\u2212 log \u03c0(a|s)] = Est,at\u223c\u03c0\n\n\u2212\n\n(cid:34)\n\n(cid:35)\n\u03b3t log \u03c0(at|st)\n\n.\n\n\u221e(cid:88)\n\nt=0\n\nThe addition of H(\u03c0) in (1) resolves this ambiguity \u2013 the policy with both the highest reward and the\nhighest entropy1 is unique because the entropy function is strictly concave with respect to \u03c0.\nIn Markov games, however, the optimal policy of an agent depends on other agents\u2019 policies. One\napproach is to use an equilibrium solution concept, such as Nash equilibrium [15]. Informally, a\ni=1 is a Nash equilibrium if no agent can achieve higher reward by unilaterally\nset of policies {\u03c0i}N\n\n1We use the term \u201centropy\u201d to denote the \u03b3-discounted causal entropy for policies in the rest of the paper.\n\n2\n\n\fchanging its policy, i.e., \u2200i \u2208 [1, N ],\u2200\u02c6\u03c0i (cid:54)= \u03c0i, E\u03c0i,\u03c0\u2212i [ri] \u2265 E\u02c6\u03c0i,\u03c0\u2212i[ri]. The process of \ufb01nding a\nNash equilibrium can be de\ufb01ned as a constrained optimization problem ([21], Theorem 3.7.2):\n\n(cid:33)\n\nai\u223c\u03c0i(\u00b7|s)qi(s, ai)\n\n(cid:32)(cid:88)\nN(cid:88)\n(cid:88)\n\ns\u2208S\n\ni=1\n\n(cid:48)\nP (s\n\ns(cid:48)\u2208S\n\nmin\n\n\u03c0\u2208\u03a0,v\u2208RS\u00d7N\nvi(s) \u2265 qi(s, ai) (cid:44) E\u03c0\u2212i\n\n(cid:34)\n\nfr(\u03c0, v) =\n\nri(s, a) + \u03b3\n\n(cid:35)\n\nvi(s) \u2212 E\n(cid:48)\n|s, a)vi(s\n\n)\n\n\u2200i \u2208 [N ], s \u2208 S, ai \u2208 Ai\n\n(2)\n\n(3)\n\na (cid:44) (ai, a\u2212i) (cid:44) (a1, . . . , aN )\n\nv (cid:44) [v1; . . . ; vN ]\n\nwhere the joint action a includes actions a\u2212i sampled from \u03c0\u2212i and ai. Intuitively, v can be thought\nof as a value function and q represents the Q-function that corresponds to v. The constraints enforce\nthe Nash equilibrium condition \u2013 when the constraints are satis\ufb01ed, (vi(s)\u2212 qi(s, ai)) is non-negative\nfor every i \u2208 [N ]. Hence fr(\u03c0, v) is always non-negative for a feasible (\u03c0, v). Moreover, this\nobjective has a global minimum of zero if a Nash equilibrium exists, and \u03c0 forms a Nash equilibrium\nif and only if fr(\u03c0, v) reaches zero while being a feasible solution ([22], Theorem 2.4).\n\nInverse reinforcement learning\n\n2.3\nSuppose we do not have access to the reward signal r, but have demonstrations D provided by an\nexpert (N expert agents in Markov games). Imitation learning aims to learn policies that behave\nsimilarly to these demonstrations. In Markov games, we assume all experts/players operate in the\nsame environment, and the demonstrations D = {(sj, aj)}M\nj=1 are collected by sampling s0 \u223c\n\u03b7(s), at = \u03c0E(at|st), st+1 \u223c P (st+1|st, at); we assume knowledge of N, \u03b3, S, A, as well as\naccess to T and \u03b7 as black boxes. We further assume that once we obtain D, we cannot ask for\nadditional expert interactions with the environment (unlike in DAgger [23] or CIRL [24]).\nLet us \ufb01rst consider imitation in Markov decision processes (as a special case to Markov games) and\nthe framework of single-agent Maximum Entropy IRL [8, 16] where the goal is to recover a reward\nfunction r that rationalizes the expert behavior \u03c0E:\nE\u03c0E [r(s, a)] \u2212\n\nH(\u03c0) + E\u03c0[r(s, a)]\n\nIRL(\u03c0E) = arg max\nr\u2208RS\u00d7A\n\nmax\n\u03c0\u2208\u03a0\n\n(cid:18)\n\n(cid:19)\n\nIn practice, expectations with respect to \u03c0E are evaluated using samples from D.\nThe IRL objective is ill-de\ufb01ned [12, 10] and there are often multiple valid solutions to the problem\nwhen we consider all r \u2208 RS\u00d7A. To resolve this ambiguity, [16] introduce a convex reward function\nregularizer \u03c8 : RS\u00d7A\n\u2192 R, which can be used for example to restrict rewards to be linear in a\npre-determined set of features [16]:\nr\u2208RS\u00d7A \u2212\u03c8(r) + E\u03c0E [r(s, a)] \u2212\n\nH(\u03c0) + E\u03c0[r(s, a)]\n\nIRL\u03c8(\u03c0E) = arg max\n\nmax\n\u03c0\u2208\u03a0\n\n(cid:18)\n\n(cid:19)\n\n(4)\n\n2.4\n\nImitation by matching occupancy measures\n\na policy \u03c0, it is de\ufb01ned as \u03c1\u03c0(s, a) = \u03c0(a|s)(cid:80)\u221e\n\n[16] interprets the imitation learning problem as matching two occupancy measures, i.e., the distribu-\ntion over states and actions encountered when navigating the environment with a policy. Formally, for\nt=0 \u03b3tP (st = s|\u03c0). [16] draws a connection between\n\nIRL and occupancy measure matching, showing that the former is a dual of the latter:\nProposition 1 (Proposition 3.1 in [16]).\n\nRL \u25e6 IRL\u03c8(\u03c0E) = arg min\n\n\u03c0\u2208\u03a0 \u2212H(\u03c0) + \u03c8(cid:63)(\u03c1\u03c0 \u2212 \u03c1\u03c0E )\n\nHere \u03c8(cid:63)(x) = supy x(cid:62)y \u2212 \u03c8(y) is the convex conjugate of \u03c8, which could be interpreted as a\nmeasure of similarity between the occupancy measures of expert policy and agent\u2019s policy. One\ninstance of \u03c8 = \u03c8GA gives rise to the Generative Adversarial Imitation Learning (GAIL) method:\n\n\u03c8(cid:63)\nGA(\u03c1\u03c0 \u2212 \u03c1\u03c0E ) = max\n\nD\u2208(0,1)S\u00d7A\n\nE\u03c0E [log(D(s, a))] + E\u03c0[log(1 \u2212 D(s, a))]\n\n(5)\n\n3\n\n\fThe resulting imitation learning method from Proposition 1 involves a discriminator (a classi\ufb01er D)\ncompeting with a generator (a policy \u03c0). The discriminator attempts to distinguish real vs. synthetic\ntrajectories (produced by \u03c0) by optimizing (5). The generator, on the other hand, aims to perform\noptimally under the reward function de\ufb01ned by the discriminator, thus \u201cfooling\u201d the discriminator\nwith synthetic trajectories that are dif\ufb01cult to distinguish from the expert ones.\n\n3 Generalizing IRL to Markov games\n\nExtending imitation learning to multi-agent settings is dif\ufb01cult because there are multiple rewards\n(one for each agent) and the notion of optimality is complicated by the need to consider an equilibrium\nsolution [15]. We use MARL(r) to denote the set of (stationary and Markovian) policies that form a\nNash equilibrium under r and have the maximum \u03b3-discounted causal entropy (among all equilibria):\n\nMARL(r) = arg min\n\nfr(\u03c0, v) \u2212 H(\u03c0)\nvi(s) \u2265 qi(s, ai) \u2200i \u2208 [N ], s \u2208 S, ai \u2208 Ai\n\n\u03c0\u2208\u03a0,v\u2208RS\u00d7N\n\n(6)\n\nwhere q is de\ufb01ned as in Equation (3). Our goal is to de\ufb01ne a suitable inverse operator MAIRL, in\nanalogy to IRL in Equation (4), which chooses a reward that creates a margin between the expert and\nevery other policy. However, the constraints in the Nash equilibrium optimization (Equation (6)) can\nmake this challenging. To that end, we derive an equivalent Lagrangian formulation of (6), where\nwe \u201cmove\u201d the constraints into the objective function, so that we can de\ufb01ne a margin between the\nexpected reward of two sets of policies that captures their \u201cdifference\u201d.\n\n3.1 Equivalent constraints via temporal difference learning\n\nIntuitively, the Nash equilibrium constraints imply that any agent i cannot improve \u03c0i via 1-step\ntemporal difference learning; if the condition for Equation (3) is not satis\ufb01ed for some vi, qi, and\n(s, ai), this would suggest that we can update the policy for agent i and its value function. Based on\nthis notion, we can derive equivalent versions of the constraints corresponding to t-step temporal\ndifference (TD) learning.\nTheorem 1. For a certain policy \u03c0 and reward r, let \u02c6vi(s; \u03c0, r) be the unique solution to the Bellman\nequation:\n\n(cid:34)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:35)\n\n\u02c6vi(s; \u03c0, r) = Ea\u223c\u03c0\n\nri(s, a) + \u03b3\n\n(cid:48)\nP (s\n\n(cid:48)\n|s, a)\u02c6vi(s\n\n; \u03c0, r)\n\n\u2200s \u2208 S.\n\n; \u03c0, r) as the discounted expected return for the i-th agent\nj=0, s(t) in the \ufb01rst (t \u2212 1) steps and choosing\n\ni\n\nj=0, s(t), a(t)\n\nDenote \u02c6q(t)\nconditioned on visiting the trajectory {s(j), a(j)}t\u22121\naction a(t)\n\ni ({s(j), a(j)}t\u22121\ni at the t step, when other agents use policy \u03c0\u2212i:\ni ({s(j), a(j)}t\u22121\nt\u22121(cid:88)\n\u02c6q(t)\n\nj=0, s(t), a(t)\n\n\u03b3jri(s(j), a(j)) + \u03b3tEa\u2212i\u223c\u03c0\u2212i\n\n; \u03c0, r)\n\n(cid:34)\n\n=\n\ni\n\nri(s(t), a(t)) + \u03b3\n\nj=0\n\n(cid:35)\n\n; \u03c0, r)\n\n.\n\n(cid:48)\n|s, a(t))\u02c6vi(s\n\ns(cid:48)\u2208S\n\n(cid:48)\nP (s\n\n(cid:88)\n(cid:105) (cid:44) Q(t)\n\nThen \u03c0 is Nash equilibrium if and only if for all t \u2208 N+, i \u2208 [N ], j \u2208 [t], s(j) \u2208 S, a(j) \u2208 A\n\u02c6vi(s(0); \u03c0, r) \u2265 Ea\u2212i\u223c\u03c0\u2212i\n\ni ({s(j), a(j)}t\u22121\n\u02c6q(t)\n\ni ({s(j), a(j)\ni }t\n\nj=0, s(t), a(t)\n\n; \u03c0, r)\n\ni\n\nj=0; \u03c0, r).\n\n(cid:104)\n\n(7)\n\nIntuitively, Theorem 1 states that if we replace the 1-step constraints with (t + 1)-step constraints, we\nobtain the same solution as MARL(r), since (t + 1)-step TD updates (over one agent at a time) are\nstill stationary with respect to a Nash equilibrium solution. So the constraints can be unrolled for t\nsteps and rewritten as \u02c6vi(s(0)) \u2265 Q(t)\n\nj=0; \u03c0, r) (corresponding to Equation (7)).\n\ni ({s(j), a(j)\ni }t\n\n4\n\n\ft(cid:89)\n\nj=1\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\na(j\u22121)\u2212i\n\nN(cid:88)\n\ni=1\n\n3.2 Multi-agent inverse reinforcement learning\n\nWe are now ready to construct the Lagrangian dual of the primal in Equation (6), using the equivalent\nformulation from Theorem 1. The \ufb01rst observation is that for any policy \u03c0, f (\u03c0, \u02c6v) = 0 when \u02c6v\nis de\ufb01ned as in Theorem 1 (see Lemma 1 in appendix). Therefore, we only need to consider the\n\u201cunrolled\u201d constraints from Theorem 1, obtaining the following dual problem\n\n(\u03c0, \u03bb) (cid:44) N(cid:88)\n\n(cid:88)\n\ni=1\n\n\u03c4i\u2208T t\n\ni\n\n(cid:16)\n\n(cid:17)\nQ(t)\ni (\u03c4i; \u03c0, r) \u2212 \u02c6vi(s(0); \u03c0, r)\n\n\u03bb(\u03c4i)\n\nmax\n\u03bb\u22650\n\nmin\n\n\u03c0\n\nL(t+1)\n\nr\n\n(8)\n\nwhere Ti(t) is the set of all length-t trajectories of the form {s(j), a(j)\nj=0, with s(0) as initial state, \u03bb\ni }t\nis a vector of N \u00b7|Ti(t)| Lagrange multipliers, and \u02c6v is de\ufb01ned as in Theorem 1. This dual formulation\nis a sum over agents and trajectories, which uniquely corresponds to the constraints in Equation 7.\nIn the following theorem, we show that for a speci\ufb01c choice of \u03bb we can recover the difference of the\nsum of expected rewards between two policies, a performance gap similar to the one used in single\nagent IRL in Equation (4). This amounts to \u201crelaxing\u201d the primal problem.\nTheorem 2. For any two policies \u03c0(cid:63) and \u03c0, let\n\n\u03c0(\u03c4i) = \u03b7(s(0))\u03c0i(a(0)\n\u03bb(cid:63)\n\ni\n\n|s(0))\n\n\u03c0i(a(j)\n\ni\n\n|s(j))\n\nP (s(j)|s(j\u22121), a(j\u22121))\u03c0(cid:63)\u2212i(a(j)\u2212i|s(j))\n\nbe the probability of generating the sequence \u03c4i using policy \u03c0i and \u03c0(cid:63)\u2212i. Then\n\nt\u2192\u221e L(t+1)\nlim\n\nr\n\n(\u03c0(cid:63), \u03bb(cid:63)\n\n\u03c0) =\n\nE\u03c0i,\u03c0(cid:63)\u2212i[ri(s, a)] \u2212\n\nE\u03c0(cid:63)\n\ni ,\u03c0(cid:63)\u2212i[ri(s, a)]\n\n(9)\n\nr\n\n(\u03c0(cid:63), \u03bb(cid:63)\n\n\u03c0) corresponds to the dual function where the multipliers are the probability of\n\n\ufb01rst term of left hand side in Equation (8),(cid:80)N\nstate is the initial state distribution, so the second term of left hand side,(cid:80)\n\nwhere L(t+1)\ngenerating their respective trajectories of length t.\nWe provide a proof in Appendix A.3. Intuitively, the \u03bb(cid:63)(\u03c4i) weights correspond to the probability of\ngenerating trajectory \u03c4i when the policy is \u03c0i for agent i and \u03c0(cid:63)\u2212i for the other agents. As t \u2192 \u221e, the\ni (\u03c4i), converges to the expected\ntotal reward E\u03c0i,\u03c0(cid:63)\u2212i[ri], which is the \ufb01rst term of right hand side. The marginal of \u03bb(cid:63) over the initial\ns \u02c6v(s)\u03b7(s), converges to\nE\u03c0(cid:63)\ni ,\u03c0(cid:63)\u2212i[ri], which is the second term of right hand side. Thus, the left hand side and right hand side\nof Equation (8) are the same as t \u2192 \u221e. We could also view the right hand side of Equation (8) as the\ncase where policies of \u03c0(cid:63)\u2212i are part of the environment.\n(cid:33)\nTheorem 2 motivates the following de\ufb01nition of multi-agent IRL with regularizer \u03c8.\n\n\u03bb(\u03c4i)Q(t)\n\n(cid:80)\n\n\u03c4i\u2208T t\n\ni\n\n(cid:32)\n\ni=1\n\nMAIRL\u03c8(\u03c0E) = arg max\n\nr\n\n\u2212\u03c8(r) +\n\n(E\u03c0E [ri]) \u2212\n\nmax\n\n\u03c0\n\n(\u03b2Hi(\u03c0i) + E\u03c0i,\u03c0E\u2212i\n\n[ri])\n\n,\n\nwhere Hi(\u03c0i) = E\u03c0i,\u03c0E\u2212i\n[\u2212 log \u03c0i(ai|s)] is the discounted causal entropy for policy \u03c0i when other\nagents follow \u03c0E\u2212i, and \u03b2 is a hyper-parameter controlling the strength of the entropy regularization\nterm as in [16]. This formulation is a strict generalization to the single agent IRL in [16].\nCorollary 2.1. If N = 1, \u03b2 = 1 then MAIRL\u03c8(\u03c0E) = IRL\u03c8(\u03c0E).\nFurthermore, if the regularization \u03c8 is additively separable, and for each agent i, \u03c0Ei is the unique\noptimal response to other experts \u03c0E\u2212i, we obtain the following:\n\nTheorem 3. Assume that \u03c8(r) =(cid:80)N\n\ni=1 \u03c8i(ri), \u03c8i is convex for each i \u2208 [N ], and that MARL(r)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\nhas a unique solution2 for all r \u2208 MAIRL\u03c8(\u03c0E), then\n\nMARL \u25e6 MAIRL\u03c8(\u03c0E) = arg min\n\u03c0\u2208\u03a0\n\ni=1\n\n\u2212\u03b2Hi(\u03c0i) + \u03c8(cid:63)\n\ni (\u03c1\u03c0i,E\u2212i \u2212 \u03c1\u03c0E )\n\n(10)\n\nwhere \u03c0i,E\u2212i denotes \u03c0i for agent i and \u03c0E\u2212i for other agents.\n\n2The set of Nash equilibria is not always convex, so we have to assume MARL(r) returns a unique solution.\n\n5\n\n\fThe above theorem suggests that \u03c8-regularized multi-agent inverse reinforcement learning is seeking,\nfor each agent i, a policy whose occupancy measure is close to one where we replace policy \u03c0i with\nexpert \u03c0Ei, as measured by the convex function \u03c8(cid:63)\ni .\nHowever, we do not assume access to the expert policy \u03c0E during training, so it is not possible\nto obtain \u03c1\u03c0i,E\u2212i\n. Therefore, we consider an alternative approach where we match the occupancy\nmeasure between \u03c1\u03c0E and \u03c1\u03c0. We can obtain our practical algorithm if we select an adversarial\nreward function regularizer and remove the effect from entropy regularizers.\ni=1 \u03c8i(ri) where \u03c8i(ri) = E\u03c0E [g(ri)] if ri > 0; +\u221e\n\nProposition 2. If \u03b2 = 0, and \u03c8(r) = (cid:80)N\n\notherwise, and\n\n(cid:26)\n\nthen\n\nN(cid:88)\n\narg min\n\ng(x) =\n\n\u2212x \u2212 log(1 \u2212 ex)\n+\u221e\n\nif\nri > 0\notherwise\n\nN(cid:88)\n\n\u03c0\n\n\u03c8(cid:63)\ni (\u03c1\u03c0i,\u03c0E\u2212i \u2212 \u03c1\u03c0E ) = arg min\n\n\u03c0\n\n\u03c8(cid:63)\ni (\u03c1\u03c0i,\u03c0\u2212i \u2212 \u03c1\u03c0E )\n\n(11)\n\ni=1\n\ni=1\nand both are equal to \u03c0E.\nTheorem 3 and Proposition 2 discuss the differences from the single agent scenario. In Theorem 3 we\nmake the assumption that MARL(r) has a unique solution, which is always true in the single agent\ncase due to convexity of the space of the optimal policies. In Proposition 2 we remove the entropy\nregularizer because here the causal entropy for \u03c0i may depend on the policies of the other agents.\nSpeci\ufb01cally, the entropy for the left hand side of Equation (11) conditions on \u03c0E\u2212i and the entropy\nfor the right hand side conditions on \u03c0\u2212i (both would disappear in the single-agent case).\n\n4 Practical multi-agent imitation learning\n\nDespite the recent successes in deep RL, it is notoriously hard to train policies with RL algorithms-\nbecause of high variance gradient estimates. This is further exacerbated in Markov games since an\nagent\u2019s optimal policy depends on other agents [14, 17]. In this section, we address these problems\nand propose practical algorithms for multi-agent imitation.\n\n4.1 Multi-agent generative adversarial imitation learning\n\n(cid:35)\n\nWe select \u03c8i to be our reward function regularizer in Proposition 2; this corresponds to the two-player\ngame introduced in Generative Adversarial Imitation Learning (GAIL, [16]). For each agent i, we\nhave a discriminator (denoted as D\u03c9i) mapping state action-pairs to scores optimized to discriminate\nexpert demonstrations from behaviors produced by \u03c0i. Implicitly, D\u03c9i plays the role of a reward\nfunction for the generator, which in turn attempts to train the agent to maximize its reward thus\nfooling the discriminator. We optimize the following objective:\n\n(cid:34) N(cid:88)\n\n(cid:35)\n\n(cid:34) N(cid:88)\n\nmin\n\n\u03b8\n\nmax\n\n\u03c9\n\nE\u03c0\u03b8\n\nlog D\u03c9i(s, ai)\n\n+ E\u03c0E\n\ni=1\n\ni=1\n\nlog(1 \u2212 D\u03c9i(s, ai))\n\n(12)\n\nWe update \u03c0\u03b8 through reinforcement learning, where we also use a baseline V\u03c6 to reduce variance.\nWe outline the algorithm \u2013 Multi-Agent GAIL (MAGAIL) \u2013 in Appendix B.\nWe can augment the reward regularizer \u03c8(r) using an indicator y(r) denoting whether r \ufb01ts our prior\nknowledge; the augmented reward regularizer \u02c6\u03c8 : RS\u00d7A\n\u2192 R \u222a {\u221e} is then: \u03c8(r) if y(r) = 1 and\n\u221e if y(r) = 0. We introduce three types of y(r) for common settings.\nCentralized The easiest case is to assume that the agents are fully cooperative, i.e. they share the\nsame reward function. Here y(r) = I(r1 = r2 = . . . rn) and \u03c8(r) = \u03c8GA(r). One could argue this\ncorresponds to the GAIL case, where the RL procedure operates on multiple agents (a joint policy).\n\nDecentralized We make no prior assumptions over the correlation between the rewards. Here\ny(r) = I(ri \u2208 ROi\u00d7Ai) and \u03c8i(ri) = \u03c8GA(ri). This corresponds to one discriminator for each agent\nwhich discriminates the trajectories as observed by agent i. However, these discriminators are not\nlearned independently as they interact indirectly via the environment.\n\n6\n\n\f(a) Centralized (Cooperative)\n\n(b) Decentralized (Mixed)\n\n(c) Zero-sum (Competitive)\n\nFigure 1: Different MAGAIL algorithms obtained with different priors on the reward structure. The\ndiscriminator tries to assign higher rewards to top row and low rewards to bottom row. In centralized\nand decentralized, the policy operates with the environment to match the expert rewards. In zero-sum,\nthe policy do not interact with the environment; expert and policy trajectories are paired together as\ninput to the discriminator.\n\nZero Sum Assume there are two agents that receive opposite rewards, so r1 = \u2212r2. As such, \u03c8 is\nno longer additively separable. Nevertheless, an adversarial training procedure can be designed using\nthe following fact:\n\nv(\u03c0E1, \u03c02) \u2265 v(\u03c0E1, \u03c0E2) \u2265 v(\u03c01, \u03c0E2 )\n\n(13)\nwhere v(\u03c01, \u03c02) = E\u03c01,\u03c02[r1(s, a)] is the expected outcome for agent 1, and is modeled by the\ndiscriminator. The discriminator could then try to maximize v for trajectories from (\u03c0E1, \u03c02) and\nminimize v for trajectories from (\u03c02, \u03c0E1 ) according to Equation (13).\nThese three settings are in summarized in Figure 1.\n\n4.2 Multi-agent actor-critic with Kronecker factors\n\nTo optimize over the generator parameters \u03b8 in Eq. (12) we wish to use an algorithm for multi-agent\nRL that has good sample ef\ufb01ciency in practice. Our algorithm, which we refer to as Multi-agent\nActor-Critic with Kronecker-factors (MACK), is based on Actor-Critic with Kronecker-factored\nTrust Region (ACKTR, [25\u201327]), a state-of-the-art natural policy gradient [28, 29] method in deep\nRL. MACK uses the framework of centralized training with decentralized execution [17]; policies\nare trained with additional information to reduce variance but such information is not used during\nexecution time. We let the advantage function of every agent agent be a function of all agents\u2019\nobservations and actions:\n\nk\u22121(cid:88)\n\nj=0\n\nA\u03c0i\n\n\u03c6i(s, at) =\n\n(\u03b3jri(st+j, at+j) + \u03b3kV \u03c0i\n\n\u03c6i (st+k, a\u2212i,t+k)) \u2212 V \u03c0i\n\n\u03c6i (st, a\u2212i,t)\n\n(14)\n\nwhere V \u03c0i\n\u03c6i (sk, a\u2212i) is the baseline for i, utilizing the additional information (a\u2212i) for variance\nreduction. We use (approximated) natural policy gradients to update both \u03b8 and \u03c6 but without trust\nregions to schedule the learning rate, using a linear decay learning rate schedule instead.\nMACK has some notable differences from Multi-Agent Deep Deterministic Policy Gradient [14]. On\nthe one hand, MACK does not assume knowledge of other agent\u2019s policies nor tries to infer them;\nthe value estimator merely collects experience from other agents (and treats them as black boxes).\nOn the other hand, MACK does not require gradient estimators such as Gumbel-softmax [30, 31] to\noptimize over discrete actions, which is necessary for DDPG [32].\n\n5 Experiments\n\nWe evaluate the performance of (centralized, decentralized, and zero-sum versions) of MAGAIL\nunder two types of environments. One is a particle environment which allows for complex interactions\nand behaviors; the other is a control task, where multiple agents try to cooperate and move a plank\nforward. We collect results by averaging over 5 random seeds. Our implementation is based on\nOpenAI baselines [33]; please refer to Appendix C for implementation details3.\n\n3Code for reproducing the experiments are in https://github.com/ermongroup/multiagent-gail.\n\n7\n\n(o1,a1)(o2,a2)(o1,a1)(o2,a2)DT(st+1|st,at)(o1,a1)(o2,a2)(o1,a1)(o2,a2)D1D2T(st+1|st,at)(o1,a1)(o2,a2)(o1,a1)(o2,a2)D1=\u2212D2\fFigure 2: Average true reward from cooperative tasks. Performance of experts and random policies\nare normalized to one and zero respectively. We use inverse log scale for better comparison.\n\nWe compare our methods (centralized, decentralized, zero-sum MAGAIL) with two baselines. The\n\ufb01rst is behavior cloning (BC), which learns a maximum likelihood estimate for ai given each state\ns and does not require actions from other agents. The second baseline is a GAIL IRL baseline that\noperates on each agent separately \u2013 for each agent we \ufb01rst pretrain the other agents with BC, and\nthen train the agent with GAIL; we then gather the trained GAIL policies from all the agents and\nevaluate their performance.\n\n5.1 Particle environments\n\nWe \ufb01rst consider the particle environment proposed in [14], which consists of several agents and\nlandmarks. We consider two cooperative environments and two competitive ones. All environments\nhave an underlying true reward function that allows us to evaluate the performance of learned agents.\nThe environments include: Cooperative Communication \u2013 two agents must cooperate to reach one\nof three colored landmarks. One agent (\u201cspeaker\u201d) knows the goal but cannot move, so it must convey\nthe message to the other agent (\u201clistener\u201d) that moves but does not observe the goal. Cooperative\nNavigation \u2013 three agents must cooperate through physical actions to reach three landmarks; ideally,\neach agent should cover a single landmark. Keep-Away \u2013 two agents have contradictory goals, where\nagent 1 tries to reach one of the two targeted landmarks, while agent 2 (the adversary) tries to keep\nagent 1 from reaching its target. The adversary does not observe the target, so it must act based on\nagent 1\u2019s actions. Predator-Prey \u2013 three slower cooperating adversaries must chase the faster agent\nin a randomly generated environment with obstacles; the adversaries are rewarded by touching the\nagent while the agent is penalized.\nFor the cooperative tasks, we use an analytic expression de\ufb01ning the expert policy; for the competitive\ntasks, we use MACK to train expert policies based on the true underlying rewards (using larger\npolicy and value networks than the ones that we use for imitation). We then use the expert policies to\nsimulate trajectories D, and then do imitation learning on D as demonstrations, where we assume\nthe underlying rewards are unknown. Following [34], we pretrain our Multi-Agent GAIL methods\nand the GAIL baseline using behavior cloning as initialization to reduce sample complexity for\nexploration. We consider 100 to 400 episodes of expert demonstrations, each with 50 timesteps,\nwhich is close to the amount of timesteps used for the control tasks in [16]. Moreover, we randomly\nsample the starting position of agent and landmarks each episode, so our policies have to learn to\ngeneralize when they encounter new settings.\n\n5.1.1 Cooperative tasks\n\nWe evaluate performance in cooperative tasks via the average expected reward obtained by all the\nagents in an episode. In this environment, the starting state is randomly initialized, so generalization\nis crucial. We do not consider the zero-sum case, since it violates the cooperative nature of the task.\nWe display the performance of centralized, decentralized, GAIL and BC in Figure 2.\nNaturally, the performance of BC and MAGAIL increases with more expert demonstrations. MA-\nGAIL performs consistently better than BC in all the settings; interestingly, in the cooperative\ncommunication task, centralized MAGAIL is able to achieve expert-level performance with only 200\ndemonstrations, but BC fails to come close even with 400 trajectories. Moreover, the centralized MA-\n\n8\n\n100200300400# Expert Demonstrations10.960.90.80.60Normalized RewardsCooperative Communication100200300400# Expert Demonstrations10.960.90.80.60Cooperative NavigationExpertRandomBCGAILCentralizedDecentralized\fTable 1: Average agent rewards in competitive tasks. We compare behavior cloning (BC), GAIL (G),\nCentralized (C), Decentralized (D), and Zero-Sum (ZS) methods. Best marked in bold (high vs. low\nrewards is preferable depending on the agent vs. adversary role).\n\nPredator-Prey\n\nTask\nAgent\n\nAdversary\nRewards\n\nTask\nAgent\n\nAdversary\nRewards\n\nBC\n\n-93.20\n\nBC\n24.22\n\nBehavior Cloning\nD\nG\n\nC\n\n-93.71\n\n-93.75\n\n-95.22\n\nZS\n\n-95.48\n\nBehavior Cloning\nG\nD\n\nC\n\n24.04\n\n23.28\n\n23.56\n\nKeep-Away\n\nZS\n23.19\n\nG\n\n-90.55\n\nG\n\n26.22\n\nC\n\nC\n\nD\n\nD\n\nBehavior Cloning\n-85.00\n-91.36\n\nBehavior Cloning\n28.73\n\n26.61\n\nZS\n\n-89.4\n\nZS\n\n27.80\n\nGAIL performs slightly better than decentralized MAGAIL due to the better prior, but decentralized\nMAGAIL still learns a highly correlated reward between two agents.\n\n5.1.2 Competitive tasks\n\nWe consider all three types of Multi-Agent GAIL (centralized, decentralized, zero-sum) and BC\nin both competitive tasks. Since there are two opposing sides, it is hard to measure performance\ndirectly. Therefore, we compare by letting (agents trained by) BC play against (adversaries trained\nby) other methods, and vice versa. From Table 1, decentralized and zero-sum MAGAIL often perform\nbetter than centralized MAGAIL and BC, which suggests that the selection of the suitable prior \u02c6\u03c8 is\nimportant for good empirical performance.\n\n5.2 Cooperative control\n\nIn some cases we are presented with sub-optimal expert demonstrations because the environment has\nchanged; we consider this case in a cooperative control task [35], where N bipedal walkers cooperate\nto move a long plank forward; the agents have incentive to collaborate since the plank is much longer\nthan any of the agents. The expert demonstrates its policy on an environment with no bumps on the\nground and heavy weights, while we perform imitation in an new environment with bumps and lighter\nweights (so one is likely to use too much force). Agents trained with BC tend to act more aggressively\nand fail, whereas agents trained with centralized MAGAIL can adapt to the new environment. With\n10 (imperfect) expert demonstrations, BC agents have a chance of failure of 39.8% (with a reward of\n1.26), while centralized MAGAIL agents fail only 26.2% of the time (with a reward of 26.57). We\nshow videos of respective policies in the supplementary.\n\n6 Discussion\n\nThere is a vast literature on single-agent imitation learning [36]. Behavior Cloning (BC) learns the\npolicy through supervised learning [37]. Inverse Reinforcement Learning (IRL) assumes the expert\npolicy optimizes over some unknown reward, recovers the reward, and learns the policy through\nreinforcement learning (RL). BC does not require knowledge of transition probabilities or access to\nthe environment, but suffers from compounding errors and covariate shift [38, 23].\nMost existing work in multi-agent imitation learning assumes the agents have very speci\ufb01c reward\nstructures. The most common case is fully cooperative agents [39], where the challenges mainly\nlie in other factors, such as unknown role assignments [40], scalability to swarm systems [41] and\nagents with partial observations [42]. In non-cooperative settings, [43] consider the case of IRL\nfor two-player zero-sum games and cast the IRL problem as Bayesian inference, while [44] assume\nagents are non-cooperative but the reward function is a linear combination of pre-speci\ufb01ed features.\nOur work is the \ufb01rst to propose a general multi-agent IRL framework that combines state-of-the art\nmulti-agent reinforcement learning methods [14, 17] and implicit generative models such as generative\nadversarial networks [45]. Experimental results demonstrate that it is able to imitate complex\nbehaviors in high-dimensional environments with both cooperative and adversarial interactions. An\ninteresting future direction is to explore new paradigms for learning from experts, such as allowing\nthe expert to participate in the agent\u2019s learning process [24].\n\n9\n\n\fAcknowledgements\n\nThis work was supported by Toyota Research Institute and Future of Life Institute. The authors\nwould like to thank Lantao Yu for discussions over implementation.\n\nReferences\n[1] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,\nI. Dunning, S. Legg, and K. Kavukcuoglu, \u201cImpala: Scalable distributed deep-rl with importance\nweighted actor-learner architectures,\u201d arXiv preprint arXiv:1802.01561, 2018.\n\n[2] D. Had\ufb01eld-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, \u201cInverse reward design,\u201d\n\nin Advances in Neural Information Processing Systems, pp. 6768\u20136777, 2017.\n\n[3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00e9, \u201cConcrete problems\n\nin ai safety,\u201d arXiv preprint arXiv:1606.06565, 2016.\n\n[4] D. Amodei and J. Clark, \u201cFaulty reward functions in the wild,\u201d 2016.\n\n[5] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang, \u201cMultiagent bidirectionally-\ncoordinated nets for learning to play starcraft combat games,\u201d arXiv preprint arXiv:1703.10069,\n2017.\n\n[6] L. Matignon, L. Jeanpierre, A.-I. Mouaddib, et al., \u201cCoordinated multi-robot exploration under\n\ncommunication constraints using decentralized markov decision processes.,\u201d in AAAI, 2012.\n\n[7] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel, \u201cMulti-agent reinforcement\nlearning in sequential social dilemmas,\u201d in Proceedings of the 16th Conference on Autonomous\nAgents and MultiAgent Systems, pp. 464\u2013473, International Foundation for Autonomous Agents\nand Multiagent Systems, 2017.\n\n[8] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, \u201cMaximum entropy inverse reinforce-\n\nment learning.,\u201d in AAAI, vol. 8, pp. 1433\u20131438, Chicago, IL, USA, 2008.\n\n[9] P. Englert and M. Toussaint, \u201cInverse kkt\u2013learning cost functions of manipulation tasks from\ndemonstrations,\u201d in Proceedings of the International Symposium of Robotics Research, 2015.\n\n[10] C. Finn, S. Levine, and P. Abbeel, \u201cGuided cost learning: Deep inverse optimal control via\n\npolicy optimization,\u201d in International Conference on Machine Learning, pp. 49\u201358, 2016.\n\n[11] B. Stadie, P. Abbeel, and I. Sutskever, \u201cThird person imitation learning,\u201d in ICLR, 2017.\n\n[12] A. Y. Ng, S. J. Russell, et al., \u201cAlgorithms for inverse reinforcement learning.,\u201d in Icml, pp. 663\u2013\n\n670, 2000.\n\n[13] P. Abbeel and A. Y. Ng, \u201cApprenticeship learning via inverse reinforcement learning,\u201d in\nProceedings of the twenty-\ufb01rst international conference on Machine learning, p. 1, ACM, 2004.\n\n[14] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, \u201cMulti-agent actor-critic for\n\nmixed cooperative-competitive environments,\u201d arXiv preprint arXiv:1706.02275, 2017.\n\n[15] J. Hu, M. P. Wellman, et al., \u201cMultiagent reinforcement learning: theoretical framework and an\n\nalgorithm.,\u201d in ICML, vol. 98, pp. 242\u2013250, Citeseer, 1998.\n\n[16] J. Ho and S. Ermon, \u201cGenerative adversarial imitation learning,\u201d in Advances in Neural Infor-\n\nmation Processing Systems, pp. 4565\u20134573, 2016.\n\n[17] J. Foerster, Y. Assael, N. de Freitas, and S. Whiteson, \u201cLearning to communicate with deep\nmulti-agent reinforcement learning,\u201d in Advances in Neural Information Processing Systems,\npp. 2137\u20132145, 2016.\n\n[18] M. L. Littman, \u201cMarkov games as a framework for multi-agent reinforcement learning,\u201d in\nProceedings of the eleventh international conference on machine learning, vol. 157, pp. 157\u2013\n163, 1994.\n\n10\n\n\f[19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press\n\nCambridge, 1998.\n\n[20] M. Bloem and N. Bambos, \u201cIn\ufb01nite time horizon maximum causal entropy inverse reinforcement\nlearning,\u201d in Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 4911\u2013\n4916, IEEE, 2014.\n\n[21] J. Filar and K. Vrieze, Competitive Markov decision processes. Springer Science & Business\n\nMedia, 2012.\n\n[22] H. Prasad and S. Bhatnagar, \u201cA study of gradient descent schemes for general-sum stochastic\n\ngames,\u201d arXiv preprint arXiv:1507.00093, 2015.\n\n[23] S. Ross, G. J. Gordon, and D. Bagnell, \u201cA reduction of imitation learning and structured\n\nprediction to no-regret online learning.,\u201d in AISTATS, p. 6, 2011.\n\n[24] D. Had\ufb01eld-Menell, S. J. Russell, P. Abbeel, and A. Dragan, \u201cCooperative inverse reinforcement\n\nlearning,\u201d in Advances in neural information processing systems, pp. 3909\u20133917, 2016.\n\n[25] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, \u201cScalable trust-region method for\ndeep reinforcement learning using kronecker-factored approximation,\u201d in Advances in neural\ninformation processing systems, pp. 5285\u20135294, 2017.\n\n[26] Y. Song, J. Song, and S. Ermon, \u201cAccelerating natural gradient with higher-order invariance,\u201d in\n\nInternational Conference on Machine Learning (ICML), 2018.\n\n[27] Y. Song, R. Shu, N. Kushman, and S. Ermon, \u201cConstructing unrestricted adversarial examples\n\nwith generative models,\u201d arXiv preprint arXiv:1805.07894, 2018.\n\n[28] S.-I. Amari, \u201cNatural gradient works ef\ufb01ciently in learning,\u201d Neural computation, vol. 10, no. 2,\n\npp. 251\u2013276, 1998.\n\n[29] S. M. Kakade, \u201cA natural policy gradient,\u201d in Advances in neural information processing\n\nsystems, pp. 1531\u20131538, 2002.\n\n[30] E. Jang, S. Gu, and B. Poole, \u201cCategorical reparameterization with gumbel-softmax,\u201d arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[31] C. J. Maddison, A. Mnih, and Y. W. Teh, \u201cThe concrete distribution: A continuous relaxation of\n\ndiscrete random variables,\u201d arXiv preprint arXiv:1611.00712, 2016.\n\n[32] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra,\n\u201cContinuous control with deep reinforcement learning,\u201d arXiv preprint arXiv:1509.02971, 2015.\n\n[33] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor,\n\nand Y. Wu, \u201cOpenai baselines.\u201d https://github.com/openai/baselines, 2017.\n\n[34] Y. Li, J. Song, and S. Ermon, \u201cInfogail: Interpretable imitation learning from visual demonstra-\n\ntions,\u201d arXiv preprint arXiv:1703.08840, 2017.\n\n[35] J. K. Gupta and M. Egorov, \u201cMulti-agent deep reinforcement learning environment.\u201d https:\n\n//github.com/sisl/madrl, 2017.\n\n[36] J. A. Bagnell, \u201cAn invitation to imitation,\u201d tech. rep., CARNEGIE-MELLON UNIV PITTS-\n\nBURGH PA ROBOTICS INST, 2015.\n\n[37] D. A. Pomerleau, \u201cEf\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation,\u201d\n\nNeural Computation, vol. 3, no. 1, pp. 88\u201397, 1991.\n\n[38] S. Ross and D. Bagnell, \u201cEf\ufb01cient reductions for imitation learning.,\u201d in AISTATS, pp. 3\u20135,\n\n2010.\n\n[39] S. Barrett, A. Rosenfeld, S. Kraus, and P. Stone, \u201cMaking friends on the \ufb02y: Cooperating with\n\nnew teammates,\u201d Arti\ufb01cial Intelligence, vol. 242, pp. 132\u2013171, 2017.\n\n11\n\n\f[40] H. M. Le, Y. Yue, and P. Carr, \u201cCoordinated multi-agent imitation learning,\u201d arXiv preprint\n\narXiv:1703.03121, 2017.\n\n[41] A. \u0160o\u0161ic, W. R. KhudaBukhsh, A. M. Zoubir, and H. Koeppl, \u201cInverse reinforcement learning\n\nin swarm systems,\u201d stat, vol. 1050, p. 17, 2016.\n\n[42] K. Bogert and P. Doshi, \u201cMulti-robot inverse reinforcement learning under occlusion with\ninteractions,\u201d in Proceedings of the 2014 international conference on Autonomous agents\nand multi-agent systems, pp. 173\u2013180, International Foundation for Autonomous Agents and\nMultiagent Systems, 2014.\n\n[43] X. Lin, P. A. Beling, and R. Cogill, \u201cMulti-agent inverse reinforcement learning for zero-sum\n\ngames,\u201d arXiv preprint arXiv:1403.6508, 2014.\n\n[44] T. S. Reddy, V. Gopikrishna, G. Zaruba, and M. Huber, \u201cInverse reinforcement learning for\ndecentralized non-cooperative multiagent systems,\u201d in Systems, Man, and Cybernetics (SMC),\n2012 IEEE International Conference on, pp. 1930\u20131935, IEEE, 2012.\n\n[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio, \u201cGenerative adversarial nets,\u201d in Advances in neural information processing systems,\npp. 2672\u20132680, 2014.\n\n[46] J. Martens and R. Grosse, \u201cOptimizing neural networks with kronecker-factored approximate\n\ncurvature,\u201d in International Conference on Machine Learning, pp. 2408\u20132417, 2015.\n\n[47] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, \u201cHigh-dimensional continuous\n\ncontrol using generalized advantage estimation,\u201d arXiv preprint arXiv:1506.02438, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3717, "authors": [{"given_name": "Jiaming", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Hongyu", "family_name": "Ren", "institution": "Stanford University"}, {"given_name": "Dorsa", "family_name": "Sadigh", "institution": "Stanford"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}