{"title": "Cooperative Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3909, "page_last": 3917, "abstract": "For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial- information game with two agents, human and robot; both are rewarded according to the human\u2019s reward function, but the robot does not initially know what this is. In contrast to classical IRL, where the human is assumed to act optimally in isolation, optimal CIRL solutions produce behaviors such as active teaching, active learning, and communicative actions that are more effective in achieving value alignment. We show that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, prove that optimality in isolation is suboptimal in CIRL, and derive an approximate CIRL algorithm.", "full_text": "Cooperative Inverse Reinforcement Learning\n\nDylan Had\ufb01eld-Menell\u2217\n\nUniversity of California at Berkeley\n\nBerkeley, CA 94709\n\nAnca Dragan\n\nPieter Abbeel\nElectrical Engineering and Computer Science\n\nStuart Russell\n\nAbstract\n\nFor an autonomous system to be helpful to humans and to pose no unwarranted\nrisks, it needs to align its values with those of the humans in its environment in\nsuch a way that its actions contribute to the maximization of value for the humans.\nWe propose a formal de\ufb01nition of the value alignment problem as cooperative\ninverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-\ninformation game with two agents, human and robot; both are rewarded according\nto the human\u2019s reward function, but the robot does not initially know what this\nis. In contrast to classical IRL, where the human is assumed to act optimally in\nisolation, optimal CIRL solutions produce behaviors such as active teaching, active\nlearning, and communicative actions that are more effective in achieving value\nalignment. We show that computing optimal joint policies in CIRL games can be\nreduced to solving a POMDP, prove that optimality in isolation is suboptimal in\nCIRL, and derive an approximate CIRL algorithm.\n\n1\n\nIntroduction\n\n\u201cIf we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere\neffectively . . . we had better be quite sure that the purpose put into the machine is the purpose which\nwe really desire.\u201d So wrote Norbert Wiener (1960) in one of the earliest explanations of the problems\nthat arise when a powerful autonomous system operates with an incorrect objective. This value\nalignment problem is far from trivial. Humans are prone to mis-stating their objectives, which can\nlead to unexpected implementations. In the myth of King Midas, the main character learns that\nwishing for \u2018everything he touches to turn to gold\u2019 leads to disaster. In a reinforcement learning\ncontext, Russell & Norvig (2010) describe a seemingly reasonable, but incorrect, reward function for\na vacuum robot: if we reward the action of cleaning up dirt, the optimal policy causes the robot to\nrepeatedly dump and clean up the same dirt.\nA solution to the value alignment problem has long-term implications for the future of AI and its\nrelationship to humanity (Bostrom, 2014) and short-term utility for the design of usable AI systems.\nGiving robots the right objectives and enabling them to make the right trade-offs is crucial for\nself-driving cars, personal assistants, and human\u2013robot interaction more broadly.\nThe \ufb01eld of inverse reinforcement learning or IRL (Russell, 1998; Ng & Russell, 2000; Abbeel\n& Ng, 2004) is certainly relevant to the value alignment problem. An IRL algorithm infers the\nreward function of an agent from observations of the agent\u2019s behavior, which is assumed to be\noptimal (or approximately so). One might imagine that IRL provides a simple solution to the value\nalignment problem: the robot observes human behavior, learns the human reward function, and\nbehaves according to that function. This simple idea has two \ufb02aws. The \ufb01rst \ufb02aw is obvious: we\ndon\u2019t want the robot to adopt the human reward function as its own. For example, human behavior\n(especially in the morning) often conveys a desire for coffee, and the robot can learn this with IRL,\nbut we don\u2019t want the robot to want coffee! This \ufb02aw is easily \ufb01xed: we need to formulate the value\n\n\u2217{dhm, anca, pabbeel, russell}@cs.berkeley.edu\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\falignment problem so that the robot always has the \ufb01xed objective of optimizing reward for the\nhuman, and becomes better able to do so as it learns what the human reward function is.\nThe second \ufb02aw is less obvious, and less easy to \ufb01x. IRL assumes that observed behavior is optimal\nin the sense that it accomplishes a given task ef\ufb01ciently. This precludes a variety of useful teaching\nbehaviors. For example, ef\ufb01ciently making a cup of coffee, while the robot is a passive observer, is a\ninef\ufb01cient way to teach a robot to get coffee. Instead, the human should perhaps explain the steps in\ncoffee preparation and show the robot where the backup coffee supplies are kept and what do if the\ncoffee pot is left on the heating plate too long, while the robot might ask what the button with the\npuffy steam symbol is for and try its hand at coffee making with guidance from the human, even if\nthe \ufb01rst results are undrinkable. None of these things \ufb01t in with the standard IRL framework.\nCooperative inverse reinforcement learning. We propose, therefore, that value alignment should\nbe formulated as a cooperative and interactive reward maximization process. More precisely, we\nde\ufb01ne a cooperative inverse reinforcement learning (CIRL) game as a two-player game of partial\ninformation, in which the \u201chuman\u201d, H, knows the reward function (represented by a generalized\nparameter \u03b8), while the \u201crobot\u201d, R, does not; the robot\u2019s payoff is exactly the human\u2019s actual reward.\nOptimal solutions to this game maximize human reward; we show that solutions may involve active\ninstruction by the human and active learning by the robot.\nReduction to POMDP and Suf\ufb01cient Statistics. As one might expect, the structure of CIRL games\nis such that they admit more ef\ufb01cient solution algorithms than are possible for general partial-\ninformation games. Let (\u03c0H, \u03c0R) be a pair of policies for human and robot, each depending, in\ngeneral, on the complete history of observations and actions. A policy pair yields an expected sum of\nrewards for each player. CIRL games are cooperative, so there is a well-de\ufb01ned optimal policy pair\nthat maximizes value.2 In Section 3 we reduce the problem of computing an optimal policy pair to the\nsolution of a (single-agent) POMDP. This shows that the robot\u2019s posterior over \u03b8 is a suf\ufb01cient statistic,\nin the sense that there are optimal policy pairs in which the robot\u2019s behavior depends only on this\nstatistic. Moreover, the complexity of solving the POMDP is exponentially lower than the NEXP-hard\nbound that (Bernstein et al., 2000) obtained by reducing a CIRL game to a general Dec-POMDP.\nApprenticeship Learning and Suboptimality of IRL-Like Solutions. In Section 3.3 we model\napprenticeship learning (Abbeel & Ng, 2004) as a two-phase CIRL game. In the \ufb01rst phase, the\nlearning phase, both H and R can take actions and this lets R learn about \u03b8. In the second phase,\nthe deployment phase, R uses what it learned to maximize reward (without supervision from H).\nWe show that classic IRL falls out as the best-response policy for R under the assumption that the\nhuman\u2019s policy is \u201cdemonstration by expert\u201d (DBE), i.e., acting optimally in isolation as if no robot\nexists. But we show also that this DBE/IRL policy pair is not, in general, optimal: even if the robot\nexpects expert behavior, demonstrating expert behavior is not the best way to teach that algorithm.\nWe give an algorithm that approximately computes H\u2019s best response when R is running IRL under\nthe assumption that rewards are linear in \u03b8 and state features. Section 4 compares this best-response\npolicy with the DBE policy in an example game and provides empirical con\ufb01rmation that the best-\nresponse policy, which turns out to \u201cteach\u201d R about the value landscape of the problem, is better than\nDBE. Thus, designers of apprenticeship learning systems should expect that users will violate the\nassumption of expert demonstrations in order to better communicate information about the objective.\n\n2 Related Work\n\nOur proposed model shares aspects with a variety of existing models. We divide the related work into\nthree categories: inverse reinforcement learning, optimal teaching, and principal\u2013agent models.\n\nInverse Reinforcement Learning. Ng & Russell (2000) de\ufb01ne inverse reinforcement learning\n(IRL) as follows: \u201cGiven measurements of an [actor]\u2019s behavior over time.\n. . . Determine the\nreward function being optimized.\u201d The key assumption IRL makes is that the observed behavior is\noptimal in the sense that the observed trajectory maximizes the sum of rewards. We call this the\ndemonstration-by-expert (DBE) assumption. One of our contributions is to prove that this may be\nsuboptimal behavior in a CIRL game, as H may choose to accept less reward on a particular action\nin order to convey more information to R. In CIRL the DBE assumption prescribes a \ufb01xed policy\n\n2A coordination problem of the type described in Boutilier (1999) arises if there are multiple optimal policy\n\npairs; we defer this issue to future work.\n\n2\n\n\fFigure 1: The difference between demonstration-by-expert and instructive demonstration in the\nmobile robot navigation problem from Section 4. Left: The ground truth reward function. Lighter\ngrid cells indicates areas of higher reward. Middle: The demonstration trajectory generated by the\nexpert policy, superimposed on the maximum a-posteriori reward function the robot infers. The robot\nsuccessfully learns where the maximum reward is, but little else. Right: An instructive demonstration\ngenerated by the algorithm in Section 3.4 superimposed on the maximum a-posteriori reward function\nthat the robot infers. This demonstration highlights both points of high reward and so the robot learns\na better estimate of the reward.\nfor H. As a result, many IRL algorithms can be derived as state estimation for a best response to\ndifferent \u03c0H, where the state includes the unobserved reward parametrization \u03b8.\nNg & Russell (2000), Abbeel & Ng (2004), and Ratliff et al. (2006) compute constraints that\ncharacterize the set of reward functions so that the observed behavior maximizes reward. In general,\nthere will be many reward functions consistent with this constraint. They use a max-margin heuristic\nto select a single reward function from this set as their estimate. In CIRL, the constraints they compute\ncharacterize R\u2019s belief about \u03b8 under the DBE assumption.\nRamachandran & Amir (2007) and Ziebart et al. (2008) consider the case where \u03c0H is \u201cnoisily\nexpert,\u201d i.e., \u03c0His a Boltzmann distribution where actions or trajectories are selected in proportion\nto the exponent of their value. Ramachandran & Amir (2007) adopt a Bayesian approach and place\nan explicit prior on rewards. Ziebart et al. (2008) places a prior on reward functions indirectly by\nassuming a uniform prior over trajectories. In our model, these assumptions are variations of DBE\nand both implement state estimation for a best response to the appropriate \ufb01xed H.\nNatarajan et al. (2010) introduce an extension to IRL where R observes multiple actors that cooperate\nto maximize a common reward function. This is a different type of cooperation than we consider,\nas the reward function is common knowledge and R is a passive observer. Waugh et al. (2011) and\nKuleshov & Schrijvers (2015) consider the problem of inferring payoffs from observed behavior in a\ngeneral (i.e., non-cooperative) game given observed behavior. It would be interesting to consider an\nanalogous extension to CIRL, akin to mechanism design, in which R tries to maximize collective\nutility for a group of Hs that may have competing objectives.\nFern et al. (2014) consider a hidden-goal MDP, a special case of a POMDP where the goal is an\nunobserved part of the state. This can be considered a special case of CIRL, where \u03b8 encodes a\nparticular goal state. The frameworks share the idea that R helps H. The key difference between the\nmodels lies in the treatment of the human (the agent in their terminology). Fern et al. (2014) model\nthe human as part of the environment. In contrast, we treat H as an actor in a decision problem that\nboth actors collectively solve. This is crucial to modeling the human\u2019s incentive to teach.\n\nOptimal Teaching. Because CIRL incentivizes the human to teach, as opposed to maximizing\nreward in isolation, our work is related to optimal teaching: \ufb01nding examples that optimally train\na learner (Balbach & Zeugmann, 2009; Goldman et al., 1993; Goldman & Kearns, 1995). The key\ndifference is that ef\ufb01cient learning is the objective of optimal teaching, while it emerges as a property\nof optimal equilibrium behavior in CIRL.\nCakmak & Lopes (2012) consider an application of optimal teaching where the goal is to teach the\nlearner the reward function for an MDP. The teacher gets to pick initial states from which an expert\nexecutes the reward-maximizing trajectory. The learner uses IRL to infer the reward function, and\nthe teacher picks initial states to minimize the learner\u2019s uncertainty. In CIRL, this approach can be\ncharacterized as an approximate algorithm for H that greedily minimizes the entropy of R\u2019s belief.\nBeyond teaching, several models focus on taking actions that convey some underlying state, not\nnecessarily a reward function. Examples include \ufb01nding a motion that best communicates an agent\u2019s\nintention (Dragan & Srinivasa, 2013), or \ufb01nding a natural language utterance that best communicates\n\n3\n\nGroundTruthExpertDemonstrationInstructiveDemonstration\fa particular grounding (Golland et al., 2010). All of these approaches model the observer\u2019s inference\nprocess and compute actions (motion or speech) that maximize the probability an observer infers the\ncorrect hypothesis or goal. Our approximate solution to CIRL is analogous to these approaches, in\nthat we compute actions that are informative of the correct reward function.\n\nPrincipal\u2013agent models. Value alignment problems are not intrinsic to arti\ufb01cial agents. Kerr\n(1975) describes a wide variety of misaligned incentives in the aptly titled \u201cOn the folly of rewarding\nA, while hoping for B.\u201d In economics, this is known as the principal\u2013agent problem: the principal\n(e.g., the employer) speci\ufb01es incentives so that an agent (e.g., the employee) maximizes the principal\u2019s\npro\ufb01t (Jensen & Meckling, 1976).\nPrincipal\u2013agent models study the problem of generating appropriate incentives in a non-cooperative\nsetting with asymmetric information. In this setting, misalignment arises because the agents that\neconomists model are people and intrinsically have their own desires. In AI, misalignment arises\nentirely from the information asymmetry between the principal and the agent; if we could characterize\nthe correct reward function, we could program it into an arti\ufb01cial agent. Gibbons (1998) provides a\nuseful survey of principal\u2013agent models and their applications.\n\n3 Cooperative Inverse Reinforcement Learning\n\nThis section formulates CIRL as a two-player Markov game with identical payoffs, reduces the\nproblem of computing an optimal policy pair for a CIRL game to solving a POMDP, and characterizes\napprenticeship learning as a subclass of CIRL games.\n\n3.1 CIRL Formulation\n\nDe\ufb01nition 1. A cooperative inverse reinforcement learning (CIRL) game M is a two-player Markov\ngame with identical payoffs between a human or principal, H, and a robot or agent, R. The\ngame is described by a tuple, M = (cid:104)S,{AH,AR}, T (\u00b7|\u00b7,\u00b7,\u00b7),{\u0398, R(\u00b7,\u00b7,\u00b7;\u00b7)}, P0(\u00b7,\u00b7), \u03b3(cid:105), with the\nfollowing de\ufb01nitions:\n\nboth agents: T (s(cid:48)\n\nS a set of world states: s \u2208 S.\nAH a set of actions for H: aH \u2208 AH.\nAR a set of actions for R: aR \u2208 AR.\nT (\u00b7|\u00b7,\u00b7,\u00b7) a conditional distribution on the next world state, given previous state and action for\n\u0398 a set of possible static reward parameters, only observed by H: \u03b8 \u2208 \u0398.\nR(\u00b7,\u00b7,\u00b7;\u00b7) a parameterized reward function that maps world states, joint actions, and reward\nP0(\u00b7,\u00b7) a distribution over the initial state, represented as tuples: P0(s0, \u03b8)\n\u03b3 a discount factor: \u03b3 \u2208 [0, 1].\n\nparameters to real numbers. R : S \u00d7 AH \u00d7 AR \u00d7 \u0398 \u2192 R.\n\n|s, aH, aR).\n\nt . Both actors receive reward rt = R(st, aH\n\nWe write the reward for a state\u2013parameter pair as R(s, aH, aR; \u03b8) to distinguish the static reward\nparameters \u03b8 from the changing world state s. The game proceeds as follows. First, the initial\nstate, a tuple (s, \u03b8), is sampled from P0. H observes \u03b8, but R does not. This observation model\ncaptures the notion that only the human knows the reward function, while both actors know a prior\ndistribution over possible reward functions. At each timestep t, H and R observe the current state st\nand select their actions aH\nt ; \u03b8) and observe\neach other\u2019s action selection. A state for the next timestep is sampled from the transition distribution,\nst+1 \u223c PT (s(cid:48)\nhistories; \u03c0H :(cid:2)\nBehavior in a CIRL game is de\ufb01ned by a pair of policies, (\u03c0H, \u03c0R), that determine action selection\nfor H and R respectively. In general, these policies can be arbitrary functions of their observation\n\u2192 AR. The optimal joint\npolicy is the policy that maximizes value. The value of a state is the expected sum of discounted\nrewards under the initial distribution of reward parameters and world states.\nRemark 1. A key property of CIRL is that the human and the robot get rewards determined by the\nsame reward function. This incentivizes the human to teach and the robot to learn without explicitly\nencoding these as objectives of the actors.\n\n\u00d7 \u0398 \u2192 AH, \u03c0R :(cid:2)\n\nAH \u00d7 AR \u00d7 S\n\nAH \u00d7 AR \u00d7 S\n\n|st, aH\n\nt , aR\n\nt ), and the process repeats.\n\nt , aR\n\n(cid:3)\u2217\n\nt , aR\n\n(cid:3)\u2217\n\n4\n\n\f3.2 Structural Results for Computing Optimal Policy Pairs\nThe analogue in CIRL to computing an optimal policy for an MDP is the problem of computing an\noptimal policy pair. This is a pair of policies that maximizes the expected sum of discounted rewards.\nThis is not the same as \u2018solving\u2019 a CIRL game, as a real world implementation of a CIRL agent must\naccount for coordination problems and strategic uncertainty (Boutilier, 1999). The optimal policy pair\nrepresents the best H and R can do if they can coordinate perfectly before H observes \u03b8. Computing\nan optimal joint policy for a cooperative game is the solution to a decentralized-partially observed\nMarkov decision process (Dec-POMDP). Unfortunately, Dec-POMDPs are NEXP-complete (Bernstein\net al., 2000) so general Dec-POMDP algorithms have a computational complexity that is doubly\nexponential. Fortunately, CIRL games have special structure that reduces this complexity.\nNayyar et al. (2013) shows that a Dec-POMDP can be reduced to a coordination-POMDP. The actor in\nthis POMDP is a coordinator that observes all common observations and speci\ufb01es a policy for each\nactor. These policies map each actor\u2019s private information to an action. The structure of a CIRL game\nimplies that the private information is limited to H\u2019s initial observation of \u03b8. This allows the reduction\nto a coordination-POMDP to preserve the size of the (hidden) state space, making the problem easier.\nTheorem 1. Let M be an arbitrary CIRL game with state space S and reward space \u0398. There exists\na (single-actor) POMDP MC with (hidden) state space SC such that |SC| = |S| \u00b7 |\u0398| and, for any\npolicy pair in M, there is a policy in MC that achieves the same sum of discounted rewards.\nTheorem proofs can be found in the supplementary material. An immediate consequence of this\nresult is that R\u2019s belief about \u03b8 is a suf\ufb01cient statistic for optimal behavior.\nCorollary 1. Let M be a CIRL game. There exists an optimal policy pair (\u03c0H\u2217\ndepends on the current state and R\u2019s belief.\nRemark 2. In a general Dec-POMDP, the hidden state for the coordinator-POMDP includes each\nactor\u2019s history of observations. In CIRL, \u03b8 is the only private information so we get an exponential\ndecrease in the complexity of the reduced problem. This allows one to apply general POMDP\nalgorithms to compute optimal joint policies in CIRL.\n\n) that only\n\n, \u03c0R\u2217\n\nIt is important to note that the reduced problem may still be very challenging. POMDPs are dif\ufb01cult\nin their own right and the reduced problem still has a much larger action space. That being said,\nthis reduction is still useful in that it characterizes optimal joint policy computation for CIRL as\nsigni\ufb01cantly easier than Dec-POMDPs. Furthermore, this theorem can be used to justify approximate\nmethods (e.g., iterated best response) that only depend on R\u2019s belief state.\n\n3.3 Apprenticeship Learning as a Subclass of CIRL Games\nA common paradigm for robot learning from humans is apprenticeship learning. In this paradigm,\na human gives demonstrations to a robot of a sample task and the robot is asked to imitate it in a\nsubsequent task. In what follows, we formulate apprenticeship learning as turn-based CIRL with a\nlearning phase and a deployment phase. We characterize IRL as the best response (i.e., the policy\nthat maximizes reward given a \ufb01xed policy for the other player) to a demonstration-by-expert policy\nfor H. We also show that this policy is, in general, not part of an optimal joint policy and so IRL is\ngenerally a suboptimal approach to apprenticeship learning.\nDe\ufb01nition 2. (ACIRL) An apprenticeship cooperative inverse reinforcement learning (ACIRL) game\nis a turn-based CIRL game with two phases: a learning phase where the human and the robot take\nturns acting, and a deployment phase, where the robot acts independently.\n\nExample. Consider an example apprenticeship task where R needs to help H make of\ufb01ce supplies.\nH and R can make paperclips and staples and the unobserved \u03b8 describe H\u2019s preference for paperclips\nvs staples. We model the problem as an ACIRL game in which the learning and deployment phase\neach consist of an individual action. The world state in this problem is a tuple (ps, qs, t) where ps\nand qs respectively represent the number of paperclips and staples H owns. t is the round number.\nAn action is a tuple (pa, qa) that produces pa paperclips and qa staples. The human can make 2\nitems total: AH = {(0, 2), (1, 1), (2, 0)}. The robot has different capabilities. It can make 50\nunits of each item or it can choose to make 90 of a single item: AR = {(0, 90), (50, 50), (90, 0)}.\nWe let \u0398 = [0, 1] and de\ufb01ne R so that \u03b8 indicates the relative preference between paperclips and\nstaples:R(s, (pa, qa); \u03b8) = \u03b8pa + (1 \u2212 \u03b8)qa. R\u2019s action is ignored when t = 0 and H\u2019s is ignored\nwhen t = 1. At t = 2, the game is over, so the game transitions to a sink state, (0, 0, 2).\n\n5\n\n\fDeployment phase \u2014 maximize mean reward estimate.\nIt is simplest to analyze the deployment\nphase \ufb01rst. R is the only actor in this phase so it get no more observations of its reward. We have\nshown that R\u2019s belief about \u03b8 is a suf\ufb01cient statistic for the optimal policy. This belief about \u03b8 induces\na distribution over MDPs. A straightforward extension of a result due to Ramachandran & Amir\n(2007) shows that R\u2019s optimal deployment policy maximizes reward for the mean reward function.\nTheorem 2. Let M be an ACIRL game. In the deployment phase, the optimal policy for R maximizes\nreward in the MDP induced by the mean \u03b8 from R\u2019s belief.\n3 ] and (2, 0) otherwise.\nIn our example, suppose that \u03c0H selects (0, 2) if \u03b8 \u2208 [0, 1\nR begins with a uniform prior on \u03b8 so observing, e.g., aH = (0, 2) leads to a posterior distribution\nthat is uniform on [0, 1\n3 ). Theorem 2 shows that the optimal action maximizes reward for the mean \u03b8\nso an optimal R behaves as though \u03b8 = 1\n\n3 ), (1, 1) if \u03b8 \u2208 [ 1\n\n6 during the deployment phase.\n\n3 , 2\n\nLearning phase \u2014 expert demonstrations are not optimal. A wide variety of apprenticeship\nlearning approaches assume that demonstrations are given by an expert. We say that H satis\ufb01es the\ndemonstration-by-expert (DBE) assumption in ACIRL if she greedily maximizes immediate reward\non her turn. This is an \u2018expert\u2019 demonstration because it demonstrates a reward maximizing action\nbut does not account for that action\u2019s impact on R\u2019s belief. We let \u03c0E represent the DBE policy.\nTheorem 2 enables us to characterize the best response for R when \u03c0H = \u03c0E: use IRL to compute\nthe posterior over \u03b8 during the learning phase and then act to maximize reward under the mean \u03b8 in\nthe deployment phase. We can also analyze the DBE assumption itself. In particular, we show that\n\u03c0E is not H\u2019s best response when \u03c0R is a best response to \u03c0E.\nTheorem 3. There exist ACIRL games where the best-response for H to \u03c0R violates the expert\ndemonstrator assumption. In other words, if br(\u03c0) is the best response to \u03c0, then br(br(\u03c0E)) (cid:54)= \u03c0E.\nThe supplementary material proves this theorem by computing the optimal equilibrium for our\nexample. In that equilibrium, H selects (1, 1) if \u03b8 \u2208 [ 41\n92 ]. In contrast, \u03c0E only chooses (1, 1) if\n\u03b8 = 0.5. The change arises because there are situations (e.g., \u03b8 = 0.49) where the immediate loss of\nreward to H is worth the improvement in R\u2019s estimate of \u03b8.\nRemark 3. We should expect experienced users of apprenticeship learning systems to present\ndemonstrations optimized for fast learning rather than demonstrations that maximize reward.\nCrucially, the demonstrator is incentivized to deviate from R\u2019s assumptions. This has implications\nfor the design and analysis of apprenticeship systems in robotics. Inaccurate assumptions about user\nbehavior are notorious for exposing bugs in software systems (see, e.g., Leveson & Turner (1993)).\n\n92 , 51\n\n3.4 Generating Instructive Demonstrations\nNow, we consider the problem of computing H\u2019s best response when R uses IRL as a state estimator.\nFor our toy example, we computed solutions exhaustively, for realistic problems we need a more\nef\ufb01cient approach. Section 3.2 shows that this can be reduced to an POMDP where the state is a\ntuple of world state, reward parameters, and R\u2019s belief. While this is easier than solving a general\nDec-POMDP, it is a computational challenge. If we restrict our attention to the case of linear reward\nfunctions we can develop an ef\ufb01cient algorithm to compute an approximate best response.\nSpeci\ufb01cally, we consider the case where the reward for a state (s, \u03b8) is de\ufb01ned as a linear combination\nof state features for some feature function \u03c6 : R(s, aH, aR; \u03b8) = \u03c6(s)(cid:62)\u03b8. Standard results from the\nIRL literature show that policies with the same expected feature counts have the same value (Abbeel\n& Ng, 2004). Combined with Theorem 2, this implies that the optimal \u03c0R under the DBE assumption\ncomputes a policy that matches the observed feature counts from the learning phase.\nThis suggests a simple approximation scheme. To compute a demonstration trajectory \u03c4 H, \ufb01rst\ncompute the feature counts R would observe in expectation from the true \u03b8 and then select actions\nthat maximize similarity to these target features. If \u03c6\u03b8 are the expected feature counts induced by \u03b8\nthen this scheme amounts to the following decision rule:\n\n\u03c4 H \u2190 argmax\n\n(1)\nThis rule selects a trajectory that trades off between the sum of rewards \u03c6(\u03c4 )(cid:62)\u03b8 and the feature\ndissimilarity ||\u03c6\u03b8 \u2212 \u03c6(\u03c4 )||2. Note that this is generally distinct from the action selected by the\ndemonstration-by-expert policy. The goal is to match the expected sum of features under a distribution\nof trajectories with the sum of features from a single trajectory. The correct measure of feature\n\n\u03c6(\u03c4 )(cid:62)\u03b8 \u2212 \u03b7||\u03c6\u03b8 \u2212 \u03c6(\u03c4 )||2.\n\n\u03c4\n\n6\n\n\fFigure 2: Left, Middle: Comparison of \u2018expert\u2019 demonstration (\u03c0E) with \u2018instructive\u2019 demonstration (br).\nLower numbers are better. Using the best response causes R to infer a better distribution over \u03b8 so it does a\nbetter job of maximizing reward. Right: The regret of the instructive demonstration policy as a function of how\noptimal R expects H to be. \u03bb = 0 corresponds to a robot that expects purely random behavior and \u03bb = \u221e\ncorresponds to a robot that expects optimal behavior. Regret is minimized for an intermediate value of \u03bb: if \u03bb is\ntoo small, then R learns nothing from its observations; if \u03bb is too large, then R expects many values of \u03b8 to lead\nto the same trajectory so H has no way to differentiate those reward functions.\nsimilarity is regret: the difference between the reward R would collect if it knew the true \u03b8 and the\nreward R actually collects using the inferred \u03b8. Computing this similarity is expensive, so we use an\n(cid:96)2 norm as a proxy measure of similarity.\n\n4 Experiments\n4.1 Cooperative Learning for Mobile Robot Navigation\nOur experimental domain is a 2D navigation problem on a discrete grid. In the learning phase of\nthe game, H teleoperates a trajectory while R observes. In the deployment phase, R is placed in a\nrandom state and given control of the robot. We use a \ufb01nite horizon H, and let the \ufb01rst H\n2 timesteps\nbe the learning phase. There are N\u03c6 state features de\ufb01ned as radial basis functions where the centers\nare common knowledge. Rewards are linear in these features and \u03b8. The initial world state is in the\nmiddle of the map. We use a uniform distribution on [\u22121, 1]N\u03c6 for the prior on \u03b8. Actions move in\none of the four cardinal directions {N, S, E, W} and there is an additional no-op \u2205 that each actor\nexecutes deterministically on the other agent\u2019s turn.\nFigure 1 shows an example comparison between demonstration-by-expert and the approximate best\nresponse policy in Section 3.4. The leftmost image is the ground truth reward function. Next to\nit are demonstration trajectories produce by these two policies. Each path is superimposed on the\nmaximum a-posteriori reward function the robot infers from the demonstration. We can see that the\ndemonstration-by-expert policy immediately goes to the highest reward and stays there. In contrast,\nthe best response policy moves to both areas of high reward. The robot reward function the robot\ninfers from the best response demonstration is much more representative of the true reward function,\nwhen compared with the reward function it infers from demonstration-by-expert.\n\n4.2 Demonstration-by-Expert vs Best Responder\nHypothesis. When R plays an IRL algorithm that matches features, H prefers the best response\npolicy from Section 3.4 to \u03c0E: the best response policy will signi\ufb01cantly outperform the DBE policy.\nManipulated Variables. Our experiment consists of 2 factors: H-policy and num-features. We\nmake the assumption that R uses an IRL algorithm to compute its estimate of \u03b8 during learn-\ning and maximizes reward under this estimate during deployment. We use Maximum-Entropy\nIRL (Ziebart et al., 2008) to implement R\u2019s policy. H-policy varies H\u2019s strategy \u03c0Hand has two\nlevels: demonstration-by-expert (\u03c0E) and best-responder (br). In the \u03c0E level H maximizes reward\nduring the demonstration. In the br level H uses the approximate algorithm from Section 3.4 to\ncompute an approximate best response to \u03c0R. The trade-off between reward and communication \u03b7 is\nset by cross-validation before the game begins. The num-features factor varies the dimensionality of\n\u03c6across two levels: 3 features and 10 features. We do this to test whether and how the difference\nbetween experts and best-responders is affected by dimensionality. We use a factorial design that\nleads to 4 distinct conditions. We test each condition against a random sample of N = 500 different\nreward parameters. We use a within-subjects design with respect to the the H-policy factor so the\nsame reward parameters are tested for \u03c0E and br.\n\n7\n\nRegretKL||\u03b8GT\u2212\u02c6\u03b8||2036912num-features=3br\u03c0ERegretKL||\u03b8GT\u2212\u02c6\u03b8||20481216num-features=10br\u03c0E10\u2212310\u22121101\u03bb00.250.50.751RegretRegretforbr\fDependent Measures. We use the regret with respect to a fully-observed setting where the robot\nknows the ground truth \u03b8 as a measure of performance. We let \u02c6\u03b8 be the robot\u2019s estimate of the reward\nparameters and let \u03b8GT be the ground truth reward parameters. The primary measure is the regret of\nR\u2019s policy: the difference between the value of the policy that maximizes the inferred reward \u02c6\u03b8 and\nthe value of the policy that maximizes the true reward \u03b8GT . We also use two secondary measures.\nThe \ufb01rst is the KL-divergence between the maximum-entropy trajectory distribution induced by \u02c6\u03b8\nand the maximum-entropy trajectory distribution induced by \u03b8. Finally, we use the (cid:96)2-norm between\nthe vector or rewards de\ufb01ned by \u02c6\u03b8 and the vector induced by \u03b8GT .\nResults. There was relatively little correlation between the measures (Cronbach\u2019s \u03b1 of .47), so\nwe ran a factorial repeated measures ANOVA for each measure. Across all measures, we found a\nsigni\ufb01cant effect for H-policy, with br outperforming \u03c0E on all measures as we hypothesized (all\nwith F > 962, p < .0001). We did \ufb01nd an interaction effect with num-features for KL-divergence\nand the (cid:96)2-norm of the reward vector but post-hoc Tukey HSD showed br to always outperform \u03c0E.\nThe interaction effect arises because the gap between the two levels of H-policy is larger with fewer\nreward parameters; we interpret this as evidence that num-features = 3 is an easier teaching problem\nfor H. Figure 2 (Left, Middle) shows the dependent measures from our experiment.\n\n4.3 Varying R\u2019s Expectations\nMaximum-Entropy IRL includes a free parameter \u03bb that controls how optimal R expects H to behave.\nIf \u03bb = 0, R will update its belief as if H\u2019s observed behavior is independent of her preferences \u03b8. If\n\u03bb = \u221e, R will update its belief as if H\u2019s behavior is exactly optimal. We ran a followup experiment\nto determine how varying \u03bb changes the regret of the br policy.\nChanging \u03bb changes the forward model in R\u2019s belief update: the mapping R hypothesizes between\na given reward parameter \u03b8 and the observed feature counts \u03c6\u03b8. This mapping is many-to-one for\nextreme values of \u03bb. \u03bb \u2248 0 means that all values of \u03b8 lead to the same expected feature counts\nbecause trajectories are chosen uniformly at random. Alternatively, \u03bb >> 0 means that almost all\nprobability mass falls on the optimal trajectory and many values of \u03b8 will lead to the same optimal\ntrajectory. This suggests that it is easier for H to differentiate different values of \u03b8 if R assumes she\nis noisily optimal, but only up until a maximum noise level. Figure 2 plots regret as a function of \u03bb\nand supports this analysis: H has less regret for intermediate values of \u03bb.\n\n5 Conclusion and Future Work\n\nIn this work, we presented a game-theoretic model for cooperative learning, CIRL. Key to this model\nis that the robot knows that it is in a shared environment and is attempting to maximize the human\u2019s\nreward (as opposed to estimating the human\u2019s reward function and adopting it as its own). This leads\nto cooperative learning behavior and provides a framework in which to design HRI algorithms and\nanalyze the incentives of both actors in a reward learning environment.\nWe reduced the problem of computing an optimal policy pair to solving a POMDP. This is a useful\ntheoretical tool and can be used to design new algorithms, but it is clear that optimal policy pairs\nare only part of the story. In particular, when it performs a centralized computation, the reduction\nassumes that we can effectively program both actors to follow a set coordination policy. This is\nclearly infeasible in reality, although it may nonetheless be helpful in training humans to be better\nteachers. An important avenue for future research will be to consider the coordination problem: the\nprocess by which two independent actors arrive at policies that are mutual best responses. Returning\nto Wiener\u2019s warning, we believe that the best solution is not to put a speci\ufb01c purpose into the machine\nat all, but instead to design machines that provably converge to the right purpose as they go along.\n\nAcknowledgments\n\nThis work was supported by the DARPA Simplifying Complexity in Scienti\ufb01c Discovery (SIMPLEX)\nprogram, the Berkeley Deep Drive Center, the Center for Human Compatible AI, the Future of Life\nInstitute, and the Defense Sciences Of\ufb01ce contract N66001-15-2-4048. Dylan Had\ufb01eld-Menell is\nalso supported by a NSF Graduate Research Fellowship.\n\n8\n\n\fReferences\nAbbeel, P and Ng, A. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.\nBalbach, F and Zeugmann, T. Recent developments in algorithmic teaching. In Language and\n\nAutomata Theory and Applications. Springer, 2009.\n\nBernstein, D, Zilberstein, S, and Immerman, N. The complexity of decentralized control of Markov\n\ndecision processes. In UAI, 2000.\n\nBostrom, N. Superintelligence: Paths, dangers, strategies. Oxford, 2014.\nBoutilier, Craig. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99,\n\npp. 478\u2013485, 1999.\n\nCakmak, M and Lopes, M. Algorithmic and human teaching of sequential decision tasks. In AAAI,\n\n2012.\n\nDragan, A and Srinivasa, S. Generating legible motion. In Robotics: Science and Systems, 2013.\nFern, A, Natarajan, S, Judah, K, and Tadepalli, P. A decision-theoretic model of assistance. JAIR, 50\n\n(1):71\u2013104, 2014.\n\nGibbons, R. Incentives in organizations. Technical report, National Bureau of Economic Research,\n\n1998.\n\nGoldman, S and Kearns, M. On the complexity of teaching. Journal of Computer and System\n\nSciences, 50(1):20\u201331, 1995.\n\nGoldman, S, Rivest, R, and Schapire, R. Learning binary relations and total orders. SIAM Journal on\n\nComputing, 22(5):1006\u20131034, 1993.\n\nGolland, D, Liang, P, and Klein, D. A game-theoretic approach to generating spatial descriptions. In\n\nEMNLP, pp. 410\u2013419, 2010.\n\nJensen, M and Meckling, W. Theory of the \ufb01rm: Managerial behavior, agency costs and ownership\n\nstructure. Journal of Financial Economics, 3(4):305\u2013360, 1976.\n\nKerr, S. On the folly of rewarding A, while hoping for B. Academy of Management Journal, 18(4):\n\n769\u2013783, 1975.\n\nKuleshov, V and Schrijvers, O. Inverse game theory. Web and Internet Economics, 2015.\nLeveson, N and Turner, C. An investigation of the Therac-25 accidents. IEEE Computer, 26(7):\n\n18\u201341, 1993.\n\nNatarajan, S, Kunapuli, G, Judah, K, Tadepalli, P, and Kersting, Kand Shavlik, J. Multi-agent inverse\n\nreinforcement learning. In Int\u2019l Conference on Machine Learning and Applications, 2010.\n\nNayyar, A, Mahajan, A, and Teneketzis, D. Decentralized stochastic control with partial history\nsharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):\n1644\u20131658, 2013.\n\nNg, A and Russell, S. Algorithms for inverse reinforcement learning. In ICML, 2000.\nRamachandran, D and Amir, E. Bayesian inverse reinforcement learning. In IJCAI, 2007.\nRatliff, N, Bagnell, J, and Zinkevich, M. Maximum margin planning. In ICML, 2006.\nRussell, S. and Norvig, P. Arti\ufb01cial Intelligence. Pearson, 2010.\nRussell, Stuart J. Learning agents for uncertain environments (extended abstract). In COLT, 1998.\nWaugh, K, Ziebart, B, and Bagnell, J. Computational rationalization: The inverse equilibrium\n\nproblem. In ICML, 2011.\n\nWiener, N. Some moral and technical consequences of automation. Science, 131, 1960.\nZiebart, B, Maas, A, Bagnell, J, and Dey, A. Maximum entropy inverse reinforcement learning. In\n\nAAAI, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1943, "authors": [{"given_name": "Dylan", "family_name": "Hadfield-Menell", "institution": "UC Berkeley"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}, {"given_name": "Anca", "family_name": "Dragan", "institution": "UC Berkeley"}]}