{"title": "Inverse Reward Design", "book": "Advances in Neural Information Processing Systems", "page_first": 6765, "page_last": 6774, "abstract": "Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.", "full_text": "Inverse Reward Design\n\nDylan Had\ufb01eld-Menell\n\nPieter Abbeel\u2217 Stuart Russell\nDepartment of Electrical Engineering and Computer Science\n\nSmitha Milli\n\nAnca D. Dragan\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94709\n\n{dhm, smilli, pabbeel, russell, anca}@cs.berkeley.edu\n\nAbstract\n\nAutonomous agents optimize the reward function we give them. What they don\u2019t\nknow is how hard it is for us to design a reward function that actually captures\nwhat we want. When designing the reward, we might think of some speci\ufb01c\ntraining scenarios, and make sure that the reward will lead to the right behavior\nin those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of\nterrain) where optimizing that same reward may lead to undesired behavior. Our\ninsight is that reward functions are merely observations about what the designer\nactually wants, and that they should be interpreted in the context in which they were\ndesigned. We introduce inverse reward design (IRD) as the problem of inferring the\ntrue objective based on the designed reward and the training MDP. We introduce\napproximate methods for solving IRD problems, and use their solution to plan\nrisk-averse behavior in test MDPs. Empirical results suggest that this approach can\nhelp alleviate negative side effects of misspeci\ufb01ed reward functions and mitigate\nreward hacking.\n\n1\n\nIntroduction\n\nRobots2 are becoming more capable of optimizing their reward functions. But along with that comes\nthe burden of making sure we specify these reward functions correctly. Unfortunately, this is a\nnotoriously dif\ufb01cult task. Consider the example from Figure 1. Alice, an AI engineer, wants to build\na robot, we\u2019ll call it Rob, for mobile navigation. She wants it to reliably navigate to a target location\nand expects it to primarily encounter grass lawns and dirt pathways. She trains a perception system\nto identify each of these terrain types and then uses this to de\ufb01ne a reward function that incentivizes\nmoving towards the target quickly, avoiding grass where possible. When Rob is deployed into the\nworld, it encounters a novel terrain type; for dramatic effect, we\u2019ll suppose that it is lava. The terrain\nprediction goes haywire on this out-of-distribution input and generates a meaningless classi\ufb01cation\nwhich, in turn, produces an arbitrary reward evaluation. As a result, Rob might then drive to its\ndemise. This failure occurs because the reward function Alice speci\ufb01ed implicitly through the terrain\npredictors, which ends up outputting arbitrary values for lava, is different from the one Alice intended,\nwhich would actually penalize traversing lava.\nIn the terminology from Amodei et al. (2016), this is a negative side effect of a misspeci\ufb01ed reward \u2014\na failure mode of reward design where leaving out important aspects leads to poor behavior. Examples\ndate back to King Midas, who wished that everything he touched turn to gold, leaving out that he\ndidn\u2019t mean his food or family. Another failure mode is reward hacking, which happens when, e.g., a\nvacuum cleaner ejects collected dust so that it can collect even more (Russell & Norvig, 2010), or a\nracing boat in a game loops in place to collect points instead of actually winning the race (Amodei\n& Clark, 2016). Short of requiring that the reward designer anticipate and penalize all possible\nmisbehavior in advance, how can we alleviate the impact of such reward misspeci\ufb01cation?\n\n\u2217OpenAI, International Computer Science Institute (ICSI)\n2Throughout this paper, we will use robot to refer generically to any arti\ufb01cial agent.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: An illustration of a negative side effect. Alice designs a reward function so that her robot navigates\nto the pot of gold and prefers dirt paths. She does not consider that her robot might encounter lava in the real\nworld and leaves that out of her reward speci\ufb01cation. The robot maximizing this proxy reward function drives\nthrough the lava to its demise. In this work, we formalize the (Bayesian) inverse reward design (IRD) problem\nas the problem of inferring (a distribution on) the true reward function from the proxy. We show that IRD can\nhelp mitigate unintended consequences from misspeci\ufb01ed reward functions like negative side effects and reward\nhacking.\n\nWe leverage a key insight: that the designed reward function should merely be an observation about\nthe intended reward, rather than the de\ufb01nition; and should be interpreted in the context in which it\nwas designed. First, a robot should have uncertainty about its reward function, instead of treating\nit as \ufb01xed. This enables it to, e.g., be risk-averse when planning in scenarios where it is not clear\nwhat the right answer is, or to ask for help. Being uncertain about the true reward, however, is only\nhalf the battle. To be effective, a robot must acquire the right kind of uncertainty, i.e. know what it\nknows and what it doesn\u2019t. We propose that the \u2018correct\u2019 shape of this uncertainty depends on the\nenvironment for which the reward was designed.\nIn Alice\u2019s case, the situations where she tested Rob\u2019s learning behavior did not contain lava. Thus, the\nlava-avoiding reward would have produced the same behavior as Alice\u2019s designed reward function in\nthe (lava-free) environments that Alice considered. A robot that knows the settings it was evaluated\nin should also know that, even though the designer speci\ufb01ed a lava-agnostic reward, they might have\nactually meant the lava-avoiding reward. Two reward functions that would produce similar behavior\nin the training environment should be treated as equally likely, regardless of which one the designer\nactually speci\ufb01ed. We formalize this in a probabilistic model that relates the proxy (designed) reward\nto the true reward via the following assumption:\nAssumption 1. Proxy reward functions are likely to the extent that they lead to high true utility\nbehavior in the training environment.\n\nFormally, we assume that the observed proxy reward function is the approximate solution to a reward\ndesign problem (Singh et al., 2010). Extracting the true reward is the inverse reward design problem.\nThe idea of using human behavior as observations about the reward function is far from new. Inverse\nreinforcement learning uses human demonstrations (Ng & Russell, 2000; Ziebart et al., 2008), shared\nautonomy uses human operator control signals (Javdani et al., 2015), preference-based reward learning\nuses answers to comparison queries (Jain et al., 2015), and even what the human wants (Had\ufb01eld-\nMenell et al., 2017). We observe that, even when the human behavior is to actually write down a\nreward function, this should still be treated as an observation, demanding its own observation model.\nOur paper makes three contributions. First, we de\ufb01ne the inverse reward design (IRD) problem as the\nproblem of inferring the true reward function given a proxy reward function, an intended decision\nproblem (e.g., an MDP), and a set of possible reward functions. Second, we propose a solution to IRD\nand justify how an intuitive algorithm which treats the proxy reward as a set of expert demonstrations\ncan serve as an effective approximation. Third, we show that this inference approach, combined\nwith risk-averse planning, leads to algorithms that are robust to misspeci\ufb01ed rewards, alleviating\nboth negative side effects as well as reward hacking. We build a system that \u2018knows-what-it-knows\u2019\nabout reward evaluations that automatically detects and avoids distributional shift in situations with\nhigh-dimensional features. Our approach substantially outperforms the baseline of literal reward\ninterpretation.\n\n2\n\nReward Engineering is Hardactualenvironment......100+210truerewardfunctionintendedenvironment...00...0210proxyrewardfunction1212\f2\n\nInverse Reward Design\n\nDe\ufb01nition 1. (Markov Decision Process Puterman (2009)) A (\ufb01nite-horizon) Markov decision\nprocess (MDP), M, is a tuple M = (cid:104)S,A, T, r, H(cid:105). S is a set of states. A is a set of actions. T is\na probability distribution over the next state, given the previous state and action. We write this as\nT (st+1|st, a). r is a reward function that maps states to rewards r : S (cid:55)\u2192 R. H \u2208 Z+ is the \ufb01nite\nplanning horizon for the agent.\nA solution to M is a policy: a mapping from the current timestep and state to a distribution over\nactions. The optimal policy maximizes the expected sum of rewards. We will use \u03be to represent\ntrajectories. In this work, we consider reward functions that are linear combinations of feature vectors\n\u03c6(\u03be). Thus, the reward for a trajectory, given weights w, is r(\u03be; w) = w(cid:62)\u03c6(\u03be).\nThe MDP formalism de\ufb01nes optimal behavior, given a reward function. However, it provides no\ninformation about where this reward function comes from (Singh et al., 2010). We refer to an MDP\nwithout rewards as a world model. In practice, a system designer needs to select a reward function\nthat encapsulates the intended behavior. This process is reward engineering or reward design:\nDe\ufb01nition 2. (Reward Design Problem (Singh et al., 2010)) A reward design problem (RDP) is\n\u223c\n\u223c\n\u223c\n\u223c\nde\ufb01ned as a tuple P = (cid:104)r\u2217,\nM )(cid:105). r\u2217 is the true reward function.\nM is a world model.\nR, \u03c0(\u00b7|\nr ,\n\u223c\n\u223c\n\u223c\nM ) is an agent model, that de\ufb01nes a distribution on\nR is a set of proxy reward functions. \u03c0(\u00b7|\nr ,\ntrajectories given a (proxy) reward function and a world model.\n\n\u223c\nM ,\n\n\u223c\nr \u2208\n\n\u223c\nR for the agent. Her goal is to specify\n\n\u223c\nM ) obtains high reward according to the true reward function r\u2217. We let\n\n\u223c\nIn an RDP, the designer believes that an agent, represented by the policy \u03c0(\u00b7|\nr ,\n\u223c\nM. She must specify a proxy reward function\nin\n\u223c\n\u03c0(\u00b7|\nr ,\nfor the proxy reward function and w\u2217 represent weights for the true reward function.\nIn this work, our motivation is that system designers are fallible, so we should not expect that they\nperfectly solve the reward design problem. Instead we consider the case where the system designer\nis approximately optimal at solving a known RDP, which is distinct from the MDP that the robot\ncurrently \ufb01nds itself in. By inverting the reward design process to infer (a distribution on) the true\nreward function r\u2217, the robot can understand where its reward evaluations have high variance and\nplan to avoid those states. We refer to this inference problem as the inverse reward design problem:\nDe\ufb01nition 3. (Inverse Reward Design) The inverse reward design (IRD) problem is de\ufb01ned by a\n\u223c\n\u223c\n\u223c\n\u223c\nM is a world model.\ntuple (cid:104)R,\nR, \u03c0(\u00b7|\nM ,\nr ,\n\u223c\n\u223c\n\u223c\n\u223c\nM )(cid:105) partially speci\ufb01es an RDP P , with an unobserved reward function r\u2217\n\u2208 R.\n(cid:104)\u2212,\nR, \u03c0(\u00b7|\nM ,\nr ,\n\u223c\n\u223c\nR is the observed proxy reward that is an (approximate) solution to P .\nr \u2208\nIn solving an IRD problem, the goal is to recover r\u2217. We will explore Bayesian approaches to IRD, so\n\u223c\nwe will assume a prior distribution on r\u2217 and infer a posterior distribution on r\u2217 given\nM ).\n3 Related Work\n\n\u223c\nr(cid:105). R is a space of possible reward functions.\n\n\u223c\nM ),\n\n\u223c\nM ), will be deployed\n\u223c\nr so that\n\u223c\nw represent weights\n\n\u223c\nr P (r\u2217\n\n\u223c\n|\nr ,\n\nOptimal reward design. Singh et al. (2010) formalize and study the problem of designing optimal\nrewards. They consider a designer faced with a distribution of environments, a class of reward\nfunctions to give to an agent, and a \ufb01tness function. They observe that, in the case of bounded agents,\nit may be optimal to select a proxy reward that is distinct from the \ufb01tness function. Sorg et al. (2010)\nand subsequent work has studied the computational problem of selecting an optimal proxy reward.\nIn our work, we consider an alternative situation where the system designer is the bounded agent. In\nthis case, the proxy reward function is distinct from the \ufb01tness function \u2013 the true utility function\nin our terminology \u2013 because system designers can make mistakes. IRD formalizes the problem\nof determining a true utility function given an observed proxy reward function. This enables us to\ndesign agents that are robust to misspeci\ufb01cations in their reward function.\nInverse reinforcement learning.\nIn inverse reinforcement learning (IRL) (Ng & Russell, 2000;\nZiebart et al., 2008; Evans et al., 2016; Syed & Schapire, 2007) the agent observes demonstrations of\n(approximately) optimal behavior and infers the reward function being optimized. IRD is a similar\n\n3\n\n\fproblem, as both approaches infer an unobserved reward function. The difference is in the observation:\nIRL observes behavior, while IRD directly observes a reward function. Key to IRD is assuming that\nthis observed reward incentivizes behavior that is approximately optimal with respect to the true\nreward. In Section 4.2, we show how ideas from IRL can be used to approximate IRD. Ultimately,\nwe consider both IRD and IRL to be complementary strategies for value alignment (Had\ufb01eld-Menell\net al., 2016): approaches that allow designers or users to communicate preferences or goals.\nPragmatics. The pragmatic interpretation of language is the interpretation of a phrase or utterance\nin the context of alternatives (Grice, 1975). For example, the utterance \u201csome of the apples are red\u201d\nis often interpreted to mean that \u201cnot all of the apples are red\u201d although this is not literally implied.\nThis is because, in context, we typically assume that a speaker who meant to say \u201call the apples are\nred\u201d would simply say so.\nRecent models of pragmatic language interpretation use two levels of Bayesian reasoning (Frank\net al., 2009; Goodman & Lassiter, 2014). At the lowest level, there is a literal listener that interprets\nlanguage according to a shared literal de\ufb01nition of words or utterances. Then, a speaker selects\nwords in order to convey a particular meaning to the literal listener. To model pragmatic inference,\nwe consider the probable meaning of a given utterance from this speaker. We can think of IRD as\na model of pragmatic reward interpretation: the speaker in pragmatic interpretation of language is\ndirectly analogous to the reward designer in IRD.\n\n4 Approximating the Inference over True Rewards\n\nWe solve IRD problems by formalizing Assumption 1: the idea that proxy reward functions are\nlikely to the extent that they incentivize high utility behavior in the training MDP. This will give us a\n\u223c\nprobabilistic model for how\nM. We will invert\nthis probability model to compute a distribution P (w = w\u2217\n4.1 Observation Model\n\n\u223c\nw is generated from the true w\u2217 and the training MDP\n\n\u223c\nM ) on the true utility function.\n\n\u223c\n|\nw,\n\n\u223c\nw,\n\n\u223c\nM ) is the designer\u2019s model of the probability that the robot will select trajectory\nRecall that \u03c0(\u03be|\n\u223c\n\u03be, given proxy reward\nM ) is the maximum entropy trajectory\ndistribution from Ziebart et al. (2008), i.e. the designer models the robot as approximately optimal:\n\u223c\n\u223c\nw to maximize expected true value, i.e.\n\u03c0(\u03be|\nw,\nE[w\u2217(cid:62)\n\n\u223c\nM ) \u221d exp(w(cid:62) \u03c6(\u03be)). An optimal designer chooses\n\u03c6(\u03be)|\u03be \u223c \u03c0(\u03be|\n\n\u223c\nM )] is high. We model an approximately optimal designer:\n\n\u223c\nw. We will assume that \u03c0(\u03be|\n\n\u223c\nw,\n\n\u223c\nw,\n\n(cid:18)\n\n(cid:20)\n\n(cid:21)(cid:19)\n\nP (\n\n\u223c\nw|w\n\n\u2217\n\n,\n\n\u223c\nM ) \u221d exp\n\n\u03b2 E\n\n\u2217(cid:62)\n\n\u03c6(\u03be)|\u03be \u223c \u03c0(\u03be|\n\n\u223c\nw,\n\nw\n\n\u223c\nM )\n\n(1)\n\nwith \u03b2 controlling how close to optimal we assume the person to be. This is now a formal statement of\n\u223c\n\u223c\n\u223c\nAssumption 1. w\u2217 can be pulled out of the expectation, so we let\n\u03c6 = E[\u03c6(\u03be)|\u03be \u223c \u03c0(\u03be|\nM )]. Our\nw,\n\u223c\n\u223c\n\u223c\ngoal is to invert (1) and sample from (or otherwise estimate) P (w\u2217\nw|w\u2217,\nM )P (w\u2217).\n|\nw,\n\u223c\n\u223c\nw|w\u2217,\nThe primary dif\ufb01culty this entails is that we need to know the normalized probability P (\nM ).\n\u223c\nZ(w), which integrates over possible proxy rewards.\nThis depends on its normalizing constant,\n(cid:19)\n\u03b2w(cid:62)\u223c\n\u223c\nZ(w)\n\n\u223c\nM ) \u221d P (\n\n\u223c\nM ) \u221d\n\n\u223c\nZ(w) =\n\nP (w = w\n\n(cid:62)\u223c\n\u03c6\n\n\u223c\nw.\n\nd\n\n(cid:19)\n\n\u223c\nw,\n\n\u2217\n\n|\n\nexp\n\n\u03b2w\n\n(cid:18)\n\nexp\n\n\u03c6\n\n(cid:18)\n\n(cid:90)\n\n\u223c\nw\n\nP (w),\n\n(2)\n\n4.2 Ef\ufb01cient approximations to the IRD posterior\n\u223c\nZ, which is intractable if\n\n\u223c\nTo compute P (w = w\u2217\nw lies in an in\ufb01nite or\n\u223c\nlarge \ufb01nite set. Notice that computing the value of the integrand for\nZ is highly non-trivial as it\ninvolves solving a planning problem. This is an example of what is referred to as a doubly-intractable\nlikelihood (Murray et al., 2006). We consider two methods to approximate this normalizing constant.\n\n\u223c\nM ), we must compute\n\n\u223c\n|\nw,\n\n4\n\n\fFigure 2: An example from the Lavaland domain. Left: The training MDP where the designer speci\ufb01es a proxy\nreward function. This incentivizes movement toward targets (yellow) while preferring dirt (brown) to grass\n(green), and generates the gray trajectory. Middle: The testing MDP has lava (red). The proxy does not penalize\nlava, so optimizing it makes the agent go straight through (gray). This is a negative side effect, which the IRD\nagent avoids (blue): it treats the proxy as an observation in the context of the training MDP, which makes it\nrealize that it cannot trust the (implicit) weight on lava. Right: The testing MDP has cells in which two sensor\nindicators no longer correlate: they look like grass to one sensor but target to the other. The proxy puts weight\non the \ufb01rst, so the literal agent goes to these cells (gray). The IRD agent knows that it can\u2019t trust the distinction\nand goes to the target on which both sensors agree (blue).\nSample to approximate the normalizing constant. This approach, inspired by methods in ap-\nproximate Bayesian computation (Sunn\u00e5ker et al., 2013), samples a \ufb01nite set of weights {wi} to\napproximate the integral in Equation 2. We found empirically that it helped to include the candidate\nsample w in the sum. This leads to the normalizing constant\n\n\u02c6Z(w) = exp(cid:0)w\n\n(cid:62)\n\n\u03c6w\n\n(cid:1) +\n\nN\u22121(cid:88)\n\ni=0\n\nexp(cid:0)\u03b2w\n\n(cid:1) .\n\n(cid:62)\n\n\u03c6i\n\n(3)\n\nWhere \u03c6i and \u03c6w are the vector of feature counts realized optimizing wi and w respectively.\nBayesian inverse reinforcement learning. During inference, the normalizing constant serves a\ncalibration purpose: it computes how good the behavior produced by all proxy rewards in that\nMDP would be with respect to the true reward. Reward functions which increase the reward for all\ntrajectories are not preferred in the inference. This creates an invariance to linear shifts in the feature\nencoding. If we were to change the MDP by shifting features by some vector \u03c60, \u03c6 \u2190 \u03c6 + \u03c60, the\nposterior over w would remain the same.\nWe can achieve a similar calibration and maintain the same property by directly integrating over the\npossible trajectories in the MDP:\n\n(cid:18)(cid:90)\n\n(cid:19)\u03b2\n\n(cid:18)\n\n(cid:19)\n\u03b2w(cid:62)\u223c\n\n\u03c6\n\nZ(w) =\n\nexp(w\n\n\u03be\n\n(cid:62)\n\n\u03c6(\u03be))d\u03be\n\nexp\n\n; \u02c6P (w|\n\n\u223c\nw) \u221d\n\nZ(w)\n\n(4)\n\nProposition 1. The posterior distribution that the IRD model induces on w\u2217 (i.e., Equation 2) and\nthe posterior distribution induced by IRL (i.e., Equation 4) are invariant to linear translations of the\nfeatures in the training MDP.\n\nProof. See supplementary material.\n\nThis choice of normalizing constant approximates the posterior to an IRD problem with the posterior\nfrom maximum entropy IRL (Ziebart et al., 2008). The result has an intuitive interpretation. The\n\u223c\nw determines the average feature counts for a hypothetical dataset of expert demonstrations\nproxy\n\u223c\nand \u03b2 determines the effective size of that dataset. The agent solves\nw and computes\n\u223c\nthe corresponding feature expectations\n\u03c6. The agent then pretends like it got \u03b2 demonstrations with\n\u223c\n\u03c6, and runs IRL. The more the robot believes the human is good at reward design, the\nfeatures counts\nmore demonstrations it pretends to have gotten from the person. The fact that reducing the proxy to\n\u223c\nM approximates IRD is not surprising: the main point of IRD is that the proxy reward is\nbehavior in\nmerely a statement about what behavior is good in the training environment.\n\n\u223c\nM with reward\n\n5\n\n\fFigure 3: Our challenge domain with latent rewards. Each terrain type (grass, dirt, target, lava) induces a\ndifferent distribution over high-dimensional features: \u03c6s \u223c N (\u00b5Is , \u03a3Is ). The designer never builds an indicator\nfor lava, and yet the agent still needs to avoid it in the test MDPs.\n\n5 Evaluation\n\n5.1 Experimental Testbed\n\nWe evaluated our approaches in a model of the scenario from Figure 1 that we call Lavaland. Our\nsystem designer, Alice, is programming a mobile robot, Rob. We model this as a gridworld with\nmovement in the four cardinal directions and four terrain types: target, grass, dirt, and lava. The true\nobjective for Rob, w\u2217, encodes that it should get to the target quickly, stay off the grass, and avoid\nlava. Alice designs a proxy that performs well in a training MDP that does not contain lava. Then, we\nmeasure Rob\u2019s performance in a test MDP that does contain lava. Our results show that combining\nIRD and risk-averse planning creates incentives for Rob to avoid unforeseen scenarios.\nWe experiment with four variations of this environment: two proof-of-concept conditions in which\nthe reward is misspeci\ufb01ed, but the agent has direct access to feature indicators for the different\ncategories (i.e. conveniently having a feature for lava); and two challenge conditions, in which the\nright features are latent; the reward designer does not build an indicator for lava, but by reasoning in\nthe raw observation space and then using risk-averse planning, the IRD agent still avoids lava.\n\n5.1.1 Proof-of-Concept Domains\n\nThese domains contain feature indicators for the four categories: grass, dirt, target, and lava.\nSide effects in Lavaland. Alice expects Rob to encounter 3 types of terrain: grass, dirt, and target,\n\u223c\nand so she only considers the training MDP from Figure 2 (left). She provides a\nw to encode a\ntrade-off between path length and time spent on grass.\nThe training MDP contains no lava, but it is introduced when Rob is deployed. An agent that treats\nthe proxy reward literally might go on the lava in the test MDP. However, an agent that runs IRD will\nknow that it can\u2019t trust the weight on the lava indicator, since all such weights would produce the\nsame behavior in the training MDP (Figure 2, middle).\nReward Hacking in Lavaland. Reward hacking refers generally to reward functions that can be\ngamed or tricked. To model this within Lavaland, we use features that are correlated in the training\ndomain but are uncorrelated in the testing environment. There are 6 features: three from one sensor\nand three from another sensor. In the training environment the features from both sensors are correct\nindicators of the state\u2019s terrain category (grass, dirt, target).\nAt test time, this correlation gets broken: lava looks like the target category to the second sensor, but\nthe grass category to the \ufb01rst sensor. This is akin to how in a racing game (Amodei & Clark, 2016),\nwinning and game points can be correlated at reward design time, but test environments might contain\nloopholes for maximizing points without winning. We want agents to hedge their bets between\nwinning and points, or, in Lavaland, between the two sensors. An agent that treats the proxy reward\nfunction literally might go to these new cells if they are closer. In contrast, an agent that runs IRD\nwill know that a reward function with the same weights put on the \ufb01rst sensor is just as likely as the\nproxy. Risk averse planning makes it go to the target for which both sensors agree (Figure 2, right).\n\n6\n\n\u00b5k\u2303ksIsIs2{grass,dirt,target,unk}s\u21e0N(\u00b5Is,\u2303Is)\fFigure 4: The results of our experiment comparing our proposed method to a baseline that directly plans with the\nproxy reward function. By solving an inverse reward design problem, we are able to create generic incentives to\navoid unseen or novel states.\n\n5.1.2 Challenge Domain: Latent Rewards, No More Feature Indicators\n\nThe previous examples allow us to explore reward hacking and negative side effects in an isolated\nexperiment, but are unrealistic as they assume the existence of a feature indicator for unknown,\nunplanned-for terrain. To investigate misspeci\ufb01ed objectives in a more realistic setting, we shift to the\nterrain type being latent, and inducing raw observations: we use a model where the terrain category\ndetermines the mean and variance of a multivariate Gaussian distribution over observed features.\nFigure 3 shows a depiction of this scenario. The designer has in mind a proxy reward on dirt, target,\nand grass, but forgets that lava might exist. We consider two realistic ways through which a designer\nmight actually specify the proxy reward function, which is based on the terrain types that the robot\ndoes not have access to: 1) directly on the raw observations \u2014 collect samples of the training terrain\ntypes (dirt, grass, target) and train a (linear) reward predictor; or 2) classi\ufb01er features \u2014 build a\nclassi\ufb01er to classify terrain as dirt, grass, or target, and de\ufb01ne a proxy on its output.\nNote that this domain allows for both negative side effects and reward hacking. Negative side effects\ncan occur because the feature distribution for lava is different from the feature distribution for the\nthree safe categories, and the proxy reward is trained only on the three safe categories. Thus in the\ntesting MDP, the evaluation of the lava cells will be arbitrary so maximizing the proxy reward will\nlikely lead the agent into lava. Reward hacking occurs when features that are correlated for the safe\ncategories are uncorrelated for the lava category.\n\n5.2 Experiment\n\nLavaland Parameters. We de\ufb01ned a distribution on map layouts with a log likelihood function\nthat prefers maps where neighboring grid cells are the same. We mixed this log likelihood with a\nquadratic cost for deviating from a target ratio of grid cells to ensure similar levels of the lava feature\nin the testing MDPs. Our training MDP is 70% dirt and 30% grass. Our testing MDP is 5% lava,\n66.5% dirt, and 28.5% grass.\nIn the proof-of-concept experiments, we selected the proxy reward function uniformly at random.\nFor latent rewards, we picked a proxy reward function that evaluated to +1 for target, +.1 for dirt,\nand \u2212.2 for grass. To de\ufb01ne a proxy on raw observations, we sampled 1000 examples of grass, dirt,\nand target and did a linear regression. With classi\ufb01er features, we simply used the target rewards as\nthe weights on the classi\ufb01ed features. We used 50 dimensions for our feature vectors. We selected\ntrajectories via risk-averse trajectory optimization. Details of our planning method, and our approach\nand rationale in selecting it can be found in the supplementary material.\nIVs and DVs. We measured the fraction of runs that encountered a lava cell on the test MDP as our\ndependent measure. This tells us the proportion of trajectories where the robot gets \u2019tricked\u2019 by the\nmisspeci\ufb01ed reward function; if a grid cell has never been seen then a conservative robot should plan\nto avoid it. We manipulate two factors: literal-optimizer and Z-approx. literal-optimizer is true if\nthe robot interprets the proxy reward literally and false otherwise. Z-approx varies the approximation\ntechnique used to compute the IRD posterior. It varies across the two levels described in Section 4.2:\nsample to approximate the normalizing constant (Sample-Z) or use the normalizing constant from\nmaximum entropy IRL (MaxEnt-Z) (Ziebart et al., 2008).\n\n7\n\nNegativeSideEffectsRewardHacking0.00.10.20.30.4Fractionof\u03bewithLavaProof-of-ConceptMaxEntZSampleZProxyRawObservationsClassi\ufb01erFeatures0.00.20.40.60.8LatentRewards\fResults. Figure 4 compares the approaches. On the left, we see that IRD alleviates negative side\neffects (avoids the lava) and reward hacking (does not go as much on cells that look deceptively like\nthe target to one of the sensors). This is important, in that the same inference method generalizes\nacross different consequences of misspeci\ufb01ed rewards. Figure 2 shows example behaviors.\nIn the more realistic latent reward setting, the IRD agent avoids the lava cells despite the designer\nforgetting to penalize it, and despite not even having an indicator for it: because lava is latent in the\nspace, and so reward functions that would implicitly penalize lava are as likely as the one actually\nspeci\ufb01ed, risk-averse planning avoids it.\nWe also see a distinction between raw observations and classi\ufb01er features. The \ufb01rst essentially\nmatches the proof-of-concept results (note the different axes scales), while the latter is much more\ndif\ufb01cult across all methods. The proxy performs worse because each grid cell is classi\ufb01ed before\nbeing evaluated, so there is a relatively good chance that at least one of the lava cells is misclassi\ufb01ed\nas target. IRD performs worse because the behaviors considered in inference plan in the already\nclassi\ufb01ed terrain: a non-linear transformation of the features. The inference must both determine a\ngood linear reward function to match the behavior and discover the corresponding uncertainty about\nit. When the proxy is a linear function of raw observations, the \ufb01rst job is considerably easier.\n\n6 Discussion\n\nSummary. In this work, we motivated and introduced the Inverse Reward Design problem as an\napproach to mitigate the risk from misspeci\ufb01ed objectives. We introduced an observation model,\nidenti\ufb01ed the challenging inference problem this entails, and gave several simple approximation\nschemes. Finally, we showed how to use the solution to an inverse reward design problem to avoid\nside effects and reward hacking in a 2D navigation problem. We showed that we are able to avoid\nthese issues reliably in simple problems where features are binary indicators of terrain type. Although\nthis result is encouraging, in real problems we won\u2019t have convenient access to binary indicators for\nwhat matters. Thus, our challenge evaluation domain gave the robot access to only a high-dimensional\nobservation space. The reward designer speci\ufb01ed a reward based on this observation space which\nforgets to penalize a rare but catastrophic terrain. IRD inference still enabled the robot to understand\nthat rewards which would implicitly penalize the catastrophic terrain are also likely.\nLimitations and future work. IRD gives the robot a posterior distribution over reward functions,\nbut much work remains in understanding how to best leverage this posterior. Risk-averse planning\ncan work sometimes, but it has the limitation that the robot does not just avoid bad things like lava,\nit also avoids potentially good things, like a giant pot of gold. We anticipate that leveraging the\nIRD posterior for follow-up queries to the reward designer will be key to addressing misspeci\ufb01ed\nobjectives.\nAnother limitation stems from the complexity of the environments and reward functions considered\nhere. The approaches we used in this work rely on explicitly solving a planning problem, and this is a\nbottleneck during inference. In future work, we plan to explore the use of different agent models that\nplan approximately or leverage, e.g., meta-learning (Duan et al., 2016) to scale IRD up to complex\nenvironments. Another key limitation is the use of linear reward functions. We cannot expect IRD\nto perform well unless the prior places weights on (a reasonable approximation to) the true reward\nfunction. If, e.g., we encoded terrain types as RGB values in Lavaland, there is unlikely to be a\nreward function in our hypothesis space that represents the true reward well.\nFinally, this work considers one relatively simple error model for the designer. This encodes some\nimplicit assumptions about the nature and likelihood of errors (e.g., IID errors). In future work, we\nplan to investigate more sophisticated error models that allow for systematic biased errors from the\ndesigner and perform human subject studies to empirically evaluate these models.\nOverall, we are excited about the implications IRD has not only in the short term, but also about its\ncontribution to the general study of the value alignment problem.\n\nAcknowledgements\n\nThis work was supported by the Center for Human Compatible AI and the Open Philanthropy Project,\nthe Future of Life Institute, AFOSR, and NSF Graduate Research Fellowship Grant No. DGE\n1106400.\n\n8\n\n\fReferences\n\nAmodei, Dario and Clark, Jack. Faulty Reward Functions in the Wild. https://blog.openai.\n\ncom/faulty-reward-functions/, 2016.\n\nAmodei, Dario, Olah, Chris, Steinhardt, Jacob, Christiano, Paul, Schulman, John, and Man\u00e9, Dan.\nConcrete Problems in AI Safety. CoRR, abs/1606.06565, 2016. URL http://arxiv.org/abs/\n1606.06565.\n\nDuan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter. RL2:\nFast Reinforcement Learning via Slow Reinforcement Learning. CoRR, abs/1611.02779, 2016.\nURL http://arxiv.org/abs/1611.02779.\n\nEvans, Owain, Stuhlm\u00fcller, Andreas, and Goodman, Noah D. Learning the Preferences of Ignorant,\nInconsistent Agents. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence,\npp. 323\u2013329. AAAI Press, 2016.\n\nFrank, Michael C, Goodman, Noah D, Lai, Peter, and Tenenbaum, Joshua B. Informative Communi-\ncation in Word Production and Word Learning. In Proceedings of the 31st Annual Conference of\nthe Cognitive Science Society, pp. 1228\u20131233. Cognitive Science Society Austin, TX, 2009.\n\nGoodman, Noah D and Lassiter, Daniel. Probabilistic Semantics and Pragmatics: Uncertainty in\nLanguage and Thought. Handbook of Contemporary Semantic Theory. Wiley-Blackwell, 2, 2014.\n\nGrice, H. Paul. Logic and Conversation, pp. 43\u201358. Academic Press, 1975.\n\nHad\ufb01eld-Menell, Dylan, Dragan, Anca, Abbeel, Pieter, and Russell, Stuart. Cooperative Inverse\nReinforcement Learning. In Proceedings of the Thirtieth Annual Conference on Neural Information\nProcessing Systems, 2016.\n\nHad\ufb01eld-Menell, Dylan, Dragan, Anca D., Abbeel, Pieter, and Russell, Stuart J. The Off-Switch\n\nGame. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence, 2017.\n\nJain, Ashesh, Sharma, Shikhar, Joachims, Thorsten, and Saxena, Ashutosh. Learning Preferences\nfor Manipulation Tasks from Online Coactive Feedback. The International Journal of Robotics\nResearch, 34(10):1296\u20131313, 2015.\n\nJavdani, Shervin, Bagnell, J. Andrew, and Srinivasa, Siddhartha S. Shared Autonomy via Hindsight\nOptimization. In Proceedings of Robotics: Science and Systems XI, 2015. URL http://arxiv.\norg/abs/1503.07619.\n\nMurray, Iain, Ghahramani, Zoubin, and MacKay, David. MCMC for Doubly-Intractable Distributions.\nIn Proceedings of the Twenty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence, 2006.\n\nNg, Andrew Y and Russell, Stuart J. Algorithms for Inverse Reinforcement Learning. In Proceedings\n\nof the Seventeenth International Conference on Machine Learning, pp. 663\u2013670, 2000.\n\nPuterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, 2009.\n\nRussell, Stuart and Norvig, Peter. Arti\ufb01cial Intelligence: A Modern Approach. Pearson, 2010.\n\nSingh, Satinder, Lewis, Richard L., , and Barto, Andrew G. Where do rewards come from? In\nProceedings of the International Symposium on AI Inspired Biology - A Symposium at the AISB\n2010 Convention, pp. 111\u2013116, 2010. ISBN 1902956923.\n\nSorg, Jonathan, Lewis, Richard L, and Singh, Satinder P. Reward Design via Online Gradient Ascent.\nIn Proceedings of the Twenty-Third Conference on Neural Information Processing Systems, pp.\n2190\u20132198, 2010.\n\nSunn\u00e5ker, Mikael, Busetto, Alberto Giovanni, Numminen, Elina, Corander, Jukka, Foll, Matthieu, and\nDessimoz, Christophe. Approximate Bayesian Computation. PLoS Comput Biol, 9(1):e1002803,\n2013.\n\n9\n\n\fSyed, Umar and Schapire, Robert E. A Game-Theoretic Approach to Apprenticeship Learning. In\nProceedings of the Twentieth Conference on Neural Information Processing Systems, pp. 1449\u2013\n1456, 2007.\n\nZiebart, Brian D, Maas, Andrew L, Bagnell, J Andrew, and Dey, Anind K. Maximum Entropy Inverse\nIn Proceedings of the Twenty-Third AAAI Conference on Arti\ufb01cial\n\nReinforcement Learning.\nIntelligence, pp. 1433\u20131438, 2008.\n\n10\n\n\f", "award": [], "sourceid": 3406, "authors": [{"given_name": "Dylan", "family_name": "Hadfield-Menell", "institution": "UC Berkeley"}, {"given_name": "Smitha", "family_name": "Milli", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}, {"given_name": "Anca", "family_name": "Dragan", "institution": "UC Berkeley"}]}