{"title": "Learning to Share and Hide Intentions using Information Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 10249, "page_last": 10259, "abstract": "Learning to cooperate with friends and compete with foes is a key component of multi-agent reinforcement learning. Typically to do so, one requires access to either a model of or interaction with the other agent(s). Here we show how to learn effective strategies for cooperation and competition in an asymmetric information game with no such model or interaction. Our approach is to encourage an agent to reveal or hide their intentions using an information-theoretic regularizer. We consider both the mutual information between goal and action given state, as well as the mutual information between goal and state. We show how to stochastically optimize these regularizers in a way that is easy to integrate with policy gradient reinforcement learning. Finally, we demonstrate that cooperative (competitive) policies learned with our approach lead to more (less) reward for a second agent in two simple asymmetric information games.", "full_text": "Learning to Share and Hide Intentions using\n\nInformation Regularization\n\nDJ Strouse1, Max Kleiman-Weiner2, Josh Tenenbaum2\n\nMatt Botvinick3,4, David Schwab5\n\n1 Princeton University, 2 MIT, 3 DeepMind\n\n4 UCL, 5 CUNY Graduate Center\n\nAbstract\n\nLearning to cooperate with friends and compete with foes is a key component\nof multi-agent reinforcement learning. Typically to do so, one requires access to\neither a model of or interaction with the other agent(s). Here we show how to learn\neffective strategies for cooperation and competition in an asymmetric information\ngame with no such model or interaction. Our approach is to encourage an agent\nto reveal or hide their intentions using an information-theoretic regularizer. We\nconsider both the mutual information between goal and action given state, as well\nas the mutual information between goal and state. We show how to optimize these\nregularizers in a way that is easy to integrate with policy gradient reinforcement\nlearning. Finally, we demonstrate that cooperative (competitive) policies learned\nwith our approach lead to more (less) reward for a second agent in two simple\nasymmetric information games.\n\n1\n\nIntroduction\n\nIn order to effectively interact with others, an intelligent agent must understand the intentions of\nothers. In order to successfully cooperate, collaborative agents that share their intentions will do\na better job of coordinating their plans together [Tomasello et al., 2005]. This is especially salient\nwhen information pertinent to a goal is known asymmetrically between agents. When competing\nwith others, a sophisticated agent might aim to hide this information from its adversary in order to\ndeceive or surprise them. This type of sophisticated planning is thought to be a distinctive aspect of\nhuman intelligence compared to other animal species [Tomasello et al., 2005].\n\nFurthermore, agents that share their intentions might have behavior that is more interpretable and\nunderstandable by people. Many reinforcement learning (RL) systems often plan in ways that can\nseem opaque to an observer. In particular, when an agent\u2019s reward function is not aligned with the\ndesigner\u2019s goal the intended behavior often deviates from what is expected [Had\ufb01eld-Menell et al.,\n2016]. If these agents are also trained to share high-level and often abstract information about its\nbehavior (i.e. intentions) it is more likely a human operator or collaborator can understand, predict,\nand explain that agents decision. This is key requirement for building machines that people can trust.\n\nPrevious approaches have tackled aspects of this problem but all share a similar structure [Dragan\net al., 2013, Ho et al., 2016, Had\ufb01eld-Menell et al., 2016, Shafto et al., 2014]. They optimize their\nbehavior against a known model of an observer which has a theory-of-mind [Baker et al., 2009,\nUllman et al., 2009, Rabinowitz et al., 2018] or is doing some form of inverse-RL [Ng et al., 2000,\nAbbeel and Ng, 2004]. In this work we take an alternative approach based on an information theoretic\nformulation of the problem of sharing and hiding intentions. This approach does not require an\nexplicit model of or interaction with the other agent, which could be especially useful in settings\nwhere interactive training is expensive or dangerous. Our approach also naturally combines with\nscalable policy-gradient methods commonly used in deep reinforcement learning.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Hiding and revealing intentions via information-theoretic regularization\n\nWe consider multi-goal environments in the form of a discrete-time \ufb01nite-horizon discounted Markov\ndecision process (MDP) de\ufb01ned by the tuple M \u2261 (S, A, G, P, \u03c1G, \u03c1S, r, \u03b3, T ), where S is a state\nset, A an action set, P : S \u00d7 A \u00d7 S \u2192 R+ a (goal-independent) probability distribution over\ntransitions, G a goal set, \u03c1G : G \u2192 R+ a distribution over goals, \u03c1S : S \u2192 R+ a probability\ndistribution over initial states, r : S \u00d7 G \u2192 R a (goal-dependent) reward function \u03b3 \u2208 [0, 1] a\ndiscount factor, and T the horizon.\n\nIn each episode, a goal is sampled and determines the reward structure for that episode. One agent,\nAlice, will have access to this goal and thus knowledge of the environment\u2019s reward structure, while\na second agent, Bob, will not and instead must infer it from observing Alice. We assume that\nAlice knows in advance whether Bob is a friend or foe and wants to make his task easier or harder,\nrespectively, but that she has no model of him and must train without any interaction with him.\n\nOf course, Alice also wishes to maximize her own expected reward \u03b7[\u03c0] = E\u03c4hPT\n\nwhere \u03c4 = (g, s0, a0, s1, a1, . . . , sT ) denotes the episode trajectory, g \u223c \u03c1G, s0 \u223c \u03c1S, at \u223c\n\u03c0g(at | st), and st+1 \u223c P (st+1 | st, at), and \u03c0g(a | s; \u03b8) : G \u00d7 S \u00d7 A \u2192 R+ is Alice\u2019s goal-\ndependent probability distribution over actions (policy) parameterized by \u03b8.\n\nt=0 \u03b3tr(st, g)i,\n\nIt is common in RL to consider loss functions of the form J[\u03c0] = \u03b7[\u03c0] + \u03b2\u2113[\u03c0], where \u2113 is a\nregularizer meant to help guide the agent toward desirable solutions. For example, the policy entropy\nis a common choice to encourage exploration [Mnih et al., 2016], while pixel prediction and control\nhave been proposed to encourage exploration in visually rich environments with sparse rewards\n[Jaderberg et al., 2017].\n\nThe setting we imagine is one in which we would like Alice to perform well in a joint environment with\nrewards rjoint, but we are only able to train her in a solo setting with rewards rsolo. How do we make\nsure that Alice\u2019s learned behavior in the solo environment transfers well to the joint environment? We\npropose the training objective Jtrain = E[rsolo] + \u03b2I (where I is some sort of task-relevant information\nmeasure) as a useful for proxy for the test objective Jtest = E[rjoint]. The structure of rjoint determines\nwhether the task is cooperative or competitive, and therefore the appropriate sign of \u03b2. For example,\nin the spatial navigation game of section 4.1, a competitive rjoint might provide +1 reward only to the\n\ufb01rst agent to reach the correct goal (and -1 for reaching the wrong one), whereas a cooperative rjoint\nmight provide each of Alice and Bob with the sum of their individual rewards. In \ufb01gure 2, we plot\nrelated metrics, after training Alice with Jtrain. On the bottom row, we plot the percentage of time\nAlice beats Bob to the goal (which is her expected reward for the competitive rjoint). On the top row,\nwe plot Bob\u2019s expected time steps per unit reward, relative to Alice\u2019s. Their combined steps per unit\nreward would be more directly related to the cooperative rjoint described above, but we plot Bob\u2019s\nindividual contribution (relative to Alice\u2019s), since his individual contribution to the joint reward rate\nvaries dramatically with \u03b2, whereas Alice\u2019s does not. We note that one advantage of our approach is\nthat it uni\ufb01es cooperative and competitive strategies in the same one-parameter (\u03b2) family.\n\nBelow, we will consider two different information regularizers meant to encourage/discourage\nAlice from sharing goal information with Bob: the (conditional) mutual information between goal\nand action given state, Iaction[\u03c0] \u2261 I(A; G | S), which we will call the \"action information\", and\nthe mutual information between state and goal, Istate[\u03c0] \u2261 I(S; G), which we will call the \"state\ninformation.\" Since the mutual information is a general measure of dependence (linear and non-linear)\nbetween two variables, Iaction and Istate measure the ease in inferring the goal from the actions and\nstates, respectively, generated by the policy \u03c0. Thus, if Alice wants Bob to do well, she should choose\na policy with high information, and vice versa if not.\n\nWe consider both action and state informations because they have different advantages and disad-\nvantages. Using action information assumes that Bob (the observer) can see both Alice\u2019s states and\nactions, which may be unrealistic in some environments, such as one in which the actions are the\ntorques a robot applies to its joint angles [Eysenbach et al., 2019]. Using state information instead\nonly assumes that Bob can observe Alice\u2019s states (and not actions), however it does so at the cost of\nrequiring Alice to count goal-dependent state frequencies under the current policy. Optimizing action\ninformation, on the other hand, does not require state counting. So, in summary, action information\nis simpler to optimize, but state information may be more appropriate to use in a setting where an\nobserver can\u2019t observe (or infer) the observee\u2019s actions.\n\n2\n\n\fThe generality with which mutual information measures dependence is at once its biggest strength\nand weakness. On the one hand, using information allows Alice to prepare for interaction with\nBob with neither a model of nor interaction with him. On the other hand, Bob might have limited\ncomputational resources (for example, perhaps his policy is linear with respect to his observations of\nAlice) and so he may not be able to \u201cdecode\u201d all of the goal information that Alice makes available\nto him. Nevertheless, Iaction and Istate can at least be considered upper bounds on Bob\u2019s inference\nperformance; if Iaction = 0 or Istate = 0, it would be impossible for Bob to guess the goal (above\nchance) from Alice\u2019s actions or states, respectively, alone.\n\nOptimizing information can be equivalent to optimizing reward under certain conditions, such as\nin the following example. Consider Bob\u2019s subtask of identifying the correct goal in a 2-goal setup.\nIf his belief over the goal is represented by p(g), then he should guess g\u2217 = argmaxgp(g), which\nresults in error probability perr = 1 \u2212 maxg p(g). Since the binary entropy function H(g) \u2261 H[p(g)]\nincreases monotonically with perr, optimizing one is equivalent to optimizing the other. Denoting the\nparts of Alice\u2019s behavior observable by Bob as x, then H(g | x) is the post-observation entropy in\nBob\u2019s beliefs, and optimizing it is equivalent to optimizing I(g; x) = H(g) \u2212 H(g | x), since the\npre-observation entropy H(g) is not dependent on Alice\u2019s behavior. If Bob receives reward r when\nidentifying the right goal, and 0 otherwise, then his expected reward is (1 \u2212 perr) r. Thus, in this\nsimpli\ufb01ed setup, optimizing information is directly related to optimizing reward. In general, when\none considers the temporal dynamics of an episode, more than two goals, or more complicated reward\nstructures, the relationship becomes more complicated. However, information is useful in abstracting\naway that complexity, and preparing Alice generically for a plethora of possible task setups.\n\n2.1 Optimizing action information: Iaction \u2261 I(A; G | S)\n\nFirst, we discuss regularization via optimizing the mutual information between goal and action\n(conditioned on state), Iaction \u2261 I(A; G | S), where G is the goal for the episode, A is the chosen\naction, and S is the state of the agent. That is, we will train an agent to maximize the objective\nJaction[\u03c0] \u2261 E[r] + \u03b2Iaction, where \u03b2 is a tradeoff parameters whose sign determines whether we want\nthe agent to signal (positive) or hide (negative) their intentions, and whose magnitude determines the\nrelative preference for rewards and intention signaling/hiding.\n\nIaction is a functional of the multi-goal policy \u03c0g(a | s) \u2261 p(a | s, g), that is the probability distribution\nover actions given the current goal and state, and is given by:\n\nIaction \u2261 I(A; G | S) = Xs\n= Xg\n\np(s) I(A; G | S = s)\n\n\u03c1G(g)Xs\n\np(s | g)Xa\n\n\u03c0g(a | s) log\n\n\u03c0g(a | s)\np(a | s)\n\n.\n\n(1)\n\n(2)\n\nThe quantity involving the sum over actions is a KL divergence between two distributions: the goal-\ndependent policy \u03c0g(a | s) and a goal-independent policy p(a | s). This goal-independent policy\n\ncomes from marginalizing out the goal, that is p(a | s) = Pg \u03c1G(g) \u03c0g(a | s), and can be thought of\n\nas a \ufb01ctitious policy that represents the agent\u2019s \u201chabit\u201d in the absence of knowing the goal. We will\ndenote \u03c00(a | s) \u2261 p(a | s) and refer to it as the \u201cbase policy,\u201d whereas we will refer to \u03c0g(a | s) as\nsimply the \u201cpolicy.\u201d Thus, we can rewrite the information above as:\n\nIaction = Xg\n\n\u03c1G(g)Xs\n\np(s | g) KL[\u03c0g(a | s) | \u03c00(a | s)] = E\u03c4 [KL[\u03c0g(a | s) | \u03c00(a | s)]] .\n\n(3)\n\nWriting the information this way suggests a method for stochastically estimating it. First, we sample\na goal g from p(g), that is we initialize an episode of some task. Next, we sample states s from\np(s | g), that is we generate state trajectories using our policy \u03c0g(a | s). At each step, we measure\nthe KL between the policy and the base policy. Averaging this quantity over episodes and steps give\nus our estimate of Iaction.\n\nOptimizing Iaction with respect to the policy parameters \u03b8 is a bit trickier, however, because the\nexpectation above is with respect to a distribution that depends on \u03b8. Thus, the gradient of Iaction with\n\n3\n\n\fAlgorithm 1 Action information regularized REINFORCE with value baseline.\n\nInput: \u03b2, \u03c1G, \u03b3, and ability to sample MDP M\nInitialize \u03c0, parameterized by \u03b8\nInitialize V , parameterized by \u03c6\nfor i = 1 to Nepisodes do\n\nGenerate trajectory \u03c4 = (g, s0, a0, s1, a1, . . . , sT )\nfor t = 0 to T \u2212 1 do\n\nUpdate policy in direction of \u2207\u03b8Jaction(t) using equation 6\n\nUpdate value in direction of \u2212\u2207\u03c6(cid:16)Vg(st) \u2212 \u02dcRt(cid:17)2\n\nend for\n\nwith \u02dcr(t) according to equation 7\n\nend for\n\nrespect to \u03b8 has two terms:\n\n\u2207\u03b8Iaction = Xg\n+Xg\n\n\u03c1G(g)Xs\n\u03c1G(g)Xs\n\n(\u2207\u03b8p(s | g)) KL[\u03c0g(a | s) | \u03c00(a | s)]\n\np(s | g) \u2207\u03b8KL[\u03c0g(a | s) | \u03c00(a | s)] .\n\n(4)\n\n(5)\n\nThe second term involves the same sum over goals and states as in equation 3, so it can be written as\nan expectation over trajectories, E\u03c4 [\u2207\u03b8KL[\u03c0g(a | s) | \u03c00(a | s)]], and therefore is straightforward to\nestimate from samples. The \ufb01rst term is more cumbersome, however, since it requires us to model (the\npolicy dependence of) the goal-dependent state probabilities, which in principle involves knowing\nthe dynamics of the environment. Perhaps surprisingly, however, the gradient can still be estimated\npurely from sampled trajectories, by employing the so-called \u201clog derivative\u201d trick to rewrite the term\nas an expectation over trajectories. The calculation is identical to the proof of the policy gradient\ntheorem [Sutton et al., 2000], except with reward replaced by the KL divergence above.\n\nThe resulting Monte Carlo policy gradient (MCPG) update is:\n\n\u2207\u03b8Jaction(t) =Aaction(t) \u2207\u03b8 log \u03c0g(at | st) + \u03b2\u2207\u03b8KL[\u03c0g(a | st) | \u03c00(a | st)] ,\n\n(6)\n\nwhere Aaction(t) \u2261 \u02dcRt \u2212 Vg(st) is a modi\ufb01ed advantage, Vg(st) is a goal-state value function\n\u2212t\u02dcrt\u2032 is a modi\ufb01ed return, and the following is the modi\ufb01ed\n\nt\u2032 =t \u03b3t\n\n\u2032\n\nregressed toward \u02dcRt, \u02dcRt = PT\n\nreward feeding into that return:\n\n\u02dcrt \u2261 rt + \u03b2KL[\u03c0g(a | st) | \u03c00(a | st)] .\n\n(7)\n\nThe second term in equation 6 encourages the agent to alter the policy to share or hide information in\nthe present state. The \ufb01rst term, on the other hand, encourages modi\ufb01cations which lead the agent to\nstates in the future which result in reward and the sharing or hiding of information. Together, this\noptimizes Jaction. This algorithm is summarized in algorithm 2.1.\n\n2.2 Optimizing state information: Istate \u2261 I(S; G)\n\nWe now consider how to regularize an agent by the information one\u2019s states give away about the goal,\nusing the mutual information between state goal, Istate \u2261 I(S; G). This can be written:\n\nIstate = Xg\n\n\u03c1G(g)Xs\n\np(s | g) log\n\np(s | g)\n\np(s)\n\n= E\u03c4(cid:20)log\n\np(s | g)\n\np(s) (cid:21) .\n\n(8)\n\nIn order to estimate this quantity, we could track and plug into the above equation the empirical\nstate frequencies pemp(s | g) \u2261 Ng(s)\nN , where Ng(s) is the number of times\nNg\nstate s was visited during episodes with goal g, Ng \u2261 Ps Ng(s) is the total number of steps taken\nunder goal g, N (s) \u2261 Pg Ng(s) is the number of times state s was visited across all goals, and\nN \u2261 Pg,s Ng(s) = Pg Ng = Ps N (s) is the total number of state visits across all goals and states.\nThus, keeping a moving average of log pemp(st|g)\npemp(st) across episodes and steps yields an estimate of Istate.\n\nand pemp(s) \u2261 N (s)\n\n4\n\n\fAlgorithm 2 State information regularized REINFORCE with value baseline.\n\nInput: \u03b2, \u03c1G, \u03b3, and ability to sample MDP M\nInitialize \u03c0, parameterized by \u03b8\nInitialize V , parameterized by \u03c6\nInitialize the state counts Ng(s)\nfor i = 1 to Nepisodes do\n\nGenerate trajectory \u03c4 = (g, s0, a0, s1, a1, . . . , sT )\nUpdate Ng(s) (and therefore pemp(s | g)) according to \u03c4\nfor t = 0 to T \u2212 1 do\n\nUpdate policy in direction of \u2207\u03b8Jstate(t) using equation 11\n\nUpdate value in direction of \u2212\u2207\u03c6(cid:16)Vg(st) \u2212 \u02dcRt(cid:17)2\n\nend for\n\nwith \u02dcr(t) according to equation 12\n\nend for\n\nHowever, we are of course interested in optimizing Istate and so, as in the last section, we need to\nemploy a slightly more sophisticated estimate procedure. Taking the gradient of Istate with respect to\nthe policy parameters \u03b8, we get:\n\n\u2207\u03b8Istate =Xg\n+Xg\n\n\u03c1G(g)Xs\n\u03c1G(g)Xs\n\n(\u2207\u03b8p(s | g)) log\n\np(s | g)\n\np(s)\n\np(s | g)(cid:18) \u2207\u03b8p(s | g)\n\np(s | g)\n\n\u2212\n\n\u2207\u03b8p(s)\n\np(s) (cid:19) .\n\n(9)\n\n(10)\n\nThe calculation is similar to that for evaluating \u2207\u03b8Iaction and details can be found in section S1. The\nresulting MCPG update is:\n\n\u2207\u03b8Jstate(t) =Astate(t) \u2207\u03b8 log \u03c0g(at | st) \u2212 \u03b2 Xg\u2032 6=g\nwhere Astate(t) \u2261 \u02dcRt \u2212 Vg(st) is a modi\ufb01ed advantage, Vg(st) is a goal-state value function regressed\n\u2032(cid:17) is\n\n\u2212t\u02dcrt\u2032 is a modi\ufb01ed return, Rcf(cid:16)t, g, g\n\na \u201ccounterfactual goal return\u201d, and the following are a modi\ufb01ed reward and a \u201ccounterfactual goal\nreward\u201d, respectively, which feed into the above returns:\n\ntoward \u02dcRt, \u02dcRt \u2261 PT\n\n\u2032(cid:17) \u2207\u03b8 log \u03c0g\u2032 (at | st) ,\n\n\u2032(cid:17) Rcf(cid:16)t, g, g\n\n\u2032(cid:17) \u2261 PT\n\n\u2212trcf(cid:16)t\n\n\u03c1G(cid:16)g\n\nt\u2032 =t \u03b3t\n\nt\u2032 =t \u03b3t\n\n, g, g\n\n(11)\n\n\u2032\n\n\u2032\n\n\u2032\n\nt\n\n(12)\n\n\uf8f6\n\uf8f8\n\npemp(st | g)\n\npemp(st | g)\n\nrcf(cid:16)t, g, g\n\npemp(st) (cid:19)\n\n\u03c0g\u2032 (at\u2032 | st\u2032 )\n\u03c0g(at\u2032 | st\u2032 )\n\n\u02dcrt \u2261 rt + \u03b2(cid:18)1 \u2212 pemp(g | st) + log\n\u2032(cid:17) \u2261 \uf8eb\nYt\u2032 =0\n\uf8ed\nwhere pemp(g | st) \u2261 \u03c1G(g) pemp(st|g)\npemp(st) . The modi\ufb01ed reward can be viewed as adding a \u201cstate\nuniqueness bonus\u201d log pemp(st|g)\nthat tries to increase the frequency of the present state under the\npemp(st)\npresent goal to the extent that the present state is more common under the present goal. If the present\nstate is less common than average under the present goal, then this bonus becomes a penalty. The\ncounterfactual goal reward, on the other hand, tries to make the present state less common under other\npemp(st|g)\ngoals, and is again scaled by uniqueness under the present goal\npemp(st) . It also includes importance\nsampling weights to account for the fact that the trajectory was generated under the current goal, but\nthe policy is being modi\ufb01ed under other goals. This algorithm is summarized in algorithm 2.2.\n\npemp(st)\n\n(13)\n\n,\n\n3 Related work\n\nWhye Teh et al. [2017] recently proposed an algorithm similar to our action information regularized\napproach (algorithm 2.1), but with very different motivations. They argued that constraining goal-\nspeci\ufb01c policies to be close to a distilled base policy promotes transfer by sharing knowledge\n\n5\n\n\facross goals. Due to this difference in motivation, they only explored the \u03b2 < 0 regime (i.e. our\n\u201ccompetitive\u201d regime). They also did not derive their update from an information-theoretic cost\nfunction, but instead proposed the update directly. Because of this, their approach differs in that it did\nnot include the \u03b2\u2207\u03b8KL[\u03c0g | \u03c00] term, and instead only included the modi\ufb01ed return. Moreover, they\ndid not calculate the full KLs in the modi\ufb01ed return, but instead estimated them from single samples\n(e.g. KL[\u03c0g(a | st) | \u03c00(a | st)] \u2248 log \u03c0g(at|st)\n\u03c00(at|st) ). Nevertheless, the similarity in our approaches\nsuggest a link between transfer and competitive strategies, although we do not explore this here.\n\nEysenbach et al. [2019] also recently proposed an algorithm similar to ours, which used both Istate\nand Iaction but with the \u201cgoal\u201d replaced by a randomly sampled \u201cskill\u201d label in an unsupervised setting\n(i.e. no reward). Their motivation was to learn a diversity of skills that would later would be useful\nfor a supervised (i.e. reward-yielding) task. Their approach to optimizing Istate differs from ours in\nthat it uses a discriminator, a powerful approach but one that, in our setting, would imply a more\nspeci\ufb01c model of the observer which we wanted to avoid.\n\nTsitsiklis and Xu [2018] derive an inverse tradeoff between an agent\u2019s delay in reaching a goal and\nthe ability of an adversary to predict that goal. Their approach relies on a number of assumptions\nabout the environment (e.g. agent\u2019s only source of reward is reaching the goal, opponent only\nneed identify the correct goal and not reach it as well, nearly uniform goal distribution), but is\nsuggestive of the general tradeoff. It is an interesting open question as to under what conditions our\ninformation-regularized approach achieves the optimal tradeoff.\n\nDragan et al. [2013] considered training agents to reveal their goals (in the setting of a robot grasping\ntask), but did so by building an explicit model of the observer. Ho et al. [2016] uses a similar\nmodel to capture human generated actions that \u201cshow\u201d a goal also using an explicit model of the\nobserver. There is also a long history of work on training RL agents to cooperate and compete\nthrough interactive training and a joint reward (e.g. [Littman, 1994, 2001, Kleiman-Weiner et al.,\n2016, Leibo et al., 2017, Peysakhovich and Lerer, 2018, Hughes et al., 2018]), or through modeling\none\u2019s effect on another agent\u2019s learning or behavior (e.g. [Foerster et al., 2018, Jaques et al., 2018]).\nOur approach differs in that it requires neither access to an opponent\u2019s rewards, nor even interaction\nwith or a model of the opponent. Without this knowledge, one can still be cooperative (competitive)\nwith others by being as (un)clear as possible about one\u2019s own intentions. Our work achieves this by\ndirectly optimizing information shared.\n\n4 Experiments\n\nWe demonstrate the effectiveness of our approach in two stages. First, we show that training Alice\n(who has access to the goal of the episode) with information regularization effectively encourages\nboth goal signaling and hiding, depending on the sign of the coef\ufb01cient \u03b2. Second, we show\nthat Alice\u2019s goal signaling and hiding translate to higher and lower rates of reward acquisition\nfor Bob (who does not have access to the goal and must infer it from observing Alice), respec-\ntively. We demonstrate these results in two different simple settings. Our code is available at\nhttps://github.com/djstrouse/InfoMARL.\n\n4.1 Spatial navigation\n\nThe \ufb01rst setting we consider is a simple grid world spatial navigation task, where we can fully\nvisualize and understand Alice\u2019s regularized policies. The 5 \u00d7 5 environment contains two possible\ngoals: the top left state or the top right. On any given episode, one goal is chosen randomly (so\n\u03c1G is uniform) and that goal state is worth +1 reward. The other goal state is then worth \u22121. Both\nare terminal. Each of Alice and Bob spawn in a random (non-terminal) state and take actions in\nA = {left, right, up, down, stay}. A step into a wall is equivalent to the stay action but results in a\npenalty of \u2212.1 reward. We \ufb01rst train Alice alone, and then freeze her parameters and introduce Bob.\n\nAlice was trained using implementations of algorithms 2.1 and 2.2 in TensorFlow [Abadi et al.,\n2016]. Given the small, discrete environment, we used tabular representations for both \u03c0 and V . See\nsection S2.1 for training parameters.\n\nExamples of Alice\u2019s resulting policies are shown in \ufb01gure 1. The top row contains policies regularized\nwith Iaction, the bottom with Istate. The left column contains \u201ccooperative\u201d policies encouraged to\nshare goal information (\u03b2 = .025), the middle \u201cambivalent\u201d policies that are unregularized (\u03b2 = 0),\n\n6\n\n\fcooperative\n\nambivalent\n\ncompetitive\n\nKL[\u03c0g | \u03c00]\n\nAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57Q\nAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57Q\nAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57Q\nAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57Q\n\np (s | g)\n\np (s)\n\nlog\n\nAAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==\nAAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==\nAAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==\nAAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==\n\n \n\nn\no\ni\nt\nc\na\n\n \n\ne\nt\na\nt\ns\n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n\nl\n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n\nl\n\nFigure 1: Information-regularized policies. Top row: regularization with Iaction. Bottom row:\nregularization with Istate. Left column: \u03b2 = .025. Center column: \u03b2 = 0. Right column: \u03b2 = .025.\nSee main text for additional details.\n\nand the right \u201ccompetitive\u201d policies encouraged to hide goal information (\u03b2 = \u2212.025). In each case,\ntwo policies are shown - at left, the policy when the goal is in the top left, and at right, when it is in\nthe top right. The arrows are proportional to Alice\u2019s action probabilities. The color scale indicates\nKL[\u03c0g | \u03c00] and log pemp(s|g)\npemp(s)\nbluer indicates more goal information shared in that state.\n\nin bits for action and state information regularization, respectively. Thus,\n\nIn the cooperative case with action information regularization, Alice wants to maximize KL[\u03c0g | \u03c00]\nand thus she wants her goal-dependent policies to differ as much as possible. This primarily results\nin her going left in most states when the goal is in the top left, and right when its in the top right.\nShe can also choose to have some probability of going up in those states, as long as she only does\nso under one goal or the other and not both. Ultimately this means that no matter what state Alice\nspawns in, she immediately signals the goal with her very \ufb01rst action, and continues to do so with\neach subsequent action.\n\nIn the competitive case with action information regularization, Alice instead wants to minimize\nKL[\u03c0g | \u03c00] and thus she wants her goal-dependent policies to match as much as possible. In the\npresent case, this means that she chooses to go up as far as she can until she must pivot in order to\nstill reach the goal in the minimal number of steps. This leads to her hiding the goal for as long as\npossible without sacri\ufb01cing reward.\n\nWithout information regularization, Alice\u2019s policy is a random mixture of the competitive and\ncooperative strategies, the details of which are determined by initialization and the randomness of\ntraining trajectories. Regularization with state information yields similar policies to those generated\nwith action information in all three cases, since in this deterministic setting, different action choices\nare tightly coupled with different state trajectories.\n\nTo demonstrate that Alice\u2019s goal revealing and hiding behaviors are useful for cooperation and\ncompetition, respectively, we then trained a second agent, Bob, who does not have access to the goal\nand instead must infer it from observing Alice. Thus, while Alice\u2019s inputs at time t were the present\ngoal g and her state salice\n, as well as his\nown state sbob\n. Details are available in section S2.1, but in brief, Bob processes Alice\u2019s state-action\ntrajectories with an RNN to form a belief about the goal, which then feeds into his policy, all of which\nis trained end-to-end via REINFORCE.\n\n, Bob\u2019s are Alice\u2019s present state and action salice\n\nt\n\nt\n\nand aalice\n\nt\n\nt\n\nWe trained 5 of each of the 3 versions of Alice above, and 10 Bobs per Alice. We plot the results for\nthe best performing Bob for each Alice (so 5 \u00d7 3 = 15 curves) in \ufb01gure 2. We use all 5 Alices to\nestimate the variance in our approach, but the best-of-10 Bob to provide a reasonable estimate of the\nbest performance of a friend/foe.\n\nWe measure Bob\u2019s performance in terms of his episode length, relative to Alice\u2019s, as well as the\npercentage of time he beats Alice to the goal. For both action and state information regularization,\nencouraging Alice to hide goal information leads to Bob taking about 30% longer to reach the goal\nrelative to when Alice is encouraged to share goal information. Information-hiding Alice receives a\nboost of similar magnitude in the frequency with which she beats Bob to the goal. Training without\ninformation regularization leads to results in between the competitive and cooperative strategies,\n\n7\n\n\faction regularized\n\nstate regularized\n\nFigure 2: The effect of Alice hiding/sharing goal information on Bob\u2019s performance. Left\ncolumn: regularization with Iaction. Right column: regularization with Istate. Top row: Bob\u2019s episode\nlength relative to Alice\u2019s (moving average over 500 episodes). Bottom row: the percentage of time\nAlice beats Bob to the goal (moving average over 1000 episodes).\n\nalthough closer to the cooperative strategy in this case. We also note that the variance in Bob\u2019s\nperformance was higher for the unregularized case, and much higher for the competitive case, with\nnearly zero variance in performance for the cooperative case, indicating that information hiding and\nsharing make training harder and easier, respectively.\n\n4.2 Key-and-door game\n\nIn the above spatial navigation task, information regularization of Alice breaks symmetries between\nequally-preferred (in terms of reward) navigation strategies. However, in many scenarios, it might\nbe worthwhile for an agent to give up some reward if it means large gains in the ability to hide or\nshare information. To demonstrate that our approach could also discover such \u201clossy\u201d strategies, we\ndesigned a simple key-and-door game with this feature (\ufb01gure 3, left).\n\nIt is again a two-goal (door) game with the same action space and reward structure as the spatial\nnavigation setting. Alice again alone receives the goal, and Bob must infer it from observing her. The\ndifference is that, in order to enter the terminal states, Alice and Bob must \ufb01rst pick up an appropriate\nkey. Each agent has goal-speci\ufb01c keys that only they can pick up (top/bottom rows, color-coded to\ndoor, labeled with A/B for Alice/Bob). Alice also has access to a master key that can open both doors\n(center right). Agents can only pick up one key per episode - the \ufb01rst they encounter. Bob spawns in\nthe same location every time (the \u201cB\u201d), while Alice spawns in any of the 3 spaces between her two\ngoal-speci\ufb01c keys (the \u201cA\u201d and spaces above/below). This means that Bob has a shorter path to the\ngoals, and thus if Alice telegraphs the goal right away, Bob will beat her to it. While Alice\u2019s master\nkey is strictly on a longer path to the goal, picking it up allows her to delay informing Bob of the goal\nsuch that she can beat him to it.\n\nWe trained Alice with action information regularization as in the previous section (see section S2.2\nfor training parameters). When unregularized or encouraged to share goal information (\u03b2 = .25),\nAlice simply took the shortest path to the goal, never picking up the master key. When Bob was\ntrained on these Alices, he beat/tied her to the goal on approximately 100% of episodes (\ufb01gure 3,\nright). When encouraged to hide information (\u03b2 = \u2212.25), however, we found that Alice learned to\ntake the longer path via the master key on about half of initializations (example in \ufb01gure 3, center).\nWhen Bob was trained on these Alices, he beat/tied her to the goal much less than half the time\n(\ufb01gure 3, right). Thus, our approach successfully encourages Alice us to forgo rewards during solo\ntraining in order to later compete more effectively in an interactive setting.\n\n8\n\n\fB\n\nA\n\nB A A\n\nB\n\nA\n\nFigure 3: Key-and-door game results. Left: depiction of game. Center: percentage episodes in\nwhich Alice picks up goal-speci\ufb01c vs master key during training in an example run (moving average\nover 100 episodes). Right: percentage episodes in which Bob beats/tie Alice to the goal (moving\naverage over 1000 episodes).\n\n5 Discussion\n\nIn this work, we developed a new framework for building agents that balance reward-seeking with\ninformation-hiding/sharing behavior. We demonstrate that our approach allows agents to learn\neffective cooperative and competitive strategies in asymmetric information games without an explicit\nmodel or interaction with the other agent(s). Such an approach could be particularly useful in settings\nwhere interactive training with other agents could be dangerous or costly, such as the training of\nexpensive robots or the deployment of \ufb01nancial trading strategies.\n\nWe have here focused on simple environments with discrete and \ufb01nite states, goals, and actions,\nand so we brie\ufb02y describe how to generalize our approach to more complex environments. When\noptimizing Iaction with many or continuous actions, one could stochastically approximate the action\nsum in KL[\u03c0g | \u03c00] and its gradient (as in [Whye Teh et al., 2017]). Alternatively, one could choose a\nform for the policy \u03c0g and base policy \u03c00 such that the KL is analytic. For example, it is common for\n\u03c0g to be Gaussian when actions are continuous. If one also chooses to use a Gaussian approximation\nfor \u03c00 (forming a variational bound on Iaction), then KL[\u03c0g | \u03c00] is closed form. For optimizing Istate\nwith continuous states, one can no longer count states exactly, so these counts could be replaced\nwith, for example, a pseudo-count based on an approximate density model. [Bellemare et al., 2016,\nOstrovski et al., 2017] Of course, for both types of information regularization, continuous states or\nactions also necessitate using function approximation for the policy representation. Finally, although\nwe have assumed access to the goal distribution \u03c1G, one could also approximate it from experience.\n\nAcknowledgements\n\nThe authors would like to acknowledge Dan Roberts and our anonymous reviewers for careful\ncomments on the original draft; Jane Wang, David Pfau, and Neil Rabinowitz for discussions on the\noriginal idea; and funding from the Hertz Foundation (DJ and Max), The Center for Brain, Minds and\nMachines (NSF #1231216) (Max and Josh), the NSF Center for the Physics of Biological Function\n(PHY-1734030) (David), and as a Simons Investigator in the MMLS (David).\n\nReferences\n\nMart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,\nRajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete\nWarden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: A system for large-scale\nmachine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design\nand Implementation, 2016.\n\nPieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In\n\nProceedings of the Twenty-First International Conference on Machine Learning (ICML), 2004.\n\nChris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning.\n\nCognition, 113(3):329\u2013349, 2009.\n\n9\n\n\fMarc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\nUnifying count-based exploration and intrinsic motivation. In Advances in Neural Information\nProcessing Systems (NIPS) 29, pages 1471\u20131479. 2016.\n\nKyunghyun Cho, Bart van Merrienboer, \u00c7aglar G\u00fcl\u00e7ehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for\nStatistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in\nNatural Language Processing (EMNLP), pages 1724\u20131734, 2014.\n\nAnca D. Dragan, Kenton C.T. Lee, and Siddhartha S. Srinivasa. Legibility and predictability of robot\n\nmotion. International Conference on Human-Robot Interaction (HRI), pages 301\u2013308, 2013.\n\nBenjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All You\nNeed: Learning Skills without a Reward Function. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\nJakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor\nMordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International\nConference on Autonomous Agents and MultiAgent Systems (AAMAS), pages 122\u2013130, 2018.\n\nDylan Had\ufb01eld-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse\nreinforcement learning. In Advances in Neural Information Processing Systems (NIPS) 29, pages\n3909\u20133917, 2016.\n\nMark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Showing\nversus doing: Teaching by demonstration. In Advances In Neural Information Processing Systems\n(NIPS) 29, pages 3027\u20133035, 2016.\n\nEdward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Due\u00f1ez Guzman, Antonio Garc\u00eda\nCasta\u00f1eda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, Heather Roff, and Thore\nGraepel. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in\nNeural Information Processing Systems (NIPS) 31, pages 3330\u20133340. 2018.\n\nMax Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David\nSilver, and Koray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In\nInternational Conference on Learning Representations (ICLR), 2017.\n\nNatasha Jaques, Angeliki Lazaridou, Edward Hughes, \u00c7aglar G\u00fcl\u00e7ehre, Pedro A. Ortega, DJ Strouse,\nJoel Z. Leibo, and Nando de Freitas. Intrinsic social motivation via causal in\ufb02uence in multi-agent\nRL. CoRR, abs/1810.08647, 2018.\n\nMax Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenen-\nbaum. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction.\nIn Proceedings of the 38th Annual Conference of the Cognitive Science Society, 2016.\n\nJoel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent\nreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on\nAutonomous Agents and MultiAgent Systems (AAMAS), pages 464\u2013473, 2017.\n\nMichael L Littman. Markov games as a framework for multi-agent reinforcement learning. In\nProceedings of the 11th International Conference on Machine Learning (ICML), pages 157\u2013163,\n1994.\n\nMichael L Littman. Friend-or-foe q-learning in general-sum games. In Proceedings of the 28th\n\nInternational Conference on Machine Learning (ICML), pages 322\u2013328, 2001.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. In Proceedings of The 33rd International Conference on Machine Learning (ICML),\npages 1928\u20131937, 2016.\n\nAndrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings\n\nof the 17th International Conference on Machine Learning (ICML), pages 663\u2013670, 2000.\n\n10\n\n\fGeorg Ostrovski, Marc G. Bellemare, A\u00e4ron van den Oord, and R\u00e9mi Munos. Count-based exploration\nwith neural density models. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 2721\u20132730, 2017.\n\nAlexander Peysakhovich and Adam Lerer. Prosocial learning agents solve generalized stag hunts\nbetter than sel\ufb01sh ones. In Proceedings of the 17th International Conference on Autonomous\nAgents and MultiAgent Systems (AAMAS), pages 2043\u20132044, 2018.\n\nNeil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew\nBotvinick. Machine theory of mind. In Proceedings of the 35th International Conference on\nMachine Learning (ICML), pages 4218\u20134227, 2018.\n\nPatrick Shafto, Noah D Goodman, and Thomas L Grif\ufb01ths. A rational account of pedagogical\n\nreasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55\u201389, 2014.\n\nRichard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient meth-\nods for reinforcement learning with function approximation. In Advances in Neural Information\nProcessing Systems (NIPS) 12, pages 1057\u20131063. 2000.\n\nMichael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Understanding\nand sharing intentions: The origins of cultural cognition. Behavioral and Brain Sciences, 28(05):\n675\u2013691, 2005.\n\nJohn N. Tsitsiklis and Kuang Xu. Delay-predictability trade-offs in reaching a secret goal. Operations\n\nResearch, 66(2):587\u2013596, 2018.\n\nTomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, and Joshua B. Tenen-\nbaum. Help or hinder: Bayesian models of social goal inference. In Advances in Neural Information\nProcessing Systems (NIPS) 22, pages 1874\u20131882. 2009.\n\nYee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell,\nNicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances\nin Neural Information Processing Systems (NIPS) 30, pages 4496\u20134506. 2017.\n\n11\n\n\f", "award": [], "sourceid": 6580, "authors": [{"given_name": "DJ", "family_name": "Strouse", "institution": "Princeton University"}, {"given_name": "Max", "family_name": "Kleiman-Weiner", "institution": "Harvard"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Matt", "family_name": "Botvinick", "institution": "Google DeepMind / University College London"}, {"given_name": "David", "family_name": "Schwab", "institution": "ITS, CUNY Graduate Center"}]}