{"title": "Biases for Emergent Communication in Multi-agent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13111, "page_last": 13121, "abstract": "We study the problem of emergent communication, in which language arises because speakers and listeners must communicate information in order to solve tasks. In temporally extended reinforcement learning domains, it has proved hard to learn such communication without centralized training of agents, due in part to a difficult joint exploration problem. We introduce inductive biases for positive signalling and positive listening, which ease this problem. In a simple one-step environment, we demonstrate how these biases ease the learning problem. We also apply our methods to a more extended environment, showing that agents with these inductive biases achieve better performance, and analyse the resulting communications protocols.", "full_text": "Biases for Emergent Communication in\nMulti-agent Reinforcement Learning\n\nYoram Bachrach\n\nDeepMind\nLondon, UK\n\nyorambac@google.com\n\nTom Eccles\nDeepMind\nLondon, UK\n\neccles@google.com\n\nGuy Lever\nDeepMind\nLondon, UK\n\nguylever@google.com\n\nThore Graepel\n\nDeepMind\nLondon, UK\n\nthore@google.com\n\nAngeliki Lazaridou\n\nDeepMind\nLondon, UK\n\nangeliki@google.com\n\nAbstract\n\nWe study the problem of emergent communication, in which language arises\nbecause speakers and listeners must communicate information in order to solve\ntasks. In temporally extended reinforcement learning domains, it has proved hard\nto learn such communication without centralized training of agents, due in part to\na dif\ufb01cult joint exploration problem. We introduce inductive biases for positive\nsignalling and positive listening, which ease this problem. In a simple one-step\nenvironment, we demonstrate how these biases ease the learning problem. We\nalso apply our methods to a more extended environment, showing that agents\nwith these inductive biases achieve better performance, and analyse the resulting\ncommunication protocols.\n\nIntroduction\n\n1\nEnvironments where multiple learning agents interact can model important real-world problems,\nranging from multi-robot or autonomous vehicle control to societal social dilemmas [4, 26, 22].\nFurther, such systems leverage implicit natural curricula, and can serve as building blocks in the\nroute for constructing arit\ufb01cial general intelligence [21, 1]. Multi-agent games provide longstanding\ngrand-challenges for AI [16], with important recent successes such as learning a cooperative and\ncompetitive multi-player \ufb01rst-person video game to human level [14]. An important unsolved problem\nin multi-agent reinforcement learning (MARL) is communication between independent agents. In\nmany domains, agents can bene\ufb01t from sharing information about their beliefs, preferences and\nintents with their peers, allowing them to coordinate joint plans or jointly optimize objectives.\nA natural question that arises when agents inhibiting the same environment are given a communication\nchannel without an agreed protocol of communication is that of emergent communication [32, 35, 8]:\nhow would the agents learn a \u201clanguage\u201d over the joint channel, allowing them to maximize their\nutility? The most naturalistic model for emergent communication in MARL is that used in Reinforced\nInter-Agent Learning (RIAL) [8] where agents optimize a message policy via reinforcement from the\nenvironment\u2019s reward signal. Unfortunately, straightforward implementations perform poorly [8],\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdriving recent research to focus on differentiable communication models [8, 29, 33, 12], even though\nthese models are less generally applicable or realistic.\nRIAL offers the advantage of having decentralized training and execution; similarly to human\ncommunication, each agent treats others as a part of its environment, without the need to have access\nto other agents\u2019 internal parameters or to back-propagate gradients \u201cthrough\u201d parameters of others.\nFurther, agents communicate with discrete symbols, providing symbolic scaffolding for extending to\nnatural language. We build on these advantages, while facilitating joint exploration and learning via\ncommunication-speci\ufb01c inductive biases.\nWe tackle emergent communication through the lens of Paul Grice [10, 30], and capitalize on the\ndual view of communication in which interaction takes place between a speaker, whose goal is to be\ninformative and relevant (adhering to the equivalent Gricean maxims), and a listener, who receives a\npiece of information and assumes that their speaker is cooperative (providing informative and relevant\ninformation). Our methodology is inspired by the recent work of Lowe et al. [24], who proposed a\nset of comprehensive measures of emergent communication along two axis of positive signalling and\npositive listening, aiming at identifying real cases of communication from pathological ones.\nOur contribution: we formulate losses which encourage positive signaling and positive listening,\nwhich are used as auxiliary speaker and listener losses, respectively, and are appended to the RIAL\ncommunication framework. We design measures in the spirit of Lowe et al. [24] but rather than using\nthese as an introspection tool, we use them as an optimization objective for emergent communication.\nWe design two sets of experiments that help us clearly isolate the real contribution of communication\nin task success. In a one-step environment based on summing MNIST digits, we show that the\nbiases we use facilitate the emergence of communication, and analyze how they change the learning\nproblem. In a gridworld environment based on search of a treasure, we show that the biases we use\nmake communication appear more consistently, and we interpret the resulting protocol.\n\n1.1 Related Work\nDifferentiable communication was considered for discrete messages [8, 29] and continuous mes-\nsages [33, 12], by allowing gradients to \ufb02ow through the communication channel. This improves\nperformance, but effectively models multiple agents as a single entity. In contrast we assume agents\nare independent learners, making the communication channel non-differentiable. Earlier work on\nemergent communication focused on cooperative \u201cembodied\u201d agents, showing how communication\nhelps accomplish a common goal [8, 29, 6], or investigating communication in mixed cooperative-\ncompetitive environments [25, 3, 15], studying properties of the emergent protocols [20, 17, 24].\nPrevious research has investigated independent reinforcement learners in cooperative settings [34],\nwith more recent work focusing on canonical RL algorithms. One version of decentralized Q-learning\nconverges to optimal policies in deterministic tabular environments without additional communi-\ncation [18], but does not trivially extend to stochastic environments or function approximation.\nCentralized critics [25, 9] improve stability by allowing agents to use information from other agents\nduring training, but these violate our assumptions of independence, and may not scale well.\n\n2 Setting\nWe apply multi-agent reinforcement learning (MARL) in partially-observable Markov games (i.e.\npartially-observable stochastic games) [31, 23, 11], in environments where agents have a joint\ncommunication channel. In every state, agents take actions given partial observations of the true\nworld state, including messages sent on a shared channel, and each agent obtains an individual reward.\nThrough their individual experiences interacting with one another and the environment, agents learn\nto broadcast appropriate messages, interpret messages received from peers and act accordingly.\nFormally, we consider an N-player partially observable Markov game G [31, 23] de\ufb01ned on a\n\ufb01nite state set S, with action sets (A1, . . . ,AN ) and message sets (M1, . . . ,MN ). An observation\nfunction O : S \u00d7 {1, . . . , N} \u2192 Rd de\ufb01nes each agent\u2019s d-dimensional restricted view of the true\nt = O(St, i), and the\nstate space. On each timestep t, each agent i receives as an observation oi\nt\u22121 sent in the previous state for all j (cid:54)= i. Each agent i then select an environment action\nmessages mj\nt ) \u2208 (A1, . . . ,AN )\nt \u2208 Ai and a message action mi\nai\nthe state changes based on a transition function T : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 \u2206(S); this is a\nstochastic transition, and we denote the set of discrete probability distributions over S as \u2206(S). Every\n\nt \u2208 Mi. Given the joint action (a1\n\nt , . . . , aN\n\n2\n\n\ft , . . . , mN\n\nt ), and ot = (o1\n\nt , . . . , aN\n\nt ), mt = (m1\n\nt , . . . oN\n\nt ). We write m\u00afi,t for (m1\n\nE\u03c0A,\u03c0M ,T (cid:2)(cid:80)\u221e\n\nt, and M\u00afi for (M1, . . . ,MN ), excluding Mi.\n\nt : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 R for player i. We use the notation\nt ),\nt , . . . , mN\n\nagent gets an individual reward ri\nat = (a1\nexcluding mi\nIn our fully cooperative setting, each agent receives the same reward at each timestep, ri\nt =\n\u2200i, j \u2264 N, which we denote by rt. Each agent maintains an action and a message policy\nrj\nt\nt), and which can in\nfrom which actions and messages are sampled, ai\ngeneral be functions of their entire trajectory of experience xi\nt).\nt\u22121, mt\u22121, oi\n1, ..., ai\nThese policies are optimized to maximize discounted cumulative joint reward J(\u03c0A, \u03c0M ) :=\n{\u03c01\nM}. Although the objective J(\u03c0A, \u03c0M ) is a joint objective, our\nmodel is that of decentralized learning and execution, where every agent has its own experience in the\nenvironment, and independently optimizes the objective J with respect to its own action and message\npolicies \u03c0i\nM ; there is no communication between agents other than using the actions and\nmessage channel in the environment. Applying independent reinforcement learning to cooperative\nMarkov games results in a problem for each agent which is non-stationary and non-Markov, and\npresents dif\ufb01cult joint exploration and coordination problems [2, 5, 19, 27].\n\n(cid:3) (which is discounted by \u03b3 < 1 to ensure convergence), where \u03c0A :=\n\nt=1 \u03b3t\u22121rt\nA }, \u03c0M := {\u03c01\n\nt) and mi\n\nt \u223c \u03c0i\nt := (m0, oi\n\nM (\u00b7|xi\n1, ai\n\nt \u223c \u03c0i\n\nA(\u00b7|xi\n\nM , ..., \u03c0N\n\nA and \u03c0i\n\nA, ..., \u03c0N\n\n3 Shaping Losses for Facilitating Communication\nOne dif\ufb01culty in emergent communication is getting the communication channel to help with the task\nat all. There is an equilibrium where the speaker produces random symbols, and the listener\u2019s policy\nis independent of the communication. This might seem like an unstable equilibrium: if one agent\nuses the communication channel, however weakly, the other will have some learning signal. However,\nthis is not the case in some tasks. If the task without communication has a single, deterministic\noptimal policy, then messages from policies suf\ufb01ciently close to the uniform message policy should\nbe ignored by the listener. Furthermore, any entropy costs imposed on the communication channel,\nwhich are often crucial for exploration, exacerbate the problem, as they produce a positive pressure\nfor the speaker\u2019s policy to be close to random. Empirically, we often see agents fail to use the\ncommunication channel at all; but when agents start to use the channel meaningfully, they are then\nable to \ufb01nd at least a locally optimal solution to the communication problem.\nWe propose two shaping losses for communication to alleviate these problems. The \ufb01rst is for positive\nsignalling [24]: encouraging the speaker to produce diverse messages in different situations. The\nsecond is for positive listening [24]: encouraging the listener to act differently for different messages.\nIn each case, the goal is for one agent to learn to ascribe some meaning to the communication, even\nwhile the other does not, which eases the exploration problem for the other agent.\nWe note that most policies which maximize these biases do not lead to high reward. Much information\nabout an agent\u2019s state is unhelpful to the task at hand, so with a limited communication channel\npositive signalling is not suf\ufb01cient to have useful communication. For positive listening, the situation\nis even worse \u2013 most ways of conditioning actions on messages are actively unhelpful to the task,\nparticularly when the speaker has not developed a good protocol. These losses should therefore not\nbe expected to lead directly to good communication. Rather, they are intended to ensure that the\nagents begin to use the communication channel at all \u2013 after this, MARL can \ufb01nd a useful protocol.\n\n3.1 Bias for positive signalling\nThe \ufb01rst inductive bias we use promotes positive signalling, incentivizing the speaker to produce\ndifferent messages in different situations. We add a loss term which is minimized by message policies\nthat have high mutual information with the speaker\u2019s trajectory. This encourages the speaker to\nproduce messages uniformly at random overall, but non-randomly when conditioned on the speaker\u2019s\ntrajectory.\n\nWe denote by \u03c0i\nM the average message policy for agent i over all trajectories, weighted by how often\nthey are visited under the current action policies for all agents. The mutual information of agent i\u2019s\nmessage mi\nI(mi\n\nt is:\nt with their trajectory xi\nt|xi\nt) \u2212 H(mi\nt, xi\nt)\n\nt) = H(mi\n\n(cid:88)\n\nm\u2208Mi\n\nM (m|xi\n\u03c0i\n\nt) log(\u03c0i\n\nM (m|xi\nt))\n\n(1)\n(2)\n\n= \u2212 (cid:88)\n\nm\u2208Mi\n\n\u03c0i\nM (m) log(\u03c0i\n\nM (m)) + E\n\nxi\nt\n\n3\n\n\ft|xi\nt, xi\n\nWe estimate this mutual information from a batch of rollouts of the agent policy. We calculate\nH(mi\nt) exactly for each timestep from the agent\u2019s policy. To estimate H(\u03c0m), we estimate \u03c0m\nas the average message policy in the batch of experience. Intuitively, we would like to maximize\nI(mi\nt), so that the speaker\u2019s message depends maximally on their current trajectory. However,\nadding this objective as a loss for gradient descent leads to poor solutions. We hypothesize that this is\ndue to properties of the loss landscape for this loss. Policies which maximize mutual information are\nt, but uniformly random unconditional on xi\ndeterministic for any particular trajectory xi\nt. At such\npolicies, the gradient of the term H(\u03c0i\nM (\u00b7|xi\nt)) is in\ufb01nite. Further, for any c < log(2) the space of\npolicies which have entropy at most c is disconnected, in that there is no continuous path in policy\nspace between some policies in this set.\nTo overcome these problems, we instead use a loss which is minimized for a high value for H(\u03c0i\nand a target value for H(\u03c0i\n\nM|si). The loss we use is:\n\nM )\n\nLps(\u03c0i\n\n(3)\nfor some target entropy Htarget, which is a hyperparameter. This loss has \ufb01nite gradients around\nits minima, and for suitable choices of Htarget the space of policies which minimizes this loss is\nconnected. In practice, we found low sensitivity to Htarget, and typically use a value of around\nlog(|A|)/2, which is half the maximum possible entropy.\n\nM ) \u2212 (H(mi\n\nM , si) = \u2212E(cid:0)\u03bbH(\u03c0i\n\nt|xt) \u2212 Htarget)2(cid:1),\n\nt for 1 \u2264 t \u2264 T .\n\nAlgorithm 1 Calculation of positive signalling loss\n1: \u03c0M i \u2190 0.\n2: Lps \u2190 0.\n3: Target conditional entropy Htarget.\n4: Weighting \u03bb for conditional entropy.\n5: for b=1; b \u2264 B; b++ do # Batch of rollouts.\nt for 1 \u2264 t \u2264 T .\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: end for\n22: Lps \u2190 Lps + T \u00d7 B \u00d7 H.\n\nObservations oi\nActions ai\nOther agent messages mt,\u00afi from M\u00afi for 1 \u2264 t \u2264 T .\nInitial hidden state hi\n0.\nAction set Ai, message set Mi, observation space Oi, hidden state space H i.\nMessage policy \u03c0i\nHidden state update rule hi : Oi \u00d7 Ai \u00d7 H i \u00d7 M\u00afi (cid:55)\u2192 H i.\nfor t = 1; t \u2264 T; t++ do\nt\u22121, hi\nt\u22121, mt\u22121,\u00afi).\nt/(T \u00d7 B).\nt(m)).\nt(m) log(pi\nt \u2212 Htarget)2.\n\nt \u2190 hi(oi\nt, ai\nhi\nt \u2190 \u03c0i\nt, hi\nM (oi\npi\n\u03c0M \u2190 \u03c0M + \u03c0i\nHi\nm\u2208Mi pi\nLps \u2190 Lps + \u03bb(Hi\n\nM : Oi \u00d7 Ai \u00d7 H i \u00d7 M\u00afi (cid:55)\u2192 Mi.\n\n21: H =(cid:80)\n\nm\u2208Mi \u03c0(m) log(\u03c0(m)).\n\nt \u2190(cid:80)\n\nt\u22121, mt\u22121,\u00afi).\n\nend for\n\n3.2 Bias for positive listening\nThe second bias promotes positive listening: encouraging the listener to condition their actions on the\ncommunication channel. This gives the speaker some signal to learn to communicate, as its messages\nhave an effect on the listener\u2019s policy and thus on the speaker\u2019s reward. The way we encourage\npositive listening is akin to the causal in\ufb02uence of communication, or CIC [15, 24]. In [15], this was\nused as a bias for the speaker, to produce in\ufb02uential messages, and in [24] as a measure of whether\ncommunication is taking place. We use a similar measure as a loss for the listener to be in\ufb02uenced by\nmessages. In [15, 24], CIC was de\ufb01ned over one timestep as the mutual information between the\nspeaker\u2019s message and the listener\u2019s action. We extend this to multiple timesteps using the mutual\ninformation between all of the speaker\u2019s previous messages on a single listener action \u2013 using this\nas an objective encourages the listener to pay attention to all the speaker\u2019s messages, rather than\njust the most recet. For a listener trajectory xt = (o1, a1, m1, . . . , ot\u22121, at\u22121, mt\u22121, ot), we de\ufb01ne\n\n4\n\n\fx(cid:48)\nt = (o1, a1, . . . , at\u22121, ot) (this is the trajectory xt, with the messages removed). We de\ufb01ne the\nmultiple timestep CIC as:\n\nCIC(xt) = H(at|x(cid:48)\n\n= DKL\n\nt) \u2212 H(at|xt).\n\n(cid:0)(at|xt)||(at|x(cid:48)\nt)(cid:1)\n\n(4)\n(5)\n\n(6)\n\n(8)\n\nWe estimate this multiple timestep CIC by learning the distribution \u03c0i\nt). We do this by per-\nforming a rollout of the agent\u2019s policy network, with the actual observations and actions in the\ntrajectory, and zero inputs in the place of the messages. We \ufb01t the resulting function \u03c0i\nt) to\npredict \u03c0i\n\nA(\u00b7|xt), using a cross-entropy loss between these distributions:\n\nA(\u00b7|x(cid:48)\n\nA(\u00b7|x(cid:48)\n\nt)),\n\nA(a|x(cid:48)\n\nA(a|xt) log(\u03c0i\n\u03c0i\nA(a|xt) term. For a given policy \u03c0i\nA(\u00b7|x(cid:48)\n\nA, this loss is\nwhere we backpropagate only through the \u03c0i\nt) = E(\u03c0i\nminimized in expectation when \u03c0i\nA is trained to be an approximation\nof the listener\u2019s policy unconditioned on the messages it has received. The multi-timestep CIC can then\nbe estimated by the KL divergence between the message-conditioned policy and the unconditioned\npolicy:\n\nt)). Thus \u03c0i\n\nA(\u00b7|x(cid:48)\n\nLce(xt) = \u2212 (cid:88)\n\na\u2208Ai\n\nCIC(xt) \u2248 DKL(\u03c0i\n\nA(\u00b7|xt)||\u03c0i\n\nA(\u00b7|x(cid:48)\n\nt)).\n\n(7)\n\nFor training positive listening we use a different divergence between these two distributions, which\nwe empirically \ufb01nd achieves more stable training. We use the L1 norm between the two distributions:\n\nLpl (xt) = \u2212 (cid:88)\n\n(cid:12)(cid:12)\u03c0i\n\na\u2208Ai\n\nA(a|xt) \u2212 \u03c0i\n\nA(a|x(cid:48)\n\nt)(cid:12)(cid:12).\n\nt for 1 \u2264 t \u2264 T .\n\nA : Oi \u00d7 Ai \u00d7 H i \u00d7 M\u00afi (cid:55)\u2192 Ai.\n\nt for 1 \u2264 t \u2264 T .\n0 = h(cid:48)\n0.\n\nAlgorithm 2 Calculation of positive listening losses\n1: Observations oi\n2: Actions ai\n3: Other agent messages mt,\u00afi from M\u00afi for 1 \u2264 t \u2264 T .\n4: Initial hidden state hi\n5: Action set Ai, observation space Oi, hidden state space H i.\n6: Action policy \u03c0i\n7: Hidden state update rule hi : Oi \u00d7 Ai \u00d7 H i \u00d7 M\u00afi (cid:55)\u2192 H i.\n8: Lce \u2190 0.\n9: Lpl \u2190 0.\n10: for i = 1; t \u2264 T; t++ do\nt\u22121, hi\n11:\nt\u22121, mt\u22121,\u00afi).\n12:\nt\u22121, h(cid:48)\nt\u22121, 0).\n13:\nt\u22121, h(cid:48)\nt, 0).\n14:\na\u2208A stop_gradient(pi\n15:\na\u2208A |pi\n16:\n17: end for\n\nLce \u2190 Lce +(cid:80)\nLpl \u2190 Lpl +(cid:80)\n\nt(a)) log(pi\nt(a) \u2212 stop_gradient(pi\n\nt \u2190 hi(oi\nhi\nt \u2190 \u03c0i\npi\nA(oi\nt \u2190 hi(oi\nh(cid:48)\nt \u2190 \u03c0i\npi\nA(oi\n\nt, ai\nt, hi\nt, ai\nt, ai\n\nt(a)).\nt(a))|.\n\nt\u22121, mt\u22121,\u00afi).\n\n4 Empirical Analysis\nWe consider two environments. The \ufb01rst is a simple one-step environment, where agents must sum\nMNIST digits by communicating their value. This environment has the advantage of being very\namenable to analysis, as we can readily quantify how valuable the communication channel currently\nis to each agent. In this environment, we provide evidence for our hypotheses about how the biases\nwe introduce in Section 3 ease the learning of communication protocols. The second environment is\na new multi-step MARL environment which we name Treasure Hunt. It is designed to have a clear\n\n5\n\n\fperformance ceiling for agents which do not utilise a communication channel. In this environment,\nwe show that the biases enable agents to learn to communicate in a multi-step reinforcement learning\nenvironment. We also analyze the resulting protocols, \ufb01nding interpretable protocols that allow us\nto intervene in the environment and observe the effect on listener behaviour. The full details of the\nTreasure Hunt environment, together with the hyperparameters used in our agents, can be found in\nthe supplementary material.\n\n4.1 Summing MNIST digits\nIn this task, depicted in Figure 1, the speaker and listener agents each observe a different MNIST\ndigit (as an image), and must determine the sum of the digits. The speaker observes an MNIST digit,\nand selects one of 20 possible messages. The listener receives this message, observes an independent\nMNIST digit, and must produce one of 19 possible actions. If this action matches the sum of the\ndigits, both agents get a \ufb01xed reward of 1, otherwise, both receive no reward. The agents used in this\nenvironment consist of a convolutional neural net, followed by an multi-layer perceptron and a linear\nlayer to produce policy logits. For the listener, we concatenate the message sent to the output of the\nconvnet as a one-hot vector. The agents are trained independently with REINFORCE.\n\nFigure 1: Summing MNIST environment. In this example, both agents would get reward 1 if a = 9.\n\nThe purpose of this environment is to test whether and how the biases we propose ease the learning\ntask. To do this, we quantify how useful the communication channel is to the speaker and to the\nlistener. We periodically calculate the rewards for the following policies:\n\n1. The optimal listener policy \u03c0lc, given the current speaker and the labels of the listener\u2019s\n\nMNIST digits.\n\n2. The optimal listener policy \u03c0lnc, given the labels of the listener\u2019s MNIST digits and no\n\ncommunication channel.\n\n3. The optimal speaker policy \u03c0sc, given the current listener and the labels of the speaker\u2019s\n\nMNIST digits.\n\n4. The uniform speaker policy \u03c0su, given the current listener.\n\nWe calculate these quantities by running over many batches of MNIST digits, and calculating the\noptimal policies explicitly. The reward the listener can gain from using the communication channel is\nPl(\u03c0s) = R(\u03c0lc, \u03c0s) \u2212 R(\u03c0lnc , \u03c0s), so this is a proxy for the strength of the learning signal for the\nlistener to use the channel. Similarly, Ps(\u03c0s) = R(\u03c0l, \u03c0sc ) \u2212 R(\u03c0l, \u03c0su ) is how much reward the\nspeaker can gain from using the communication channel, and so is a proxy for the strength of the\nlearning signal for the speaker.\nThe results (Figure 2) support the hypothesis that the bias for positive signalling eases the learning\nproblem for the listener, and the bias for positive listening eases the learning problem for the speaker.\nWhen neither agent has any inductive bias, we see both Pl and Ps stay low throughout training, and\nthe \ufb01nal reward of 0.1 is exactly what can be achieved in this environment with no communication.\nWhen we add a bias for positive signalling or positive listening, we see the communication channel\nused in most runs (Table 1), leading to greater reward, and Ps and Pl both increase. Importantly,\nwhen we add our inductive bias for positive listening, we see Ps increase initially, followed by Pl.\nThis is consistent with the hypothesis that the positive listening bias bias produces a stronger learning\nsignal for the speaker; then once the speaker has begun to learn to communicate meaningfully, the\nlistener also has a strong learning signal. When we add the bias for positive signalling the reverse is\ntrue \u2013 Pl increases before Ps. This again \ufb01ts the hypothesis that speaker\u2019s bias produces a stronger\nlearning signal for the speaker.\nWe also ran experiments with the speaker getting an extra reward for positive listening, as in [15].\nHowever, we did not see any gain from this over the no-bias baseline; in our setup, it seems the\n\n6\n\n\fFigure 2: (a) Both biases lead to more reward. (b,c, d) Listener and speaker power in various settings. Listener\npower increases \ufb01rst with positive signalling, and speaker power increases \ufb01rst with positive listening.\n\nspeaker agent was unable to gain any in\ufb02uence over the listener. We think that there is a natural\nreason this bias would not help in this environment; for a \ufb01xed listener, the speaker policy which\noptimizes positive listening has no relation to the speaker\u2019s input. Thus this bias does not force the\nspeaker to produce different messages for different inputs, and so does not increase the learning\nsignal for the listener.\n\nProportion of good runs\n\nFinal reward of good runs\n\nBiases\nNo bias\n\nSocial in\ufb02uence\nPositive listening\nPositive signalling\n\nBoth\n\n0.00\n0.00\n0.88\n0.98\n1.00\n\nCI\n\n0.00-0.07\n0.00-0.07\n0.76-0.94\n0.90-1.00\n0.93-1.0\n\nN/A\nN/A\n\n0.92 \u00b1 0.03\n0.99 \u00b1 0.00\n0.98 \u00b1 0.01\n\nTable 1: Both biases lead to consistent discovery of useful communication. We de\ufb01ne a good run to be one with\n\ufb01nal average reward greater than 0.2. Averages are over 50 runs for each setting.\n\n4.2 Treasure Hunt\nWe propose a new cooperative RL environment called Treasure Hunt, where agents explore several\ntunnels to \ufb01nd treasure 1. When successful, both agents receive a reward of 1. The agents have a\nlimited \ufb01eld of view; one agent is able to ef\ufb01ciently \ufb01nd the treasure, but can never reach it, while the\nother can reach the treasure but must perform costly exploration to \ufb01nd it. In the optimal solution,\nthe agent which can see the treasure \ufb01nds it and communicates the position to the agent which can\nreach it. Agents communicate by sending one of \ufb01ve discrete symbols on each timestep. The precise\ngeneration rules for the environment can be found in the supplementary material.\nThe agents used in this environment are Advantage Actor-Critic methods [28] with the V-trace\ncorrection [7]. The agent architecture employs a single convolutional layer, followed by a multi-layer\nperceptron. The message from the other agent is concatenated to the output of the MLP, and fed into\nan LSTM. The network\u2019s action policy, message policy and value function heads are linear layers.\n\n1Videos for this environment can be found at https://youtu.be/eueK8WPkBYs and https://youtu.\n\nbe/HJbVwh10jYk.\n\n7\n\n\fFigure 3: Treasure hunt environment.\n\nFigure 4: Positive signalling and listening biases leads\nto more reward.\n\nProportion good\n\nFinal reward (good runs)\n\nBiases\nNo bias\n\nPositive signalling\nPositive listening\n\nBoth\n\n0.28\n0.84\n0.64\n0.94\n\nCI\n\n0.18-0.42\n0.71-0.92\n0.50-0.76\n0.84-0.98\n\nFinal reward\n12.45 \u00b1 0.48\n14.22 \u00b1 0.36\n13.94 \u00b1 0.44\n15.14 \u00b1 0.33\n\n14.67 \u00b1 0.29\n14.69 \u00b1 0.18\n14.95 \u00b1 0.20\n15.41 \u00b1 0.14\n\n36.1 \u00b1 3.3\n41.3 \u00b1 7.9\n\nTable 2: Proportion and average reward of good runs. Values are means over 50 runs with 95% con\ufb01dence\nintervals, calculated using Wilson approximation in the case of Bernoulli variables.\n\nRun\n\nMean time (unmodi\ufb01ed) Mean visit time (modi\ufb01ed)\n\nMedian\n\nBest\n\n100.6 \u00b1 14.7\n85.4 \u00b1 14.1\n\nTable 3: Visit time to tunnel, with and without modi\ufb01ed messages. Values are means over 100 episodes with\n95% con\ufb01dence intervals.\n\nOur training follows the independent multiagent reinforcement learning paradigm: each agent is\ntrained independently using their own experience of states and actions. We use RMSProp [13] to\nadjust the weights of the agent\u2019s neural network. We co-train two agents, each in a consistent role\n(\ufb01nder or collector) across episodes.\nThe results are shown in Table 2. We \ufb01nd that biases for positive signalling and positive listening\nboth lead to increased reward, and adding either bias to the agents leads to more consistent discovery\nof useful communication protocols; we de\ufb01ne these as runs which get reward greater than 13, the\nmaximum \ufb01nal reward in 50 runs with no communication. With or without biases, the agents still\nfrequently only discover local optima - for example, protocols where the agent which can \ufb01nd treasure\nreports on the status of only one tunnel, leaving the other to search the remaining tunnels. This\ndemonstrates a limitation of these methods; positive signalling and listening biases are useful for\n\ufb01nding some helpful communication protocol, but they do not completely solve the joint exploration\nproblem in emergent communication. However, among runs which achieve some communication, we\nsee greater reward on average among runs with both biases, corresponding to reaching better local\noptima for the communication protocol on average.\nWe also ran experiments with the speaker getting an extra reward for in\ufb02uencing the listener, as in\n[15]. Here, we used the centralized model in [15], where the listener calculates the social in\ufb02uence of\nthe speaker\u2019s messages, and the speaker gets an intrinsic reward for increasing this in\ufb02uence. We did\nnot see a signi\ufb01cant improvement in task reward, as compared to communication with no additional\nbias.\nWe analyze the communication protocols for two runs, which correspond to the two videos linked in1.\nOne is a typical solution among runs where communication emerges; we picked this run by taking the\nmedian \ufb01nal reward out of all runs with both positive signalling and positive listening biases enabled.\nQualitatively, the behaviour is simple \u2013 the \ufb01nder \ufb01nds the rightmost tunnel, and then reports whether\nthere is treasure in that tunnel for the remainder of the episode. The other run we analyze is the one\nwith the greatest \ufb01nal reward; this has more complicated communication behaviour. To analyze these\nruns, we rolled out 100 episodes using the \ufb01nal policies from each.\n\n8\n\n\fFirst, we relate the \ufb01nder\u2019s communication protocol to the actual location of the treasure on this\nframe; in both runs, we see that these are well correlated. In the median run, we see that one symbol\nrelates strongly to the presence of treasure; when this symbol is sent, the treasure is in the rightmost\ntunnel around 75% of the time. In the best run, where multiple tunnels appear to be reported on by the\n\ufb01nder, the protocol is more complicated, with various symbols correlating with one or more tunnels.\nDetails of the correlations between tunnels and symbols can be found in the supplementary material.\nNext, we intervene in the environment to demonstrate that these communication protocols have\nthe expected effect on the collector. For each of these pairs of agents, we produce a version of the\nenvironment where the message channel is overridden, starting after 100 frames. We override the\nchannel with a constant message, using the symbol which most strongly indicates a particular tunnel.\nWe then measure how long the collector takes to reach a square 3 from the bottom, where the agent is\njust out of view of the treasure. In Table 3, we compare this to the baseline where we do not override\nthe communication channel. In both cases, the collector reaches the tunnel signi\ufb01cantly faster than in\nthe baseline, indicating that the \ufb01nder\u2019s consistent communication is being acted on as expected.\n\n5 Conclusion\nWe introduced two new shaping losses to encourage the emergence of communication in decentralized\nlearning; one on the speaker\u2019s side for positive signalling, and one on the listener\u2019s side for positive\nlistening. In a simple environment, we showed that these losses have the intended effect of easing the\nlearning problem for the other agent, and so increase the consistency with which agents learn useful\ncommunication protocols. In a temporally extended environment, we again showed that these losses\nincrease the consistency with which agents learn to communicate.\nSeveral questions remain open for future research. Firstly, we investigate only fully co-operative\nenvironments; does this approach can help in environments which are neither fully cooperative nor\nfully competitive? In such settings, both positive signalling and positive listening can be harmful\nto an agent, as it becomes more easily exploited via the communication channel. However, since\nthe losses we use mainly serve to ensure the communication channel starts to be used, this may be\nas large a problem as it initially seems. Secondly, the environments investigated here have dif\ufb01cult\ncommunication problems, but are otherwise simple; can these methods be extended to improve\nperformance in decentralized agents in large-scale multi-agent domains? There are a few dimensions\nalong which these experiments could be scaled \u2013 to more complex observations and action spaces,\nbut also to environments with more than two players, and to larger communication channels.\n\nReferences\n[1] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent\ncomplexity via multi-agent competition. In 6th International Conference on Learning Repre-\nsentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track\nProceedings, 2018.\n\n[2] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. The complexity of decentralized\ncontrol of Markov Decision Processes. In UAI \u201900: Proceedings of the 16th Conference in\nUncertainty in Arti\ufb01cial Intelligence, Stanford University, Stanford, California, USA, June 30 -\nJuly 3, 2000, pages 32\u201337, 2000.\n\n[3] Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark.\n\nEmergent communication through negotiation. arXiv preprint arXiv:1804.03980, 2018.\n\n[4] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the\nstudy of distributed multi-agent coordination. IEEE Trans. Industrial Informatics, 9(1):427\u2013438,\n2013.\n\n[5] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative mul-\ntiagent systems. In Proceedings of the Fifteenth National Conference on Arti\ufb01cial Intelligence\nand Tenth Innovative Applications of Arti\ufb01cial Intelligence Conference, AAAI 98, IAAI 98, July\n26-30, 1998, Madison, Wisconsin, USA., pages 746\u2013752, 1998.\n\n[6] Abhishek Das, Th\u00e9ophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rab-\nbat, and Joelle Pineau. Tarmac: Targeted multi-agent communication. arXiv preprint\narXiv:1810.11187, 2018.\n\n9\n\n\f[7] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu,\nT. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable Distributed Deep-RL\nwith Importance Weighted Actor-Learner Architectures. ArXiv e-prints, February 2018.\n\n[8] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to\ncommunicate with deep multi-agent reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2137\u20132145, 2016.\n\n[9] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon\nWhiteson. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second\nAAAI Conference on Arti\ufb01cial Intelligence, (AAAI-18), the 30th innovative Applications of\nArti\ufb01cial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in\nArti\ufb01cial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages\n2974\u20132982, 2018.\n\n[10] H Paul Grice. Utterer\u2019s meaning, sentence-meaning, and word-meaning.\n\nLanguage, and Arti\ufb01cial Intelligence, pages 49\u201366. Springer, 1968.\n\nIn Philosophy,\n\n[11] Eric A. Hansen, Daniel S. Bernstein, and Shlomo Zilberstein. Dynamic programming for\npartially observable stochastic games. In Proceedings of the 19th National Conference on\nArti\ufb01cal Intelligence, AAAI\u201904, pages 709\u2013715. AAAI Press, 2004.\n\n[12] Matthew John Hausknecht. Cooperation and communication in multiagent deep reinforcement\n\nlearning. PhD thesis, University of Texas at Austin, Austin, USA, 2016.\n\n[13] G. Hinton, N. Srivastava, , and K. Swersky. Lecture 6a: Overview of mini\u2013batch gradient\n\ndescent. Coursera, 2012.\n\n[14] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garc\u00eda\nCasta\u00f1eda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas\nSonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray\nKavukcuoglu, and Thore Graepel. Human-level performance in \ufb01rst-person multiplayer games\nwith population-based deep reinforcement learning. CoRR, abs/1807.01281, 2018.\n\n[15] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A Ortega,\nDJ Strouse, Joel Z Leibo, and Nando de Freitas. Intrinsic social motivation via causal in\ufb02uence\nin multi-agent rl. International Conference of Machine Learning, 2019.\n\n[16] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, and Eiichi Osawa. Robocup:\n\nThe robot world cup initiative. In Agents, pages 340\u2013347, 1997.\n\n[17] Satwik Kottur, Jos\u00e9 MF Moura, Stefan Lee, and Dhruv Batra. Natural language does not\n\nemerge\u2019naturally\u2019in multi-agent dialog. arXiv preprint arXiv:1706.08502, 2017.\n\n[18] Martin Lauer and Martin A. Riedmiller. An algorithm for distributed reinforcement learning in\ncooperative multi-agent systems. In Proceedings of the Seventeenth International Conference\non Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2,\n2000, pages 535\u2013542, 2000.\n\n[19] Guillaume J. Laurent, La\u00ebtitia Matignon, and Nadine Le Fort-Piat. The world of Indepen-\ndent learners is not Markovian. International Journal of Knowledge-Based and Intelligent\nEngineering Systems, 15(1):55\u201364, March 2011.\n\n[20] Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of\nlinguistic communication from referential games with symbolic and pixel input. arXiv preprint\narXiv:1804.03984, 2018.\n\n[21] Joel Z. Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the\nemergence of innovation from social interaction: A manifesto for multi-agent intelligence\nresearch. CoRR, abs/1903.00742, 2019.\n\n[22] Joel Z. Leibo, Vin\u00edcius Flores Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel.\nMulti-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th\nConference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, S\u00e3o Paulo, Brazil,\nMay 8-12, 2017, pages 464\u2013473, 2017.\n\n10\n\n\f[23] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\nIn Proceedings of the Eleventh International Conference on Machine Learning, pages 157\u2013163.\nMorgan Kaufmann, 1994.\n\n[24] Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the pitfalls\n\nof measuring emergent communication. arXiv preprint arXiv:1903.05168, 2019.\n\n[25] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-\nagent actor-critic for mixed cooperative-competitive environments. In Advances in Neural\nInformation Processing Systems, pages 6379\u20136390, 2017.\n\n[26] La\u00ebtitia Matignon, Laurent Jeanpierre, and Abdel-Illah Mouaddib. Coordinated multi-robot\nexploration under communication constraints using decentralized markov decision processes.\nIn Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, July 22-26, 2012,\nToronto, Ontario, Canada., 2012.\n\n[27] La\u00ebtitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent reinforcement\nlearners in cooperative markov games: a survey regarding coordination problems. Knowledge\nEng. Review, 27(1):1\u201331, 2012.\n\n[28] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning, pages 1928\u20131937,\n2016.\n\n[29] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-\n\nagent populations. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[30] Stephen Neale. Paul grice and the philosophy of language. Linguistics and philosophy,\n\n15(5):509\u2013559, 1992.\n\n[31] L. S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences of the\n\nUnited States of America, 39(10):1095\u20131100, 1953.\n\n[32] Luc Steels. Evolving grounded communication for robots. Trends in cognitive sciences,\n\n7(7):308\u2013312, 2003.\n\n[33] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication\nwith backpropagation. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,\nSpain, pages 2244\u20132252, 2016.\n\n[34] Ming Tan. Multi-agent reinforcement learning: Independent versus cooperative agents. In Ma-\nchine Learning, Proceedings of the Tenth International Conference, University of Massachusetts,\nAmherst, MA, USA, June 27-29, 1993, pages 330\u2013337, 1993.\n\n[35] Kyle Wagner, James A Reggia, Juan Uriagereka, and Gerald S Wilkinson. Progress in the\nsimulation of emergent communication and language. Adaptive Behavior, 11(1):37\u201369, 2003.\n\n11\n\n\f", "award": [], "sourceid": 7196, "authors": [{"given_name": "Tom", "family_name": "Eccles", "institution": "DeepMind"}, {"given_name": "Yoram", "family_name": "Bachrach", "institution": null}, {"given_name": "Guy", "family_name": "Lever", "institution": "Google DeepMind"}, {"given_name": "Angeliki", "family_name": "Lazaridou", "institution": "DeepMind"}, {"given_name": "Thore", "family_name": "Graepel", "institution": "DeepMind"}]}