{"title": "Learning to Communicate with Deep Multi-Agent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2137, "page_last": 2145, "abstract": "We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. We propose two approaches for learning in these domains: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). The former uses deep Q-learning, while the latter exploits the fact that, during learning, agents can backpropagate error derivatives through (noisy) communication channels. Hence, this approach uses centralised learning but decentralised execution. Our experiments introduce new environments for studying the learning of communication protocols and present a set of engineering innovations that are essential for success in these domains.", "full_text": "Learning to Communicate with\n\nDeep Multi-Agent Reinforcement Learning\n\nJakob N. Foerster1,\u2020\n\njakob.foerster@cs.ox.ac.uk\n\nYannis M. Assael1,\u2020\n\nyannis.assael@cs.ox.ac.uk\n\nNando de Freitas1,2,3\n\nnandodefreitas@google.com\n\nShimon Whiteson1\n\nshimon.whiteson@cs.ox.ac.uk\n\n1University of Oxford, United Kingdom\n\n2Canadian Institute for Advanced Research, CIFAR NCAP Program\n\n3Google DeepMind\n\nAbstract\n\nWe consider the problem of multiple agents sensing and acting in environments\nwith the goal of maximising their shared utility. In these environments, agents must\nlearn communication protocols in order to share information that is needed to solve\nthe tasks. By embracing deep neural networks, we are able to demonstrate end-\nto-end learning of protocols in complex environments inspired by communication\nriddles and multi-agent computer vision problems with partial observability. We\npropose two approaches for learning in these domains: Reinforced Inter-Agent\nLearning (RIAL) and Differentiable Inter-Agent Learning (DIAL). The former uses\ndeep Q-learning, while the latter exploits the fact that, during learning, agents can\nbackpropagate error derivatives through (noisy) communication channels. Hence,\nthis approach uses centralised learning but decentralised execution. Our experi-\nments introduce new environments for studying the learning of communication\nprotocols and present a set of engineering innovations that are essential for success\nin these domains.\n\n1\n\nIntroduction\n\nHow language and communication emerge among intelligent agents has long been a topic of intense\ndebate. Among the many unresolved questions are: Why does language use discrete structures?\nWhat role does the environment play? What is innate and what is learned? And so on. Some of the\ndebates on these questions have been so \ufb01ery that in 1866 the French Academy of Sciences banned\npublications about the origin of human language.\nThe rapid progress in recent years of machine learning, and deep learning in particular, opens the\ndoor to a new perspective on this debate. How can agents use machine learning to automatically\ndiscover the communication protocols they need to coordinate their behaviour? What, if anything,\ncan deep learning offer to such agents? What insights can we glean from the success or failure of\nagents that learn to communicate?\nIn this paper, we take the \ufb01rst steps towards answering these questions. Our approach is programmatic:\n\ufb01rst, we propose a set of multi-agent benchmark tasks that require communication; then, we formulate\nseveral learning algorithms for these tasks; \ufb01nally, we analyse how these algorithms learn, or fail to\nlearn, communication protocols for the agents.\n\n\u2020These authors contributed equally to this work.\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe tasks that we consider are fully cooperative, partially observable, sequential multi-agent decision\nmaking problems. All the agents share the goal of maximising the same discounted sum of rewards.\nWhile no agent can observe the underlying Markov state, each agent receives a private observation\ncorrelated with that state. In addition to taking actions that affect the environment, each agent can\nalso communicate with its fellow agents via a discrete limited-bandwidth channel. Due to the partial\nobservability and limited channel capacity, the agents must discover a communication protocol that\nenables them to coordinate their behaviour and solve the task.\nWe focus on settings with centralised learning but decentralised execution. In other words, com-\nmunication between agents is not restricted during learning, which is performed by a centralised\nalgorithm; however, during execution of the learned policies, the agents can communicate only via the\nlimited-bandwidth channel. While not all real-world problems can be solved in this way, a great many\ncan, e.g., when training a group of robots on a simulator. Centralised planning and decentralised\nexecution is also a standard paradigm for multi-agent planning [1, 2].\nTo address this setting, we formulate two approaches. The \ufb01rst, reinforced inter-agent learning\n(RIAL), uses deep Q-learning [3] with a recurrent network to address partial observability. In one\nvariant of this approach, which we refer to as independent Q-learning, the agents each learn their\nown network parameters, treating the other agents as part of the environment. Another variant trains\na single network whose parameters are shared among all agents. Execution remains decentralised, at\nwhich point they receive different observations leading to different behaviour.\nThe second approach, differentiable inter-agent learning (DIAL), is based on the insight that cen-\ntralised learning affords more opportunities to improve learning than just parameter sharing. In\nparticular, while RIAL is end-to-end trainable within an agent, it is not end-to-end trainable across\nagents, i.e., no gradients are passed between agents. The second approach allows real-valued mes-\nsages to pass between agents during centralised learning, thereby treating communication actions as\nbottleneck connections between agents. As a result, gradients can be pushed through the communica-\ntion channel, yielding a system that is end-to-end trainable even across agents. During decentralised\nexecution, real-valued messages are discretised and mapped to the discrete set of communication\nactions allowed by the task. Because DIAL passes gradients from agent to agent, it is an inherently\ndeep learning approach.\nExperiments on two benchmark tasks, based on the MNIST dataset and a well known riddle, show,\nnot only can these methods solve these tasks, they often discover elegant communication protocols\nalong the way. To our knowledge, this is the \ufb01rst time that either differentiable communication or\nreinforcement learning with deep neural networks has succeeded in learning communication protocols\nin complex environments involving sequences and raw images. The results also show that deep\nlearning, by better exploiting the opportunities of centralised learning, is a uniquely powerful tool\nfor learning such protocols. Finally, this study advances several engineering innovations that are\nessential for learning communication protocols in our proposed benchmarks.\n\n2 Related Work\n\nResearch on communication spans many \ufb01elds, e.g. linguistics, psychology, evolution and AI. In AI,\nit is split along a few axes: a) prede\ufb01ned or learned communication protocols, b) planning or learning\nmethods, c) evolution or RL, and d) cooperative or competitive settings.\nGiven the topic of our paper, we focus on related work that deals with the cooperative learning of\ncommunication protocols. Out of the plethora of work on multi-agent RL with communication,\ne.g., [4\u20137], only a few fall into this category. Most assume a pre-de\ufb01ned communication protocol,\nrather than trying to learn protocols. One exception is the work of Kasai et al. [7], in which\ntabular Q-learning agents have to learn the content of a message to solve a predator-prey task with\ncommunication. Another example of open-ended communication learning in a multi-agent task is\ngiven in [8]. Here evolutionary methods are used for learning the protocols which are evaluated\non a similar predator-prey task. Their approach uses a \ufb01tness function that is carefully designed to\naccelerate learning. In general, heuristics and handcrafted rules have prevailed widely in this line of\nresearch. Moreover, typical tasks have been necessarily small so that global optimisation methods,\nsuch as evolutionary algorithms, can be applied. The use of deep representations and gradient-based\noptimisation as advocated in this paper is an important departure, essential for scalability and further\n\n2\n\n\fprogress. A similar rationale is provided in [9], another example of making an RL problem end-to-end\ndifferentiable.\nUnlike the recent work in [10], we consider discrete communication channels. One of the key\ncomponents of our methods is the signal binarisation during the decentralised execution. This is\nrelated to recent research on \ufb01tting neural networks in low-powered devices with memory and\ncomputational limitations using binary weights, e.g. [11], and previous works on discovering binary\ncodes for documents [12].\n\n3 Background\n\ni ). Here, \u03b8\u2212\n\ni\n\ni\n\n\u2212 Q(s, u; \u03b8i))2], at each iteration i, with target yDQN\n\nDeep Q-Networks (DQN). In a single-agent, fully-observable, RL setting [13], an agent observes the\ncurrent state st \u2208 S at each discrete time step t, chooses an action ut \u2208 U according to a potentially\nstochastic policy \u03c0, observes a reward signal rt, and transitions to a new state st+1. Its objective\nis to maximise an expectation over the discounted return, Rt = rt + \u03b3rt+1 + \u03b32rt+2 + \u00b7\u00b7\u00b7 , where\nrt is the reward received at time t and \u03b3 \u2208 [0, 1] is a discount factor. The Q-function of a policy \u03c0\nis Q\u03c0(s, u) = E [Rt|st = s, ut = u]. The optimal action-value function Q\u2217(s, u) = max\u03c0 Q\u03c0(s, u)\nobeys the Bellman optimality equation Q\u2217(s, u) = Es(cid:48) [r + \u03b3 maxu(cid:48) Q\u2217(s(cid:48), u(cid:48)) | s, u]. Deep Q-\nlearning [3] uses neural networks parameterised by \u03b8 to represent Q(s, u; \u03b8). DQNs are optimised\nby minimising: Li(\u03b8i) = Es,u,r,s(cid:48)[(yDQN\n=\nr + \u03b3 maxu(cid:48) Q(s(cid:48), u(cid:48); \u03b8\u2212\ni are the parameters of a target network that is frozen for a number\nof iterations while updating the online network Q(s, u; \u03b8i). The action u is chosen from Q(s, u; \u03b8i)\nby an action selector, which typically implements an \u0001-greedy policy that selects the action that\nmaximises the Q-value with a probability of 1 \u2212 \u0001 and chooses randomly with a probability of \u0001.\nDQN also uses experience replay: during learning, the agent builds a dataset of episodic experiences\nand is then trained by sampling mini-batches of experiences.\nIndependent DQN. DQN has been extended to cooperative multi-agent settings, in which each agent\na observes the global st, selects an individual action ua\nt , and receives a team reward, rt, shared\namong all agents. Tampuu et al. [14] address this setting with a framework that combines DQN\nwith independent Q-learning, in which each agent a independently and simultaneously learns its\nown Q-function Qa(s, ua; \u03b8a\ni ). While independent Q-learning can in principle lead to convergence\nproblems (since one agent\u2019s learning makes the environment appear non-stationary to other agents),\nit has a strong empirical track record [15, 16], and was successfully applied to two-player pong.\nDeep Recurrent Q-Networks. Both DQN and independent DQN assume full observability, i.e., the\nagent receives st as input. By contrast, in partially observable environments, st is hidden and the\nagent receives only an observation ot that is correlated with st, but in general does not disambiguate\nit. Hausknecht and Stone [17] propose deep recurrent Q-networks (DRQN) to address single-agent,\npartially observable settings. Instead of approximating Q(s, u) with a feed-forward network, they\napproximate Q(o, u) with a recurrent neural network that can maintain an internal state and aggregate\nobservations over time. This can be modelled by adding an extra input ht\u22121 that represents the hidden\nstate of the network, yielding Q(ot, ht\u22121, u). For notational simplicity, we omit the dependence of Q\non \u03b8.\n\n4 Setting\n\nIn this work, we consider RL problems with both multiple agents and partial observability. All the\nagents share the goal of maximising the same discounted sum of rewards Rt. While no agent can\nobserve the underlying Markov state st, each agent a receives a private observation oa\nt correlated with\nt \u2208 U that affects the environment,\nst. In every time-step t, each agent selects an environment action ua\nt \u2208 M that is observed by other agents but has no direct impact on the\nand a communication action ma\nenvironment or reward. We are interested in such settings because it is only when multiple agents and\npartial observability coexist that agents have the incentive to communicate. As no communication\nprotocol is given a priori, the agents must develop and agree upon such a protocol to solve the task.\nSince protocols are mappings from action-observation histories to sequences of messages, the space\nof protocols is extremely high-dimensional. Automatically discovering effective protocols in this\nspace remains an elusive challenge. In particular, the dif\ufb01culty of exploring this space of protocols\nis exacerbated by the need for agents to coordinate the sending and interpreting of messages. For\n\n3\n\n\fexample, if one agent sends a useful message to another agent, it will only receive a positive reward\nif the receiving agent correctly interprets and acts upon that message. If it does not, the sender will be\ndiscouraged from sending that message again. Hence, positive rewards are sparse, arising only when\nsending and interpreting are properly coordinated, which is hard to discover via random exploration.\nWe focus on settings where communication between agents is not restricted during centralised\nlearning, but during the decentralised execution of the learned policies, the agents can communicate\nonly via a limited-bandwidth channel.\n\n5 Methods\n\nIn this section, we present two approaches for learning communication protocols.\n\n5.1 Reinforced Inter-Agent Learning\n\nt\u22121, ha\n\nu and Qa\n\nt , ma(cid:48)\nt as well as messages from other agents ma(cid:48)\n\nThe most straightforward approach, which we call reinforced inter-agent learning (RIAL), is to\ncombine DRQN with independent Q-learning for action and communication selection. Each agent\u2019s\nQ-network represents Qa(oa\nt\u22121, ua), which conditions on that agent\u2019s individual hidden\nstate ha\nt\u22121 and observation oa\nt\u22121.\nTo avoid needing a network with |U||M| outputs, we split the network into Qa\nm, the Q-values\nfor the environment and communication actions, respectively. Similarly to [18], the action selector\nt and ma\nseparately picks ua\nt from Qu and Qm, using an \u0001-greedy policy. Hence, the network requires\nonly |U| + |M| outputs and action selection requires maximising over U and then over M, but not\nmaximising over U \u00d7 M.\nBoth Qu and Qm are trained using DQN with the following two modi\ufb01cations, which were found to be\nessential for performance. First, we disable experience replay to account for the non-stationarity that\noccurs when multiple agents learn concurrently, as it can render experience obsolete and misleading.\nSecond, to account for partial observability, we feed in the actions u and m taken by each agent\nas inputs on the next time-step. Figure 1(a) shows how information \ufb02ows between agents and the\nenvironment, and how Q-values are processed by the action selector in order to produce the action,\nt , and message ma\nt . Since this approach treats agents as independent networks, the learning phase is\nua\nnot centralised, even though our problem setting allows it to be. Consequently, the agents are treated\nexactly the same way during decentralised execution as during learning.\n\n(a) RIAL - RL based communication\n\n(b) DIAL - Differentiable communication\n\nFigure 1: The bottom and top rows represent the communication \ufb02ow for agent a1 and agent a2,\nrespectively. In RIAL (a), all Q-values are fed to the action selector, which selects both environment\nand communication actions. Gradients, shown in red, are computed using DQN for the selected\naction and \ufb02ow only through the Q-network of a single agent. In DIAL (b), the message ma\nt bypasses\nthe action selector and instead is processed by the DRU (Section 5.2) and passed as a continuous\nvalue to the next C-network. Hence, gradients \ufb02ow across agents, from the recipient to the sender.\nFor simplicity, at each time step only one agent is highlighted, while the other agent is greyed out.\n\nParameter Sharing. RIAL can be extended to take advantage of the opportunity for centralised\nlearning by sharing parameters among the agents. This variation learns only one network, which is\nused by all agents. However, the agents can still behave differently because they receive different\n\n4\n\not1ut+12Q-Netut1Q-NetActionSelectmt1mt+12Agent 1Agent 2ot+12ActionSelectmt-12EnvironmentQ-NetActionSelectQ-NetActionSelectt+1tot1ut+12C-Netut1C-NetActionSelectDRUmt1mt+12Agent 1Agent 2ot+12ActionSelectEnvironmentC-NetActionSelectC-NetActionSelectDRUt+1t\fobservations and thus evolve different hidden states. In addition, each agent receives its own index\na as input, allowing it to specialise. The rich representations in deep Q-networks can facilitate\nthe learning of a common policy while also allowing for specialisation. Parameter sharing also\ndramatically reduces the number of parameters that must be learned, thereby speeding learning.\nUnder parameter sharing, the agents learn two Q-functions Qu(oa\nt\u22121, a, ua\nt )\nand Qm(oa\nt ). During decentralised execution, each agent uses its\nown copy of the learned network, evolving its own hidden state, selecting its own actions, and\ncommunicating with other agents only through the communication channel.\n\nt\u22121, a, ua\n\nt , ma(cid:48)\n\nt , ma(cid:48)\n\nt\u22121, ha\n\nt\u22121, ua\n\nt\u22121, ma\n\nt\u22121, ha\n\nt\u22121, ua\n\nt\u22121, ma\n\n5.2 Differentiable Inter-Agent Learning\n\nWhile RIAL can share parameters among agents, it still does not take full advantage of centralised\nlearning. In particular, the agents do not give each other feedback about their communication actions.\nContrast this with human communication, which is rich with tight feedback loops. For example,\nduring face-to-face interaction, listeners send fast nonverbal queues to the speaker indicating the level\nof understanding and interest. RIAL lacks this feedback mechanism, which is intuitively important\nfor learning communication protocols.\nTo address this limitation, we propose differentiable inter-agent learning (DIAL). The main insight\nbehind DIAL is that the combination of centralised learning and Q-networks makes it possible, not\nonly to share parameters but to push gradients from one agent to another through the communication\nchannel. Thus, while RIAL is end-to-end trainable within each agent, DIAL is end-to-end trainable\nacross agents. Letting gradients \ufb02ow from one agent to another gives them richer feedback, reducing\nthe required amount of learning by trial and error, and easing the discovery of effective protocols.\nDIAL works as follows: during centralised learning, communication actions are replaced with direct\nconnections between the output of one agent\u2019s network and the input of another\u2019s. Thus, while\nthe task restricts communication to discrete messages, during learning the agents are free to send\nreal-valued messages to each other. Since these messages function as any other network activation,\ngradients can be passed back along the channel, allowing end-to-end backpropagation across agents.\nIn particular, the network, which we call a C-Net, outputs two distinct types of values, as shown in\nFigure 1(b), a) Q(\u00b7), the Q-values for the environment actions, which are fed to the action selector,\nand b) ma\nt , the real-valued vector message to other agents, which bypasses the action selector and\nis instead processed by the discretise/regularise unit (DRU(ma\nt )). The DRU regularises it during\ncentralised learning, DRU(ma\nt , \u03c3)), where \u03c3 is the standard deviation of the noise\nt > 0}.\nadded to the channel, and discretises it during decentralised execution, DRU(ma\nFigure 1 shows how gradients \ufb02ow differently in RIAL and DIAL. The gradient chains for Qu, in\nRIAL and Q, in DIAL, are based on the DQN loss. However, in DIAL the gradient term for m is the\nbackpropagated error from the recipient of the message to the sender. Using this inter-agent gradient\nfor training provides a richer training signal than the DQN loss for Qm in RIAL. While the DQN\nerror is nonzero only for the selected message, the incoming gradient is a |m|-dimensional vector\nthat can contain more information. It also allows the network to directly adjust messages in order to\nminimise the downstream DQN loss, reducing the need for trial and error learning of good protocols.\nWhile we limit our analysis to discrete messages, DIAL naturally handles continuous message spaces,\nas they are used anyway during centralised learning. At the same time, DIAL can also scale to large\ndiscrete message spaces, since it learns binary encodings instead of the one-hot encoding in RIAL,\n|m| = O(log(|M|). Further algorithmic details and pseudocode are in the supplementary material.\n\nt ) = Logistic(N (ma\n\nt ) = 1{ma\n\n6 Experiments\n\nIn this section, we evaluate RIAL and DIAL with and without parameter sharing in two multi-agent\nproblems and compare it with a no-communication shared-parameter baseline (NoComm). Results\npresented are the average performance across several runs, where those without parameter sharing (-\nNS), are represented by dashed lines. Across plots, rewards are normalised by the highest average\nreward achievable given access to the true state (Oracle). In our experiments, we use an \u0001-greedy\npolicy with \u0001 = 0.05, the discount factor is \u03b3 = 1, and the target network is reset every 100 episodes.\nTo stabilise learning, we execute parallel episodes in batches of 32. The parameters are optimised\nusing RMSProp [19] with a learning rate of 5 \u00d7 10\u22124. The architecture uses recti\ufb01ed linear units\n\n5\n\n\f(ReLU), and gated recurrent units (GRU) [20], which have similar performance to long short-term\nmemory [21] (LSTM) [22]. Unless stated otherwise, we set the standard deviation of noise added to\nthe channel to \u03c3 = 2, which was found to be essential for good performance.1\n\n6.1 Model Architecture\n\nRIAL and DIAL share the same individual model archi-\ntecture. For brevity, we describe only the DIAL model\nhere. As illustrated in Figure 2, each agent consists of a re-\ncurrent neural network (RNN), unrolled for T time-steps,\nthat maintains an internal state h, an input network for\nproducing a task embedding z, and an output network for\nthe Q-values and the messages m. The input for agent a is\nde\ufb01ned as a tuple of (oa\nt\u22121, a). The inputs a and\nt\u22121 are passed through lookup tables, and ma(cid:48)\nt\u22121 through\nua\na 1-layer MLP, both producing embeddings of size 128.\nt is processed through a task-speci\ufb01c network that pro-\noa\nduces an additional embedding of the same size. The state\nembedding is produced by element-wise summation of\nthese embeddings, za\nWe found that performance and stability improved when a batch normalisation layer [23]\nwas used to preprocess mt\u22121. za\n1,t =\nt\n1,t\u22121), which is used to approximate the agent\u2019s action-observation history.\nGRU[128, 128](za\nt , ha\nFinally, the output ha\n2,t of the top GRU layer, is passed through a 2-layer MLP Qa\nt =\nMLP[128, 128, (|U| + |M|)](ha\n\nt ) + MLP[|M|, 128](mt\u22121) + Lookup(ua\nis processed through a 2-layer RNN with GRUs, ha\n\nt\u22121) + Lookup(a)(cid:1).\n\nt =(cid:0)TaskMLP(oa\n\nFigure 2: DIAL architecture.\n\nt , ma\n\nt , ma(cid:48)\n\nt\u22121, ua\n\n2,t).\n\n6.2 Switch Riddle\n\nFigure 3: Switch: Every day one pris-\noner gets sent to the interrogation room\nwhere he sees the switch and chooses\nfrom \u201cOn\u201d, \u201cOff\u201d, \u201cTell\u201d and \u201cNone\u201d.\n\nThe \ufb01rst task is inspired by a well-known riddle described\nas follows: \u201cOne hundred prisoners have been newly\nushered into prison. The warden tells them that starting\ntomorrow, each of them will be placed in an isolated cell,\nunable to communicate amongst each other. Each day,\nthe warden will choose one of the prisoners uniformly\nat random with replacement, and place him in a central\ninterrogation room containing only a light bulb with a\ntoggle switch. The prisoner will be able to observe the\ncurrent state of the light bulb. If he wishes, he can toggle\nthe light bulb. He also has the option of announcing that he believes all prisoners have visited the\ninterrogation room at some point in time. If this announcement is true, then all prisoners are set free,\nbut if it is false, all prisoners are executed[...]\u201d [24].\nt \u2208 {0, 1}, which indicates if\nArchitecture. In our formalisation, at time-step t, agent a observes oa\nthe agent is in the interrogation room. Since the switch has two positions, it can be modelled as a\nt \u2208 {\u201cNone\u201d,\u201cTell\u201d};\n1-bit message, ma\notherwise the only action is \u201cNone\u201d. The episode ends when an agent chooses \u201cTell\u201d or when the\nmaximum time-step, T , is reached. The reward rt is 0 unless an agent chooses \u201cTell\u201d, in which\ncase it is 1 if all agents have been to the interrogation room and \u22121 otherwise. Following the riddle\nde\ufb01nition, in this experiment ma\nt\u22121 is available only to the agent a in the interrogation room. Finally,\nwe set the time horizon T = 4n \u2212 6 in order to keep the experiments computationally tractable.\nComplexity. The switch riddle poses signi\ufb01cant protocol learning challenges. At any time-step t,\nthere are |o|t possible observation histories for a given agent, with |o| = 3: the agent either is not\nin the interrogation room or receives one of two messages when it is. For each of these histories,\nan agent can chose between 4 = |U||M| different options, so at time-step t, the single-agent policy\nspace is (|U||M|)\n= 43t. The product of all policies for all time-steps de\ufb01nes the total policy\n= 4(3T +1\u22123)/2, where T is the \ufb01nal time-step. The size of the multi-agent\n\nspace for an agent:(cid:81) 43t\n\nt . If agent a is in the interrogation room, then its actions are ua\n\n|o|t\n\n1Source code is available at: https://github.com/iassael/learning-to-communicate\n\n6\n\n\u2026\u2026\u2026\u2026\u2026\u2026h21az1az2az3azTah11ah12ah13ah1T -1ah22ah23ah2Tah11ah12ah13ah1Tah21ah22ah23ah2T -1aQ1am1a,)(Q3am3a,)(QTa)(\u2026\u2026)m0a,(o1a0u,a)m2a,(o3a2u,a)mT-1a,(oTaT-1u,a,,,Day 13231OffOnOffOnOffOnDay 2Day 3Day 4Switch:Action:OnNoneNoneTellOffOnPrisoner in IR:\f(a) Evaluation of n = 3\n\n(b) Evaluation of n = 4\n\n(c) Protocol of n = 3\n\nFigure 4: Switch: (a-b) Performance of DIAL and RIAL, with and without ( -NS) parameter sharing,\nand NoComm-baseline, for n = 3 and n = 4 agents. (c) The decision tree extracted for n = 3 to\ninterpret the communication protocol discovered by DIAL.\npolicy space grows exponentially in n, the number of agents: 4n(3T +1\u22123)/2. We consider a setting\nwhere T is proportional to the number of agents, so the total policy space is 4n3O(n). For n = 4, the\nsize is 4354288. Our approach using DIAL is to model the switch as a continuous message, which is\nbinarised during decentralised execution.\nExperimental results. Figure 4(a) shows our results for n = 3 agents. All four methods learn an\noptimal policy in 5k episodes, substantially outperforming the NoComm baseline. DIAL with param-\neter sharing reaches optimal performance substantially faster than RIAL. Furthermore, parameter\nsharing speeds both methods. Figure 4(b) shows results for n = 4 agents. DIAL with parameter\nsharing again outperforms all other methods. In this setting, RIAL without parameter sharing was\nunable to beat the NoComm baseline. These results illustrate how dif\ufb01cult it is for agents to learn the\nsame protocol independently. Hence, parameter sharing can be crucial for learning to communicate.\nDIAL-NS performs similarly to RIAL, indicating that the gradient provides a richer and more robust\nsource of information. We also analysed the communication protocol discovered by DIAL for n = 3\nby sampling 1K episodes, for which Figure 4(c) shows a decision tree corresponding to an optimal\nstrategy. When a prisoner visits the interrogation room after day two, there are only two options:\neither one or two prisoners may have visited the room before. If three prisoners had been, the third\nprisoner would have \ufb01nished the game. The other options can be encoded via the \u201cOn\u201d and \u201cOff\u201d\npositions respectively.\n\n6.3 MNIST Games\n\nIn this section, we consider two tasks based on the well known MNIST digit classi\ufb01cation dataset [25].\nColour-Digit MNIST is a two-player\ngame in which each agent observes the\npixel values of a random MNIST digit in\nred or green, while the colour label and\ndigit value are hidden. The reward consists\nof two components that are antisymmetric\nin the action, colour, and parity of the dig-\nits. As only one bit of information can be\nsent, agents must agree to encode/decode\neither colour or parity, with parity yielding\ngreater rewards. The game has two steps;\nin the \ufb01rst step, both agents send a 1-bit message, in the second step they select a binary action.\nMulti-Step MNIST is a grayscale variant that requires agents to develop a communication protocol\nthat integrates information across 5 time-steps in order to guess each others\u2019 digits. At each step,\nthe agents exchange a 1-bit message and at he \ufb01nal step, t = 5, they are awarded r = 0.5 for each\ncorrectly guessed digit. Further details on both tasks are provided in the supplementary material.\nArchitecture. The input processing network is a 2-layer MLP TaskMLP[(|c|\u00d728\u00d728), 128, 128](oa\nt ).\nFigure 5 depicts the generalised setting for both games. Our experimental evaluation showed improved\ntraining time using batch normalisation after the \ufb01rst layer.\n\nFigure 5: MNIST games architectures.\n\n7\n\n1k2k3k4k5k# Epochs0.50.60.70.80.91.0Norm. R (Optimal)DIALDIAL-NSRIALRIAL-NSNoCommOracle10k20k30k40k# Epochs0.50.60.70.80.91.0Norm. R (Optimal)DIALDIAL-PSRIALRIAL-NSNoCommOracleOffHas Been?OnYesNoNoneHas Been?YesNoSwitch?OnOnOffTellOnDay123+u12m1m2m3m4u11u51u52\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026Agent 1Agent 2m1\u2026\u2026u12u11u22u21Agent 1Agent 2\f(a) Evaluation of Multi-Step\n\n(c) Protocol of Multi-Step\nFigure 6: MNIST Games: (a,b) Performance of DIAL and RIAL, with and without (-NS) parameter\nsharing, and NoComm, for both MNIST games. (c) Extracted coding scheme for multi-step MNIST.\n\n(b) Evaluation of Colour-Digit\n\nExperimental results. Figures 6(a) and 6(b) show that DIAL substantially outperforms the other\nmethods on both games. Furthermore, parameter sharing is crucial for reaching the optimal protocol.\nIn multi-step MNIST, results were obtained with \u03c3 = 0.5. In this task, RIAL fails to learn, while in\ncolour-digit MNIST it \ufb02uctuates around local minima in the protocol space; the NoComm baseline\nis stagnant at zero. DIAL\u2019s performance can be attributed to directly optimising the messages in\norder to reduce the global DQN error while RIAL must rely on trial and error. DIAL can also\noptimise the message content with respect to rewards taking place many time-steps later, due to the\ngradient passing between agents, leading to optimal performance in multi-step MNIST. To analyse\nthe protocol that DIAL learned, we sampled 1K episodes. Figure 6(c) illustrates the communication\nbit sent at time-step t by agent 1, as a function of its input digit. Thus, each agent has learned a binary\nencoding and decoding of the digits. These results illustrate that differentiable communication in\nDIAL is essential to fully exploiting the power of centralised learning and thus is an important tool\nfor studying the learning of communication protocols.\n\n6.4 Effect of Channel Noise\n\nThe question of why language evolved to be discrete has\nbeen studied for centuries, see e.g., the overview in [26].\nSince DIAL learns to communicate in a continuous channel,\nour results offer an illuminating perspective on this topic. In\nparticular, Figure 7 shows that, in the switch riddle, DIAL\nwithout noise in the communication channel learns centred\nactivations. By contrast, the presence of noise forces mes-\nsages into two different modes during learning. Similar\nobservations have been made in relation to adding noise\nwhen training document models [12] and performing clas-\nsi\ufb01cation [11]. In our work, we found that adding noise\nwas essential for successful training. More analysis on this\nappears in the supplementary material.\n\n7 Conclusions\n\nFigure 7: DIAL\u2019s learned activations\nwith and without noise in DRU.\n\nThis paper advanced novel environments and successful techniques for learning communication\nprotocols. It presented a detailed comparative analysis covering important factors involved in the\nlearning of communication protocols with deep networks, including differentiable communication,\nneural network architecture design, channel noise, tied parameters, and other methodological aspects.\nThis paper should be seen as a \ufb01rst attempt at learning communication and language with deep\nlearning approaches. The gargantuan task of understanding communication and language in their\nfull splendour, covering compositionality, concept lifting, conversational agents, and many other\nimportant problems still lies ahead. We are however optimistic that the approaches proposed in this\npaper can help tackle these challenges.\n\n8\n\n10k20k30k40k50k# Epochs0.00.20.40.60.81.0Norm. R (Optimal)DIALDIAL-NSRIALRIAL-NSNoCommOracle5k10k15k20k# Epochs0.00.20.40.60.81.0Norm. R (Optimal)DIALDIAL-NSRIALRIAL-NSNoCommOracle1234Step0123456789True Digit-10010Activation0.00.51.0Probability\u00be=0-10010Activation\u00be=2:0Epoch 1kEpoch 3kEpoch 5k\fReferences\n[1] F. A. Oliehoek, M. T. J. Spaan, and N. Vlassis. Optimal and approximate Q-value functions for decentralized\n\nPOMDPs. JAIR, 32:289\u2013353, 2008.\n\n[2] L. Kraemer and B. Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning.\n\nNeurocomputing, 190:82\u201394, 2016.\n\n[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[4] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In ICML, 1993.\n[5] F. S. Melo, M. Spaan, and S. J. Witwicki. QueryPOMDP: POMDP-based communication in multiagent\n\nsystems. In Multi-Agent Systems, pages 189\u2013204. 2011.\n\n[6] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents and\n\nMulti-Agent Systems, 11(3):387\u2013434, 2005.\n\n[7] T. Kasai, H. Tenmoto, and A. Kamiya. Learning of communication codes in multi-agent reinforcement\n\nlearning problem. In IEEE Soft Computing in Industrial Applications, pages 1\u20136, 2008.\n\n[8] C. L. Giles and K. C. Jim. Learning communication for multi-agent systems. In Innovative Concepts for\n\nAgent-Based Systems, pages 377\u2013390. Springer, 2002.\n\n[9] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. arXiv preprint arXiv:1502.04623, 2015.\n\n[10] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropagation. arXiv\n\npreprint arXiv:1605.07736, 2016.\n\n[11] M. Courbariaux and Y. Bengio. BinaryNet: Training deep neural networks with weights and activations\n\nconstrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.\n\n[12] G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative\n\nmodels. Topics in Cognitive Science, 3(1):74\u201391, 2011.\n\n[13] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.\n[14] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent\ncooperation and competition with deep reinforcement learning. arXiv preprint arXiv:1511.08779, 2015.\n[15] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical\n\nFoundations. Cambridge University Press, New York, 2009.\n\n[16] E. Zawadzki, A. Lipson, and K. Leyton-Brown. Empirically evaluating multiagent learning algorithms.\n\narXiv preprint 1401.8074, 2014.\n\n[17] M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. arXiv preprint\n\narXiv:1507.06527, 2015.\n\n[18] K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games using deep\n\nreinforcement learning. arXiv preprint arXiv:1506.08941, 2015.\n\n[19] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 2012.\n\n[20] K. Cho, B. van Merri\u00ebnboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation:\n\nEncoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on\n\nsequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, pages 448\u2013456, 2015.\n\n[24] W. Wu. 100 prisoners and a lightbulb. Technical report, OCF, UC Berkeley, 2002.\n[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[26] M. Studdert-Kennedy. How did language go discrete?\n\nIn M. Tallerman, editor, Language Origins:\n\nPerspectives on Evolution, chapter 3. Oxford University Press, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1124, "authors": [{"given_name": "Jakob", "family_name": "Foerster", "institution": "University of Oxford"}, {"given_name": "Ioannis Alexandros", "family_name": "Assael", "institution": "University of Oxford"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}