{"title": "Learning Attentional Communication for Multi-Agent Cooperation", "book": "Advances in Neural Information Processing Systems", "page_first": 7254, "page_last": 7264, "abstract": "Communication could potentially be an effective way for multi-agent cooperation. However, information sharing among all agents or in predefined communication architectures that existing methods adopt can be problematic. When there is a large number of agents, agents cannot differentiate valuable information that helps cooperative decision making from globally shared information. Therefore, communication barely helps, and could even impair the learning of multi-agent cooperation. Predefined communication architectures, on the other hand, restrict communication among agents and thus restrain potential cooperation. To tackle these difficulties, in this paper, we propose an attentional communication model that learns when communication is needed and how to integrate shared information for cooperative decision making. Our model leads to efficient and effective communication for large-scale multi-agent cooperation. Empirically, we show the strength of our model in a variety of cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies than existing methods.", "full_text": "Learning Attentional Communication for\n\nMulti-Agent Cooperation\n\nJiechuan Jiang\nPeking University\n\njiechuan.jiang@pku.edu.cn\n\nZongqing Lu\u21e4\nPeking University\n\nzongqing.lu@pku.edu.cn\n\nAbstract\n\nCommunication could potentially be an effective way for multi-agent cooperation.\nHowever, information sharing among all agents or in prede\ufb01ned communication\narchitectures that existing methods adopt can be problematic. When there is a\nlarge number of agents, agents cannot differentiate valuable information that helps\ncooperative decision making from globally shared information. Therefore, commu-\nnication barely helps, and could even impair the learning of multi-agent cooperation.\nPrede\ufb01ned communication architectures, on the other hand, restrict communication\namong agents and thus restrain potential cooperation. To tackle these dif\ufb01culties,\nin this paper, we propose an attentional communication model that learns when\ncommunication is needed and how to integrate shared information for cooperative\ndecision making. Our model leads to ef\ufb01cient and effective communication for\nlarge-scale multi-agent cooperation. Empirically, we show the strength of our\nmodel in a variety of cooperative scenarios, where agents are able to develop more\ncoordinated and sophisticated strategies than existing methods.\n\n1\n\nIntroduction\n\nBiologically, communication is closely related to and probably originated from cooperation. For\nexample, vervet monkeys can make different vocalizations to warn other members of the group\nabout different predators [3]. Similarly, communication can be crucially important in multi-agent\nreinforcement learning (MARL) for cooperation, especially for the scenarios where a large number\nof agents work in a collaborative way, such as autonomous vehicles planning [1], smart grid control\n[20], and multi-robot control [15].\nDeep reinforcement learning (RL) has achieved remarkable success in a series of challenging\nproblems, such as game playing [17, 22, 9] and robotics [13, 12, 6]. MARL can be simply seen\nas independent RL, where each learner treats the other agents as part of its environment. However,\nthe strategies of other agents are uncertain and changing as training progresses, so the environment\nbecomes unstable from the perspective of any individual agent and thus it is hard for agents to\ncollaborate. Moreover, policies learned using independent RL can easily over\ufb01t to the other agents\u2019\npolicies [10].\nWe argue one of the keys to solve this problem is communication, which could enhance strategy\ncoordination. There are several approaches for learning communication in MARL including DIAL [4]\n, CommNet [23], BiCNet [19], and Master-Slave [8]. However, information sharing among all agents\nor in prede\ufb01ned communication architectures these methods adopt can be problematic. When there\nis a large number of agents, agents cannot differentiate valuable information that helps cooperative\ndecision making from globally shared information, and hence communication barely helps and could\neven jeopardize the learning of cooperation. Moreover, in real-world applications, it is costly that all\n\n\u21e4Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fagents communicate with each other, since receiving a large amount of information requires high\nbandwidth and incurs long delay and high computational complexity. Prede\ufb01ned communication\narchitectures, e.g., Master-Slave [8], might help, but they restrict communication among speci\ufb01c\nagents and thus restrain potential cooperation.\nTo tackle these dif\ufb01culties, we propose an attentional communication model, called ATOC, to\nenable agents to learn effective and ef\ufb01cient communication under partially observable distributed\nenvironment for large-scale MARL. Inspired by recurrent models of visual attention, we design an\nattention unit that receives encoded local observation and action intention of an agent and determines\nwhether the agent should communicate with other agents to cooperate in its observable \ufb01eld. If so, the\nagent, called initiator, selects collaborators to form a communication group for coordinated strategies.\nThe communication group dynamically changes and retains only when necessary. We exploit a bi-\ndirectional LSTM unit as the communication channel to connect each agent within a communication\ngroup. The LSTM unit takes as input internal states (i.e., encoding of local observation and action\nintention) and returns thoughts that guide agents for coordinated strategies. Unlike CommNet and\nBiCNet that perform arithmetic mean and weighted mean of internal states, respectively, our LSTM\nunit selectively outputs important information for cooperative decision making, which makes it\npossible for agents to learn coordinated strategies in dynamic communication environments.\nWe implement ATOC as an extension of actor-critic model, which is trained end-to-end by backprop-\nagation. Since all agents share the parameters of the policy network, Q-network, attention unit, and\ncommunication channel, ATOC is suitable for large-scale multi-agent environments. We empirically\nshow the success of ATOC in three scenarios, which correspond to the cooperation of agents for\nlocal reward, a shared global reward, and reward in competition, respectively. It is demonstrated\nATOC agents are able to develop more coordinated and sophisticated strategies compared to existing\nmethods. To the best of our knowledge, this is the \ufb01rst time that attentional communication is\nsuccessfully applied to MARL.\n\n2 Related Work\n\nRecently, several models which are end-to-end trainable by backpropagation have been proven\neffective to learn communication in MARL. DIAL [4] is the \ufb01rst to propose learnable communication\nvia backpropagation with deep Q-networks. At each timestep, an agent generates its message as\nthe input of other agents for the next timestep. Gradients \ufb02ow from one agent to another through\nthe communication channel, bringing rich feedback to train an effective channel. However, the\ncommunication of DIAL is rather simple, just selecting prede\ufb01ned messages. Further, communication\nin terms of sequences of discrete symbols are investigated in [7] and [18].\nCommNet [23] is a large feed-forward neural network that maps inputs of all agents to their actions,\nwhere each agent occupies a subset of units and additionally has access to a broadcasting communi-\ncation channel to share information. At a single communication step, each agent sends its hidden\nstate as the communication message to the channel. The averaged message from other agents is the\ninput of next layer. However, it is only a large single network for all agents, so it cannot easily scale\nand would perform poorly in the environment with a large number of agents. It is worth mentioning\nthat CommNet has been extended for abstractive summarization [2] in natural language processing.\nBiCNet [19] is based on actor-critic model for continuous action, using recurrent networks to connect\neach individual agent\u2019s policy and value networks. BiCNet is able to handle real-time strategy games\nsuch as StarCraft micromanagement tasks. Master-Slave [8] is also a communication architecture for\nreal-time strategy games, where the action of each slave agent is composed of contributions from both\nthe slave agent and master agent. However, both works assume that agents know the global states of\nthe environment, which is not realistic in practice. Moreover, prede\ufb01ned communication architectures\nrestrict communication and hence restrain potential cooperation among agents. Therefore, they\ncannot adapt to the change of scenarios.\nMADDPG [14] is an extension of actor-critic model for mixed cooperative-competitive environments.\nCOMA [5] is proposed to solve multi-agent credit assignment in cooperative settings. MADDPG\nand COMA both use a centralized critic that takes as input the observations and actions of all agents.\nHowever, MADDPG and COMA have to train an independent policy network for each agent, where\neach agent would learn a policy specializing speci\ufb01c tasks [11], and the policy network easily over\ufb01ts\nto the number of agents. Therefore, MADDPG and COMA are infeasible in large-scale MARL.\n\n2\n\n\fMean Field [24] takes as input the observation and mean action of neighboring agents to make the\ndecision. However, the mean action eliminates the difference among neighboring agents in terms of\naction and observation and thus incurs the loss of important information that could help cooperative\ndecision making.\n\n3 Background\n\nDeep Q-Networks (DQN). Combining reinforcement learning with a class of deep neural net-\nworks, DQN [17] has performed at a level that is comparable to a professional game player.\nAt each timestep t, the agent observes the state st 2S , chooses an action at 2A according\nto the policy \u21e1, gets a reward rt, and transitions to next state st+1. The objective is to max-\nimize the total expected discounted reward Ri = PT\nt=0 trt, where  2 [0, 1] is a discount\nfactor. DQN learns the action-value function Q\u21e1(s, a) = Es [Rt|st = s, at = a], which can be\nrecursively rewritten as Q\u21e1(s, a) = Es0 [r (s, a) + Ea0\u21e0\u21e1 [Q\u21e1(s0, a0)]], by minimizing the loss:\nL (\u2713) = Es,a,r,s0 [y0  Q (s, a; \u2713)], where y0 = r +  maxa0 Q (s0, a0; \u2713). The agent selects the\naction that maximizes the Q value with a probability of 1  \u270f or acts randomly with a probability of \u270f.\nDeterministic Policy Gradient (DPG). Different from value-based algorithms like DQN, the main\nidea of policy gradient methods is to directly adjust the parameters \u2713 of the policy to maximize\nthe objective J (\u2713) = Es\u21e0p\u21e1,a\u21e0\u21e1\u2713 [R] along the direction of policy gradient r\u2713J (\u2713), which can\nbe written as r\u2713J (\u2713) = Es\u21e0p\u21e1,a\u21e0\u21e1\u2713 [r\u2713 log \u21e1\u2713 (a|s) Q\u21e1 (s, a)]. This can be further extended to\ndeterministic policies [21] \u00b5\u2713: S 7! A, and r\u2713J (\u2713) = Es\u21e0D\u21e5r\u2713\u00b5\u2713 (a|s)raQ\u00b5 (s, a)|a=\u00b5\u2713(s)\u21e4.\nTo ensure raQ\u00b5 (s, a) exists, the action space must be continuous.\nDeep Deterministic Policy Gradient (DDPG). DDPG [13] is an actor-critic algorithm based on\nDPG. It respectively uses deep neural networks parameterized by \u2713\u00b5and \u2713Q to approximate the\ndeterministic policy a = \u00b5 (s|\u2713\u00b5) and action-value function Qs, a|\u2713Q. The policy network infers\n\nactions according to states, corresponding to the actor; the Q-network approximates the value function\nof state-action pair and provides the gradient, corresponding to the critic.\nRecurrent Attention Model (RAM). In the process of perceiving the image, instead of processing\nthe whole perception \ufb01eld, humans focus attention on some important parts to obtain information\nwhen and where it is needed and then move from one part to another. RAM [16] uses a RNN to model\nthe attention mechanism. At each timestep, an agent obtains and processes a partial observation via a\nbandwidth-limited sensor. The glimpse feature extracted from the past observations is stored at an\ninternal state which is encoded into the hidden layer of the RNN. By decoding the internal state, the\nagent decides the location of the sensor and the action interacting with the environment.\n\n4 Methods\n\nATOC is instantiated as an extension of actor-critic model, but it can also be realized using value-based\nmethods. ATOC consists of a policy network, a Q-network, an attention unit, and a communication\nchannel, as illustrated in Figure 1.\nWe consider the partially observable distributed environment for MARL, where each agent i receives\na local observation oi\nt correlated with the state st at time t. The policy network takes the local\nobservation as input and extracts a hidden layer as thought, which encodes both local observation\nand action intention, represented as hi\nt as input and determines whether communication is needed for cooperation in its observable\nhi\n\ufb01eld. If needed, the agent, called initiator, selects other agents, called collaborators, in its \ufb01eld\nto form a communication group and the group stays the same in T timesteps. Communication is\nfully determined (when and how long to communicate) by the attention unit when T is equal to\n1. T can also be tuned for the consistency of cooperation. The communication channel connects\neach agent of the communication group, takes as input the thought of each agent and outputs the\nintegrated thought that guides agents to generate coordinated actions. The integrated thought \u02dchi\nt is\nt and fed into the rest of the policy network. Then, the policy network outputs the\nmerged with hi\naction ai\nt; \u2713\u00b5). By sharing encoding of local observation and action intention within a\ndynamically formed group, individual agents could build up relatively more global perception of the\nenvironment, infer the intent of other agents, and cooperate on decision making.\n\nt; \u2713\u00b5. Every T timesteps, the attention unit takes\n\nt = \u00b5Ioi\n\nt = \u00b5II(hi\n\nt, \u02dchi\n\n3\n\n\f4.1 Attention model\n\nWhen the coach directs a team, instead of managing a whole scene, she focuses attention selectively\non the key position and gives some directional but not speci\ufb01c instructions to the players near\nthe location. Inspired from this, we introduce the attention mechanism to learning multi-agent\ncommunication. Different from the coach, our attention unit never senses the environment in full,\nbut only uses encoding of observable \ufb01eld and action intention of an agent and decides whether\ncommunication is helpful in terms of cooperation. The attention unit can be instantiated by RNN\nor MLP. The \ufb01rst part of the actor network that produces the thought corresponds to the glimpse\nnetwork and the thought hi\nt can be considered as the glimpse feature vector. The attention unit takes\nthe thought representation as input and produces the probability of the observable \ufb01eld of the agent\nbecomes an attention focus (i.e., the probability of communication).\nUnlike existing work on learning communication in\nMARL, e.g., CommNet and BiCNet, where all agents\ncommunicate with each other all the time, our attention\nunit enables dynamic communication among agents only\nwhen necessary. This is much more practical, because\nin real-world applications communication is restricted by\nbandwidth and/or range and incurs additional cost, and\nthus it may not be possible or cost too much to maintain\nfull connectivity among all the agents. On the other hand,\ndynamic communication can keep the agent from receiv-\ning useless information compared to full connectivity. As\nwill be discussed in next section, useless information may\nnegatively impact cooperative decision making among\nagents. Overall, the attention unit leads to more effective\nand ef\ufb01cient communication.\n\nCommunication Channel\n\nintegrated thought\u00a0\n\nActorNet\u00a0(II)\n\nGradient\n\nCriticNet\n\n.......... \n\ncollaborator\n\n. . . . .\n\nAction\n\nr\no\n\n \n\nt\n\na\n\ni\nt\ni\n\nn\n\ni\n\nAttention \n\nUnit \n\n4.2 Communication\n\nThought\n\n \n\nActorNet\u00a0(I)\n\nObservation\n\nFigure 1: ATOC architecture.\n\nWhen an initiator selects its collaborators, it only considers\nthe agents in its observable \ufb01eld and ignores those who\ncannot be perceived. This setting complies with the facts:\n(i) one of the purposes of communication is to share the\npartial observation, and adjacent agents could understand\neach other easily; (ii) cooperative decision making can be\nmore easily accomplished among adjacent agents; (iii) all agents share one policy network, which\nmeans adjacent agents may have similar behaviors, however communication can increase the diversity\nof their strategies. There are three types of agents in the observable \ufb01eld of the initiator: other\ninitiators; agents who have been selected by other initiators; agents who have not been selected.\nWe assume a \ufb01xed communication bandwidth, which means each initiator can select at most m\ncollaborators. The initiator \ufb01rst chooses collaborators from agents who have not been selected, then\nfrom agents selected by other initiators, \ufb01nally from other initiators, all based on proximity.\nWhen an agent is selected by multiple initiators, it will participate the communication of each group.\nAssuming agent k is selected by two initiators p and q sequentially. Agent k \ufb01rst participates in the\ncommunication of p\u2019s group. The communication channel integrates their thoughts: {\u02dchp\nt } =\ng(hp\nt ). The\nagent shared by multiple groups bridges the information gap and strategy division among individual\ngroups. It can disseminate the thought within a group to other groups, which can eventually lead\nto coordinated strategies among the groups. This is especially critical for the case where all agents\ncollaborate on a single task. In addition, to deal with the issue of role assignment and heterogeneous\nagent types, we can \ufb01x the position of agents who participate in communication.\nThe bi-directional LSTM unit acts as the communication channel. It plays the role of integrating\ninternal states of agents within a group and guiding the agents towards coordinated decision making.\nUnlike CommNet and BiCNet that integrate shared information of agents by arithmetic mean and\nweighted mean, respectively, our LSTM unit can selectively output information that promotes\ncooperation and forget information that impedes cooperation through gates.\n\nt ). Then agent k communicates with q\u2019s group: {\u02dchq\n\nt ,\u00b7\u00b7\u00b7 , \u02dchk0\nt ,\u00b7\u00b7\u00b7 , \u02dchk0\n\nt } = g(hq\n\nt ,\u00b7\u00b7\u00b7 , \u02dchk00\n\nt ,\u00b7\u00b7\u00b7 , hk\n\n4\n\n\fagent\n\nlandmark\n\nagent\n\nball\n\ntarget location\n\npush\n\npredator\n\nobstacle\n\nprey\n\nFigure 2: Illustration of experimental scenarios: cooperative navigation (left), cooperative pushball (mid),\npredator-prey (right).\n\n4.3 Training\nThe training of ATOC is an extension of DDPG. More concretely, consider a game with N agents,\nand the critic, actor, communication channel, and attention unit of ATOC is parameterized by \u2713Q,\n\u2713\u00b5, \u2713g, and \u2713p, respectively. Note that we drop time t in the following notations for simplicity. The\nexperience replay buffer R contains the tuples (O, A, R, O0, C) recording the experiences of all\nagents, where O = (o1, . . . , oN ), A = (a1, . . . , aN ), R = (r1, . . . , rN ), O0 = (o01, . . . , o0N ), and\nC is a N \u21e5 N matrix that records the communication groups. We select experiences where the\naction is determined by an agent independently (i.e., without communication) and experiences with\ncommunication, respectively, to update the action-value function Q\u00b5 as:\n\nL(\u2713Q) = Eo,a,r,o0h(Q\u00b5 (o, a)  y)2i ,\n\nThe policy gradient can be written as:\n\ny = r + Q\u00b50 (o0, a0)|a0=\u00b50(o0).\n\nBy the chain rule, the gradient of integrated thought can be further derived as:\n\nr\u2713\u00b5J (\u2713\u00b5) = Eo,a\u21e0R\u21e5r\u2713\u00b5\u00b5 (a|o)raQ\u00b5 (o, a)|a=\u00b5(o)\u21e4 .\n\nr\u2713g J (\u2713g) = Eo,a\u21e0Rhr\u2713g g(\u02dch|H)r\u02dch\u00b5(a|\u02dch)raQ\u00b5 (o, a)|a=\u00b5(o)i .\n\nThe gradients are backpropagated to the policy network and communication channel to update the\nparameters. Then, we softly update target networks as \u27130 = \u2327 \u2713 + (1  \u2327 ) \u27130.\nThe attention unit is trained as a binary classi\ufb01er for communication. For each initiator i and its\ngroup Gi, we calculate the difference of mean Q values between coordinated actions and independent\nactions (denoted as \u00afa)\n\nQi =\n\n1\n|Gi|\n\n(Xj2Gi\n\nQoj, aj|\u2713Q  Xj2Gi\n\nQoj, \u00afaj|\u2713Q)\n\nand store (Qi, hi) into a queue D, where Q weights the performance enhancement produced by\ncommunication. When an episode ends, we perform min-max normalization on Q in D and get\n \u02c6Q 2 [0, 1].  \u02c6Q can be used as the tag of the binary classi\ufb01er and we use log loss to update \u2713p as:\n\nL(\u2713p) =  \u02c6Qi log(phi|\u2713p)  (1   \u02c6Qi) log1  phi|\u2713p .\n\n5 Experiments\n\nExperiments are performed based the multi-agent particle environment [14, 18], which is a two-\ndimensional world with continuous space and discrete time, consisting agents and landmarks. We\nmade a few modi\ufb01cations to the environment so as to adopt a large number of agents. Each agent\nhas only local observation, acts independently and cooperatively, and collects its own reward or\na shared global reward. We perform experiments in three scenarios, as illustrated in Figure 2, to\ninvestigate the cooperation of agents for local reward, shared global reward and reward in competition,\nrespectively. We compare ATOC with CommNet, BiCNet and DDPG. CommNet and BiCNet are the\nfull communication model, and DDPG is exactly ATOC without communication. MADDPG has to\ntrain an independent policy network for each agent, which makes it infeasible in large-scale MARL.\n\n5.1 Hyperparameters\nIn all the experiments, we use Adam optimizer with a learning rate of 0.001. The discount factor of\nreward  is 0.96. For the soft update of target networks, we use \u2327 = 0.001. The neural networks use\n\n5\n\n\fReLU and batch normalization for some hidden layers. The actor network has four hidden layers,\nthe second layer is the thought (128 units), and the output layer is the tanh activation function. The\ncritic network has two hidden layers with 512 and 256 units respectively. We use two-layer MLP to\nimplement the attention unit but it is also can be realized by RNN. For communication, T is 15. We\ninitialize all of the parameters by the method of random normal. The capacity of the replay buffer is\n105 and every time we take a minibatch of 2560. We noted that the large minibatch can accelerate the\nconvergence process, especially for the case of sparse reward. We accumulate experiences in the \ufb01rst\nthirty episodes before training. As DDPG, we use an Ornstein-Uhlenbeck process with \u2713 = 0.15 and\n = 0.2 for the exploration noise process.\n\nd\nr\na\nw\ne\nr\n \nn\na\ne\nm\n\n5.2 Cooperative Navigation\nIn this scenario, N agents cooperatively reach L landmarks, while avoiding collisions. Each agent is\nrewarded based on the proximity to the nearest landmark, while it is penalized when colliding with\nother agents. Ideally, each agent predicts actions of nearby agents based on its own observation and\nreceived information from other agents, and determines its own action towards occupying a landmark\nwithout colliding with other agents.\nWe trained ATOC and the baselines with the settings of\nN = 50 and L = 50, where each agent can observe three\nnearest agents and four landmarks with relative positions\nand velocities. At each timestep, the reward of an agent is\nd, where d denotes the distance between the agent and\nits nearest landmark, or d  1 if a collision occurs. Fig-\nure 3 shows the learning curves of 3000 episodes in terms\nof mean reward, averaged over all agents and timesteps.\nWe can see that ATOC converges to higher mean reward\nthan the baselines. We evaluate ATOC and the baselines\nby running 30 test games and measure average mean re-\nward, number of collisions, and percentage of occupied\nlandmarks.\nAs shown in Table 1, ATOC largely outperforms all the baselines. In the experiment, CommNet,\nBiCNet and DDPG all fail to learn the strategy that ATOC obtains. That is an agent is \ufb01rst trying\nto occupy the nearest landmark. If the landmark is more likely to be occupied by other agent,\nthe agent will turn to another vacant landmark rather than keeping probing and approaching the\nnearest landmark. The strategy of DDPG is more aggressive, i.e., multiple agents usually approach\na landmark simultaneously, which could lead to collisions. Both CommNet and BiCNet agents are\nmore conservative, i.e., they are more willing to avoid collisions rather than seizing a landmark, which\neventually leads to a small number of occupied landmarks. Moreover, both CommNet and BiCNet\nagents are more likely to surround a landmark and observe the actions of other agents. Nevertheless,\ngathered agents are prone to collisions.\nAs ATOC without communication is exactly DDPG and ATOC outperforms DDPG, we can see\ncommunication indeed helps. However, CommNet and BiCNet also have communication, why is\ntheir performance much worse? CommNet performs arithmetic mean on the information of the\nhidden layers. This operation implicitly treats information from different agents equally. However,\ninformation from various agents has different value for an agent to make decisions. For example, the\ninformation from a nearby agent who intends to seize the same landmark is much more useful than\nthe information from an agent far away. In the scenario with a large number of agents, there is a lot\nof useless information, which can be seen as noise that interferes the decision of agents. BiCNet uses\n\nFigure 3: Reward of ATOC against baselines\nduring training on cooperative navigation.\n\nATOC \nCommNet \nBiCNet \nDDPG\n\nepisode\n\nTable 1: Cooperative Navigation\n\nN = 50, L = 50\n\nN = 100, L = 100\n\nmean reward\n# collisions\n\n% occupied landmarks\n\nmean reward\n# collisions\n\n% occupied landmarks\n\n0.22\n47\n40%\n\nATOC ATOC w/o Comm. DDPG CommNet BiCNet\n0.04\n0.52\n51\n13\n16%\n92%\n0.73\n0.05\n91\n28\n9%\n89%\n\n0.60\n59\n12%\n0.65\n68\n17%\n\n0.14\n32\n22%\n0.23\n53\n25%\n\n6\n\n\finitiator\ninitiator\n\n\ufffdo\ufffd\ufffda\ufffdorator\n\ufffdo\ufffd\ufffda\ufffdorator\n\n\ufffdan\ufffd\ufffdar\ufffd\n\ufffdan\ufffd\ufffdar\ufffd\n\n\ufffdo\ufffd\ufffd\ufffdni\ufffdation\ufffd\ufffdat\ufffd\n\ufffdo\ufffd\ufffd\ufffdni\ufffdation\ufffd\ufffdat\ufffd\n\nt\n\na\ufffdtion\ufffd(cid:90)it\ufffdo\ufffdt\ufffd\ufffdo\ufffd\ufffd\ufffdni\ufffdation\ufffd\na\ufffdtion\ufffd(cid:90)it\ufffdo\ufffdt\ufffd\ufffdo\ufffd\ufffd\ufffdni\ufffdation\ufffd\na\ufffdtion\ufffd(cid:90)it\ufffd\ufffd\ufffdo\ufffd\ufffd\ufffdni\ufffdation\na\ufffdtion\ufffd(cid:90)it\ufffd\ufffd\ufffdo\ufffd\ufffd\ufffdni\ufffdation\n\nFigure 4: Visualizations of communications among ATOC agents on cooperative navigation. The rightmost\n\ufb01gure illustrates actions taken by a group of agents with and without communication.\n\na RNN as the communication channel, which can be seen as the weighted mean. However, as the\nnumber of agents increases, RNN also fails to capture the importance of information from different\nagents. Unlike CommNet and BiCNet, ATOC exploits the attention unit to dynamically perform\ncommunication, and most information is from nearby agents and thus helpful for decision making.\nIt is essential for agents to share a policy network as they do in the experiments. The primary reason\nis that most real-world applications are open systems, i.e., agents come and go. If each agent is\ntrained with an independent policy network, the network is apt to over\ufb01t to the number of agents in\nthe environment and thus hard to generalize, not to mention the efforts needed to train numerous\nindependent policy networks, like MADDPG, in large-scale multi-agent environments. However,\nagents that share a policy network may be homogeneous in terms of strategy, e.g., DDPG agents\nare all aggressive to seize the landmarks while CommNet and BiCNet agents are all conservative.\nNevertheless, unlike these baselines, ATOC agents behave differently: when a landmark is more\nlikely be occupied by an agent, nearby agents will turn to other landmarks. The primary reason\nbehind this is the communication scheme of ATOC. An agent can share its local observation and\nintent to nearby agents, i.e., the dynamically formed communication group. Although the size of\ncommunication group is small, the shared information may be further encoded and forwarded among\ngroups by the agent who belongs to multiple groups. Thus, each agent can obtain more and diverse\ninformation. Based on the received information, agents may infer the actions of other agents and\nbehave accordingly. Overall, ATOC agents show cooperative strategies to occupy the landmarks.\nTo investigate the scalability of ATOC and the baselines, we directly use the trained models under\nthe setting of N = 50 and L = 50 to the scenario of N = 100 and L = 100. With the increase of\nagent density, the number of collisions of all the methods increases. However, as shown in Table 1,\nATOC is still much better than the baselines in terms of all the metrics, which proves the scalability\nof ATOC. Interestingly, the percentage of occupied landmarks increases for DDPG and CommNet.\nAs discussed before, the learned strategy of CommNet is conservative in the original setting, and thus\nit might lead to more occupied landmarks when agents are dense and decisions are more con\ufb02icting.\nThe percentage of occupied landmarks of DDPG increases slightly, the number of collisions increase\nthough. The largely degraded performance of BiCNet in terms of all the metrics shows its bad\nscalability.\nWe visualize the communications among ATOC agents\nto trace the effect of the attention unit. As illustrated\nin Figure 4 (the left three \ufb01gures), attentional commu-\nnications occur at the regions where agents are dense\nand situations are complex. As the game progresses,\nthe agents occupy more landmarks and communication\nis less needed. We select a communication group and\nobserve their behaviors with/without communications.\nWe \ufb01nd that agents without communications are more\nlikely to target the same landmarks, which may lead\nto collisions, while agents with communications can\nspread to different landmarks, as depicted in Figure 4\n(the rightmost \ufb01gure).\nTo investigate the correlation between communication and attention, we further visualize the com-\nmunication among ATOC agent at certain timestep and its corresponding heatmap of attention in\n\n\ufffdo\ufffd\ufffd\ufffdni\ufffdation\ufffd\ufffdat\ufffd\nFigure 5: Heatmap of attention corresponding to\ncommunication among ATOC agents in cooper-\native navigation.\n\n\ufffdan\ufffd\ufffdar\ufffd\n\ninitiator\n\n\ufffdo\ufffd\ufffda\ufffdorator\n\n7\n\n\fd\nr\na\nw\ne\nr\n \n\nn\na\ne\nm\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nn\n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\nd\nr\na\nw\ne\nr\n\n103\n\nATOC \nCommNet \nBiCNet \nDDPG\n\nepisode\n\nFigure 6: Reward of ATOC against\nbaselines during training on cooper-\native pushball.\n\nPredator \nPrey\n\n0\n\n500\n\n1000\n\n1500\n2000\nepisode\n\n2500\n\n3000\n\n3500\n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\nd\nr\na\nw\ne\nr\n\n103\n\nPredator \nPrey\n\n0\n\n500\n\n1000\n\n1500\n2000\nepisode\n\n2500\n\n3000\n\n3500\n\nFigure 7: Learning curves of ATOC (left) and CommNet (right) during\nlearning on predator-prey.\n\nFigure 5. The regions where communications occur are the attention focuses as illustrated in Figure 5.\nOnly at the regions where agents are dense and landmarks are not occupied, communication is needed\nfor cooperative decision making. Our attention unit learns exactly what we expect, i.e., carrying out\ncommunication only when needed. Further, we turn off the communications of ATOC agents (without\nretraining) and the performance drops as shown in Table 1. Therefore, we argue that communication\nduring execution is also essential for better cooperation.\n\n5.3 Cooperative Pushball\n\nIn this scenario, N agents who share a global reward cooperatively push a heavy ball to a designated\nlocation. The ball is 200 times heavier and 144 times bigger than an agent. Agents push the ball\nby collisions, not by forces, and control the moving direction by hitting the ball at different angles.\nHowever, agents are not given the prior knowledge of how to control the direction, which is learned\nduring training. The inertial mass of the ball makes it dif\ufb01cult for agents to change its state of motion,\nand round surfaces of the ball and agents make the task more complicated. Therefore, the task is very\nchallenging. In the experiment, there are 50 agents, each agent can observe the relative locations of\nthe ball and at most 10 agents within a prede\ufb01ned distance, and the designated location is the center\nof the playground. The reward of agents at each timestep is d, where d denotes the distance from\nthe ball to the center of the playground.\nFigure 6 shows the learning curve in terms of normalized mean reward for ATOC and the baselines.\nATOC converges to a much higher reward than all the baselines. CommNet and BiCNet have\ncomparable reward, which is higher than DDPG. We evaluate ATOC and the baselines by running 30\ntest games. The normalized mean reward is illustrated in Table 2.\nATOC agents learn sophisticated strategies: agents push the ball by hitting the center of the ball; they\nchange the moving direction by striking the side of the ball; when the ball is approaching the target\nlocation, some agents will turn to the opposite of moving direction of the ball and collide with the\nball to reduce its speed so as to keep the ball from passing the target location; at the end agents split\ninto two parts with equal size and strike the ball from two opposite directions, and eventually the ball\nwill be stabilized at the target location. The control of moving direction and reducing speed embodies\nthe division of work and cooperation among agents, which is accomplished by communication. By\nvisualizing the communication structures and behaviors of agents, we \ufb01nd that agents in the same\ncommunication group behave homogeneously, e.g., a group of agents push the ball, a group of agents\ncontrol the direction, and a group of agents reduce the speed when the ball approaches the target\nlocation.\nDDPG agents all behave similarly and show no division of work. That is almost all agents push the\nball from the same direction, which can lead to the deviated direction or quickly passing the target\nlocation. Until the ball is pushed far from the target location, DDPG agents realize they are pushing\nat the wrong direction and switch to the opposite together. Therefore, the ball is pushed back and\nforce and hardly stabilized at the target location. Communication indeed helps, which explains why\nCommNet and BiCNet are better than DDPG. ATOC is better than CommNet and BiCNet, which is\n\nTable 2: Cooperative Pushball\n\nnormalized mean reward\n\nATOC ATOC w/o Comm. DDPG CommNet BiCNet\n0.77\n0.95\n\n0.86\n\n0.50\n\n0.71\n\n8\n\n\fre\ufb02ected in the experiments by ATOC\u2019s much smaller amplitude of oscillation. The primary reason\nhas been explained in previous section.\nTo investigate the effect of communication in ATOC, we turn off the communication of ATOC agents\n(without retraining), and the result is shown in Table 2. The performance of ATOC drops, but it\nis still better than all the baselines. The reason behind this is that communication stabilizes the\nenvironment during training. Moreover, in ATOC, cooperative policy gradients can backpropagate to\nupdate individual policy networks, which enables agents to infer the actions of other agents without\ncommunication and thus behaves cooperatively.\n\n5.4 Predator-Prey\n\nCom\n\nCom\n\nmNet\n\n0.8\n\n0.4\n\n0\n\ne\nr\no\nc\ns\nr\no\nt\na\nd\ne\nr\np\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nn\n\nPredator vs Prey\n\nATOC vs DDPG\n\nATOC vs Com\n\nATOC vs BiCNet\n\nATOC vs ATOC\n\nDDPG vs DDPG\n\nBiCNet vs BiCNet\nmNet vs Com\n\nBiCNet vs ATOC\nmNet vs ATOC\nFigure 8: Cross-comparison between ATOC and baselines in\nterms of predator score on predator-prey.\n\nIn this variant of the classic predator-prey game, 60 slower predators chase 20 faster preys around\nan environment with 5 landmarks impeding the way. Each time a predator collides with an prey,\nthe predator is rewarded with +10 while the prey is penalized with 10. Each agent observes the\nrelative positions and velocities of \ufb01ve nearest predators and three nearest preys, and the positions\nof two nearest landmarks. To restrain preys in the playground instead of runaway, a prey is also\ngiven a reward based on its coordinates (x, y) at each timestep. The reward is f (x)  f (y), where\nf (a) = 0 if a \uf8ff 0.9, f (a) = 10 \u21e5 (a  0.9) if 0.9 < a < 1, otherwise f (a) = e2a2.\nIn this scenario, predators collaborate to\nsurround and seize preys, while preys co-\noperate to perform temptation and eva-\nsion.\nIn the experiment, we focus on\nthe cooperation of predators/preys rather\nthan the competition between them. For\neach method, predator and prey agents\nare trained together. Figure 7 shows the\nlearning curves of predators and preys for\nATOC and CommNet. As the learning\ncurves of DDPG and BiCNet are not sta-\nble in this scenario, we only show the\nlearning curves of ATOC and CommNet. From Figure 7, we can see that ATOC converges much\nfaster than CommNet, where ATOC is stabilized after 1000 episodes, but CommNet is stabilized after\n2500 episodes. As the setting of the scenario appears to be more favorable for predators than preys,\nwhich is also indicated in Figure 7, both ATOC and CommNet predators are converged more quickly\nthan preys.\nTo evaluate the performance, we perform cross-comparison between ATOC and the baselines. That is\nwe play the game using ATOC predators against preys of the baselines and vice versa. The results are\nshown in terms of the 0-1 normalized mean predator score of 30 test runs for each game, as illustrated\nin Figure 8. The \ufb01rst bar cluster shows the games between predators and preys of the same method,\nfrom which we can see that the game setting is indeed more favorable for predators than preys since\npredators have positive scores for all the methods. The second bar cluster shows the scores of the\ngames where ATOC predators are against DDPG, CommNet, and BiCNet preys. We can see that\nATOC predators have higher scores than the predators of all the baselines and hence are stronger\nthan other predators. The third bar cluster shows the games where DDPG, CommNet, and BiCNet\npredators are against ATOC preys. The predator scores are all low, comparable to the scores in the\n\ufb01rst cluster. Therefore, we argue that ATOC leads to better cooperation than the baselines even in\ncompetitive environments and the learned policy of ATOC predators and preys can generalize to the\nopponents with different policies.\n\nDDPG vs ATOC\n\nmNet\n\n6 Conclusions\n\nWe have proposed an attentional communication model in large-scale multi-agent environments,\nwhere agents learn an attention unit that dynamically determines whether communication is needed\nfor cooperation and also a bi-directional LSTM unit as a communication channel to interpret encoded\ninformation from other agents. Unlike existing methods for communication, ATOC can effectively and\nef\ufb01ciently exploits communication to make cooperative decisions. Empirically, ATOC outperforms\nexisting methods in a variety of cooperative multi-agent environments.\n\n9\n\n\fAcknowledgments\nThis work was supported in part by Peng Cheng Laboratory and NSFC under grant 61872009.\n\nReferences\n[1] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of\ndistributed multi-agent coordination. IEEE Transactions on Industrial informatics, 9(1):427\u2013438, 2013.\n\n[2] Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. Deep communicating agents for\n\nabstractive summarization. In NAACL, 2018.\n\n[3] Dorothy Cheney and Robert Seyfarth. Constraints and preadaptations in the earliest stages of language\n\nevolution. Linguistic Review, 22(2-4):135\u2013159, 2005.\n\n[4] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to com-\nmunicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing\nSystems (NIPS), 2016.\n\n[5] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counter-\n\nfactual multi-agent policy gradients. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[6] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic\nmanipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and\nAutomation (ICRA), 2017.\n\n[7] Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: learning to communicate\n\nwith sequences of symbols. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[8] Xiangyu Kong, Bo Xin, Fangchen Liu, and Yizhou Wang. Revisiting the master-slave architecture in\n\nmulti-agent deep reinforcement learning. arXiv preprint arXiv:1712.07305, 2017.\n\n[9] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In\n\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2017.\n\n[10] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver, and\nThore Graepel. A uni\ufb01ed game-theoretic approach to multiagent reinforcement learning. In Advances in\nNeural Information Processing Systems (NIPS), 2017.\n\n[11] Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. In\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[12] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor\n\npolicies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[13] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nIn International\n\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nConference on Learning Representations (ICLR), 2016.\n\n[14] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic\nfor mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n[15] La\u00ebtitia Matignon, Laurent Jeanpierre, Abdel-Illah Mouaddib, et al. Coordinated multi-robot exploration\nunder communication constraints using decentralized markov decision processes. In AAAI Conference on\nArti\ufb01cial Intelligence (AAAI), 2012.\n\n[16] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual\n\nattention. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[18] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent popula-\n\ntions. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n10\n\n\f[19] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Mul-\ntiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint\narXiv:1703.10069, 2017.\n\n[20] Manisa Pipattanasomporn, Hassan Feroze, and Saifur Rahman. Multi-agent systems in a distributed smart\n\ngrid: Design and implementation. In IEEE/PES Power Systems Conference and Exposition, 2009.\n\n[21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Determin-\n\nistic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014.\n\n[22] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,\nThomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human\nknowledge. Nature, 550(7676):354, 2017.\n\n[23] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backprop-\n\nagation. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[24] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean \ufb01eld multi-agent\n\nreinforcement learning. In International Conference on Machine Learning (ICML), 2018.\n\n11\n\n\f", "award": [], "sourceid": 3603, "authors": [{"given_name": "Jiechuan", "family_name": "Jiang", "institution": "Peking University"}, {"given_name": "Zongqing", "family_name": "Lu", "institution": "Peking University"}]}