{"title": "Learning Multiagent Communication with Backpropagation", "book": "Advances in Neural Information Processing Systems", "page_first": 2244, "page_last": 2252, "abstract": "Many tasks in AI require the collaboration of multiple agents. Typically, the communication protocol between agents is manually specified and not altered during training. In this paper we explore a simple neural model, called CommNet, that uses continuous communication for fully cooperative tasks. The model consists of multiple agents and the communication between them is learned alongside their policy. We apply this model to a diverse set of tasks, demonstrating the ability of the agents to learn to communicate amongst themselves, yielding improved performance over non-communicative agents and baselines. In some cases, it is possible to interpret the language devised by the agents, revealing simple but effective strategies for solving the task at hand.", "full_text": "Learning Multiagent Communication\n\nwith Backpropagation\n\nSainbayar Sukhbaatar\nDept. of Computer Science\n\nCourant Institute, New York University\n\nsainbar@cs.nyu.edu\n\nArthur Szlam\n\nRob Fergus\n\nFacebook AI Research\n\nFacebook AI Research\n\nNew York\n\naszlam@fb.com\n\nNew York\n\nrobfergus@fb.com\n\nAbstract\n\nMany tasks in AI require the collaboration of multiple agents. Typically, the\ncommunication protocol between agents is manually speci\ufb01ed and not altered\nduring training. In this paper we explore a simple neural model, called CommNet,\nthat uses continuous communication for fully cooperative tasks. The model consists\nof multiple agents and the communication between them is learned alongside their\npolicy. We apply this model to a diverse set of tasks, demonstrating the ability\nof the agents to learn to communicate amongst themselves, yielding improved\nperformance over non-communicative agents and baselines. In some cases, it\nis possible to interpret the language devised by the agents, revealing simple but\neffective strategies for solving the task at hand.\n\nIntroduction\n\n1\nCommunication is a fundamental aspect of intelligence, enabling agents to behave as a group, rather\nthan a collection of individuals. It is vital for performing complex tasks in real-world environments\nwhere each actor has limited capabilities and/or visibility of the world. Practical examples include\nelevator control [3] and sensor networks [5]; communication is also important for success in robot\nsoccer [25]. In any partially observed environment, the communication between agents is vital to\ncoordinate the behavior of each individual. While the model controlling each agent is typically\nlearned via reinforcement learning [1, 28], the speci\ufb01cation and format of the communication is\nusually pre-determined. For example, in robot soccer, the bots are designed to communicate at each\ntime step their position and proximity to the ball.\nIn this work, we propose a model where cooperating agents learn to communicate amongst themselves\nbefore taking actions. Each agent is controlled by a deep feed-forward network, which additionally\nhas access to a communication channel carrying a continuous vector. Through this channel, they\nreceive the summed transmissions of other agents. However, what each agent transmits on the\nchannel is not speci\ufb01ed a-priori, being learned instead. Because the communication is continuous,\nthe model can be trained via back-propagation, and thus can be combined with standard single\nagent RL algorithms or supervised learning. The model is simple and versatile. This allows it to be\napplied to a wide range of problems involving partial visibility of the environment, where the agents\nlearn a task-speci\ufb01c communication that aids performance. In addition, the model allows dynamic\nvariation at run time in both the number and type of agents, which is important in applications such\nas communication between moving cars.\nWe consider the setting where we have J agents, all cooperating to maximize reward R in some\nenvironment. We make the simplifying assumption of full cooperation between agents, thus each\nagent receives R independent of their contribution. In this setting, there is no difference between\neach agent having its own controller, or viewing them as pieces of a larger model controlling all\nagents. Taking the latter perspective, our controller is a large feed-forward neural network that maps\ninputs for all agents to their actions, each agent occupying a subset of units. A speci\ufb01c connectivity\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fstructure between layers (a) instantiates the broadcast communication channel between agents and\n(b) propagates the agent state.\nWe explore this model on a range of tasks. In some, supervision is provided for each action while\nfor others it is given sporadically. In the former case, the controller for each agent is trained by\nbackpropagating the error signal through the connectivity structure of the model, enabling the agents\nto learn how to communicate amongst themselves to maximize the objective. In the latter case,\nreinforcement learning must be used as an additional outer loop to provide a training signal at each\ntime step (see the supplementary material for details).\n2 Communication Model\nWe now describe the model used to compute the distribution over actions p(a(t)|s(t), \u03b8) at a given\ntime t (omitting the time index for brevity). Let sj be the jth agent\u2019s view of the state of the\nenvironment. The input to the controller is the concatenation of all state-views s = {s1, ..., sJ},\nand the controller \u03a6 is a mapping a = \u03a6(s), where the output a is a concatenation of discrete\nactions a = {a1, ..., aJ} for each agent. Note that this single controller \u03a6 encompasses the individual\ncontrollers for each agents, as well as the communication between agents.\n2.1 Controller Structure\nWe now detail our architecture for \u03a6 that is built from modules f i, which take the form of multilayer\nneural networks. Here i \u2208 {0, .., K}, where K is the number of communication steps in the network.\nEach f i takes two input vectors for each agent j: the hidden state hi\nj and the communication ci\nj,\nand outputs a vector hi+1\n. The main body of the model then takes as input the concatenated vectors\nh0 = [h0\n\nJ ], and computes:\n\n2, ..., h0\n\n1, h0\n\nj\n\nhi+1\nj\n\nci+1\nj\n\n=\n\n= f i(hi\n\nj, ci\nj)\n\n1\n\nJ \u2212 1 (cid:88)j(cid:48)(cid:54)=j\n\nhi+1\nj(cid:48)\n\n.\n\n(1)\n\n(2)\n\nj +\nj) and the model can be viewed as a feedforward network with layers hi+1 = \u03c3(T ihi) where hi\n\nIn the case that f i is a single linear layer followed by a non-linearity \u03c3, we have: hi+1\nC ici\nis the concatenation of all hi\n\nj and T i takes the block form (where \u00afC i = C i/(J \u2212 1)):\n\nj = \u03c3(H ihi\n\nT i =\n\n\u00afC i\nH i\n\u00afC i H i\n\u00afC i\n...\n\u00afC i\n\n\u00afC i\n\u00afC i\n\u00afC i H i\n...\n...\n\u00afC i\n\u00afC i\n\n\u00afC i\n...\n\u00afC i\n...\n\u00afC i\n...\n...\n...\n... H i\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n,\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nA key point is that T is dynamically sized since the number of agents may vary. This motivates the\nthe normalizing factor J \u2212 1 in equation (2), which rescales the communication vector by the number\nof communicating agents. Note also that T i is permutation invariant, thus the order of the agents\ndoes not matter.\nj = r(sj) is used. This takes as input state-view\nAt the \ufb01rst layer of the model an encoder function h0\nj (in Rd0 for some d0). The form of the encoder is problem dependent,\nsj and outputs feature vector h0\nj = 0 for all j.\nbut for most of our tasks it is a single layer neural network. Unless otherwise noted, c0\nAt the output of the model, a decoder function q(hK\nj ) is used to output a distribution over the space of\nactions. q(.) takes the form of a single layer network, followed by a softmax. To produce a discrete\nj ).\naction, we sample from this distribution: aj \u223c q(hK\nThus the entire model (shown in Fig. 1), which we call a Communication Neural Net (CommNet), (i)\ntakes the state-view of all agents s, passes it through the encoder h0 = r(s), (ii) iterates h and c in\nequations (1) and (2) to obtain hK, (iii) samples actions a for all agents, according to q(hK).\n2.2 Model Extensions\nLocal Connectivity: An alternative to the broadcast framework described above is to allow agents\nto communicate to others within a certain range. Let N (j) be the set of agents present within\n\n2\n\n\fFigure 1: An overview of our CommNet model. Left: view of module f i for a single agent j. Note\nthat the parameters are shared across all agents. Middle: a single communication step, where each\nagents modules propagate their internal state h, as well as broadcasting a communication vector c on\na common channel (shown in red). Right: full model \u03a6, showing input states s for each agent, two\ncommunication steps and the output actions for each agent.\n\ncommunication range of agent j. Then (2) becomes:\n\nci+1\nj =\n\n1\n\n|N (j)| (cid:88)j(cid:48)\u2208N (j)\n\nhi+1\nj(cid:48)\n\n.\n\n(3)\n\nAs the agents move, enter and exit the environment, N (j) will change over time. In this setting,\nour model has a natural interpretation as a dynamic graph, with N (j) being the set of vertices\nconnected to vertex j at the current time. The edges within the graph represent the communication\nchannel between agents, with (3) being equivalent to belief propagation [22]. Furthermore, the use of\nmulti-layer nets at each vertex makes our model similar to an instantiation of the GGSNN work of Li\net al. [14].\nSkip Connections: For some tasks, it is useful to have the input encoding h0\ncommunication steps beyond the \ufb01rst layer. Thus for agent j at step i, we have:\n\nj present as an input for\n\nhi+1\nj = f i(hi\n\nj ).\n\nj, ci\n\nj, h0\n\n(4)\n\nTemporal Recurrence: We also explore having the network be a recurrent neural network (RNN).\nThis is achieved by simply replacing the communication step i in Eqn. (1) and (2) by a time step t,\nand using the same module f t for all t. At every time step, actions will be sampled from q(ht\nj). Note\nthat agents can leave or join the swarm at any time step. If f t is a single layer network, we obtain\nplain RNNs that communicate with each other. In later experiments, we also use an LSTM as an f t\nmodule.\n\n3 Related Work\nOur model combines a deep network with reinforcement learning [8, 20, 13]. Several recent works\nhave applied these methods to multi-agent domains, such as Go [16, 24] and Atari games [29], but\nthey assume full visibility of the environment and lack communication. There is a rich literature\non multi-agent reinforcement learning (MARL) [1], particularly in the robotics domain [18, 25, 5,\n21, 2]. Amongst fully cooperative algorithms, many approaches [12, 15, 33] avoid the need for\ncommunication by making strong assumptions about visibility of other agents and the environment.\nOthers use communication, but with a pre-determined protocol [30, 19, 37, 17].\nA few notable approaches involve learning to communicate between agents under partial visibility:\nKasai et al. [10] and Varshavskaya et al. [32], both use distributed tabular-RL approaches for\nsimulated tasks. Giles & Jim [6] use an evolutionary algorithm, rather than reinforcement learning.\nGuestrin et al. [7] use a single large MDP to control a collection of agents, via a factored message\npassing framework where the messages are learned. In contrast to these approaches, our model uses a\ndeep network for both agent control and communication.\nFrom a MARL perspective, the closest approach to ours is the concurrent work of Foerster et al. [4].\nThis also uses a deep reinforcement learning in multi-agent partially observable tasks, speci\ufb01cally\ntwo riddle problems (similar in spirit to our levers task) which necessitate multi-agent communication.\n\n3\n\nConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.mean2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2h0=[h01,h02,...,h0J],andcomputes:78hi+1j=fi(hij,cij)(1)79ci+1j=1J1Xj06=jhi+1j0.(2)Inthecasethatfiisasinglelinearlayerfollowedbyanonlinearity,wehave:hi+1j=(Hihij+80Cicij)andthemodelcanbeviewedasafeedforwardnetworkwithlayershi+1=(Tihi)wherehi81istheconcatenationofallhijandTtakestheblockform:82Ti=0BBBB@HiCiCi...CiCiHiCi...CiCiCiHi...Ci...............CiCiCi...Hi1CCCCA,ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.multilayer NN Avg.\t2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2Figure1:Blah.ThekeyideaisthatTisdynamicallysized.First,thenumberofagentsmayvary.Thismotivates83thethenormalizingfactorJ1inequation(2),whichresaclesthecommunicationvectorbythe84numberofcommunicatingagents.Second,theblocksareappliedbasedoncategory,ratherthanby85coordinate.Inthissimpleformofthemodel\u201ccategory\u201dreferstoeither\u201cself\u201dor\u201cteammate\u201d;butas86wewillseebelow,thecommunicationarchitecturecanbemorecomplicatedthan\u201cbroadcasttoall\u201d,87andsomayrequiremorecategories.NotealsothatTiispermutationinvariant,thustheorderofthe88agentsdoesnotmatter.89Atthe\ufb01rstlayerofthemodelanencoderfunctionh0j=p(sj)isused.Thistakesasinputstate-view90sjandoutputsfeaturevectorh0j(inRd0forsomed0).Theformoftheencoderisproblemdependent,91butformostofourtaskstheyconsistofalookup-tableembedding(orbagsofvectorsthereof).Unless92otherwisenoted,c0j=0forallj.93Attheoutputofthemodel,adecoderfunctionq(hKj)isusedtooutputadistributionoverthespaceof94actions.q(.)takestheformofasinglelayernetwork,followedbyasoftmax.Toproduceadiscrete95action,wesamplefromthethisdistribution.96Thustheentiremodel,whichwecallaCommunicationNeuralNet(CommNN),(i)takesthestate-97viewofallagentss,passesitthroughtheencoderh0=p(s),(ii)iterateshandcinequations(1)98and(2)toobainhK,(iii)samplesactionsaforallagents,accordingtoq(hK).993.2ModelExtensions100LocalConnectivity:Analternativetothebroadcastframeworkdescribedaboveistoallowagents101tocommunicatetootherswithinacertainrange.LetN(j)bethesetofagentspresentwithin102communicationrangeofagentj.Then(2)becomes:103ci+1j=1|N(j)|Xj02N(j)hi+1j0.(3)3h0=[h01,h02,...,h0J],andcomputes:78hi+1j=fi(hij,cij)(1)79ci+1j=1J1Xj06=jhi+1j0.(2)Inthecasethatfiisasinglelinearlayerfollowedbyanonlinearity,wehave:hi+1j=(Hihij+80Cicij)andthemodelcanbeviewedasafeedforwardnetworkwithlayershi+1=(Tihi)wherehi81istheconcatenationofallhijandTtakestheblockform:82Ti=0BBBB@HiCiCi...CiCiHiCi...CiCiCiHi...Ci...............CiCiCi...Hi1CCCCA,ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.multilayer NN Avg.\t2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=124@logp(a(t)|s(t),\u2713)@\u2713 TXi=tr(i)b(s(t),\u2713)!\u21b5@@\u2713 TXi=tr(i)b(s(t),\u2713)!235.Herer(t)isrewardgivenattimet,andthehyperparameter\u21b5isforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2Figure1:Blah.ThekeyideaisthatTisdynamicallysized.First,thenumberofagentsmayvary.Thismotivates83thethenormalizingfactorJ1inequation(2),whichresaclesthecommunicationvectorbythe84numberofcommunicatingagents.Second,theblocksareappliedbasedoncategory,ratherthanby85coordinate.Inthissimpleformofthemodel\u201ccategory\u201dreferstoeither\u201cself\u201dor\u201cteammate\u201d;butas86wewillseebelow,thecommunicationarchitecturecanbemorecomplicatedthan\u201cbroadcasttoall\u201d,87andsomayrequiremorecategories.NotealsothatTiispermutationinvariant,thustheorderofthe88agentsdoesnotmatter.89Atthe\ufb01rstlayerofthemodelanencoderfunctionh0j=p(sj)isused.Thistakesasinputstate-view90sjandoutputsfeaturevectorh0j(inRd0forsomed0).Theformoftheencoderisproblemdependent,91butformostofourtaskstheyconsistofalookup-tableembedding(orbagsofvectorsthereof).Unless92otherwisenoted,c0j=0forallj.93Attheoutputofthemodel,adecoderfunctionq(hKj)isusedtooutputadistributionoverthespaceof94actions.q(.)takestheformofasinglelayernetwork,followedbyasoftmax.Toproduceadiscrete95action,wesamplefromthethisdistribution.96Thustheentiremodel,whichwecallaCommunicationNeuralNet(CommNN),(i)takesthestate-97viewofallagentss,passesitthroughtheencoderh0=p(s),(ii)iterateshandcinequations(1)98and(2)toobainhK,(iii)samplesactionsaforallagents,accordingtoq(hK).993.2ModelExtensions100LocalConnectivity:Analternativetothebroadcastframeworkdescribedaboveistoallowagents101tocommunicatetootherswithinacertainrange.LetN(j)bethesetofagentspresentwithin102communicationrangeofagentj.Then(2)becomes:103ci+1j=1|N(j)|Xj02N(j)hi+1j0.(3)3tanhConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.CommNet modelthcommunication stepModule for agentConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=0BBBB@AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai1CCCCA.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.theiractions,eachagentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthebroadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstate.3CommunicationModelWenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(omittingthetimeindexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothecontrolleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamappinga=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellasthecommunicationbetweenagents.3.1ControllerStructureWenowdetailourarchitectureforthatallowscommunicationwithoutlosingmodularity.isbuiltfrommodulesfi,whichtaketheformofmultilayerneuralnetworks.Herei2{0,..,K},whereKisthenumberofcommunicationstepsinthenetwork.Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectorsh0=[h01,h02,...,h0J],andcomputes:hi+1j=fi(hij,cij)(1)ci+1j=1J1Xj06=jhi+1j0.(2)Inthecasethatfiisasinglelinearlayerfollowedbyanonlinearity,wehave:hi+1j=(Hihij+Cicij)andthemodelcanbeviewedasafeedforwardnetworkwithlayershi+1=(Tihi)wherehiistheconcatenationofallhijandTitakestheblockform(where\u00afCi=Ci/(J1)):Ti=0BBBBB@Hi\u00afCi\u00afCi...\u00afCi\u00afCiHi\u00afCi...\u00afCi\u00afCi\u00afCiHi...\u00afCi...............\u00afCi\u00afCi\u00afCi...Hi1CCCCCA,ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.mean2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2h0=[h01,h02,...,h0J],andcomputes:78hi+1j=fi(hij,cij)(1)79ci+1j=1J1Xj06=jhi+1j0.(2)Inthecasethatfiisasinglelinearlayerfollowedbyanonlinearity,wehave:hi+1j=(Hihij+80Cicij)andthemodelcanbeviewedasafeedforwardnetworkwithlayershi+1=(Tihi)wherehi81istheconcatenationofallhijandTtakestheblockform:82Ti=HiCiCi...CiCiHiCi...CiCiCiHi...Ci...............CiCiCi...Hi,ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.multilayer NN Avg.\t2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2Figure1:Blah.ThekeyideaisthatTisdynamicallysized.First,thenumberofagentsmayvary.Thismotivates83thethenormalizingfactorJ1inequation(2),whichresaclesthecommunicationvectorbythe84numberofcommunicatingagents.Second,theblocksareappliedbasedoncategory,ratherthanby85coordinate.Inthissimpleformofthemodel\u201ccategory\u201dreferstoeither\u201cself\u201dor\u201cteammate\u201d;butas86wewillseebelow,thecommunicationarchitecturecanbemorecomplicatedthan\u201cbroadcasttoall\u201d,87andsomayrequiremorecategories.NotealsothatTiispermutationinvariant,thustheorderofthe88agentsdoesnotmatter.89Atthe\ufb01rstlayerofthemodelanencoderfunctionh0j=p(sj)isused.Thistakesasinputstate-view90sjandoutputsfeaturevectorh0j(inRd0forsomed0).Theformoftheencoderisproblemdependent,91butformostofourtaskstheyconsistofalookup-tableembedding(orbagsofvectorsthereof).Unless92otherwisenoted,c0j=0forallj.93Attheoutputofthemodel,adecoderfunctionq(hKj)isusedtooutputadistributionoverthespaceof94actions.q(.)takestheformofasinglelayernetwork,followedbyasoftmax.Toproduceadiscrete95action,wesamplefromthethisdistribution.96Thustheentiremodel,whichwecallaCommunicationNeuralNet(CommNN),(i)takesthestate-97viewofallagentss,passesitthroughtheencoderh0=p(s),(ii)iterateshandcinequations(1)98and(2)toobainhK,(iii)samplesactionsaforallagents,accordingtoq(hK).993.2ModelExtensions100LocalConnectivity:Analternativetothebroadcastframeworkdescribedaboveistoallowagents101tocommunicatetootherswithinacertainrange.LetN(j)bethesetofagentspresentwithin102communicationrangeofagentj.Then(2)becomes:103ci+1j=1|N(j)|Xj02N(j)hi+1j0.(3)3h0=[h01,h02,...,h0J],andcomputes:78hi+1j=fi(hij,cij)(1)79ci+1j=1J1Xj06=jhi+1j0.(2)Inthecasethatfiisasinglelinearlayerfollowedbyanonlinearity,wehave:hi+1j=(Hihij+80Cicij)andthemodelcanbeviewedasafeedforwardnetworkwithlayershi+1=(Tihi)wherehi81istheconcatenationofallhijandTtakestheblockform:82Ti=HiCiCi...CiCiHiCi...CiCiCiHi...Ci...............CiCiCi...Hi,ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.multilayer NN Avg.\t2ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.22ProblemFormulation33WeconsiderthesettingwherewehaveMagents,allcooperatingtomaximizerewardRinsome34environment.WemakethesimplifyingassumptionthateachagentreceivesR,independentoftheir35contribution.Inthissetting,thereisnodifferencebetweeneachagenthavingitsowncontroller,or36viewingthemaspiecesofalargermodelcontrollingallagents.Takingthelatterperspective,our37controllerisalargefeed-forwardneuralnetworkthatmapsinputsforallagentstotheiractions,each38agentoccupyingasubsetofunits.Aspeci\ufb01cconnectivitystructurebetweenlayers(a)instantiatesthe39broadcastcommunicationchannelbetweenagentsand(b)propagatestheagentstateinthemannerof40anRNN.41Becausetheagentswillreceivereward,butnotnecessarilysupervisionforeachaction,reinforcement42learningisusedtomaximizeexpectedfuturereward.Weexploretwoformsofcommunicationwithin43thecontroller:(i)discreteand(ii)continuous.Intheformercase,communicationisanaction,and44willbetreatedassuchbythereinforcementlearning.Inthecontinuouscase,thesignalspassed45betweenagentsarenodifferentthanhiddenstatesinaneuralnetwork;thuscreditassignmentforthe46communicationcanbeperformedusingstandardbackpropagation(withintheouterRLloop).47Weusepolicygradient[33]withastatespeci\ufb01cbaselinefordeliveringagradienttothemodel.48Denotethestatesinanepisodebys(1),...,s(T),andtheactionstakenateachofthosestates49asa(1),...,a(T),whereTisthelengthoftheepisode.Thebaselineisascalarfunctionofthe50statesb(s,\u2713),computedviaanextraheadonthemodelproducingtheactionprobabilities.Beside51maximizingtheexpectedrewardwithpolicygradient,themodelsarealsotrainedtominimizethe52distancebetweenthebaselinevalueandactualreward.Thus,after\ufb01nishinganepisode,weupdate53themodelparameters\u2713by54\u2713=TXt=1logp(a(t)|s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)\u2713TXi=tr(i)b(s(t),\u2713)2.Herer(t)isrewardgivenattimet,andthehyperparameterisforbalancingtherewardandthe55baselineobjectives,setto0.03inallexperiments.563Model57Wenowdescribethemodelusedtocomputep(a(t)|s(t),\u2713)atagiventimet(ommitingthetime58indexforbrevity).Letsjbethejthagent\u2019sviewofthestateoftheenvironment.Theinputtothe59controlleristheconcatenationofallstate-viewss={s1,...,sJ},andthecontrollerisamapping60a=(s),wheretheoutputaisaconcatenationofdiscreteactionsa={a1,...,aJ}foreachagent.61Notethatthissinglecontrollerencompassestheindividualcontrollersforeachagents,aswellas62thecommunicationbetweenagents.63Oneobviouschoiceforisafully-connectedmulti-layerneuralnetwork,whichcouldextract64featureshfromsandusethemtopredictgoodactionswithourRLframework.Thismodelwould65allowagentstocommunicatewitheachotherandshareviewsoftheenvironment.However,it66isin\ufb02exiblewithrespecttothecompositionandnumberofagentsitcontrols;cannotdealwell67withagentsjoiningandleavingthegroupandeventheorderoftheagentsmustbe\ufb01xed.Onthe68otherhand,ifnocommunicationisusedthenwecanwritea={(s1),...,(sJ)},whereisa69per-agentcontrollerappliedindependently.Thiscommunication-freemodelsatis\ufb01esthe\ufb02exibility70requirements1,butisnotabletocoordinatetheiractions.713.1ControllerStructure72Wenowdetailthearchitectureforthathasthemodularityofthecommunication-freemodelbut73stillallowscommunication.isbuiltfrommodulesfi,whichtaketheformofmultilayerneural74networks.Herei2{0,..,K},whereKisthenumberofcommunicationlayersinthenetwork.75Eachfitakestwoinputvectorsforeachagentj:thehiddenstatehijandthecommunicationcij,76andoutputsavectorhi+1j.Themainbodyofthemodelthentakesasinputtheconcatenatedvectors771Assumingsjincludestheidentityofagentj.2Figure1:Blah.ThekeyideaisthatTisdynamicallysized.First,thenumberofagentsmayvary.Thismotivates83thethenormalizingfactorJ1inequation(2),whichresaclesthecommunicationvectorbythe84numberofcommunicatingagents.Second,theblocksareappliedbasedoncategory,ratherthanby85coordinate.Inthissimpleformofthemodel\u201ccategory\u201dreferstoeither\u201cself\u201dor\u201cteammate\u201d;butas86wewillseebelow,thecommunicationarchitecturecanbemorecomplicatedthan\u201cbroadcasttoall\u201d,87andsomayrequiremorecategories.NotealsothatTiispermutationinvariant,thustheorderofthe88agentsdoesnotmatter.89Atthe\ufb01rstlayerofthemodelanencoderfunctionh0j=p(sj)isused.Thistakesasinputstate-view90sjandoutputsfeaturevectorh0j(inRd0forsomed0).Theformoftheencoderisproblemdependent,91butformostofourtaskstheyconsistofalookup-tableembedding(orbagsofvectorsthereof).Unless92otherwisenoted,c0j=0forallj.93Attheoutputofthemodel,adecoderfunctionq(hKj)isusedtooutputadistributionoverthespaceof94actions.q(.)takestheformofasinglelayernetwork,followedbyasoftmax.Toproduceadiscrete95action,wesamplefromthethisdistribution.96Thustheentiremodel,whichwecallaCommunicationNeuralNet(CommNN),(i)takesthestate-97viewofallagentss,passesitthroughtheencoderh0=p(s),(ii)iterateshandcinequations(1)98and(2)toobainhK,(iii)samplesactionsaforallagents,accordingtoq(hK).993.2ModelExtensions100LocalConnectivity:Analternativetothebroadcastframeworkdescribedaboveistoallowagents101tocommunicatetootherswithinacertainrange.LetN(j)bethesetofagentspresentwithin102communicationrangeofagentj.Then(2)becomes:103ci+1j=1|N(j)|Xj02N(j)hi+1j0.(3)3tanhConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.CommNet modelthcommunication stepModule for agentConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.ConnectingNeuralModelsAnonymousAuthor(s)Af\ufb01liationAddressemailAbstractabstract11Introduction2Inthisworkwemaketwocontributions.First,wesimplifyandextendthegraphneuralnetwork3architectureof??.Second,weshowhowthisarchitecturecanbeusedtocontrolgroupsofcooperating4agents.52Model6Thesimplestformofthemodelconsistsofmultilayerneuralnetworksfithattakeasinputvectors7hiandciandoutputavectorhi+1.Themodeltakesasinputasetofvectors{h01,h02,...,h0m},and8computes9hi+1j=fi(hij,cij)10ci+1j=Xj06=jhi+1j0;Wesetc0j=0forallj,andi2{0,..,K}(wewillcallKthenumberofhopsinthenetwork).11Ifdesired,wecantakethe\ufb01nalhKjandoutputthemdirectly,sothatthemodeloutputsavector12correspondingtoeachinputvector,orwecanfeedthemintoanothernetworktogetasinglevectoror13scalaroutput.14Ifeachfiisasimplelinearlayerfollowedbyanonlinearity:15hi+1j=(Aihij+Bicij),thenthemodelcanbeviewedasafeedforwardnetworkwithlayers16Hi+1=(TiHi),whereTiswritteninblockform17Ti=AiBiBi...BiBiAiBi...BiBiBiAi...Bi...............BiBiBi...Ai.ThekeyideaisthatTisdynamicallysized,andthematrixcanbedynamicallysizedbecausethe18blocksareappliedbytype,ratherthanbycoordinate.19Submittedto29thConferenceonNeuralInformationProcessingSystems(NIPS2016).Donotdistribute.Figure1:AnoverviewofourCommNetmodel.Left:viewofmodulefiforasingleagentj.Notethattheparametersaresharedacrossallagents.Middle:asinglecommunicationstep,whereeachagentsmodulespropagatetheirinternalstateh,aswellasbroadcastingacommunicationvectorconacommonchannel(showninred).Right:fullmodel,showinginputstatessforeachagent,twocommunicationstepsandtheoutputactionsforeachagent.AkeypointisthatTisdynamicallysizedsincethenumberofagentsmayvary.ThismotivatesthethenormalizingfactorJ1inequation(2),whichrescalesthecommunicationvectorbythenumberofcommunicatingagents.NotealsothatTiispermutationinvariant,thustheorderoftheagentsdoesnotmatter.2\fLike our approach, the communication is learned rather than being pre-determined. However, the\nagents communicate in a discrete manner through their actions. This contrasts with our model where\nmultiple continuous communication cycles are used at each time step to decide the actions of all\nagents. Furthermore, our approach is amenable to dynamic variation in the number of agents.\nThe Neural GPU [9] has similarities to our model but differs in that a 1-D ordering on the input is\nassumed and it employs convolution, as opposed to the global pooling in our approach (thus permitting\nunstructured inputs). Our model can be regarded as an instantiation of the GNN construction of\nScarselli et al. [23], as expanded on by Li et al. [14]. In particular, in [23], the output of the model\nis the \ufb01xed point of iterating equations (3) and (1) to convergence, using recurrent models. In [14],\nthese recurrence equations are unrolled a \ufb01xed number of steps and the model trained via backprop\nthrough time. In this work, we do not require the model to be recurrent, neither do we aim to reach\nsteady state. Additionally, we regard Eqn. (3) as a pooling operation, conceptually making our model\na single feed-forward network with local connections.\n\n4 Experiments\n4.1 Baselines\nWe describe three baselines models for \u03a6 to compare against our model.\nIndependent controller: A simple baseline is where agents are controlled independently without\nany communication between them. We can write \u03a6 as a = {\u03c6(s1), ..., \u03c6(sJ )}, where \u03c6 is a per-agent\ncontroller applied independently. The advantages of this communication-free model is modularity\nand \ufb02exibility1. Thus it can deal well with agents joining and leaving the group, but it is not able to\ncoordinate agents\u2019 actions.\nFully-connected: Another obvious choice is to make \u03a6 a fully-connected multi-layer neural network,\nthat takes concatenation of h0\nj as an input and outputs actions {a1, ..., aJ} using multiple output\nsoftmax heads. It is equivalent to allowing T to be an arbitrary matrix with \ufb01xed size. This model\nwould allow agents to communicate with each other and share views of the environment. Unlike our\nmodel, however, it is not modular, in\ufb02exible with respect to the composition and number of agents it\ncontrols, and even the order of the agents must be \ufb01xed.\nDiscrete communication: An alternate way for agents to communicate is via discrete symbols, with\nthe meaning of these symbols being learned during training. Since \u03a6 now contains discrete operations\nand is not differentiable, reinforcement learning is used to train in this setting. However, unlike\nactions in the environment, an agent has to output a discrete symbol at every communication step.\nBut if these are viewed as internal time steps of the agent, then the communication output can be\ntreated as an action of the agent at a given (internal) time step and we can directly employ policy\ngradient [35].\nAt communication step i, agent j will output the index wi\nsampled according to:\n\nj corresponding to a particular symbol,\n\nwi\nj \u223c Softmax(Dhi\nj)\n\n(5)\nwhere matrix D is the model parameter. Let \u02c6w be a 1-hot binary vector representation of w. In our\nbroadcast framework, at the next step the agent receives a bag of vectors from all the other agents\n(where \u2227 is the element-wise OR operation):\nci+1\n\n(6)\n\nj = (cid:94)j(cid:48)(cid:54)=j\n\n\u02c6wi\nj(cid:48)\n\n4.2 Simple Demonstration with a Lever Pulling Task\nWe start with a very simple game that requires the agents to communicate in order to win. This\nconsists of m levers and a pool of N agents. At each round, m agents are drawn at random from\nthe total pool of N agents and they must each choose a lever to pull, simultaneously with the other\nm \u2212 1 agents, after which the round ends. The goal is for each of them to pull a different lever.\nCorrespondingly, all agents receive reward proportional to the number of distinct levers pulled. Each\nagent can see its own identity, and nothing else, thus sj = j.\n\n1Assuming sj includes the identity of agent j.\n\n4\n\n\fWe implement the game with m = 5 and N = 500. We use a CommNet with two communication\nsteps (K = 2) and skip connections from (4). The encoder r is a lookup-table with N entries of\n128D. Each f i is a two layer neural net with ReLU non-linearities that takes in the concatenation\nof (hi, ci, h0), and outputs a 128D vector. The decoder is a linear layer plus softmax, producing\na distribution over the m levers, from which we sample to determine the lever to be pulled. We\ncompare it against the independent controller, which has the same architecture as our model except\nthat communication c is zeroed. The results are shown in Table 1. The metric is the number of distinct\nlevers pulled divided by m = 5, averaged over 500 trials, after seeing 50000 batches of size 64 during\ntraining. We explore both reinforcement (see the supplementary material) and direct supervision\n(using the solution given by sorting the agent IDs, and having each agent pull the lever according to\nits relative order in the current m agents). In both cases, the CommNet performs signi\ufb01cantly better\nthan the independent controller. See the supplementary material for an analysis of a trained model.\n\nModel \u03a6\nIndependent\nCommNet\n\nTraining method\n\nSupervised Reinforcement\n\n0.59\n0.99\n\n0.59\n0.94\n\nTable 1: Results of lever game (#distinct levers pulled)/(#levers) for our CommNet and independent\ncontroller models, using two different training approaches. Allowing the agents to communicate\nenables them to succeed at the task.\n\n4.3 Multi-turn Games\nIn this section, we consider two multi-agent tasks using the MazeBase environment [26] that use\nreward as their training signal. The \ufb01rst task is to control cars passing through a traf\ufb01c junction to\nmaximize the \ufb02ow while minimizing collisions. The second task is to control multiple agents in\ncombat against enemy bots.\nWe experimented with several module types. With a feedforward MLP, the module f i is a single\nlayer network and K = 2 communication steps are used. For an RNN module, we also used a single\nlayer network for f t, but shared parameters across time steps. Finally, we used an LSTM for f t. In\nall modules, the hidden layer size is set to 50. MLP modules use skip-connections. Both tasks are\ntrained for 300 epochs, each epoch being 100 weight updates with RMSProp [31] on mini-batch of\n288 game episodes (distributed over multiple CPU cores). In total, the models experience \u223c8.6M\nepisodes during training. We repeat all experiments 5 times with different random initializations, and\nreport mean value along with standard deviation. The training time varies from a few hours to a few\ndays depending on task and module type.\n\n4.3.1 Traf\ufb01c Junction\nThis consists of a 4-way junction on a 14 \u00d7 14 grid as shown in Fig. 2(left). At each time step, new\ncars enter the grid with probability parrive from each of the four directions. However, the total number\nof cars at any given time is limited to Nmax = 10. Each car occupies a single cell at any given time\nand is randomly assigned to one of three possible routes (keeping to the right-hand side of the road).\nAt every time step, a car has two possible actions: gas which advances it by one cell on its route or\nbrake to stay at its current location. A car will be removed once it reaches its destination at the edge\nof the grid.\nTwo cars collide if their locations overlap. A collision incurs a reward rcoll = \u221210, but does not affect\nthe simulation in any other way. To discourage a traf\ufb01c jam, each car gets reward of \u03c4 rtime = \u22120.01\u03c4\nat every time step, where \u03c4 is the number time steps passed since the car arrived. Therefore, the total\nreward at time t is:\n\nr(t) = C trcoll +\n\n\u03c4irtime,\n\nN t(cid:88)i=1\n\nwhere C t is the number of collisions occurring at time t, and N t is number of cars present. The\nsimulation is terminated after 40 steps and is classi\ufb01ed as a failure if one or more more collisions\nhave occurred.\nEach car is represented by one-hot binary vector set {n, l, r}, that encodes its unique ID, current\nlocation and assigned route number respectively. Each agent controlling a car can only observe other\ncars in its vision range (a surrounding 3 \u00d7 3 neighborhood), but it can communicate to all other cars.\n\n5\n\n\fFigure 2: Left: Traf\ufb01c junction task where agent-controlled cars (colored circles) have to pass the\nthrough the junction without colliding. Middle: The combat task, where model controlled agents (red\ncircles) \ufb01ght against enemy bots (blue circles). In both tasks each agent has limited visibility (orange\nregion), thus is not able to see the location of all other agents. Right: As visibility in the environment\ndecreases, the importance of communication grows in the traf\ufb01c junction task.\n\nThe state vector sj for each agent is thus a concatenation of all these vectors, having dimension\n32 \u00d7 |n| \u00d7 |l| \u00d7 |r|.\nIn Table 2(left), we show the probability of failure of a variety of different model \u03a6 and module\nf pairs. Compared to the baseline models, CommNet signi\ufb01cantly reduces the failure rate for all\nmodule types, achieving the best performance with LSTM module (a video showing this model\nbefore and after training can be found at http://cims.nyu.edu/~sainbar/commnet).\nWe also explored how partial visibility within the environment effects the advantage given by\ncommunication. As the vision range of each agent decreases, the advantage of communication\nincreases as shown in Fig. 2(right). Impressively, with zero visibility (the cars are driving blind) the\nCommNet model is still able to succeed 90% of the time.\nTable 2(right) shows the results on easy and hard versions of the game. The easy version is a junction\nof two one-way roads, while the harder version consists from four connected junctions of two-way\nroads. Details of the other game variations can be found in the supplementary material. Discrete\ncommunication works well on the easy version, but the CommNet with local connectivity gives the\nbest performance on the hard case.\n\nj = C i+1hi\n\n4.3.2 Analysis of Communication\nWe now attempt to understand what the agents communicate when performing the junction task.\nWe start by recording the hidden state hi\nj of each agent and the corresponding communication\nvectors \u02dcci+1\nj (the contribution agent j at step i + 1 makes to the hidden state of other\nagents). Fig. 3(left) and Fig. 3(right) show the 2D PCA projections of the communication and hidden\nstate vectors respectively. These plots show a diverse range of hidden states but far more clustered\ncommunication vectors, many of which are close to zero. This suggests that while the hidden state\ncarries information, the agent often prefers not to communicate it to the others unless necessary. This\nis a possible consequence of the broadcast channel: if everyone talks at the same time, no-one can\nunderstand. See the supplementary material for norm of communication vectors and brake locations.\n\nModel \u03a6\nIndependent\nFully-connected\nDiscrete comm.\nCommNet\n\nModule f () type\n\nMLP\n\n20.6\u00b1 14.1\n12.5\u00b1 4.4\n15.8\u00b1 9.3\n2.2\u00b1 0.6\n\nRNN\n19.5\u00b1 4.5\n34.8\u00b1 19.7\n15.2\u00b1 2.1\n7.6\u00b1 1.4\n\nLSTM\n9.4\u00b1 5.6\n4.8\u00b1 2.4\n8.4\u00b1 3.4\n1.6\u00b1 1.0\n\nModel \u03a6\nIndependent\nDiscrete comm.\nCommNet\nCommNet local\n\nOther game versions\n\nEasy (MLP) Hard (RNN)\n15.8\u00b1 12.5\n1.1\u00b1 2.4\n0.3\u00b1 0.1\n\n26.9\u00b1 6.0\n28.2\u00b1 5.7\n22.5\u00b1 6.1\n21.1\u00b1 3.4\n\n-\n\nTable 2: Traf\ufb01c junction task. Left: failure rates (%) for different types of model and module function\nf (.). CommNet consistently improves performance, over the baseline models. Right: Game variants.\nIn the easy case, discrete communication does help, but still less than CommNet. On the hard version,\nlocal communication (see Section 2.2) does at least as well as broadcasting to all agents.\n\n6\n\n3 possible routes New car arrivals Car exiting Visual range 4\t4\t1\t2\t2\t4 movement actions Visual range Firing range Attack actions (e.g. attack_4) Enemy bot 5\t1\t3\t3\t5\t1% 10% 100% 1x13x35x57x7Failure\trateVision\trangeIndependentDiscrete\tcomm.CommNet\fFigure 3: Left: First two principal components of communication vectors \u02dcc from multiple runs on\nthe traf\ufb01c junction task Fig. 2(left). While the majority are \u201csilent\u201d (i.e. have a small norm), distinct\nclusters are also present. Middle: for three of these clusters, we probe the model to understand\ntheir meaning (see text for details). Right: First two principal components of hidden state vectors h\nfrom the same runs as on the left, with corresponding color coding. Note how many of the \u201csilent\u201d\ncommunication vectors accompany non-zero hidden state vectors. This shows that the two pathways\ncarry different information.\n\nTo better understand the meaning behind the communication vectors, we ran the simulation with\nonly two cars and recorded their communication vectors and locations whenever one of them braked.\nVectors belonging to the clusters A, B & C in Fig. 3(left) were consistently emitted when one of the\ncars was in a speci\ufb01c location, shown by the colored circles in Fig. 3(middle) (or pair of locations for\ncluster C). They also strongly correlated with the other car braking at the locations indicated in red,\nwhich happen to be relevant to avoiding collision.\n4.3.3 Combat Task\nWe simulate a simple battle involving two opposing teams in a 15\u00d715 grid as shown in Fig. 2(middle).\nEach team consists of m = 5 agents and their initial positions are sampled uniformly in a 5 \u00d7 5\nsquare around the team center, which is picked uniformly in the grid. At each time step, an agent can\nperform one of the following actions: move one cell in one of four directions; attack another agent\nby specifying its ID j (there are m attack actions, each corresponding to one enemy agent); or do\nnothing. If agent A attacks agent B, then B\u2019s health point will be reduced by 1, but only if B is inside\nthe \ufb01ring range of A (its surrounding 3 \u00d7 3 area). Agents need one time step of cooling down after\nan attack, during which they cannot attack. All agents start with 3 health points, and die when their\nhealth reaches 0. A team will win if all agents in the other team die. The simulation ends when one\nteam wins, or neither of teams win within 40 time steps (a draw).\nThe model controls one team during training, and the other team consist of bots that follow a hard-\ncoded policy. The bot policy is to attack the nearest enemy agent if it is within its \ufb01ring range. If not,\nit approaches the nearest visible enemy agent within visual range. An agent is visible to all bots if it\nis inside the visual range of any individual bot. This shared vision gives an advantage to the bot team.\nWhen input to a model, each agent is represented by a set of one-hot binary vectors {i, t, l, h, c}\nencoding its unique ID, team ID, location, health points and cooldown. A model controlling an agent\nalso sees other agents in its visual range (3 \u00d7 3 surrounding area). The model gets reward of -1 if the\nteam loses or draws at the end of the game. In addition, it also get reward of \u22120.1 times the total\nhealth points of the enemy team, which encourages it to attack enemy bots.\n\nModule f () type\n\nLSTM\nMLP\nModel \u03a6\n44.3\u00b1 0.4\n34.2\u00b1 1.3\nIndependent\n19.6\u00b1 4.2\nFully-connected 17.7\u00b1 7.1\n29.1\u00b1 6.7\nDiscrete comm.\n46.4\u00b1 0.7\n44.5\u00b1 13.4 44.4\u00b1 11.9 49.5\u00b1 12.6\nCommNet\n\nRNN\n37.3\u00b1 4.6\n2.9\u00b1 1.8\n33.4\u00b1 9.4\n\nModel \u03a6\nm = 3\nIndependent 29.2\u00b1 5.9\nCommNet\n\nm = 10\n30.5\u00b1 8.7\n51.0\u00b1 14.1 45.4\u00b1 12.4\n\nOther game variations (MLP)\n\n5 \u00d7 5 vision\n60.5\u00b1 2.1\n73.0\u00b1 0.7\n\nTable 3: Win rates (%) on the combat task for different communication approaches and module\nchoices. Continuous consistently outperforms the other approaches. The fully-connected baseline\ndoes worse than the independent model without communication. On the right we explore the\neffect of varying the number of agents m and agent visibility. Even with 10 agents on each team,\ncommunication clearly helps.\n\n7\n\n\u221220\u221210010203040506070\u221240\u221230\u221220\u221210010203040A C B STOP A B C1 C2 \u22124\u22123\u22122\u221210123\u22124\u22123\u22122\u221210123\fTable 3 shows the win rate of different module choices with various types of model. Among\ndifferent modules, the LSTM achieved the best performance. Continuous communication with\nCommNet improved all module types. Relative to the independent controller, the fully-connected\nmodel degraded performance, but the discrete communication improved LSTM module type. We\nalso explored several variations of the task: varying the number of agents in each team by setting\nm = 3, 10, and increasing visual range of agents to 5 \u00d7 5 area. The result on those tasks are shown\non the right side of Table 3. Using CommNet model consistently improves the win rate, even with\nthe greater environment observability of the 5\u00d75 vision case.\n4.4 bAbI Tasks\nWe apply our model to the bAbI [34] toy Q & A dataset, which consists of 20 tasks each requiring\ndifferent kind of reasoning. The goal is to answer a question after reading a short story. We can\nformulate this as a multi-agent task by giving each sentence of the story its own agent. Communication\namong agents allows them to exchange useful information necessary to answer the question.\nThe input is {s1, s2, ..., sJ , q}, where sj is j\u2019th sentence of the story, and q is the question sentence.\nWe use the same encoder representation as [27] to convert them to vectors. The f (.) module consists\nof a two-layer MLP with ReLU non-linearities. After K = 2 communication steps, we add the\n\ufb01nal hidden states together and pass it through a softmax decoder layer to sample an output word y.\nThe model is trained in a supervised fashion using a cross-entropy loss between y and the correct\nanswer y\u2217. The hidden layer size is set to 100 and weights are initialized from N (0, 0.2). We train\nthe model for 100 epochs with learning rate 0.003 and mini-batch size 32 with Adam optimizer [11]\n(\u03b21 = 0.9, \u03b22 = 0.99, \u0001 = 10\u22126). We used 10% of training data as validation set to \ufb01nd optimal\nhyper-parameters for the model.\nResults on the 10K version of the bAbI task are shown in Table 4, along with other baselines (see the\nsupplementary material for a detailed breakdown). Our model outperforms the LSTM baseline, but is\nworse than the MemN2N model [27], which is speci\ufb01cally designed to solve reasoning over long\nstories. However, it successfully solves most of the tasks, including ones that require information\nsharing between two or more agents through communication.\n\nMean error (%)\n\nFailed tasks (err. > 5%)\n\nLSTM [27]\nMemN2N [27]\nDMN+ [36]\nIndependent (MLP module)\nCommNet (MLP module)\n\n36.4\n4.2\n2.8\n15.2\n7.1\n\n16\n3\n1\n9\n3\n\nTable 4: Experimental results on bAbI tasks.\n\n5 Discussion and Future Work\nWe have introduced CommNet, a simple controller for MARL that is able to learn continuous\ncommunication between a dynamically changing set of agents. Evaluations on four diverse tasks\nclearly show the model outperforms models without communication, fully-connected models, and\nmodels using discrete communication. Despite the simplicity of the broadcast channel, examination\nof the traf\ufb01c task reveals the model to have learned a sparse communication protocol that conveys\nmeaningful information between agents. Code for our model (and baselines) can be found at\nhttp://cims.nyu.edu/~sainbar/commnet/.\nOne aspect of our model that we did not fully exploit is its ability to handle heterogenous agent types\nand we hope to explore this in future work. Furthermore, we believe the model will scale gracefully\nto large numbers of agents, perhaps requiring more sophisticated connectivity structures; we also\nleave this to future work.\n\nAcknowledgements\nThe authors wish to thank Daniel Lee and Y-Lan Boureau for their advice and guidance. Rob Fergus\nis grateful for the support of CIFAR.\nReferences\n[1] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning.\n\nSystems, Man, and Cybernetics, IEEE Transactions on, 38(2):156\u2013172, 2008.\n\n8\n\n\f[2] Y. Cao, W. Yu, W. Ren, and G. Chen. An overview of recent progress in the study of distributed multi-agent\n\ncoordination. IEEE Transactions on Industrial Informatics, 1(9):427\u2013438, 2013.\n\n[3] R. H. Crites and A. G. Barto. Elevator group control using multiple reinforcement learning agents. Machine\n\nLearning, 33(2):235\u2013262, 1998.\n\n[4] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate to solve riddles\n\nwith deep distributed recurrent Q-networks. arXiv, abs/1602.02672, 2016.\n\n[5] D. Fox, W. Burgard, H. Kruppa, and S. Thrun. Probabilistic approach to collaborative multi-robot\n\nlocalization. Autonomous Robots, 8(3):325\u2013\u2013344, 2000.\n\n[6] C. L. Giles and K. C. Jim. Learning communication for multi-agent systems. In Innovative Concepts for\n\nAgent Based Systems, pages 377\u2014-390. Springer, 2002.\n\n[7] C. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs. In NIPS, 2001.\n[8] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using\n\nof\ufb02ine monte-carlo tree search planning. In NIPS, 2014.\n\n[9] L. Kaiser and I. Sutskever. Neural gpus learn algorithms. In ICLR, 2016.\n[10] T. Kasai, H. Tenmoto, and A. Kamiya. Learning of communication codes in multi-agent reinforcement\n\nlearning problem. IEEE Conference on Soft Computing in Industrial Applications, pages 1\u20136, 2008.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[12] M. Lauer and M. A. Riedmiller. An algorithm for distributed reinforcement learning in cooperative\n\nmulti-agent systems. In ICML, 2000.\n\n[13] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of\n\nMachine Learning Research, 17(39):1\u201340, 2016.\n\n[14] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In ICLR, 2015.\n[15] M. L. Littman. Value-function reinforcement learning in markov games. Cognitive Systems Research,\n\n[16] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver. Move evaluation in go using deep convolutional\n\n2(1):55\u201366, 2001.\n\nneural networks. In ICLR, 2015.\n\n[17] D. Maravall, J. De Lope, and R. Domnguez. Coordination of communication in robot teams by reinforce-\n\nment learning. Robotics and Autonomous Systems, 61(7):661\u2013666, 2013.\n\n[18] M. Matari. Reinforcement learning in the multi-robot domain. Autonomous Robots, 4(1):73\u201383, 1997.\n[19] F. S. Melo, M. Spaan, and S. J. Witwicki. Querypomdp: Pomdp-based communication in multiagent\n\nsystems. In Multi-Agent Systems, pages 189\u2013204, 2011.\n\n[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, D. Wierstra, S. Legg, and D. Hassabis.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[21] R. Olfati-Saber, J. Fax, and R. Murray. Consensus and cooperation in networked multi-agent systems.\n\nProceedings of the IEEE, 95(1):215\u2013233, 2007.\n\n[22] J. Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. In AAAI, 1982.\n[23] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model.\n\nIEEE Trans. Neural Networks, 20(1):61\u201380, 2009.\n\n[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[25] P. Stone and M. Veloso. Towards collaborative and adversarial learning: A case study in robotic soccer.\n\nInternational Journal of Human Computer Studies, (48), 1998.\n\n[26] S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus. Mazebase: A sandbox for learning\n\nfrom games. CoRR, abs/1511.07401, 2015.\n\n[27] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. NIPS, 2015.\n[28] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998.\n[29] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, and R. Vicente. Multiagent cooperation\n\nand competition with deep reinforcement learning. arXiv:1511.08779, 2015.\n\n[30] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In ICML, 1993.\n[31] T. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[32] P. Varshavskaya, L. P. Kaelbling, and D. Rus. Distributed Autonomous Robotic Systems 8, chapter Ef\ufb01cient\n\nDistributed Reinforcement Learning through Agreement, pages 367\u2013378. 2009.\n\n[33] X. Wang and T. Sandholm. Reinforcement learning to play an optimal nash equilibrium in team markov\n\n[34] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question answering: A set of\n\n[35] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\n[36] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering.\n\ngames. In NIPS, pages 1571\u20131578, 2002.\n\nprerequisite toy tasks. In ICLR, 2016.\n\nIn Machine Learning, pages 229\u2013256, 1992.\n\nICML, 2016.\n\n[37] C. Zhang and V. Lesser. Coordinating multi-agent reinforcement learning with limited communication. In\n\nProc. AAMAS, pages 1101\u20131108, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1153, "authors": [{"given_name": "Sainbayar", "family_name": "Sukhbaatar", "institution": "NYU"}, {"given_name": "arthur", "family_name": "szlam", "institution": "Facebook"}, {"given_name": "Rob", "family_name": "Fergus", "institution": "New York University"}]}