{"title": "VAIN: Attentional Multi-agent Predictive Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2701, "page_last": 2711, "abstract": "Multi-agent predictive modeling is an essential step for understanding physical, social and team-play systems. Recently, Interaction Networks (INs) were proposed for the task of modeling multi-agent physical systems. One of the drawbacks of INs is scaling with the number of interactions in the system (typically quadratic or higher order in the number of agents). In this paper we introduce VAIN, a novel attentional architecture for multi-agent predictive modeling that scales linearly with the number of agents. We show that VAIN is effective for multi-agent predictive modeling. Our method is evaluated on tasks from challenging multi-agent prediction domains: chess and soccer, and outperforms competing multi-agent approaches.", "full_text": "VAIN: Attentional Multi-agent Predictive Modeling\n\nYedid Hoshen\n\nFacebook AI Research, NYC\n\nyedidh@fb.com\n\nAbstract\n\nMulti-agent predictive modeling is an essential step for understanding physical,\nsocial and team-play systems. Recently, Interaction Networks (INs) were proposed\nfor the task of modeling multi-agent physical systems. One of the drawbacks of\nINs is scaling with the number of interactions in the system (typically quadratic\nor higher order in the number of agents).\nIn this paper we introduce VAIN,\na novel attentional architecture for multi-agent predictive modeling that scales\nlinearly with the number of agents. We show that VAIN is effective for multi-\nagent predictive modeling. Our method is evaluated on tasks from challenging\nmulti-agent prediction domains: chess and soccer, and outperforms competing\nmulti-agent approaches.\n\n1\n\nIntroduction\n\nModeling multi-agent interactions is essential for understanding the world. The physical world\nis governed by (relatively) well-understood multi-agent interactions including fundamental forces\n(e.g. gravitational attraction, electrostatic interactions) as well as more macroscopic phenomena\n(electrical conductors and insulators, astrophysics). The social world is also governed by multi-agent\ninteractions (e.g. psychology and economics) which are often imperfectly understood. Games such\nas Chess or Go have simple and well de\ufb01ned rules but move dynamics are governed by very complex\npolicies. Modeling and inference of multi-agent interaction from observational data is therefore an\nimportant step towards machine intelligence.\nDeep Neural Networks (DNNs) have had much success in machine perception e.g. Computer Vision\n[1, 2, 3], Natural Language Processing [4] and Speech Recognition [5, 6]. These problems usually\nhave temporal and/or spatial structure, which makes them amenable to particular neural architectures\n- Convolutional and Recurrent Neural Networks (CNN [7] and RNN [8]). Multi-agent interactions\nare different from machine perception in several ways:\n\n\u2022 The data is no longer sampled on a spatial or temporal grid.\n\u2022 The number of agents changes frequently.\n\u2022 Systems are quite heterogeneous, there is not a canonical large network that can be used for\n\u2022 Multi-agent systems have an obvious factorization (into point agents), whereas signals such\n\n\ufb01netuning.\n\nas images and speech do not.\n\nTo model simple interactions in a physics simulation context, Interaction Networks (INs) were\nproposed by Battaglia et al.\nInteraction networks model each interaction in the physical\ninteraction graph (e.g. force between every two gravitating bodies) by a neural network. By the\nadditive sum of the vector outputs of all the interactions, a global interaction vector is obtained.\nThe global interaction alongside object features are then used to predict the future velocity of the\nobject. It was shown that Interaction Networks can be trained for different numbers of physical agents\n\n[9].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fand generate accurate results for simple physical scenarios in which the nature of the interaction is\nadditive and binary (i.e. pairwise interaction between two agents) and while the number of agents is\nsmall.\nAlthough Interaction Networks are suitable for the physical domain for which they were introduced,\nthey have signi\ufb01cant drawbacks that prevent them from being ef\ufb01ciently extensible to general multi-\nagent interaction scenarios. The network complexity is O(N d) where N is the number of objects\nand d is the typical interaction clique size. Fundamental physics interactions simulated by the\nmethod have d = 2, resulting in a quadratic dependence and higher order interactions become\ncompletely unmanageable. In Social LSTM [10], this was remedied by pooling a local neighborhood\nof interactions. The solution however cannot work for scenarios with long-range interactions. Another\nsolution offered by Battaglia et al. [9] is to add several fully connected layers modeling the high-order\ninteractions. This approach struggles when the objective is to select one of the agents (e.g. which\nagent will move), as it results in a distributed representation and loses the structure of the problem.\nIn this work we present VAIN (Vertex Attention Interaction Network), a novel multi-agent attentional\nneural network for predictive modeling. VAIN\u2019s attention mechanism helps with modeling the locality\nof interactions and improves performance by determining which agents will share information. VAIN\ncan be said to be a CommNet [11] with a novel attention mechanism or a factorized Interaction\nNetwork [9]. This will be made more concrete in Sec. 2. We show that VAIN can model high-order\ninteractions with linear complexity in the number of vertices while preserving the structure of the\nproblem, this has lower complexity than IN in cases where there are many fewer vertices than edges\n(in many cases linear vs quadratic in the number of agents).\nFor evaluation we introduce two non-physical tasks which more closely resemble real-world and\ngame-playing multi-agent predictive modeling, as well as a physical Bouncing Balls task. Our\nnon-physical tasks are taken from Chess and Soccer and contain different types of interactions and\ndifferent data regimes. The interaction graph on these tasks is not known apriori, as is typical in\nnature.\nAn informal analysis of our architecture is presented in Sec. 2. Our method is presented in Sec. 3. De-\nscription of our experimental evaluation scenarios and our results are provided in Sec. 4. Conclusion\nand future work are presented in Sec. 5.\nRelated Work\nThis work is primarily concerned with learning multi-agent interactions with graph structures. The\nseminal works in graph neural networks were presented by Scarselli et al. [12, 13] and Li et al. [14].\nAnother notable iterative graph-like neural algorithm is the Neural-GPU [15]. Notable works in\ngraph NNs includes Spectral Networks [16] and work by Duvenaud et al. [17] for \ufb01ngerprinting of\nchemical molecules.\nTwo related approaches that learn multi-agent interactions on a graph structure are: Interaction\nNetworks [9] which learn a physical simulation of objects that exhibit binary relations and Com-\nmunication Networks (CommNets) [11], presented for learning optimal communications between\nagents. The differences between our approach VAIN and previous approaches INs and CommNets\nare analyzed in detail in Sec. 2.\nAnother recent approach is PointNet [18] where every point in a point cloud is embedded by a deep\nneural net, and all embeddings are pooled globally. The resulting descriptor is used for classi\ufb01cation\nand segmentation. Although a related approach, the paper is focused on 3D point clouds rather\nthan multi-agent systems. A different approach is presented by Social LSTM [10] which learns\nsocial interaction by jointly training multiple interacting LSTMs. The complexity of that approach is\nquadratic in the number of agents requiring the use of local pooling that only deals with short range\ninteractions to limit the number of interacting bodies.\nThe attentional mechanism in VAIN has some connection to Memory Networks [19, 20] and Neural\nTurning Machines [21]. Other works dealing with multi-agent reinforcement learning include [22]\nand [23].\nThere has been much work on board game bots (although the approach of modeling board games as\ninteractions in a neural network multi agent system is new). Approaches include [24, 25] for Chess,\n[26, 27, 28] for Backgammons [29] for Go.\n\n2\n\n\fConcurrent work: We found on Arxiv two concurrent submissions which are relevant to this work.\nSantoro et al. [30] discovered that an architecture nearly identical to Interaction Nets achieves\nexcellent performance on the CLEVR dataset [31]. We leave a comparison on CLEVR for future\nwork. Vaswani et al. [32] use an architecture that bears similarity to VAIN for achieving state-of-\nthe-art performance for machine translation. The differences between our work and Vaswani et al.\u2019s\nconcurrent work are substantial in application and precise details.\n\n2 Factorizing Multi-Agent Interactions\n\nIn this section we give an informal analysis of the multi-agent interaction architectures presented by\nInteraction Networks [9], CommNets [11] and VAIN.\nInteraction Networks model each interaction by a neural network. For simplicity of analysis, let us\nrestrict the interactions to be of 2nd order. Let \u03c8int(xi, xj) be the interaction between agents Ai and\nAj, and \u03c6(xi) be the non-interacting features of agent Ai. The output is given by a function \u03b8() of\nj \u03c8int(xi, xj) and of the non-interacting features \u03c6(xi).\n\nthe sum of all of the interactions of Ai,(cid:80)\n(cid:88)\n\n\u03c8int(xi, xj), \u03c6(xi))\n\n(1)\n\noi = \u03b8(\n\nj(cid:54)=i\n\nA single step evaluation of the output for the entire system requires O(N 2) evaluations of \u03c8int().\nAn alternative architecture is presented by CommNets, where interactions are not modeled explicitly.\nInstead an interaction vector is computed for each agent \u03c8com(xi). The output is computed by:\n\noi = \u03b8(\n\n\u03c8com(xj), \u03c6(xi))\n\n(2)\n\n(cid:88)\n\nj(cid:54)=i\n\nA single step evaluation of the CommNet architecture requires O(N ) evaluations of \u03c8com(). A\nsigni\ufb01cant drawback of this representation is not explicitly modeling the interactions and putting\nthe whole burden of modeling on \u03b8. This can often result in weaker performance (as shown in our\nexperiments).\nVAIN\u2019s architecture preserves the complexity advantages of CommNet while addressing its limita-\ntions in comparison to IN. Instead of requiring a full network evaluation for every interaction pair\n\u03c8int(xi, xj) it learns a communication vector \u03c8c\nvain(xi) for each agent and additionally an attention\nvain(xi). The strength of interaction between agents is modulated by kernel function\nvector ai = \u03c8a\ne|ai\u2212aj|2. The interaction is approximated by:\n\n\u03c8int(xi, xj) = e|ai\u2212aj|2\n\n\u03c8vain(xj)\n\nThe output is given by:\n\noi = \u03b8(\n\ne|ai\u2212aj|2\n\n\u03c8vain(xj), \u03c6(xi))\n\n(cid:88)\n\nj(cid:54)=i\n\nIn cases where the kernel function is a good approximation for the relative strength of interaction (in\nsome high-dimensional linear space), VAIN presents an ef\ufb01cient linear approximation for IN which\npreserves CommNet\u2019s complexity in \u03c8().\nAlthough physical interactions are often additive, many other interesting cases (Games, Social, Team\nPlay) are not additive. In such cases the average instead the sum of \u03c8 should be used (in [9] only\nphysical scenarios were presented and therefore the sum was always used, whereas in [11] only\nnon-physical cases were considered and therefore only averaging was used). In non-additive cases\nVAIN uses a softmax:\n\n(cid:88)\n\nKi,j = e|ai\u2212aj|2\n\n/\n\ne|ai\u2212aj|2\n\n3 Model Architecture\n\nj\n\nIn this section we model the interaction between N agents denoted by A1...AN . The output can be\neither be a prediction for every agent or a system-level prediction (e.g. predict which agent will act\n\n3\n\n(3)\n\n(4)\n\n(5)\n\n\f(cid:80)\nj wi,j \u2217 ec\n\ni and communications encoder Ec() to yield vector ec\n\nFigure 1: A schematic of a single-hop VAIN: i) The agent features Fi are embedded by singleton\nencoder Es() to yield encoding es\ni and attention\ni is computed Pi =\nvector ai ii) For each agent an attention-weighted sum of all embeddings ec\nj. The attention weights wi,j are computed by a Softmax over \u2212||ai \u2212 aj||2. The\ndiagonal wi,i is set to zero to exclude self-interactions. iii) The singleton codes es\ni are concatenated\nwith the pooled feature Pi to yield intermediate feature Ci iv) The feature is passed through decoding\nnetwork D() to yield per-agent vector oi. For Regression: oi is the \ufb01nal output of the network. vii)\nFor Classi\ufb01cation: oi is scalar and is passed through a Softmax.\n\nnext). Although it is possible to use multiple hops, our presentation here only uses a single hop (and\nthey did not help in our experiments).\nFeatures are extracted for every agent Ai and we denote the features by Fi. The features are guided\nby basic domain knowledge (such as agent type or position).\nWe use two agent encoding functions: i) a singleton encoder for single-agent features Es() ii) A\ncommunication encoder for interaction with other agents Ec(). The singleton encoding function Es()\nis applied on all agent features Fi to yield singleton encoding es\ni\n\nEs(Fi) = es\ni\n\n(6)\n\nWe de\ufb01ne the communication encoding function Ec(). The encoding function is applied to all\nagent features Fi to yield both encoding ec\ni and attention vector ai. The attention vector is used for\naddressing the agents with whom information exchange is sought. Ec() is implemented by fully\nconnected neural networks (from now FCNs).\n\nEc(Fi) = (ec\n\ni , ai)\n\n(7)\n\nFor each agent we compute the pooled feature Pi, the interaction vectors from other agents weighted\nby attention. We exclude self-interactions by setting the self-interaction weight to 0:\n\n(cid:88)\n\nPi =\n\nej \u2217 Sof tmax(\u2212||ai \u2212 aj||2) \u2217 (1 \u2212 \u03b4j=i)\n\n(8)\n\nj\n\nThis is in contrast to the average pooling mechanism used in CommNets and we show that it yields\nbetter results. The motivation is to average only information from relevant agents (e.g. nearby or\nparticularly in\ufb02uential agents). The weights wi,j = Sof tmaxj(\u2212||ai \u2212 aj||2) give a measure of\nthe interaction between agents. Although naively this operation scales quadratically in the number\nof agents, it is multiplied by the feature dimension rather by a full E() evaluation and is therefore\nsigni\ufb01cantly smaller than the cost of the (linear number) of E() calculations carried out by the\nalgorithm. In case the number of agents is very large (>1000) the cost can still be mitigated: The\nSoftmax operation often yields a sparse matrix, in such cases the interaction can be modeled by the\nK-Nearest neighbors (measured by attention). The calculation is far cheaper than evaluating Ec()\n\n4\n\n\fO(N 2) times as in IN. In cases where even this cheap operation is too expensive we recommend to\nusing CommNets as a default as they truly have an O(N) complexity.\nThe pooled-feature Pi is concatenated to the original features Fi to form intermediate features Ci:\n(9)\n\nCi = (Pi, ei)\n\nThe features Ci are passed through decoding function D() which is also implemented by FCNs. The\nresult is denoted by oi:\n\noi = D(Ci)\n\n(10)\n\n(11)\n\nFor regression problems, oi is the per-agent output of VAIN. For classi\ufb01cation problems, D()\nis designed to give scalar outputs. The result is passed through a softmax layer yielding agent\nprobabilities:\n\nP rob(i) = Sof tmax(oi)\n\nSeveral advantages of VAIN over Interaction Networks [9] are apparent:\nRepresentational Power: VAIN does not assume that the interaction graph is pre-speci\ufb01ed (in fact the\nattention weights wi,j learn the graph). Pre-specifying the graph structure is advantageous when it is\nclearly known e.g. spring-systems where locality makes a signi\ufb01cant difference. In many multi-agent\nscenarios the graph structure is not known apriori. Multiple-hops can give VAIN the potential to\nmodel higher-order interactions than IN, although this was not found to be advantageous in our\nexperiments.\nComplexity: As explained in Sec. 2, VAIN features better complexity than INs. The complexity\nadvantage increases with the order of interaction.\n\n4 Evaluation\n\nWe presented VAIN, an ef\ufb01cient attentional model for predictive modeling of multi-agent interactions.\nIn this section we show that our model achieves better results than competing methods while having\na lower computational complexity.\nWe perform experiments on tasks from different multi-agent domains to highlight the utility and\ngenerality of VAIN: chess move, soccer player prediction and physical simulation.\n\n4.1 Chess Piece Prediction\n\nChess is a board game involving complex multi-agent interactions. Chess is dif\ufb01cult from a multi-\nagent perspective due to having 12 different types of agents and non-local high-order interactions. In\nthis experiment we do not attempt to create an optimal chess player. Rather, we are given a board\nposition from a professional game. Our task is to identify the piece that will move next (MPP).\nThere are 32 possible pieces, each encoded by one-hot encodings of piecetype, x position, y position.\nMissing pieces are encoded with all zeros. The output is the id of the piece that will move next.\nFor training and evaluation of this task we downloaded 10k games from the FICS Games Dataset, an\non-line repository of chess games. All the games used are standard games between professionally\nranked players. 9k randomly sampled games were used for training, and the remaining 1k games\nfor evaluation. Moves later in the game than 100 (i.e. 50 Black and 50 White moves), were dropped\nfrom the dataset so as not to bias it towards particularly long games. The total number of moves is\naround 600k.\nWe use the following methods for evaluation: Rand: Random piece selection. F C: A standard\nFCN with three hidden layers (64 hidden nodes each). This method requires indexing to be learned.\nSM ax: Each piece is encoded by neural network into a scalar \"vote\". The \"votes\" from all input\npieces are fed to a Sof tmax classi\ufb01er predicting the output label. This approach does not require\nlearning to index, but cannot model interactions. 1hop \u2212 F C: Each piece is encoded as in SMax but\nto a vector rather than a scalar. A deep (3 layer) classi\ufb01er predicts the MPP from the concatenation of\nthe vectors. CommN et: A standard CommNet (no attention) [11]. The protocol for CommNet is\nthe same as VAIN. IN: An Interaction Network followed by Softmax (as for VAIN). Inference for\nthis IN required around 8 times more computation than VAIN and CommNet. ours \u2212 V AIN.\n\n5\n\n\fTable 1: Accuracy (%) for the Next Moving Piece (MPP) experiments.\n\nRand\n4.5\n\nF C SM ax\n21.6\n13.3\n\n1hop \u2212 F C CommN et\n\n18.6\n\n27.2\n\nIN ours\n30.1\n28.3\n\nThe results for next moving chess piece prediction can be seen in Table. 1. Our method clearly\noutperforms the competing baselines illustrating that VAIN is effective at selection type problems\n- i.e. selecting 1 - of- N agents according to some criterion (in this case likelihood to move). The\nnon-interactive method SM ax performs much better than Rand (+9%) due to use of statistics of\nmoves. Interactive methods (F C, 1hot \u2212 F C, CommN et, IN and V AIN) naturally perform\nbetter as the interactions between pieces are important for deciding the next mover. It is interesting\nthat the simple F C method performs better than 1hop \u2212 F C (+3%), we think this is because the\nclassi\ufb01er in 1hop\u2212F C \ufb01nds it hard to recover the indexes after the average pooling layer. This shows\nthat one-hop networks followed by fully connected classi\ufb01ers (such as the original formulation of\nInteraction Networks) struggle at selection-type problems. Our method V AIN performs much better\nthan 1hop \u2212 IN (11.5%) due to the per-vertex outputs oi, and coupling between agents. V AIN also\nperforms signi\ufb01cantly better than F C (+8.5%) as it does not have to learn indexing. It outperforms\nvanilla CommNet by 2.9%, showing the advantages of our attentional mechanism. It also outperforms\nINs followed by a per-agent Softmax (similarly to the formulation for VAIN) by 1.8% even though\nthe IN performs around 8 times more computation than VAIN.\n\n4.2 Soccer Players\n\nTeam-player interaction is a promising application area for end-to-end multi-agent modeling as\nthe rules of sports interaction are quite complex and not easily formulated by hand-coded rules.\nAn additional advantage is that predictive modeling can be self-supervised and no labeled data is\nnecessary. In team-play situations many agents may be present and interacting at the same time\nmaking the complexity of the method critical for its application.\nIn order to evaluate the performance of VAIN on team-play interactions, we use the Soccer Video and\nPlayer Position Dataset (SVPP) [33]. The SVPP dataset contains the parameters of soccer players\ntracked during two home matches played by Troms\u00f8 IL, a Norwegian soccer team. The sensors\nwere positioned on each home team player, and recorded the player\u2019s location, heading direction\nand movement velocity (as well as other parameters that we did not use in this work). The data was\nre-sampled by [33] to occur at regular 20 Hz intervals. We further subsampled the data to 2 Hz. We\nonly use sensor data rather than raw-pixels. End-to-end inference from raw-pixel data is left to future\nwork.\nThe task that we use for evaluation is predicting from the current state of all players, the position of\neach player for each time-step during the next 4 seconds (i.e. at T + 0.5, T + 1.0 ... T + 4.0). Note\nthat for this task, we just use a single frame rather than several previous frames, and therefore do not\nuse RNN encoders for this task.\nWe use the following methods for evaluation: Static: trivial prediction of 0-motion. P ALV : Linearly\nextrapolating the agent displacement by the current linear velocity. P ALAF : A linear regressor\npredicting the agent\u2019s velocity using all features including the velocity, but also the agent\u2019s heading\ndirection and most signi\ufb01cantly the agent\u2019s current \ufb01eld position. P AD: a predictive model using all\nthe above features but using three fully-connected layers (with 256, 256 and 16 nodes). CommN et:\nA standard CommNet (no attention) [11]. The protocol for CommNet is the same as VAIN. IN: An\nInteraction Network [9], requiring O(N 2) network evaluations. ours: VAIN.\nWe excluded the second half of the Anzhi match due to large sensor errors for some of the players\n(occasional 60m position changes in 1-2 seconds).\nA few visualizations of the Soccer scenario can be seen in Fig. 4. The positions of the players are\nindicated by green circles, apart from a target player (chosen by us), that is indicated by a blue circle.\nThe brightness of each circle is chosen to be proportional to the strength of attention between each\nplayer and the target player. Arrows are proportional to player velocity. We can see in this scenario\nthat the attention to nearest players (attackers to attackers, mid\ufb01elder to mid\ufb01elders) is strongest, but\nattention is given to all \ufb01eld players. The goal keeper normally receives no attention (due to being\n\n6\n\n\fFigure 2: a) A soccer match used for the Soccer task. b) A chess position illustrating the high-order\nnature of the interactions in next move prediction. Note that in both cases, VAIN uses agent positional\nand sensor data rather than raw-pixels.\n\nTable 2: Soccer Prediction errors (meters).\n\nExperiments\n\nMethods\n\n1103a\n\n1103b\n\nDataset Time-step Static P ALV\n0.14\n1.16\n2.67\n0.13\n1.06\n2.42\n0.17\n1.36\n3.10\n1.11\n\n0.54\n1.99\n3.58\n0.49\n1.81\n3.27\n0.61\n2.23\n3.95\n1.84\n\n0.5\n2.0\n4.0\n0.5\n2.0\n4.0\n0.5\n2.0\n4.0\n\n1107a\n\nMean\n\nP ALAF\n\n0.14\n1.14\n2.62\n0.13\n1.06\n2.41\n0.17\n1.34\n3.03\n1.10\n\nP AD IN CommN et\n0.14\n1.13\n2.58\n0.13\n1.04\n2.38\n0.17\n1.32\n2.99\n1.08\n\n0.16\n1.09\n2.47\n0.14\n1.02\n2.30\n0.17\n1.26\n2.82\n1.04\n\n0.15\n1.10\n2.48\n0.13\n1.02\n2.31\n0.17\n1.26\n2.81\n1.04\n\nours\n0.14\n1.09\n2.47\n0.13\n1.02\n2.30\n0.17\n1.25\n2.79\n1.03\n\nfar away, and in normal situations not affecting play). This is an example of mean-\ufb01eld rather than\nsparse attention.\nWe evaluated our methods on the SVPP dataset. The prediction errors in Table. 2 are broken down for\ndifferent time-steps and for different train / test datasets splits. It can be seen that the non-interactive\nbaselines generally fare poorly on this task as the general con\ufb01guration of agents is informative for\nthe motion of agents beyond a simple extrapolation of motion. Examples of patterns than can be\npicked up include: running back to the goal to help the defenders, running up to the other team\u2019s goal\narea to join an attack. A linear model including all the features performs better than a velocity only\nmodel (as position is very informative). A non-linear per-player model with all features improves\non the linear models. The interaction network, CommNet and VAIN signi\ufb01cantly outperform the\nnon-interactive methods. VAIN outperformed CommNet and IN, achieving this with only 4% of the\nnumber of encoder evaluations performed by IN. This validates our premise that VAIN\u2019s architecture\ncan model object interactions without modeling each interaction explicitly.\n\n4.3 Bouncing Balls\n\nFollowing Battaglia et al. [9], we present a simple physics-based experiment. In this scenario, balls\nare bouncing inside a 2D square container of size L. There are N identical balls (we use N = 50)\nwhich are of constant size and are perfectly elastic. The balls are initialized at random positions and\nwith initial velocities sampled at random from [\u2212v0..v0] (we use v0 = 3ms\u22121). The balls collide\nwith other balls and with the walls, where the collisions are governed by the laws of elastic collisions.\nThe task which we evaluate is the prediction of the displacement and change in velocity of each ball in\nthe next time step. We evaluate the prediction accuracy of our method V AIN as well as Interaction\nNetworks [9] and CommNets [11]. We found it useful to replace VAIN\u2019s attention mechanism by an\nunnormalized attention function due to the additive nature of physical forces:\n\npi,j = e\u2212||ai\u2212aj||2 \u2212 \u03b4i,j\n\n(12)\n\nIn Fig. 4 we can observe the attention maps for two different balls in the Bouncing Balls scenario.\nThe position of the ball is represented by a circle. The velocity of each ball is indicated by a line\n\n7\n\n\fFigure 3: Accuracy differences between VAIN and IN for different computation budgets: VAIN\noutperforms IN by spending its computation budget on a few larger networks (one for each agent)\nrather than many small networks (one for every pair of agents). This is even more signi\ufb01cant for\nsmall computation budgets.\n\nTable 3: RMS accuracy of Bouncing Ball next step prediction.\n\nVEL0 VEL-CONST COMMNET\n0.561\n\n0.510\n\n0.547\n\nIN\n0.139\n\nVAIN\n0.135\n\nRM S\n\nextending from the center of the circle, the length of the line is proportional to the speed of the ball.\nFor each \ufb01gure we choose a target ball Ai, and paint it blue. The attention strength of each agent\nAj with respect to Ai is indicated by the shade of the circle. The brighter the circle, the stronger the\nattention. In the \ufb01rst scenario we observe that the two balls near the target receive attention whereas\nother balls are suppressed. This shows that the system exploits the sparsity due to locality that is\ninherent to this multi-agent system. In the second scenario we observe, that the ball on a collision\ncourse with the target receives much stronger attention, relative to a ball that is much closer to the\ntarget but is not likely to collide with it. This indicates VAIN learns important attention features\nbeyond the simple positional hand-crafted features typically used.\nThe results of our bouncing balls experiments can be seen in Tab. 3. We see that in this physical sce-\nnario VAIN signi\ufb01cantly outperformed CommNets, and achieves better performance than Interaction\nNetworks for similar computation budgets. In Fig. 4.2 we see that the difference increases for small\ncomputation budgets. The attention mechanism is shown to be critical to the success of the method.\n\n4.4 Analysis and Limitations\n\nOur experiments showed that VAIN achieves better performance than other architectures with similar\ncomplexity and equivalent performance to higher complexity architectures, mainly due to its attention\nmechanism. There are two ways in which the attention mechanism implicitly encodes the interactions\nof the system: i) Sparse: if only a few agents signi\ufb01cantly interact with agent ao, the attention\nmechanism will highlight these agents (\ufb01nding K spatial nearest neighbors is a special case of\nsuch attention). In this case CommNets will fail. ii) Mean-\ufb01eld: if a space can be found where the\nimportant interactions act in an additive way, (e.g. in soccer team dynamics scenario), the attention\nmechanism would \ufb01nd the correct weights for the mean \ufb01eld. In this case CommNets would work,\nbut VAIN can still improve on them.\nVAIN is less well-suited for cases where both: interactions are not sparse such that the K most\nimportant interactions will not give a good representation and where the interactions are strong and\nhighly non-linear so that a mean-\ufb01eld approximation is non-trivial. One such scenario is the M body\ngravitation problem. Interaction Networks are particularly well suited for this scenario and VAIN\u2019s\nfactorization will not yield an advantage.\nImplementation\n\n8\n\n\fBouncing Balls (a)\n\nBouncing Balls (b)\n\nSoccer (a)\n\nSoccer (b)\n\nFigure 4: A visualization of attention in the Bouncing Balls and Soccer scenarios. The target ball\nis blue, and others are green. The brightness of each ball indicates the strength of attention with\nrespect to the (blue) target ball. The arrows indicate direction of motion. Bouncing Balls: Left\nimage: The ball nearer to target ball receives stronger attention. Right image: The ball on collision\ncourse with the target ball receives much stronger attention than the nearest neighbor of the target\nball. Soccer: This is an example of mean-\ufb01eld type attention, where the nearest-neighbors receive\nprivileged attention, but also all other \ufb01eld players receive roughly equal attention. The goal keeper\ntypically receives no attention due to being far away.\n\nSoccer: The encoding and decoding functions Ec(), Es() and D() were implemented by fully-\nconnected neural networks with two layers, each of 256 hidden units and with ReLU activations. The\nencoder outputs had 128 units. For IN each layer was followed by a BatchNorm layer (otherwise the\nsystem converged slowly to a worse minimum). For VAIN no BatchNorm layers were used. Chess:\nThe encoding and decoding functions E() and D() were implemented by fully-connected neural\nnetworks with three layers, each of width 64 and with ReLU activations. They were followed by\nBatchNorm layers for both IN and VAIN. Bouncing Balls: The encoding and decoding function Ec(),\nEs() and D() were implemented with FCNs with 256 hidden units and three layer. The encoder\noutputs had 128 units. No BatchNorm units were used. For Soccer, Ec() and D() architectures\nfor VAIN and IN was the same. For Chess we evaluate INs with Ec() being 4 times smaller than\nfor VAIN, this still takes 8 times as much computation as used by VAIN. For Bouncing Balls the\ncomputation budget was balanced between VAIN and IN by decreasing the number of hidden units in\nEc() for IN by a constant factor.\nIn all scenarios the attention vector ai is of dimension 10 and shared features with the encoding\nvectors ei. Regression problems were trained with L2 loss, and classi\ufb01cation problems were trained\nwith cross-entropy loss. All methods were implemented in PyTorch [34] in a Linux environment.\nEnd-to-end optimization was carried out using ADAM [35] with \u03b1 = 1e\u2212 3 and no L2 regularization\nwas used. The learning rate was halved every 10 epochs. The chess prediction training for the MPP\ntook several hours on a M40 GPU, other tasks had shorter training times due to smaller datasets.\n\n5 Conclusion and Future Work\n\nWe have shown that VAIN, a novel architecture for factorizing interaction graphs, is effective\nfor predictive modeling of multi-agent systems with a linear number of neural network encoder\nevaluations. We analyzed how our architecture relates to Interaction Networks and CommNets.\nExamples were shown where our approach learned some of the rules of the multi-agent system. An\ninteresting future direction to pursue is interpreting the rules of the game in symbolic form, from\nVAIN\u2019s attention maps wi,j. Initial experiments that we performed have shown that some chess rules\ncan be learned (movement of pieces, relative values of pieces), but further research is required.\n\nAcknowledgement\n\nWe thank Rob Fergus for signi\ufb01cant contributions to this work. We also thank Gabriel Synnaeve and\nArthur Szlam for fruitful comments on the manuscript.\n\n9\n\n\fReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, 2012.\n\n[2] Yaniv Taigman, Ming Yang, Marc\u2019Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap\n\nto human-level performance in face veri\ufb01cation. In CVPR, 2014.\n\n[3] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for\n\nface recognition and clustering. In CVPR, 2015.\n\n[4] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[5] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep\nJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural\nnetworks for acoustic modeling in speech recognition: The shared views of four research groups.\nIEEE Signal Processing Magazine, 2012.\n\n[6] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro,\nJingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-to-\nend speech recognition in english and mandarin. In ICML, 2016.\n\n[7] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne\nHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.\nNeural computation, 1989.\n\n[8] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n1997.\n\n[9] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction\n\nnetworks for learning about objects, relations and physics. In NIPS, 2016.\n\n[10] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and\nSilvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In CVPR, 2016.\n\n[11] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa-\n\ngation. In NIPS, 2016.\n\n[12] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\n\nThe graph neural network model. IEEE Transactions on Neural Networks, 2009.\n\n[13] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph\n\ndomains. In IJCNN, 2005.\n\n[14] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. ICLR, 2016.\n\n[15] \u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. ICLR, 2016.\n\n[16] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. ICLR, 2014.\n\n[17] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In NIPS, 2015.\n\n[18] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point\n\nsets for 3d classi\ufb01cation and segmentation. CVPR, 2017.\n\n[19] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint\n\narXiv:1410.3916, 2014.\n\n10\n\n\f[20] Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS,\n\n2015.\n\n[21] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint\n\narXiv:1410.5401, 2014.\n\n[22] Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration\nfor deep deterministic policies: An application to starcraft micromanagement tasks. ICLR,\n2017.\n\n[23] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang.\nMultiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv\npreprint arXiv:1703.10069, 2017.\n\n[24] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[25] Yuandong Tian and Yan Zhu. Better computer go player with neural network and long-term\n\nprediction. ICLR, 2016.\n\n[26] Murray Campbell, A Joseph Hoane, and Feng-hsiung Hsu. Deep blue. Arti\ufb01cial intelligence,\n\n2002.\n\n[27] Matthew Lai. Giraffe: Using deep reinforcement learning to play chess. arXiv preprint\n\narXiv:1509.01549, 2015.\n\n[28] Omid E David, Nathan S Netanyahu, and Lior Wolf. Deepchess: End-to-end deep neural\n\nnetwork for automatic learning in chess. In ICANN, 2016.\n\n[29] Gerald Tesauro. Neurogammon: A neural-network backgammon program. In IJCNN, 1990.\n\n[30] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning.\narXiv preprint arXiv:1706.01427, 2017.\n\n[31] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,\nand Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary\nvisual reasoning. arXiv preprint arXiv:1612.06890, 2016.\n\n[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762,\n2017.\n\n[33] Svein Arne Pettersen, Dag Johansen, H\u00e5vard Johansen, Vegard Berg-Johansen, Vamsid-\nhar Reddy Gaddam, Asgeir Mortensen, Ragnar Langseth, Carsten Griwodz, H\u00e5kon Kvale\nStensland, and P\u00e5l Halvorsen. Soccer video and player position dataset. In Proceedings of the\n5th ACM Multimedia Systems Conference, pages 18\u201323. ACM, 2014.\n\n[34] https://github.com/pytorch/pytorch/, 2017.\n\n[35] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1535, "authors": [{"given_name": "Yedid", "family_name": "Hoshen", "institution": "Facebook AI Research"}]}