{"title": "Interaction Networks for Learning about Objects, Relations and Physics", "book": "Advances in Neural Information Processing Systems", "page_first": 4502, "page_last": 4510, "abstract": "Reasoning about objects, relations, and physics is central to human intelligence, and a key goal of artificial intelligence. Here we introduce the interaction network, a model which can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Our model takes graphs as input, performs object- and relation-centric reasoning in a way that is analogous to a simulation, and is implemented using deep neural networks. We evaluate its ability to reason about several challenging physical domains: n-body problems, rigid-body collision, and non-rigid dynamics. Our results show it can be trained to accurately simulate the physical trajectories of dozens of objects over thousands of time steps, estimate abstract quantities such as energy, and generalize automatically to systems with different numbers and configurations of objects and relations. Our interaction network implementation is the first general-purpose, learnable physics engine, and a powerful general framework for reasoning about object and relations in a wide variety of complex real-world domains.", "full_text": "Interaction Networks for Learning about Objects,\n\nRelations and Physics\n\nAnonymous Author(s)\n\nAf\ufb01liation\nAddress\nemail\n\nAbstract\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n\n16\n\n17\n18\n19\n20\n21\n22\n23\n24\n\n25\n26\n27\n28\n29\n30\n31\n\n32\n33\n34\n35\n36\n\nReasoning about objects, relations, and physics is central to human intelligence, and\na key goal of arti\ufb01cial intelligence. Here we introduce the interaction network, a\nmodel which can reason about how objects in complex systems interact, supporting\ndynamical predictions, as well as inferences about the abstract properties of the\nsystem. Our model takes graphs as input, performs object- and relation-centric\nreasoning in a way that is analogous to a simulation, and is implemented using\ndeep neural networks. We evaluate its ability to reason about several challenging\nphysical domains: n-body problems, rigid-body collision, and non-rigid dynamics.\nOur results show it can be trained to accurately simulate the physical trajectories of\ndozens of objects over thousands of time steps, estimate abstract quantities such\nas energy, and generalize automatically to systems with different numbers and\ncon\ufb01gurations of objects and relations. Our interaction network implementation\nis the \ufb01rst general-purpose, learnable physics engine, and a powerful general\nframework for reasoning about object and relations in a wide variety of complex\nreal-world domains.\n\n1\n\nIntroduction\n\nRepresenting and reasoning about objects, relations and physics is a \u201ccore\u201d domain of human common\nsense knowledge [25], and among the most basic and important aspects of intelligence [27, 15]. Many\neveryday problems, such as predicting what will happen next in physical environments or inferring\nunderlying properties of complex scenes, are challenging because their elements can be composed\nin combinatorially many possible arrangements. People can nevertheless solve such problems by\ndecomposing the scenario into distinct objects and relations, and reasoning about the consequences\nof their interactions and dynamics. Here we introduce the interaction network \u2013 a model that can\nperform an analogous form of reasoning about objects and relations in complex systems.\nInteraction networks combine three powerful approaches: structured models, simulation, and deep\nlearning. Structured models [7] can exploit rich, explicit knowledge of relations among objects,\nindependent of the objects themselves, which supports general-purpose reasoning across diverse\ncontexts. Simulation is an effective method for approximating dynamical systems, predicting how the\nelements in a complex system are in\ufb02uenced by interactions with one another, and by the dynamics\nof the system. Deep learning [23, 16] couples generic architectures with ef\ufb01cient optimization\nalgorithms to provide highly scalable learning and inference in challenging real-world settings.\nInteraction networks explicitly separate how they reason about relations from how they reason about\nobjects, assigning each task to distinct models which are: fundamentally object- and relation-centric;\nand independent of the observation modality and task speci\ufb01cation (see Model section 2 below\nand Fig. 1a). This lets interaction networks automatically generalize their learning across variable\nnumbers of arbitrarily ordered objects and relations, and also recompose their knowledge of entities\n\nSubmitted to 30th Conference on Neural Information Processing Systems (NIPS 2016). Do not distribute.\n\n\fFigure 1: Schematic of an interaction network. a. For physical reasoning, the model takes objects and relations\nas input, reasons about their interactions, and applies the effects and physical dynamics to predict new states. b.\nFor more complex systems, the model takes as input a graph that represents a system of objects, oj, and relations,\n(cid:104)i, j, rk(cid:105)k, instantiates the pairwise interaction terms, bk, and computes their effects, ek, via a relational model,\nfR(\u00b7). The ek are then aggregated and combined with the oj and external effects, xj, to generate input (as cj),\nfor an object model, fO(\u00b7), which predicts how the interactions and dynamics in\ufb02uence the objects, p.\n\n37\n38\n39\n\n40\n41\n42\n43\n44\n45\n46\n47\n48\n\n49\n50\n51\n52\n53\n54\n55\n56\n\n57\n58\n59\n60\n61\n\n62\n63\n64\n65\n66\n67\n68\n69\n70\n71\n\nand interactions in novel and combinatorially many ways. They take relations as explicit input,\nallowing them to selectively process different potential interactions for different input data, rather\nthan being forced to consider every possible interaction or those imposed by a \ufb01xed architecture.\nWe evaluate interaction networks by testing their ability to make predictions and inferences about var-\nious physical systems, including n-body problems, and rigid-body collision, and non-rigid dynamics.\nOur interaction networks learn to capture the complex interactions that can be used to predict future\nstates and abstract physical properties, such as energy. We show that they can roll out thousands of\nrealistic future state predictions, even when trained only on single-step predictions. We also explore\nhow they generalize to novel systems with different numbers and con\ufb01gurations of elements. Though\nthey are not restricted to physical reasoning, the interaction networks used here represent the \ufb01rst\ngeneral-purpose learnable physics engine, and even have the potential to learn novel physical systems\nfor which no physics engines currently exist.\n\nRelated work Our model draws inspiration from previous work that reasons about graphs and\nrelations using neural networks. The \u201cgraph neural network\u201d [22] is a framework that shares learning\nacross nodes and edges, the \u201crecursive autoencoder\u201d [24] adapts its processing architecture to exploit\nan input parse tree, the \u201cneural programmer-interpreter\u201d [21] is a composable neural network that\nmimics the execution trace of a program, and the \u201cspatial transformer\u201d [11] learns to dynamically\nmodify network connectivity to capture certain types of interactions. Others have explored deep\nlearning of logical and arithmetic relations [26], and relations suitable for visual question-answering\n[1].\nThe behavior of our model is similar in spirit to a physical simulation engine [2], which generates\nsequences of states by repeatedly applying rules that approximate the effects of physical interactions\nand dynamics on objects over time. The interaction rules are relation-centric, operating on two or\nmore objects that are interacting, and the dynamics rules are object-centric, operating on individual\nobjects and the aggregated effects of the interactions they participate in.\nPrevious AI work on physical reasoning explored commonsense knowledge, qualitative representa-\ntions, and simulation techniques for approximating physical prediction and inference [28, 9, 6]. The\n\u201cNeuroAnimator\u201d [8] was perhaps the \ufb01rst quantitative approach to learning physical dynamics, by\ntraining neural networks to predict and control the state of articulated bodies. Ladick\u00fd et al. [14]\nrecently used regression forests to learn \ufb02uid dynamics. Recent advances in convolutional neural\nnetworks (CNNs) have led to efforts that learn to predict coarse-grained physical dynamics from\nimages [19, 17, 18]. Notably, Fragkiadaki et al. [5] used CNNs to predict and control a moving\nball from an image centered at its coordinates. Mottaghi et al. [20] trained CNNs to predict the 3D\ntrajectory of an object after an external impulse is applied. Wu et al. [29] used CNNs to parse objects\nfrom images, which were then input to a physics engine that supported prediction and inference.\n\n2\n\nObject reasoningRelational reasoningCompute interaction Apply object dynamicsEffectsObjects,relationsPredictions, inferencesa.b.\f72\n\n73\n74\n75\n76\n77\n78\n79\n80\n81\n82\n83\n84\n\n85\n86\n87\n88\n89\n\n90\n91\n92\n93\n\n94\n\n95\n\n96\n97\n98\n99\n100\n101\n102\n103\n104\n\n105\n106\n107\n108\n\n109\n110\n111\n112\n113\n\n114\n115\n116\n117\n118\n\n2 Model\n\nDe\ufb01nition To describe our model, we use physical reasoning as an example (Fig. 1a), and build\nfrom a simple model to the full interaction network (abbreviated IN). To predict the dynamics of a\nsingle object, one might use an object-centric function, fO, which inputs the object\u2019s state, ot, at\ntime t, and outputs a future state, ot+1. If two or more objects are governed by the same dynamics,\nfO could be applied to each, independently, to predict their respective future states. But if the\nobjects interact with one another, then fO is insuf\ufb01cient because it does not capture their relationship.\nAssuming two objects and one directed relationship, e.g., a \ufb01xed object attached by a spring to a freely\nmoving mass, the \ufb01rst (the sender, o1) in\ufb02uences the second (the receiver, o2) via their interaction.\nThe effect of this interaction, et+1, can be predicted by a relation-centric function, fR. The fR takes\nas input o1, o2, as well as attributes of their relationship, r, e.g., the spring constant. The fO is\nmodi\ufb01ed so it can input both et+1 and the receiver\u2019s current state, o2,t, enabling the interaction to\nin\ufb02uence its future state, o2,t+1,\n\net+1 = fR(o1,t, o2,t, r)\n\no2,t+1 = fO(o2,t, et+1)\n\nThe above formulation can be expanded to larger and more complex systems by representing them\nas a graph, G = (cid:104)O, R(cid:105), where the nodes, O, correspond to the objects, and the edges, R, to the\nrelations (see Fig. 1b). We assume an attributed, directed multigraph because the relations have\nattributes, and there can be multiple distinct relations between two objects (e.g., rigid and magnetic\ninteractions). For a system with NO objects and NR relations, the inputs to the IN are,\nO = {oj}j=1...NO , R = {(cid:104)i, j, rk(cid:105)k}k=1...NR where i (cid:54)= j, 1 \u2264 i, j \u2264 NO , X = {xj}j=1...NO\nThe O represents the states of each object. The triplet, (cid:104)i, j, rk(cid:105)k, represents the k-th relation in the\nsystem, from sender, oi, to receiver, oj, with relation attribute, rk. The X represents external effects,\nsuch as active control inputs or gravitational acceleration, which we de\ufb01ne as not being part of the\nsystem, and which are applied to each object separately.\nThe basic IN is de\ufb01ned as,\n\nIN(G) = \u03c6O(a(G, X, \u03c6R( m(G) ) ))\n\nm(G) = B = {bk}k=1...NR\nfR(bk) = ek\n\u03c6R(B) = E = {ek}k=1...NR\n\na(G, X, E) = C = {cj}j=1...NO\nfO(cj)\n= pj\n= P = {pj}j=1...NO\n\u03c6O(C)\n\n(1)\n\n(2)\n\nThe marshalling function, m, rearranges the objects and relations into interaction terms, bk =\n(cid:104)oi, oj, rk(cid:105) \u2208 B, one per relation, which correspond to each interaction\u2019s receiver, sender, and\nrelation attributes. The relational model, \u03c6R, predicts the effect of each interaction, ek \u2208 E, by\napplying fR to each bk. The aggregation function, a, collects all effects, ek \u2208 E, that apply to each\nreceiver object, merges them, and combines them with O and X to form a set of object model inputs,\ncj \u2208 C, one per object. The object model, \u03c6O, predicts how the interactions and dynamics in\ufb02uence\nthe objects by applying fO to each cj, and returning the results, pj \u2208 P . This basic IN can predict\nthe evolution of states in a dynamical system \u2013 for physical simulation, P may equal the future states\nof the objects, Ot+1.\nThe IN can also be augmented with an additional component to make abstract inferences about the\nsystem. The pj \u2208 P , rather than serving as output, can be combined by another aggregation function,\ng, and input to an abstraction model, \u03c6A, which returns a single output, q, for the whole system. We\nexplore this variant in our \ufb01nal experiments that use the IN to predict potential energy.\nAn IN applies the same fR and fO to every bk and cj, respectively, which makes their relational and\nobject reasoning able to handle variable numbers of arbitrarily ordered objects and relations. But\none additional constraint must be satis\ufb01ed to maintain this: the a function must be commutative and\nassociative over the objects and relations. Using summation within a to merge the elements of E into\nC satis\ufb01es this, but division would not.\nHere we focus on binary relations, which means there is one interaction term per relation, but another\noption is to have the interactions correspond to n-th order relations by combining n senders in each bk.\nThe interactions could even have variable order, where each bk includes all sender objects that interact\nwith a receiver, but would require a fR than can handle variable-length inputs. These possibilities are\nbeyond the scope of this work, but are interesting future directions.\n\n3\n\n\f119\n120\n121\n122\n123\n\n124\n125\n126\n127\n128\n\n129\n\n130\n\n131\n132\n133\n134\n135\n136\n\n137\n138\n139\n\n140\n141\n142\n143\n144\n\n145\n146\n\n147\n148\n149\n150\n\n151\n152\n\n153\n154\n155\n156\n157\n158\n159\n160\n161\n162\n163\n\n164\n\n165\n166\n167\n168\n169\n170\n\n(cid:104) 0 0\n\n(cid:105)\n\n1 1\n0 0\n\n0 0\n0 1\n\n(cid:105)\n\nand Rs =\n\n(cid:104) 1 0\n\nImplementation The general de\ufb01nition of the IN in the previous section is agnostic to the choice\nof functions and algorithms, but we now outline a learnable implementation capable of reasoning\nabout complex systems with nonlinear relations and dynamics. We use standard deep neural network\nbuilding blocks, multilayer perceptrons (MLP), matrix operations, etc., which can be trained ef\ufb01ciently\nfrom data using gradient-based optimization, such as stochastic gradient descent.\nWe de\ufb01ne O as a DS \u00d7 NO matrix, whose columns correspond to the objects\u2019 DS-length state vectors.\nThe relations are a triplet, R = (cid:104)Rr, Rs, Ra(cid:105), where Rr and Rs are NO \u00d7 NR binary matrices which\nindex the receiver and sender objects, respectively, and Ra is a DR \u00d7 NR matrix whose DR-length\ncolumns represent the NR relations\u2019 attributes. The j-th column of Rr is a one-hot vector which\nindicates the receiver object\u2019s index; Rs indicates the sender similarly. For the graph in Fig. 1b,\n. The X is a DX \u00d7 NO matrix, whose columns are DX-length vectors\nRr =\nthat represent the external effect applied each of the NO objects.\nThe marshalling function, m, computes the matrix products, ORr and ORs, and concatenates them\nwith Ra: m(G) = [ORr; ORs; Ra] = B .\nThe resulting B is a (2DS + DR) \u00d7 NR matrix, whose columns represent the interaction terms, bk,\nfor the NR relations (we denote vertical and horizontal matrix concatenation with a semicolon and\ncomma, respectively). The way m constructs interaction terms can be modi\ufb01ed, as described in our\nExperiments section (3).\nThe B is input to \u03c6R, which applies fR, an MLP, to each column. The output of fR is a DE-length\nvector, ek, a distributed representation of the effects. The \u03c6R concatenates the NR effects to form the\nDE \u00d7 NR effect matrix, E.\nThe G, X, and E are input to a, which computes the DE \u00d7 NO matrix product, \u00afE = ERT\nr , whose\nj-th column is equivalent to the elementwise sum across all ek whose corresponding relation has\na(G, X, E) = [O; X; \u00afE] = C.\nreceiver object, j. The \u00afE is concatenated with O and X:\nThe resulting C is a (DS + DX + DE) \u00d7 NO matrix, whose NO columns represent the object states,\nexternal effects, and per-object aggregate interaction effects.\nThe C is input to \u03c6O, which applies fO, another MLP, to each of the NO columns. The output of fO\nis a DP -length vector, pj, and \u03c6O concatenates them to form the output matrix, P .\nTo infer abstract properties of a system, an additional \u03c6A is appended and takes P as input. The g\naggregation function performs an elementwise sum across the columns of P to return a DP -length\nvector, \u00afP . The \u00afP is input to \u03c6A, another MLP, which returns a DA-length vector, q, that represents\nan abstract, global property of the system.\nTraining an IN requires optimizing an objective function over the learnable parameters of \u03c6R and \u03c6O.\nNote, m and a involve matrix operations that do not contain learnable parameters.\nBecause \u03c6R and \u03c6O are shared across all relations and objects, respectively, training them is statisti-\ncally ef\ufb01cient. This is similar to CNNs, which are very ef\ufb01cient due to their weight-sharing scheme.\nA CNN treats a local neighborhood of pixels as related, interacting entities: each pixel is effectively\na receiver object and its neighboring pixels are senders. The convolution operator is analogous to\n\u03c6R, where fR is the local linear/nonlinear kernel applied to each neighborhood. Skip connections,\nrecently popularized by residual networks, are loosely analogous to how the IN inputs O to both\n\u03c6R and \u03c6O, though in CNNs relation- and object-centric reasoning are not delineated. But because\nCNNs exploit local interactions in a \ufb01xed way which is well-suited to the speci\ufb01c topology of images,\ncapturing longer-range dependencies requires either broad, insensitive convolution kernels, or deep\nstacks of layers, in order to implement suf\ufb01ciently large receptive \ufb01elds. The IN avoids this restriction\nby being able to process arbitrary neighborhoods that are explicitly speci\ufb01ed by the R input.\n\n3 Experiments\n\nPhysical reasoning tasks Our experiments explored two types of physical reasoning tasks: pre-\ndicting future states of a system, and estimating their abstract properties, speci\ufb01cally potential energy.\nWe evaluated the IN\u2019s ability to learn to make these judgments in three complex physical domains:\nn-body systems; balls bouncing in a box; and strings composed of springs that collide with rigid\nobjects. We simulated the 2D trajectories of the elements of these systems with a physics engine, and\nrecorded their sequences of states. See the Supplementary Material for full details.\n\n4\n\n\f171\n172\n173\n174\n175\n176\n\n177\n178\n179\n180\n181\n182\n183\n184\n185\n186\n\n187\n188\n189\n190\n191\n192\n193\n194\n195\n196\n197\n198\n\n199\n200\n201\n202\n203\n204\n205\n206\n207\n208\n209\n210\n\n211\n212\n213\n214\n215\n\n216\n217\n218\n219\n220\n221\n222\n\n223\n224\n225\n\nIn the n-body domain, such as solar systems, all n bodies exert distance- and mass-dependent\ngravitational forces on each other, so there were n(n \u2212 1) relations input to our model. Across\nsimulations, the objects\u2019 masses varied, while all other \ufb01xed attributes were held constant. The\ntraining scenes always included 6 bodies, and for testing we used 3, 6, and 12 bodies. In half of\nthe systems, bodies were initialized with velocities that would cause stable orbits, if not for the\ninteractions with other objects; the other half had random velocities.\nIn the bouncing balls domain, moving balls could collide with each other and with static walls.\nThe walls were represented as objects whose shape attribute represented a rectangle, and whose\ninverse-mass was 0. The relations input to the model were between the n objects (which included the\nwalls), for (n(n\u2212 1) relations). Collisions are more dif\ufb01cult to simulate than gravitational forces, and\nthe data distribution was much more challenging: each ball participated in a collision on less than 1%\nof the steps, following straight-line motion at all other times. The model thus had to learn that despite\nthere being a rigid relation between two objects, they only had meaningful collision interactions when\nthey were in contact. We also varied more of the object attributes \u2013 shape, scale and mass (as before)\n\u2013 as well as the coef\ufb01cient of restitution, which was a relation attribute. Training scenes contained 6\nballs inside a box with 4 variably sized walls, and test scenes contained either 3, 6, or 9 balls.\nThe string domain used two types of relations (indicated in rk), relation structures that were more\nsparse and speci\ufb01c than all-to-all, as well as variable external effects. Each scene contained a string,\ncomprised of masses connected by springs, and a static, rigid circle positioned below the string. The\nn masses had spring relations with their immediate neighbors (2(n \u2212 1)), and all masses had rigid\nrelations with the rigid object (2n). Gravitational acceleration, with a magnitude that was varied\nacross simulation runs, was applied so that the string always fell, usually colliding with the static\nobject. The gravitational acceleration was an external input (not to be confused with the gravitational\nattraction relations in the n-body experiments). Each training scene contained a string with 15 point\nmasses, and test scenes contained either 5, 15, or 30 mass strings. In training, one of the point masses\nat the end of the string, chosen at random, was always held static, as if pinned to the wall, while the\nother masses were free to move. In the test conditions, we also included strings that had both ends\npinned, and no ends pinned, to evaluate generalization.\nOur model takes as input the state of each system, G, decomposed into the objects, O (e.g., n-body\nobjects, balls, walls, points masses that represented string elements), and their physical relations, R\n(e.g., gravitational attraction, collisions, springs), as well as the external effects, X (e.g., gravitational\nacceleration). Each object state, oj, could be further divided into a dynamic state component\n(e.g., position and velocity) and a static attribute component (e.g., mass, size, shape). The relation\nattributes, Ra, represented quantities such as the coef\ufb01cient of restitution, and spring constant. The\ninput represented the system at the current time. The prediction experiment\u2019s target outputs were the\nvelocities of the objects on the subsequent time step, and the energy estimation experiment\u2019s targets\nwere the potential energies of the system on the current time step. We also generated multi-step\nrollouts for the prediction experiments (Fig. 2), to assess the model\u2019s effectiveness at creating visually\nrealistic simulations. The output velocity, vt, on time step t became the input velocity on t + 1, and\nthe position at t + 1 was updated by the predicted velocity at t.\n\nData Each of the training, validation, test data sets were generated by simulating 2000 scenes\nover 1000 time steps, and randomly sampling 1 million, 200k, and 200k one-step input/target pairs,\nrespectively. The model was trained for 2000 epochs, randomly shuf\ufb02ing the data indices between\neach. We used mini-batches of 100, and balanced their data distributions so the targets had similar\nper-element statistics. The performance reported in the Results was measured on held-out test data.\nWe explored adding a small amount of Gaussian noise to 20% of the data\u2019s input positions and\nvelocities during the initial phase of training, which was reduced to 0% from epochs 50 to 250. The\nnoise std. dev. was 0.05\u00d7 the std. dev. of each element\u2019s values across the dataset. It allowed the\nmodel to experience physically impossible states which could not have been generated by the physics\nengine, and learn to project them back to nearby, possible states. Our error measure did not re\ufb02ect\nclear differences with or without noise, but rollouts from models trained with noise were slightly\nmore visually realistic, and static objects were less subject to drift over many steps.\n\nModel architecture The fR and fO MLPs contained multiple hidden layers of linear transforms\nplus biases, followed by recti\ufb01ed linear units (ReLUs), and an output layer that was a linear transform\nplus bias. The best model architecture was selected by a grid search over layer sizes and depths. All\n\n5\n\n\fFigure 2: Prediction rollouts. Each column contains three panels of three video frames (with motion blur),\neach spanning 1000 rollout steps. Columns 1-2 are ground truth and model predictions for n-body systems, 3-4\nare bouncing balls, and 5-6 are strings. Each model column was generated by a single model, trained on the\nunderlying states of a system of the size in the top panel. The middle and bottom panels show its generalization\nto systems of different sizes and structure. For n-body, the training was on 6 bodies, and generalization was to 3\nand 12 bodies. For balls, the training was on 6 balls, and generalization was to 3 and 9 balls. For strings, the\ntraining was on 15 masses with 1 end pinned, and generalization was to 30 masses with 0 and 2 ends pinned.\n\n6\n\nTrueModelTrueModelTrueModelTimeTimeTime\f226\n227\n228\n\n229\n230\n231\n232\n233\n234\n235\n\n236\n237\n238\n\n239\n240\n241\n\n242\n243\n244\n245\n246\n\n247\n248\n249\n250\n\n251\n\n252\n253\n254\n255\n256\n257\n258\n259\n260\n\n261\n262\n263\n264\n265\n266\n\n267\n268\n269\n270\n271\n272\n273\n274\n\n275\n276\n277\n\ninputs (except Rr and Rs) were normalized by centering at the median and rescaling the 5th and 95th\npercentiles to -1 and 1. All training objectives and test measures used mean squared error (MSE)\nbetween the model\u2019s prediction and the ground truth target.\nAll prediction experiments used the same architecture, with parameters selected by a hyperparameter\nsearch. The fR MLP had four, 150-length hidden layers, and output length DE = 50. The fO MLP\nhad one, 100-length hidden layer, and output length DP = 2, which targeted the x, y-velocity. The\nm and a were customized so that the model was invariant to the absolute positions of objects in the\nscene. The m concatenated three terms for each bk: the difference vector between the dynamic states\nof the receiver and sender, the concatenated receiver and sender attribute vectors, and the relation\nattribute vector. The a only outputs the velocities, not the positions, for input to \u03c6O.\nThe energy estimation experiments used the IN from the prediction experiments with an additional\n\u03c6A MLP which had one, 25-length hidden layer. Its P inputs\u2019 columns were length DP = 10, and\nits output length was DA = 1.\nWe optimized the parameters using Adam [13], with a waterfall schedule that began with a learning\nrate of 0.001 and down-scaled the learning rate by 0.8 each time the validation error, estimated over\na window of 40 epochs, stopped decreasing.\nTwo forms of L2 regularization were explored: one applied to the effects, E, and another to the model\nparameters. Regularizing E improved generalization to different numbers of objects and reduced\ndrift over many rollout steps. It likely incentivizes sparser communication between the \u03c6R and \u03c6O,\nprompting them to operate more independently. Regularizing the parameters generally improved\nperformance and reduced over\ufb01tting. Both penalty factors were selected by a grid search.\nFew competing models are available in the literature to compare our model against, but we considered\nseveral alternatives: a constant velocity baseline which output the input velocity; an MLP baseline,\nwith two 300-length hidden layers, which took as input a \ufb02attened vector of all of the input data; and\na variant of the IN with the \u03c6R component removed (the interaction effects, E, was set to a 0-matrix).\n\n4 Results\n\nPrediction experiments Our results show that the IN can predict the next-step dynamics of our task\ndomains very accurately after training, with orders of magnitude lower test error than the alternative\nmodels (Fig. 3a, d and g, and Table 1). Because the dynamics of each domain depended crucially on\ninteractions among objects, the IN was able to learn to exploit these relationships for its predictions.\nThe dynamics-only IN had no mechanism for processing interactions, and performed similarly to the\nconstant velocity model. The baseline MLP\u2019s connectivity makes it possible, in principle, for it to\nlearn the interactions, but that would require learning how to use the relation indices to selectively\nprocess the interactions. It would also not bene\ufb01t from sharing its learning across relations and\nobjects, instead being forced to approximate the interactive dynamics in parallel for each objects.\nThe IN also generalized well to systems with fewer and greater numbers of objects (Figs. 3b-c, e-f\nand h-k, and Table SM1 in Supp. Mat.). For each domain, we selected the best IN model from the\nsystem size on which it was trained, and evaluated its MSE on a different system size. When tested\non smaller n-body and spring systems from those on which it was trained, its performance actually\nexceeded a model trained on the smaller system. This may be due to the model\u2019s ability to exploit its\ngreater experience with how objects and relations behave, available in the more complex system.\nWe also found that the IN trained on single-step predictions can be used to simulate trajectories over\nthousands of steps very effectively, often tracking the ground truth closely, especially in the n-body\nand string domains. When rendered into images and videos, the model-generated trajectories are\nusually visually indistinguishable from those of the ground truth physics engine (Fig. 2; see Supp.\nMat. for videos of all images). This is not to say that given the same initial conditions, they cohere\nperfectly: the dynamics are highly nonlinear and imperceptible prediction errors by the model can\nrapidly lead to large differences in the systems\u2019 states. But the incoherent rollouts do not violate\npeople\u2019s expectations, and might be roughly on par with people\u2019s understanding of these domains.\n\nEstimating abstract properties We trained an abstract-estimation variant of our model to predict\npotential energies in the n-body and string domains (the ball domain\u2019s potential energies were always\n0), and found it was much more accurate (n-body MSE 1.4, string MSE 1.1) than the MLP baseline\n\n7\n\n\fFigure 3: Prediction experiment accuracy and generalization. Each colored bar represents the MSE between a\nmodel\u2019s predicted velocity and the ground truth physics engine\u2019s (the y-axes are log-scaled). Sublots (a-c) show\nn-body performance, (d-f) show balls, and (g-k) show string. The leftmost subplots in each (a, d, g) for each\ndomain compare the constant velocity model (black), baseline MLP (grey), dynamics-only IN (red), and full IN\n(blue). The other panels show the IN\u2019s generalization performance to different numbers and con\ufb01gurations of\nobjects, as indicated by the subplot titles. For the string systems, the numbers correspond to: (the number of\nmasses, how many ends were pinned).\n\nTable 1: Prediction experiment MSEs\n\nDomain Constant velocity Baseline Dynamics-only IN\nn-body\nBalls\nString\n\n82\n0.074\n0.018\n\n0.072\n0.016\n\n79\n\n76\n\n0.074\n0.017\n\nIN\n0.25\n0.0020\n0.0011\n\n278\n279\n\n280\n\n281\n282\n283\n284\n285\n286\n287\n288\n289\n290\n291\n292\n\n293\n294\n295\n296\n297\n298\n299\n300\n301\n302\n303\n\n304\n305\n306\n307\n308\n309\n\n(n-body MSE 19, string MSE 425). The IN presumably learns the gravitational and spring potential\nenergy functions, applies them to the relations in their respective domains, and combines the results.\n\n5 Discussion\n\nWe introduced interaction networks as a \ufb02exible and ef\ufb01cient model for explicit reasoning about\nobjects and relations in complex systems. Our results provide surprisingly strong evidence of their\nability to learn accurate physical simulations and generalize their training to novel systems with\ndifferent numbers and con\ufb01gurations of objects and relations. They could also learn to infer abstract\nproperties of physical systems, such as potential energy. The alternative models we tested performed\nmuch more poorly, with orders of magnitude greater error. Simulation over rich mental models is\nthought to be a crucial mechanism of how humans reason about physics and other complex domains\n[4, 12, 10], and Battaglia et al. [3] recently posited a simulation-based \u201cintuitive physics engine\u201d\nmodel to explain human physical scene understanding. Our interaction network implementation is the\n\ufb01rst learnable physics engine that can scale up to real-world problems, and is a promising template for\nnew AI approaches to reasoning about other physical and mechanical systems, scene understanding,\nsocial perception, hierarchical planning, and analogical reasoning.\nIn the future, it will be important to develop techniques that allow interaction networks to handle\nvery large systems with many interactions, such as by culling interaction computations that will have\nnegligible effects. The interaction network may also serve as a powerful model for model-predictive\ncontrol inputting active control signals as external effects \u2013 because it is differentiable, it naturally\nsupports gradient-based planning. It will also be important to prepend a perceptual front-end that\ncan infer from objects and relations raw observations, which can then be provided as input to an\ninteraction network that can reason about the underlying structure of a scene. By adapting the\ninteraction network into a recurrent neural network, even more accurate long-term predictions might\nbe possible, though preliminary tests found little bene\ufb01t beyond its already-strong performance.\nBy modifying the interaction network to be a probabilistic generative model, it may also support\nprobabilistic inference over unknown object properties and relations.\nBy combining three powerful tools from the modern machine learning toolkit \u2013 relational reasoning\nover structured knowledge, simulation, and deep learning \u2013 interaction networks offer \ufb02exible,\naccurate, and ef\ufb01cient learning and inference in challenging domains. Decomposing complex\nsystems into objects and relations, and reasoning about them explicitly, provides for combinatorial\ngeneralization to novel contexts, one of the most important future challenges for AI, and a crucial\nstep toward closing the gap between how humans and machines think.\n\n8\n\n10-210-3g. 15, 1h. 5, 1i. 30, 1j. 15, 0k. 15, 2String110-110102MSE (log-scale)a. 6b. 3c. 12n-body10-210-110-3d. 6e. 3f. 9BallsIN (15 obj, 1 pin)IN (5 obj, 1 pin)IN (15 obj, 0 pin)IN (30 obj, 1 pin)IN (15 obj, 2 pin)IN (3 obj)IN (12 obj)IN (6 obj)Constant velocityBaseline MLPDynamics-only ININ (3 obj)IN (9 obj)IN (6 obj)10-2\f310\n\n311\n312\n\n313\n314\n\n315\n316\n\n317\n\n318\n319\n\n320\n321\n\n322\n323\n\n324\n325\n326\n\n327\n328\n\n329\n\n330\n331\n\n332\n333\n\n334\n\n335\n336\n\n337\n338\n\n339\n\n340\n341\n\n342\n343\n\n344\n345\n\n346\n347\n\n348\n\n349\n350\n\n351\n\n352\n353\n\n354\n355\n\n356\n357\n\n358\n359\n\n360\n\n361\n362\n\nReferences\n[1] J Andreas, M Rohrbach, T Darrell, and D Klein. Learning to compose neural networks for question\n\nanswering. NAACL, 2016.\n\n[2] D Baraff. Physically based modeling: Rigid body simulation. SIGGRAPH Course Notes, ACM SIGGRAPH,\n\n2(1):2\u20131, 2001.\n\n[3] PW Battaglia, JB Hamrick, and JB Tenenbaum. Simulation as an engine of physical scene understanding.\n\nProceedings of the National Academy of Sciences, 110(45):18327\u201318332, 2013.\n\n[4] K.J.W. Craik. The nature of explanation. Cambridge University Press, 1943.\n[5] K Fragkiadaki, P Agrawal, S Levine, and J Malik. Learning visual predictive models of physics for playing\n\nbilliards. ICLR, 2016.\n\n[6] F. Gardin and B. Meltzer. Analogical representations of naive physics. Arti\ufb01cial Intelligence, 38(2):139\u2013\n\n159, 1989.\n\n[7] Z. Ghahramani. Probabilistic machine learning and arti\ufb01cial intelligence. Nature, 521(7553):452\u2013459,\n\n2015.\n\n[8] R Grzeszczuk, D Terzopoulos, and G Hinton. Neuroanimator: Fast neural network emulation and control of\nphysics-based models. In Proceedings of the 25th annual conference on Computer graphics and interactive\ntechniques, pages 9\u201320. ACM, 1998.\n\n[9] P.J Hayes. The naive physics manifesto. Universit\u00e9 de Gen\u00e8ve, Institut pour les \u00e9tudes s \u00e9 mantiques et\n\ncognitives, 1978.\n\n[10] M. Hegarty. Mechanical reasoning by mental simulation. TICS, 8(6):280\u2013285, 2004.\n[11] M Jaderberg, K Simonyan, and A Zisserman. Spatial transformer networks. In in NIPS, pages 2008\u20132016,\n\n2015.\n\n[12] P.N. Johnson-Laird. Mental models: towards a cognitive science of language, inference, and consciousness,\n\nvolume 6. Cambridge University Press, 1983.\n\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n[14] L Ladick\u00fd, S Jeong, B Solenthaler, M Pollefeys, and M Gross. Data-driven \ufb02uid simulations using\n\nregression forests. ACM Transactions on Graphics (TOG), 34(6):199, 2015.\n\n[15] B Lake, T Ullman, J Tenenbaum, and S Gershman. Building machines that learn and think like people.\n\narXiv:1604.00289, 2016.\n\n[16] Y LeCun, Y Bengio, and G Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[17] A Lerer, S Gross, and R Fergus. Learning physical intuition of block towers by example. arXiv:1603.01312,\n\n2016.\n\n[18] W Li, S Azimi, A Leonardis, and M Fritz. To fall or not to fall: A visual approach to physical stability\n\nprediction. arXiv:1604.00066, 2016.\n\n[19] R Mottaghi, H Bagherinezhad, M Rastegari, and A Farhadi. Newtonian image understanding: Unfolding\n\nthe dynamics of objects in static images. arXiv:1511.04048, 2015.\n\n[20] R Mottaghi, M Rastegari, A Gupta, and A Farhadi. \" what happens if...\" learning to predict the effect of\n\nforces in images. arXiv:1603.05600, 2016.\n\n[21] SE Reed and N de Freitas. Neural programmer-interpreters. ICLR, 2016.\n[22] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model.\n\nIEEE Trans. Neural Networks, 20(1):61\u201380, 2009.\n\n[23] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117, 2015.\n[24] R Socher, E Huang, J Pennin, C Manning, and A Ng. Dynamic pooling and unfolding recursive autoen-\n\ncoders for paraphrase detection. In in NIPS, pages 801\u2013809, 2011.\n\n[25] E Spelke, K Breinlinger, J Macomber, and K Jacobson. Origins of knowledge. Psychol. Rev., 99(4):605\u2013\n\n632, 1992.\n\n[26] I Sutskever and GE Hinton. Using matrices to model symbolic relationship. In D. Koller, D. Schuurmans,\n\nY. Bengio, and L. Bottou, editors, in NIPS 21, pages 1593\u20131600. 2009.\n\n[27] J.B. Tenenbaum, C. Kemp, T.L. Grif\ufb01ths, and N.D. Goodman. How to grow a mind: Statistics, structure,\n\nand abstraction. Science, 331(6022):1279, 2011.\n\n[28] P Winston and B Horn. The psychology of computer vision, volume 73. McGraw-Hill New York, 1975.\n[29] J Wu, I Yildirim, JJ Lim, B Freeman, and J Tenenbaum. Galileo: Perceiving physical object properties by\n\nintegrating a physics engine with deep learning. In in NIPS, pages 127\u2013135, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2244, "authors": [{"given_name": "Peter", "family_name": "Battaglia", "institution": "Google DeepMind"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}, {"given_name": "Matthew", "family_name": "Lai", "institution": "Google DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}]}