{"title": "Recurrent Relational Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3368, "page_last": 3378, "abstract": "This paper is concerned with learning to solve tasks that require a chain of interde-\npendent steps of relational inference, like answering complex questions about the\nrelationships between objects, or solving puzzles where the smaller elements of a\nsolution mutually constrain each other. We introduce the recurrent relational net-\nwork, a general purpose module that operates on a graph representation of objects.\nAs a generalization of Santoro et al. [2017]\u2019s relational network, it can augment\nany neural network model with the capacity to do many-step relational reasoning.\nWe achieve state of the art results on the bAbI textual question-answering dataset\nwith the recurrent relational network, consistently solving 20/20 tasks. As bAbI is\nnot particularly challenging from a relational reasoning point of view, we introduce\nPretty-CLEVR, a new diagnostic dataset for relational reasoning. In the Pretty-\nCLEVR set-up, we can vary the question to control for the number of relational\nreasoning steps that are required to obtain the answer. Using Pretty-CLEVR, we\nprobe the limitations of multi-layer perceptrons, relational and recurrent relational\nnetworks. Finally, we show how recurrent relational networks can learn to solve\nSudoku puzzles from supervised training data, a challenging task requiring upwards\nof 64 steps of relational reasoning. We achieve state-of-the-art results amongst\ncomparable methods by solving 96.6% of the hardest Sudoku puzzles.", "full_text": "Recurrent Relational Networks\n\nRasmus Berg Palm\n\nTechnical University of Denmark\n\nTradeshift\n\nrapal@dtu.dk\n\nUlrich Paquet\n\nDeepMind\n\nupaq@google.com\n\nTechnical University of Denmark\n\nOle Winther\n\nolwi@dtu.dk\n\nAbstract\n\nThis paper is concerned with learning to solve tasks that require a chain of interde-\npendent steps of relational inference, like answering complex questions about the\nrelationships between objects, or solving puzzles where the smaller elements of a\nsolution mutually constrain each other. We introduce the recurrent relational net-\nwork, a general purpose module that operates on a graph representation of objects.\nAs a generalization of Santoro et al. [2017]\u2019s relational network, it can augment\nany neural network model with the capacity to do many-step relational reasoning.\nWe achieve state of the art results on the bAbI textual question-answering dataset\nwith the recurrent relational network, consistently solving 20/20 tasks. As bAbI is\nnot particularly challenging from a relational reasoning point of view, we introduce\nPretty-CLEVR, a new diagnostic dataset for relational reasoning. In the Pretty-\nCLEVR set-up, we can vary the question to control for the number of relational\nreasoning steps that are required to obtain the answer. Using Pretty-CLEVR, we\nprobe the limitations of multi-layer perceptrons, relational and recurrent relational\nnetworks. Finally, we show how recurrent relational networks can learn to solve\nSudoku puzzles from supervised training data, a challenging task requiring upwards\nof 64 steps of relational reasoning. We achieve state-of-the-art results amongst\ncomparable methods by solving 96.6% of the hardest Sudoku puzzles.\n\n1\n\nIntroduction\n\nA central component of human intelligence is the ability to abstractly reason about objects and their\ninteractions [Spelke et al., 1995, Spelke and Kinzler, 2007]. As an illustrative example, consider\nsolving a Sudoku. A Sudoku consists of 81 cells that are arranged in a 9-by-9 grid, which must\nbe \ufb01lled with digits 1 to 9 so that each digit appears exactly once in each row, column and 3-by-3\nnon-overlapping box, with a number of digits given 1. To solve a Sudoku, one methodically reasons\nabout the puzzle in terms of its cells and their interactions over many steps. One tries placing digits\nin cells and see how that affects other cells, iteratively working toward a solution.\nContrast this with the canonical deep learning approach to solving problems, the multilayer perceptron\n(MLP), or multilayer convolutional neural net (CNN). These architectures take the entire Sudoku\nas an input and output the entire solution in a single forward pass, ignoring the inductive bias that\nobjects exists in the world, and that they affect each other in a consistent manner. Not surprisingly\nthese models fall short when faced with problems that require even basic relational reasoning [Lake\net al., 2016, Santoro et al., 2017].\nThe relational network of Santoro et al. [2017] is an important \ufb01rst step towards a simple module\nfor reasoning about objects and their interactions but it is limited to performing a single relational\noperation, and was evaluated on datasets that require a maximum of three steps of reasoning (which,\n\n1We invite the reader to solve the Sudoku in the supplementary material to appreciate the dif\ufb01culty of solving\n\na Sudoku in which 17 cells are initially \ufb01lled.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsurprisingly, can be solved by a single relational reasoning step as we show). Looking beyond\nrelational networks, there is a rich literature on logic and reasoning in arti\ufb01cial intelligence and\nmachine learning, which we discuss in section 5.\nToward generally realizing the ability to methodically reason about objects and their interactions over\nmany steps, this paper introduces a composite function, the recurrent relational network. It serves\nas a modular component for many-step relational reasoning in end-to-end differentiable learning\nsystems. It encodes the inductive biases that 1) objects exists in the world 2) they can be suf\ufb01ciently\ndescribed by properties 3) properties can change over time 4) objects can affect each other and 5)\ngiven the properties, the effects object have on each other is invariant to time.\nAn important insight from the work of Santoro et al. [2017] is to decompose a function for relational\nreasoning into two components or \u201cmodules\u201d: a perceptual front-end, which is tasked to recognize\nobjects in the raw input and represent them as vectors, and a relational reasoning module, which\nuses the representation to reason about the objects and their interactions. Both modules are trained\njointly end-to-end. In computer science parlance, the relational reasoning module implements an\ninterface: it operates on a graph of nodes and directed edges, where the nodes are represented by real\nvalued vectors, and is differentiable. This paper chie\ufb02y develops the relational reasoning side of that\ninterface.\nSome of the tasks we evaluate on can be ef\ufb01ciently and perfectly solved by hand-crafted algorithms\nthat operate on the symbolic level. For example, 9-by-9 Sudokus can be solved in a fraction of a\nsecond with constraint propagation and search [Norvig, 2006] or with dancing links [Knuth, 2000].\nThese symbolic algorithms are superior in every respect but one: they don\u2019t comply with the interface,\nas they are not differentiable and don\u2019t work with real-valued vector descriptions. They therefore\ncannot be used in a combined model with a deep learning perceptual front-end and learned end-to-end.\nFollowing Santoro et al. [2017], we use the term \u201crelational reasoning\u201d liberally for an object- and\ninteraction-centric approach to problem solving. Although the term \u201crelational reasoning\u201d is similar\nto terms in other branches of science, like relational logic or \ufb01rst order logic, no direct parallel is\nintended.\nThis paper considers many-step relational reasoning, a challenging task for deep learning architectures.\nWe develop a recurrent relational reasoning module, which constitutes our main contribution. We\nshow that it is a powerful architecture for many-step relational reasoning on three varied datasets,\nachieving state-of-the-art results on bAbI and Sudoku.\n\n2 Recurrent Relational Networks\n\nWe ground the discussion of a recurrent relational network in something familiar, solving a Sudoku\npuzzle. A simple strategy works by noting that if a certain Sudoku cell is given as a \u201c7\u201d, one can\nsafely remove \u201c7\u201d as an option from other cells in the same row, column and box. In a message\npassing framework, that cell needs to send a message to each other cell in the same row, column,\nand box, broadcasting it\u2019s value as \u201c7\u201d, and informing those cells not to take the value \u201c7\u201d. In an\niteration t, these messages are sent simultaneously, in parallel, between all cells. Each cell i should\nthen consider all incoming messages, and update its internal state ht\n. With the updated state\neach cell should send out new messages, and the process repeats.\n\ni to ht+1\n\ni\n\nMessage passing on a graph. The recurrent relational network will learn to pass messages on a\ngraph. For Sudoku, the graph has i \u2208 {1, 2, ..., 81} nodes, one for each cell in the Sudoku. Each\nnode has an input feature vector xi, and edges to and from all nodes that are in the same row, column\nand box in the Sudoku. The graph is the input to the relational reasoning module, and vectors xi\nwould generally be the output of a perceptual front-end, for instance a convolutional neural network.\nKeeping with our Sudoku example, each xi encodes the initial cell content (empty or given) and the\nrow and column position of the cell.\nAt each step t each node has a hidden state vector ht\ni, which is initialized to the features, such that\ni = xi. At each step t, each node sends a message to each of its neighboring nodes. We de\ufb01ne the\nh0\nmessage mt\n\nij from node i to node j at step t by\n\n(1)\n\nij = f(cid:0)ht\u22121\n\ni\n\nmt\n\n(cid:1) ,\n\n, ht\u22121\n\nj\n\n2\n\n\fx3\n\not\n3\n\nht\n3\n\nmt\n13\n\nmt\n32\n\nmt\n31\n\nmt\n23\n\not\n1\n\nht\n1\n\nx1\n\nmt\n12\n\nmt\n21\n\not\n2\n\nht\n2\n\nx2\n\nFigure 1: A recurrent relational network on a fully connected graph with 3 nodes. The nodes\u2019 hidden\nstates ht\ni with blue. The dashed\nlines indicate the recurrent connections. Subscripts denote node indices and superscripts denote steps\nt. For a \ufb01gure of the same graph unrolled over 2 steps see the supplementary material.\n\ni are highlighted with green, the inputs xi with red, and the outputs ot\n\nwhere f, the message function, is a multi-layer perceptron. This allows the network to learn what\nkind of messages to send. In our experiments, MLPs with linear outputs were used. Since a node\nneeds to consider all the incoming messages we sum them with\n\nmt\n\nj =\n\nmt\n\nij ,\n\n(2)\n\n(cid:88)\n\ni\u2208N (j)\n\nwhere N (j) are all the nodes that have an edge into node j. For Sudoku, N (j) contains the nodes in\nthe same row, column and box as j. In our experiments, since the messages in (1) are linear, this is\nsimilar to how log-probabilities are summed in belief propagation [Murphy et al., 1999].\n\nRecurrent node updates. Finally we update the node hidden state via\n\nj = g(cid:0)ht\u22121\n\n(cid:1) ,\n\nht\n\n(3)\nwhere g, the node function, is another learned neural network. The dependence on the previous node\nhidden state ht\u22121\nallows the network to iteratively work towards a solution instead of starting with a\nblank slate at every step. Injecting the feature vector xj at each step like this allows the node function\nto focus on the messages from the other nodes instead of trying to remember the input.\n\n, xj, mt\nj\n\nj\n\nj\n\nSupervised training. The above equations for sending messages and updating node states de\ufb01ne a\nrecurrent relational network\u2019s core. To train a recurrent relational network in a supervised manner\nto solve a Sudoku we introduce an output probability distribution over the digits 1-9 for each of the\nnodes in the graph. The output distribution ot\n\ni = r(cid:0)ht\ncross-entropy terms, one for each node: lt = \u2212(cid:80)I\n\n(4)\nwhere r is a MLP that maps the node hidden state to the output probabilities, e.g. using a softmax\nnonlinearity. Given the target digits y = {y1, y2, ..., y81} the loss at step t, is then the sum of\ni [yi], where oi[yi] is the yi\u2019th component\n\ni for node i at step t is given by\not\n\n(cid:1) ,\n\ni=1 log ot\n\ni\n\nof oi. Equations (1) to (4) are illustrated in \ufb01gure 1.\n\nConvergent message passing. A distinctive feature of our proposed model is that we minimize the\ncross entropy between the output and target distributions at every step.\nAt test time we only consider the output probabilities at the last step, but having a loss at every step\nduring training is bene\ufb01cial. Since the target digits yi are constant over the steps, it encourages the\nnetwork to learn a convergent message passing algorithm. Secondly, it helps with the vanishing\ngradient problem.\n\n3\n\n\fVariations.\nIf the edges are unknown, the graph can be assumed to be fully connected. In this case\nthe network will need to learn which objects interact with each other. If the edges have attributes,\neij, the message function in equation 1 can be modi\ufb01ed such that mt\noutput of interest is for the whole graph instead of for each node the output in equation 4 can be\n\nmodi\ufb01ed such that there\u2019s a single output ot = r ((cid:80)\n\ni). The loss can be modi\ufb01ed accordingly.\n\n, eij\n\ni ht\n\nij = f(cid:0)ht\u22121\n\ni\n\n, ht\u22121\n\nj\n\n(cid:1). If the\n\n3 Experiments\n\nCode to reproduce all experiments can be found at github.com/rasmusbergpalm/recurrent-relational-\nnetworks.\n\n3.1 bAbI question-answering tasks\n\nTable 1: bAbI results. Trained jointly on all 20 tasks using the 10,000 training samples. Entries\nmarked with an asterix are our own experiments, the rest are from the respective papers.\n\nMethod\nRRN* (this work)\nSDNC [Rae et al., 2016]\nDAM [Rae et al., 2016]\nSAM [Rae et al., 2016]\nDNC [Rae et al., 2016]\nNTM [Rae et al., 2016]\nLSTM [Rae et al., 2016]\nEntNet [Henaff et al., 2016]\nReMo [Yang et al., 2018]\nRN [Santoro et al., 2017]\nMemN2N [Sukhbaatar et al., 2015]\n\nN Mean Error (%)\n0.46 \u00b1 0.77\n15\n6.4 \u00b1 2.5\n15\n8.7 \u00b1 6.4\n15\n11.5 \u00b1 5.9\n15\n12.8 \u00b1 4.7\n15\n26.6 \u00b1 3.7\n15\n28.7 \u00b1 0.5\n15\n9.7 \u00b1 2.6\n5\n1\n1\n1\n\n1.2\nN/A\n7.5\n\nFailed tasks (err. >5%)\n\n0.13 \u00b1 0.35\n4.1 \u00b1 1.6\n5.4 \u00b1 3.4\n7.1 \u00b1 3.4\n8.2 \u00b1 2.5\n15.5 \u00b1 1.7\n17.1 \u00b1 0.8\n5 \u00b1 1.2\n\n1\n2\n6\n\nbAbI is a text based QA dataset from Facebook [Weston et al., 2015] designed as a set of prerequisite\ntasks for reasoning. It consists of 20 types of tasks, with 10,000 questions each, including deduction,\ninduction, spatial and temporal reasoning. Each question, e.g. \u201cWhere is the milk?\u201d is preceded by\na number of facts in the form of short sentences, e.g. \u201cDaniel journeyed to the garden. Daniel put\ndown the milk.\u201d The target is a single word, in this case \u201cgarden\u201d, one-hot encoded over the full bAbI\nvocabulary of 177 words. A task is considered solved if a model achieves greater than 95% accuracy.\nThe most dif\ufb01cult tasks require reasoning about three facts.\nTo map the questions into a graph we treat the facts related to a question as the nodes in a fully\nconnected graph up to a maximum of the last 20 facts. The fact and question sentences are both\nencoded by Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] layers with 32\nhidden units each. We concatenate the last hidden state of each LSTM and pass that through a MLP.\nThe output is considered the node features xi. Following [Santoro et al., 2017] all edge features eij\nare set to the question encoding. We train the network for three steps. At each step, we sum the node\nhidden states and pass that through a MLP to get a single output for the whole graph. For details see\nthe supplementary material.\nOur trained network solves 20 of 20 tasks in 13 out of 15 runs. This is state-of-the-art and markedly\nmore stable than competing methods. See table 1. We perform ablation experiment to see which\nparts of the model are important, including varying the number of steps. We \ufb01nd that using dropout\nand appending the question encoding to the fact encodings is important for the performance. See the\nsupplementary material for details.\nSurprisingly, we \ufb01nd that we only need a single step of relational reasoning to solve all the bAbI\ntasks. This is surprising since the hardest tasks requires reasoning about three facts. It\u2019s possible\nthat there are super\ufb01cial correlations in the tasks that the model learns to exploit. Alternatively the\nmodel learns to compress all the relevant fact-relations into the 128 \ufb02oats resulting from the sum over\nthe node hidden states, and perform the remaining reasoning steps in the output MLP. Regardless, it\nappears multiple steps of relational reasoning are not important for the bAbI dataset.\n\n4\n\n\f3.2 Pretty-CLEVR\n\nGiven that bAbI did not require multiple steps of relational reasoning and in order to test our\nhypothesis that our proposed model is better suited for tasks requiring more steps of relational\nreasoning we create a diagnostic dataset \u201cPretty-CLEVER\u201d. It can be seen as an extension of the\n\u201cSort-of-CLEVR\u201d data set by [Santoro et al., 2017] which has questions of a non-relational and\nrelational nature. \u201cPretty-CLEVR\u201d takes this a step further and has non-relational questions as well\nas questions requiring varying degrees of relational reasoning.\n\n(a) Samples.\n\n(b) Results.\n\nFigure 2: 2a Two samples of the Pretty-CLEVR diagnostic dataset. Each sample has 128 questions\nassociated, exhibiting varying levels of relational reasoning dif\ufb01culty. For the topmost sample the\nsolution to the question: \u201cgreen, 3 jumps\u201d, which is \u201cplus\u201d, is shown with arrows. 2b Random\ncorresponds to picking one of the eight possible outputs at random (colors or shapes, depending on\nthe input). The RRN is trained for four steps but since it predicts at each step we can evaluate the\nperformance for each step. The the number of steps is stated in parentheses.\n\nPretty-CLEVR consists of scenes with eight colored shapes and associated questions. Questions are\nof the form: \u201cStarting at object X which object is N jumps away?\u201d. Objects are uniquely de\ufb01ned\nby their color or shape. If the start object is de\ufb01ned by color, the answer is a shape, and vice versa.\nJumps are de\ufb01ned as moving to the closest object, without going to an object already visited. See\n\ufb01gure 2a. Questions with zero jumps are non-relational and correspond to: \u201cWhat color is shape X?\u201d\nor \u201cWhat shape is color X?\u201d. We create 100,000 random scenes, and 128 questions for each (8 start\nobjects, 0-7 jumps, output is color or shape), resulting in 12.8M questions. We also render the scenes\nas images. The \u201cjump to nearest\u201d type question is chosen in an effort to eliminate simple correlations\nbetween the scene state and the answer. It is highly non-linear in the sense that slight differences in\nthe distance between objects can cause the answer to change drastically. It is also asymmetrical, i.e.\nif the question \u201cx, n jumps\u201d equals \u201cy\u201d, there is no guarantee that \u201cy, n jumps\u201d equals \u201cx\u201d. We \ufb01nd it\nis a surprisingly dif\ufb01cult task to solve, even with a powerful model such as the RRN. We hope others\nwill use it to evaluate their relational models.2\nSince we are solely interested in examining the effect of multiple steps of relational reasoning we\ntrain on the state descriptions of the scene. We consider each scene as a fully connected undirected\ngraph with 8 nodes. The feature vector for each object consists of the position, shape and color. We\nencode the question as the start object shape or color and the number of jumps. As we did for bAbI\nwe concatenate the question and object features and pass it through a MLP to get the node features\nxi. To make the task easier we set the edge features to the euclidean distance between the objects.\nWe train our network for four steps and compare to a single step relational network and a baseline\n\n2Pretty-CLEVR is available online as part of the code for reproducing experiments.\n\n5\n\n01234567Question jumps0.00.20.40.60.81.0AccuracyRRN(1)RNRRN(2)RandomRRN(3)MLPRRN(4)\fMLP that considers the entire scene state, all pairwise distances, and the question as a single vector.\nFor details see the supplementary material.\nMirroring the results from the \u201cSort-of-CLEVR\u201d dataset the MLP perfectly solves the non-relational\nquestions, but struggle with even single jump questions and seem to lower bound the performance\nof the relational networks. The relational network solves the non-relational questions as well as the\nones requiring a single jump, but the accuracy sharply drops off with more jumps. This matches the\nperformance of the recurrent relational network which generally performs well as long as the number\nof steps is greater than or equal to the number of jumps. See \ufb01g 2b. It seems that, despite our best\nefforts, there are spurious correlations in the data such that questions with six to seven jumps are\neasier to solve than those with four to \ufb01ve jumps.\n\n3.3 Sudoku\n\nWe create training, validation and testing sets totaling 216,000 Sudoku puzzles with a uniform\ndistribution of givens between 17 and 34. We consider each of the 81 cells in the 9x9 Sudoku grid a\nnode in a graph, with edges to and from each other cell in the same row, column and box. The node\nfeatures xi are the output of a MLP which takes as input the digit for the cell (0-9, 0 if not given), and\nthe row and column position (1-9). Edge features are not used. We run the network for 32 steps and\nat every step the output function r maps each node hidden state to nine output logits corresponding to\nthe nine possible digits. For details see the supplementary material.\n\nFigure 3: Example of how the trained network solves part of a Sudoku. Only the top row of a\nfull 9x9 Sudoku is shown for clarity. From top to bottom steps 0, 1, 8 and 24 are shown. See the\nsupplementary material for a full Sudoku. Each cell displays the digits 1-9 with the font size scaled\n(non-linearly for legibility) to the probability the network assigns to each digit. Notice how the\nnetwork eliminates the given digits 6 and 4 from the other cells in the \ufb01rst step. Animations showing\nhow the trained network solves Sodukos, including a failure case can be found at imgur.com/a/ALsfB.\n\nOur network learns to solve 94.1% of even the hardest 17-givens Sudokus after 32 steps. We only\nconsider a puzzled solved if all the digits are correct, i.e. no partial credit is given for getting individual\ndigits correct. For more givens the accuracy (fraction of test puzzles solved) quickly approaches\n100%. Since the network outputs a probability distribution for each step, we can visualize how the\nnetwork arrives at the solution step by step. For an example of this see \ufb01gure 3.\nTo examine our hypothesis that multiple steps are required we plot the accuracy as a function of the\nnumber of steps. See \ufb01gure 4. We can see that even simple Sudokus with 33 givens require upwards\nof 10 steps of relational reasoning, whereas the harder 17 givens continue to improve even after 32\nsteps. Figure 4 also shows that the model has learned a convergent algorithm. The model was trained\nfor 32 steps, but seeing that the accuracy increased with more steps, we ran the model for 64 steps\nduring testing. At 64 steps the accuracy for the 17 givens puzzles increases to 96.6%.\n\n6\n\n123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789\fWe also examined the importance of the row and column features by multiplying the row and column\nembeddings by zero and re-tested our trained network. At 64 steps with 17 givens, the accuracy\nchanged to 96.7%. It thus seems the network does not use the row and column position information\nto solve the task.\n\nFigure 4: Fraction of test puzzles solved as a function of number of steps. Even simple Sudokus\nwith 33 givens require about 10 steps of relational reasoning to be solved. The dashed vertical line\nindicates the 32 steps the network was trained for. The network appears to have learned a convergent\nrelational reasoning algorithm such that more steps beyond 32 improve on the hardest Sudokus.\n\nWe compare our network to several other differentiable methods. See table 2. We train two relational\nnetworks: a node and a graph centric. For details see the supplementary material. Of the two, the node\ncentric was considerably better. The node centric correspond exactly to our proposed network with\na single step, yet fails to solve any Sudoku. This shows that multiple steps are crucial for complex\nrelational reasoning. Our network outperforms loopy belief propagation, with parallel and random\nmessages passing updates [Bauke, 2008]. It also outperforms a version of loopy belief propagation\nmodi\ufb01ed speci\ufb01cally for solving Sudokus that uses 250 steps, Sinkhorn balancing every two steps\nand iteratively picks the most probable digit [Khan et al., 2014]. We also compare to learning the\nmessages in parallel loopy BP as presented in Lin et al. [2015]. We tried a few variants including a\nsingle step as presented and 32 steps with and without a loss on every step, but could not get it to\nsolve any 17 given Sudokus. Finally we outperform Park [2016] which treats the Sudoku as a 9x9\nimage, uses 10 convolutional layers, iteratively picks the most probable digit, and evaluate on easier\nSudokus with 24-36 givens. We also tried to train a version of our network that only had a loss at the\nlast step. It was harder to train, performed worse and didn\u2019t learn a convergent algorithm.\n\nTable 2: Comparison of methods for solving Sudoku puzzles. Only methods that are differentiable\nare included in the comparison. Entries marked with an asterix are our own experiments, the rest are\nfrom the respective papers.\n\nMethod\nRecurrent Relational Network* (this work)\nLoopy BP, modi\ufb01ed [Khan et al., 2014]\nLoopy BP, random [Bauke, 2008]\nLoopy BP, parallel [Bauke, 2008]\nDeeply Learned Messages* [Lin et al., 2015]\nRelational Network, node* [Santoro et al., 2017]\nRelational Network, graph* [Santoro et al., 2017]\nDeep Convolutional Network [Park, 2016]\n\n7\n\nGivens Accuracy\n96.6%\n92.5%\n61.7%\n53.2%\n\n0%\n0%\n0%\n70%\n\n17\n17\n17\n17\n17\n17\n17\n\n24-36\n\n0102030405060Steps0.00.20.40.60.81.0Accuracy17 givens19 givens21 givens23 givens25 givens27 givens29 givens31 givens33 givens\f3.4 Age arithmetic\n\nAnonymous reviewer 2 suggested the following task which we include here. The task is to infer the\nage of a person given a single absolute age and a set of age differences, e.g. \u201cAlice is 20 years old.\nAlice is 4 years older than Bob. Charlie is 6 years younger than Bob. How old is Charlie?\u201d. Please\nsee the supplementary material for details on the task and results.\n\n4 Discussion\n\nWe have proposed a general relational reasoning model for solving tasks requiring an order of\nmagnitude more complex relational reasoning than the current state-of-the art. BaBi and Sort-of-\nCLEVR require a few steps, Pretty-CLEVR requires up to eight steps and Sudoku requires more\nthan ten steps. Our relational reasoning module can be added to any deep learning model to add a\npowerful relational reasoning capacity. We get state-of-the-art results on Sudokus solving 96.6% of\nthe hardest Sudokus with 17 givens. We also markedly improve state-of-the-art on the BaBi dataset\nsolving 20/20 tasks in 13 out of 15 runs with a single model trained jointly on all tasks.\nOne potential issue with having a loss at every step is that it might encourage the network to learn a\ngreedy algorithm that gets stuck in a local minima. However, the output function r separates the node\nhidden states and messages from the output probability distributions. The network therefore has the\ncapacity to use a small part of the hidden state for retaining a current best guess, which can remain\nconstant over several steps, and other parts of the hidden state for running a non-greedy multi-step\nalgorithm.\nSending messages for all nodes in parallel and summing all the incoming messages might seem like\nan unsophisticated approach that risk resulting in oscillatory behavior and drowning out the important\nmessages. However, since the receiving node hidden state is an input to the message function, the\nreceiving node can in a sense determine which messages it wishes to receive. As such, the sum can\nbe seen as an implicit attention mechanism over the incoming messages. Similarly the network can\nlearn an optimal message passing schedule, by ignoring messages based on the history and current\nstate of the receiving and sending node.\n\n5 Related work\n\nRelational networks [Santoro et al., 2017] and interaction networks [Battaglia et al., 2016] are the\nmost directly comparable to ours. These models correspond to using a single step of equation 3.\nSince it only does one step it cannot naturally do complex multi-step relational reasoning. In order\nto solve the tasks that require more than a single step it must compress all the relevant relations\ninto a \ufb01xed size vector, then perform the remaining relational reasoning in the last forward layers.\nRelational networks, interaction networks and our proposed model can all be seen as an instance of\nGraph Neural Networks [Scarselli et al., 2009, Gilmer et al., 2017].\nGraph neural networks with message passing computations go back to Scarselli et al. [2009]. However,\nthere are key differences that we found important for implementing stable multi-step relational\nreasoning. Including the node features xj at every step in eq. 3 is important to the stability of the\nnetwork. Scarselli et al. [2009], eq. 3 has the node features, ln, inside the message function. Battaglia\net al. [2016] use an xj in the node update function, but this is an external driving force. Sukhbaatar\net al. [2016] also proposed to include the node features at every step. Optimizing the loss at every\nstep in order to learn a convergent message passing algorithm is novel to the best of our knowledge.\nScarselli et al. [2009] introduces an explicit loss term to ensure convergence. Ross et al. [2011] trains\nthe inference machine predictors on every step, but there are no hidden states; the node states are the\noutput marginals directly, similar to how belief propagation works.\nOur model can also be seen as a completely learned message passing algorithm. Belief propagation\nis a hand-crafted message passing algorithm for performing exact inference in directed acyclic\ngraphical models. If the graph has cycles, one can use a variant, loopy belief propagation, but it is\nnot guaranteed to be exact, unbiased or converge. Empirically it works well though and it is widely\nused [Murphy et al., 1999]. Several works have proposed replacing parts of belief propagation with\nlearned modules [Heess et al., 2013, Lin et al., 2015]. Our work differs by not being rooted in loopy\nBP, and instead learning all parts of a general message passing algorithm. Ross et al. [2011] proposes\n\n8\n\n\fInference Machines which ditch the belief propagation algorithm altogether and instead train a series\nof regressors to output the correct marginals by passing messages on a graph. Wei et al. [2016]\napplies this idea to pose estimation using a series of convolutional layers and Deng et al. [2016]\nintroduces a recurrent node update for the same domain.\nThere is rich literature on combining symbolic reasoning and logic with sub-symbolic distributed\nrepresentations which goes all the way back to the birth of the idea of parallel distributed processing\nMcCulloch and Pitts [1943]. See [Raedt et al., 2016, Besold et al., 2017] for two recent surveys.\nHere we describe only a few recent methods. Sera\ufb01ni and Garcez [2016] introduces the Logic\nTensor Network (LTN) which describes a \ufb01rst order logic in which symbols are grounded as vector\nembeddings, and predicates and functions are grounded as tensor networks. The embeddings and\ntensor networks are then optimized jointly to maximize a fuzzy satis\ufb01ability measure over a set of\nknown facts and fuzzy constraints. \u0160ourek et al. [2015] introduces the Lifted Relational Network\nwhich combines relational logic with neural networks by creating neural networks from lifted rules\nand training examples, such that the connections between neurons created from the same lifted rules\nshares weights. Our approach differs fundamentally in that we do not aim to bridge symbolic and\nsub-symbolic methods. Instead we stay completely in the sub-symbolic realm. We do not introduce or\nconsider any explicit logic, aim to discover (fuzzy) logic rules, or attempt to include prior knowledge\nin the form of logical constraints.\nAmos and Kolter [2017] Introduces OptNet, a neural network layer that solve quadratic programs\nusing an ef\ufb01cient differentiable solver. OptNet is trained to solve 4x4 Sudokus amongst other problems\nand beats the deep convolutional network baseline as described in Park [2016]. Unfortunately we\ncannot compare to OptNet directly as it has computational issues scaling to 9x9 Sudokus (Brandon\nAmos, 2018, personal communication).\nSukhbaatar et al. [2016] proposes the Communication Network (CommNet) for learning multi-agent\ncooperation and communication using back-propagation. It is similar to our recurrent relational\nnetwork, but differs in key aspects. The messages passed between all nodes at a given step are the\nsame, corresponding to the average of all the node hidden states. Also, it is not trained to minimize\nthe loss on every step of the algorithm.\n\nAcknowledgments\n\nWe\u2019d like to thank the anonymous reviewers for the valuable comments and suggestions, especially\nreviewer 2 who suggested the age arithmetic task. This research was supported by the NVIDIA\nCorporation with the donation of TITAN X GPUs.\n\nReferences\nBrandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks.\n\narXiv preprint arXiv:1703.00443, 2017.\n\nPeter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks\nfor learning about objects, relations and physics. In Advances in Neural Information Processing\nSystems, pages 4502\u20134510, 2016.\n\nHeiko Bauke. Passing messages to lonely numbers. Computing in Science & Engineering, 10(2):\n\n32\u201340, 2008.\n\nTarek R Besold, Artur d\u2019Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos, Pas-\ncal Hitzler, Kai-Uwe K\u00fchnberger, Luis C Lamb, Daniel Lowd, Priscila Machado Vieira Lima,\net al. Neural-symbolic learning and reasoning: A survey and interpretation. arXiv preprint\narXiv:1711.03902, 2017.\n\nZhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent\nneural networks for analyzing relations in group activity recognition. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 4772\u20134781, 2016.\n\nJustin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\n\nmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n9\n\n\fNicolas Heess, Daniel Tarlow, and John Winn. Learning to pass expectation propagation messages.\n\nIn Advances in Neural Information Processing Systems, pages 3219\u20133227, 2013.\n\nMikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world\n\nstate with recurrent entity networks. arXiv preprint arXiv:1612.03969, 2016.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nSheehan Khan, Shahab Jabbari, Shahin Jabbari, and Majid Ghanbarinejad. Solving Sudoku using\n\nprobabilistic graphical models. 2014.\n\nDonald E Knuth. Dancing links. arXiv preprint cs/0011047, 2000.\n\nBrenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, pages 1\u2013101, 2016.\n\nGuosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. Deeply learning the messages\nin message passing inference. In Advances in Neural Information Processing Systems, pages\n361\u2013369, 2015.\n\nWarren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.\n\nThe bulletin of mathematical biophysics, 5(4):115\u2013133, 1943.\n\nKevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate\ninference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in\narti\ufb01cial intelligence, pages 467\u2013475. Morgan Kaufmann Publishers Inc., 1999.\n\nPeter Norvig. Solving every Sudoku puzzle, 2006. URL http://norvig.com/sudoku.html.\n\nKyubyong Park. Can neural networks crack Sudoku?, 2016. URL https://github.com/\n\nKyubyong/sudoku.\n\nJack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex\nGraves, and Tim Lillicrap. Scaling memory-augmented neural networks with sparse reads and\nwrites. In Advances in Neural Information Processing Systems, pages 3621\u20133629, 2016.\n\nLuc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical relational arti\ufb01cial\nintelligence: Logic, probability, and computation. Synthesis Lectures on Arti\ufb01cial Intelligence and\nMachine Learning, 10(2):1\u2013189, 2016.\n\nStephane Ross, Daniel Munoz, Martial Hebert, and J Andrew Bagnell. Learning message-passing\ninference machines for structured prediction. In Computer Vision and Pattern Recognition (CVPR),\n2011 IEEE Conference on, pages 2737\u20132744. IEEE, 2011.\n\nAdam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. arXiv\npreprint arXiv:1706.01427, 2017.\n\nFranco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\nLuciano Sera\ufb01ni and Artur S d\u2019Avila Garcez. Learning and reasoning with logic tensor networks. In\n\nAI* IA 2016 Advances in Arti\ufb01cial Intelligence, pages 334\u2013348. Springer, 2016.\n\nGustav \u0160ourek, Vojtech Aschenbrenner, Filip \u017delezny, and Ond\u02c7rej Ku\u017eelka. Lifted relational neural\nnetworks. In Proceedings of the 2015th International Conference on Cognitive Computation:\nIntegrating Neural and Symbolic Approaches-Volume 1583, pages 52\u201360. CEUR-WS. org, 2015.\n\nElizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89\u201396,\n\n2007.\n\nElizabeth S Spelke, Grant Gutheil, and Gretchen Van de Walle. The development of object perception.\n\n1995.\n\n10\n\n\fSainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances\n\nin neural information processing systems, pages 2440\u20132448, 2015.\n\nSainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation.\n\nIn Advances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\nShih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4724\u20134732, 2016.\n\nJason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri\u00ebnboer, Armand\nJoulin, and Tomas Mikolov. Towards AI-complete question answering: A set of prerequisite toy\ntasks. arXiv preprint arXiv:1502.05698, 2015.\n\nHyochang Yang, Sungzoon Cho, et al. Finding remo (related memory object): A simple neural\n\narchitecture for text based reasoning. arXiv preprint arXiv:1801.08459, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1702, "authors": [{"given_name": "Rasmus", "family_name": "Palm", "institution": "Technical University Denmark"}, {"given_name": "Ulrich", "family_name": "Paquet", "institution": "DeepMind"}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}]}