{"title": "Graph Normalizing Flows", "book": "Advances in Neural Information Processing Systems", "page_first": 13578, "page_last": 13588, "abstract": "We introduce graph normalizing flows: a new, reversible graph neural network model for prediction and generation. On supervised tasks, graph normalizing flows perform similarly to message passing neural networks, but at a significantly reduced memory footprint, allowing them to scale to larger graphs. In the unsupervised case, we combine graph normalizing flows with a novel graph auto-encoder to create a generative model of graph structures. Our model is permutation-invariant, generating entire graphs with a single feed-forward pass, and achieves competitive results with the state-of-the art auto-regressive models, while being better suited to parallel computing architectures.", "full_text": "Graph Normalizing Flows\n\nJenny Liu\u2217\n\nUniversity of Toronto\n\nVector Institute\n\njyliu@cs.toronto.edu\n\nAviral Kumar\u2217\u2020\n\nUC Berkeley\n\naviralk@berkeley.edu\n\nJimmy Ba\n\nUniversity of Toronto\n\nVector Institute\n\njba@cs.toronto.edu\n\nJamie Kiros\n\nGoogle Research\n\nkiros@google.com\n\nKevin Swersky\nGoogle Research\n\nkswersky@google.com\n\nAbstract\n\nWe introduce graph normalizing \ufb02ows: a new, reversible graph neural network\nmodel for prediction and generation. On supervised tasks, graph normalizing \ufb02ows\nperform similarly to message passing neural networks, but at a signi\ufb01cantly reduced\nmemory footprint, allowing them to scale to larger graphs. In the unsupervised\ncase, we combine graph normalizing \ufb02ows with a novel graph auto-encoder to\ncreate a generative model of graph structures. Our model is permutation-invariant,\ngenerating entire graphs with a single feed-forward pass, and achieves competitive\nresults with the state-of-the art auto-regressive models, while being better suited to\nparallel computing architectures.\n\n1\n\nIntroduction\n\nGraph-structured data is ubiquitous in science and engineering, and modeling graphs is an important\ncomponent of being able to make predictions and reason about these domains. Machine learning\nhas recently turned its attention to modeling graph-structured data using graph neural networks\n(GNNs) [8, 23, 16, 13, 6] that can exploit the structure of relational systems to create more accurate\nand generalizable predictions. For example, these can be used to predict the properties of molecules\nin order to aid in search and discovery [5, 6], or to learn physical properties of robots such that new\nrobots with different structures can be controlled without re-learning a control policy [27].\n\nIn this paper, we introduce a new formulation for graph neural networks by extending the framework\nof normalizing \ufb02ows [22, 3, 4] to graph-structured data. We call these models graph normalizing\n\ufb02ows (GNFs). GNFs have the property that the message passing computation is exactly reversible,\nmeaning that one can exactly reconstruct the input node features from the GNN representation; this\nresults in GNFs having several useful properties.\n\nIn the supervised case, we leverage a similar mechanism to [7] to obtain signi\ufb01cant memory savings\nin a model we call reversible graph neural networks, or GRevNets. Ordinary GNNs require the\nstorage of hidden states after every message passing step in order to facilitate backpropagation. This\nmeans one needs to store O(#nodes \u00d7 #message passing steps) states, which can be costly for large\ngraphs. In contrast, GRevNets can reconstruct hidden states in lower layers from higher layers\nduring backpropagation, meaning one only needs to store O(#nodes) states. A recent approach for\nmemory saving based on recurrent backpropagation (RBP) [1, 20, 18] requires running message\npassing to convergence, followed by the approximate, iterative inversion of a large matrix. Conversely,\nGRevNets get the exact gradients at a minor additional cost, equivalent to one extra forward pass. We\n\n\u2217Equal Contribution\n\u2020Work done during an internship at Google\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fshow that GRevNets are competitive with conventional memory-inef\ufb01cient GNNs, and outperform\nRBP on standard benchmarks.\n\nIn the unsupervised case, we use GNFs to develop a generative model of graphs. Learned generative\nmodels of graphs are a relatively new and less explored area. Machine learning has been quite\nsuccessful at generative modeling of complex domains such as images, audio, and text. However,\nrelational data poses new and interesting challenges such as permutation invariance, as permuting the\nnodes results in the same underlying graph.\n\nOne of the most successful approaches so far is to model the graph using an auto-regressive pro-\ncess [17, 30]. These generate each node in sequence, and for each newly generated node, the\ncorresponding edges to previously generated nodes are also created. In theory, this is capable of\nmodeling the full joint distribution, but computing the full likelihood requires marginalizing over all\npossible node-orderings. Sequential generation using RNNs also potentially suffers from trying to\nmodel long-range dependencies.\n\nNormalizing \ufb02ows are primarily designed for continuous-valued data, and the GNF models a dis-\ntribution over a structured, continuous space over sets of variables. We combine this with a novel\npermutation-invariant graph auto-encoder to generate embeddings that are decoded into an adjacency\nmatrix in a similar manner to [15, 19]. The result is a fully permutation-invariant model that achieves\ncompetitive results compared to GraphRNN [30], while being more well-suited to parallel computing\narchitectures.\n\n2 Background\n\n2.1 Graph Neural Networks\n\nNotation: A graph is de\ufb01ned as G = (H, \u2126), where H \u2208 RN \u00d7dn , H = (h(1), \u00b7 \u00b7 \u00b7 , h(N )) is the\nnode feature matrix consisting of node features, of size dn, for each of the N nodes (h(v) for node\nv) in the graph. \u2126 \u2208 RN \u00d7N \u00d7(de+1) is the edge feature matrix for the graph. The \ufb01rst channel of \u2126\nis the adjacency matrix of the graph (i.e. \u2126i,j,0 = 1 if eij is an edge in the graph). The rest of the\nmatrix \u2126i,j,1:(de+1) is the set of edge features of size de for each possible edge (i, j) in the graph.\n\nGraph Neural Networks (GNNs) or Message Passing Neural Nets (MPNNs) [6] are a generaliza-\ntion/uni\ufb01cation of a number of neural net architectures on graphs used in literature for a variety of\ntasks ranging from molecular modeling to network relational modeling. In general, MPNNs have two\nphases in the forward pass \u2013 a message passing (MP) phase and a readout (R) phase. The MP phase\nruns for T time steps, t = 1, . . . , T and is de\ufb01ned in terms of message generation functions Mt and\nvertex update functions Ut. During each step in the message passing phase, hidden node features\nh\n\nat each node in the graph are updated based on messages m\n\n(v)\nt+1 according to\n\n(v)\nt\n\nm\n\n(v)\n\nt+1 = Agg(cid:18)nMt(h\n\n(v)\nt\n\n, h\n\n(u)\nt\n\n, \u2126u,v)ou\u2208N (v)(cid:19)\n\nh\n\n(v)\nt+1 = Ut(h\n\n(v)\nt\n\n, m\n\n(v)\nt+1)\n\n(1)\n\n(2)\n\nwhere Agg is an aggregation function (e.g., sum), and N (v) denotes the set of neighbours to node v\nin the graph. The R phase converts the \ufb01nal node embeddings at MP step T into task-speci\ufb01c features\nby e.g., max-pooling.\n\nOne particularly useful aggregation function is graph attention [26], which uses attention [2, 25] to\nweight the messages from adjacent nodes. This involves computing an attention coef\ufb01cient \u03b1 between\nadjacent nodes using a linear transformation W , an attention mechanism a, and a nonlinearity \u03c3,\n\ne(v,u)\nt+1 = a(W h\n\n(v)\nt\n\n, W h\n\n(u)\nt\n\n),\n\n\u03b1v,u\n\nt+1 =\n\nm\n\n(v)\n\nt+1 = \u03c3(cid:16) Xu\u2208N (v)\n\n\u03b1(v,u)\nt+1 M (h\n\n(v)\nt\n\n, h\n\n(u)\nt\n\nexp(e(v,u)\nt+1 )\n\nt+1 )\n\nPw\u2208N (v) exp(e(u,w)\n, \u2126u,v)(cid:17)\n\nMulti-headed attention [25] applies attention with multiple weights W and concatenates the results.\n\n2\n\n\f2.2 Normalizing Flows\n\nNormalizing \ufb02ows (NFs) [22, 3, 4] are a class of generative models that use invertible mappings to\ntransform an observed vector x \u2208 Rd to a latent vector z \u2208 Rd using a mapping function z = f (x)\nwith inverse x = f \u22121(f (x)). The change of variables formula relates a density function over x,\nP (x) to one over z by\n\n\u22121\n\nP (z) = P (x)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2202f (x)\n\n\u2202x (cid:12)(cid:12)(cid:12)(cid:12)\n\nWith a suf\ufb01ciently expressive mapping, NFs can learn to map a complicated distribution into one\nthat is well modeled as a Gaussian; the key is to \ufb01nd a mapping that is expressive, but with an\nef\ufb01ciently computable determinant. We base our formulation on non-volume preserving \ufb02ows, a.k.a\nRealNVP [4]. Speci\ufb01cally, the af\ufb01ne coupling layer involves partitioning the dimensions of x into\ntwo sets of variables, x(0) and x(1), and mapping them onto variables z(0) and z(1) by\n\nz(0) = x(0)\nz(1) = x(1) \u2299 exp(s(x(0))) + t(x(0))\n\nWhere s and t are nonlinear functions and \u2299 is the Hadamard product. The resulting Jacobian is\nlower triangular and its determinant is therefore ef\ufb01ciently computable.\n\n3 Methods\n\n3.1 Reversible Graph Neural Networks (GRevNets)\n\nGRevNets are a family of reversible message passing neural network models. To achieve reversibility,\nthe node feature matrix of a GNN is split into two parts along the feature dimension\u2013H (0)\nand H (1)\n.\nFor a particular node in the graph v, the two parts of its features at time t in the message passing\nphase are called h0\n\nt respectively, such that h\n\n(v)\nt = concat(h0\n\nt and h1\n\nt , h1\n\nt ).\n\nt\n\nt\n\nOne step of the message passing procedure is broken down into into two intermediate steps, each\nof which is denoted as a half-step. F1(\u00b7), F2(\u00b7), G1(\u00b7), and G2(\u00b7) denote instances of the MP\ntransformation given in Equations (1) and (2), with F1/G1 and F2/G2 indicating whether the\nfunction is applied to scaling or translation. These functions consist of applying Mt and then Ut to\none set of the partitioned features, given the graph adjacency matrix \u2126. Figure 1 depicts the procedure\nin detail.\nH (0)\nt+ 1\nH (1)\nt+ 1\n\nt \u2299 exp(cid:16)F1(cid:16)H (1)\n\nt (cid:17)(cid:17) + F2(H (1)\n\nt+1 = H (1)\nt+ 1\n\nt+1 = H (0)\nt+ 1\n\u2299 exp(cid:16)G1(cid:16)H (0)\n\n2(cid:17)(cid:17) + G2(cid:16)H (0)\n2(cid:17)\n\n= H (0)\n\n= H (1)\n\nH (0)\n\nH (1)\n\nt+ 1\n\nt+ 1\n\n)\n\nt\n\nt\n\n2\n\n2\n\n2\n\n2\n\nThis architecture is easily reversible given H (0)\n\nt+1 and H (1)\n\nt+1, with the reverse procedure given by,\n\nH (0)\nt+ 1\n\n2\n\nH (1)\nt+ 1\n\n2\n\n= H (0)\nt+1\n= (cid:16)H (1)\n\n2(cid:17)(cid:17)\nt+1 \u2212 G2(cid:16)H (0)\n2(cid:17)(cid:17)\nexp(cid:16)G1(cid:16)H (0)\n\nt+ 1\n\nt+ 1\n\n3.2 GNFs for Structured Density Estimation\n\nH (1)\nt = H (1)\nt+ 1\nt = (cid:16)H (0)\nH (0)\n\n2\n\n2\n\nt\n\nt+ 1\n\n)(cid:17)\n\u2212 F2(H (1)\nt (cid:17)(cid:17)\nexp(cid:16)F1(cid:16)H (1)\n\nIn the same spirit as NFs, we can use the change of variables to give us the rule for exact density\ntransformation. If we assume Ht \u223c P (Ht), then the density in terms of P (Ht\u22121) is given by\n\n(3)\n\n(4)\n\nP (Ht\u22121) = det(cid:12)(cid:12)(cid:12)(cid:12)\n\nP (Ht) so that\n\n\u2202Ht\u22121\n\n\u2202Ht (cid:12)(cid:12)(cid:12)(cid:12)\nP (G) = det(cid:12)(cid:12)(cid:12)(cid:12)\n\nP (HT )= P (HT )\n\n\u2202HT\n\n\u2202H0 (cid:12)(cid:12)(cid:12)(cid:12)\n\nT\n\nYt=1\n\ndet(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2202Ht\n\n\u2202Ht\u22121(cid:12)(cid:12)(cid:12)(cid:12)\n\n3\n\n\fFigure 1: Architecture of 1 step of message passing in a GRevNet: H (0)\ndenote the two parts\nof the node-features of a particular node. F1(\u00b7), F2(\u00b7) and G1(\u00b7), G2(\u00b7) are 1-step MP transforms\nconsisting of applying Mt and Ut once each, with F1, G1 performing scaling and F2, G2 performing\ntranslation.\n\n, H (1)\n\nt\n\nt\n\nFigure 2: A summary of our graph generation pipeline using GNFs. The learned node features X\nfrom the auto-encoder are used to train the GNF. At generation time, the GNF generates node features\nwhich are then fed into the decoder to get the predicted adjacency matrix.\n\nwith H0 being the input node features. The Jacobians are given by lower triangular matrices, hence\nmaking density computations tractable.\n\nprior P (HT ) = QN\n\nGNFs can model expressive distributions in continuous spaces over sets of vectors. We choose the\ni=1 N (hi|0, I) to be a product of independent, standard Gaussian distributions.\nSampling simply involves sampling a set of Gaussian vectors and running the inverse mapping. One\nfree variable is the number of nodes that must be generated before initiating message passing. We\nsimply model this as a \ufb01xed prior P (N ), where the distribution is given by the empirical distribution\nin the training set. Sampling graphs uniformly from the training set is equivalent to sampling N from\nthis distribution, and then sampling G uniformly from the set of training graphs with N nodes.\n\nNotice that the graph message passing induces dependencies between the nodes in the input space,\nwhich is re\ufb02ected in the Jacobian. This also allows us to cast the RealNVP in the GNF framework:\nsimply remove the edges from the graph so that all nodes become independent. Then each node\ntransformation will be a sequence of reversible non-linear mappings, with no between-node depen-\ndencies3. We use this as a baseline to demonstrate that the GNF bene\ufb01ts when the nodes must model\ndependencies between each other.\n\n3This formulation is speci\ufb01cally applicable to unstructured vector spaces, as opposed to images, which would\n\ninvolve checkerboard partitions and other domain-speci\ufb01c heuristics.\n\n4\n\n\fIn the absence of a known graph structure, as is the case for generation, we use a fully connected\ngraph neural network. This allows the model to learn how to organize nodes in order to match a\nspeci\ufb01c distribution. However, this poses a problem for certain aggregation functions like sum and\nmean. In the sum case, the message variance will increase with the number of nodes, and in both sum\nand mean cases, the messages from each node will have to contend with the messages from every\nother node. If there is a salient piece of information being sent from one node to another, then it could\nget drowned out by less informative messages. Instead, we opt to use graph attention as discussed in\nSection 2.1. This allows each node to choose the messages that it deems to be the most informative.\n\nThe result of using a fully connected graph is that the computational cost of message passing is\nO(N 2), similar to the GraphRNN. However, each step of the GNF is expressible in terms of matrix\noperations, making it more amenable to parallel architectures. This is a similar justi\ufb01cation for using\ntransformers over RNNs [25].\n\n3.3 Graph Auto-Encoders\n\nWhile GNFs are expressive models for structured, continuous spaces, our objective is to train a\ngenerative model of graph structures, an inherently discrete problem. Our strategy to solve this is to\nuse a two-step process: (1) train a permutation invariant graph auto-encoder to create a graph encoder\nthat embeds graphs into a continuous space; (2) train a GNF to model the distribution of the graph\nembeddings, and use the decoder to generate graphs. Each stage is trained separately. A similar\nstrategy has been employed in prior work on generative models in [15, 19].\n\nNote that in contrast to the GraphVAE [12], which generates a single vector to model the entire\ngraph, we instead embed a set of nodes in a graph jointly, but each node will be mapped to its own\nembedding vector. This avoids the issue of having to run a matching process in the decoder.\n\nThe graph auto-encoder takes in a graph G and reconstructs the elements of the adjacency matrix,\nA, where Aij = 1 if node vi has an edge connecting it to node vj , and 0 otherwise. We focus on\nundirected graphs, meaning that we only need to predict the upper (or lower) triangular portion of A,\nbut this methodology could easily extend to directed graphs.\nThe encoder takes in a set of node features H \u2208 RN \u00d7d and an adjacency matrix A \u2208 {0, 1}N \u00d7 N\n2 ( N\n2\nsince the graph is undirected) and outputs a set of node embeddings X \u2208 RN \u00d7k. The decoder takes\nthese embeddings and outputs a set of edge probabilities \u02c6A \u2208 [0, 1]N \u00d7 N\n2 . For parameters \u03b8, we use\nthe binary cross entropy loss function,\n\nL(\u03b8) = \u2212\n\nN\n\nN\n2\n\nXi=1\n\nXj=1\n\nAij log( \u02c6Aij) + (1 \u2212 Aij) log(1 \u2212 \u02c6Aij).\n\n(5)\n\nWe use a relatively simple decoder. Given node embeddings xi and xj , our decoder outputs the edge\nprobability as\n\nwhere C is a temperature hyperparameter, set to 10 in our experiments. This re\ufb02ects the idea that\nnodes that are close in the embedding space should have a high probability of being connected.\n\nThe encoder is a standard GNN with multi-head dot-product attention, that uses the adjacency matrix\nA as the edge structure (and no additional edge features). In order to break symmetry, we need some\nway to distinguish the nodes from each other. If we are just interested in learning structure, then we\ndo not have access to node features, only the adjacency matrix. In this case, we generate node features\nH using random Gaussian variables hi \u223c N (0, \u03c32I), where we use \u03c32 = 0.3. This allows the graph\nnetwork to learn how to appropriately separate and cluster nodes according to A. We generate a new\nset of random features each time we encode a graph. This way, the graph can only rely on the features\nto break symmetry, and must rely on the graph structure to generate a useful encoding.\n\nPutting the GNF together with the graph encoder, we map training graphs from H to X and use this\nas training inputs for the GNF. Generating involves sampling Z \u223c N (0, I) followed by inverting the\nGNF, X = f \u22121(Z), and \ufb01nally decoding X into A and thresholding to get binary edges.\n\n4 Supervised Experiments\n\nIn this section we study the capabilities of the supervised GNF, the GrevNet architecture.\n\n5\n\n\u02c6Aij =\n\n1\n\n1 + exp(C(kxi \u2212 xjk2\n\n2 \u2212 1))\n\n(6)\n\n\fDatasets/Tasks: We experiment on two types of tasks. Transductive learning tasks consist of\nsemi-supervised document classi\ufb01cation in citation networks (Cora and Pubmed datasets), where we\ntest our model with the author\u2019s dataset splits [28], as well as 1% train split for a fair comparison\nagainst [18]. Inductive Learning tasks consist of PPI (Protein-Protein Interaction Dataset) [31]\nand QM9 Molecule property prediction dataset [21]. For transductive learning tasks we report\nclassi\ufb01cation accuracy, for PPI we report Micro F1 score, and for QM9, Mean Absolute Error (MAE).\nMore details on datasets are provided in the supplementary material.\n\nBaselines: We compare GRevNets to: (1) A vanilla GNN architecture with an identical architecture\nand the same number of message-passing steps; (2) Neumann-RBP [18] \u2013 which, to the best of our\nknowledge, is the state-of-the-art in the domain of memory-ef\ufb01cient GNNs.\n\n4.1 Performance on benchmark tasks\n\nTable 1 compares GRevNets to GNNs and Neumann RBP (NRBP). 1% train uses 1% of the data\nfor training to replicate the settings in [18]. For these, we provide average numbers for GNN and\nGRevNet and best numbers for NRBP. The GRevNet architecture is competitive with a standard\nGNN, and outperforms NRBP.\n\nWe also compare GRevNets to the GAT architecture [26]. While GAT outperforms the naive GRevNet,\nwe \ufb01nd that converting GAT into a reversible model by using it as the F and G functions within a\nGRevNet (GAT-GRevNet) only leads to a small drop in performance while allowing the bene\ufb01ts of a\nreversible model.\n\nDataset/Task\n\nGNN GRevNet Neumann RBP GAT GAT-GRevNet\n\nCora\nCora\n\nSemi-Supervised\n(1% Train)\n\nPubmed\nPubmed\n\nSemi-Supervised\n(1% Train)\n\nPPI\n\nInductive\n\n71.9\n55.5\n\n76.3\n76.6\n\n0.78\n\n74.5\n55.8\n\n76.0\n77.0\n\n0.76\n\n56.5\n54.6\n\n62.4\n58.5\n\n0.70\n\n83.0\n\u2013\n\n79.0\n\u2013\n\n\u2013\n\n82.7\n\u2013\n\n78.6\n\u2013\n\n\u2013\n\nModel\n\nmu\n\nalpha HOMO LUMO\n\ngap\n\nR2\n\nGNN\n\nGrevNet\n\n0.474\n0.462\n\n0.421\n0.414\n\nModel\n\nZPVE\n\nU0\n\nGNN\n\nGrevNet\n\n0.035\n0.036\n\n0.410\n0.390\n\n0.097\n0.098\n\nU\n\n0.396\n0.407\n\n0.124\n0.124\n\n0.170\n0.169\n\n27.150\n26.380\n\nH\n\nG\n\nCv\n\n0.381\n0.418\n\n0.373\n0.359\n\n0.198\n0.195\n\nTable 1: Top: performance in terms of accuracy (Cora, Pubmed) and Micro F1 scores (PPI). For\nGNN and GrevNet, number of MP steps is \ufb01xed to 4. For Neumann RBP, we use 100 steps of MP.\nFor GAT and GAT-GRevNet, we use 8 steps of MP. These values are averaged out over 3-5 runs\nwith different seeds. Bottom: performance in terms of Mean Absolute Error (lower is better) for\nindependent regression tasks on QM9 dataset. Number of MP steps is \ufb01xed to 4. The model was\ntrained for 350k steps, as in [6].\n\n4.2 Analysis of Memory Footprint\n\nWe \ufb01rst provide a more rigorous theoretical derivation for the memory footprint and then provide\nsome quantitative results. Let us assume that the node feature dimension is d, and the maximum\nnumber of nodes in a graph is N . Let us assume weights (parameters) of the message passing function\nis a matrix of size W . For simplicity, assume a parameter-free aggregation function that sums over\nmessages from neighbouring nodes. Finally, assume that the \ufb01nal classi\ufb01er weights are C in size.\nSuppose we run K message passing steps. Total memory that needs to be allocated for a run of GNN\n(ignoring gradients for now; gradients will scale by a similar factor) is W + C + K \u00d7 N \u00d7 d (=\nmemory allotted to weights + intermediate graph-sized tensors generated + adjacency matrix). For\na GNF, the total memory is W + C + N \u00d7 d. Note the lack of multiplicative dependence on the\nnumber of message passing steps in the latter term.\n\n6\n\n\fMODEL\n\nMOG (NLL) MOG RING (NLL)\n\n6-HALF MOONS (NLL)\n\nREALNVP\nGNF\n\n4.2\n3.6\n\n5.2\n4.2\n\n-1.2\n-1.7\n\nTable 2: Per-node negative log likelihoods (NLL) on synthetic datasets for REALNVP and GNF.\n\nAs a quantitative example, consider a semi-supervised classi\ufb01cation task on the Pubmed network\n(N = 19717, d = 500). We assume that the message passing function for a GNN is as follows:\nFC(500) \u2192 ReLU() \u2192 FC(750) \u2192 ReLU() \u2192 FC(500). Each of the functions F1(\u00b7) and\nF2(\u00b7) (please see Figure 1 in the paper for notation) in the corresponding GNF have the following\narchitecture: FC(250) \u2192 FC(750) \u2192 FC(250). We can compute the total memory allocated to\nweights/parameters: WGN N = 500 \u00d7 750 + 750 \u00d7 500, WGN F = 2 \u00d7 (250 \u00d7 750 + 750 \u00d7 250).\nWe perform K = 5 message passing steps for Pubmed. So, the amount of memory allocated to\nintermediate tensors in a GNN is 19717 \u00d7 500 \u00d7 5 + 19717 \u00d7 750 \u00d7 5, and correspondingly for a\nGNF is 19717 \u00d7 500. Summing up, the overall memory requirements are: GNN = 945.9 M and GNF\n= 80.2 M. Hence, in this case, GNFs are at least \u2265 10\u00d7 memory ef\ufb01cient than GNNs. Further, we\nuse self-attention in our experiments, which scales according to O(N 2). GNNs will store attention\naf\ufb01nity matrices for each message passing step. In this case, a similar argument can show that this\ncauses a difference of 11G memory. When using 12G GPU machines, this difference is signi\ufb01cant.\n\n5 Unsupervised Experiments\n\n5.1 Structured Density Estimation\n\nWe compare the performance of GNFs with RealNVP for structured density estimation on 3 datasets.\nDetails of the model architecture can be found in the supplementary material.\n\nDatasets. The \ufb01rst dataset is MIXTURE OF GAUSSIANS (MOG), where each training example is a\nset of 4 points in a square con\ufb01guration. Each point is drawn from a separate isotropic Gaussian, so\nno two points should land in the same area. MIXTURE OF GAUSSIANS RING (MOG RING) takes\neach example from MOG and rotates it randomly about the origin, creating an aggregate training\ndistribution that forms a ring. 6-HALF MOONS interpolates the original half moons dataset using 6\npoints with added noise.\n\nResults. Our results are shown in Table 2. We outperform REALNVP on all three datasets. We\nalso compare the generated samples of the two models on the MOG dataset in Figure 3.\n\n(a) Training examples\n\n(b) GNF samples\n\n(c) RealNVP samples\n\nFigure 3: (a) shows the aggregate training distribution for the MOG dataset in gray, as well as 5\nindividual training examples. Each training example is shown in a different color and is a structured\nset of nodes where each node is drawn from a different Gaussian. (b) and (c) each show 5 generated\nsamples from GNF and RealNVP, selected randomly. Each sample is shown in a different color. Note\nthat, GNF learns to generate structured samples where each node resembles a sample from a different\nGaussian, while RealNVP by design cannot model these dependencies. Best viewed in color.\n\n5.2 Graph Generation\n\nBaselines. We compare our graph generation model on two datasets, COMMUNITY-SMALL and\nEGO-SMALL from GraphRNN [30]. COMMUNITY-SMALL is a procedurally-generated set of 100 2-\ncommunity graphs, where 12 \u2264 |V | \u2264 20. EGO-SMALL is a set of 200 graphs, where 4 \u2264 |V | \u2264 18,\n\n7\n\n\fBINARY CE\n\nTOTAL # INCORRECT EDGES\n\nTOTAL # EDGES\n\nDATASET\n\nTRAIN\n\nTEST\n\nTRAIN\n\nTEST\n\nTRAIN\n\nTEST\n\nEGO-SMALL\nCOMMUNITY-SMALL\n\n9.8E-4\n5E-4\n\n11E-04\n7E-04\n\n24\n10\n\n32\n2\n\n3758\n1329\n\n984\n353\n\nTable 3: Train and test binary cross-entropy (CE) as described in equation 5, averaged over the total\nnumber of nodes. TOTAL # INCORRECT EDGES measures the number of incorrect edge predictions\n(either missing or extraneous) in the reconstructed graphs over the entire dataset. TOTAL # EDGES\nlists the total number of edges in each dataset. As we use Gaussian noise for initial node features, we\naveraged 5 runs of our model to obtain these metrics.\n\ndrawn from the larger Citeseer network dataset [24]. For all experiments described in this section, we\nused scripts from the GraphRNN codebase [29] to generate and split the data. 80% of the data was\nused for training and the remainder for testing.\n\n5.2.1 Graph Auto-Encoder\n\nWe \ufb01rst train a graph auto-encoder with attention, as described in Section 3.3. Every training epoch,\nwe generate new Gaussian noise features for each graph as input to the encoder. The GNN consists of\n10 MP steps, where each MP step uses a self-attention module followed by a multi-layer perceptron.\nAdditional details can be found in the supplementary material.\n\nTable 3 shows that our auto-encoder generalizes well to unseen test graphs, with a small gap between\ntrain and test cross-entropy. The total # of incorrect edges metric shows that the model achieves good\ntest reconstruction on EGO-SMALL and near-perfect test reconstruction on COMMUNITY-SMALL.\n\n5.2.2 Graph Normalizing Flows for Permutation Invariant Graph Generation\n\nOur trained auto-encoder gives us a distribution over node embeddings that are useful for graph\nreconstruction. We then train a GNF to maximize the likelihood of these embeddings using an\nisotropic Gaussian as the prior. Once trained, at generation time the model \ufb02ows N random Gaussian\nembeddings sampled from the prior to N node embeddings that describe a graph adjacency when run\nthrough the decoder.\n\nOur GNF consists of 10 MP steps with attention and an MLP for each of F1, F2, G1, and G2. For\nmore details on the architecture see the supplementary material.\n\nEvaluating Generated Graphs. We evaluate our model by providing visual samples and by using\nthe quantitative evaluation technique in GraphRNN [30], which calculates the MMD distance [9]\nbetween the generated graphs and a previously unseen test set on three statistics based on degrees,\nclustering coef\ufb01cients, and orbit counts. We use the implementation of GraphRNN provided by the\nauthors to train their model and their provided evaluation script to generate all quantitative results.\n\nIn [29], the MMD evaluation was performed by using a test set of N ground truth graphs, computing\ntheir distribution over |V |, and then searching for a set of N generated graphs from a much larger\nset of samples from the model that closely matches this distribution over |V |. These results tend to\nexhibit considerable variance as the graph test sets were quite small.\n\nTo achieve more certain trends, we also performed an evaluation by generating 1024 graphs for each\nmodel and computing the MMD distance between this generated set of graphs and the ground truth\ntest set. We report both evaluation settings in Table 4. We also report results directly from [30] on two\nother graph generation models, GRAPHVAE and DEEPGMG, evaluated on the same graph datasets.\n\nResults. We provide a visualization of generated graphs from GRAPHRNN and GNF in Figure 4.\nAs shown in Table 4, GNF outperforms GRAPHVAE and DEEPGMG, and is competitive with\nGRAPHRNN. Error margins for GNF and GRAPHRNN and a larger set of visualizations are\nprovided in the supplementary material.\n\n6 Conclusion\nWe propose GNFs, normalizing \ufb02ows using GNNs based on the RealNVP, by making the message\npassing steps reversible. In the supervised case, reversibility allows for backpropagation without\n\n8\n\n\fMODEL\n\nGRAPHVAE\nDEEPGMG\n\nGRAPHRNN\nGNF\n\nGRAPHRNN(1024)\nGNF(1024)\n\nCOMMUNITY-SMALL\n\nEGO-SMALL\n\nDEGREE CLUSTER ORBIT DEGREE CLUSTER ORBIT\n\n0.35\n0.22\n\n0.08\n0.20\n\n0.03\n0.12\n\n0.98\n0.95\n\n0.12\n0.20\n\n0.01\n0.15\n\n0.54\n0.4\n\n0.04\n0.11\n\n0.01\n0.02\n\n0.13\n0.04\n\n0.09\n0.03\n\n0.04\n0.01\n\n0.17\n0.10\n\n0.22\n0.10\n\n0.05\n0.03\n\n0.05\n0.02\n\n0.003\n0.001\n\n0.06\n\n0.0008\n\nTable 4: Graph generation results depicting MMD for various graph statistics between the test set\nand generated graphs. GRAPHVAE and DEEPGMG are reported directly from [30]. The second set\nof results (GRAPHRNN, GNF) are from evaluating the GraphRNN evaluation scheme with node\ndistribution matching turned on. We trained 5 separate models of each type and performed 3 trials\nper model, then averaged the result over 15 runs. The third set of results (GRAPHRNN (1024), GNF\n(1024)) are obtained when evaluating on the test set over all 1024 generated graphs (no sub-sampling\nof the generated graphs based on node similarity). In this case, we trained and evaluated the result\nover 5 separate runs per model.\n\n(a) Training data\n\n(b) GNF samples\n\n(c) GRAPHRNN samples\n\nFigure 4: Dataset examples and samples, drawn randomly, from the generative models. Top row:\nEGO-SMALL, bottom row: COMMUNITY-SMALL.\n\nthe need to store hidden node states. This provides signi\ufb01cant memory savings, further pushing the\nscalability limits of GNN architectures. On several benchmark tasks, GNFs match the performance of\nGNNs, and outperform Neumann RBP. In the unsupervised case, GNFs provide a \ufb02exible distribution\nover a set of continuous vectors. Using the pre-trained embeddings of a novel graph auto-encoder,\nwe use GNFs to learn a distribution over the embedding space, and then use the decoder to generate\ngraphs. This model is permutation invariant, yet competitive with the state-of-the-art auto-regressive\nGraphRNN model. Future work will focus on applying GNFs to larger graphs, and training the GNF\nand auto-encoder in an end-to-end approach.\n\n7 Acknowledgements\n\nWe would like to thank Harris Chan, William Chan, Steve Kearnes, Renjie Liao, and Mohammad\nNorouzi for their helpful discussions and feedback, as well as Justin Gilmer for his help with the\nMPNN code.\n\n9\n\n\fReferences\n\n[1] L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment.\nIn M. Caudil and C. Butler, editors, Proceedings of the IEEE First International Conference on Neural\nNetworks San Diego, CA, pages 609\u2013618, 1987.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. International Conference on Learning Representations, 2015.\n\n[3] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516, 2014.\n\n[4] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ICLR, 2017.\n\n[5] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[6] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message\npassing for quantum chemistry. In International Conference on Machine Learning, volume 70, pages\n1263\u20131272, 2017.\n\n[7] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual network:\nBackpropagation without storing activations. NIPS, 2017. URL http://arxiv.org/abs/1707.04585.\n\n[8] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains.\nIn Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages\n729\u2013734. IEEE, 2005.\n\n[9] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola.\nA kernel two-sample test. J. Mach. Learn. Res., 13:723\u2013773, March 2012. ISSN 1532-4435. URL\nhttp://dl.acm.org/citation.cfm?id=2188385.2188410.\n\n[10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1024\u20131034. 2017.\n\n[11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014. URL http://dblp.uni-trier.de/db/journals/corr/corr1412.html#KingmaB14.\n\n[12] Thomas N. Kipf and Max Welling. Variational graph auto-encoders. CoRR, abs/1611.07308, 2016.\n\n[13] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[14] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nInternational Conference on Learning Representation, 2017.\n\n[15] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings of the\n\n32nd International Conference on Machine Learning, volume 37, pages 1718\u20131727, 2015.\n\n[16] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[17] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning Deep Generative\n\nModels of Graphs. arXiv e-prints, art. arXiv:1803.03324, Mar 2018.\n\n[18] Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urtasun, and\nRichard Zemel. Reviving and improving recurrent back-propagation. In International Conference on\nMachine Learning, volume 80, pages 3082\u20133091, 2018.\n\n[19] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In\n\nInternational Conference on Learning Representations, 2016.\n\n[20] Fernando J. Pineda.\n\nIn D. Z. Anderson, editor, Neural\n\nral networks.\n602\u2013611. American Institute of Physics, 1988.\n67-generalization-of-back-propagation-to-recurrent-and-higher-order-neural-networks.\npdf.\n\nGeneralization of back propagation to recurrent and higher order neu-\nInformation Processing Systems, pages\nURL http://papers.nips.cc/paper/\n\n10\n\n\f[21] Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum\n\nchemistry structures and properties of 134 kilo molecules. Scienti\ufb01c Data, 1:140022 EP \u2013, 2014.\n\n[22] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In International\n\nConference on Machine Learning, volume 37, pages 1530\u20131538, 2015.\n\n[23] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[24] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad.\n\nCollective classi\ufb01cation in network data. Technical report, 2008.\n\n[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, pages 5998\u20136008, 2017.\n\n[26] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\nGraph attention networks. In International Conference on Learning Representations, 2018. URL https:\n//openreview.net/forum?id=rJXMpikCZ.\n\n[27] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph\n\nneural networks. ICLR, 2018.\n\n[28] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with\ngraph embeddings. In Proceedings of the 33rd International Conference on International Conference on\nMachine Learning - Volume 48, ICML\u201916, pages 40\u201348. JMLR.org, 2016. URL http://dl.acm.org/\ncitation.cfm?id=3045390.3045396.\n\n[29] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Code for GraphRNN:\nGenerating realistic graphs with deep auto-regressive model. https://github.com/JiaxuanYou/\ngraph-generation, 2018.\n\n[30] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic\ngraphs with deep auto-regressive models. In Jennifer Dy and Andreas Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 5708\u20135717, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR. URL\nhttp://proceedings.mlr.press/v80/you18a.html.\n\n[31] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks.\n\nBioinformatics, 33(14):190\u2013198, 2017.\n\n11\n\n\f", "award": [], "sourceid": 7511, "authors": [{"given_name": "Jenny", "family_name": "Liu", "institution": "Vector Institute, University of Toronto"}, {"given_name": "Aviral", "family_name": "Kumar", "institution": "UC Berkeley"}, {"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto / Vector Institute"}, {"given_name": "Jamie", "family_name": "Kiros", "institution": "Google Inc."}, {"given_name": "Kevin", "family_name": "Swersky", "institution": "Google"}]}