{"title": "Efficient Graph Generation with Graph Recurrent Attention Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4255, "page_last": 4265, "abstract": "We propose a new family of efficient and expressive deep generative models of graphs, called Graph Recurrent Attention Networks (GRANs).\nOur model generates graphs one block of nodes and associated edges at a time.\nThe block size and sampling stride allow us to trade off sample quality for efficiency.\nCompared to previous RNN-based graph generative models, our framework better captures the auto-regressive conditioning between the already-generated and to-be-generated parts of the graph using Graph Neural Networks (GNNs) with attention.\nThis not only reduces the dependency on node ordering but also bypasses the long-term bottleneck caused by the sequential nature of RNNs.\nMoreover, we parameterize the output distribution per block using a mixture of Bernoulli, which captures the correlations among generated edges within the block. \nFinally, we propose to handle node orderings in generation by marginalizing over a family of canonical orderings.\nOn standard benchmarks, we achieve state-of-the-art time efficiency and sample quality compared to previous models.\nAdditionally, we show our model is capable of generating large graphs of up to 5K nodes with good quality.\nOur code is released at: \\url{https://github.com/lrjconan/GRAN}.", "full_text": "Ef\ufb01cient Graph Generation with\n\nGraph Recurrent Attention Networks\n\nRenjie Liao1,2,3, Yujia Li4, Yang Song5, Shenlong Wang1,2,3,\n\nWilliam L. Hamilton6,7, David Duvenaud1,3, Raquel Urtasun1,2,3, Richard Zemel1,3,8\n\nUniversity of Toronto1, Uber ATG Toronto2, Vector Institute3,\n\nDeepMind4, Stanford University5, McGill University6,\n\nMila \u2013 Quebec Arti\ufb01cial Intelligence Institute7, Canadian Institute for Advanced Research8\n\n{rjliao, slwang, duvenaud, urtasun, zemel}@cs.toronto.edu\n\n{yujiali, charlienash}@google.com, yangsong@cs.stanford.edu, wlh@cs.mcgill.ca\n\nAbstract\n\nWe propose a new family of ef\ufb01cient and expressive deep generative models\nof graphs, called Graph Recurrent Attention Networks (GRANs). Our model\ngenerates graphs one block of nodes and associated edges at a time. The block size\nand sampling stride allow us to trade off sample quality for ef\ufb01ciency. Compared to\nprevious RNN-based graph generative models, our framework better captures the\nauto-regressive conditioning between the already-generated and to-be-generated\nparts of the graph using Graph Neural Networks (GNNs) with attention. This not\nonly reduces the dependency on node ordering but also bypasses the long-term\nbottleneck caused by the sequential nature of RNNs. Moreover, we parameterize\nthe output distribution per block using a mixture of Bernoulli, which captures\nthe correlations among generated edges within the block. Finally, we propose to\nhandle node orderings in generation by marginalizing over a family of canonical\norderings. On standard benchmarks, we achieve state-of-the-art time ef\ufb01ciency and\nsample quality compared to previous models. Additionally, we show our model is\ncapable of generating large graphs of up to 5K nodes with good quality. Our code\nis released at: https://github.com/lrjconan/GRAN.\n\n1\n\nIntroduction\n\nGraphs are the natural data structure to represent relational and structural information in many\ndomains, such as knowledge bases, social networks, molecule structures and even the structure of\nprobabilistic models. The ability to generate graphs therefore has many applications; for example, a\ngenerative model of molecular graph structures can be employed for drug design [10, 21, 19, 36],\ngenerative models for computation graph structures can be useful in model architecture search [35],\nand graph generative models also play a signi\ufb01cant role in network science [34, 1, 18].\nThe study of generative models for graphs dates back at least to the early work by Erd\u02ddos and R\u00e9nyi [8]\nin the 1960s. These traditional approaches to graph generation focus on various families of random\ngraph models [38, 8, 13, 34, 2, 1], which typically formalize a simple stochastic generation process\n(e.g., random, preferential attachment) and have well-understood mathematical properties. However,\ndue to their simplicity and hand-crafted nature, these random graph models generally have limited\ncapacity to model complex dependencies and are only capable of modeling a few statistical properties\nof graphs. For example, Erd\u02ddos\u2013R\u00e9nyi graphs do not have the heavy-tailed degree distribution that is\ntypical for many real-world networks.\nMore recently, building graph generative models using neural networks has attracted increasing atten-\ntion [27, 10, 21]. Compared to traditional random graph models, these deep generative models have a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fgreater capacity to learn structural information from data and can model graphs with complicated\ntopology and constrained structural properties, such as molecules.\nSeveral paradigms have been developed in the context of graph generative models. The \ufb01rst category\nof models generates the components of the graph independently or with only weak dependency\nstructure of the generation decisions. Examples of these models include the variational auto-encoder\n(VAE) model with convolutional neural networks for sequentialized molecule graphs [10] and Graph\nVAE models [16, 29, 23]. These models generate the individual entries in the graph adjacency matrix\n(i.e., edges) independently given the latents; this independence assumption makes the models ef\ufb01cient\nand generally parallelizable but can seriously compromise the quality of the generated graphs [21, 37].\nThe second category of deep graph generative models make auto-regressive decisions when generating\nthe graphs. By modeling graph generation as a sequential process, these approaches naturally\naccommodate complex dependencies between generated edges. Previous work on auto-regressive\ngraph generation utilize recurrent neural networks (RNNs) on domain-speci\ufb01c sequentializations of\ngraphs (e.g., SMILES strings) [27], as well as auto-regressive models that sequentially add nodes and\nedges [21, 37] or small graph motifs [14] (via junction-tree VAEs). However, a key challenge in this\nline of work is \ufb01nding a way to exploit the graph structure during the generation process. For instance,\nwhile applying RNNs to SMILES strings (as in [27]) is computationally ef\ufb01cient, this approach is\nlimited by the domain-speci\ufb01c sequentialization of the molecule graph structure. A similar issue\narises in junction tree VAEs that rely on small molecule-speci\ufb01c graph motifs. On the other hand,\nprevious work that exploits general graph structures using graph neural networks (GNNs) [21] do not\nscale well, with a maximum size of the generated graphs not exceeding 100 nodes.\nCurrently, the most scalable auto-regressive framework that is both general (i.e., not molecule-\nspeci\ufb01c) and able to exploit graph structure is the GraphRNN model [37], where the entries in a graph\nadjacency matrix are generated sequentially, one entry or one column at a time through an RNN.\nWithout using GNNs in the loop, these models can scale up signi\ufb01cantly, to generate graphs with\nhundreds of nodes. However, the GraphRNN model has some important limitations: (1) the number\nof generation steps in the full GraphRNN is still very large (O(N 2) for the best model, where N is\nthe number of nodes); (2) due to the sequential ordering, two nodes nearby in the graph could be far\napart in the generation process of the RNN, which presents signi\ufb01cant bottlenecks in handling such\nlong-term dependencies. In addition, handling permutation invariance is vital for generative models\non graphs, since computing the likelihood requires marginalizing out the possible permutations of the\nnode orderings for the adjacency matrix. This becomes more challenging as graphs scale up, since it\nis impossible to enumerate all permutations as was done in previous methods for molecule graphs\n[21]. GraphRNN relies on the random breadth-\ufb01rst search (BFS) ordering of the nodes across all\ngraphs, which is ef\ufb01cient to compute but arguably suboptimal.\nIn this paper, we propose an ef\ufb01cient auto-regressive graph generative model, called Graph Recurrent\nAttention Network (GRAN), which overcomes the shortcomings of previous approaches. In particular,\n\u2022 Our approach is a generation process with O(N ) auto-regressive decision steps, where a\nblock of nodes and associated edges are generated at each step, and varying the block size\nalong with the sampling stride allow us to explore the ef\ufb01ciency-quality trade-off.\n\n\u2022 We propose an attention-based GNN that better utilizes the topology of the already generated\npart of the graph to effectively model complex dependencies between this part and newly\nadded nodes. GNN reduces the dependency on the node ordering as it is permutation equiv-\narient w.r.t. the node representations. Moreover, the attention mechanism helps distinguish\nmultiple newly added nodes.\n\n\u2022 We parameterize the output distribution per generation step using a mixture of Bernoullis,\nwhich can capture the correlation between multiple generated edges.\n\u2022 We propose a solution to handle node orderings by marginalizing over a family of \u201ccanonical\u201d\nnode orderings (e.g., DFS, BFS, or k-core). This formulation has a variational interpretation\nas adaptively choosing the optimal ordering for each graph.\n\nAltogether, we obtain a model that achieves the state-of-the-art performance on standard benchmarks\nand permits \ufb02exible trade-off between generation ef\ufb01ciency and sample quality. Moreover, we\nsuccessfully apply our model to a large graph dataset with graphs up to 5k nodes, the scale of which\nis signi\ufb01cantly beyond the limits of existing deep graph generative models.\n\n2\n\n\fGraph at t-1 step\n\n1\n\n2\n\n4\n\n3\n\n1\n2\n3\n4\n\nL\n\nbt2\n\nL\n\nbt1\n\nAdjacency Matrix\n\n1\n\n2\n\n4\n\n3\n\n5\n\n6\n\nGraph\n\nRecurrent\nAttention\nNetwork\n\n1\n\n2\n\n4\n\n3\n\n5\n\nnew block (node 5, 6) \n\naugmented edges (dashed)\n\nOutput distribution on\n\naugmented edges\n\nGraph at t step\n\n2\n\n4\n\nL\n\nbt2\n\n6\n\n3\n\n5\n\nL\n\nbt1\n\nL\nbt\n\n1\n\nSampling\n\n6\n\n1\n2\n3\n4\n5\n6\n\nAdjacency Matrix\n\nFigure 1: Overview of our model. Dashed lines are augmented edges. Nodes with the same color\nbelong to the same block (block size = 2). In the middle right, for simplicity, we visualize the output\ndistribution as a single Bernoulli where the line width indicates the probability of generating the edge.\n\n2 Model\n\n2.1 Representation of Graphs and the Generation Process\n\nWe aim to learn a distribution p(G) over simple graphs that have at most one edge between any pair\nof nodes. Given a simple graph G = (V, E) and an ordering \u21e1 over the nodes in the graph, we have a\nbijection A\u21e1 , (G, \u21e1) between the set of possible adjacency matrices and the set of node-ordered\ngraphs. We use the superscript \u21e1 in A\u21e1 to emphasize that the adjacency matrix implicitly assumes a\nnode ordering via the order of its rows/columns.\nBased on the bijection above, we can model p(G) =P\u21e1 p(G, \u21e1) =P\u21e1 p(A\u21e1) by modeling the\ndistribution p(A\u21e1) over adjacency matrices. For undirected graphs, A\u21e1 is symmetric, thus we can\nmodel only the lower triangular part of A\u21e1, denoted as L\u21e1. Intuitively, our approach generates\nthe entries in L\u21e1 one row (or one block of rows) at a time, generating a sequence of vectors\ni 2 R1\u21e5|V | is the i-th row of L\u21e1 padded with zeros. The process\ni |i = 1,\u00b7\u00b7\u00b7 ,|V |} where L\u21e1\n{L\u21e1\ncontinues until we reach the maximum number of time steps (rows). Once L\u21e1 is generated, we have\nA\u21e1 = L\u21e1 + L\u21e1>.1\nNote that our approach generates all the entries in one row (or one block of rows) in one pass\nconditioned on the already generated graph. This signi\ufb01cantly shortens the sequence of auto-\nregressive graph generation decisions by a factor of O(N ), where N = |V |. By increasing the block\nsize and the stride of generation, we trade-off model expressiveness for speed. We visualize the\noverall process per generation step in Figure 1.\n\n2.2 Graph Recurrent Attention Networks\n\nFormally, our model generates one block of B rows of L\u21e1 at a time. The t-th block contains rows\nwith indices in bt = {B(t 1) + 1, ..., Bt}. We use bold subscripts to denote a block of rows or\nvectors, and normal subscripts for individual rows or vectors. The number of steps to generate a\ngraph is therefore T = dN/Be. We factorize the probability of generating L\u21e1 as\n\n(1)\n\nTYt=1\n\nbt1\n\np(L\u21e1\n\nbt|L\u21e1\n\nbt1).\n\np(L\u21e1) =\n\nb1,\u00b7\u00b7\u00b7 , L\u21e1\nThe conditional distribution p(L\u21e1\n) de\ufb01nes the probability of generating the current\nblock (all edges from the nodes in this block to each other and to the nodes generated earlier)\nconditioned on the existing graph. RNNs are standard neural network models for handling this type of\nsequential dependency structures. However, two nodes which are nearby in terms of graph topology\nmay be far apart in the sequential node ordering. This causes the so called long-term bottleneck\nfor RNNs To improve the conditioning, we propose to use GNNs rather than RNNs to make the\n\nb1,\u00b7\u00b7\u00b7 , L\u21e1\n\nbt|L\u21e1\n\n1One can easily generalize the aforementioned representation to directed graphs by augmenting one more\n\npass to sequentially generate the upper triangular part.\n\n3\n\n\fgeneration decisions of the current block directly depend on the graph structure. We do not carry\nhidden states of GNNs from one generation step to the next. By doing so, we enjoy the parallel\ntraining similar to PixelCNN [31], which is more ef\ufb01cient than the training method of typical deep\nauto-regressive models. We now explain the details of one generation step as below.\n\nNode Representation: At the t-th generation step, we \ufb01rst compute the initial node representations\nof the already-generated graph via a linear mapping,\n\nbi + b,\n\nh0\nbi = W L\u21e1\n\nB(i1)+1, ..., L\u21e1\n\nbi is represented as a vector [L\u21e1\n\n(2)\n8i < t.\nBi] 2 RBN, where [.] is the vector con-\nA block L\u21e1\ncatenation operator and N is the maximum allowed graph size for generation; graphs smaller than\ni vectors padded with 0s. These h vectors will then be used as initial node\nthis size will have the L\u21e1\nrepresentations in the GNN, hence a superscript of 0. For the current block L\u21e1\nbt, since we have not\nbt = 0. Note also that hbt 2 RBH contains a representation vector\ngenerated anything yet, we set h0\nof size H for each node in the block bt. In practice, computing h0\nbt1 alone at the t-th generation\nbi|i < t 1} can be cached from previous steps. The main goal of this linear\nstep is enough as {h0\nlayer is to reduce the embedding size to better handle large scale graphs.\n\nGraph Neural Networks with Attentive Messages: From these node representations, all the\nedges associated with the current block are generated using a GNN. These edges include connections\nwithin the block as well as edges linking the current block with the previously generated nodes.\nFor the t-th generation step, we construct a graph Gt that contains the already-generated subgraph\nof B(t 1) nodes and the edges between these nodes, as well as the B nodes in the block to be\ngenerated. For these B new nodes, we add edges to connect them with each other and the previous\nB(t 1) nodes. We call these new edges augmented edges and depict them with dashed lines in\nFigure 1. We then use a GNN on this augmented graph Gt to get updated node representations\nthat encode the graph structure. More concretely, the r-th round of message passing in the GNN is\nimplemented with the following equations:\n\nmr\nij = f (hr\n\u02dchr\ni = [hr\n\ni hr\nj ),\ni , xi],\n\n(3)\n(4)\n\nar\n\nij = Sigmoid\u21e3g(\u02dchr\n\nj )\u2318 ,\ni \u02dchr\ni ,Xj2N (i)\nar\nijmr\n\nhr+1\ni = GRU(hr\n\nij).\n\n(5)\n\n(6)\n\ni is hr\n\ni is the hidden representation for node i after round r, and mr\n\nij is the message vector from\nHere hr\nnode i to j. \u02dchr\ni augmented with a B-dimensional binary mask xi indicating whether node i is in\nthe existing B(t 1) nodes (in which case xi = 0), or in the new block of B nodes (xi is a one-of-B\nencoding of the relative position of node i in this block). ar\nij is an attention weight associated with\nedge (i, j). The dependence on \u02dchr\nj makes it possible for the model to distinguish between\nexisting nodes and nodes in the current block, and to learn different attention weights for different\ntypes of edges. Both the message function f and the attention function g are implemented as 2-layer\nMLPs with ReLU nonlinearities. Finally, the node representations are updated through a GRU similar\nto [20] after aggregating all the incoming messages through an attention-weighted sum over the\nbt1 from Eq. 2 are reused\nneighborhood N (i) for each node i. Note that the node representations h0\nas inputs to the GNN for all subsequent generation steps t0 t. One can also untie the parameters of\nthe model at each propagation round to improve the model capacity.\n\ni and \u02dchr\n\nOutput Distribution: After R rounds of message passing, we obtain the \ufb01nal node representation\nvectors hR\nbt via a\nmixture of Bernoulli distributions:\n\ni for each node i, and then model the probability of generating edges in the block L\u21e1\n\np(L\u21e1\n\nbt|L\u21e1\n\n\u2713k,i,j,\n\nKXk=1\n\nbt1) =\n\nb1, ..., L\u21e1\n\n\u21b5k Yi2bt Y1\uf8ffj\uf8ffi\n\u21b51, . . . ,\u21b5 K = Softmax\u21e3Xi2bt,1\uf8ffj\uf8ffi\n\u27131,i,j, . . . ,\u2713 K,i,j = SigmoidMLP\u2713(hR\n\nj )\ni hR\n\nMLP\u21b5(hR\n\nj )\u2318 ,\ni hR\n\n4\n\n(7)\n\n(8)\n\n(9)\n\n\fwhere both MLP\u21b5 and MLP\u2713 contain two hidden layers with ReLU nonlinearities and have K-\ndimensional outputs. Here K is the number of mixture components. When K = 1, the distribution\ndegenerates to Bernoulli which assumes the independence of each potential edge conditioned on the\nexisting graph. This is a strong assumption and may compromise the model capacity. We illustrate\nthe single Bernoulli output distribution in the middle right of Figure 1 using the line width. When\nK > 1, the generation of individual edges are not independent due to the latent mixture components.\nTherefore, the mixture model provides a cheap way to capture dependence in the output distribution,\nas within each mixture component the distribution is fully factorial, and all the mixture components\ncan be computed in parallel.\n\nBlock Size and Stride: The main limiting factor for graph generation speed is the number of\ngeneration steps T , as the T steps have to be performed sequentially and therefore cannot bene\ufb01t\nfrom parallelization. To improve speed, it is bene\ufb01cial to use a large block size B, as the number of\nsteps needed to generate graphs of size N is dN/Be. On the other hand, as B grows, modeling the\ngeneration of large blocks becomes increasingly dif\ufb01cult, and the model quality may suffer.\nWe propose \u201cstrided sampling\u201d to allow a trained model to improve its performance without being\nretrained or \ufb01ne-tuned. More concretely, after generating a block of size B, we may choose to\nonly keep the \ufb01rst S rows in the block, and in the next step generate another block starting from\nthe (S + 1)-th row. We call S (1 \uf8ff S \uf8ff B) the \u201cstride\u201d for generation, inspired by the stride in\nconvolutional neural networks. The standard generation process corresponds to a stride of S = B.\nWith S < B, neighboring blocks have an overlap of B S rows, and T = b(N B)/Sc + 1 steps\nis needed to generate a graph of size N. During training, we train with block size of B and stride\nsize of 1, hence learning to generate the next B rows conditioned on all possible subgraphs under a\nparticular node ordering; while at test time we can use the model with different stride values. Setting\nS to B maximizes speed, and using smaller S can improve quality, as the dependency between the\nrows in a block can be modeled by more than 1 steps.\n\n2.3 Learning with Families of Canonical Orderings\n\nOrdering is important for an auto-regressive model. Previous work [21, 37] explored learning and\ngeneration under a canonical ordering (e.g., based on BFS), and Li et al. [21] also explored training\nwith a uniform random ordering. Here, we propose to train under a chosen family of canonical\norderings, allowing the model to consider multiple orderings with different structural biases while\navoiding the intractable full space of factorially-many orderings. Similar strategy has been exploited\nfor learning relational pooling function in [25]. Concretely, we aim to maximize the log-likelihood\n\nlog p(G) = logP\u21e1 p(G, \u21e1). However this computation is intractable due to the number of orderings\n\u21e1 being factorial in the graph size. We therefore limit the set of orderings to a family of canonical\norderings Q = {\u21e11, ...,\u21e1 M}, and learn to maximize a lower bound\n\np(G, \u21e1)\n\n(10)\n\nlog p(G) logX\u21e12Q\n\ninstead. Since Q is a strict subset of the N ! orderings, logP\u21e12Q p(G, \u21e1) is a valid lower bound of the\ntrue log-likelihood, and it is a tighter bound than any single term log p(G, \u21e1), which is the objective\nfor picking a single arbitrary ordering and maximizing the log-likelihood under that ordering, a\npopular training strategy [37, 21]. On the other hand, increasing the size of Q can make the bound\ntighter. Choosing a set Q of the right size can therefore achieve a good trade-off between tightness of\nthe bound (which usually correlates with better model quality) and computational cost.\n\nVariational Interpretation: This new objective has an intuitive variational interpretation. To see\nthis, we write out the variational evidence lower bound (ELBO) on the log-likelihood,\n\nlog p(G) Eq(\u21e1|G)[log p(G, \u21e1)] + H(q(\u21e1|G)),\n\n(11)\nwhere q(\u21e1|G) is a variational posterior over orderings given the graph G. When restricting \u21e1 to a\nset of M canonical distributions, q(\u21e1|G) is simply a categorical distribution over M items, and the\noptimal q\u21e4(\u21e1|G) can be solved analytically with Lagrange multipliers, as\n\nq\u21e4(\u21e1|G) = p(G, \u21e1)/\u21e3X\u21e12Q\n\np(G, \u21e1)\u2318\n\n5\n\n8\u21e1 2Q .\n\n(12)\n\n\fSubstituting q(\u21e1|G) into Eq. 11, we get back to the objective de\ufb01ned in Eq. 10. In other words,\nby optimizing the objective in Eq. 10, we are implicitly picking the optimal (combination of) node\norderings from the set Q and maximizing the log-likelihood of the graph under this optimal ordering.\nCanonical Node Orderings: Different types or domains of graphs may favor different node\norderings. For example, some canonical orderings are more effective compared to others in molecule\ngeneration, as shown in [21]. Therefore, incorporating prior knowledge in designing Q would help\nin domain-speci\ufb01c applications. Since we aim at a universal deep generative model of graphs, we\nconsider the following canonical node orderings which are solely based on graph properties: the\ndefault ordering used in the data2, the node degree descending ordering, ordering of the BFS/DFS\ntree rooted at the largest degree node (similar to [37]), and a novel core descending ordering. We\npresent the details of the core node ordering in the appendix. In our experiments, we explore different\ncombinations of orderings from this generic set to form our Q.\n3 Related Work\n\nIn addition to the graph generation approaches mentioned in Sec. 1, there are a number of other\nnotable works that our research builds upon.\nTraditional approaches. Exponential random graphs model (ERGMs) [33] are an early approach to\nlearning a generative graph models from data. ERGMs rely on an expressive probabilistic model that\nlearns weights over node features to model edge likelihoods, but in practice, this approach is limited\nby the fact that it only captures a set of hand-engineered graph-suf\ufb01cient statistics. The Kronecker\ngraph model [18] relies on Kronecker matrix products to ef\ufb01ciently generate large adjacency matrices.\nWhile scalable and able to learn some graph properties (e.g., degree distributions) from data, this\napproach remains highly-constrained in terms of the graph structures that it can represent.\nNon-autoregressive deep models. A number of approaches have been proposed to improve non-\nauto-regressive deep graph generative models. For example, Grover et al. [11] use a graph neural\nnetwork encoder and an iterative decoding procedure to de\ufb01ne a generative model over a single \ufb01xed\nset of nodes. Ma et al. [24], on the other hand, propose to regularize the graph VAE objective with\ndomain-speci\ufb01c constraints; however, as with other previous work on non-autoregressive, graph VAE\napproaches, they are only able to scale to graphs with less than one hundred nodes. NetGAN [4]\nbuilds a generative adversarial network on top of random walks over the graph. A sampled graph\nis then constructed based on multiple generated random walks. Relying on the reversible GNNs\nand graph attention networks [32], Liu et al. [22] propose a normalizing \ufb02ow based prior within the\nframework of graphVAE. However, the graph in the latent space is assumed to be fully connected\nwhich signi\ufb01cantly limits the scalability.\nAuto-regressive deep models. For auto-regressive models, Dai et al. [6] chose to model the se-\nquential graphs using RNNs, and enforced constraints in the output space using attribute grammars\nto ensure the syntactic and semantic validity. Similarly, Kusner et al. [17] predicted the logits of\ndifferent graph components independently but used syntactic constraints that depend on context to\nguarantee validity. Jin et al. [14] proposed a different approach to molecule graph generation by\nconverting molecule graphs into equivalent junction trees. This approach works well for molecules\nbut is not ef\ufb01cient for modeling large tree-width graphs. The molecule generation model of Li et al.\n[19] is the most similar to ours, where one column of the adjacency matrix is also generated at each\nstep, and the generation process is modeled as either Markovian or recurrent using a global RNN,\nwhich works well for molecule generation. Chu et al. [5] improve GraphRNN with a random-walk\nbased encoder and successfully apply the model to road layout generation. Unlike these previous\nworks, which focus on generating domain-speci\ufb01c and relatively small graphs, e.g., molecules, we\ntarget the problem of ef\ufb01ciently generating large and generic graphs.\n\n4 Experiments\n\nIn this section we empirically verify the effectiveness of our model on both synthetic and real graph\ndatasets with drastically varying sizes and characteristics.\n\n2In our case, it is the default ordering used by NetworkX [12]\n\n6\n\n\f4.1 Dataset and Evaluation Metrics\n\nOur experiment setup closely follows You et al. [37]. To benchmark our GRAN against the models\nproposed in the literature, we utilize one synthetic dataset containing random grid graphs and two\nreal world datasets containing protein and 3D point clouds respectively.\n\nDatasets:\n(1) Grid: We generate 100 standard 2D grid graphs with 100 \uf8ff| V |\uf8ff 400. (2) Protein:\nThis dataset contains 918 protein graphs [7] with 100 \uf8ff| V |\uf8ff 500. Each protein is represented by a\ngraph, where nodes are amino acids and two nodes are connected by an edge if they are less than\n6 Angstroms away. (3) Point Cloud: FirstMM-DB is a dataset of 41 simulated 3D point clouds of\nhousehold objects [26] with an average graph size of over 1k nodes, and maximum graph size over\n5k nodes. Each object is represented by a graph where nodes represent points. Edges are connected\nfor k-nearest neighbors which are measured w.r.t. Euclidean distance of the points in 3D space. We\nuse the same protocol as [37] and create random 80% and 20% splits of the graphs in each dataset for\ntraining and testing. 20% of the training data in each split is used as the validation set. We generate\nthe same amount of samples as the test set for each dataset.\n\nMetrics: Evaluating generative models is known to be challenging [30]. Since it is dif\ufb01cult to\nmeasure likelihoods for any auto-regressive graph generative model that relies on an ordering, we\nfollow [37, 21] and evaluate model performance by comparing the distributions of graph statistics\nbetween the generated and ground truth graphs. In previous work, You et al. [37] computed degree\ndistributions, clustering coef\ufb01cient distributions, and the number of occurrence of all orbits with 4\nnodes, and then used the maximum mean discrepancy (MMD) over these graph statistics, relying\non Gaussian kernels with the \ufb01rst Wasserstein distance, i.e., earth mover\u2019s distance (EMD), in the\nMMD. In practice, we found computing this MMD with the Gaussian EMD kernel to be very slow\nfor moderately large graphs. Therefore, we use the total variation (TV) distance, which greatly\nspeeds up the evaluation and is still consistent with EMD. In addition to the node degree, clustering\ncoef\ufb01cient and orbit counts (used by [36]), we also compare the spectra of the graphs by computing\nthe eigenvalues of the normalized graph Laplacian (quantized to approximate a probability density).\nThis spectral comparison provides a view of the global graph properties, whereas the previous metrics\nfocus on local graph statistics.\n\n4.2 Benchmarking Sample Quality\n\nIn the \ufb01rst experiment we compare the quality of our GRAN model against other existing models in\nthe literature including GraphVAE [29], GraphRNN and its variant GraphRNN-S [37]. We also add a\nErdos-Renyi baseline of which the edge probability is estimated via maximum likelihood over the\ntraining graphs. For a fair comparison, we control the data split to be exactly same for all methods.\nWe implement a GraphVAE model where the encoder is a 3-layer GCN and decoder is a MLP with\n2 hidden layers. All hidden dimensions are set to 128. For GraphRNN and GraphRNN-S, we use\nthe best setting reported in the paper and re-trained the model with our split. We also tried to run\nDeepGMG model [21] but failed to obtain results in a reasonable amount of time due to its scalability\nissue on these datasets. For our GRAN, hidden dimensions are set to 128, 512 and 256 on three\ndatasets respectively. Block size and stride are both set to 1. The number of Bernoulli mixtures is 20\nfor all experiments. We stack 7 layers of GNNs and unroll each layer for 1 step. All of our models\nare trained with Adam optimizer [15] and constant learning rate 1e-4. We choose the best model\nbased on the validation set. The sample evaluation results on the test set are reported in Table 1, with\na few sample graphs generated by the models shown in Figure 2. For all the metrics, smaller is better.\nNote that none of the Graph-VAE, GraphRNN and GraphRNN-S models were able to scale to the\npoint cloud dataset due to out-of-memory issues, and the running time of GraphRNN models becomes\nprohibitively long for large graphs. Overall, our proposed GRAN model achieves the state-of-the-art\nperformance on all benchmarks in terms of sample quality. On the other hand, from Figure 2, we can\nsee that even though the quantitative metric of GraphRNN is similar to ours, the visual difference\nof the generated grid graphs is particularly noticeable. This implies that the current set of graph\nstatistics may not give us a complete picture of model performance. We show more visual examples\nand results on one more synthetic random lobster graph dataset in the appendix.\n\n7\n\n\fd\ni\nr\nG\n\nd\ni\nr\nG\n\nn\ni\ne\nt\no\nr\nP\n\nn\n\ni\ne\nt\no\nr\nP\n\nTrain\n\nGraphVAE\n\nGraphRNN\n\nGRAN (Ours)\n\nFigure 2: Visualization of sample graphs generated by different models.\n\nGrid\n\nProtein\n\n3D Point Cloud\n\n|V |max = 361, |E|max = 684\n|V |avg \u21e1 210, |E|avg \u21e1 392\n\n|V |max = 500, |E|max = 1575\n|V |avg \u21e1 258, |E|avg \u21e1 646\n\nErdos-Renyi\nGraphVAE*\nGraphRNN-S\nGraphRNN\nGRAN\n\nDeg.\n0.79\n\n7.07e2\n\n0.13\n\n1.12e2\n8.23e4\n\nClus.\n2.00\n\n7.33e2\n3.73e2\n7.73e5\n3.79e3\n\nOrbit\n1.08\n0.12\n0.18\n\n1.03e3\n1.59e3\n\nSpec.\n0.68\n\n1.44e2\n\n0.19\n\n1.18e2\n1.62e2\n\nDeg.\n5.64e2\n\n0.48\n\n4.02e2\n1.06e2\n1.98e3\n\nClus.\n1.00\n\n7.14e2\n4.79e2\n\n0.14\n\nOrbit\n1.54\n0.74\n0.23\n0.88\n0.13\n\nSpec.\n9.13e2\n\n0.11\n0.21\n\n1.88e2\n5.13e3\n\n7.45e3\nTable 1: Comparison with other graph generative models. For all MMD metrics, the smaller the better.\n*: our own implementation, -: not applicable due to memory issue, Deg.: degree distribution, Clus.:\nclustering coef\ufb01cients, Orbit: the number of 4-node orbits, Spec.: spectrum of graph Laplacian.\n\n1.75e2\n\n4.86e2\n\n0.51\n\n0.21\n\n|V |max = 5037, |E|max = 10886\n|V |avg \u21e1 1377, |E|avg \u21e1 3074\nSpec.\nDeg.\n0.31\n4.26e2\n\nClus. Orbit\n1.22\n1.27\n\n-\n-\n-\n\n-\n-\n-\n\n-\n-\n-\n\n-\n-\n-\n\n4.3 Ef\ufb01ciency vs. Sample Quality\n\nAnother important feature of GRAN is its ef\ufb01ciency. In this experiment, we quantify the graph\ngeneration ef\ufb01ciency and show the ef\ufb01ciency-quality trade-off by varying the generation stride. The\nmain results are reported in Figure 3, where the models are trained with block size 16. We trained our\nmodels on grid graphs and evaluated model quality on the validation set. To measure the run time for\neach setting we used a single GTX 1080Ti GPU. We measure the speed improvement by computing\nthe ratio of GraphRNN average time per graph to that of ours. GraphRNN takes around 9.5 seconds\nto generate one grid graph on average. Our best performing model with stride 1 is about 6 times as\nfast as GraphRNN, increasing the sampling stride trades-off quality for speed. With a stride of 16,\nour model is more than 80x faster than GraphRNN, but the model quality is also noticeably worse.\nWe leave the full details of quantitative results in the appendix.\n\n4.4 Ablation Study\n\nIn this experiment we isolate the contributions of different parts of our model and present an ablation\nstudy. We again trained all models on the random grid graphs and report results on the validation set.\nFrom Table 2, we can see that increasing the number of Bernoulli mixtures improves the performance,\nespecially w.r.t. the orbit metric. Note that since grid graphs are somewhat simple in terms of the\n\n8\n\n\fQ\n{\u21e11}\n{\u21e11}\n{\u21e11}\n{\u21e11,\u21e1 2}\n{\u21e11,\u21e1 2,\u21e1 3}\n{\u21e11,\u21e1 2,\u21e1 3,\u21e1 4}\n\nB K\nOrbit\nDeg.\n1\n1\n2.66e5\n1.51e5\n1 20\n4.27e6\n1.54e5\n1 50\n9.56e7\n1.70e5\n1 20\n2.48e2\n6.00e2\n1 20\n8.99e3 7.37e3 1.69e2\n1 20\n2.34e2 5.95e2 5.21e2\n1 20 {\u21e11,\u21e1 2,\u21e1 3,\u21e1 4,\u21e1 5} 4.11e4 9.43e3 6.66e4\n1.69e4\n4 20\n5.04e4\n7.01e5 4.89e5 8.57e5\n8 20\n1.30e3 6.65e3 9.32e3\n16 20\n\n0\n\nClus.\n\n0\n0\n0\n0.16\n\n{\u21e11}\n{\u21e11}\n{\u21e11}\n\nFigure 3: Ef\ufb01ciency vs. sample quality. The\nbar and line plots are the MMD (left y-axis)\nand speed ratio (right y-axis) respectively.\n\nTable 2: Ablation study on grid graphs. B: block size,\nK: number of Bernoulli mixtures, \u21e11: DFS, \u21e12: BFS,\n\u21e13: k-core, \u21e14: degree descent, \u21e15: default.\n\ntopology, the clustering coef\ufb01cients and degree distribution are not discriminative enough to re\ufb02ect the\nchanges. We also tried different mixtures on protein graphs and con\ufb01rmed that increasing the number\nof Bernoulli mixtures does bring signi\ufb01cant gain of performance. We set it to 20 for all experiments\nas it achieves a good balance between performance and computational cost. We also compare with\ndifferent families of canonical node orderings. We found that using all orderings and DFS ordering\nalone are similarly good on grid graphs. Therefore, we choose to use DFS ordering due to its\nlower computational cost. In general, the optimal set of canonical orderings Q is application/dataset\ndependent. For example, for molecule generation, adding SMILE string ordering would boost the\nperformance of using DFS alone. More importantly, our learning scheme is simple yet \ufb02exible to\nsupport different choices of Q. Finally, we test different block sizes while \ufb01xing the stride as 1. It\nis clear that the larger the block size, the worse the performance, which indicates the learning task\nbecomes more dif\ufb01cult.\n\n5 Conclusion\n\nIn this paper, we propose the Graph Recurrent Attention Network (GRAN) for ef\ufb01cient graph\ngeneration. Our model generates one block of the adjacency matrix at a time through an O(N ) step\ngeneration process. The model uses GNNs with attention to condition the generation of a block\non the existing graph, and we further propose a mixture of Bernoulli output distribution to capture\ncorrelations among generated edges per step. Varying the block size and sampling stride permits\neffective exploration of the ef\ufb01ciency-quality trade-off. We achieve state-of-the-art performance\non standard benchmarks and show impressive results on a large graph dataset of which the scale is\nbeyond the limit of any other deep graph generative models. In the future, we would like to explore\nthis model in applications where graphs are latent or partially observed.\n\nAcknowledgments\n\nRL was supported by Connaught International Scholarship and RBC Fellowship. WLH was supported\nby a Canada CIFAR Chair in AI. RL, RU and RZ were supported in part by the Intelligence Advanced\nResearch Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC)\ncontract number D16PC00003. The U.S. Government is authorized to reproduce and distribute\nreprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer:\nthe views and conclusions contained herein are those of the authors and should not be interpreted as\nnecessarily representing the of\ufb01cial policies or endorsements, either expressed or implied, of IARPA,\nDoI/IBC, or the U.S. Government.\n\nReferences\n[1] R\u00e9ka Albert and Albert-L\u00e1szl\u00f3 Barab\u00e1si. Statistical mechanics of complex networks. Reviews of modern\n\nphysics, 74(1):47, 2002.\n\n9\n\n\f[2] Albert-L\u00e1szl\u00f3 Barab\u00e1si and R\u00e9ka Albert. Emergence of scaling in random networks.\n\n286(5439):509\u2013512, 1999.\n\nscience,\n\n[3] Vladimir Batagelj and Matjaz Zaversnik. An o(m) algorithm for cores decomposition of networks. arXiv\n\npreprint cs/0310049, 2003.\n\n[4] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Z\u00fcgner, and Stephan G\u00fcnnemann. Netgan: Generating\n\ngraphs via random walks. arXiv preprint arXiv:1803.00816, 2018.\n\n[5] Hang Chu, Daiqing Li, David Acuna, Amlan Kar, Maria Shugrina, Xinkai Wei, Ming-Yu Liu, Antonio\n\nTorralba, and Sanja Fidler. Neural turtle graphics for modeling city road layouts. In ICCV, 2019.\n\n[6] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoder\n\nfor structured data. arXiv preprint arXiv:1802.08786, 2018.\n\n[7] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. Journal of molecular biology, 330(4):771\u2013783, 2003.\n\n[8] Paul Erdos and Alfr\u00e9d R\u00e9nyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci,\n\n5(1):17\u201360, 1960.\n\n[9] Solomon W Golomb. Polyominoes: puzzles, patterns, problems, and packings, volume 16. Princeton\n\nUniversity Press, 1996.\n\n[10] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Benjam\u00edn\nS\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams,\nand Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of\nmolecules. ACS central science, 4(2):268\u2013276, 2018.\n\n[11] Aditya Grover, Aaron Zweig, and Stefano Ermon. Graphite: Iterative generative modeling of graphs. arXiv\n\npreprint arXiv:1803.10459, 2018.\n\n[12] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function\nusing networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States),\n2008.\n\n[13] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.\n\nSocial networks, 5(2):109\u2013137, 1983.\n\n[14] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular\n\ngraph generation. arXiv preprint arXiv:1802.04364, 2018.\n\n[15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308,\n\n2016.\n\n[17] Matt J Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational autoencoder. In\n\nICML, 2017.\n\n[18] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kro-\nnecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985\u2013\n1042, 2010.\n\n[19] Yibo Li, Liangren Zhang, and Zhenming Liu. Multi-objective de novo drug design with conditional graph\n\ngenerative model. Journal of cheminformatics, 10(1):33, 2018.\n\n[20] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[21] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models\n\nof graphs. arXiv preprint arXiv:1803.03324, 2018.\n\n[22] Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. Graph normalizing \ufb02ows, 2019.\n\n[23] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational\n\nautoencoders for molecule design. In NIPS, pages 7795\u20137804, 2018.\n\n10\n\n\f[24] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing\n\nvariational autoencoders. arXiv preprint arXiv:1809.02630, 2018.\n\n[25] Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling for\n\ngraph representations. arXiv preprint arXiv:1903.02541, 2019.\n\n[26] Marion Neumann, Plinio Moreno, Laura Antanas, Roman Garnett, and Kristian Kersting. Graph kernels\nfor object category prediction in task-dependent robot grasping. In International Workshop on Mining and\nLearning with Graphs at KDD, 2013.\n\n[27] Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focused molecule\n\nlibraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120\u2013131, 2017.\n\n[28] Stephen B Seidman. Network structure and minimum degree. Social networks, 5(3):269\u2013287, 1983.\n\n[29] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational\n\nautoencoders. arXiv preprint arXiv:1802.03480, 2018.\n\n[30] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.\n\narXiv preprint arXiv:1511.01844, 2015.\n\n[31] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\n\nimage generation with pixelcnn decoders. In NIPS, 2016.\n\n[32] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.\n\nGraph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[33] Stanley Wasserman and Philippa Pattison. Logit models and logistic regressions for social networks: I. an\n\nintroduction to markov graphs andp. Psychometrika, 61(3):401\u2013425, Sep 1996.\n\n[34] Duncan J Watts and Steven H Strogatz. Collective dynamics of \u2018small-world\u2019networks.\n\n393(6684):440, 1998.\n\nnature,\n\n[35] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural\n\nnetworks for image recognition. arXiv preprint arXiv:1904.01569, 2019.\n\n[36] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy\n\nnetwork for goal-directed molecular graph generation. In NIPS, pages 6410\u20136421, 2018.\n\n[37] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic\n\ngraphs with deep auto-regressive models. In ICML, pages 5694\u20135703, 2018.\n\n[38] George Udny Yule. Ii.\u2014a mathematical theory of evolution, based on the conclusions of dr. jc willis, fr s.\nPhilosophical transactions of the Royal Society of London. Series B, containing papers of a biological\ncharacter, 213(402-410):21\u201387, 1925.\n\n11\n\n\f", "award": [], "sourceid": 2394, "authors": [{"given_name": "Renjie", "family_name": "Liao", "institution": "University of Toronto"}, {"given_name": "Yujia", "family_name": "Li", "institution": "DeepMind"}, {"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Shenlong", "family_name": "Wang", "institution": "University of Toronto"}, {"given_name": "Will", "family_name": "Hamilton", "institution": "McGill"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "Uber ATG"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "Vector Institute/University of Toronto"}]}