Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This manuscript describes a new deep learning model called Graph Recurrent Attention Network (GRAN) for the generation of a generic graph. This model has the following unique features:1) combine a RNN-based generator with an attention-based GNN that can model complex, long-range dependencies; 2) only need O(N) steps where N is the graph size (i.e, #nodes). Each step may generate a block of nodes; 3) the algorithm can generate different graphs by different orderings of nodes. The authors have shown that their algorithm can generate graphs of similar quality but much more efficiently. Their experimental results also show that their algorithm indeed has some advantage over existing ones. One concern is about the presentation of the experimental results. The authors used a few metrics to measure the quality of the generated graphs, but did not explain how these metrics are calculated and how they shall be interpreted. For example, when two methods have the spectral score 0.01 and 0.02, respectively, how different are they?
Significance: 1. The paper is very well written and easy to follow. However, most of the experimental details are missing and in the absence of code makes it difficult to imagine. 2. For example, no where it is mentioned how Eq. (11) is trained with ? It says 2-layer GCN but what is the input node representations ? And what is the GT in order to train for mixture network ? 3. Current drawback is that the model need to know the maximum allowed graph size 'N' (required for RNN module) 4. The results in Table 1 suggests that GRAN (this approach) performs best for Grid data. On other two dataset GraphRNN is better. Clarification: 1. a's in Eq.(5) are scalars ? Are they normalized using sigmoid activation function ? 2. What happens if RNN modules for node generation are not used ? h can be initialized randomly and updated based on GRU outputs of GNN module or something similar. Please inclkude this in Table 2 of ablation study. 3. How is overall training performed ? Is it teacher forcing approach ? 4. Please include model size comparison.
The general presentation of the paper is quite good. The paper is well written, and the authors present convincing derivations of the statistical foundations of their approach. Key arguments are widely supported by mathematical facts or well placed references. The experimental evaluation presents convincing results, with both quantitative and visual content to assess the quality of the generated samples. However, the fact that most of the comparison is made on purely synthetic data is slightly disappointing. Indeed, the only real dataset used is the protein dataset. Yet, it only contains small graphs, which is not the configuration where the proposed method is supposed to be the most relevant. The authors could have made use of open source social interaction graphs datasets, so as to prove their point about scalability, and demonstrate high sample quality on real-world large graph datasets.
Here is an update of my review. I am keeping my score but have a concern. <> Some of your more recent experiments show that models without an RNN can perform well in some instances. But from my reading, the RNN was an important part of the message of your paper. You are finding a balance between independent generation (graph VAE) and introducing structural dependence with an RNN. Please be careful to make sure the message of the entire paper is clear, consistent, and as unified as possible. Perhaps you can offer some guidance to the reader when an RNN will be useful, and why it may not have been needed in some of your tasks. I know you mentioned SMILES. <> Optimization of the GNN. My question was not about the type and tuning of the optimizer (Adam, learning rate, etc). Sorry for not making it clear. My suggestion was to clearly indicate what the final and overall loss function is that you use to train all the weights. You have a GNN that defines a Bernoulli distribution over edges and another that learns a variational posterior over orderings, as well as an RNN. Overall, all these components build up a likelihood function that serves as the loss that can be used to _optimize_ the weights. In my humble opinion the final objective function could be more salient, especially for purposes of implementation in the reader's favorite framework. For instance, an Algorithm environment with a line that refers to the equations that altogether define the loss. However I know space can constrain that suggestion. ====================== This paper focuses on the approach of using an auto-regressive scheme for graph generative models which better captures dependencies among vertices in the graph. The authors note that many existing works leverage domain-specific knowledge rather than learning from the data and scale poorly. A central baseline that addresses these issues is GraphRNN which still suffers from O(N^2) generation steps (N is the number of vertices), may lose long-range dependency between distant vertices in the graph, and only presents one scheme for ordering vertices. The authors propose generating a graph by focusing on B rows in the lower-triangle of the adjacency matrix at a time (a "block" of size B). To generate a new block of the graph, rows of the already-generated portion of the lower triangle are passed to a GRU. The representations from the GRU are passed to a graph neural network with attention, and the resulting vertex representations can be used to calculate a probability of edge formation (via an MLP). To learn an ordering, the authors take a variational approach with a categorical distribution over orderings as the variational posterior. Notably, this generative scheme, dubbed GRAN, can be seen to address each of the concerns discussed previously. The experiments compare summary statistics of generated graphs to actual graphs in real and simulated data against the baselines of GraphVAE and GraphRNN, demonstrate a speed-quality tradeoff controlled by B, and perform an ablation study. * Pros* - The characterization of existing work and discussion of limitations are clear. - The stated goals are clear and the scheme developed is clearly motivated by these goals. - This is an important line of work and I am not aware of work that takes the proposed approach. - The authors propose an interesting approach for learning long-range dependencies and mixing over canonical orderings. - The experiments demonstrate that GRAN makes graph generation via an autoregressive approach more computationally efficient if a user is willing to compromise on quality. *Cons* Issues with the experimental design preclude me from rating this paper above a 6. A few details in the description of the algorithm/training are also lacking. In particular: - It appears that one train-test split was chosen to be consistent with a previous experiment. It has been discussed that this practice should be avoided, e.g. “Pitfalls of Graph Neural Network Evaluation” - The experiments do not report confidence intervals or any quantification of the variability in the results (related to above). - The choice/optimization of weights for the GNN that produces edge probabilities needs to be more clearly discussed. - If possible, experiments could address whether the proposed model better learns long-range dependencies. - It could be made clearer how the size of the generated graph is determined.