NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:9347
Title:On the equivalence between graph isomorphism testing and function approximation with GNNs

Reviewer 1

The paper targets the problem of measuring the representation power of Graph neural networks (GNNs), an interesting and important topic, that has become popular recently (partially due to two prominent works (Xu et al. ICLR 2019, Morris et al. AAAI 2019)). There are three main contributions: 1. Establishing the equivalence between two methods for measuring GNN representation power: (i) their ability to approximate permutation invariant functions (ii) their ability to distinguish non-isomorphic graphs. Although not very surprising, this is a nice observation. 2. The authors further suggest reformulating these GNN representation power measures in the language of sigma algebras: Given a class of GNN models, one can naturally associate a sigma algebra (defined on the domain of these models). The authors show that these sigma algebras are an equivalent way to measure representation power of GNNs, for instance, the inclusion of sigma algebras originating from two models is equivalent to saying one model is more powerful than the other. This is a potentially useful observation. I would see it as a stronger contribution if the authors could identify a case in which using the sigma algebra formulation is useful (e.g., easier than using the other representation power measures.) 3. The authors propose a variant of the 2-Graph G-invariant neural networks (Maron et al., ICLR 2019). This variant is shown to be more powerful than many popular graph learning models such as message-passing networks and the 2-G-invariant networks mentioned above. I think this is a very good direction although the experimental section lacks proper comparison on multiple datasets. Here are a few additional comments/reservations: 1. Clarity: in general the paper is well structured but there are several places that can be improved. For example definition 3 is long and hard to parse, the high-level idea behind Lemmata 3-4. Similarly, definition 5 is too long. Illustrations might help. 2. Section 4: can the authors discuss the case of infinite K? 3. I am missing a more thorough evaluation of the Ring-GNN. 4. Figure1: can the authors clearly write how each arrow is obtained? 5. The long proofs in the appendix are hard to follow. Can the authors try to make them more accessible? (maybe by using illustrations)? In general, this paper presents two potentially useful theoretical observations on measuring the expressive power of GNNs and proposes a strong GNN variant that looks promising. -------------------------------------------------------------------------- Post Rebuttal -------------------------------------------------------------------------- After reading all other reviews and author response I remain on the positive side. The new experimental results strengthen the paper. The SVD part, as other reviewers pointed out, should be changed. I think that the authors should compare to other eigenvalue based methods, or provide results without the SVD part.

Reviewer 2

originality: O1. The paper is based on G-invariant networks but uses an interesting idea to improve its power. O2. The use of measure theory is not new (e.g., Bloem-Reddy and Teh [2]) but the connections to function approximation were interesting. quality: Q1. I like the paper very much. It is great to see the works on GNNs adding more formalism. Ring-GNN is a simple but interesting idea, add the powers of A to the G-invariant network representation. Q1.1 I really liked the use of the CSL task, it provides a clean way to show progress in the representation power. Q2. [IMPORTANT] Adding SVD to Ring-GNN feels like cheating. What if k-GNN and RP-GIN methods were also given the eigenvalues [say, concatenated with their embeddings passing through an MLP]? The eigenvalues of the 10 CSL classes are likely different, even a simple MLP may be able to distinguish the classes without the GNN. Either show this is not the case, or remove the SVD approach. Q3. The sigma-algebra formalism has many parallels to that of Bloem-Reddy and Teh [2], but I like the former better for the added probabilistic interpretation. Noise outsourcing should allow this paper to connect the sigma-algebra formalism with representation learning and generative models. I think tightening the connection with Bloem-Reddy and Teh would strengthen this paper. Also, introducing the concept of orbit would be useful in the formalism. Q4. One of the main competitors on experiment 6.1 (RP-GIN) should be described in the introduction with GIN. Q5. When introducing CSL graphs (first mention of Figure 5), it would be useful to cite Murphy et al. [19] where they are described in more detail, otherwise it looks like something that the reader should just know. clarity: C1- I really enjoyed reading the appendix, thanks for all the effort that went into it! (illustrations, examples, detailed descriptions, extensions are all great) C2- The statement "Such a dependence on the graph size was been theoretically overcame by the very recent work [13]" right after "showed the universal approximation of G-invariant networks, constructed based on the linear invariant and equivariant layers studied in [16], if the order of the tensor involved in the networks can grow as the graph gets larger" is strangely worded , as it seem to imply [13] was abe to overcome the large order tensor that universality needs (which is contradicted after the comma in the next page). C3- " [19] proposes relational pooling", ". [2] studies the", ." [16] studies the spaces" => it is odd to start a sentence with a number. Please add 1st author's names when rather than using numbers as nouns. C4 - Section 6.1 does not comment on the Ring-GNN-SVD. Overall, I feel the SVD addition to Ring-GNN without adding the eigenvalues to other methods is cheating. typo: sigmas-algebras => sigma-algebras

Reviewer 3

Pros: 1, The theoretical analysis on the connection between the universal approximation of permutation invariant functions and the capacity limitation of GNNs is fundamental. 2, The introduction of sigma-algebra formulation of expressiveness is very novel and interesting. 3, The synthetic experiments on circular skip links graphs are very impressive and interesting. Cons: 1, Although I like the theoretical results on the universal approximation and limitations of GNNs, it is a bit disappointing that the newly proposed Ring-GNN does not have any guarantee on the universal approximation, if I understood correctly. 2, To me, it seems that the reason why Ring GNN outperforms other models in the Circular Skip Link graphs is that the construction of the adjacency matrix at different layers uses graph information at different scales. For example, the J + 1 layer adjacency matrix is min(A^2^J, 1). On this perspective, the discussion and comparison with [1] is necessary to me since you can easily treat the exponent of adjacency matrix as a hyperparameter and learn spectral filters in [1]. Moreover, the full SVD is avoided in [1] which breaks the complexity down to O(kN^2) where k is the number of top eigenvalues you want. 3, Details of the model are sparse. Since the space is limited, I suggest authors to trim some of the remarks and lemmas which does not directly help understanding the main results. Instead, you can use the space to properly introduce the G-invariant network which will significantly help readers understand the model in section 5. For example, in definition 5, what is the function rho used in Ring GNN? 4, The computational complexity of Ring GNN is quite high which may be a bottleneck for practical use. 5, The performance of the proposed Ring-GNN on real IMDB datasets is considerably worse than GIN and other baselines. Do you have any idea why this is the case? Also, please provide the reference for the IMDB datasets in the text. 6, The problem setting of the theoretical analysis is different from the practical setting where node feature is presented. Specifically, a graph neural network is not a function which maps an n by n adjacency matrix to scalar but a function which maps a tuple of an n by n adjacency matrix and an n by d node feature to scalar/vector. I wonder how the analysis applies to such a setting. [1] Liao, R., Zhao, Z., Urtasun, R. and Zemel, R.S., 2019. Lanczosnet: Multi-scale deep graph convolutional networks. arXiv preprint arXiv:1901.01484. ============================================================= Thanks for the response! The clarification on Ring-GNN and node feature helps me understand the contribution better. However, I disagree about the comparison with LanczosNet. First, the randomized starting vector would not be a problem if you just fix it for every run which does not hurt the theoretical guarantee of Lanczos algorithm in general (smart choice of starting vector leads to better convergence). Second, the functional form of min(A^2^J, 1) will not make a big difference since if the learnable spectral filter (essentially a MLP) in LanczosNet takes the A^2^J as input, min( , 1) can be approximated by the filter in an arbitrarily accurate manner. At last, your example of "hexagon” and “2 disconnected triangles" can be distinguished by LanczosNet. The reason is the eigenvalues/spectrum of the adjacency matrices are different. One is [-2, -1, 2, -1, 1, 1] and the other is [-1, 2, -1, -1, 2, -1]. LanczosNet not only takes the node feature and propagates but also takes the spectrum as input and extract feature from the spectrum which means the final representation of these graphs are different. Overall, lacking a comparison with GNNs which also leverage the eigenvalues significantly degrades the contribution on the proposed algorithm. On the other hand, if you can remove the SVD part, i.e., every GNNs do not use eigenvalues explicitly, then the comparison seems more fair and convincing. Therefore, I would like to keep my original score.