__ Summary and Contributions__: The paper proposes a novel graph neural network architecture which is motivated both by the Weisfeiler-Lehman isomorphism test and kernel theory. In more detail, the paper states that the natural representation of a neighborhood of a node is not a simple sum but a sum of features in the kernel space, because this yields a meaningful kernel/distance between sets. Accordingly, the paper suggests that the lth-layer representation of a node should not be the transformed sum of previous layer representations in the neighborhood, but rather the sum of transformed previous-layer representations. [It should also be noted that the transformation suggested in this paper is more involved than usual, using an MLP first and then a Taylor approximation of the feature space of an RBF kernel, which is in turn dimensionality-reduced via random projections to keep the dimensionaliy of the representation manageable.] (this characterization is imprecise; as the authors pointed out in their feedback, the Taylor approximation and dimensionality reduction is replaced with an MLP) After all this processing, the inner product/Euclidean distance between node representations can be interpreted as a node kernel/distance and the inner product/Euclidean distance between the sum of node representations forms a well-defined graph kernel/distance. This scheme is evaluated in node classification, graph classification, graph regression, and graph generation tasks, yielding performance gains in all cases, with particularly impressive gains in graph kernel classification.

__ Strengths__: The paper has numerous strengths, the most striking of which are the clear motivation of each architectural component as well as the extensive experimental evaluation. The paper starts from the straightforward Weisfeiler-Lehman test, which is closely related to many graph neural network variants, and then extends it logically with kernel concepts to construct a well-behaving graph similarity/distance. Each additional complexity is properly motivated either from the kernel world or from the isomorphism testing world and yields an overall architecture that is fairly simple to understand and yet very expressive and powerful.
With respect to experiments, the paper uses an established methodology and compares against strong baselines from the literature which represent the state-of-the-art well. Still, especially in terms of graph classification, the proposed method yields impressive gains in performance, sometimes tens of percentage points. It is also impressive that the evaluation covers such a wide range of tasks, giving a very comprehensive impression of the capabilities of the proposed model.
Finally, I regard the paper as highly valuable to the community, both because it provides improvements to a model type that has gained widespread interest in recent years and because the proposed extensions are easy to incorporate, such that the paper has the potential for widespread adoption in the next years.

__ Weaknesses__: The main weakness of the paper is, in my eyes, the motivation for the similarity interpretation between kernels. If one already has a well-performing graph neural net for a task at hand, it is not immediately obvious that an interpretation of the resulting vector space in terms of similarities/distances is necessary. Here, it may be valuable to point to work which assumes that distances in the representation space have meaning, e.g. adversarial attacks, counterfactuals, or few-shot learning.
Furthermore, the focus of the paper could be improved. Currently, several paragraphs discuss the problems of aggregation-first schemes with a very specific example (2.4), while related work is covered only briefly (with graph neural networks being compressed to a single paragraph). The proposed architecture is only presented textually, whereas a flow diagram of a single layer in the architecture would help to appreciate the modules involved. By contrast, Algorithm 1 is currently too abstract to understand the specific implementation proposed by the authors (it can be kept, but does not, in itself, suffice to grasp the approach).
Finally, I believe that some of the related work is not covered (more on that below).

__ Correctness__: The theoretical claims are correct and well justified and the experimenta methodology follows standards in the field. The only problem I see is that the graph classification results are obtained using only graph kernel variations of the baselines, whereas the graph neural network results for comparison are in the appendix. It should be noted in the text that the graph kernel classification is _not_ superior to an (arguably more straightforward) classification via an output layer (at least that is what I gather from the appendix). This is, again, related to the issue of motivation: The paper currently makes the added value of a kernel interpretation not sufficiently clear.

__ Clarity__: The paper is overall well-written and easy to follow. The only issues in terms of clarity are mentioned above (i.e. a flow chart for the weisfeiler-lehman layer could be added and it should be made explicit that graph kernel classification does not outperform standard graph neural network classification).

__ Relation to Prior Work__: The relation to related work is characterized correctly, in principle. However, graph neural networks are represented too briefly to be informative if one is not already very familiar with the fields and some works on deep learning and graph kernels are missing. At least one work on learned graph kernels (e.g. Yanardag and Vishwanathan, 2015, doi: 10.1145/2783258.2783417) should be included.

__ Reproducibility__: Yes

__ Additional Feedback__: With regards to reproducibility, I believe that the provided documentation is outstanding and makes reproducibility straightforward.
With regards to the broader impact statement, I believe that potentially hazardous applications are not sufficiently covered. For example, if one would apply the proposed method to infer a 'risk score' (in terms of credibility or bad credit) of a person from their social network, this could be ethically questionable. It is worthwhile, I think, to caution readers against naively applying the proposed method for such tasks.
typos and style:
* line 132: I believe, the top index of \hat f should be i, not i-1.
* in line 153, the concatenation should probably be \phi(f^{i-1}(v)) \oplus \hat f^i(v), i.e. the first argument of the concat operator should be first plugged in \phi.

__ Summary and Contributions__: The paper proposes a transform-sum-cat scheme for graph representation learning, in contrast to the sum-transform scheme that is typically applied in GNNs.
Here, authors first identify problematic aspects of sum-transform schemes when applied on graphs with continuous node attributes, and proceed to present solutions which are both applicable for graph kernels as well as GNNs (called WLS).
Experiments are conducted on a diverse range of tasks, ranging from graph classification, node classification, graph regression and graph generation.

__ Strengths__: The paper is well-written and easy to follow, and I certainly enjoyed reading it.
Its main contribution lies in the fact that GNNs, which rely on a (linear) feature transformation, lose expressive power when operating on continuous input node features.
As a result, the authors propose to non-linearly transform node features before aggregating them. This scheme is unified into a single algorithm which is applicable for both graph kernels and GNNs.
Furthermore, the experimental evaluation looks quite convincing to me since the proposed approach is tested on a wide range of different tasks, a welcome contribution!

__ Weaknesses__: * Overall, the paper feels quite incremental to me. Transforming node representations before aggregation is not a novel concept, and can be found in literature quite often. This should be at least discussed in a revised manuscript.
In particular, the official formulation of GIN [1] is based on this concept, and only drops the first transformation in case input features are categorial. With this, WLS-GNN and GIN do not seem to differ much at all.
* Although the experiments are conducted on a wide range of tasks, its comparison to recent methods feels quite shallow. For example, none of the presented baselines in Table 1 tackles the problem of graph classification.
Why are authors not comparing against graph kernels/GNNs that are also evaluating on these graph classification benchmark datasets?
Furthermore, it seems like the graph regression experiments on ZINC are not really comparable, since baselines do not seem to make use of edge features, while WLS does.
[1] Xu et al.: How powerful are Graph Neural Networks?

__ Correctness__: yes

__ Clarity__: yes

__ Relation to Prior Work__: I feel that this is not really the case. There are a lot of GNNs that perform transform-sum-cat like aggregations, and, in particular, the GIN formulation is mostly identical to the presented solution in case an MLP is added on top of the input node features.

__ Reproducibility__: Yes

__ Additional Feedback__: * From reading the paper, it is not clear to me how the presented approach can handle the problem illustrated in Figure 2(b). I feel like the results of Appendix B.3 should also be discussed in the main paper.
* How is the readout function of the WLS-GNN defined? Is it the same as the one used in the WLS kernel?
============= Comments after Rebuttal ==================
I thank the authors for their informative rebuttal. I still feel that this is an interesting paper regarding kernel theory, but my rating will remain unchanged. This is mostly due to the limited contributions regarding GNNs.
For the revised manuscript, I highly recommend to give proper credit to the related work (Algorithm 1, COMBINE as concatenation, "transform-first-add-later" as described in GIN), and to also enhance the Graph Classification Table by the results reported in the related work.

__ Summary and Contributions__: Motivated by Weisfeilerâ€“Leman test, this paper studied a novel way of basic embedding unit for graph neural network. This unit works in a transform-sum-cat fashion, which is in a different order from previous ones. Authors claimed that this method exploits the similarity preserving property in some metric space before and after embedding. The proposed method worked efficiently in experiments with less parameter size, and outperformed several selected counterparts in multiple tasks.

__ Strengths__: The paper is well written and thus easy to follow.
Theory in this seems rational and convincing.
The proposed method is novel and I never see this in previous works.
Graph neural network is of high relevance to several topic in machine learning, and thus is within the scope of Neurips.

__ Weaknesses__: I mainly have the following concerns.
1) In general, this paper is incremental to GIN [1], which limits the contribution of this paper. While GIN is well motivated by WL test with solid theoretical background, this paper lacks deeper analysis and new motivation behind the algorithm design. I suggest the authors to give more insightful analysis and motivation.
2) I noticed that in Sec 5.3, a generator equipped with a standard R-GCN as discriminator tends to collapse after several (around 20), while the proposed module will not. The reason behind this fact can be essential to show the mechanism how the proposed method differs from previous one. However, this part is missing in this version of submission. I would like to see why the proposed module can prevent a generator from collapsing.
3) I understand that stochastic/random projection is with high probability to preserve the metric before mapping . My concern is that when stacking multiple layers of WLS units, the probability of the failure case of stochastic/random projection also increases (since projection is performed at each layer). This may greatly hinder the scenario of the proposed method from forming deeper GNN. In this case, authors should justify the stability of the proposed method. How stable is the proposed method? And what happens when stacking more layers?

__ Correctness__: To me, claims in this paper seem to be appropriate and the experimental settings are correct.

__ Clarity__: This paper is well written.

__ Relation to Prior Work__: The contribution of the this paper compared to GIN is not well explained.

__ Reproducibility__: Yes

__ Additional Feedback__: I would consider to raise the rating if authors can address my concerns in the weakness part.
---after rebuttal---
I think the authors have addressed most of my concerns and other viewers'. While the reason of the collapse of the generator remains opaque, it might be difficult to give a thorough analysis during the rebuttal period. As such, I would like to raise my rating from 5 to 6.

__ Summary and Contributions__: This paper provides a set of tools for working on graphs that is different from existing work mostly by moving aggregation of adjacent node attributes from inside the loop to outside. Prior work operates on aggregated neighbor attributes and then learning is done on the aggregate, but this work learns on the neighbor attributes and then combines the attributes using concatentation. This allows finer comparison between graphs with different attributes but same adjacency. The contributions are - change in order of operations in computing node representation, a graph similarity based on this change and a demonstration of the use of this change as a graph neural network on some benchmarking data, synthetic and from biology/chemistry.

__ Strengths__: The revisiting of order of operations is an interesting idea, and does provide benefits of distinguishing graphs of the roots of unity as described by the toy example in figure 2 that existing methods do not differentiate. I am not sure how many existing problems have that property where the sum of attributes sum to zero, so it might be artificial, but the empirical evaluation on the chemistry data set seems convincing. This approach does seem relevant to the segment of the NeurIPS community that works with graphs. I make no comment on the significance and novelty of the contribution and I am not that familiar with the prior art.

__ Weaknesses__: There could be some cases where users of algorithms might have better generalization performance if the neighbor attributes were treated as a bag instead of separate entities like this method. The same way that linear SVMs sometimes generalized better than RBF kernels in the SVM era. I suppose in that case a linear kernel for the transform part would take the part of the aggregate, and then it should perform similarly to the aggregate first methods. You can sort of see the effect in Table 1 on the BZR and COX2 data set where WLSIn has a lower variance than WLS. Maybe a linear kernel baseline would help as well to convince the reader. Also since in line 235 the hyperparameters were tuned for each split, it's hard to tell if the method generalizes to an unseen split that wasn't tuned against.

__ Correctness__: I'm a bit concerned about tuning each split individually as in line 235. It could lead to false security about generalization performance. In production system there is usually a secret test set that is not tuned against by the hyperparameter tuner to check for overfitting via the hyperparameter tuning process.

__ Clarity__: Yes, although I am not familiar with the particular approach, the toy diagrams and clear explanations worked for me.

__ Relation to Prior Work__: Yes, the idea of moving aggregation outside of the learning loop made sense to me.

__ Reproducibility__: Yes

__ Additional Feedback__: No change to my review post rebuttal, I find the paper interesting in terms of the kernel theory usage and stick to my score.