__ Summary and Contributions__: This work is motivated by the (previously shown) fact that node message aggregation functions used in GNNs (sum, mean, min, max, standard deviation) are known to fail to distinguish between certain graphs. To address the limitation, this work proposes a framework, Principal Neighbourhood Aggregation (PNA), where multiple aggregators are combined along with a scaling term dependent on node degree. The use of use of multiple aggregators is theoretically justified as necessary for distinguishing between multisets (node neighborhoods) in continuous space.
The specific aggregators considered in the PNA framework are mean, standard deviation, min, and max. The commonly-used sum aggregator is implicitly included since the scaling operators combined with the mean aggregation generalizes sum aggregation, as shown in the paper. Higher-order moment aggregators are mentioned as possibly useful in some graph domains, but are not included as part of the presented framework, empirically motivated (in the appendix) by the tendency for higher order moments to overfit, especially on small average degree graphs.
Post-rebuttal update:
I have read carefully the author rebuttal. They have provided new results, which they will include in the appendix, that address my concerns about the multi-task regularization effect and graph size. These additional results strengthen the submission.

__ Strengths__: Novelty:
The key contribution of the work is in the theoretical findings of the necessity of multiple aggregators for distinguishing between node neighborhoods for continuous input spaces. This finding dovetails from the Xu et al [1] finding, but the consideration of continuous input spaces is particularly valuable because intermediate latent node features in deep GNN models exist in such continuous spaces. Understanding the representation power of GNN models for these latent features is crucial in gaining an understanding of the models.
The idea of combining multiple aggregators is not novel. The paper correctly identifies an example of combined aggregation functions in Dehmamy et al [2]. However, the particular theoretical development as it relates to the necessity for multiple aggregator functions (Theorem 1) and aggregators based on the moments are new in this work, as well as the use of mean aggregation combined scaling based on neighborhood size.
Significance and Impact:
The idea of aggregating and scaling node messages is general enough that if can be applied to most GNN architectures, and some (GCN, GAT, GIN, and MPNN) are indeed considered in the work. The theoretical findings have implications across all of these architectures that consist of aggregating neighborhood messages throughout a graph.
[1] Xu, Hu, Leskovec, Jegelka. How Powerful are Graph Neural Networks? arXiv preprint arXiv:1810.00826, 2018
[2] Demamy, Barabasi, Yu. Understanding the representation power of graph neural networks in learning graph topology. NeurIPS 2019.

__ Weaknesses__: Methodological:
The work here places importance on topology/structure. For example, the message scaling is dependent on node degree. Thus this method is apt for applications where the structure is paramount, e.g. one such application mentioned is reasoning about social networks where the degree of the nodes/users provides a lot of information about that node/user. Though useful in many domains, there are domains where GNNs are useful but topology is not important. This is reflected empirically for regular grid graph of the computer vision datasets where PNA does not significantly improve over other methods.
Empirical:
* The training on multiple tasks may have a regularizing effect that is not discussed in the work. Results on individual tasks may be useful to gain some insights into the power of the method on these data.
* The synthetic dataset is comprised of a nice variety of graph types, but results are reported for all of them. It is not clear how much the framework improves on tasks on different graph types. Splitting these results, perharps in the appendix may be useful for a more nuanced insight into the benefits of PNA
* Only undirected and unweighted graphs are considered, excluding many important graph applications eg computation graphs, nonsymmetric social networks
* The empirical analysis on synthetic data only considered relatively small graphs (15-50 nodes) which do not capture the scale of many real-world graph applications. The real datasets considered (molecules, CIFAR10, and MNIST) are limited. For example the computer vision datasets are regular grids, where the topology is less important. Indeed, the results show on those datasets (no significant improvement over SotA) highlight the fact that the representation shortcomings addressed in this work are in distinguishing *structure* in the underlying graphs, the importance of which are application dependent.
* Though the synthetic dataset is highlighted as one of the contributions, the novelty and impact of this particular contribution is quite secondary. Random graphs are commonly used to evaluate GNN models and the particular tasks on these graphs considered here are tasks of identifying different graph properties. Though sound and necessary for evaluation, it is not a particularly noteworthy contribution. To be clear, the paper contribution does not hinge on this anyway.

__ Correctness__: The theorems and propositions appear to be correct following the proofs provided in the appendix. The empirical methodology mostly follows conventions in the field for evaluating GNNs

__ Clarity__: The paper is well written and the motivation and development are clearly structured

__ Relation to Prior Work__: Prior work on empirical and theoretical developments in graph representation power are correctly cited and the connection to the most closely related works are clearly discussed

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposed Principal Neighbourhood Aggregation (PNA) as a new aggregator for Graph Neural Networks. Principal Neighbourhood Aggregation combines multiple existing GNN aggregators and achieve superior performance.
------------------------
Updates:
The authors provide additional experimental results in the response, which answers my questions.
Overall, I like the technical contribution, although the experiments can be improved in the original manuscript. With these new results, I'm happy to vote for acceptance.

__ Strengths__: The proposed approach is technically sound.
The experimental results are strong under the proposed experimental settings.

__ Weaknesses__: The paper does not report the performance on standard GNN benchmarks, such as node classification on Cora and graph classification on TU dataset. These evaluations are simple to conduct, and the authors should at least give a brief discussion.
The GNN model used is also not standard. The paper uses a GNN with recurrent GNN layers, where weights are shared across different layers. This kind of GNN has less representation power, and it is likely that under this setting, the proposed Principal Neighbourhood Aggregation has clearer advantage as it uses an ensemble of aggregators. I think using standard GNNs where different layers do not share weights would make the experimental results more convincing.
Important baselines are missing. It would be really helpful to show the performance of GNNs with combinations of max, min, mean aggregators that are commonly used in the current literature, without the proposed Normalized moments aggregation and Degree-based scalers.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Overall, the paper works on an interesting direction on improving GNN models, that is to use more sophisticated aggregation functions. However, I think the current evaluation is not convincing and can be greatly improved.
Besides the comments I mentioned above, here are my additional comments.
(1) How does Principal Neighbourhood Aggregation compares with a naive baseline, where max, mean and min aggregators are used? This baseline should presumably work better than only using one aggregator, however the paper does not make a comparison. This important baseline will greatly help justifying the proposed method.
(2) The GNN model as well as the evaluation setup are not standard. The proposed aggregation function should be orthogonal to the GNN architecture and the dataset, so it is not clear why standard setups are not used.

__ Summary and Contributions__: This work improves the original GIN model in multiple ways; generalizing the injective sum aggregation function to more expressive multiple aggregators, a new architecture called Principal Neighbourhood Aggregation combining graph convolution and recurrent mechanism, and learning on multiple tasks simultaneously.

__ Strengths__: It is an important modelling contribution toward more expressive aggregation operations for GNNs. Numerical results are also convincing.

__ Weaknesses__: Question on baseline: What is the performance of the simplest architecture using only the new multiple aggregators defined in Eq.8? That is, a CNN-like architecture that stacks M x GC layers (each layer has its own weights) with no GRU and no S2S in Fig.2?
An alternative question is : Are GRU and S2S critical for good performances? And if yes, why?

__ Correctness__: Sounds correct.

__ Clarity__: Yes, the paper is nicely written.

__ Relation to Prior Work__: Yes, great work.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors propose a new aggregation method for GNNs. They theoretically prove that mixing multiple aggregation methods performs better than a single one. They also introduce moment aggregators. Empirical results show that the proposed method outperforms existing methods.

__ Strengths__: The paper addresses an important problem in GNNs, aggregation of information from neighboring nodes. They theoretically prove that using multiple aggregators the performance of GNN could be improved if the node features are continuous. The proofs seems to be correct and the experiments are carried out well showing improvement compared to baselines. There are also insightful ablation studies in the paper and supplement.

__ Weaknesses__: The proposed method is indeed novel in geometric deep learning. However, the idea of mixed pooling [1] has already been explored in CNNs. Although there are obvious differences, they are in the same spirit as the proposed method. It would be interesting to add a few ablation studies:
1- A more comprehensive ablation study (besides the one in the supplement) on the number of moments needed (preferably on real-world datasets such as social network as mentioned in the paper). Is it linearly dependent to the average degree? As mentioned in the supplement, max and min seems to be more consistent aggregators. Could they reduce the number of required moments (for example to sub-linear)?
2- Which of these 4 aggregators in PNA is more important? Are they equally important?
[1] Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree, Lee et. al., AISTAT 2016.

__ Correctness__: The claims seems to be theoretically correct.

__ Clarity__: The paper is well written and clearly organized.

__ Relation to Prior Work__: Relation to prior works have been adequately discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: ==== post rebuttal ====
I have read the authors' response as well as other reviews during the discussion period in August. I appreciate their efforts and I keep my score as is.