__ Summary and Contributions__: The authors introduce a hybrid method based on Geometric Scattering and Graph Convolutional Networks. They introduce a Graph residual convolution approach to aggregate multi-scale information and describe some of the properties of the introduced architecture.

__ Strengths__: It's a novel approach relevant to recent literature and probably a step in the right direction when it comes to graph classification.
The results are good.
The paper is written clearly.

__ Weaknesses__: The Geometric Scattering approach is a rather ad-hoc choice for the task and properties that it introduces, e.g. section 7, are not experimentally demonstrated.

__ Correctness__: The methods are correct. To prove all claims more experiments should be necessary.

__ Clarity__: Yes it is very clear

__ Relation to Prior Work__: Yes prior work is clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: The additional information sketched by the two lemmas in section 7 are not verified empirically in section 8.
The authors claim that introducing wavelets improves sensitivity in high frequencies.
Has that been verified experimentally? Scattering Transforms typically average over high frequencies (to achieve stability)
and are known to lose high-frequency information. It is positive that the authors are attempting a hybrid model
but the rationale of using scattering for high frequencies can be challenged and experimental
evidence for the contrary are weak. Have you considered comparing with [1] which loosens the constraints on wavelet structure?
Can you include performance of graph scattering methods in Tables 1-2. The results as they are make it difficult to draw any conclusion with respect to the quality of the contribution, i.e. what brings the performance? There is little effort to verify the rationale.
Minor comments:
line 111 small potion/portion
[1] https://arxiv.org/pdf/2006.05722.pdf
Post rebuttal feedback:
On high frequencies, scattering compared to DL representations considerably fails at capturing high frequencies but I guess that would lead us to futile debate.
If you do what you say in the rest of the rebuttal I believe the quality of the paper will improve and therefore I maintain my acceptance score.

__ Summary and Contributions__: *** Post rebuttal update ***
I have read the authors' response and am satisfied with way they addressed the points raised and the modifications they'll introduce based on my feedback; thus, my 'accept' score remains unchanged.
**********************************************
The paper proposes a modification of the graph convolutional network (GCN) architecture to add the scattering transform (Mallat, 2012) and a residual layer. The scattering transform layer has a geometric interpretation (analogous to multiscale structure in graph signal processing) and is intended to deal with the oversmoothness problem in GCNs. The residual layer is intended to keep information localized at a node level.
The paper provides an algorithmic modification to GCN and a detailed empirical study to support the superior performance of the architecture on semi supervised graph labeling classification tasks.

__ Strengths__: The contribution is mostly methodological; the paper also connects ideas from signal processing on graphs and neural networks in a new and interesting way.
The paper is solidly grounded in the theory behind scattering transforms and the broader signal processing on graphs work of the last few years.
The paper is relevant to the NeurIPS community as it provides an architecture that can enable practitioners working on semi supervised graph classification tasks.

__ Weaknesses__: While the application of the scattering transform is explained from an intuitive point of view, the paper can be made stronger by fleshing out a few remaining issues:
-- R is lazy random walk (line 224). The statement "which can be interpreted as an interpolation between the completely lazy random walk and the non-resting random walk R" is ambiguous and needs to be clarified.
-- How is the residual layer helping in this architecture and why? The scattering layer already 'collects' information at the node level as defined (low- vs band pass filtering). Why do we need the extra threshold on the wavelets in Fig 1e?

__ Correctness__: I did not find inaccurate claims, although there are some points that can be further fleshed out as highlighted in the "Weaknesses" section above.

__ Clarity__: Overall the paper is clearly written and barring a few typos here and there it is of very high quality in terms of clarity.

__ Relation to Prior Work__: The paper discusses prior work (scattering transforms and relevant GCN architectures) in much details and sufficiently well describes how it differs from related work.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper introduces an extension and combination of two existing methods (graph convolution and geometric scattering) and applies it to a novel application (node classification). The proposed combination alleviate undesirable oversmoothing of previous methods which is particularly important in node classification (whereas previous methods were focusing on whole graph classification problems for which that over-smoothing was less of an issue).
The paper reintroduces most of the necessary tools and existing method, explains their limitations (over-smoothing) to motivate the proposed approach.
Then, the proposed method is described.
Its non-over-smoothing behavior is convincingly demonstrated on toy-task (periodic and colored graph) as well as on experimental classification task.

__ Strengths__: Straightforward to read, apart perhaps some notations issue (see additional feedback). Good motivation and efficient and elegant solution to address the issue.

__ Weaknesses__: Perhaps some lack of clarity on the notations. I found the notations where often misleading and inconsistent with prior work on scattering, which makes the paper a bit hard to read in my opinion.
Although, perhaps on the novelty side, in section 5 the proposed approach seems to be a concatenation of prior work GCN together some additional high frequencies features, and it seems a super-set of a representation is bound to do a bit better than the original representation.

__ Correctness__: The paper seems correct overall, maybe with a reservation on some of the notations and equations that I found a bit confusing.
The experimental methodology seems quite nice. For prior work methods, they downloaded original code and reran it on the data they used, making sure the number were matching original publications when available, which I think is really the best possible practice, although I haven't actually checked those numbers.

__ Clarity__: The motivations and the general explanations are nice and straightforward. However, some of the notations and equations were a bit unclear to me, see additional feedback for details. I might probably be willing to increase my rating if the equations were easier to follow and the relation to prior work notations were a bit more straightforward.
The paper is a bit too packed maybe, for example Figure 1 has a lot of different purpose, Table 1 and 2 are a bit too close to the rest of the text etc...

__ Relation to Prior Work__: There seems to be a solid prior work discussion both in the motivations and in the experimental section. Again, really nice practice of downloading and rerunning prior work code and making sure the numbers agree with past publications.

__ Reproducibility__: Yes

__ Additional Feedback__: l49-50: "compute with attention mechanisms computed from": double use of computed
l111: "potion": portion?
l121: "h_j^l = ...": j doesn't appear in the LHS? This is a bit confusing. If I understand correctly, actually A and \tilde D actually both depend on j through their def l117? Then, maybe it would be clearer to also index those A and
\tilde D by j and theta by i j or am I missing something?
l133 potion again
l157: "J:= (k_1, k_m)", I found this notation quite confusing as J is meant to be the maximum / smoothing scale in other scattering papers, whereas here it is the tuple of frequencies. Also Phi in scattering paper is usually used for the last smoothing operator, whereas here it is used for the unsmoothed version?
l161: "we reinvent": reapply?
l189: I don't think defining B from b adds much to the readability.

__ Summary and Contributions__: The paper introduces scattering transforms to tackle the oversmoothing behavior of standard graph convolutional neural nets. Additionally, graph residual convolutions are introduced as a mechanism to restrict the captured frequency spectrum.
The proposed approach is evaluated on standard benchmarks for semi-supervised node classification in citation graphs, and comparisons are made with state-of-the-art geometric deep learning methods and other baselines.

__ Strengths__: The paper identifies an innovative solution to an important limitation of GCNs. The combination of classical GCNs with band-pass channels through scattering transforms seems very promising.
Another strength is the self-contained nature of the paper with a pretty complete yet concise introduction of the necessary background theory.
The experiment on reducing the amount of labelled nodes is very informative and relevant for practical applications.

__ Weaknesses__: The main weakness in my opinion is the experimental evaluation which is limited to the well known citation graph benchmarks. Although, these provide a good way to compare with published methods and additional baselines, it is unclear how well these results translate to more interesting applications that are discussed in the introduction and used to motivate the work.
Another weakness is the relatively little insights provided about why and how exactly the scattering transform helps with the node classification. I appreciate Sec. 7 which does a theoretical analysis of the complementary nature of the extracted information, but it would also be valuable to have more practical insights. For example, from the experiments it is quite clear that the proposed method shines for the DBLP benchmark where a lot of the information is hidden in the connectivity structure of the graph. But what exactly is it that the scattering layers extract from there what the other methods can't. At least to me this isn't very clear.

__ Correctness__: Seems correct, but I could not follow all derivations without digging deeper into the signal processing theory (which is not my main area of expertise).

__ Clarity__: The paper is dense but generally well written. The background sections on graph signal processing and GCNs are very helpful.

__ Relation to Prior Work__: I believe relevant work is cited and appropriate comparisons are made to the relevant state-of-the-art approaches.

__ Reproducibility__: Yes

__ Additional Feedback__: I'd be interested to hear whether the proposed approach could also benefit from an attention mechanism similar to GAT.
I wasn't entirely sure about the setup for the experiment where the training size is reduced. Is this taking a fixed graph and then simply hiding an increasing portion of the node labels, or is the graph structure different between the settings with reduced training size? Are the number of nodes for which labels are predicted the same between each setting? Is each unlabelled node always connected to at least one labelled node or does the reduction of training size also mean that the nearest labelled node might be further away in the low training size regime?
A schematic illustration of the geometric scattering may help with the accessibility of the paper. A toy example with a simple node classification problem might be used for that (similar to Fig. 2).
Typo in l. 326 "interesred" -> "interested"
The broader impact statement is inadequate. The authors statement that "This work is computational in nature and addresses the foundations of graph processing and geometric deep learning. As such, by itself, it is not expected to raise ethical concerns nor to have adverse effects on society." is missing the point of the impact statement. Graph neural networks are heavily used for the analysis of social networks (e.g., fake news detection) and thus take center stage when it comes to ethical concerns and there are huge societal implications. To clarify, this has not impacted my review score but though it should be mentioned here.
----
AFTER REBUTTAL: I am happy with the rebuttal which answered the questions I had (mostly about possible extensions and experimental setup). I think this is a good paper and I am keeping my recommendation as is.