__ Summary and Contributions__: I do not think the rebuttal addresses my concerns on any of the 3 issues: (1) mathematical mistakes in inference (2) extremely high cost (3) wrong experiment setting.
For (1), the correct formulation is p(Y_U|A,X,z) = \mathbb{E}_{q(z|A,X,Y)} p(Y_U|A,X,z), then you need to sample from q(z|A,X,Y) and do Monte Carlo instead of prior, this is a serious mistake. For (2), I keep my opinion that a 4000% increase of computation time does not compensate for a 4% relative improvement. For (3), the authors claim "the total number of labeled nodes in the case of 0.5% under the label-scarce scenario is still more than that under the standard experimental scenario", then why do the authors claim that this is low-label setting?? This setting has more labels than standard scenario...
------------------------------------------------------
The work targets semi-supervised node classification tasks and aims to design a robust graph neural network using a latent variable model. The idea is to model the joint distribution with latent variables and use mean-field variational inference for approximations.

__ Strengths__: The general formulation is concrete. The authors nicely organize the paper with formulation of the latent variable model and then describe in detail how they instantiate each module. There is also extensive analysis of the complexity of the algorithm. I find the figure helpful and make the paper clearer. The mathematical notations are considered consistent throughout the paper.
The authors conduct extensive experiments on multiple setups including normal node classification, low-label regime and adversarial defence against malicious attack. The proposed model also shows strong empirical performance over previous baselines in all three setups with comparable standard deviation. This is pretty impressive considering GSNN contains more stochasticity than the baselines.

__ Weaknesses__: This paper combines latent variable models with GNNs, it’s not novel enough and there are many previous works with similar ideas in graph generation. The difference is that the formulation of this paper is more like a conditional generative model and targets at node classification tasks.
Based on the implementation of the method, I think the model is similar to RGCN in some aspects. Undoubtedly, there are differences that the model does not directly learn a Gaussian representation but instead samples from a Gaussian latent variable and concatenates it with the features of the node. However, both aim to inject some noise and in essence decrease the information between the representation and the original node feature so that the model only captures the key attributes and thus making the model more robust than vanilla GNNs.
One concern is about the inference procedure. Why does the model directly sample from prior instead of the posterior? I find the statement “Since the posterior would be close to the prior after model training, so we directly sample from prior” (line 181-184) very sketchy. I believe the learned latent space (if trained well) should be structured in the sense that if we train a VAE on MNIST, then the posterior of the same digits will be close while the posterior of different digits will be scattered across the latent space. It’s not like after training, the posteriors will look exactly the same as prior. If it is the case, then unfortunately it means that the mutual information between z and the node will be close to 0 and should be useless.
Another issue is that I feel the comparison with baselines in the experiments is not fair since the proposed model needs to sample $z$ for $L$ times, and $L=40$ in the experiments. It means that the model is at least **40** times slower than vanilla GNNs and can be worse with the additional overhead. It will be super helpful if the authors can show the results with different $L$, e.g. 1, 5, 20, 40. Another minor point is that GSNN also has more parameters, but I think it should be fine if the authors can show GSNN has comparable parameters with GAT.
Also in the adversarial attack setting, where the graph is attacked on the structure with adding/deleting edges, I am not fully convinced that the model can alleviate this issue simply by noise injection in feature. I feel a nicer way that can possibly achieve more performance boost is to also model the structure $A$ using latent variable models.
I list some additional points below.
- There lacks an explanation as to why the authors did not use $Y_L$ as input for qnet1.
- The paper does not have a clear description on how they (randomly) selected nodes for training in the label-scarce scenario. For example, is it completely random or did you select a fixed number of nodes for each class? I also did not find the details in the appendix.
- In low data regime, another important previous work is PPNP [1], which also shows strong performance over standard GNNs. I suggest that the authors compare their method with PPNP.
- Why does the model have higher performance in the scarce-label setting than the clean setting? As listed in Table 1 and 2, GSNN-M/A has higher accuracy in 0.5% Pubmed than in the original Pubmed. Also same for adversarial attack, it seems that the model can achieve a higher score on the perturbed graph than on the pristine graph. This is rather counterintuitive.
[1] Predict then Propagate: Graph Neural Networks meet Personalized PageRank

__ Correctness__: I think the method is correct in high-level. I am concerned about the inference steps, which I also mentioned in the weakness section.

__ Clarity__: Overall I think the paper is easy to follow. Some suggestions and typos include:
- line 91: moset -> most
- line 291: stat-of-the-art -> state-of-the-art
- Some notations in figure 1 need to be in bold so that they are consistent with the main body.
- $N$ is the same as $|V|$ in Sec. 3.3, there is no need to further introduce $N$ as a new notation to represent the number of nodes.

__ Relation to Prior Work__: The paper has extensive discussions but also has room for improvement with discussion and comparison between RGCN and GSNN.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Most existing models learn a deterministic classification function, which lack sufficient flexibility to explore better choices in the presence of noisy observations, scarce labeled nodes, and noisy graph structure. This paper proposes a novel framework named Graph Stochastic Neural Networks (GSNN), to model the uncertainty of the classification function by simultaneously learning a family of stochastic functions.

__ Strengths__: (1) The paper propose a stochastic framework for GNN, which models the uncertainty of the classification function by simultaneously learning a family of stochastic functions.
(2) Comparison against several baselines.
(3) The paper is well written

__ Weaknesses__: (1) The paper uses only three small datasets (core, citeseer, pubmed) with a relatively high autocorrelation, this makes it difficult to see how the proposed method would generalize to other graphs with more noise and low autocorrelation.
(2) The two proposed methods (GSNN-M, GSNN-A) seems to provide similar results, it's unclear which one is better for which graphs, i.e., for different graphs with different noise in attributes and structure. It'd be good to explore these with synthetic experiments, and add a discussion on them.

__ Correctness__: I think the claims are correct.

__ Clarity__: The paper is mostly well written, however it should be checked for typos, for example:
- Line 91 "Moset existing GNN ..." -> "Most exising GNN ..."

__ Relation to Prior Work__: Related work is discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: ===After author response and discussion with other reviewers===
I thank the Authors for responding to my comments in an acceptable way.

__ Summary and Contributions__: This paper attempts to address an unresolved but meaningful problem. To improve the inflexibility of GNNs in the face of imperfect observed data, the paper proposes a novel framework GSNN to model the uncertainty of classification function by simultaneously learning a family of functions. GSNN treats the classification function as a stochastic function, and uses a learnable graph neural network parameterized by a high-dimensional latent variable to model its distribution. To infer the missing labels by classification function with uncertainty, GSNN wisely adopts variational inference technology to approximate the intractable joint posterior for missing labels and the latent variable. The extensive experimental results show that GSNN achieves substantial performance gain in different scenarios, such as the label-scarce scenario and adversarial attack scenario.

__ Strengths__: Significance and novelty: Most GNN-based models learn a deterministic classification function, which makes them lack sufficient flexibility to cope with kinds of imperfect observed data, such as scarce labels or deliberate noise in the graph structure. To solve the problems, this paper proposes to model the uncertainty of the classification function and simultaneously learn a family of functions, which is well-motivated. The idea is novel and different from previous works, which provide a new perspective for the graph-based semi-supervised learning problem.
Soundness of the claims: The authors skillfully formalize the problem of modeling the uncertainty of classification function. They treat the classification function to be learned as a stochastic function and further combine GNN models and a high-dimensional latent variable to model its distribution. The variational inference technology makes the missing labels become inferable. The overall solutions of this paper, including theoretical analysis, practical model design and experiment evaluation are technically sound. Extensive experimental results also show significant performance gain compared with stat-of-the-art baselines, which further demonstrates the effectiveness of the proposed method.
Relevance: This paper has the potential to attract wide attention at NeurIPS 2020.

__ Weaknesses__: 1.Theoretically, this work fills the gap between the deterministic classification function and the stochastic classification function. A toy motivation example in the introduction is encouraged. By doing this, the proposed GSNN work will reach a broader audience.
2.Some typos in the paper.

__ Correctness__: The basic theory and the practical model design are clear and correct. The overall logic and derivation are rigorous. It is praiseworthy for the authors to provide a proof to claim the consistency between the practical approximation inference (i.e., Eq. (11)) and theoretical definition (Eq. (4)), which is meaningful to demonstrate the completeness and correctness of the proposed method. The overall experiments are well designed and the evaluation results could support most claims made by the authors. It is better to describe the parameter settings of baselines and the proposed model in Section 4.1 with more details.

__ Clarity__: The paper is well organized and easy to follow. The formalization is clear and the notations are defined rigorously and keep consistent throughout the paper.

__ Relation to Prior Work__: The paper carefully claims the differences between the proposed model and two types of related works, including deterministic graph neural networks and methods using uncertainty modeling.
To make the audiences understand the problem more clearly, the baselines in this paper, especially the second type of methods, need to be introduced in more details.

__ Reproducibility__: Yes

__ Additional Feedback__: (1) Some typos need to be revised. For example, “Moset” in line 91 should be “Most”.
(2) The parameter settings of baselines and the proposed model should be described with more details.
(3) It would be appreciated if the authors could provide a toy motivation example to make the idea more intuitive and understandable.