__ Summary and Contributions__: In this paper, the authors studied the sampling and its variance for GCNs and GNNs. They proposed to formulate the optimization of the sampling variance as an adversary bandit problem. Experimental results on several benchmark datasets demonstrated the effectiveness of the proposed method.
=========================
post-rebuttal edit
=========================
Thank the author's response much. It addressed part of my concerns. I'd like to keep my score.
My main concerns are still related to the experiments. I encourage the authors to try more baselines and datasets to further verify the advantages of the proposed method.

__ Strengths__: 1. The research problem of studying how to reduce the sampling variance of graph-structured data is interesting.
2. This paper seems theoretically solid, with very rich and detailed analysis.
3. Experimental results on benchmark datasets indicate the effectiveness of the proposed method.

__ Weaknesses__: 1. Although the problem studied in this paper is interesting, this pape is not easy to follow.
2. More baselines (sampling approaches designed for GNNs) are needed. In Table 2, S-GCN [5] is a simple sampler. ClusterGCN and GraphSAINT are designed for sampling (sub)graphs. The same for Table 3.
3. To be honest, I am kind of confused about Table 3. It would be better if the authors provide more analysis for Table 3. And more analysis when Tables 2,3 are considered together.
4. How did the authors determine the hyper-parameter settings?
5. I am interested in seeing more experimental comparison on the datasets with a large number of nodes.
6. How about the comparison in terms of computation cost / running time?

__ Correctness__: It seems to be correct.

__ Clarity__: Yes, but it is not easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: No

__ Additional Feedback__:

__ Summary and Contributions__: The authors propose to use a bandit approach to optimally sample the neighbors in GNN embeddings. Previous approaches include random and importance sampling and proposed approach scales even to GNNS with attention since they can change across iterations. Also a nice theoretical bound shows a multiplicative factor of 3 over the optimal variance.

__ Strengths__: (A) Good theoretical grounding in bandits using multi-play MAB is proposed for neighbor selection and a variance bound is shown.
(B) Strong empirical results as shown on a wide-variety of datasets.
(C) Casting node selection to a bandit problem seems novel to me and could lead to other bandit extensions as well as applications to other graph settings.
(D) Highly relevant since training GNNs is expensive and improvements are highly welcome.

__ Weaknesses__: (A) How about using combinatorial bandits (CMAB) for selecting the neighbors?
(B) I might have missed it but further insight into why this approach works could have been interesting. Are you able to better sample neighbors in fewer rounds while random/importance sampling explore unnecessarily?

__ Correctness__: I haven't checked all the proofs but its seems to be correct.

__ Clarity__: Yes, the paper is clearly laid out and easy to follow. Could use an additional pass though.

__ Relation to Prior Work__: The prior work is discussed and also extensively compared in the experiments.

__ Reproducibility__: Yes

__ Additional Feedback__: Update after discussion:
The authors answered most of the questions and there seems to be some concern about novelty. However, keeping my score since its application to GNN is interesting and the fact both attention weights and embeddings change and are handled in the paper.

__ Summary and Contributions__: This paper studies the problem of accelerating the training of graph neural networks (GNNs) using sampling. The embedding of a node is computed by aggregating the embeddings of its neighboring nodes. Instead of sampling all neighbors - which can be expensive - we can sample a subset of the nodes. The resulting estimator's variance can be reduced using importance sampling. For GNNs, it is difficult to determine the optimal importance sampling weights since they depend on unknown quantities. This paper proposes the use of adversarial bandits methods to learn the optimal sampling weights.

__ Strengths__: The paper is relevant to the neurips community. The paper is well-written, and the authors have compared their sampling scheme to existing sampling schemes. They also provide a regret guarantee for their algorithm. The contribution is novel to the best of my knowledge.

__ Weaknesses__: I do not follow why the gradient of the variance can be set as the reward for the bandit (the writeup after equation 7). The objective is to choose a good Q_i^t, and the quantity on the lhs of equation 7 is to be minimized. This is upper bounded by the inner product on the rhs. In the reward function however, only the gradient is present. Could you elaborate this?
In the experiments, I would have liked to see two plots. First, a plot comparing the empirical regret with the upper bound in Theorem 1. This would help validate the expression in Theorem 1 and understand whether it predicts the dependence on various quantities correctly. Second, a plot measuring the gains on simulated toy graphs. For instance, how does the convergence change if we consider graphs where all nodes have a fixed degree. I would imagine that as the degree increases, the convergence would be slower (for a fixed sample of the neighborhood)
I would also encourage releasing the code for reproducibility.

__ Correctness__: Yes. As long as the authors can answer the question about the gradient being chosen as the reward above.

__ Clarity__: Yes

__ Relation to Prior Work__: Yes. For completeness, the authors may look at the work by David Tse's group on using bandits as samplers (although they use it in the stochastic setting, not the adversarial setting).

__ Reproducibility__: Yes

__ Additional Feedback__: == after rebuttal ==
Thank you for clarifying eqn 7 and the plots.

__ Summary and Contributions__: The paper describes about the variance reduced samplers for training GCNs and attentive GNNs by formulating the optimisation of samplers as a bandit problem and proposes two multi-armed bandit algorithms.

__ Strengths__: The paper is written well. The idea of using mutli-arm bandit for sampling the neighbours training of GCNs and attentive GNNs is interesting. The technical details and equations appears to be correct. The authors applied their multi- arm bandit approach to GAT and GeniePath and shown its effective over layer sampling approaches.

__ Weaknesses__: The authors conduct experiments with 2 layer architecture. However, for few problems and dataset, it may require more complex architecture and the authors could clarify how the proposed algorithm performs in terms of scalability and computation cost.
The notation used in the paper sometimes can be confusing. For example - in equation 4 - alpha_ij is a value not a function - alpha_ij(t) can be noted as function of 't'
It is mentioned that the rewards vary as the training proceeds and it would have been interesting to explore how any of simple bandit algorithms perform in the experiments or how to apply simple bandit algorithms for the current experiments. The authors could try and adapt a simple eps-greedy method to solve this problem
The algorithm based on MAB where a combinatorial set of neighbors with size k needs to be chosen. The size parameters 'k' will be a hyperparameter for the model which needs to tuned using validation set. The authors can provide clarification on how to select the size k.

__ Correctness__: The technical content of paper appears to be correct. There could be additional experiments conducted to see how complex architecture can benefit such sampling approach for effective training of the model

__ Clarity__: The paper is written and some notation in the equations can be clarified.

__ Relation to Prior Work__: The authors summarize three types of related works for training graph neural networks. The authors clearly mentioned and highlighted their contributions with respect to related work.

__ Reproducibility__: No

__ Additional Feedback__: