__ Summary and Contributions__: The paper investigates the role of network with stochastic architectures, and identifies two issues associated with it: training/test disparity and mode collapse. Correspondingly, the authors propose their solutions to alleviate these issues. Experimental results demonstrate the effectiveness of their techniques, and also show the promising extensions of NSA in various applications. Overall it is novel and interesting to explore these problems, despite that some analysis are not clear enough to illustrate the problem and solutions.
[The authors feedback mostly addresses my concerns and I am in favour of acceptance.]

__ Strengths__: 1. The paper investigate a number of interesting and promising points in neural architecture search, e.g., train/test disparity, mode collapse, architecture generalization, which are seldom explored in previous work.
2. The authors conduct a set of analysis to support the claims, and experiments are adequate and informative.
3. Various extensions of NSA are provided, demonstrating the superiority of NSA in specific scenarios.

__ Weaknesses__: 1. The paper seeks to answer a set of important research questions, however, for some questions, the analysis is a bit complicated and not very convincing. See the detailed comments below.
2. Throughout the analysis, there are some sections that establish the solution without checking the premise or assumptions first. See the detailed comments below.
3. The writing is a bit hard to follow. See the comments in [Clarity].

__ Correctness__: Most analysis are based on intuition, verified by empirical results. It is hard to say scientifically correct, but could be empirically effective.

__ Clarity__: The overall structure is clear but the writing is a bit hard to follow in the some parts. For instance:
1) in L212-L216, it is hard to follow the exact solution. It would be better to add some formulas or diagrams, even in the Appendix.
2) In L295, how is the mutual information used in that experiment?

__ Relation to Prior Work__: The relevant literature are adequately incorporated.

__ Reproducibility__: Yes

__ Additional Feedback__: 1. Have the authors tried to track the values of var(\mu) with NSA and NSA-i to testify the assumptions in the improvement: cov(h_{i,\alpha}, h_{j,\alpha}) are of high correlation while cov(h_{i,\alpha_i}, h_{j,\alpha_j}) are i.i.d.. This would be more direct to justify the proposed NSA-i.
2. In Sec 4.1, while the proposed NSA-id shows higher ensemble accuracies, it seems as if the models are still not diverse as the rise of accuracy saturate quickly. In other words, it is still questionable that the rise of accuracy is from the diversity of architectures.
3. In L206: how do we know weight sharing is the root cause of mode collapse. Are there any more formal descriptions to explain the cause?
4. The paper focus on a refined search space grounded on Wide-ResNet, which could be very different from the common search space employed in ENAS/DARTS. So far the empirical findings only hold for NSA, but not the general NAS.
5. How long does the searching algorithm take for training? I find a lot of connections among different layers of Wide-ResNet in the visualization of the Appendix, which could be very time-consuming.
6. In table 2, NSA uses significantly more parameters (and probably more computational FLOPs) comparing to DARTS and ENAS, and thus the slight improvement of accuracy is less surprising.

__ Summary and Contributions__: The authors investigate the source of multiple issues in neural architecture search (NAS) by drilling down on problems with the underlying neural stochastic architectures (NSA). Specifically, the authors identify batch norm as the root cause in the observed train-test disparity in NSA. Batch norm typically has different methods to calculate batch statistics at train vs test time and this causes test-time predictions of NSA to be much more stochastic and lower performing than train time predictions. Additionally, the authors look at “mode collapse” of model weights in NSA. This refers to how NSA weights naively would converge on weights that are robust to different architectures rather than weights that support diverse predictions. The authors solve this issue by augmenting NSA models with architecture specific weights. Finally, the authors investigate how training on a limited subset of the architecture space affects performance when NSA weights are applied to previously unseen architectures. The authors report that when NSA-i models are trained on “enough” (~500) architectures they generalize fine to unseen architectures.

__ Strengths__: -Strong experiments that demonstrate the claims of the authors
-Fig 2 vs Fig 3 is a compelling demonstration of how changing batch norm
properties can improve train-test performance
-Good baselines in Table 2
-Batch norm implementation over the architectures seems like a great step for NSA

__ Weaknesses__: I would be interested to see how NSA performs on regression tasks. The authors focus on classification tasks alone. In regression tasks, one can investigate the variance of the predictions over many architectures as a metric of uncertainty quantification. In classification, there's no equivalent metric since one typically does not consider the variance of p(y_i=K|x_i) for a particular class because the "variance" of a probability is a non-standard notion.

__ Correctness__: The claims all appear to be correct to me. I checked all the equations and found no errors and the empirical methodology is sound.

__ Clarity__: Yes. The authors succinctly explain their concepts and present it in an approachable way. I noticed no clunky phrases or grammatical issues/typos. Well done.

__ Relation to Prior Work__: Yes, with the caveat that the author missed a reference to a related piece of work from NeurIPS 2019 https://arxiv.org/pdf/1812.09584.pdf

__ Reproducibility__: Yes

__ Additional Feedback__: Updating my review to say that I acknowledge the author feedback and I've updated my score by a point because of the thoughtfulness of the feedback.

__ Summary and Contributions__: This work explores the properties of the network with stochastic architectures (NSA). They propose some solutions to issues found and apply the NSA to model ensembling, uncertainty estimation and semi-supervised learning. They show in their evaluation that their proposal improve performance on the application mentioned before.

__ Strengths__: * The paper is well-written, well-structured and easy to understand. I really enjoyed reading this paper.
* The paper identified two problems and proposes solutions that empirically improves the optimization.
* The paper presents diverse evaluation and improve on state-of-the-art classification, uncertainty estimation and semi-supervised learning
==== Post-rebuttal ====
Thank you to the authors for the rebuttal and clarifying some open questions. I found this paper really interesting and therefore, keep my score and vote in favour of accepting it.

__ Weaknesses__: * The paper is mostly empirically based, and a major part focuses on the drawbacks of batch normalisation.
* Conclusion is a more of a summary. The authors could have discussed more about what could have been investigate in the future.

__ Correctness__: The claims and empirical methodology are adequate.

__ Clarity__: I do believe, the paper is well-structured and well-written, thus making it easy to understand.

__ Relation to Prior Work__: I do believe, the authors described clearly how his work differs from previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: I have only some further questions:
* l.81: What are typical choices for the distribution for alpha and how does this choice affect the optimization?
* Eq.3 and Eq.1 look exactly the same to me, what is the difference? In the case of instance specific architecture, shouldn't the loss be calculated over the instance?
* Figure 4: Shouldn't the dependent weights improve overall performance as claimed in the paper? I found it surprising that it decreases in performance. I do not particularly agree that the "ensemble gain is more obvious compared to NSA-i" as the plot does not clearly show it. Can you quantify this?