__ Summary and Contributions__: The paper does a thorough empirical study on the relationship of the number of networks in an ensemble, and the size of each network in an ensemble, and how they relate to each other. The paper focuses on the calibrated NLL scores of the ensembles and tries to find optimal size and number of models in an ensemble given some fixed network training budget. The paper relates each of the findings to the power law while varying the size of the models and the number of the models. The paper systematically justifies the experiments on two different architectures (VGG, WideResNets) and one dataset (CIFAR100).

__ Strengths__: - The paper is very well motivated. The use of ensembles is very important for many reasons, and studying the power law under a given memory budget is very important to the field.
- The paper studies the relation between CNLL and the size of the model and the number of models in the parameters, and relates it to the power law.
- Although the paper doesn't propose anything novel, the paper is clearly valuable as an empirical study in the field.
- The study does very thorough explanation and evaluation and presents the paper in a clear and interpretable manner.
- The paper does a very thorough and systematic analysis of the models over the CNLL metric.
- To my knowledge, this is the only paper of such kind.

__ Weaknesses__: - I would like to see a more thorough related work. To do a thorough empirical study, it is very important to also cite the important related works. I would like for the authors to consider citing the following papers, which are important for the study of deep ensembles [1,2,3,4,5,6,7,8]
- I would like the authors to discuss in the paper on the relation between the ensemble architecture considered in the paper and in [2] as they used shared encoders and n-heads.
- Another important thing to comment on is the relation between the number of models as n->\inf and how the model approaches a Bayesian NN [see 4,5]
- Many recent papers have also suggested the importance of diversity in NNs as otherwise it can cause the NNs to collapse and be reduced to the same output space [2,4,5] (and [9] which is a more classical take on the subject). The results seem to suggest that even if that is a problem that exists, the models are still able to learn meaningful uncertainty. Why might that be?
- It would also be good if the authors considered adding another dataset along with CIFAR-100, as CIFAR-100 is a dataset of 32*32 image patches.
- The authors correctly recognize CNLL as an important metric for ensembles, but it would also be interesting to see what the performance of the models are on some benchmark OOD detection tasks (such as the ones considered in [2]). It is not imperative, however showing that would definitely add more interest to the paper.
- It is also important to do a timing analysis on how long it takes to train n-models of some size and how that varies. A benchmarked timing analysis is very important since training more networks and deeper networks obviously comes at a cost, which should be discussed in an empirical study.
[1] https://arxiv.org/abs/1804.00382
[2] https://arxiv.org/abs/2003.04514
[3] https://arxiv.org/abs/1912.02757
[4] https://arxiv.org/abs/2001.10995
[5] https://arxiv.org/abs/2002.08791
[6] http://papers.nips.cc/paper/7219-simple-and-scalable-predictive-uncertainty-estimation-using-deep-ensembles
[7] https://arxiv.org/abs/2005.07292
[8] http://papers.nips.cc/paper/6270-stochastic-multiple-choice-learning-for-training-diverse-deep-ensembles
[9] https://dl.acm.org/doi/abs/10.1145/1015330.1015385?casa_token=HMjBA3QbKlkAAAAA:OByAELh4KVDkhVXcCYUyQzO9sy8yaQtOggxYv1cv6oqBm84Vhb_mV8VBOdGvYoWHEsLB1LtojV7Aog

__ Correctness__: - The paper is an empirical analysis, and it does a proper job supporting the experiments and the claims are well-founded.
- I agree with all the claims and methodology used in the paper. The paper is thorough in validating claims and chooses an appropriate metric to perform the empirical analysis.

__ Clarity__: The paper is very well written and the experiments are well presented in the plots. But the paper needs significant improvements in the related works section. Empirical studies should have a strong related works section as they serve to represent a whole body of work, and the literature on ensembling NNs is very popular (see weaknesses for some suggestions).

__ Relation to Prior Work__: The paper is an empirical study, and makes justified choices. The paper mentions concurrent work but the work seems to be sufficiently different.

__ Reproducibility__: Yes

__ Additional Feedback__: Happy to increase my score further, if the authors incorporate my suggestions in the rebuttal.
POST REBUTTAL:
I have decided to stick with my score after reading the rebuttal + other reviews. The rebuttal sufficiently satisfies my points, although I don't agree with the points regarding bagging reducing the uncertainty estimation, since I do not believe that to be the case. As the functional diversity of the ensembles increase, the uncertainty should not decrease, since ensembles suffer from a notion of "posterior collapse". I suggest the authors to not include that discussion / rephrase that suggestion in the final camera-ready paper. But regardless of that, I do think this paper is very important for the community, and without a doubt should be accepted.

__ Summary and Contributions__: This paper looks at the calibrated log likelihood as a proxy for uncertainty estimation of a deep ensemble. It shows that one can fit a power law to the CNLL as a function of ensemble size, network size (width), and total parameter count of the ensemble. This has a number of benefits, including the ability to reason about the uncertainty properties of an infinite ensemble size and to determine the optimal split of ensemble size and network width given a fixed memory budget.

__ Strengths__: This paper is extremely thorough and addresses an important practical problem with neural networks, which is that they provide poor uncertainty estimates. This paper is I believe the first to provide tools for predicting the uncertainty properties of an ensemble, and to even make theoretical claims about the performance of an infinitely large ensemble. I'd hope that this provides building blocks for future work along the lines of what Neural Tangent Kernel was able to do with infinite width neural networks.

__ Weaknesses__: The entire work is somewhat premised on the idea that CNLL is a good proxy for uncertainty. While this might be a valid assumption, it would be nice to see the results empirically validated by looking at other potential metrics, such as ECE or TACE to measure calibration, and maybe even look at distribution shift using a dataset like CIFAR10-C.

__ Correctness__: As far as I can tell, the experiments are very thorough and methodologically correct.
The original DeepEnsembles paper did have an adversarial training component to it, which doesn't appear to be part of this implementation. I don't think that's a substantial problem, but it might be good to mention the difference.

__ Clarity__: Generally it is very well written. There are a couple typos/grammatical errors sprinkled throughout, so it would be good to take a proofreading pass.

__ Relation to Prior Work__: Yes, as far as I'm aware.

__ Reproducibility__: Yes

__ Additional Feedback__: AFTER REBUTTAL:
After seeing other reviewers takes and the authors' response, I am dropping slightly to an 8. I still think this is a very strong and valuable paper, but maybe not as earth-shattering as my initial review indicated.

__ Summary and Contributions__: The authors study the dependence of the calibrated negative log likelihood of an ensemble of networks as a function of the ensemble size n and a the network size s (width in particular). The propose to model them as powerlaws, give a theoretical argument for why that is appropriate, and observe the behavior of real models (VGG and WideResNet) on CIFAR-100. Based on the fitted powerlaws, they infer the optimal split between model size s and number of models n, make a prediction, and verify its agreement with empirical observations.

__ Strengths__: Overall I find the paper well written, interesting, and attacking a really important question. I have been wondering about a similar question for some time and I am happy to see a very good attempt at resolving it. The question is clear, well-specified, and well-answered. I particularly appreciate that the authors test their memory split prediction against empirical observations.
I think this is a good paper overall.

__ Weaknesses__: 1) How valid / necessary is the power-law fit? Looking at the values in Figure 1, the range of negative log likelihoods in the leftmost panel is 0.8 -1.8. The middle panel show the linear scale and beyond the n = 1 point, the curves look pretty straight to me. That is reflected in the right panel where you show the index of the powerlaw is just a bit above -1.0, so essentially 1/n and a bit, but a bit slower decay. Many functions would locally look like this. What are the main reasons for using the powerlaw fit in particular?
2) Derivation in equation 2. What would be the conditions on the derivation in equation 2 to work? You characterize the distribution by its first two moments, which seems fine, but how sensitive is your conclusions that the resulting NLL depends on 1/n (=> the powerlaw) on the your assumption about the distribution of p*? In particular since you work with the log of those, will there be any other additional complications caused by that?
3) Different architectures, different power-laws? I wonder how to use this in practice given that different architectures seem to produce different powerlaws. Would I have to do an exploration of the (n,s) plane first, make a fit, and only then be able to predict the optimal memory split? This could defeat the practical purpose if that exploration took a lot of resources.

__ Correctness__: The paper seems correct and sound as much as I can say.

__ Clarity__: The paper is clearly written.

__ Relation to Prior Work__: The relevant work seems broadly mentioned, however, I am not an expert in the negative log likelihood of ensembles in particular, so I cannot judge the completeness.

__ Reproducibility__: Yes

__ Additional Feedback__: POST REBUTTAL:
The authors addressed my questions well and I will keep my score at 7 = accept. I really like the paper -- it's message is clear and I think it'll be useful to the field.

__ Summary and Contributions__: Investigate presence of power laws in calibrated negative log likelihood (CNNL) of deep ensembles. Theoretically motivate their presence and verify experimentally. Show that the power laws can be used to predict CNNL of different memory splits (several medium-size networks while keeping the total number of parameters constant) and thus finding the optimal split.

__ Strengths__: - Significance: Prediction based on the observed power laws is useful in choosing deep ensemble memory splits
- Theoretical grounding and empirical evaluation look good, but haven't checked in detail
- Provides more data on the double descent behavior, which authors find not to occur in CNNL compared to NNL (Figure 4).

__ Weaknesses__: No explanation for why calibration removes the double-descent behaviour

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: Please include the code in the supplementary material