__ Summary and Contributions__: This paper introduces the concept of consistency of deep learning models and proposes a dynamic snapshot ensemble method to improve consistency. Experimental results are provided for sensitivity analysis of ensemble methods with the proposed consistency measure.

__ Strengths__: A new formally defined consistency measure and a new ensemble method called dynamic snapshot ensemble are introduced.

__ Weaknesses__: The major weakness is that Theorem 1, and the other theoretical results, actually show that the distance between the centroid-vectors (related to the consistency of the average predictor) is not greater than the average distance between pairs of prediction vectors. It does not imply that the consistency of the average predictor is better than any individual predictor. Hence this theoretical justification for better consistency of ensemble methods is not sufficient.

__ Correctness__:
The proofs sound correct. However, the theoretical results (e.g. theorems and corollaries) are not enough to justify the claim that ensemble methods improve consistency.

__ Clarity__: This paper is well presented.

__ Relation to Prior Work__: Related works are discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: -The paper proposes consistency and correct-consistency as two metrics to evaluate trustworthiness of a classification model.
- Presents proofs to show that an ensemble is more consistent than individual base learners, and adding a base learner with accuracy higher than the average accuracy of the existing base learners in an ensemble results in improved correct-consistency with some probability.
- Proposes a dynamic snapshot ensemble with pruning and empirically shows that its performance in terms of accuracy and correct-consistency is better than snapshot ensemble and is comparable with deep ensembles of Lakshminayaranan et. al. while being computationally more efficient.

__ Strengths__: The paper considers the notions of consistency in terms of producing the same predictions for the same test data points when retraining a model multiple times on one training set or on a different training set. Similarly, correct consistency is defined for reproducing the correct predictions when the model is retrained. These metrics can indeed be important especially in setups where more training data becomes available for retraining periodically (see also weaknesses). The definitions of the metrics are intuitive, and the proposed algorithms are motivated by the theorems that are proved for the Euclidean distance metric for difference in predictions (probabilities). The experiments empirically validate the claims (up to some extent) and compare the proposed algorithm with three other approaches to form ensembles.

__ Weaknesses__: Based on the defined metrics and provided theorems, the scope seems limited to single-label classification problems. Moreover, as also mentioned by the authors in the paper, an important application of this notion of consistency is when more training data becomes available. Although the introduced metrics consider the percentage of agreement in predictions/correct predictions between two models, they do not necessarily encourage reproducing all correct predictions in addition to giving more new correct predictions. The experiments need to show how the accuracy and percentage of correct->incorrect and incorrect->correct predictions change for D1->D2, and D2->D3 using different approaches to form ensembles.
Additionally, the effect of the random initialization and shuffling in the algorithm is not clear based on the experiments. How would the results look if a larger N is used and the top half best performing learners are kept for the ensemble?

__ Correctness__: The claims and method, and the empirical evaluations seem sound and correct. The experiments results can be improved for better illustration of the comparisons between methods as mentioned above.

__ Clarity__: The paper is written pretty clearly. Some experimental details for reproducibility are missing, although the authors have said the code for the experiments will be provided which should solve this issue.

__ Relation to Prior Work__: The paper discusses related work and differentiates itself from former notions of reproducibility and discusses similarities and differences from previous ensemble methods in DL.

__ Reproducibility__: No

__ Additional Feedback__: Update after rebuttal:
I read the authors’ response, but I still have my main concerns.
As mentioned in my review, the introduced metrics consider the percentage of agreement in predictions/correct predictions between two models, but they do not necessarily encourage reproducing all correct predictions in addition to giving more new correct predictions. The provided results in the authors’ rebuttal in fact seem to show this. Although the proposed method has higher correct->correct percentages, the percentages of incorrect->correct predictions for the proposed method are among the lowest. It is desirable to also have higher percentages of incorrect->correct as more training data becomes available, which is an important application as also mentioned by the authors in the paper (like the paper’s experiments setup where D1 is a subset of D2 and D2 is a subset of D3).
Moreover, still the effect of the random initialization and shuffling in the proposed algorithm is not clear based on the experiments. An ablation study can be setting N=2m and keeping the top performing half (m). The current ablation studies consider snapshot without any pruning, i.e. larger N with beta=1 (keeping all).

__ Summary and Contributions__: Additional comments:
- The terms "significant improvements (L282)", "significantly reduce (L298)", and "significant performance improvement (L303)" should be used very carefully because such terms can mislead readers because they may think the results are STATISTICALLY significant.
- "ACC and CON seem to be correlated with each other" -- The first example provided by the author is not a good example as their differences are very subtle. The second example seems to be unrelated with my concern.
- Other concerns are well addressed.
In summary, I have a concern about the term author used in the paper. Please consider using a different term. A concern has not been addressed but other concerns have been addressed. My overall score is not changed.
----------
The authors point out the importance of producing consistently correct outputs for improving a model’s trustworthiness. And then they theoretically and empirically investigate why and how ensembles improve consistency and correct-consistency of DL. The authors formally define the consistency of a DL classifier and provide theoretical and empirical explanations about improving the consistency of DL classifiers, which can bring more attention to consistency estimation in DL.

__ Strengths__: - This paper formally defines the consistency of a classifier and investigates the benefits of ensemble approaches for improving consistency. The concept of consistency has been discussed, but not many works have systemically defined it and performed a theoretical and empirical assessment of it. This is an interesting and timely appropriate problem for improving the trustworthiness of DL.
- The proposed method is intuitive and built on clear theoretical motivation.

__ Weaknesses__: - The proposed method is not significantly powerful than the ExtBagging in terms of accuracy and consistency.
- In table 2, ACC and CON seem to be correlated with each other. Comparison with methods having similar accuracy with DynSnap but significantly lower consistency than DynSnap will be more interesting and help to support the usefulness of the consistency measure. Based on the current experimental results itself, CON seems to be redundant in evaluating ensemble approaches because ACC and CON seem to be correlated.

__ Correctness__: Yes.

__ Clarity__: Yes. This paper is very well written.
- New terms are clearly defined.
- The scope of the paper is well clarified.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: - Generalization of consistency: Do the theoretical findings work correctly when any other distance measure is used to define consistency?
Minor:
(1) Table 2: Bold the highest hits
(2) Provide more details about the replication results. How many replications did the author perform?

__ Summary and Contributions__: The paper formally defines and studies ‘consistency’ and ‘correct-consistency’ in the context of periodic retraining of deployed model where the outputs from successive generation might not agree on the correct labels assigned to the same input. Authors presents an ensemble-based technique to improve ‘consistency’ and ‘correct-consistency’, provide a theoretical explanation and validate the proposed approach on several datasets.

__ Strengths__: The paper formally defines and tackles the ‘consistency’ and ‘correct-consistency’ of a model over successive generation. It is an important problem to solve since consistent model behavior is important to increase user’s trust. These terms are well defined with appropriate examples.
Author gives a theoretical justification of why ensemble-based techniques increase ‘consistency’ and ‘correct-consistency’ and validate these theorems by extensive experiments on three datasets.
Based on the theoretical justification authors propose a novel ensemble-based technique to improve ‘correct-consistency’ and compared to a bagging approach in which ensemble models are trained from the scratch with random initialization and shuffle of the data. The proposed approach is able to achieve comparable correct-consistency with much lesser training time.

__ Weaknesses__: Motivation behind some parts of the proposed approach is not clearly mentioned. For example, it is not mentioned why having learning rate scheduling was important. Also, having an ablation to understand contribution of these component might help in understanding the proposed approach better.
Authors did not discuss how does the theoretical justification work in a setting where training data distribution changes over successive model generations. In empirical evaluation they have taken this factor into the account though.

__ Correctness__: The claims and empirical methodology used in the paper are correct.

__ Clarity__: The paper is well written and the proposed technique and experimentations are easy to understand.

__ Relation to Prior Work__: There is not much prior work in the direction of ‘correct-consistency’ but authors do mention related prior work related to reproducibility and ensemble-based approaches in deep learning.

__ Reproducibility__: Yes

__ Additional Feedback__: