NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: Many recent attempts and benchmarks have been proposed that look into uncertainty for deep learning models. However, the authors provide a sufficiently diverse suite of tasks, and not just a single task, strengthening their argument, due to transferability to more than one domain. Moreover, to the best of my knowledge, it is the first time that I see a large-scale empirical evaluation of uncertainty calibration methods (e.g. temperature scaling) to Bayesian deep learning methods (e.g. SVI and MC Dropout). Quality & Clarity: The ideas are clearly communicated and the paper is carefully written. The experimental setup is described adequately and the results seem to align with our intuition. Significance: The empirical results and motivation of this paper can be very impactful, especially for practitioners but also for researchers! In case of exact Bayesian inference methods (e.g. HMC), we wouldn’t expect to observe a degradation of uncertainty as to the data distribution shifts, but due to the approximations in the variational methods (e.g. SVI, Dropout) this side-effect emerges! As a result, this paper highlights this and goes against the intuition many may had before deep neural network based models.
Reviewer 2
UPDATE: I thank the authors for their feedback, which I have read. The authors have done well in addressing my (and other) concerns and I am raising my score to an 8. ---------------------------------------- Thanks to the authors for the hard work on this paper. ==Originality== The work is original. I know of no other suites of benchmarks that evaluate predictive uncertainty for a variety of methods on a set of datasets. ==Quality== The work is good. The experiments are well-designed, the metrics are appropriate and informative. There are some questions I had about the work that I wish were answered, and these are listed below. Why does SVI perform well on MNIST but poorly on every other dataset considered? I believe it is within the scope of this work to offer at least a basic explanation of this observation. It would be nice to have a proposed procedure for an applied ML practitioner who wants to compare a variety of uncertainty estimation procedures. Should one create one's own skew and OOD datasets? Are there any principles that are important to keep in mind while doing that? How hard will it be to use your eventually released code to do that? I ask these questions because your paper is extremely useful already, and it would be great to have this additional bridge discussion to allow someone to actually figure out whether the SVI MNIST results are a one-time fluke or if SVI might be great for their own dataset. As of now, the reader has no idea what it means that SVI works well on MNIST but is hard to train and/or poorly performing on the other datasets in this work. Some questions/comments regarding the Ensemble method: -Comparing accuracy between the ensemble method and the other methods is unfair. Ensembles are likely to do better. It might be obvious to some readers, but I recommend pointing this out as well. It does raise the question of whether ensemble method performs best because it has k times the number of parameters as (most of) the other methods. Can this be confirmed somehow? Does more capacity mean better OOD calibration? -In the original Ensembles paper (Lakshminarayanan, 2017), there are two more tricks used to make this method work: (a) the heteroscedastic loss in eq 1 and (b) adversarial training in sec 2.3. Are either of these used in your implementation? Or is it just a plain ensemble? I know the code will be released, and the readers will be able to answer this question themselves, but it's probably worthwhile to address this question in the main text or the appendix. Minor: In section 4.4 SVI is not used, and the reason is explained. In section 4.3 SVI is not used, but the reason is not explained. ==Clarity== The text is clear and well-written. ==Significance== The work is fairly significant. It will ground the study of uncertainty estimation and hopefully provide a standard suite for future researchers to reach for when they are developing new uncertainty estimation algorithms.
Reviewer 3
Strengths – Addresses an important problem, and the need for predictive uncertainty under dataset shift is well-motivated. – Performs and presents a comprehensive set of experiments on several state of the art methods, on datasets across multiple modalities. – Surveys and summarizes the dominant threads in prior work clearly. – Includes a comprehensive supplementary material and experimental details to enable reproducibility. Weaknesses / Questions – The paper conducts a large scale empirical study, and draws direct conclusions. However, there is little by way of analysis / actionable takeaways from these conclusions. For example, have the authors analyzed why SVI / ensembles tend to perform best for MNIST / CIFAR-10 respectively, towards the goal of building a better general understanding of what kinds of approaches are best-suited to any given dataset? – Have the authors investigated the performance on different OOD datasets that are more / less similar to the source dataset (say MSCOCO vs SVHN as OOD for ImageNet), and seen if consistent orderings of the various approaches are observed? – As an analysis paper with several different sets of experiments, I found the experimental section somewhat disorganized. In particular, Figures 1-2, 4 required repeated cross-referencing, and a consistent ordering scheme for plots would have made for a significantly better read. – L180: "SVI .. outperforms all methods by a large margin .. ": While SVI does seem to have lower Brier scores with shift, accuracies don’t appear to be any better (figure 1b) – how was this observation made? – L135: "However, both measure .. not directly measured by proper scoring rules": What are these additional properties that ECE and entropy capture? ==================== Final recommendation ========================= I have read and am satisfied with the author response and will raise my score to 7. I also recommend that (as mentioned in the author response) experiments to better understand the relation of SVI performance and model specification, and better tuning of dropout models be included in the final version.