NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1674
Title:Addressing Failure Prediction by Learning Model Confidence

Reviewer 1

Originality: The use of the True Class Probability (TCP) is novel in the area of uncertainty estimation for the task of detecting misclassifications. However, as the TCP cannot be known at test time, the authors use an additional network component to predict this score based intermediate features derived from the predictive model. This is very similar to the task of confidence score estimation in speech recognition (, . While the use of the TCP as a target is novel, confidence score estimation is not, thus the work has limited novelty. Quality: The paper is technically sound, experiments are sensible and in line with standard practice in the area. The authors are generally honest about the evaluation of their work, however, they do not analyse the limitations of the given approach. Clarity: The paper is very clearly written, easy to understand and pleasant to read. The author place the work well within the context of current work. To address the authors' question regarding use of information theoretic uncertainty measures for misclassification detection (line 158) - it was shown by Malinin and Gales ( that they perform worse than MCP for Misclassification detection. Significance: The proposed method, while sensible and well evaluated, does not seem to provide a significant advantage over baseline approaches as the complexity of the task grows. One of the issues with the proposed approach, which the authors do not discuss, is that the model is trained to predict the TCP on the training data. Thus, if the model is very good and fits the training data well, then the TCP will be, more often than not, equal to the MCP. Thus, there will only be a few cases where the confidence scores predicted by the model will differ significantly from the MCP. Some ways in which this can be remedied is to use some kind of meta-learning style approach and trained the ConfidNet module on holdout data not used to train the predictive model. Additionally, maybe it is possible to balance the loss, such that it is more sensitive to misclassifications than to correct classifications. Another limitation of the proposed approach is that the theoretical guarantees derived for the TCP do not necessarily hold for the confidence scores predicted by the model. Reasons to accept: Very well written paper, pleasant to read. Approach is sensible, good experimental evaluation. Reasons to reject: Limited originality - related to speech recognition confidence score estimation, but such work is not cited. Method does not provide significant gains over baseline approaches. ---POST REBUTTAL COMMENTS--- I will change my rating to a 7 and this this paper should be accepted. I actually quite liked reading this paper and thought that it was professionally written and executed. Furthermore, the authors have engaged with the reviewers concerns and addressed them/provided new results. I still feel that the method lacks a certain degree of novelty and the gains and not as great relative to baseline models. At the same time, I am aware that 'predicting your own mistakes' is a rather challenging task. Furthermore, the experiments are extensive and have been further expanded. Overall, I feel like this paper is a good demonstration of 'good science practice', and thus, I vote accept.

Reviewer 2

I've read the author response and increased my score to a 7 - I vote for acceptance, conditional on the authors including coverage accuracy curves in the final version as they agreed to in the response, and coverage-accuracy numbers (something like table 2 in would suffice) in the supplementary. I think these are important mainly because past work on selective classification use these, and it would be very helpful for future researchers to compare these numbers. The author response addresses my concerns. It would be good to add intuition about why TCP does better than BCE as a conjecture (future work can check this). The new comparisons against baselines, and TCP on the validation set, were helpful. It would be good to add a note that there could be better ways of using the validation set that maintain accuracy and do even better on your metrics. Suggestions for improvement: - Run on more datasets. BCE is the obvious baseline (train a classifier to predict correct vs wrong examples). The supplementary shows that TCP does better than BCE on 3 datasets, and about 0-2% better. This isn't statistically significant even at p = 0.1. It would be nice if the authors had a chance to include at least one more dataset (even if TCP turns out to be worse than BCE there, that would be good to know), and mention these results in the main paper. - To make the paper even more compelling, I would advocate seeing if this idea still helps when train/test are different domains e.g. MNIST -> SVHN or for out of domain detection. ---------------------------------- Novelty: Medium. Their goal is to identify incorrectly classified examples. The naive baseline is to directly predict whether an example is correctly or incorrectly classified. This is similar to their BCE loss, and has been done before (e.g. Blatz et al 2004, Devries and Taylor 2018). Their specific novelty is to predict the softmax probability of the true label. Also they apply it to in-domain selective classification. The novelty is satisfactory, but not particularly high. Clarity: High. The paper is very clearly written. Quality: Medium. - In practice, neural networks are often trained until they achieve 0 training error. In that case, max-confidence and true-confidence on the training set are identical, because the model gets all the examples correct. Did you stop training your models before they reached 0 training error? My concern is that the method may be less applicable for modern neural networks where we typically train until they get 0 error. At the very least, this caveat should be discussed in the paper, since the method seems sensitive to the training procedure. - It seems like you train the model and the true confidence predictor on the same training set? What if you train the true confidence predictor on the validation set? Could this improve the results, and make it less sensitive to training procedure? My intuition is that this would be better - at least for the naive baseline/BCE where we try and classify an example as being predicted correctly or incorrectly. - What were the test set accuracies of your models? It could potentially be easier to predict incorrect examples for a model that’s less accurate, so it would be good to see these. - The improvement of the proposed method over BCE seems fairly small 1% - 2.5% improvement in AUPR, and BCE seems to be the obvious baseline. While BCE isn’t exactly what DeVries and Taylor did, it’s fairly close. - I think the experiments were fairly extensive. One could always ask for more experiments, for example the method could be compared to DeVries and Taylor, or other selective prediction methods like (Geifman and El-Yaniv, 2017). However, it seems satisfactory to me. It would be nice to see coverage/accuracy curves/scores though (El-Yaniv and Wiener, 2010). - I could not find the GitHub link to the code, even though the reproducibility checklist says it was provided. The appendix says 'all models are trained in a standard way' - more details are needed for reproducibility. Significance: Medium. The problem of identifying incorrectly classified examples, also known as selective prediction, is important. They show that a simple method can work well. The improvements are modest, but the method is a lot simpler than the alternatives so future work may build off of it. References: Confidence Estimation for Machine translation. John Blatz et al. COLING 2004. Learning Confidence for Out-of-Distribution Detection in Neural Networks. DeVries and Taylor. Arxiv 2018. Selective Classification for Deep Neural Networks. Geifman and El-Yaniv. NIPS 2017. On the Foundations of Noise-free Selective Classification. El-Yaniv and Wiener. JMLR 2010.

Reviewer 3

Update after author rebuttal: The authors have addressed concerns over use of held-out set for confidence estimator training and it would definitely be useful if this discussion is added to paper. The authors mention challenges in adopting such an approach with small dataset tasks. However I would recommend a focus on large scale datasets where even preliminary experiments by the author show relatively more interesting observations. 2. Over calibration issues in the confidence estimation branch the authors have reported better calibration performance compared to MCP. However as these details are not clear from the rebuttal, I cannot comment further. I would encourage the readers to not limit their discussion on calibration to a comparison with MCP and provide a discussion on calibration issues observed if any. ------------------------------------------------ In this paper the authors propose a confidence estimation network which is trained for a classifier and shares parameters with the classifier network. The authors are interested in confidence estimation when the model is used matched conditions. Hence they train even the confidence estimator using the same training data. However models typically perform very well on the training data, even when compared to matched test data (i.e., test data drawn from the same true distribution). Hence it is not clear if choosing a held-out set for training the confidence estimator might result in better confidence estimation on test data. It would be very interesting to see if even the confidence estimation network suffers from the same calibration issues as the primary classification neural network. Related analysis would be very informative. Further analyzing the behavior of these confidence estimation models as the test data deviates from the training data would be useful. It is not clear why the confidence estimator needs to share the parameters with the primary classification network. Is it for reducing the computational complexity ?