Reviews: Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

POST REBUTTAL COMMENTS This did not affect my decision and score, but the citations on calibration should be improved. The oldest work on calibration that the authors cite is from 2000 (Platt). The definitions of calibration/calibration error, and many of the key ideas were proposed several decades ago by statisticians and meteorologists like Brier, Murphy, Winkler, Deegroot, Fienberg. The calibration error metric proposed there is different from ECE (root-mean-squared instead of mean absolute value), it would be good if the authors can mention that the RMS calibration error is another possible metric. I've included some pointers to the literature below (there are many other papers, but it should be OK to just cite a few): - Verification of forecasts expressed in terms of probability. GW Brier 1950. - Reliability of subjective probability forecasts of precipitation. AH Murphy and RL Winkler 1977. - The comparison and evaluation of forecasters. MH DeGroot and SE Fienberg 1983. Additionally, vector scaling is not neural specific, and the labeling of this should be changed (the labeling did not affect my decision either). I've read all the other reviews, the author response, and discussed the paper extensively with the other reviewers. I think this is a good piece of work, although there are several ways it can be strengthened. I stand by my original decision, that this paper is marginally above the acceptance threshold. I think the connection between Dirichlet calibration and Matrix calibration is neat. I do think it's a fairly natural extension of beta calibration: see page 4 of http://proceedings.mlr.press/v54/kull17a/kull17a.pdf where those authors assume P(model outputs | class) is a Beta/2-variable-Dirichlet distribution for each class, and derive the connection to Platt scaling. Here, this paper assumes P(model outputs | class) is a k-variable-Dirichlet (k > 2), and derive the connection to matrix scaling. However, it is still a good piece of work, and to the best of my knowledge nobody seems to have discovered this extension in 2 years. Detailed thoughts on author response: - Similarity between matrix scaling and Dirichlet scaling: In their response, the authors say they are not aware of using log-transformed class probabilities in multi-class calibration, so composing log-transform with matrix scaling is novel. This does not convince me. In fact, regular vector scaling is done *before* the softmax, which is very close to log-transformed probabilities. Additionally, as I understand using log-transformed probabilities is standard when doing calibration on probabilities (it generally works better). Further, note that vector scaling can essentially be viewed as their method with a very high regularization penalty. - Generalizing to other exponential families: The authors say the significance of their theoretical results is they can extend it to other exponential families. In this context, isn't the Dirichlet the most general member in the discrete exponential family? That is, if the set of possible outcomes is {x \in R^k | x_i \in {0, 1}, sum x_i = 1}, as it is in their multiclass setting, then the most general distribution on x is a k-category Dirichlet distribution. So maybe the authors are suggesting they could add constraints, perhaps if some classes seem like they could be more related than others. To strengthen the paper, the authors could sketch this out in more detail. - Neural effect sizes: for the neural experiments, the authors should explain why the improvement in log-loss/Brier score (e.g. 0.8%/0.7%) is significant (even if classwise-ECE drops by 2%). I don't know if this is very positive or not. - Non-neural experiments: I'm less familiar with non-neural calibration. My reservations are: they did not try vector scaling on non-neural calibration (after applying log-transform), even though this method does best in the neural experiments and would be the natural baseline. I've discussed with the other reviewers, and we all agree that vector scaling is not "neural specific". Isotonic regression does better in classwise-ECE (but not statistically significant), and 1% worse in Brier score (not sure if this is statistically significant). Isotonic is going to be worse in log loss, because the authors' implementation uses sklearn which optimizes MSE, and further the probabilities are then normalized, but the normalized version is not trained to optimize MSE/log-loss. - In general, I agree with R2 and R3 that it's not convincing that we should use the new methods over vector scaling. But a calibration method does not need to be better in *all* datasets. It is quite likely that each method is more amenable to certain kinds of data. As mentioned in the review, they could show that there is a significant difference in certain kinds of datasets. - The p-classwise-ECE metric can be better motivated (e.g. looking at table 20, in many cases all methods get a score of 0/are considered not calibrated). Their argument for non-neural experiments is that p-classwise-ECE and classwise-ECE show complementary information, but p-classwise-ECE is significant while classwise-ECE is not, is that a good enough reason for using the former? My take is that if the p-classwise-ECE is better, but classwise-ECE is worse, perhaps that suggests that there are a small number of datasets where the method does much worse, but it often does slightly better? That could still be useful, but I'm not 100% clear about what's going on here. ----------------------------------------- Originality: Medium. - The authors say that they demonstrate that the multiclass setting introduces numerous subtleties that have not been recognized by other authors. Nixon et al 2019 point out similar issues, and their metric is the same as classwise ECE. They also have some experiments showing that temperature scaling performs worse than vector scaling in this metric. This paper provides a lot more extensive evidence of this, but it does reduce the novelty of this paper. - They give a probabilistic interpretation for matrix scaling, showing an equivalence to assuming P(q | Y) ~ Dirichlet(alpha). There is some novelty here. However, it is important to note that the insights are similar to Kull, Filho, and Flach 2017. In that paper on beta calibration, those authors show an equivalence between assuming P(q | Y) ~ Beta(alpha) and (a minor variant of) Platt scaling. This paper’s result is basically the multi-dimensional extension of that result. This is not a bad thing - it just makes me borderline on novelty here. The authors should comment more on the relative novelty compared to that paper - what are some differences in the proof techniques, and what potential impact might this have? - The off diagonal regularization for matrix scaling is novel to the best of my knowledge, although it is worth noting that the prior method (vector scaling) is an extreme version of this, where all the off diagonal elements are set to 0. - Related work is well covered. The paper could cite the Hosmer Lemeshov test (Hosmer and Lemeshov 1980), which is a significance test for miscalibration, which appears close to some of the metrics they use. They could also cite Kuleshov and Liang 2015 which deals with calibrated structured prediction, but a specific instantiation of their method is for multi-class calibration. They have a notion of “event pooling” and their metric is weaker than classwise calibration. They could cite Nixon et al. Quality: Medium-High - The experiments are extensive and well done, and I commend the authors for including so many details (on many datasets and metrics). These details could guide practitioners in the future. That said, I have a few suggestions and questions. - If we look at classwise-ECE (table 10 supplementary), or confidence-ECE (table 9) isotonic regression seems to do better. Why is p-classwise-ECE chosen over classwise-ECE in the main paper? It’s also hard for me to ascertain the effect size of many of these results - significance and effect sizes can be very different things. For example, their results show that dirichlet scaling typically leads to better accuracy (but worse classwise-ECE) than isotonic regression on the non-neural network experiments - but what are the effect sizes of these? For the neural network experiments the effect sizes look rather small, and vector scaling does better at classwise-ECE. The paper says that the effect size is small (2%), and regularized matrix scaling does better at log loss, but the latter effect size is very small too. Given the similarity of these methods, I’m not convinced about the value of their regularized matrix scaling. - Minor: for the non-neural network experiments, it wasn’t clear to me which methods were better at calibrating which models. One could imagine that all methods are roughly the same at calibrating SVMs, MLPs, but some methods are much better at calibrating decision trees and Naive Bayes (which might have very different kinds of biases). I think this is a minor issue, but the authors could think about whether there is a simple way to show this. Again, I realize the authors provide ranks in the supplementary material e.g. table 10, but effect sizes (for at least some of these) would be good. Clarity: Medium-High. The paper is well written. There are some typos, for example the average ranks in table 3 and 4 are identical. The section on “Interpretability” was confusing - I didn’t really follow that paragraph and the associated Figure 2. It also isn’t clear why vector scaling isn’t a general-purpose calibrator - for methods that output “probabilities” (like Naive Bayes), can’t you apply a log transform and then do vector scaling? Significance: Medium. I think this paper has interesting content. However, given that prior work has already focused on classwise ECE, and their proposed method does not appear to do better than prior methods (isotonic regression, vector scaling), I’m borderline on the significance. The connection between Dirichlet calibration and matrix scaling could lead to future ideas, and their focus on multiclass calibration is in the right direction (they have far more extensive experiments than Nixon et al, and the latter paper only came out a couple of months before NeurIPS deadline) which could be useful for the community. References: Measuring Calibration in Deep Learning. Jeremy Nixon, Michael Dusenberry, Linchuan Zhang, Ghassen Jerfel, Dustin Tran. Arxiv 2019. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. Meelis Kull, Telmo de Menezes e Silva Filho, Peter Flach. AISTATS 2017. Calibrated Structured Prediction. Volodymyr Kuleshov and Percy Liang. NeurIPS 2015. Goodness of ﬁt tests for the multiple logistic regression model. Hosmer, D.W., Hosmer, T. and Lemeshow, S. Communications in Statistics - Theory and Methods 1980.

Paper ID:	6658
Title:	Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4