Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Mingyuan Zhang, Shivani Agarwal
A fundamental question in multiclass classification concerns understanding the consistency properties of surrogate risk minimization algorithms, which minimize a (often convex) surrogate to the multiclass 0-1 loss. In particular, the framework of calibrated surrogates has played an important role in analyzing the Bayes consistency properties of such algorithms, i.e. in studying convergence to a Bayes optimal classifier (Zhang, 2004; Tewari and Bartlett, 2007). However, follow-up work has suggested this framework can be of limited value when studying H-consistency; in particular, concerns have been raised that even when the data comes from an underlying linear model, minimizing certain convex calibrated surrogates over linear scoring functions fails to recover the true model (Long and Servedio, 2013). In this paper, we investigate this apparent conundrum. We find that while some calibrated surrogates can indeed fail to provide H-consistency when minimized over a natural-looking but naively chosen scoring function class F, the situation can potentially be remedied by minimizing them over a more carefully chosen class of scoring functions F. In particular, for the popular one-vs-all hinge and logistic surrogates, both of which are calibrated (and therefore provide Bayes consistency) under realizable models, but were previously shown to pose problems for realizable H-consistency, we derive a form of scoring function class F that enables H-consistency. When H is the class of linear models, the class F consists of certain piecewise linear scoring functions that are characterized by the same number of parameters as in the linear case, and minimization over which can be performed using an adaptation of the min-pooling idea from neural network training. Our experiments confirm that the one-vs-all surrogates, when trained over this class of nonlinear scoring functions F, yield better linear multiclass classifiers than when trained over standard linear scoring functions.