{"title": "Classification Calibration Dimension for General Multiclass Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 2078, "page_last": 2086, "abstract": "We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of \\emph{classification calibration dimension} of a multiclass loss matrix, which measures the smallest `size' of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al.\\ (2010) for analyzing the difficulty of designing `low-dimensional' convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems.", "full_text": "Classi\ufb01cation Calibration Dimension for\n\nGeneral Multiclass Losses\n\nHarish G. Ramaswamy\n\nShivani Agarwal\n\nDepartment of Computer Science and Automation\nIndian Institute of Science, Bangalore 560012, India\n\n{harish gurup,shivani}@csa.iisc.ernet.in\n\nAbstract\n\nWe study consistency properties of surrogate loss functions for general multiclass\nclassi\ufb01cation problems, de\ufb01ned by a general loss matrix. We extend the notion\nof classi\ufb01cation calibration, which has been studied for binary and multiclass 0-1\nclassi\ufb01cation problems (and for certain other speci\ufb01c learning problems), to the\ngeneral multiclass setting, and derive necessary and suf\ufb01cient conditions for a\nsurrogate loss to be classi\ufb01cation calibrated with respect to a loss matrix in this\nsetting. We then introduce the notion of classi\ufb01cation calibration dimension of a\nmulticlass loss matrix, which measures the smallest \u2018size\u2019 of a prediction space\nfor which it is possible to design a convex surrogate that is classi\ufb01cation cali-\nbrated with respect to the loss matrix. We derive both upper and lower bounds on\nthis quantity, and use these results to analyze various loss matrices. In particular,\nas one application, we provide a different route from the recent result of Duchi\net al. (2010) for analyzing the dif\ufb01culty of designing \u2018low-dimensional\u2019 convex\nsurrogates that are consistent with respect to pairwise subset ranking losses. We\nanticipate the classi\ufb01cation calibration dimension may prove to be a useful tool in\nthe study and design of surrogate losses for general multiclass learning problems.\n\nIntroduction\n\n1\nThere has been signi\ufb01cant interest and progress in recent years in understanding consistency of\nlearning methods for various \ufb01nite-output learning problems, such as binary classi\ufb01cation, multi-\nclass 0-1 classi\ufb01cation, and various forms of ranking and multi-label prediction problems [1\u201315].\nSuch \ufb01nite-output problems can all be viewed as instances of a general multiclass learning problem,\nwhose structure is de\ufb01ned by a loss function, or equivalently, by a loss matrix. While the studies\nabove have contributed to the understanding of learning problems corresponding to certain forms\nof loss matrices, a framework for analyzing consistency properties for a general multiclass learning\nproblem, de\ufb01ned by a general loss matrix, has remained elusive.\nIn this paper, we analyze consistency of surrogate losses for general multiclass learning problems,\nbuilding on the results of [3, 5\u20137] and others. We start in Section 2 with some background and\nexamples that will be used as running examples to illustrate concepts throughout the paper, and for-\nmalize the notion of classi\ufb01cation calibration with respect to a general loss matrix. In Section 3, we\nderive both necessary and suf\ufb01cient conditions for classi\ufb01cation calibration with respect to general\nmulticlass losses; these are both of independent interest and useful in our later results. Section 4 in-\ntroduces the notion of classi\ufb01cation calibration dimension of a loss matrix, a fundamental quantity\nthat measures the smallest \u2018size\u2019 of a prediction space for which it is possible to design a convex sur-\nrogate that is classi\ufb01cation calibrated with respect to the loss matrix. We derive both upper and lower\nbounds on this quantity, and use these results to analyze various loss matrices. As one application,\nin Section 5, we provide a different route from the recent result of Duchi et al. [10] for analyzing\nthe dif\ufb01culty of designing \u2018low-dimensional\u2019 convex surrogates that are consistent with respect to\ncertain pairwise subset ranking losses. We conclude in Section 6 with some future directions.\n\n1\n\n\f2 Preliminaries, Examples, and Background\n\n+\n\nSetup. We are given training examples (X1, Y1), . . . , (Xm, Ym) drawn i.i.d. from a distribution D\non X \u00d7Y, where X is an instance space and Y = [n] = {1, . . . , n} is a \ufb01nite set of class labels. We\nare also given a \ufb01nite set T = [k] = {1, . . . , k} of target labels in which predictions are to be made,\nand a loss function \ufffd : Y \u00d7 T \u2192[0,\u221e), where \ufffd(y, t) denotes the loss incurred on predicting t \u2208 T\nwhen the label is y \u2208 Y. In many common learning problems, T = Y, but in general, these could\nbe different (e.g. when there is an\u2018abstain\u2019 option available to a classi\ufb01er, in which case k = n + 1).\nWe will \ufb01nd it convenient to represent the loss function \ufffd as a loss matrix L \u2208 Rn\u00d7k\n(here R+ =\n[0,\u221e)), and for each y \u2208 [n], t \u2208 [k], will denote by \ufffdyt the (y, t)-th element of L, \ufffdyt = (L)yt =\n\ufffd(y, t), and by \ufffdt the t-th column of L, \ufffdt = (\ufffd1t, . . . , \ufffdnt)\ufffd \u2208 Rn. Some examples follow:\nExample 1 (0-1 loss). Here Y = T = [n], and the loss incurred is 1 if the predicted label t is\ndifferent from the actual class label y, and 0 otherwise: \ufffd0-1(y, t) = 1(t \ufffd= y) , where 1(\u00b7) is 1 if the\nargument is true and 0 otherwise. The loss matrix L0-1 for n = 3 is shown in Figure 1(a).\nExample 2 (Ordinal regression loss). Here Y = T = [n], and predictions t farther away from the\nactual class label y are penalized more heavily, e.g. using absolute distance: \ufffdord(y, t) = |t \u2212 y| .\nThe loss matrix Lord for n = 3 is shown in Figure 1(b).\nExample 3 (Hamming loss). Here Y = T = [2r] for some r \u2208 N, and the loss incurred on\npredicting t when the actual class label is y is the number of bit-positions in which the r-bit binary\nrepresentations of t \u2212 1 and y \u2212 1 differ: \ufffdHam(y, t) =\ufffdr\ni=1 1((t \u2212 1)i \ufffd= (y \u2212 1)i) , where for any\nz \u2208 {0, . . . , 2r \u2212 1}, zi \u2208 {0, 1} denotes the i-th bit in the r-bit binary representation of z. The loss\nmatrix LHam for r = 2 is shown in Figure 1(c). This loss is used in sequence labeling tasks [16].\nExample 4 (\u2018Abstain\u2019 loss). Here Y = [n] and T = [n+1], where t = n+1 denotes \u2018abstain\u2019. One\npossible loss function in this setting assigns a loss of 1 to incorrect predictions in [n], 0 to correct\n2 1(t = n + 1) . The loss\npredictions, and 1\nmatrix L(?) for n = 3 is shown in Figure 1(d).\nThe goal in the above setting is to learn from the training examples a function h : X\u2192[k] with low\nexpected loss on a new example drawn from D, which we will refer to as the \ufffd-risk of h:\n\n2 for abstaining: \ufffd(?)(y, t) = 1(t \ufffd= y) 1(t \u2208 [n]) + 1\n\nD[h] \ufffd= E(X,Y )\u223cD\ufffd(Y, h(X)) = EX\ner\ufffd\n\n(1)\nwhere py(x) = P(Y = y | X = x) under D, and p(x) = (p1(x), . . . , pn(x))\ufffd \u2208 Rn denotes the\nconditional probability vector at x. In particular, the goal is to learn a function with \ufffd-risk close to\nthe optimal \ufffd-risk, de\ufb01ned as\n\npy(X)\ufffd(y, h(X)) = EX p(X)\ufffd\ufffdh(X) ,\n\nn\ufffdy=1\n\ner\ufffd,\u2217D\n\n\ufffd= inf\n\nh:X\u2192[k]\n\ner\ufffd\n\nD[h] = inf\n\nEX p(X)\ufffd\ufffdh(X) = EX min\nt\u2208[k]\n\nh:X\u2192[k]\n\np(X)\ufffd\ufffdt .\n\n(2)\n\nMinimizing the discrete \ufffd-risk directly is typically dif\ufb01cult computationally; consequently, one usu-\n\npy(X)\u03c8(y, f (X)) .\n\nD[f ] \ufffd= E(X,Y )\u223cD\u03c8(Y, f (X)) = EX\ner\u03c8\n\nally employs a surrogate loss function \u03c8 : Y \u00d7 \ufffdT \u2192R+ operating on a surrogate target space\n\ufffdT \u2286 Rd for some appropriate d \u2208 N,1 and minimizes (approximately, based on the training sample)\nthe \u03c8-risk instead, de\ufb01ned for a (vector) function f : X\u2192\ufffdT as\nn\ufffdy=1\nThe learned function f : X\u2192\ufffdT is then used to make predictions in [k] via some transformation pred :\n\ufffdT \u2192[k]: the prediction on a new instance x \u2208 X is given by pred(f (x)), and the \ufffd-risk incurred is\ner\ufffd\nD[pred\u25e6 f ]. As an example, several algorithms for multiclass classi\ufb01cation with respect to 0-1 loss\nlearn a function of the form f : X\u2192Rn and predict according to pred(f (x)) = argmaxt\u2208[n]ft(x).\nBelow we will \ufb01nd it useful to represent the surrogate loss function \u03c8 via n real-valued functions\n\u03c8y : \ufffdT \u2192R+ de\ufb01ned as \u03c8y(\u02c6t) = \u03c8(y, \u02c6t) for y \u2208 [n], or equivalently, as a vector-valued function\n\u03c8 : \ufffdT \u2192Rn\nwhere for any A \u2286 Rn, conv(A) denotes the convex hull of A.\n1Equivalently, one can de\ufb01ne \u03c8 : Y \u00d7 Rd\u2192 \u00afR+, where \u00afR+ = R+ \u222a {\u221e} and \u03c8(y, \u02c6t) = \u221e \u2200\u02c6t /\u2208 \ufffdT .\n\nR\u03c8 \ufffd=\ufffd\u03c8(\u02c6t) : \u02c6t \u2208 \ufffdT\ufffd\n\n+ de\ufb01ned as \u03c8(\u02c6t) = (\u03c81(\u02c6t), . . . , \u03c8n(\u02c6t))\ufffd. We will also de\ufb01ne the sets\n\nand S\u03c8 \ufffd= conv(R\u03c8) ,\n\n(3)\n\n(4)\n\n2\n\n\f1\n1\n\n0 \ufffd\n\n\ufffd 0 1\n\n1 0\n1 1\n(a)\n\n2\n1\n\n0 \ufffd\n\n\ufffd 0 1\n\n1 0\n2 1\n(b)\n\n0\n1\n1\n2\n\n\uf8ee\uf8ef\uf8f0\n\n1\n1\n2\n0\n0\n2\n1\n1\n(c)\n\n2\n1\n1\n0\n\n\uf8f9\uf8fa\uf8fb\n\n0\n1\n1\n\n\uf8ee\uf8f0\n\n1\n1\n1\n0\n1\n0\n(d)\n\n1\n2\n1\n2\n1\n2\n\n\uf8f9\uf8fb\n\nFigure 1: Loss matrices corresponding to Examples 1-4: (a) L0-1 for n = 3; (b) Lord for n = 3; (c)\nLHam for r = 2 (n = 4); (d) L(?) for n = 3.\n\ner\u03c8\n\nD[f ] = inf\n\n\ufffd= inf\nf :X\u2192\ufffdT\n\nEX p(X)\ufffd\u03c8(f (X)) = EX inf\nz\u2208R\u03c8\n\nUnder suitable conditions, algorithms that approximately minimize the \u03c8-risk based on a training\nsample are known to be consistent with respect to the \u03c8-risk, i.e. to converge (in probability) to the\noptimal \u03c8-risk, de\ufb01ned as\ner\u03c8,\u2217D\n\np(X)\ufffdz .\n(5)\nThis raises the natural question of whether, for a given loss \ufffd, there are surrogate losses \u03c8 for which\nconsistency with respect to the \u03c8-risk also guarantees consistency with respect to the \ufffd-risk, i.e.\nguarantees convergence (in probability) to the optimal \ufffd-risk (de\ufb01ned in Eq. (2)). This question has\nbeen studied in detail for the 0-1 loss, and for square losses of the form \ufffd(y, t) = ay1(t \ufffd= y), which\ncan be analyzed similarly to the 0-1 loss [6, 7]. In this paper, we consider this question for general\nmulticlass losses \ufffd : [n] \u00d7 [k]\u2192R+, including rectangular losses with k \ufffd= n. The only assumption\nwe make on \ufffd is that for each t \u2208 [k], \u2203p \u2208 \u0394n such that argmint\ufffd\u2208[k]p\ufffd\ufffdt\ufffd = {t} (otherwise the\nlabel t never needs to be predicted and can simply be ignored).2\n\np(X)\ufffdz = EX inf\nz\u2208S\u03c8\n\nf :X\u2192\ufffdT\n\nDe\ufb01nitions and Results. We will need the following de\ufb01nitions and basic results, generalizing\nthose of [5\u20137]. The notion of classi\ufb01cation calibration will be central to our study; as Theorem 3\nbelow shows, classi\ufb01cation calibration of a surrogate loss \u03c8 w.r.t. \ufffd corresponds to the property that\nconsistency w.r.t. \u03c8-risk implies consistency w.r.t. \ufffd-risk. Proofs of these results are straightforward\ngeneralizations of those in [6, 7] and are omitted.\n\np\ufffd\u03c8(\u02c6t) .\n\n\u02c6t\u2208\ufffdT\n\n\u2200p \u2208 P :\n\ninf\n\n\u2200p \u2208 P :\n\np\ufffd\u03c8(\u02c6t) > inf\n\n\u02c6t\u2208\ufffdT :pred(\u02c6t) /\u2208argmintp\ufffd\ufffdt\n\nDe\ufb01nition 1 (Classi\ufb01cation calibration). A surrogate loss function \u03c8 : [n] \u00d7 \ufffdT \u2192R+ is said to be\nclassi\ufb01cation calibrated with respect to a loss function \ufffd : [n]\u00d7 [k]\u2192R+ over P \u2286 \u0394n if there exists\na function pred : \ufffdT \u2192[k] such that\nLemma 2. Let \ufffd : [n] \u00d7 [k]\u2192R+ and \u03c8 : [n] \u00d7 \ufffdT \u2192R+. Then \u03c8 is classi\ufb01cation calibrated with\nrespect to \ufffd over P \u2286 \u0394n iff there exists a function pred\ufffd : S\u03c8\u2192[k] such that\np\ufffdz .\nTheorem 3. Let \ufffd : [n] \u00d7 [k]\u2192R+ and \u03c8 : [n] \u00d7 \ufffdT \u2192R+. Then \u03c8 is classi\ufb01cation calibrated with\nrespect to \ufffd over \u0394n iff \u2203 a function pred : \ufffdT \u2192[k] such that for all distributions D on X \u00d7 [n] and\nall sequences of random (vector) functions fm : X\u2192\ufffdT (depending on (X1, Y1), . . . , (Xm, Ym)),3\nDe\ufb01nition 4 (Positive normals). Let \u03c8 : [n] \u00d7 \ufffdT \u2192R+. For each point z \u2208 S\u03c8, the set of positive\nDe\ufb01nition 5 (Trigger probabilities). Let \ufffd : [n] \u00d7 [k]\u2192R+. For each t \u2208 [k], the set of trigger\nprobabilities of t with respect to \ufffd is de\ufb01ned as\n\nNS\u03c8 (z) \ufffd=\ufffdp \u2208 \u0394n : p\ufffd(z \u2212 z\ufffd) \u2264 0 \u2200z\ufffd \u2208 S\u03c8\ufffd .\n\nD[pred \u25e6 fm] P\u2212\u2192 er\ufffd,\u2217D .\ner\ufffd\n\nD[fm] P\u2212\u2192 er\u03c8,\u2217D\ner\u03c8\n\nnormals at z is de\ufb01ned as4\n\np\ufffdz > inf\nz\u2208S\u03c8\n\nz\u2208S\u03c8:pred\ufffd(z) /\u2208argmintp\ufffd\ufffdt\n\ninf\n\nimplies\n\n3\n\nt\n\nExamples of trigger probability sets for various losses are shown in Figure 2.\n\n\ufffd=\ufffdp \u2208 \u0394n : p\ufffd(\ufffdt \u2212 \ufffdt\ufffd ) \u2264 0 \u2200t\ufffd \u2208 [k]\ufffd = \ufffdp \u2208 \u0394n : t \u2208 argmint\ufffd\u2208[k]p\ufffd\ufffdt\ufffd\ufffd .\n\nQ\ufffd\n2Here \u0394n denotes the probability simplex in Rn, \u0394n = {p \u2208 Rn : pi \u2265 0 \u2200 i \u2208 [n],\ufffdn\n3Here P\u2212\u2192 denotes convergence in probability.\n4The set of positive normals is non-empty only at points z in the boundary of S\u03c8.\n\ni=1 pi = 1}.\n\n\fQ0-1\nQ0-1\nQ0-1\n\n1 = {p \u2208 \u03943 : p1 \u2265 max(p2, p3)}\n2 = {p \u2208 \u03943 : p2 \u2265 max(p1, p3)}\n3 = {p \u2208 \u03943 : p3 \u2265 max(p1, p2)}\n\nQord\n1 = {p \u2208 \u03943 : p1 \u2265 1\n2}\nQord\n2 = {p \u2208 \u03943 : p1 \u2264 1\n2 , p3 \u2264 1\n2}\nQord\n3 = {p \u2208 \u03943 : p3 \u2265 1\n2}\n\n(a)\n\n(b)\n\nQ(?)\n1 = {p \u2208 \u03943 : p1 \u2265 1\n2}\nQ(?)\n2 = {p \u2208 \u03943 : p2 \u2265 1\n2}\nQ(?)\n3 = {p \u2208 \u03943 : p3 \u2265 1\n2}\nQ(?)\n4 = {p \u2208 \u03943 : max(p1, p2, p3) \u2264 1\n2}\n\n(c)\n\nFigure 2: Trigger probability sets for (a) 0-1 loss \ufffd0-1; (b) ordinal regression loss \ufffdord; and (c) \u2018ab-\nstain\u2019 loss \ufffd(?); all for n = 3, for which the probability simplex can be visualized easily. Calculations\nof these sets can be found in the appendix. We note that such sets have also been studied in [17,18].\n\n3 Necessary and Suf\ufb01cient Conditions for Classi\ufb01cation Calibration\n\nWe start by giving a necessary condition for classi\ufb01cation calibration of a surrogate loss \u03c8 with\nrespect to any multiclass loss \ufffd over \u0394n, which requires the positive normals of all points z \u2208 S\u03c8 to\nbe \u2018well-behaved\u2019 w.r.t. \ufffd and generalizes the \u2018admissibility\u2019 condition used for 0-1 loss in [7]. All\nproofs not included in the main text can be found in the appendix.\nTheorem 6. Let \u03c8 : [n] \u00d7 \ufffdT \u2192R+ be classi\ufb01cation calibrated with respect to \ufffd : [n] \u00d7 [k]\u2192R+\nover \u0394n. Then for all z \u2208 S\u03c8, there exists some t \u2208 [k] such that NS\u03c8 (z) \u2286 Q\ufffd\nt.\nWe note that, as in [7], it is possible to give a necessary and suf\ufb01cient condition for classi\ufb01cation\ncalibration in terms of a similar property holding for positive normals associated with projections of\nS\u03c8 in lower dimensions. Instead, below we give a different suf\ufb01cient condition that will be helpful\nin showing classi\ufb01cation calibration of certain surrogates. In particular, we show that for a surrogate\nloss \u03c8 to be classi\ufb01cation calibrated with respect to \ufffd over \u0394n, it is suf\ufb01cient for the above property\nof positive normals to hold only at a \ufb01nite number of points in R\u03c8, as long as their positive normal\nsets jointly cover \u0394n:\nTheorem 7. Let \ufffd : [n]\u00d7[k]\u2192R+ and \u03c8 : [n]\u00d7\ufffdT \u2192R+. Suppose there exist r \u2208 N and z1, . . . , zr \u2208\nR\u03c8 such that\ufffdr\n\u03c8 is classi\ufb01cation calibrated with respect to \ufffd over \u0394n.\nComputation of NS\u03c8 (z). The conditions in the above results both involve the sets of positive\nnormals NS\u03c8 (z) at various points z \u2208 S\u03c8. Thus in order to use the above results to show that a\nsurrogate \u03c8 is (or is not) classi\ufb01cation calibrated with respect to a loss \ufffd, one needs to be able to\ncompute or characterize the sets NS\u03c8 (z). Here we give a method for computing these sets for certain\nsurrogate losses \u03c8 and points z \u2208 S\u03c8.\nLemma 8. Let \ufffdT \u2286 Rd be a convex set and let \u03c8 : \ufffdT \u2192Rn\n+ be convex.5 Let z = \u03c8(\u02c6t) for some\n\u02c6t \u2208 \ufffdT such that for each y \u2208 [n], the subdifferential of \u03c8y at \u02c6t can be written as \u2202\u03c8y(\u02c6t) =\nsy \u2208 Rd.6 Let s =\ufffdn\nconv({wy\n1, . . . , wy\nsn\ufffd \u2208 Rd\u00d7s ;\n1 . . . wn\nwhere byj is 1 if the j-th column of A came from {wy\nNS\u03c8 (z) =\ufffdp \u2208 \u0394n : p = Bq for some q \u2208 Null(A) \u2229 \u0394s\ufffd ,\n\nj=1 NS\u03c8 (zj) = \u0394n and for each j \u2208 [r], \u2203t \u2208 [k] such that NS\u03c8 (zj) \u2286 Q\ufffd\n\nwhere Null(A) \u2286 Rs denotes the null space of the matrix A.\n5A vector function is convex if all its component functions are convex.\n6Recall that the subdifferential of a convex function \u03c6 : Rd\u2192 \u00afR+ at a point u0 \u2208 Rd is de\ufb01ned as\n\u2202\u03c6(u0) =\ufffdw \u2208 Rd : \u03c6(u) \u2212 \u03c6(u0) \u2265 w\ufffd(u \u2212 u0) \u2200u \u2208 Rd\ufffd and is a convex set in Rd (e.g. see [19]).\n\ny=1 sy, and let\nB = [byj] \u2208 Rn\u00d7s ,\n\nsy}) for some sy \u2208 N and wy\ns2 . . . . . . wn\n1 . . . w1\n\n1 . . . w2\n\ns1 w2\n\n1, . . . , wy\n\nsy} and 0 otherwise. Then\n\nt. Then\n\n1, . . . , wy\n\nA =\ufffdw1\n\n4\n\n\fWe give an example illustrating the use of Theorem 7 and Lemma 8 to show classi\ufb01cation calibration\nof a certain surrogate loss with respect to the ordinal regression loss \ufffdord de\ufb01ned in Example 2:\nExample 5 (Classi\ufb01cation calibrated surrogate for ordinal regression loss). Consider the ordinal\n\nde\ufb01ned as (see Figure 3)\n\n\u03c8(y, \u02c6t) = |\u02c6t \u2212 y|\n\nregression loss \ufffdord de\ufb01ned in Example 2 for n = 3. Let \ufffdT = R, and let \u03c8 : {1, 2, 3} \u00d7 R\u2192R+ be\nThus R\u03c8 =\ufffd\u03c8(\u02c6t) =\ufffd|\u02c6t \u2212 1|,|\u02c6t \u2212 2|,|\u02c6t \u2212 3|\ufffd\ufffd : \u02c6t \u2208 R\ufffd. We will show there are 3 points in R\u03c8\nsatisfying the conditions of Theorem 7. Speci\ufb01cally, consider \u02c6t1 = 1, \u02c6t2 = 2, and \u02c6t3 = 3, giving\nz1 = \u03c8(\u02c6t1) = (0, 1, 2)\ufffd, z2 = \u03c8(\u02c6t2) = (1, 0, 1)\ufffd, and z3 = \u03c8(\u02c6t3) = (2, 1, 0)\ufffd in R\u03c8. Observe\nthat \ufffdT here is a convex set and \u03c8 : \ufffdT \u2192R3 is a convex function. Moreover, for \u02c6t1 = 1, we have\n\n\u2200y \u2208 {1, 2, 3}, \u02c6t \u2208 R .\n\n(6)\n\n\u2202\u03c81(1) = [\u22121, 1] = conv({+1,\u22121}) ;\n\u2202\u03c82(1) = {\u22121} = conv({\u22121}) ;\n\u2202\u03c83(1) = {\u22121} = conv({\u22121}) .\n\n0\n0\n\n0\n0\n\n1\n0\n0\n\n0\n1\n0\n\nThis gives\n\n1 \ufffd .\n\nTherefore, we can use Lemma 8 to compute NS\u03c8 (z1). Here\ns = 4, and\nA = [ +1 \u22121 \u22121 \u22121 ] ; B =\ufffd 1\nNS\u03c8 (z1) = \ufffdp \u2208 \u03943 : p = (q1 + q2, q3, q4) for some q \u2208 \u03944, q1 \u2212 q2 \u2212 q3 \u2212 q4 = 0\ufffd\n\n= \ufffdp \u2208 \u03943 : p = (q1 + q2, q3, q4) for some q \u2208 \u03944, q1 = 1\n2\ufffd\n= \ufffdp \u2208 \u03943 : p1 \u2265 1\n2\ufffd\n= Qord\n1 .\nA similar procedure yields NS\u03c8 (z2) = Qord\n\u03c8 is classi\ufb01cation calibrated with respect to \ufffdord over \u03943.\nWe note that in general, computational procedures such as Fourier-Motzkin elimination [20] can be\nhelpful in computing NS\u03c8 (z) via Lemma 8.\n4 Classi\ufb01cation Calibration Dimension\n\n3 . Thus, by Theorem 7, we get that\n\n2 and NS\u03c8 (z3) = Qord\n\nFigure 3: The surrogate \u03c8\n\nWe now turn to the study of a fundamental quantity associated with the property of classi\ufb01cation\ncalibration with respect to a general multiclass loss \ufffd. Speci\ufb01cally, in the above example, we saw\nthat to develop a classi\ufb01cation calibrated surrogate loss w.r.t. the ordinal regression loss for n = 3,\n\nit was suf\ufb01cient to consider a surrogate target space \ufffdT = R, with dimension d = 1; in addition, this\nyielded a convex surrogate \u03c8 : R\u2192R3\n+ which can be used in developing computationally ef\ufb01cient\nalgorithms. In fact the same surrogate target space with d = 1 can be used to develop a similar\nconvex, classi\ufb01cation calibrated surrogate loss w.r.t. the ordinal regression loss for any n \u2208 N.\nHowever not all losses \ufffd have such \u2018low-dimensional\u2019 surrogates. This raises the natural question\nof what is the smallest dimension d that supports a convex classi\ufb01cation calibrated surrogate for a\ngiven multiclass loss \ufffd, and leads us to the following de\ufb01nition:\nDe\ufb01nition 9 (Classi\ufb01cation calibration dimension). Let \ufffd : [n] \u00d7 [k]\u2192R+. De\ufb01ne the classi\ufb01cation\ncalibration dimension (CC dimension) of \ufffd as\n\nCCdim(\ufffd) \ufffd= min\ufffdd \u2208 N : \u2203 a convex set \ufffdT \u2286 Rd and a convex surrogate \u03c8 : \ufffdT \u2192Rn\n\nthat is classi\ufb01cation calibrated w.r.t. \ufffd over \u0394n\ufffd ,\n\nif the above set is non-empty, and CCdim(\ufffd) = \u221e otherwise.\nFrom the above discussion, CCdim(\ufffdord) = 1 for all n. In the following, we will be interested in\ndeveloping an understanding of the CC dimension for general losses \ufffd, and in particular in deriving\nupper and lower bounds on this.\n\n+\n\n5\n\n\f4.1 Upper Bounds on the Classi\ufb01cation Calibration Dimension\n\nWe start with a simple result that establishes that the CC dimension of any multiclass loss \ufffd is \ufb01nite,\nand in fact is strictly smaller than the number of class labels n.\n\nLemma 10. Let \ufffd : [n] \u00d7 [k]\u2192R+. Let \ufffdT =\ufffd\u02c6t \u2208 Rn\u22121\n\u03c8y : \ufffdT \u2192R+ be given by\n\n:\ufffdn\u22121\n\u03c8y(\u02c6t) = 1(y \ufffd= n) (\u02c6ty \u2212 1)2 + \ufffdj\u2208[n\u22121],j\ufffd=y\n\nj=1\n\n+\n\n\u02c6tj\n\n2 .\n\n\u02c6tj \u2264 1\ufffd, and for each y \u2208 [n], let\n\nIn particular, since \u03c8 is convex,\n\nThen \u03c8 is classi\ufb01cation calibrated with respect to \ufffd over \u0394n.\nCCdim(\ufffd) \u2264 n \u2212 1.\nIt may appear surprising that the convex surrogate \u03c8 in the above lemma is classi\ufb01cation calibrated\nwith respect to all multiclass losses \ufffd on n classes. However this makes intuitive sense, since in\nprinciple, for any multiclass problem, if one can estimate the conditional probabilities of the n\nclasses accurately (which requires estimating n\u22121 real-valued functions on X ), then one can predict\na target label that minimizes the expected loss according to these probabilities. Minimizing the above\nsurrogate effectively corresponds to such class probability estimation. Indeed, the above lemma can\nbe shown to hold for any surrogate that is a strictly proper composite multiclass loss [21].\nIn practice, when the number of class labels n is large (such as in a sequence labeling task, where n\nis exponential in the length of the input sequence), the above result is not very helpful; in such cases,\nit is of interest to develop algorithms operating on a surrogate target space in a lower-dimensional\nspace. Next we give a different upper bound on the CC dimension that depends on the loss \ufffd, and\nfor certain losses, can be signi\ufb01cantly tighter than the general bound above.\nTheorem 11. Let \ufffd : [n] \u00d7 [k]\u2192R+. Then CCdim(\ufffd) \u2264 rank(L), the rank of the loss matrix L.\nProof. Let rank(L) = d. We will construct a convex classi\ufb01cation calibrated surrogate loss \u03c8 for \ufffd\nwith surrogate target space \ufffdT \u2286 Rd.\nLet \ufffdt1 , . . . , \ufffdtd be linearly independent columns of L. Let {e1, . . . , ed} denote the standard basis\nin Rd. We can de\ufb01ne a linear function \u02dc\u03c8 : Rd\u2192Rn by\n\n\u02dc\u03c8(ej) = \ufffdtj \u2200j \u2208 [d] .\n\n+ as\n\u03c8(\u02c6t) = \u02dc\u03c8(\u02c6t) ;\n\nwe note that the resulting vectors are always in Rn\n\nThen for each z in the column space of L, there exists a unique vector u \u2208 Rd such that \u02dc\u03c8(u) = z.\nIn particular, there exist unique vectors u1, . . . , uk \u2208 Rd such that for each t \u2208 [k], \u02dc\u03c8(ut) = \ufffdt.\nLet \ufffdT = conv({u1, . . . , uk}), and de\ufb01ne \u03c8 : \ufffdT \u2192Rn\nt=1 \u03b1tut for\n\u03b1 \u2208 \u0394k, \u03c8(\u02c6t) =\ufffdk\n+ \u2200t \u2208 [k]. The function \u03c8 is clearly convex. To show \u03c8 is\nclassi\ufb01cation calibrated w.r.t. \ufffd over \u0394n, we will use Theorem 7. Speci\ufb01cally, consider the k points\nzt = \u03c8(ut) = \ufffdt \u2208 R\u03c8 for t \u2208 [k]. By de\ufb01nition of \u03c8, we have S\u03c8 = conv({\ufffd1, . . . , \ufffdk}); from the\nde\ufb01nitions of positive normals and trigger probabilities, it then follows that NS\u03c8 (zt) = NS\u03c8 (\ufffdt) =\nt for all t \u2208 [k]. Thus by Theorem 7, \u03c8 is classi\ufb01cation calibrated w.r.t. \ufffd over \u0394n.\nQ\ufffd\nExample 6 (CC dimension of Hamming loss). Consider the Hamming loss \ufffdHam de\ufb01ned in Example\n3, for n = 2r. For each i \u2208 [r], de\ufb01ne \u03c3i \u2208 Rn as\n\n+, since by de\ufb01nition, for any \u02c6t =\ufffdk\n\nt=1 \u03b1t\ufffdt, and \ufffdt \u2208 Rn\n\nif (y \u2212 1)i, the i-th bit in the r-bit binary representation of (y \u2212 1), is 1\notherwise.\n\n\u03c3iy =\ufffd+1\n\n\u22121\n\nThen the loss matrix LHam satis\ufb01es\n\nLHam =\n\nr\n2\n\nee\ufffd \u2212\n\n1\n2\n\n\u03c3i\u03c3i\ufffd ,\n\nr\ufffdi=1\n\nwhere e is the n \u00d7 1 all ones vector. Thus rank(LHam) \u2264 r + 1, giving us CCdim(\ufffdHam) \u2264 r + 1.\nFor r \u2265 3, this is a signi\ufb01cantly tighter upper bound than the bound of 2r \u2212 1 given by Lemma 10.\n\n6\n\n\fWe note that the upper bound of Theorem 11 need not always be tight: for example, for the ordinal\nregression loss, for which we already know CCdim(\ufffdord) = 1, the theorem actually gives an upper\nbound of n, which is even weaker than that implied by Lemma 10.\n\n4.2 Lower Bound on the Classi\ufb01cation Calibration Dimension\n\nIn this section we give a lower bound on the CC dimension of a loss function \ufffd and illustrate it by\nusing it to calculate the CC dimension of the 0-1 loss. Section 5 we will explore consequences of\nthe lower bound for classi\ufb01cation calibrated surrogates for certain types of ranking losses. We will\nneed the following de\ufb01nition:\nDe\ufb01nition 12. The feasible subspace dimension of a convex set C at p \u2208 C, denoted by \u00b5C(p), is\nde\ufb01ned as the dimension of the subspace FC(p) \u2229 (\u2212FC(p)), where FC(p) is the cone of feasible\ndirections of C at p.7\nThe following gives a lower bound on the CC dimension of a loss \ufffd in terms of the feasible subspace\ndimension of the trigger probability sets Q\ufffd\nTheorem 13. Let \ufffd : [n]\u00d7 [k]\u2192R+. Then for all p \u2208 relint(\u0394n) and t \u2208 arg mint\ufffd p\ufffd\ufffdt\ufffd (i.e. such\nthat p \u2208 Q\ufffd\n\nt at certain points p \u2208 Q\ufffd\nt:\n\nt): 8\n\nCCdim(\ufffd) \u2265 n \u2212 \u00b5Q\ufffd\n\nt\n\n(p) \u2212 1 .\n\nThe proof requires extensions of the de\ufb01nition of positive normals and the necessary condition of\nTheorem 6 to sequences of points in S\u03c8 and is quite technical. In the appendix, we provide a proof\nin the special case when p \u2208 relint(\u0394n) is such that inf z\u2208S\u03c8 p\ufffdz is achieved in S\u03c8, which does not\nrequire these extensions. Full proof details will be provided in a longer version of the paper. Both\nthe proof of the lower bound and its applications make use of the following lemma, which gives a\nmethod to calculate the feasible subspace dimension for certain convex sets C and points p \u2208 C:\nLemma 14. Let C = \ufffdu \u2208 Rn : A1u \u2264 b1, A2u \u2264 b2, A3u = b3\ufffd. Let p \u2208 C be such that\nA3\ufffd\ufffd, the dimension of the null space of\ufffdA1\nA3\ufffd.\nA1p = b1, A2p < b2. Then \u00b5C(p) = nullity\ufffd\ufffdA1\n\nThe above lower bound allows us to calculate precisely the CC dimension of the 0-1 loss:\n\nExample 7 (CC dimension of 0-1 loss). Consider the 0-1 loss \ufffd0-1 de\ufb01ned in Example 1. Take\nfor all t \u2208 [k] = [n] (see Figure 2); in particular,\np = ( 1\nwe have p \u2208 Q0-1\n\nn , . . . , 1\n\nn )\ufffd \u2208 relint(\u0394n). Then p \u2208 Q0-1\nt\n1 . Now Q0-1\n1 can be written as\nQ0-1\n\n1\n\n= \ufffdq \u2208 \u0394n : q1 \u2265 qy \u2200y \u2208 {2, . . . , n}\ufffd\n= \ufffdq \u2208 Rn :\ufffd\u2212en\u22121 In\u22121\ufffdq \u2264 0,\u2212q \u2264 0, e\ufffdn q = 1} ,\n\nwhere en\u22121, en denote the (n \u2212 1) \u00d7 1 and n \u00d7 1 all ones vectors, respectively, and In\u22121 denotes\nthe (n\u2212 1)\u00d7 (n\u2212 1) identity matrix. Moreover, we have\ufffd\u2212en\u22121 In\u22121\ufffdp = 0, \u2212p < 0. Therefore,\nby Lemma 14, we have\n\nThus by Theorem 13, we get CCdim(\ufffd0-1) \u2265 n \u2212 1. Combined with the upper bound of Lemma 10,\nthis gives CCdim(\ufffd0-1) = n \u2212 1.\n\nFA(p) = {v \u2208 Rn : \u2203\ufffd0 > 0 such that p + \ufffdv \u2208 C \u2200\ufffd \u2208 (0, \ufffd0)}.\n\n7For a set C \u2286 Rn and point p \u2208 C, the cone of feasible directions of C at p is de\ufb01ned as\n8Here relint(\u0394n) denotes the relative interior of \u0394n: relint(\u0394n) = {p \u2208 \u0394n : py > 0 \u2200y \u2208 [n]}.\n\n7\n\n(p) = nullity\ufffd\ufffd\u2212en\u22121 In\u22121\n\ne\ufffdn\n\n\u00b5\n\nQ0-1\n\n1\n\n\ufffd\ufffd = nullity\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u22121 1 0 . . . 0\n\u22121 0 1 . . . 0\n\n...\n\n\u22121 0 0 . . . 1\n1 1 1 . . . 1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n= 0 .\n\n\f5 Application to Pairwise Subset Ranking\n\nWe consider an application of the above framework to analyzing certain types of subset ranking\nproblems, where each instance x \u2208 X consists of a query together with a set of r documents (for\nsimplicity, r \u2208 N here is \ufb01xed), and the goal is to learn a predictor which given such an instance will\nreturn a ranking (permutation) of the r documents [8]. Duchi et al. [10] showed recently that for\ncertain pairwise subset ranking losses \ufffd, \ufb01nding a predictor that minimizes the \ufffd-risk is an NP-hard\nproblem. They also showed that several common pairwise convex surrogate losses that operate on\n\nrespect to such losses \ufffd, even under some low-noise conditions on the distribution, and proposed\n\n\ufffdT = Rr (and are used to learn scores for the r documents) fail to be classi\ufb01cation calibrated with\nan alternative convex surrogate, also operating on \ufffdT = Rr, that is classi\ufb01cation calibrated under\n\ncertain conditions on the distribution (i.e. over a strict subset of the associated probability simplex).\nHere we provide an alternative route to analyzing the dif\ufb01culty of obtaining consistent surrogates for\nsuch pairwise subset ranking problems using the classi\ufb01cation calibration dimension. Speci\ufb01cally,\nwe will show that even for a simple setting of such problems, the classi\ufb01cation calibration dimension\n\nof the underlying loss \ufffd is greater than r, and therefore no convex surrogate operating on \ufffdT \u2286 Rr\ncan be classi\ufb01cation calibrated w.r.t. such a loss over the full probability simplex.\nFormally, we will identify the set of class labels Y with a set G of \u2018preference graphs\u2019, which are\nsimply directed acyclic graphs (DAGs) over r vertices; for each directed edge (i, j) in a preference\ngraph g \u2208 G associated with an instance x \u2208 X , the i-th document in the document set in x is\npreferred over the j-th document. Here we will consider a simple setting where each preference\ngraph has exactly one edge, so that |Y| = |G| = r(r \u2212 1); in this setting, we can associate each\ng \u2208 G with the edge (i, j) it contains, which we will write as g(i,j). The target labels consist of\npermutations over r objects, so that T = Sr with |T | = r!. Consider now the following simple\npairwise loss \ufffdpair : Y \u00d7 T \u2192R+:\n\n(7)\n\n(8)\n\n1\n\n\ufffdpair(g(i,j), \u03c3) = 1\ufffd\u03c3(i) > \u03c3(j)\ufffd .\n\u03c3 \u2200\u03c3 \u2208 T .\n\n\u03c3 = 1\n\n2 for all \u03c3 \u2208 T .\n\n\u03c31 \u2212 \ufffdpair\n\n1\n\nr(r\u22121) , . . . ,\n\u03c3 \u2212 \ufffdpair\n\n\u03c3\ufffd ) = 0 \u2200\u03c3, \u03c3\ufffd \u2208 T , and so p \u2208 Qpair\n\nr(r\u22121) )\ufffd \u2208 relint(\u0394r(r\u22121)), and observe that p\ufffd\ufffdpair\n\n\u03c31 , de\ufb01ned by\n\u03c3t ) \u2264 0 for t = 2, . . . , r! and\n\u03c31 satis\ufb01es\n\nLet p = (\nThus p\ufffd(\ufffdpair\nLet (\u03c31, . . . , \u03c3r!) be any \ufb01xed ordering of the permutations in T , and consider Qpair\nthe intersection of r! \u2212 1 half-spaces of the form q\ufffd(\ufffdpair\nthe simplex constraints q \u2208 \u0394r(r\u22121). Moreover, from the above observation, p \u2208 Qpair\np\ufffd(\ufffdpair\n\n\u03c31 \u2212 \ufffdpair\n\u03c3t ) = 0 \u2200t = 2, . . . , r!. Therefore, by Lemma 14, we get\n\u03c31 \u2212 \ufffdpair\n\n\u03c3r! ), e\ufffd\ufffd\ufffd ,\nwhere e is the r(r \u2212 1) \u00d7 1 all ones vector. It is not hard to see that the set {\ufffdpair\n: \u03c3 \u2208 T } spans a\n2 \u22121\ufffd.\ndimensional space, and hence the nullity of the above matrix is at most r(r\u22121)\u2212\ufffd r(r\u22121)\nr(r\u22121)\nThus by Theorem 13, we get that CCdim(\ufffdpair) \u2265 r(r \u2212 1) \u2212\ufffd r(r\u22121)\n2 + 1\ufffd \u2212 1 = r(r\u22121)\n2 \u2212 2 . In\nparticular, for r \u2265 5, this gives CCdim(\ufffdpair) > r, and therefore establishes that no convex surrogate\n\u03c8 operating on a surrogate target space \ufffdT \u2286 Rr can be classi\ufb01cation calibrated with respect to \ufffdpair\non the full probability simplex \u0394r(r\u22121).\n\n(p) = nullity\ufffd\ufffd(\ufffdpair\n\n\u03c31 \u2212 \ufffdpair\n\n\u03c32 ), . . . , (\ufffdpair\n\n\u00b5\n\nQpair\n\n\u03c31\n\n2\n\n\u03c3\n\n6 Conclusion\n\nWe developed a framework for analyzing consistency for general multiclass learning problems de-\n\ufb01ned by a general loss matrix, introduced the notion of classi\ufb01cation calibration dimension of a\nmulticlass loss, and used this to analyze consistency properties of surrogate losses for various gen-\neral multiclass problems. An interesting direction would be to develop a generic procedure for\ndesigning consistent convex surrogates operating on a \u2018minimal\u2019 surrogate target space according to\nthe classi\ufb01cation calibration dimension of the loss matrix. It would also be of interest to extend the\nresults here to account for noise conditions as in [9, 10].\n\n8\n\n\fAcknowledgments\n\nWe would like to thank the anonymous reviewers for helpful comments. HG thanks Microsoft\nResearch India for a travel grant. This research is supported in part by a Ramanujan Fellowship to\nSA from DST and an Indo-US Joint Center Award from the Indo-US Science & Technology Forum.\n\nReferences\n[1] G\u00b4abor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting\n\nmethods. Annals of Statistics, 32(1):30\u201355, 2004.\n\n[2] Wenxin Jiang. Process consistency for AdaBoost. Annals of Statistics, 32(1):13\u201329, 2004.\n[3] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. Annals of Statistics, 32(1):56\u2013134, 2004.\n\n[4] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classi-\n\n\ufb01ers. IEEE Transactions on Information Theory, 51(1):128\u2013142, 2005.\n\n[5] Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classi\ufb01cation and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[6] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[7] Ambuj Tewari and Peter Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 8:1007\u20131025, 2007.\n\n[8] David Cossock and Tong Zhang. Statistical analysis of bayes optimal subset ranking. IEEE\n\nTransactions on Information Theory, 54(11):5140\u20135154, 2008.\n\n[9] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning\n\nto rank: Theory and algorithm. In International Conference on Machine Learning, 2008.\n\n[10] John Duchi, Lester Mackey, and Michael Jordan. On the consistency of ranking algorithms. In\n\nInternational Conference on Machine Learning, 2010.\n\n[11] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On NDCG consistency of listwise rank-\ning methods. In International Conference on Arti\ufb01cial Intelligence and Statistics(AISTATS),\nvolume 15. JMLR: W&CP, 2011.\n\n[12] David Buffoni, Cl\u00b4ement Calauz`enes, Patrick Gallinari, and Nicolas Usunier. Learning scoring\nfunctions with order-preserving losses and standardized supervision. In International Confer-\nence on Machine Learning, 2011.\n\n[13] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Conference on\n\nLearning Theory, 2011.\n\n[14] Wojciech Kotlowski, Krzysztof Dembczynski, and Eyke Huellermeier. Bipartite ranking\nthrough minimization of univariate loss. In International Conference on Machine Learning,\n2011.\n\n[15] Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approx-\n\nimation, 26:225\u2013287, 2007.\n\n[16] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Neural\n\nInformation Processing Systems, 2003.\n\n[17] Deirdre O\u2019Brien, Maya Gupta, and Robert Gray. Cost-sensitive multi-class classi\ufb01cation from\n\nprobability estimates. In International Conference on Machine Learning, 2008.\n\n[18] Nicolas Lambert and Yoav Shoham. Eliciting truthful answers to multiple-choice questions.\n\nIn ACM Conference on Electronic Commerce, 2009.\n\n[19] Dimitri Bertsekas, Angelia Nedic, and Asuman Ozdaglar. Convex Analysis and Optimization.\n\nAthena Scienti\ufb01c, 2003.\n\n[20] Jean Gallier. Notes on convex sets, polytopes, polyhedra, combinatorial topology, Voronoi\ndiagrams and Delaunay triangulations. Technical report, Department of Computer and Infor-\nmation Science, University of Pennsylvania, 2009.\n\n[21] Elodie Vernet, Robert C. Williamson, and Mark D. Reid. Composite multiclass losses.\n\nNeural Information Processing Systems, 2011.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1026, "authors": [{"given_name": "Harish", "family_name": "Ramaswamy", "institution": null}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": null}]}