{"title": "Composite Multiclass Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 1224, "page_last": 1232, "abstract": "We consider loss functions for multiclass prediction problems. We   show when a  multiclass loss can be expressed as a ``proper   composite loss'', which is the composition of a proper loss and a link   function. We extend existing results for binary losses to   multiclass losses.  We determine the stationarity condition,   Bregman representation, order-sensitivity, existence and uniqueness   of the composite representation for multiclass losses.  We also   show that the integral representation  for binary proper losses can   not be extended to  multiclass losses. We subsume existing results   on ``classification calibration'' by relating it to properness.  We   draw conclusions concerning the design of multiclass losses.", "full_text": "Composite Multiclass Losses\n\nElodie Vernet\nENS Cachan\n\nRobert C. Williamson\n\nANU and NICTA\n\nMark D. Reid\nANU and NICTA\n\nevernet@ens-cachan.fr\n\nBob.Williamson@anu.edu.au\n\nMark.Reid@anu.edu.au\n\nAbstract\n\nWe consider loss functions for multiclass prediction problems. We show when\na multiclass loss can be expressed as a \u201cproper composite loss\u201d, which is the\ncomposition of a proper loss and a link function. We extend existing results for\nbinary losses to multiclass losses. We determine the stationarity condition, Breg-\nman representation, order-sensitivity, existence and uniqueness of the composite\nrepresentation for multiclass losses. We subsume existing results on \u201cclassi\ufb01ca-\ntion calibration\u201d by relating it to properness and show that the simple integral\nrepresentation for binary proper losses can not be extended to multiclass losses.\n\n1\n\nIntroduction\n\n+. The partial losses (cid:96)i are the components of (cid:96)(q) = ((cid:96)1(q), . . . , (cid:96)n(q))(cid:48).\n\nThe motivation of this paper is to understand the intrinsic structure and properties of suitable loss\nfunctions for the problem of multiclass prediction, which includes multiclass probability estimation.\nSuppose we are given a data sample S := (xi,yi)i\u2208[m] where xi \u2208 X is an observation and yi \u2208\n{1, ..,n} =: [n] is its corresponding class. We assume the sample S is drawn iid according to some\ndistribution P = PX ,Y on X \u00d7 [n]. Given a new observation x we want to predict the probability\npi := P(Y = i|X = x) of x belonging to class i, for i \u2208 [n]. Multiclass classi\ufb01cation requires the\nlearner to predict the most likely class of x; that is to \ufb01nd \u02c6y = argmaxi\u2208[n] pi.\nA loss measures the quality of prediction. Let \u2206n :={(p1, . . . , pn): \u2211i\u2208[n] pi = 1,and 0\u2264 pi \u2264 1, \u2200i\u2208\n[n]} denote the n-simplex. For multiclass probability estimation, (cid:96): \u2206n \u2192 Rn\n+. For classi\ufb01cation, the\nloss (cid:96): [n] \u2192 Rn\nProper losses are particularly suitable for probability estimation. They have been studied in detail\nwhen n = 2 (the \u201cbinary case\u201d) where there is a nice integral representation [1, 2, 3], and charac-\nterization [4] when differentiable. Classi\ufb01cation calibrated losses are an analog of proper losses for\nthe problem of classi\ufb01cation [5]. The relationship between classi\ufb01cation calibration and properness\nwas determined in [4] for n = 2. Most of these results have had no multiclass analogue until now.\nThe design of losses for multiclass prediction has received recent attention [6, 7, 8, 9, 10, 11, 12]\nalthough none of these papers developed the connection to proper losses, and most restrict consid-\neration to margin losses (which imply certain symmetry conditions). Glasmachers [13] has shown\nthat certain learning algorithms can still behave well when the losses do not satisfy the conditions in\nthese earlier papers because the requirements are actually stronger than needed.\nOur contributions are: We relate properness, classi\ufb01cation calibration, and the notion used in [8]\nwhich we rename \u201cprediction calibrated\u201d \u00a73; we provide a novel characterization of multiclass\nproperness \u00a74; we study composite proper losses (the composition of a proper loss with an invertible\nlink) presenting new uniqueness and existence results \u00a75; we show how the above results can aid in\nthe design of proper losses \u00a76; and we present a (somewhat surprising) negative result concerning\nthe integral representation of proper multiclass losses \u00a77. Many of our results are characterisations.\nFull proofs are provided in the extended version [14].\n\n1\n\n\f2 Formal Setup\nSuppose X is some set and Y = {1, . . . ,n} = [n] is a set of labels. We suppose we are given\ndata (xi,yi)i\u2208[m] such that Yi \u2208 Y is the label corresponding to xi \u2208 X . These data follow a joint\ndistribution PX ,Y . We denote by EX ,Y and EY |X respectively, the expectation and the conditional\nexpectation with respect to PX ,Y .\nThe conditional risk L associated with a loss (cid:96) is the function\n\nL: \u2206n \u00d7 \u2206n (cid:51) (p,q) (cid:55)\u2192 L(p,q) = EY\u223cp(cid:96)Y(q) = p(cid:48) \u00b7 (cid:96)(q) = \u2211\ni\u2208[n]\n\npi(cid:96)i(q) \u2208 R+,\n\nwhere Y \u223c p means Y is drawn according to a multinomial distribution with parameter p. In a typical\nlearning problem one will make an estimate q: X \u2192 \u2206n. The full risk is L(q) = EX EY |X (cid:96)Y(q(X)).\nMinimizing L(q) over q: X \u2192 \u2206n is equivalent to minimizing L(p(x),q(x)) over q(x) \u2208 \u2206n for all\nx \u2208 X where p(x) = (p1(x), . . . , pn(x))(cid:48), p(cid:48) is the transpose of p, and pi(x) = P(Y = i|X = x). Thus\nit suf\ufb01ces to only consider the conditional risk; confer [3].\nA loss (cid:96): \u2206n \u2192 Rn\n+ is proper if L(p, p) \u2264 L(p,q), \u2200p,q \u2208 \u2206n. It is strictly proper if the inequality is\nstrict when p (cid:54)= q. The conditional Bayes risk L: \u2206n (cid:51) p (cid:55)\u2192 infq\u2208\u2206n L(p,q). This function is always\nconcave [2]. If (cid:96) is proper, then L(p) = L(p, p) = p(cid:48) \u00b7 (cid:96)(p). Strictly proper losses induce Fisher\nconsistent estimators of probabilities: if (cid:96) is strictly proper, p = argminq L(p,q).\nIn order to differentiate the losses we project the n-simplex into a subset of Rn\u22121. We de-\nnote by \u03a0\u2206 : \u2206n (cid:51) p = (p1, . . . , pn)(cid:48) (cid:55)\u2192 \u02dcp = (p1, . . . , pn\u22121)(cid:48) \u2208 \u02dc\u2206n := {(p1, . . . , pn\u22121)(cid:48) : pi \u2265 0, \u2200i \u2208\n[n], \u2211n\u22121\n\u2206 : \u02dc\u2206n (cid:51) \u02dcp = ( \u02dcp1, . . . , \u02dcpn\u22121) (cid:55)\u2192 p =\n( \u02dcp1, . . . , \u02dcpn\u22121,1\u2212 \u2211n\u22121\nThe losses above are de\ufb01ned on the simplex \u2206n since the argument (an estimator) represents\na probability vector. However it is sometimes desirable to use another set V of predictions.\nOne can consider losses (cid:96): V \u2192 Rn\n+. Suppose there exists an invertible function \u03c8 : \u2206n \u2192 V .\nThen (cid:96) can be written as a composition of a loss \u03bb de\ufb01ned on the simplex with \u03c8\u22121. That is,\n(cid:96)(v) = \u03bb \u03c8 (v) := \u03bb (\u03c8\u22121(v)). Such a function \u03bb \u03c8 is a composite loss. If \u03bb is proper, we say (cid:96) is a\nproper composite loss, with associated proper loss \u03bb and link \u03c8.\nWe use the following notation. The kth unit vector ek is the n vector with all components zero except\nthe kth which is 1. The n-vector 1n := (1, . . . ,1)(cid:48). The derivative of a function f is denoted D f and\nits Hessian H f . Let \u02da\u2206n := {(p1, . . . , pn): \u2211i\u2208[n] pi = 1,and 0 < pi < 1, \u2200i \u2208 [n]} and \u2202\u2206n := \u2206n\\ \u02da\u2206n.\n\ni=1 pi \u2264 1}, the projection of the n-simplex \u2206n, and \u03a0\u22121\n\ni=1 \u02dcpi)(cid:48) \u2208 \u2206n its inverse.\n\n3 Relating Properness to Classi\ufb01cation Calibration\n\nProperness is an attractive property of a loss for the task of class probability estimation. However if\none is merely interested in classifying (predicting \u02c6y \u2208 [n] given x \u2208 X ) then one requires less. We\nrelate classi\ufb01cation calibration (the analog of properness for classi\ufb01cation problems) to properness.\nSuppose c \u2208 \u02da\u2206n. We cover \u2206n with n subsets each representing one class:\n\nTi(c) := {p \u2208 \u2206n : \u2200 j (cid:54)= i pic j \u2265 p jci}.\n\nObserve that for i (cid:54)= j, the sets {p \u2208 R: pic j = p jc j} are subsets of dimension n\u2212 2 through c and\nall ek such that k (cid:54)= i and k (cid:54)= j. These subsets partition \u2206n into two parts, the subspace Ti is the\nintersection of the subspaces delimited by the precedent (n\u2212 2)-subspace and in the same side as ei.\nWe will make use of the following properties of Ti(c).\nLemma 1 Suppose c \u2208 \u02da\u2206n, i \u2208 [n]. Then the following hold:\n\n1. For all p \u2208 \u2206n, there exists i such that p \u2208 Ti(c).\n2. Suppose p \u2208 \u2206n. Ti(c)\u2229 T j(c) \u2286 {p \u2208 \u2206n : pic j = p jci}, a subspace of dimension n\u2212 2.\n\n3. Suppose p \u2208 \u2206n. If p \u2208(cid:84)n\n\ni=1 Ti(c) then p = c.\n\n4. For all p,q \u2208 \u2206n, p (cid:54)= q, there exists c \u2208 \u02da\u2206n, and i \u2208 [n] such that p \u2208 Ti(c) and q /\u2208 Ti(c).\n\n2\n\n\fClassi\ufb01cation calibrated losses have been developed and studied under some different de\ufb01nitions\nand names [6, 5]. Below we generalise the notion of c-calibration which was proposed for n = 2 in\n[4] as a generalisation of the notion of classi\ufb01cation calibration in [5].\nDe\ufb01nition 2 Suppose (cid:96): \u2206n \u2192 Rn\n+ is a loss and c \u2208 \u02da\u2206n. We say (cid:96) is c-calibrated at p \u2208 \u2206n if for all\ni \u2208 [n] such that p /\u2208 Ti(c) then \u2200q \u2208 Ti(c), L(p) < L(p,q). We say that (cid:96) is c-calibrated if \u2200p \u2208 \u2206n,\n(cid:96) is c-calibrated at p.\n\nn , . . . , 1\n\nDe\ufb01nition 2 means that if the probability vector q one predicts doesn\u2019t belong to the same subset\n(i.e. doesn\u2019t predict the same class) as the real probability vector p, then the loss might be larger.\nClassi\ufb01cation calibration in the sense used in [5] corresponds to 1\n2-calibrated losses when n = 2. If\nn )(cid:48), cmid-calibration induces Fisher-consistent estimates in the case of classi\ufb01cation.\ncmid := ( 1\nFurthermore \u201c(cid:96) is cmid-calibrated and for all i \u2208 [n], and (cid:96)i is continuous and bounded below\u201d is\nequivalent to \u201c(cid:96) is in\ufb01nite sample consistent as de\ufb01ned by [6]\u201d. This is because if (cid:96) is continuous\nand Ti(c) is closed, then \u2200q \u2208 Ti(c), L(p) < L(p,q) if and only if L(p) < infq\u2208Ti(c) L(p,q).\nThe following result generalises the correspondence between binary classi\ufb01cation calibration and\nproperness [4, Theorem 16] to multiclass losses (n > 2).\nProposition 3 A continuous loss (cid:96): \u2206n \u2192 Rn\nall c \u2208 \u02da\u2206n.\nIn particular, a continuous strictly proper loss is cmid-calibrated. Thus for any estimator \u02c6qn of the\nconditional probability vector one constructs by minimizing the empirical average of a continuous\nstrictly proper loss, one can build an estimator of the label (corresponding to the largest probability\nof \u02c6qn) which is Fisher consistent for the problem of classi\ufb01cation.\nIn the binary case, (cid:96) is classi\ufb01cation calibrated if and only if the following implication holds [5]:\n\n+ is strictly proper if and only if it is c-calibrated for\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nL( fn) \u2192 min\n\nL(g)\n\n\u21d2\n\nPX ,Y (Y (cid:54)= fn(X)) \u2192 min\n\nPX ,Y (Y (cid:54)= g(X))\n\n.\n\n(1)\n\ng\n\ng\n\nTewari and Bartlett [8] have characterised when (1) holds in the multiclass case. Since there is no\nreason to assume the equivalence between classi\ufb01cation calibration and (1) still holds for n > 2, we\ngive different names for these two notions. We keep the name of classi\ufb01cation calibration for the\nnotion linked to Fisher consistency (as de\ufb01ned before) and call prediction calibrated the notion of\nTewari and Bartlett (equivalent to (1)).\nDe\ufb01nition 4 Suppose (cid:96): V \u2192 Rn\n+ is a loss. Let C(cid:96) = co({(cid:96)(v): v \u2208 V }), the convex hull of the\nimage of V . (cid:96) is said to be prediction calibrated if there exists a prediction function pred: Rn \u2192 [n]\nsuch that\n\n\u2200p \u2208 \u2206n :\n\ninf\n\nz\u2208C(cid:96),ppred(z)<maxi pi\n\np(cid:48) \u00b7 z > inf\nz\u2208C(cid:96)\n\np(cid:48) \u00b7 z = L(p).\n\nObserve that the class is predicted from (cid:96)(p) and not directly from p (which is equivalent if the\n+ is such that (cid:96) is prediction calibrated and pred((cid:96)(p)) \u2208\nloss is invertible). Suppose that (cid:96): \u2206n \u2192 Rn\nargmaxi pi. Then (cid:96) is cmid-calibrated almost everywhere.\nBy introducing a reference \u201clink\u201d \u00af\u03c8 (which corresponds to the actual link if (cid:96) is a proper composite\nloss) we now show how the pred function can be canonically expressed in terms of argmaxi pi.\nProposition 5 Suppose (cid:96): V \u2192 Rn\n\u03bb is proper. If (cid:96) is prediction calibrated then pred(\u03bb (p)) \u2208 argmaxi pi.\n\n+ is a loss. Let \u00af\u03c8(p) \u2208 argminv\u2208V L(p,v) and \u03bb = (cid:96)\u25e6 \u00af\u03c8. Then\n\n4 Characterizing Properness\nWe \ufb01rst present some simple (but new) consequences of properness. We say f : C \u2282 Rn \u2192 Rn is\nmonotone on C when for all x and y in C, ( f (x)\u2212 f (y))(cid:48) \u00b7 (x\u2212 y) \u2265 0; confer [15].\nProposition 6 Suppose (cid:96): \u2206n \u2192 Rn\n+ is a loss. If (cid:96) is proper, then \u2212(cid:96) is monotone.\n\n3\n\n\fProposition 7 If (cid:96) is strictly proper then it is invertible.\n\nA theme of the present paper is the extensibility of results concerning binary losses to multiclass\nlosses. The following proposition shows how the characterisation of properness in the general (not\nnecessarily differentiable) multiclass case can be reduced to the binary case. In the binary case,\nthe two classes are often denoted \u22121 and 1 and the loss is denoted (cid:96) = ((cid:96)1, (cid:96)\u22121)(cid:48). We project the\n2-simplex \u22062 into [0,1]: \u03b7 \u2208 [0,1] is the projection of (\u03b7,1\u2212 \u03b7) \u2208 \u22062.\nProposition 8 Suppose (cid:96): \u2206n \u2192 Rn\n\u02dc(cid:96)p,q : [0,1] (cid:51) \u03b7 (cid:55)\u2192\n\np(cid:48) \u00b7 (cid:96)(cid:0)p + \u03b7(q\u2212 p)(cid:1) (cid:19)\n(cid:18) q(cid:48) \u00b7 (cid:96)(cid:0)p + \u03b7(q\u2212 p)(cid:1)\n\n(cid:18) \u02dc(cid:96)p,q\n\n+ is a loss. De\ufb01ne\n\n(cid:19)\n\n=\n\n.\n\n1 (\u03b7)\n\u02dc(cid:96)p,q\u22121 (\u03b7)\n\n\u2212(cid:96)(cid:48)\n1(\u03b7)\n1\u2212\u03b7 =\n\n+ is proper if and only if \u2200\u03b7 \u2208 [0,1],\n\n+ is a loss. Then (cid:96) is (strictly) proper if and only if \u2200p,q \u2208 \u2206n,\n\nThen (cid:96) is (strictly) proper if and only if \u02dc(cid:96)p,q is (strictly) proper \u2200p,q \u2208 \u2202\u2206n.\nThis proposition shows that in order to check if a loss is proper one needs only to check the proper-\nness in each line. One could use the easy characterization of properness for differentiable binary\n(cid:96)(cid:48)\u22121(\u03b7)\nlosses ((cid:96): [0,1] \u2192 R2\n\u03b7 \u2265 0, [4]). However this\nneeds to be checked for all lines de\ufb01ned by p,q \u2208 \u2202\u2206n. We now extend some characterisations of\nproperness to the multiclass case by using Proposition 8.\nLambert [16] proved that in the binary case, properness is equivalent to the fact that the further your\nprediction is from reality, the larger the loss (\u201corder sensitivity\u201d). The result relied upon on the total\norder of R. In the multiclass case, there does not exist such a total order. Yet, one can compare\ntwo predictions if they are in the same line as the true real class probability. The next result is a\ngeneralization of the binary case equivalence of properness and order sensitivity.\nProposition 9 Suppose (cid:96): \u2206n \u2192 Rn\n\u22000 \u2264 h1 \u2264 h2, L(p, p + h1(q\u2212 p)) \u2264 L(p, p + h2(q\u2212 p)) (the inequality is strict if h1 (cid:54)= h2).\n\u201cOrder sensitivity\u201d tells us more about properness: the true class probability minimizes the risk\nand if the prediction moves away from the true class probability in a line then the risk increases.\nThis property appears convenient for optimization purposes: if one reaches a local minimum in the\nsecond argument of the risk and the loss is strictly proper then it is a global minimum. If the loss is\nproper, such a local minimum is a global minimum or a constant in an open set. But observe that\ntypically one is minimising the full risk L(q(\u00b7)) over functions q: X \u2192 \u2206n. Order sensitivity of (cid:96)\ndoes not imply this optimisation problem is well behaved; one needs convexity of q (cid:55)\u2192 L(p,q) for\nall p \u2208 \u2206n to ensure convexity of the functional optimisation problem.\nThe order sensitivity along a line leads to a new characterisation of differentiable proper losses. As\nin the binary case, one condition comes from the fact that the derivative is zero at a minimum and\nthe other ensures that it is really a minimum.\nCorollary 10 Suppose (cid:96): \u2206n \u2192 Rn\nD \u02dc(cid:96)(\u03a0\u2206(p))\u00b7 D\u03a0\u2206(p). Then (cid:96) is proper if and only if\np(cid:48) \u00b7 M(p) = 0\n(q\u2212 r)(cid:48) \u00b7 M(p)\u00b7 (q\u2212 r) \u2264 0\n\n+ is a loss such that \u02dc(cid:96) = (cid:96)\u25e6 \u03a0\u22121\n\n\u2206 is differentiable. Let M(p) =\n\n\u2200q,r \u2208 \u2206n, \u2200p \u2208 \u02da\u2206n.\n\n(cid:27)\n\n(2)\n\nWe know that for any loss, its Bayes risk L(p) = infq\u2208\u2206n L(p,q) = infq\u2208\u2206n p(cid:48) \u00b7 (cid:96)(q) is concave. If (cid:96) is\nproper, L(p) = p(cid:48) \u00b7 (cid:96)(p). Rather than working with the loss (cid:96): V \u2192 Rn\n+ we will now work with the\nsimpler associated conditional Bayes risk L: V \u2192 R+.\nWe need two de\ufb01nitions from [15]. Suppose f : Rn \u2192 R is concave. Then limt\u21930\nexists,\nand is called the directional derivative of f at x in the direction d and is denoted D f (x,d). By\nanalogy with the usual de\ufb01nition of subdifferential, the superdifferential \u2202 f (x) of f at x is\n\n\u2202 f (x) :=(cid:8)s \u2208 Rn : s(cid:48) \u00b7 y \u2265 D f (x,y), \u2200y \u2208 Rn(cid:9) =(cid:8)s \u2208 Rn : f (y) \u2264 f (x) + s(cid:48) \u00b7 (y\u2212 x), \u2200y \u2208 Rn(cid:9) .\n\nA vector s \u2208 \u2202 f (x) is called a supergradient of f at x.\nThe next proposition is a restatement of the well known Bregman representation of proper losses;\nsee [17] for the differentiable case, and [2, Theorem 3.2] for the general case.\n\nf (x+td)\u2212 f (x)\n\nt\n\n4\n\n\fProposition 11 Suppose (cid:96): \u2206n \u2192 Rn\nfunction f and \u2200q \u2208 \u2206n, there exists a supergradient A(q) \u2208 \u2202 f (q) such that\n\u2200p,q \u2208 \u2206n, p(cid:48) \u00b7 (cid:96)(q) = L(p,q) = f (q) + (p\u2212 q)(cid:48) \u00b7 A(q).\n\n+ is a loss. Then (cid:96) is proper if and only if there exists a concave\n\nThen f is unique and f (p) = L(p, p) = L(p).\n\n\u2206 is differentiable at \u02dcq \u2208 \u02dc\u2206n, A(q) = (D \u02dcf (\u03a0\u2206(q)),0)(cid:48) +\u03b1 1(cid:48)\n\nThe fact that f is de\ufb01ned on a simplex is not a problem. Indeed, the superdifferential becomes\n\u2202 f (x) = {s \u2208 Rn : s(cid:48) \u00b7 d \u2265 D f (x,d),\u2200d \u2208 \u2206n} = {s \u2208 Rn : f (y) \u2264 f (x) + s(cid:48) \u00b7 (y\u2212 x), \u2200y \u2208 \u2206n}.\nIf\n\u02dcf = f \u25e6\u03a0\u22121\nn, \u03b1 \u2208 R. Then (p\u2212q)(cid:48)\u00b7A(q) =\nD \u02dcf (\u03a0\u2206(q)) \u00b7 (\u03a0\u2206(p) \u2212 \u03a0\u2206(q)). Hence for any concave differentiable function f , there exists an\nunique proper loss whose Bayes risk is equal to f (we say that f is differentiable when \u02dcf is differ-\nentiable).\nThe last property gives us the form of the proper losses associated with a Bayes risk. Suppose\nL: \u2206n \u2192 R+ is concave. The proper losses whose Bayes risk is equal to L are\n\nL(q) + (ei \u2212 q)(cid:48) \u00b7 A(q)\n\n\u2208 Rn\n\n+, \u2200A(q) \u2208 \u2202 L(q).\n\n(3)\n\n(cid:96): \u2206n (cid:51) q (cid:55)\u2192(cid:16)\n\n(cid:17)n\n\ni=1\n\nThis result suggests that some information is lost by representing a proper loss via its Bayes risk\n(when the last is not differentiable). The next proposition elucidates this by showing that proper\nlosses which have the same Bayes risk are equal almost everywhere.\n\nProposition 12 Two proper losses (cid:96)1 and (cid:96)2 have the same conditional Bayes risk function L if and\nonly if (cid:96)1 = (cid:96)2 almost everywhere. If L is differentiable, (cid:96)1 = (cid:96)2 everywhere.\nWe say that L is differentiable at p if \u02dcL = L\u25e6 \u03a0\u22121\nProposition 13 Suppose (cid:96): \u2206n \u2192 Rn\ndifferentiable on \u02da\u2206n; (cid:96) is continuous at p \u2208 \u02da\u2206n if and only if, L is differentiable at p \u2208 \u02da\u2206n.\n\n+ is a proper loss. Then (cid:96) is continuous in \u02da\u2206n if and only if L is\n\n\u2206 is differentiable at \u02dcp = \u03a0\u2206(p).\n\n5 The Proper Composite Representation: Uniqueness and Existence\n\nIt is sometimes helpful to de\ufb01ne a loss on some set V rather than \u2206n; confer [4]. Composite losses\n(see the de\ufb01nition in \u00a72) are a way of constructing such losses: given a proper loss \u03bb : \u2206n \u2192 Rn\n+ and\nan invertible link \u03c8 : \u2206n \u2192 V , one de\ufb01nes \u03bb \u03c8 : V \u2192 Rn\n+ using \u03bb \u03c8 = \u03bb \u25e6\u03c8\u22121. We now consider the\nquestion: given a loss (cid:96): V \u2192 Rn\n+, when does (cid:96) have a proper composite representation (whereby (cid:96)\ncan be written as (cid:96) = \u03bb \u25e6 \u03c8\u22121), and is this representation unique? We \ufb01rst consider the binary case\nand study the uniqueness of the representation of a loss as a proper composite loss.\nProposition 14 Suppose (cid:96) = \u03bb \u25e6\u03c8\u22121 : V \u2192 R2\n+ is a proper composite loss and that the proper loss\n\u03bb is differentiable and the link function \u03c8 is differentiable and invertible. Then the proper loss \u03bb\nis unique. Furthermore \u03c8 is unique if \u2200v1,v2 \u2208 R, \u2203v \u2208 [v1,v2], (cid:96)(cid:48)\n\u22121(v) (cid:54)= 0. If there\n\u22121(v) = 0 \u2200v \u2208 [ \u00afv1, \u00afv2], one can choose any \u03c8|[ \u00afv1, \u00afv2] such that\nexists \u00afv1, \u00afv2 \u2208 R such that (cid:96)(cid:48)\n\u03c8 is differentiable, invertible and continuous in [ \u00afv1, \u00afv2] and obtain (cid:96) = \u03bb \u25e6 \u03c8\u22121, and \u03c8 is uniquely\nde\ufb01ned where (cid:96) is invertible.\n\u22121(v) (cid:54)= 0\nProposition 15 Suppose (cid:96): V \u2192 R2\n1(v) (cid:54)= 0. Then (cid:96) can be expressed as a proper composite loss if and only if the following\nor (cid:96)(cid:48)\nthree conditions hold: 1) (cid:96)1 is decreasing (increasing); 2) (cid:96)\u22121 is increasing (decreasing); and 3)\nf : V (cid:51) v (cid:55)\u2192 (cid:96)(cid:48)\n\n+ is a differentiable binary loss such that \u2200v \u2208 V , (cid:96)(cid:48)\n\n1(v)\n(cid:96)(cid:48)\u22121(v) is strictly increasing (decreasing) and continuous.\n\n1(v) (cid:54)= 0 or (cid:96)(cid:48)\n\n1(v) = (cid:96)(cid:48)\n\nObserve that the last condition is alway satis\ufb01ed if both (cid:96)1 and (cid:96)\u22121 are convex.\n(cid:48)\nSuppose \u03d5 : R \u2192 R+ is a function. The loss de\ufb01ned via (cid:96)\u03d5 : V (cid:51) v (cid:55)\u2192 ((cid:96)\u22121(v), (cid:96)1(v))\n(\u03d5(\u2212v),\u03d5(v))\n\ufb01cation problems. We will now show how the above proposition applies to them.\n\n=\n+ is called a binary margin loss. Binary margin losses are often used for classi-\n\n(cid:48) \u2208 R2\n\n5\n\n\fCorollary 16 Suppose \u03d5 : R \u2192 R+ is differentiable and \u2200v \u2208 R, \u03d5(cid:48)(v) (cid:54)= 0 or \u03d5(cid:48)(\u2212v) (cid:54)= 0. Then (cid:96)\u03d5\ncan be expressed as a proper composite loss if and only if f : R (cid:51) v (cid:55)\u2192 \u2212 \u03d5(cid:48)(v)\n\u03d5(cid:48)(\u2212v) is strictly monotonic\ncontinuous and \u03d5 is monotonic.\n\n2x2+4\n\n\u03c0 arctan(x\u2212 1). Then f (v) = \u03d5(cid:48)(\u2212v)\n\nIf \u03d5 is convex or concave then f de\ufb01ned above is monotonic. However not all binary margin losses\nare composite proper losses. One can even build a smooth margin loss which cannot be expressed as\na proper composite loss. Consider \u03d5(x) = 1\u2212 1\n\u03d5(cid:48)(\u2212v)+\u03d5(cid:48)(v) = x2\u22122x+2\nwhich is not invertible.\nWe now generalize the above results to the multiclass case.\nProposition 17 Suppose (cid:96) has two proper composite representations (cid:96) = \u03bb \u25e6 \u03c8\u22121 = \u00b5 \u25e6 \u03c6\u22121 where\n\u03bb and \u00b5 are proper losses and \u03c8 and \u03c6 are continuous invertible. Then \u03bb = m almost everywhere.\nIf (cid:96) is continuous and has a composite representation, then the proper loss (in the decomposition) is\nunique (\u03bb = \u00b5 everywhere).\nIf (cid:96) is invertible and has a composite representation, then the representation is unique.\nGiven a loss (cid:96): V \u2192 Rn\n+, we denote by S(cid:96) = (cid:96)(V ) +\n[0,\u221e)n = {\u03bb : \u2203v \u2208 V , \u2200i \u2208 [n], \u03bbi \u2265 (cid:96)i(v)} the super-\nprediction set of (cid:96) (confer e.g. [18]). We introduce a\nset of hyperplanes for p \u2208 \u2206n and \u03b2 \u2208 R, h\u03b2\np = {x \u2208\nRn : x(cid:48) \u00b7 p = \u03b2}. A hyperplane h\u03b2\np supports a set A at\nx \u2208 A when x \u2208 h\u03b2\np and for all a \u2208 A , a(cid:48) \u00b7 p \u2265 \u03b2 or\nfor all a \u2208 A , a(cid:48) \u00b7 p \u2264 \u03b2 . We say that S(cid:96) is strictly\nconvex in its inner part when for all p \u2208 \u2206n, there ex-\nists an unique x \u2208 (cid:96)(V ) such that there exists a hyper-\nplane h\u03b2\np supporting S(cid:96) at x. S(cid:96) is said to be smooth\nwhen for all x \u2208 (cid:96)(V ), there exists an unique hyper-\nplane supporting S(cid:96) at x.\nIf (cid:96) is invertible, we can\nexpress these two de\ufb01nitions in terms of v \u2208 V rather\nthan x \u2208 (cid:96)(V ). If (cid:96): V \u2192 Rn\n+ is strictly convex, then\nS(cid:96) will be strictly convex in its inner part.\nProposition 18 Suppose (cid:96): V \u2192 Rn\ncomposite representation if and only if S(cid:96) is convex, smooth and strictly convex in its inner part.\nProposition 19 Suppose (cid:96): V \u2192 Rn\n+ is a continuous loss. If (cid:96) has a proper composite represen-\ntation, then S(cid:96) is convex and smooth. If (cid:96) is also invertible, then S(cid:96) is strictly convex in its inner\npart.\n\n+ is a continuous invertible loss. Then (cid:96) has a strictly proper\n\nx = (cid:96)(v)\n\n(cid:96)1(v)\n\n)\nv\n(\n2\n(cid:96)\n\n{\nx\n:\nx\u00b7\n\nq\n\n(cid:96)(V )\n\nS(cid:96)\n\nL\n\n(\n\nv\n)}\n\nhL\n(\n\nq\n\nv\n)\n\n=\n\nq\n\n=\n\n6 Designing Proper Losses\n\nSuppose we are given n(n\u22121)\n\nWe now build a family of conditional Bayes risks.\nconcave\nfunctions {Li1,i2 : \u22062 \u2192 R}1\u2264i1<i2\u2264n on \u22062, and we want to build a concave function L on \u2206n\nwhich is equal to one of the given functions on each edge of the simplex (\u22001 \u2264 i1 < i2 \u2264 n,\nL(0, .,0, pi1,0, .,0, pi2,0, .,0) = Li1,i2(pi1, pi2 )). This is equivalent to choosing a binary loss function,\nknowing that the observation is in the class i1 or i2. The result below gives one possible construction.\n(There exists an in\ufb01nity of solutions \u2014 one can simply add any concave function equal to zero in\neach edge).\n(cid:19)\nLemma 20 Suppose we have a family of concave functions {Li1,i2 : \u22062 \u2192 R}1\u2264i1<i2\u2264n, then\n\n2\n\n(cid:18) pi1\n\nis concave and \u22001 \u2264 i1 < i2 \u2264 n, L(0, .,0, pi1,0, .,0, pi2,0, .,0) = Li1,i2(pi1, pi2).\n\nL: \u2206n (cid:51) p (cid:55)\u2192 L(p1, . . . , pn) = \u2211\n\n(pi1 + pi2 )Li1,i2\n\n1\u2264i1<i2\u2264n\n\npi2\n\n,\n\npi1 + pi2\n\npi1 + pi2\n\n6\n\n\fUsing this family of Bayes risks, one can build a family of proper losses.\nLemma 21 Suppose we have a family of binary proper losses (cid:96)i1,i2 : \u22062 \u2192 R2. Then\n\n(cid:32) j\u22121\n\n(cid:96)i, j\u22121\n\n\u2211\n\ni=1\n\n(cid:18) pi\n\npi + p j\n\n(cid:96): \u2206n (cid:51) p (cid:55)\u2192 (cid:96)(p) =\n\nis a proper n-class loss such that\n\n(cid:96)i, j\n1\n\ni= j+1\n\nn\n\n+\n\n\u2211\n\n(cid:19)\n\uf8f1\uf8f2\uf8f3 (cid:96)i1,i2\n\n(pi1)\n1\n(cid:96)i1,i2\u22121 (pi1)\n0\n\n\u2208 Rn\n\n+\n\n(cid:18) p j\n\n(cid:19)(cid:33)n\n\npi + p j\n\nj=1\n\ni = i1\ni = i2\notherwise\n\n.\n\n(cid:96)i((0, .,0, pi1,0, .,0, pi2,0, .,0)) =\n\nObserve that it is much easier to work at \ufb01rst with the Bayes risk and then using the correspondence\nbetween Bayes risks and proper losses.\n\n7\n\nIntegral Representations of Proper Losses\n\n(cid:96)c\n\nUnlike the natural generalisation of the results from proper binary to proper multiclass losses above,\nthere is one result that does not carry over: the integral representation of proper losses [1]. In the\nbinary case there exists a family of \u201cextremal\u201d loss functions (cost-weighted generalisations of the\n\n0-1 loss) each parametrised by c \u2208 [0,1] and de\ufb01ned for all \u03b7 \u2208 [0,1] by (cid:96)c\u22121(\u03b7) := c(cid:74)\u03b7 \u2265 c(cid:75) and\n1 := (1\u2212 c)(cid:74)\u03b7 < c(cid:75). As shown in [1, 3], given these extremal functions, any proper binary loss (cid:96)\ncan be expressed as the weighted integral (cid:96) =(cid:82) 1\n0 (cid:96)c w(c)dc + constant with w(c) = \u2212L(cid:48)(cid:48)(c). This\nrepresentation is a special case of a representation from Choquet theory [19] which characterises\nwhen every point in some set can be expressed as a weighted combination of the \u201cextremal points\u201d\nof the set. Although there is such a representation when n > 2, the dif\ufb01culty is that the set of extremal\npoints is much larger and this rules out the existence of a nice small set of \u201cprimitive\u201d proper losses\nwhen n > 2. The rest of this section makes this statement precise.\nA convex cone K is a set of points closed under linear combinations of positive coef\ufb01cients. That\n2 (g + h) for g,h \u2208 K\nis, K = \u03b1K + \u03b2 K for any \u03b1,\u03b2 \u2265 0. A point f \u2208 K is extremal if f = 1\nimplies \u2203\u03b1 \u2208 R+ such that g = \u03b1 f . That is, f cannot be represented as a non-trivial combination of\nother points in K . The set of extremal points for K will be denoted ex K . Suppose U is a bounded\nclosed convex set in Rd, and Kb(U) is the set of convex functions on U bounded by 1, then Kb(U)\nis compact with respect to the topology of uniform convergence. Theorem 2.2 of [20] shows that the\nextremal points of the convex cone K (U) ={\u03b1 f +\u03b2 g : f ,g \u2208 Kb(U),\u03b1,\u03b2 \u2265 0} are dense (w.r.t. the\ntopology of uniform convergence) in K (U) when d > 1. This means for any function f \u2208 K (U)\nthere is a sequence of functions (gi)i such that for all i gi \u2208 ex K (U) and limi\u2192\u221e(cid:107) f \u2212 gi(cid:107)\u221e = 0,\nwhere (cid:107) f(cid:107)\u221e := supu\u2208U | f (u)|. We use this result to show that the set of extremal Bayes risks is\ndense in the set of Bayes risks when n > 2.\nIn order to simplify our analysis, we restrict attention to fair proper losses. A loss is fair if each\npartial loss is zero on its corresponding vertex of the simplex ((cid:96)i(ei) = 0, \u2200i \u2208 [n]). A proper loss is\nfair if and only if its Bayes risk is zero at each vertex of the simplex (in this case the Bayes risk is\nalso called fair). One does not lose generality by studying fair proper losses since any proper loss is\na sum of a fair proper loss and a constant vector.\nThe set of fair proper losses de\ufb01ned on \u2206n form a closed convex cone, denoted Ln. The set of\nconcave functions which are zero on all the vertices of the simplex \u2206n is denoted Fn and is also a\nclosed convex cone.\nProposition 22 Suppose n > 2. Then for any fair proper loss (cid:96) \u2208 Ln there exists a sequence ((cid:96)i)i\nof extremal fair proper losses ((cid:96)i \u2208 ex Ln) which converges almost everywhere to (cid:96).\nThe proof of Proposition 22 requires the following lemma which relies upon the correspondence\nbetween a proper loss and its Bayes risk (Proposition 11) and the fact that two continuous functions\nequal almost everywhere are equal everywhere.\nLemma 23 If (cid:96) \u2208 ex Ln then its corresponding Bayes risk L is extremal in Fn. Conversely, if\nL \u2208 ex Fn then all the proper losses (cid:96) with Bayes risk equal to L are extremal in Ln.\n\n7\n\n\fWe also need a correspondence between the uniform convergence of a sequence of Bayes risk func-\ntions and the convergence of their associated proper losses.\nLemma 24 Suppose L,Li \u2208 Fn for i \u2208 N and suppose (cid:96) and (cid:96)i, i \u2208 N are associated proper losses.\nThen (Li)i converges uniformly to L if and only if ((cid:96)i)i converges almost everywhere to (cid:96).\n\nBronshtein [20]\nand Johansen [21]\nshowed how to construct a set of ex-\ntremal convex functions which is dense\nin K (U). With a trivial change of sign\nthis leads to a family of extremal proper\nfair Bayes risks that is dense in the set\nof Bayes risks in the topology of uniform\nconvergence. This means that it is not\npossible to have a small set of extremal\n(\u201cprimitive\u201d) losses from which one can\nconstruct any proper fair loss by linear\ncombinations when n > 2.\nA convex polytope is a compact convex\nintersection of a \ufb01nite set of half-spaces\nand is therefore the convex hull of its\nvertices. Let {ai}i be a \ufb01nite family\nof af\ufb01ne functions de\ufb01ned on \u2206n. Now\nde\ufb01ne the convex polyhedral function f\nby f (x) := maxi ai(x). The set K :=\n{Pi = {x \u2208 \u2206n : f (x) = ai(x)}} is a cover-\ning of \u2206n by polytopes. Theorem 2.1 of [20] shows that for f , Pi and K so de\ufb01ned, f is extremal\nif the following two conditions are satis\ufb01ed: 1) for all polytopes Pi in K and for every face F of Pi,\nF \u2229 \u2206n (cid:54)= \u2205 implies F has a vertex in \u2206n; 2) every vertex of Pi in \u2206n belongs to n distinct polytopes\nof K. The set of all such f is dense in K (U).\nUsing this result it is straightforward to exhibit some sets of extremal fair Bayes risks {Lc(p): c \u2208\n\u2206n}. Two examples are when Lc(p) =\n\nFigure 1: Complexity of extremal concave functions in two\ndimensions (corresponds to n = 3). Graph of an extremal con-\ncave function in two dimensions. Lines are where the slope\nchanges. The pattern of these lines can be arbitrarily complex.\n\n(cid:94)\n\n\u2264 p j\n\nn\n\n1\u2212pi\n1\u2212ci\n\n.\n\nj(cid:54)=i(cid:74) pi\nci \u220f\npi\n\nci\n\n\u2211\n\ni=1\n\nc j(cid:75) or Lc(p) =\n\ni\u2208[n]\n\n8 Conclusion\n\nWe considered loss functions for multiclass prediction problems and made four main contributions:\n\u2022 We extended existing results for binary losses to multiclass prediction problems includ-\ning several characterisations of proper losses and the relationship between properness and\nclassi\ufb01cation calibration;\n\n\u2022 We related the notion of prediction calibration to classi\ufb01cation calibration;\n\u2022 We developed some new existence and uniqueness results for proper composite losses\n(which are new even in the binary case) which characterise when a loss has a proper com-\nposite representation in terms of the geometry of the associated superprediction set; and\n\u2022 We showed that the attractive (simply parametrised) integral representation for binary\n\nproper losses can not be extended to the multiclass case.\n\nOur results suggest that in order to design losses for multiclass prediction problems it is helpful to\nuse the composite representation, and design the proper part via the Bayes risk as suggested for the\nbinary case in [1]. The proper composite representation is used in [22].\n\nAcknowledgements\n\nThe work was performed whilst Elodie Vernet was visiting ANU and NICTA, and was supported by\nthe Australian Research Council and NICTA, through backing Australia\u2019s ability.\n\n8\n\n\fReferences\n[1] Andreas Buja, Werner Stuetzle and Yi Shen. Loss functions for binary class probability estima-\ntion and classi\ufb01cation: Structure and applications. Technical report, University of Pennsylva-\nnia, November 2005. http://www-stat.wharton.upenn.edu/\u02dcbuja/PAPERS/\npaper-proper-scoring.pdf.\n\n[2] Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estima-\n\ntion. Journal of the American Statistical Association, 102(477):359-378, March 2007.\n\n[3] Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experi-\n\nments. Journal of Machine Learning Research, 12:731-817, March 2011.\n\n[4] Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine\n\nLearning Research, 11:2387-2422, 2010.\n\n[5] Peter L. Bartlett, Michael I. Jordan and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138-156, March 2006.\n\n[6] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225-1251, 2004.\n\n[7] Simon I. Hill and Arnaud Doucet. A framework for kernel-based multi-category classi\ufb01cation.\n\nJournal of Arti\ufb01cial Intelligence Research, 30:525-564, 2007.\n\n[8] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 8:1007-1025, 2007.\n\n[9] Yufeng Liu. Fisher consistency of multicategory support vector machines. Proceedings of the\nEleventh International Conference on Arti\ufb01cial Intelligence and Statistics, side 289-296, 2007.\n[10] Ra\u00b4ul Santos-Rodr\u00b4\u0131guez, Alicia Guerrero-Curieses, Roc\u00b4\u0131o Alaiz-Rodriguez and Jes\u00b4us Cid-\nSueiro. Cost-sensitive learning based on Bregman divergences. Machine Learning, 76:271-\n285, 2009. http://dx.doi.org/10.1007/s10994-009-5132-8.\n\n[11] Hui Zou, Ji Zhu and Trevor Hastie. New multicategory boosting algorithms based on multi-\n\ncategory Fisher-consistent losses. The Annals of Applied Statistics, 2(4):1290-1306, 2008.\n\n[12] Zhihua Zhang, Michael I. Jordan, Wu-Jun Li and Dit-Yan Yeung. Coherence functions for\nmulticategory margin-based classi\ufb01cation methods. Proceedings of the Twelfth Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2009.\n\n[13] Tobias Glasmachers. Universal consistency of multi-class support vector classication. Ad-\n\nvances in Neural Information Processing Systems (NIPS), 2010.\n\n[14] Elodie Vernet, Robert C. Williamson and Mark D. Reid. Composite multiclass losses. (with\nproofs). To appear in NIPS 2011, October 2011. http://users.cecs.anu.edu.au/\n\u02dcwilliams/papers/P188.pdf.\n\n[15] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00b4echal. Fundamentals of Convex Analysis.\n\nSpringer, Berlin, 2001.\n\n[16] Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. Technical report, Stan-\nford University, March 2010. http://www.stanford.edu/\u02dcnlambert/lambert_\nelicitation.pdf.\n\n[17] Jes\u00b4us Cid-Sueiro and An\u00b4\u0131bal R. Figueiras-Vidal. On the structure of strict sense Bayesian cost\nfunctions and its applications. IEEE Transactions on Neural Networks, 12(3):445-455, May\n2001.\n\n[18] Yuri Kalnishkan and Michael V. Vyugin. The weak aggregating algorithm and weak mixability.\n\nJournal of Computer and System Sciences, 74:1228-1244, 2008.\n\n[19] Robert R. Phelps. Lectures on Choquet\u2019s Theorem, volume 1757 of Lecture Notes in Mathe-\n\nmatics. Springer, 2nd edition, 2001.\n\n[20] E\ufb01m Mikhailovich Bronshtein. Extremal convex functions. Siberian Mathematical Journal,\n\n19:6-12, 1978.\n\n[21] S\u00f8ren Johansen. The extremal convex functions. Mathematica Scandinavica, 34:61-68, 1974.\n[22] Tim van Erven, Mark D. Reid and Robert C. Williamson. Mixability is Bayes risk curvature\nrelative to log loss. Proceedings of the 24th Annual Conference on Learning Theory, 2011. To\nappear. http://users.cecs.anu.edu.au/\u02dcwilliams/papers/P186.pdf.\n\n[23] Rolf Schneider. Convex Bodies: The Brunn-Minkowski Theory. Cambridge University Press,\n\n1993.\n\n9\n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Elodie", "family_name": "Vernet", "institution": null}, {"given_name": "Mark", "family_name": "Reid", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}