{"title": "On the Efficient Minimization of Classification Calibrated Surrogates", "book": "Advances in Neural Information Processing Systems", "page_first": 1201, "page_last": 1208, "abstract": "Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable --- a set whose losses span the exponential, logistic and squared losses ---, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zero-sum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates.", "full_text": "On the Ef\ufb01cient Minimization of Classi\ufb01cation\n\nCalibrated Surrogates\n\nRichard Nock\n\nCEREGMIA \u2014 Univ. Antilles-Guyane\n\n97275 Schoelcher Cedex, Martinique, France\n\nFrank Nielsen\n\nLIX - Ecole Polytechnique\n\n91128 Palaiseau Cedex, France\n\nrnock@martinique.univ-ag.fr\n\nnielsen@lix.polytechnique.fr\n\nAbstract\n\nBartlett et al (2006) recently proved that a ground condition for convex surrogates,\nclassi\ufb01cation calibration, ties up the minimization of the surrogates and classi\ufb01-\ncation risks, and left as an important problem the algorithmic questions about the\nminimization of these surrogates. In this paper, we propose an algorithm which\nprovably minimizes any classi\ufb01cation calibrated surrogate strictly convex and dif-\nferentiable \u2014 a set whose losses span the exponential, logistic and squared losses\n\u2014, with boosting-type guaranteed convergence rates under a weak learning as-\nsumption. A particular subclass of these surrogates, that we call balanced convex\nsurrogates, has a key rationale that ties it to maximum likelihood estimation, zero-\nsum games and the set of losses that satisfy some of the most common require-\nments for losses in supervised learning. We report experiments on more than 50\nreadily available domains of 11 \ufb02avors of the algorithm, that shed light on new\nsurrogates, and the potential of data dependent strategies to tune surrogates.\n\n1 Introduction\n\nA very active supervised learning trend has been \ufb02ourishing over the last decade: it studies functions\nknown as surrogates \u2014 upperbounds of the empirical risk, generally with particular convexity prop-\nerties \u2014, whose minimization remarkably impacts on empirical / true risks minimization [3, 4, 10].\nSurrogates play fundamental roles in some of the most successful supervised learning algorithms,\nincluding AdaBoost, additive logistic regression, decision tree induction, Support Vector Machines\n[13, 7, 10]. As their popularity has been rapidly spreading, authors have begun to stress the need to\nset in order the huge set of surrogates, and better understand their properties. Statistical consistency\nproperties have been shown for a wide set containing most of the surrogates relevant to learning,\nclassi\ufb01cation calibrated surrogates (CCS) [3]; other important properties, like the algorithmic ques-\ntions about minimization, have been explicitly left as important problems to settle [3].\n\nIn this paper, we address and solve this problem for all strictly convex differentiable CCS, a set\nreferred to as strictly convex surrogates (SCS). We propose a minimization algorithm, ULS, which\noutputs linear separators, with two key properties: it provably achieves the optimum of the surrogate,\nand meets Boosting-type convergence under a weak learning assumption. There is more, as we\nshow that SCS strictly contains another set of surrogates of important rationale, balanced convex\nsurrogates (BCS). This set, which contains the logistic and squared losses but not the exponential\nloss, coincides with the set of losses satisfying three common requirements about losses in learning.\nIn fact, BCS spans a large subset of the expected losses for zero-sum games of [9], by which ULS may\nalso be viewed as an ef\ufb01cient learner for decision making (in simple environments, though).\n\nSection 2 gives preliminary de\ufb01nitions; section 3 presents surrogates losses and risks; sections 4 and\n5 present ULS and its properties; section 6 discusses experiments with ULS; section 7 concludes.\n\n\f2 Preliminary de\ufb01nitions\n\nUnless otherwise stated, bold-faced variables like w denote vectors (components are wi, i =\n1, 2, ...), calligraphic upper-cases like S denote sets, and blackboard faces like O denote subsets\nof R, the set of real numbers. We let set O denote a domain (Rn, [0, 1]n, etc., where n is the\nnumber of description variables), whose elements are observations. An example is an ordered pair\n(o, c) \u2208 O \u00d7 {c\u2212, c+}, where {c\u2212, c+} denotes the set of classes (or labels), and c+ (resp. c\u2212) is\nthe positive class (resp. negative class). Classes are abstracted by a bijective mapping to one of two\nother sets:\n\nc \u2208 {c\u2212, c+} (cid:11) y\u2217 \u2208 {\u22121, +1} (cid:11) y \u2208 {0, 1} .\n\n(1)\nThe convention is c+ (cid:10) +1 (cid:10) 1 and c\u2212 (cid:10) \u22121 (cid:10) 0. We thus have three distinct notations for an\nexample: (o, c), (o, y\u2217), (o, y), that shall be used without ambiguity. We suppose given a set of m\nexamples, S = {(oi, ci), i = 1, 2, ..., m}. We wish to build a classi\ufb01er H, which can either be a\nfunction H : O \u2192 O \u2286 R (hereafter, O is assumed to be symmetric with respect to 0), or a function\nH : O \u2192 [0, 1]. Following a convention of [6], we compute to which extent the outputs of H and\nthe labels in S disagree, \u03b5(S, H), by summing a loss which quanti\ufb01es pointwise disagreements:\n\n\u03b5(S, H)\n\n.= (cid:88)\n\ni\n\n(cid:96)(ci, H(oi)) .\n\n(2)\n\nThe fundamental loss is the 0/1 loss, (cid:96)0/1(c, H) (to ease readability, the second argument is written\nH instead of H(o)). It takes on two forms depending on im(H):\n\n[0,1](y, H) .= 1y(cid:54)=\u03c4 \u25e6H if im(H) = [0, 1] .\n\nR (y\u2217, H) .= 1y\u2217(cid:54)=\u03c3\u25e6H if im(H) = O , or (cid:96)0/1\n(cid:96)0/1\n\n(3)\nThe following notations are introduced in (3): for a clear distinction of the output of H, we put in\nindex to (cid:96) and \u03b5 an indication of the loss\u2019 domain of parameters: R, meaning it is actually some\nO \u2286 R, or [0, 1]. The exponent to (cid:96) gives the indication of the loss name. Finally, 1\u03c0 is the indicator\nvariable that takes value 1 iff predicate \u03c0 is true, and 0 otherwise; \u03c3 : R \u2192 {\u22121, +1} is +1 iff\nx \u2265 0 and \u22121 otherwise; \u03c4 : [0, 1] \u2192 {0, 1} is 1 iff x \u2265 1/2, and 0 otherwise.\nBoth losses (cid:96)R and (cid:96)[0,1] are de\ufb01ned simultaneously via popular transforms on H, such as the logit\ntransform logit(p) .= log(p/(1 \u2212 p)),\u2200p \u2208 [0, 1] [7]. We have indeed (cid:96)0/1\nR (y\u2217, logit(H))\nand (cid:96)0/1\n[0,1](y, logit\u22121(H)). We have implicitly closed the domain of the logit, adding\ntwo symbols \u00b1\u221e to ensure that the eventual in\ufb01nite values for H can be mapped back to [0, 1].\nIn supervised learning, the objective is to carry out the minimization of the expectation of the 0/1\nloss in generalization, the so-called true risk. Very often however, this task can be relaxed to the\nminimization of the empirical risk of H, which is (2) with the 0/1 loss [6]:\n\n[0,1](y, H) = (cid:96)0/1\n\nR (y\u2217, H) = (cid:96)0/1\n\n\u03b50/1(S, H)\n\n.= (cid:88)\n\n(cid:96)0/1(ci, H(oi)) .\n\ni\n\n(4)\n\nThe main classi\ufb01ers we investigate are linear separators (LS). In this case, H(o) .= (cid:80)t \u03b1tht(o) for\nfeatures ht with im(ht) \u2286 R and leveraging coef\ufb01cients \u03b1t \u2208 R.\n3 Losses and surrogates\n\nA serious alternative to directly minimizing (4) is to rather focus on the minimization of a sur-\nrogate risk [3]. This is a function \u03b5(S, H) as in (2) whose surrogate loss (cid:96)(c, H(o)) satis\ufb01es\n(cid:96)0/1(c, H(o)) \u2264 (cid:96)(c, H(o)). Four are particularly important in supervised learning, de\ufb01ned via\nthe following surrogate losses:\n(5)\n(6)\n(7)\n(8)\n\n.= exp(\u2212y\u2217H) ,\n.= log(1 + exp(\u2212y\u2217H)) ,\n.= (1 \u2212 y\u2217H)2 ,\n.= max{0, 1 \u2212 y\u2217H} .\n\n(cid:96)exp\nR (y\u2217, H)\n(cid:96)log\nR (y\u2217, H)\n(cid:96)sqr\nR (y\u2217, H)\n(cid:96)hinge\nR (y\u2217, H)\n\n(5) is the exponential loss, (6) is the logistic loss, (7) is the squared loss and (8) is hinge loss.\nDe\ufb01nition 1 A Strictly Convex Loss (SCL) is a strictly convex function \u03c8 : X \u2192 R+ differentiable\non int(X) with X symmetric interval with respect to zero, s. t. \u2207\u03c8(0) < 0.\n\n\fa\u03c6 im(\u2207\u03c6)\n\n\u2287 im(H) = (\u03c6\n\n(cid:63)\n\nF\u03c6(y\u2217H)\n(\u2212y\u2217H) \u2212 a\u03c6)/b\u03c6\n\u2212y\u2217H+\u221a(1\u2212\u00b5)2+(y\u2217H)2\n\n\u03c6\n\n(H)\n\n\u02c6Pr[c = c+|H; o]\n\n= \u2207\u22121\n1\n2\u221a(1\u2212\u00b5)2+H 2\n2 +\n1\n2 +\n\nH\n\nH\n\nR\n\n\u03c6\u00b5,\u00b5\u2208(0,1)(x) .= \u00b5 + (1 \u2212 \u00b5)px(1 \u2212 x) \u00b5\n\u03c6M(x) .= px(1 \u2212 x)\n0\n\u03c6Q(x) .= \u2212x log x \u2212 (1 \u2212 x) log(1 \u2212 x) 0\n\u03c6B(x) .= x(1 \u2212 x)\n0\nTable 1: permissible functions, their corresponding BCLs and the matching [0, 1] predictions.\n\n\u2212y\u2217H + p1 + (y\u2217H)2\nlog(1 + exp(\u2212y \u2217H))\n\n(1 \u2212 y \u2217H)2\n\n2\u221a1+H 2\n\n[\u22121, 1]\n\n2 + H\n\n1+exp(H)\n\nexp(H)\n\n1\u2212\u00b5\n\nR\n\nR\n\n1\n\n2\n\n\u2207. is the gradient notation (here, the derivative). Any surrogate risk built from a SCL is called a\nStrictly Convex Surrogate (SCS). From Theorem 4 in [3], it comes that SCL contains all classi\ufb01cation\ncalibrated losses (CCL) that are strictly convex and differentiable, such as (5), (6), (7).\n.= supx(cid:48)\u2208int(X){xx(cid:48) \u2212 \u03c8(x(cid:48))}. Be-\nFix \u03c8 \u2208 SCL. The Legendre conjugate \u03c8(cid:63) of \u03c8 is \u03c8(cid:63)(x)\ncause of the strict convexity of \u03c8, the analytic expression of the Legendre conjugate becomes\n\u03c8 (x)). \u03c8(cid:63) is also strictly convex and differentiable. A function\n\u03c8(cid:63)(x)\n\u03c6 : [0, 1] \u2192 R+ is called permissible iff it is differentiable on (0, 1), strictly concave, symmet-\n.= \u03c6(1/2) \u2212 a\u03c6 > 0. Permissible\nric about x = 1/2, and with \u03c6(0) = \u03c6(1) = a\u03c6 \u2265 0. We let b\u03c6\nfunctions with a\u03c6 = 0 span a very large subset of generalized entropies [9]. Permissible func-\ntions are useful to de\ufb01ne the following subclass of SCL, of particular interest (here, \u03c6 .= \u2212\u03c6).\n\n\u03c8 (x) \u2212 \u03c8(\u2207\u22121\n\n.= x\u2207\u22121\n\n\u03c6(x)\n\n 12\n\n 10\n\n 8\n\n 6\n\n 4\n\n 2\n\n 0\n\n(f = f\nB)\n(f = f\nM)\nm = 1/3 )\n(f = f\nQ)\n\n(f = f\n\nDe\ufb01nition 2 Let \u03c6 permissible.\n(BCL) with signature \u03c6, F\u03c6, is:\n.= (\u03c6\n\nF\u03c6(x)\n\n(cid:63)\n\nThe Balanced Convex Loss\n\n(\u2212x) \u2212 a\u03c6)/b\u03c6 .\n\n(9)\n\n-3\n\n-2\n\n-1\n\n 0\n\n 1\n\n 2\n\n 3\n\nBalanced Convex Surrogates (BCS) are de\ufb01ned accordingly. All\n(x) satis\ufb01es the following\nBCL share a common shape. Indeed, \u03c6\nrelationships:\n\n(cid:63)\n\n(cid:63)\n\u03c6\n(cid:63)\n\u03c6\n\n(x) = \u03c6\n\n(cid:63)\n\n(10)\n(11)\n\n(\u2212x) + x ,\n\n(cid:63)\n\n\u03c6\n\nlim\n\nx\u2192infim(\u2207\u03c6)\n\n(x) = a\u03c6 .\n\nFigure 1: Bold curves depict plots\nof \u03c6\n(\u2212x) for the \u03c6 in Table 1; thin\ndotted half-lines are its asymptotes.\nNoting that F\u03c6(0) = 1 and \u2207F\u03c6 (0) = \u2212(1/b\u03c6)\u2207\u22121\n(0) < 0, it follows that BCS \u2282 SCS,\nIt also follows\nwhere the strict inequality comes from the fact that (5) is a SCL but not a BCL.\nlimx\u2192supim(\u2207\u03c6) F\u03c6(x) = 0 from (11), and limx\u2192infim(\u2207\u03c6) F\u03c6(x) = \u2212x/b\u03c6 from (10). We get that\nthe asymptotes of any BCL can be summarized as (cid:96)(x) .= x(\u03c3(x) \u2212 1)/(2b\u03c6). When b\u03c6 = 1, this is\nthe linear hinge loss [8], a generalization of (8) for which x .= y\u2217H \u2212 1. Thus, while hinge loss is\nnot a BCL, it is the limit behavior of any BCL (see Figure 1).\nTable 1 (left column) gives some examples of permissible \u03c6. When scaled so that \u03c6(1/2) = 1,\nsome confound with popular choices: \u03c6B with Gini index, \u03c6Q with the Bit-entropy, and \u03c6M with\nMatsushita\u2019s error [10, 11]. Table 1 also gives the expressions of F\u03c6 along with the im(H) = O \u2286 R\nallowed by the BCL, for the corresponding permissible function. It is interesting to note the constraint\non im(H) for the squared loss to be a BCL, which makes it monotonous in the interval, but implies\nto rescale the outputs of classi\ufb01ers like linear separators to remain in [\u22121, 1].\n4 ULS: the ef\ufb01cient minimization of any SCS\n\nFor any strictly convex function \u03c8 : X \u2192 R differentiable on int(X), the Bregman Loss Function\n(BLF) D\u03c8 with generator \u03c8 is [5]:\nD\u03c8(x||x(cid:48))\n(12)\n\n.= \u03c8(x) \u2212 \u03c8(x(cid:48)) \u2212 (x \u2212 x(cid:48))\u2207\u03c8(x(cid:48)) .\n\nThe following Lemma states some relationships that are easy to check using \u03c8 (cid:63)(cid:63) = \u03c8. They are\nparticularly interesting when im(H) = O \u2286 R.\n\n\fAlgorithm 1: Algorithm ULS(M, \u03c8)\nInput: M \u2208 Rm\u00d7T , SCL \u03c8 with dom(\u03c8) = R;\nLet \u03b11 \u2190 0; Let w0 \u2190 \u2207\u22121\nfor j = 1, 2, ...J do\n\n(0)1;\n\n\u02dc\u03c8\n\n[WU] (weight update) wj \u2190 (M \u03b1j) (cid:5) w0 ;\nLet Tj \u2286 {1, 2, ..., T}; let \u03b4j \u2190 0;\n[LC] (leveraging coef\ufb01cients) \u2200t \u2208 Tj, pick \u03b4j,t such that: (cid:80)m\nLet \u03b1j+1 \u2190 \u03b1j + \u03b4j;\n\nOutput: H(x) .= (cid:80)T\n\nt=1 \u03b1J+1,tht(x) \u2208 LS\n\ni=1 mit((M \u03b4j) (cid:5) wj)i = 0 ;\n\n\u03c6\n\n(H)) = b\u03c6F\u03c6(y\u2217H) and D\u03c6(y||\u2207\u22121\n\n\u03c8(cid:63) (y\u2217H))\u2212 \u03c8(cid:63)(0). Furthermore, for any BCL F\u03c6,\n(H)) = D\u03c6(1||\u2207\u22121\n\nLemma 1 For any SCL \u03c8, \u03c8(y\u2217H) = D\u03c8(cid:63) (0||\u2207\u22121\nD\u03c6(y||\u2207\u22121\nThe second equality is important because it ties real predictions (right) with [0, 1] predictions (left).\nIt also separates SCL and BCL, as for any \u03c8 in SCL, it can be shown that there exists a functions \u03d5\n\u03d5 (H)) = \u03c8(y\u2217H) iff \u03c8 \u2208 BCL. We now focus on the minimization of any SCS.\nsuch that D\u03d5(y||\u2207\u22121\nWe show that there exists an algorithm, ULS, which \ufb01ts a linear separator H to the minimization\nof any SCS \u03b5\u03c8\ni H(oi)) for any SCL \u03c8 with dom(\u03c8) = R, in order not to restrict the LS\nbuilt. To simplify notations, we let:\n\n.= (cid:80)i \u03c8(y\u2217\n\n(y\u2217H)).\n\n\u03c6\n\n\u03c6\n\nR\n\n.= \u03c8(cid:63)(\u2212x) .\nWith this notation, the \ufb01rst equality in Lemma 1 becomes:\n\n\u02dc\u03c8(x)\n\n(13)\n\n\u02dc\u03c8\n\n\u03c8 (\u2212x).\n\n(\u2212y\u2217H)) \u2212 \u02dc\u03c8(0) .\n\n\u03c8(y\u2217H) = D \u02dc\u03c8(0||\u2207\u22121\n\n(14)\n.= dom(\u2207 \u02dc\u03c8) = \u2212im(\u2207\u03c8), where this latter equality comes from \u2207 \u02dc\u03c8(x) =\nWe let W\n\u2212\u2207\u03c8(cid:63) (\u2212x) = \u2212\u2207\u22121\nIt also comes im(\u2207 \u02dc\u03c8) = R. Because any BLF is strictly convex\nin its \ufb01rst argument, we can compute its Legendre conjugate. In fact, we shall essentially need the\nargument that realizes the supremum: for any x \u2208 R, for any p \u2208 W, we let:\n\n(15)\nWe do not make reference to \u02dc\u03c8 in the (cid:5) notation, as it shall be clear from context. We name x (cid:5) p\nthe Legendre dual of the ordered pair (x, p), closely following a notation by [6]. The Legendre dual\nis unique and satis\ufb01es:\n\n.= argp(cid:48)\u2208W sup{xp(cid:48) \u2212 D \u02dc\u03c8(p(cid:48)||p)} .\n\nx (cid:5) p\n\n\u2207 \u02dc\u03c8(x (cid:5) p) = x + \u2207 \u02dc\u03c8(p) ,\n\u2200x, x(cid:48) \u2208 R,\u2200p \u2208 W, x (cid:5) (x(cid:48) (cid:5) p) = (x + x(cid:48)) (cid:5) p .\n\n(16)\n(17)\nTo state ULS, we follow the setting of [6] and suppose that we have T features ht (t = 1, 2, ..., T )\nknown in advance, the problem thus reducing to the computation of the leveraging coef\ufb01cients. We\nde\ufb01ne m \u00d7 T matrix M with:\n\n.= \u2212y\u2217\nGiven leveraging coef\ufb01cients vector \u03b1 \u2208 RT , we get:\n\nmit\n\ni ht(oi) .\n\n\u2212y\u2217\n\ni H(oi) = (M \u03b1)i .\n\n(18)\n\n(19)\n\nWe can specialize this setting to classical greedy induction frameworks for LS: in classical boosting,\nat step j, we would \ufb01t a single \u03b1t [6]; in totally corrective boosting, we would rather \ufb01t {\u03b1t, 1 \u2264 t \u2264\nj} [14]. Intermediate schemes may be used as well forTj, provided they ensure that, at each step j of\nthe algorithm and for any feature ht, it may be chosen at some j(cid:48) > j. ULS is displayed in Algorithm\n1. In Algorithm 1, notations are vector-based: the Legendre duals are computed component-wise;\nfurthermore, Tj may be chosen according to whichever scheme underlined above. The following\nTheorem provides a \ufb01rst general convergence property for ULS.\n\nTheorem 1 ULS(M, \u03c8) converges to a classi\ufb01er H realizing the minimum of \u03b5\u03c8\nR.\n\n\fProof sketch:\nIn step [WU] in ULS, (17) brings wj+1 = (M \u03b1j+1) (cid:5) w0 = (M \u03b4j) (cid:5) wj. After\nfew derivations involving the choice of \u03b4j and step [LC] in ULS, we obtain (with vector notations,\nBLFs are the sum of the component-wise BLFs):\n\nD \u02dc\u03c8(0||wj+1) \u2212 D \u02dc\u03c8(0||wj) = \u2212D \u02dc\u03c8(wj+1||wj)\n\n(20)\nLet A \u02dc\u03c8(wj+1, wj) .= \u2212D \u02dc\u03c8(wj+1||wj), which is just, from (20) and (14), the difference between\ntwo successive SCL in Algorithm 1. Thus, A \u02dc\u03c8(wj+1, wj) < 0 whenever wj+1 (cid:54)= wj. Should we\nbe able to prove that when ULS has converged, w. \u2208 KerM (cid:62), this would make A \u02dc\u03c8(wj+1, wj) an\nauxiliary function for ULS, which is enough to prove the convergence of ULS towards the optimum\n[6]. Thus, suppose that wj+1 = wj (ULS has converged). Suppose that Tj is a singleton (e.g.\nclassical boosting scheme). In this case, \u03b4j = 0 and so \u2200t = 1, 2, ..., T, (cid:80)m\ni=1 mit(0 (cid:5) wj)i =\n(cid:80)m\nj+1M = 0(cid:62), and wj, wj+1 \u2208 KerM (cid:62). The case of totally\ncorrective boosting is simpler, as after the last iteration we would have wJ+1 \u2208 KerM (cid:62). Interme-\ndiate choices for Tj \u2282 {1, 2, ..., T} are handled in the same way.\nWe emphasize the fact that Theorem 1 proves the convergence towards the global optimum of \u03b5\u03c8\nR,\nregardless of \u03c8. The optimum is de\ufb01ned by the LS with features in M that realizes the smallest\n\u03b5\u03c8\nR. Notice that in practice, it may be a tedious task to satisfy exactly (20), in particular for totally\ncorrective boosting [14].\n\ni=1 mitwj,i = 0, i.e. w(cid:62)\n\nj M = w(cid:62)\n\nULS has the \ufb02avor of boosting algorithms, repeatedly modifying a set of weights w over the exam-\nples. In fact, this similarity is more than syntactical, as ULS satis\ufb01es two \ufb01rst popular algorithmic\nboosting properties, the \ufb01rst of which being that step [LC] in ULS is equivalent to saying that this\nLS has zero edge on wj+1 [14]. The following Lemma shows that this edge conditions is sound.\nLemma 2 Suppose that there does not exist some ht with all mit of the same sign, \u2200i = 1, 2, ..., m.\nThen, for any choice of Tj, step [LC] in ULS has always a \ufb01nite solution.\nProof: Let:\n\nZ .= D \u02dc\u03c8(0||(M \u03b1j+1) (cid:5) w0) .\n(21)\n\u02dc\u03c8(\u2212(M (\u03b4j + \u03b1j))i) from (14), a function convex in all leveraging\nWe have Z = m \u02dc\u03c8(0) + (cid:80)m\n.= \u22022Z/(\u2202\u03b4j,u\u03b4j,v) (for the sake of simplicity,\ncoef\ufb01cients. De\ufb01ne |Tj| \u00d7 |Tj| matrix E with euv\nTj = {1, 2, ...,|Tj|}, where |.| denotes the cardinal). We have euv = (cid:80)m\ni=1 miumiv/\u03d5(((M \u03b4j) (cid:5)\nwj)i), with \u03d5(x) .= d2 \u02dc\u03c8(x)/dx2 a function strictly positive in int(W) since \u02dc\u03c8 is strictly convex.\n.= 1/\u03d5(((M \u03b4j)(cid:5)wj)i) > 0. It is easy to show that x(cid:62)Ex = (cid:80)m\nLet qi,j\ni=1 qi,j(cid:104)x, \u02dcmi(cid:105)2 \u2265 0,\u2200x \u2208\n.= mit. Thus, E is positive semide\ufb01nite; as such, step\nR|Tj |, with \u02dcmi \u2208 R|Tj | the vector with \u02dcmit\n[LC] in ULS, which is the same as solving \u2202Z/\u2202\u03b4j,u = 0, \u2200u \u2208 Tj (i.e. minimizing Z) has always\na solution.\nThe condition for the Lemma to work is absolutely not restrictive, as if such an ht were to exist, we\nwould not need to run ULS: indeed, we would have either \u03b50/1(S, ht) = 0, or \u03b50/1(S,\u2212ht) = 0. The\nsecond property met by ULS is illustrated in the second example below.\n\ni=1\n\nWe give two examples of specializations of ULS. Take for exam-\nple \u03c8(x) = exp(\u2212x) (5). In this case, W = R+, w0 = 1 and it is\nnot hard to see that ULS matches real AdaBoost with unnormal-\nized weights [13]. The difference is syntactical: the LS output\nby ULS and real AdaBoost are the same. Now, take any BCL. In\nthis case, \u02dc\u03c8 = \u03c6, W = [0, 1] (scaling issues underlined for the\nlogit in Section 2 make it desirable to close W), and w0 = 1/21.\nIn all these cases, where W \u2286 R+, wj is always a distribution\nup to a normalization factor, and this would also be the case for\nany strictly monotonous SCS \u03c8. The BCL case brings an appeal-\ning display of how the weights behave. Figure 2 displays a typ-\nical Legendre dual for a BCL. Consider example (oi, yi), and its\ni H(oi)) (cid:5) w0,i for\nweight update, wj,i \u2190 (M \u03b1j)i (cid:5) w0,i = (\u2212y\u2217\ni H(oi) in Fig-\nthe current classi\ufb01er H. Fix p = w0,i and x = \u2212y\u2217\nure 2. We see that the new weight of the example gets larger iff\nx > 0, i.e. iff the example is given the wrong class by H, which\n\nx\n\n1\nx (cid:5) p\n\n1/2\n\np\n\n0\n\n\u2207\n\n\u03c6\n\nFigure 2: A typical \u2207\u03c6 (red:\nstrictly increasing, symmetric wrt\npoint (1/2, 0)), with Legendre dual\nx (cid:5) p computed from x and p.\nis the second boosting property met by ULS.\n\n\fULS turns out to meet a third boosting property, and the most important as it contributes to root the\nalgorithm in the seminal boosting theory of the early nineties: we have guarantees on its convergence\nrate under a generalization of the well-known \u201cWeak Learning Assumption\u201d (WLA) [13]. To state\nthe WLA, we plug the iteration in the index of the distribution normalization coef\ufb01cient in (21), and\nde\ufb01ne Zj\n\n.= ||wj||1 (||.||k is the Lk norm). The WLA is:\n\n(WLA)\u2200j,\u2203\u03b3j > 0 : |(1/|Tj|) (cid:88)\n\nt\u2208Tj\n\n(1/Zj)\n\nm\n\n(cid:88)\n\ni=1\n\nmitwj,i| \u2265 \u03b3j .\n\n(22)\n\nThis is indeed a generalization of the usual WLA for boosting algorithms, that we obtain taking\n|Tj| = 1, ht \u2208 {\u22121, +1} [12]. Few algorithms are known that formally boost WLA in the sense that\nrequiring only WLA implies guaranteed rates for the minimization of \u03b5\u03c8\nR. We show that ULS meets\nthis property \u2200\u03c8 \u2208 SCL. To state this, we need few more de\ufb01nitions. Let mt denote the tth column\n.= minj Zj. Let a\u03b3 denote the average of \u03b3j (\u2200j), and\nvector of M, am\na\u03d5\n\n.= minx\u2208int(W) \u03d5(x) (\u03d5 de\ufb01ned in the proof of Lemma 2).\n\n.= maxt ||mt||2 and aZ\n\nTheorem 2 Under the WLA, ULS terminates in at most J = O(ma2\nProof sketch: We use Taylor expansions with Lagrange remainder for \u02dc\u03c8, and then the mean-value\ntheorem, and obtain that \u2200w, w + \u2206 \u2208 W,\u2203w(cid:63) \u2208 [min{w + \u2206, w}, max{w + \u2206, w}] such that\nD \u02dc\u03c8(w + \u2206||w) = \u22062\u03d5(w(cid:63))/2 \u2265 (\u22062/2)a\u03d5 \u2265 0. We use m times this inequality with w = wj,i\nand \u2206 = (wj+1,i \u2212 wj,i), sum the inequalities, combine with Cauchy - Schwartz and Jensen\u2019s\ninequalities, and obtain:\n\n\u03b3)) iterations.\n\nm/(a\u03d5a2\n\nZa2\n\nD \u02dc\u03c8(wj+1||wj) \u2265 a\u03d5(aZ\u03b3j/(2am))2 .\n\n(23)\n\nUsing (20), we obtain that D \u02dc\u03c8(0||wJ+1) \u2212 m \u02dc\u03c8(0) equals:\n\n\u2212m \u02dc\u03c8(0) + D \u02dc\u03c8(0||w1) +\n\nJ\n\n(cid:88)\n\nj=1\n\n(D \u02dc\u03c8(0||wj+1) \u2212 D \u02dc\u03c8(0||wj)) = m\u03c8(0) \u2212\n\nJ\n\n(cid:88)\n\nj=1\n\nD \u02dc\u03c8(wj+1||wj) .(24)\n\nBut, (14) together with the de\ufb01nition of wj in [WU] (see ULS) yields D \u02dc\u03c8(0||wJ+1,i) = \u02dc\u03c8(0) +\ni H(oi)),\u2200i = 1, 2, ..., m, which ties up the SCS to (24); the guaranteed decrease in the rhs\n\u03c8(y\u2217\nof (24) by (23) makes that there remains to check when the rhs becomes negative to conclude that\nULS has terminated. This gives the bound of the Theorem.\nThe bound in Theorem 2 is mainly useful to prove that the WLA guarantees a convergence rate of\norder O(m/a2\n5 ULS, BCL, maximum likelihood and zero-sum games\n\n\u03b3) for ULS, but not the best possible as it is in some cases far from being optimal.\n\nBCL matches through the second equality in Lemma 1 the set of losses that satisfy the main re-\nquirements about losses used in machine learning. This is a strong rationale for its use. Suppose\nim(H) \u2286 [0, 1], and consider the following requirements about some loss (cid:96)[0,1](y, H):\n\n(R1) The loss is lower-bounded. \u2203z \u2208 R such that inf y,H (cid:96)[0,1](y, H) = z.\n(R2) The loss is a proper scoring rule. Consider a singleton domain O = {o}. Then, the best\n(constant) prediction is arg minx\u2208[0,1] \u03b5[0,1](S, x) = p .= \u02c6Pr[c = c+|o] \u2208 [0, 1], where p is\nthe relative proportion of positive examples with observation o.\n\n(R3) The loss is symmetric in the following sense: (cid:96)[0,1](y, H) = (cid:96)[0,1](1 \u2212 y, 1 \u2212 H).\n\nR1 is standard. For R2, we can write \u03b5[0,1](S, x) = p(cid:96)[0,1](1, x) + (1 \u2212 p)(cid:96)[0,1](0, x) = L(p, x),\nwhich is just the expected loss of zero-sum games used in [9] (eq. (8)) with Nature states reduced\nto the class labels. The fact that the minimum is achieved at x = p makes the loss a proper scoring\nrule. R3 implies (cid:96)[0,1](1, 1) = (cid:96)[0,1](0, 0), which is virtually assumed for any domain; otherwise, it\nscales to H \u2208 [0, 1] a well-known symmetry in the cost matrix that holds for domains without class\ndependent misclassi\ufb01cation costs. For these domains indeed, it is assumed (cid:96)[0,1](1, 0) = (cid:96)[0,1](0, 1).\n\n\fFinally, we say that loss (cid:96)[0,1] is properly de\ufb01ned iff dom((cid:96)[0,1]) = [0, 1]2 and it is twice differentiable\non (0, 1)2. This is only a technical convenience: even the 0/1 loss coincides on {0, 1} with properly\nde\ufb01ned losses. In addition, the differentiability condition would be satis\ufb01ed by many popular losses.\nThe proof of the following Lemma involves Theorem 3 in [1] and additional facts to handle R3.\nLemma 3 Assume im(H) \u2286 [0, 1]. Loss (cid:96)[0,1](., .) is properly de\ufb01ned and meets requirements R1,\nR2, R3 iff (cid:96)[0,1](y, H) = z + D\u03c6(y||H) for some permissible \u03c6.\nThus, \u03c6 maybe viewed as the \u201csignature\u201d of the loss. The second equality in Lemma 1 makes a tight\nconnection between the predictions of H in [0, 1] and R. Let it be more formal: the matching [0, 1]\nprediction for some H with im(H) = O is:\n\n\u02c6Pr\u03c6[c = c+|H; o]\n\n.= \u2207\u22121\n\n\u03c6\n\n(H(o)) ,\n\n(25)\n\nWith this de\ufb01nition, illustrated in Table 1, Lemma 3 and the second equality in Lemma 1 show that\nBCL matches the set of losses of Lemma 3. This de\ufb01nition also brings the true nature of the mini-\nmization of any BCS with real valued hypotheses like linear separators (in ULS). From Lemma 3 and\n[2], there exists a bijection between BCL and a subclass of the exponential families whose members\u2019\npdfs may be written as: Pr\u03c6[y|\u03b8] = exp(\u2212D\u03c6(y||\u2207\u22121\n(\u03b8)) + \u03c6(y) \u2212 \u03bd(y)), where \u03b8 \u2208 R is the\nnatural parameter and \u03bd(.) is used for normalization. Plugging \u03b8 = H(o), using (25) and the second\nequality in Lemma 1, we obtain that any BCS can be rewritten as \u03b5\u03c6\nR = U +(cid:80)i \u2212 log Pr\u03c6[yi|H(oi)],\nwhere U does not play a role in its minimization. We obtain the following Lemma, in which we sup-\npose im(H) = O.\n\n\u03c6\n\nLemma 4 Minimizing any BCS with classi\ufb01er H yields the maximum likelihood estimation, for each\nobservation, of the natural parameter \u03b8 = H(o) of an exponential family de\ufb01ned by signature \u03c6.\n.=\nIn fact, one exponential family is concerned in \ufb01ne. To see this, we can factor the pdf as Pr[y|\u03b8]\n(cid:63) the cumulant function, \u03bb(y) the suf\ufb01cient statistic and z the\nexp (\u03b8\u03bb(y) \u2212 \u03c8(\u03b8)) /z, with \u03c8 = \u03c6\nnormalization function. Since y \u2208 {0, 1}, we easily end up with Pr\u03c6[y|\u03b8] = 1/(1 + exp(\u2212\u03b8)), the\nlogistic prediction for a Bernoulli prior. To summarize, minimizing any loss that meets R1, R2 and\nR3 (i.e. any BCL) amounts to the same ultimate goal; Since ULS works for any of the corresponding\nsurrogate risks, the crux of the choice of the BCL relies on data-dependent considerations.\nFinally, we can go further in the parallel with game theory developed above for R2: using notations\nin [9], the loss function of the decision maker can be written L(X, q) = D\u03c6(1||q(X)). R3 makes it\neasy to recover losses like the log loss or the Brier score [9] respectively from \u03c6Q and \u03c6B (Table 1).\nIn this sense, ULS is also a sound learner for decision making in the zero-sum game of [9]. Notice\nhowever that, to work, it requires that Nature has a restricted sample space size ({0, 1}).\n6 Experiments\nWe have compared against each other 11 \ufb02avors of ULS, including real AdaBoost [13], on a bench-\nmark of 52 domains (49 from the UCI repository). True risks are estimated via strati\ufb01ed 10-fold\ncross validation; ULS is ran for r (\ufb01xed) features ht, each of which is a Boolean rule: If Mono-\nmial then Class= \u00b11 else Class = \u22131, with at most l (\ufb01xed) literals, induced following the greedy\nminimization of the BCS at hand. Leveraging coef\ufb01cients ([LC] in ULS) are approximated up to\n10\u221210 precision. Figure 3 summarizes the results for two values of the couple (l, r). Histograms\nare ordered from left to right in increasing average true risk over all domains (shown below his-\ntograms). The italic numbers give, for each algorithm, the number of algorithms it beats accord-\ning to a Student paired t-test over all domains with .1 threshold probability. Out of the 10 \ufb02a-\nvors of ULS, the \ufb01rst four \ufb02avors pick \u03c6 in Table 1. The \ufb01fth uses another permissible function:\n\u03c6\u03c5(x) .= (x(1 \u2212 x))\u03c5 ,\u2200\u03c5 \u2208 (0, 1). The last \ufb01ve adaptively tune the BCS at hand out-of-a-bag\nof BCS. The \ufb01rst four \ufb01t the BCS at each stage of the inner loop (for j ...) of ULS. Two (noted\n\u201cF.\u201d) pick the BCS which minimizes the empirical risk in the bag; two others (noted \u201cE.\u201d) pick the\nBCS which maximizes the current edge. There are two different bags corresponding to four permis-\nsible functions each: the \ufb01rst (index \u201c1\u201d) contains the \u03c6 in Table 1, the second (index \u201c2\u201d) replaces\n\u03c6B by \u03c6\u03c5. We wanted to evaluate \u03c6B because it forces to renormalize the leveraging coef\ufb01cients in\nH each time it is selected, to ensure that the output of H lies in [\u22121, 1]. The last adaptive \ufb02avor,\nF \u2217, \u201cexternalizes\u201d the choice of the BCS: it selects for each fold the BCS which yields the smallest\nempirical risk in a bag corresponding to \ufb01ve \u03c6: those of Table 1 plus \u03c6\u03c5.\n\n\f 25\n\n 20\n\n 15\n\n 10\n\n 5\n\n 0\n\n 25\n\n 20\n\n 15\n\n 10\n\n 5\n\n 0\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\nF\u2217\n\n14.18 (10)\n\n\u03c6M\n\n14.70 (5)\n\n\u03c6\u03c5\n\n14.71 (3)\n\n\u03c6\u00b5\n\n14.83 (2)\n\nF2\n\n15.03 (1)\n\n\u03c6Q\n\n15.06 (1)\n\nE1\n\n15.22 (1)\n\n\u03c6B\n\n15.25 (1)\n\nAdaBoost\n15.35 (1)\n\nE2\n\n15.36 (1)\n\nF1\n\n17.37 (0)\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\n987654321\n\n10\n\n11\n\nF\u2217\n\n12.15 (10)\n\n\u03c6Q\n\n12.39 (3)\n\nAdaBoost\n12.56 (3)\n\n\u03c6M\n\n12.59 (3)\n\n\u03c6B\n\n12.62 (3)\n\nE2\n\n12.63 (3)\n\n\u03c6\u03c5\n\n12.74 (2)\n\n\u03c6\u00b5\n\n12.79 (2)\n\nF2\n\n13.10 (2)\n\nF1\n\n17.57 (1)\n\nE1\n\n23.60 (0)\n\nFigure 3: Summary of our results over the 52 domains for the 11 algorithms (top: l = 2, r = 10;\nbottom: l = 3, r = 100). Vertical (red) bars show the average rank over all domains (see text).\n\nThree main conclusions emerge from Figure 3. First, F \u2217 appears to be superior to all other ap-\nproaches, but slightly more sophisticated choices for the SCS (i.e. E., F.) fail at improving the\nresults; this is a strong advocacy for a particular treatment of this surrogate tuning problem. Second,\nMatsushita\u2019s BCL, built from \u03c6M, appears to be a serious alternative to the logistic loss. Third and\nlast, a remark previously made by [10] for decision trees seems to hold as well for linear separators,\nas stronger concave regimes for \u03c6 in BCLs tend to improve performances at least for small r.\nConclusion\nIn this paper, we have shown the existence of a supervised learning algorithm which minimizes\nany strictly convex, differentiable classi\ufb01cation calibrated surrogate [3], inducing linear separators.\nSince the surrogate is now in the input of the algorithm, along with the learning sample, it opens\nthe interesting problem of the tuning of this surrogate to the data at hand to further reduce the true\nrisk. While the strategies we have experimentally tested are, with this respect, a simple primer for\neventual solutions, they probably display the potential and the non triviality of these solutions.\nReferences\n[1] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a bregman predictor.\n\nIEEE Trans. on Information Theory, 51:2664\u20132669, 2005.\n\n[2] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of\n\nMachine Learning Research, 6:1705\u20131749, 2005.\n\n[3] P. Bartlett, M. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the Am.\n\nStat. Assoc., 101:138\u2013156, 2006.\n\n[4] P. Bartlett and M. Traskin. Adaboost is consistent. In NIPS*19, 2006.\n[5] L. M. Bregman. The relaxation method of \ufb01nding the common point of convex sets and its application to\nthe solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200\u2013217, 1967.\n[6] M. Collins, R. Schapire, and Y. Singer. Logistic regression, adaboost and Bregman distances. In COLT\u201900,\n\npages 158\u2013169, 2000.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression : a Statistical View of Boosting.\n\nAnn. of Stat., 28:337\u2013374, 2000.\n\n[8] C. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS*11, pages 225\u2013231, 1998.\n[9] P. Gr\u00a8unwald and P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian\n\ndecision theory. Ann. of Statistics, 32:1367\u20131433, 2004.\n\n[10] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms.\n\nJournal of Comp. Syst. Sci., 58:109\u2013128, 1999.\n\n[11] K. Matsushita. Decision rule, based on distance, for the classi\ufb01cation problem. Ann. of the Inst. for Stat.\n\nMath., 8:67\u201377, 1956.\n\n[12] R. Nock and F. Nielsen. A Real Generalization of discrete AdaBoost. Artif. Intell., 171:25\u201341, 2007.\n[13] R. E. Schapire and Y. Singer.\nImproved boosting algorithms using con\ufb01dence-rated predictions.\nCOLT\u201998, pages 80\u201391, 1998.\n\nIn\n\n[14] M. Warmuth, J. Liao, and G. R\u00a8atsch. Totally corrective boosting algorithms that maximize the margin. In\n\nICML\u201906, pages 1001\u20131008, 2006.\n\n\f", "award": [], "sourceid": 249, "authors": [{"given_name": "Richard", "family_name": "Nock", "institution": null}, {"given_name": "Frank", "family_name": "Nielsen", "institution": null}]}