{"title": "On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost", "book": "Advances in Neural Information Processing Systems", "page_first": 1049, "page_last": 1056, "abstract": "The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations.", "full_text": "On the Design of Loss Functions for Classi\ufb01cation:\n\ntheory, robustness to outliers, and SavageBoost\n\nHamed Masnadi-Shirazi\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\n\nhmasnadi@ucsd.edu\n\nAbstract\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\nnuno@ucsd.edu\n\nThe machine learning problem of classi\ufb01er design is studied from the perspective\nof probability elicitation, in statistics. This shows that the standard approach of\nproceeding from the speci\ufb01cation of a loss, to the minimization of conditional\nrisk is overly restrictive. It is shown that a better alternative is to start from the\nspeci\ufb01cation of a functional form for the minimum conditional risk, and derive\nthe loss function. This has various consequences of practical interest, such as\nshowing that 1) the widely adopted practice of relying on convex loss functions is\nunnecessary, and 2) many new losses can be derived for classi\ufb01cation problems.\nThese points are illustrated by the derivation of a new loss which is not convex,\nbut does not compromise the computational tractability of classi\ufb01er design, and\nis robust to the contamination of data with outliers. A new boosting algorithm,\nSavageBoost, is derived for the minimization of this loss. Experimental results\nshow that it is indeed less sensitive to outliers than conventional methods, such as\nAda, Real, or LogitBoost, and converges in fewer iterations.\n\n1 Introduction\n\nThe binary classi\ufb01cation of examples x is usually performed with recourse to the mapping \u02c6y =\nsign[f (x)], where f is a function from a pre-de\ufb01ned class F, and \u02c6y the predicted class label. Most\nstate-of-the-art classi\ufb01er design algorithms, including SVMs, boosting, and logistic regression, de-\ntermine the optimal function f \u2217 by a three step procedure: 1) de\ufb01ne a loss function \u03c6(yf (x)), where\ny is the class label of x, 2) select a function class F, and 3) search within F for the function f \u2217 which\nminimizes the expected value of the loss, known as minimum conditional risk. Although tremen-\ndously successful, these methods have been known to suffer from some limitations, such as slow\nconvergence, or too much sensitivity to the presence of outliers in the data [1, 2]. Such limitations\ncan be attributed to the loss functions \u03c6(\u00b7) on which the algorithms are based. These are convex\nbounds on the so-called 0-1 loss, which produces classi\ufb01ers of minimum probability of error, but is\ntoo dif\ufb01cult to handle from a computational point of view.\n\nIn this work, we analyze the problem of classi\ufb01er design from a different perspective, that has long\nbeen used to study the problem of probability elicitation, in the statistics literature. We show that the\ntwo problems are identical, and probability elicitation can be seen as a reverse procedure for solving\nthe classi\ufb01cation problem: 1) de\ufb01ne the functional form of expected elicitation loss, 2) select a\nfunction class F, and 3) derive a loss function \u03c6. Both probability elicitation and classi\ufb01er design\nreduce to the problem of minimizing a Bregman divergence. We derive equivalence results, which\nallow the representation of the classi\ufb01er design procedures in \u201cprobability elicitation form\u201d, and the\nrepresentation of the probability elicitation procedures in \u201cmachine learning form\u201d. This equivalence\nis useful in two ways. From the elicitation point of view, the risk functions used in machine learning\ncan be used as new elicitation losses. From the machine learning point of view, new insights on the\nrelationship between loss \u03c6, optimal function f \u2217, and minimum risk are obtained. In particular, it is\nshown that the classical progression from loss to risk is overly restrictive: once a loss \u03c6 is speci\ufb01ed,\n\n1\n\n\fboth the optimal f \u2217, and the functional form of the minimum risk are immediately pined down.\nThis is, however, not the case for the reverse progression: it is shown that any functional form of\nthe minimum conditional risk, which satis\ufb01es some mild constraints, supports many (\u03c6, f \u2217) pairs.\nHence, once the risk is selected, one degree of freedom remains: by selecting a class of f \u2217, it is\npossible to tailor the loss \u03c6, so as to guarantee classi\ufb01ers with desirable traits. In addition to this,\nthe elicitation view reveals that the machine learning emphasis on convex losses \u03c6 is misguided. In\nparticular, it is shown that what matters is the convexity of the minimum conditional risk. Once a\nfunctional form is selected for this quantity, the convexity of the loss \u03c6 does not affect the convexity\nof the Bregman divergence to be optimized.\n\nThese results suggest that many new loss functions can be derived for classi\ufb01er design. We illustrate\nthis, by deriving a new loss that trades convexity for boundedness. Unlike all previous \u03c6, the one\nnow proposed remains constant for strongly negative values of its argument. This is akin to robust\nloss functions proposed in the statistics literature to reduce the impact of outliers. We derive a new\nboosting algorithm, denoted SavageBoost, by combination of the new loss and the procedure used\nby Friedman to derive RealBoost [3]. Experimental results show that the new boosting algorithm is\nindeed more outlier resistant than classical methods, such as AdaBoost, RealBoost, and LogitBoost.\n\n2 Classi\ufb01cation and risk minimization\n\nA classi\ufb01er is a mapping g : X \u2192 {\u22121, 1} that assigns a class label y \u2208 {\u22121, 1} to a feature\nvector x \u2208 X , where X is some feature space. If feature vectors are drawn with probability density\nPX(x), PY (y) is the probability distribution of the labels y \u2208 {\u22121, 1}, and L(x, y) a loss function,\nthe classi\ufb01cation risk is R(f ) = EX,Y [L(g(x), y)]. Under the 0-1 loss, L0/1(x, y) = 1 if g(x) 6= y\nand 0 otherwise, this risk is the expected probability of classi\ufb01cation error, and is well known to be\nminimized by the Bayes decision rule. Denoting by \u03b7(x) = PY |X(1|x) this can be written as\n\ng\u2217(x) = sign[2\u03b7(x) \u2212 1].\n\n(1)\n\nClassi\ufb01ers are usually implemented with mappings of the form g(x) = sign[f (x)], where f is some\nmapping from X to R. The minimization of the 0-1 loss requires that\nsign[f \u2217(x)] = sign[2\u03b7(x) \u2212 1], \u2200x\n\n(2)\n\nWhen the classes are separable, any f (x) such that yf (x) \u2265 0, \u2200x has zero classi\ufb01cation error. The\n0-1 loss can be written as a function of this quantity\n\nL0/1(x, y) = \u03c60/1[yf (x)] = sign[\u2212yf (x)].\n\nThis motivates the minimization of the expected value of this loss as a goal for machine learning.\nHowever, this minimization is usually dif\ufb01cult. Many algorithms have been proposed to minimize\nalternative risks, based on convex upper-bounds of the 0-1 loss. These risks are of the form\n\n(3)\nwhere \u03c6(\u00b7) is a convex upper bound of \u03c60/1(\u00b7). Some examples of \u03c6(\u00b7) functions in the literature are\ngiven in Table 1. Since these functions are non-negative, the risk is minimized by minimizing the\nconditional risk EY |X[\u03c6(yf (x))|X = x] for every x \u2208 X . This conditional risk can be written as\n\nR\u03c6(f ) = EX,Y [\u03c6(yf (x))]\n\nC\u03c6(\u03b7, f ) = \u03b7\u03c6(f ) + (1 \u2212 \u03b7)\u03c6(\u2212f ),\n\nwhere we have omitted the dependence of \u03b7 and f on x for notational convenience.\nVarious authors have shown that, for the \u03c6(\u00b7) of Table 1, the function f \u2217\n\n\u03c6 which minimizes (4)\n\nf \u2217\n\u03c6(\u03b7) = arg min\n\nf\n\nC\u03c6(\u03b7, f )\n\n(4)\n\n(5)\n\nsatis\ufb01es (2) [3, 4, 5]. These functions are also presented in Table 1. It can, in fact, be shown that (2)\nholds for any f \u2217\n\u03c6(\u00b7) which minimizes (4) whenever \u03c6(\u00b7) is convex, differentiable at the origin, and\nhas derivative \u03c6\u2032(0) = 0 [5].\nWhile learning algorithms based on the minimization of (4), such as SVMs, boosting, or logistic\nregression, can perform quite well, they are known to be overly sensitive to outliers [1, 2]. These\nare points for which yf (x) < 0. As can be seen from Figure 1, the sensitivity stems from the large\n\n2\n\n\fTable 1: Machine learning algorithms progress from loss \u03c6, to inverse link function f \u2217\nconditional risk C \u2217\n\n\u03c6(\u03b7), and minimum\n\n\u03c6(v)\n\n(1 \u2212 v)2\n\nmax(1 \u2212 v, 0)2\nmax(1 \u2212 v, 0)\n\nexp(\u2212v)\n\nf \u2217\n\u03c6(\u03b7)\n2\u03b7 \u2212 1\n2\u03b7 \u2212 1\n\nsign(2\u03b7 \u2212 1)\n\n2 log \u03b7\n1\nlog \u03b7\n1\u2212\u03b7\n\n1\u2212\u03b7\n\nC \u2217\n\n\u03c6(\u03b7)\n\n4\u03b7(1 \u2212 \u03b7)\n4\u03b7(1 \u2212 \u03b7)\n1 \u2212 |2\u03b7 \u2212 1|\n\n2p\u03b7(1 \u2212 \u03b7)\n\n-\u03b7 log \u03b7 \u2212 (1 \u2212 \u03b7) log(1 \u2212 \u03b7)\n\n\u03c6(\u03b7).\nAlgorithm\n\nLeast squares\nModi\ufb01ed LS\n\nSVM\n\nBoosting\n\nLogistic Regression\n\nlog(1 + e\u2212v)\n\n(in\ufb01nite) weight given to these points by the \u03c6(\u00b7) functions when yf (x) \u2192 \u2212\u221e. In this work, we\nshow that this problem can be eliminated by allowing non-convex \u03c6(\u00b7). This may, at \ufb01rst thought,\nseem like a bad idea, given the widely held belief that the success of the aforementioned algorithms\nis precisely due to the convexity of these functions. We will see, however, that the convexity of \u03c6(\u00b7)\nis not important. What really matters is the fact, noted by [4], that the minimum conditional risk\n\nC \u2217\n\n\u03c6(\u03b7) = inf\nf\n\nC\u03c6(\u03b7, f ) = C\u03c6(\u03b7, f \u2217\n\u03c6)\n\n(6)\n\nsatis\ufb01es two properties. First, it is a concave function of \u03b7 (\u03b7 \u2208 [0, 1])1. Second, if f \u2217\ntiable, then C \u2217\n\n\u03c6(\u03b7) is differentiable and, for any pair (v, \u02c6\u03b7) such that v = f \u2217\n\n\u03c6(\u02c6\u03b7),\n\n\u03c6 is differen-\n\nC\u03c6(\u03b7, v) \u2212 C \u2217\n\n\u03c6(\u03b7) = B\u2212C \u2217\n\n\u03c6\n\n(\u03b7, \u02c6\u03b7),\n\n(7)\n\nwhere\n\nBF (\u03b7, \u02c6\u03b7) = F (\u03b7) \u2212 F (\u02c6\u03b7) \u2212 (\u03b7 \u2212 \u02c6\u03b7)F \u2032(\u02c6\u03b7).\n\n(8)\nis the Bregman divergence of the convex function F . The second property provides an interesting\ninterpretation of the learning algorithms as methods for the estimation of the class posterior proba-\nbility \u03b7(x): the search for the f (x) which minimizes (4) is equivalent to a search for the probability\nestimate \u02c6\u03b7(x) which minimizes (7). This raises the question of whether minimizing a cost of the\nform of (4) is the best way to elicit the posterior probability \u03b7(x).\n\n3 Probability elicitation\n\nThis question has been extensively studied in statistics. In particular, Savage studied the problem of\ndesigning reward functions that encourage probability forecasters to make accurate predictions [6].\nThe problem is formulated as follows.\n\n\u2022 let I1(\u02c6\u03b7) be the reward for the prediction \u02c6\u03b7 when the event y = 1 holds.\n\u2022 let I\u22121(\u02c6\u03b7) be the reward for the prediction \u02c6\u03b7 when the event y = \u22121 holds.\n\nThe expected reward is\n\n(9)\nSavage asked the question of which functions I1(\u00b7), I\u22121(\u00b7) make the expected reward maximal when\n\u02c6\u03b7 = \u03b7, \u2200\u03b7. These are the functions such that\n\nI(\u03b7, \u02c6\u03b7) = \u03b7I1(\u02c6\u03b7) + (1 \u2212 \u03b7)I\u22121(\u02c6\u03b7).\n\n(10)\nwith equality if and only if \u02c6\u03b7 = \u03b7. Using the linearity of I(\u03b7, \u02c6\u03b7) on \u03b7, and the fact that J(\u03b7) is\nsupported by I(\u03b7, \u02c6\u03b7) at, and only at, \u03b7 = \u02c6\u03b7, this implies that J(\u03b7) is strictly convex [6, 7]. Savage\nthen showed that (10) holds if and only if\n\nI(\u03b7, \u02c6\u03b7) \u2264 I(\u03b7, \u03b7) = J(\u03b7), \u2200\u03b7\n\nI1(\u03b7) = J(\u03b7) + (1 \u2212 \u03b7)J \u2032(\u03b7)\nI\u22121(\u03b7) = J(\u03b7) \u2212 \u03b7J \u2032(\u03b7).\n\n(11)\n(12)\n\nDe\ufb01ning the loss of the prediction of \u03b7 by \u02c6\u03b7 as the difference to the maximum reward\n\nL(\u03b7, \u02c6\u03b7) = I(\u03b7, \u03b7) \u2212 I(\u03b7, \u02c6\u03b7)\n\n1Here, and throughout the paper, we omit the dependence of \u03b7 on x, whenever we are referring to functions\n\nof \u03b7, i.e. mappings whose range is [0, 1].\n\n3\n\n\fTable 2: Probability elicitation form for various machine learning algorithms, and Savage\u2019s procedure. In\nSavage 1 and 2 m\u2032 = m + k.\n\nI1(\u03b7)\n\n\u22124(1 \u2212 \u03b7)2\n\u22124(1 \u2212 \u03b7)2\n\nI\u22121(\u03b7)\n\u22124\u03b72\n\u22124\u03b72\n\nsign[2\u03b7 \u2212 1] \u2212 1\n\nsign[2\u03b7 \u2212 1] + 1\n\nJ(\u03b7)\n\n\u22124\u03b7(1 \u2212 \u03b7)\n\u22124\u03b7(1 \u2212 \u03b7)\n|2\u03b7 \u2212 1| \u2212 1\n\n\u2212q 1\u2212\u03b7\n\n\u03b7\nlog \u03b7\n\n\u2212k(1 \u2212 \u03b7)2 + m\u2032 + l\n\n\u2212k(1/\u03b7 + log \u03b7) + m\u2032 + l\n\n\u2212q \u03b7\n\n1\u2212\u03b7\n\nlog(1 \u2212 \u03b7)\n\u2212k\u03b72 + m\n\n\u2212k log \u03b7 + m\u2032\n\n\u22122p\u03b7(1 \u2212 \u03b7)\n\n\u03b7 log \u03b7 + (1 \u2212 \u03b7) log(1 \u2212 \u03b7)\n\nk\u03b72 + l\u03b7 + m\nm + l\u03b7 \u2212 k log \u03b7\n\nAlgorithm\n\nLeast squares\nModi\ufb01ed LS\n\nSVM\n\nBoosting\n\nLog. Regression\n\nSavage 1\nSavage 2\n\nit follows that\n\nL(\u03b7, \u02c6\u03b7) = BJ (\u03b7, \u02c6\u03b7),\n\n(13)\ni.e. the loss is the Bregman divergence of J. Hence, for any probability \u03b7, the best prediction \u02c6\u03b7 is the\none of minimum Bregman divergence with \u03b7. Savage went on to investigate which functions J(\u03b7)\nare admissible. He showed that for losses of the form L(\u03b7, \u02c6\u03b7) = H(h(\u03b7) \u2212 h(\u02c6\u03b7)), with H(0) = 0\nand H(v) > 0, v 6= 0, and h(v) any function, only two cases are possible. In the \ufb01rst h(v) = v, i.e.\nthe loss only depends on the difference \u03b7 \u2212 \u02c6\u03b7, and the admissible J are\n\n(14)\nfor some integers (k, l, m). In the second h(v) = log(v), i.e. the loss only depends on the ratio \u03b7/\u02c6\u03b7,\nand the admissible J are of the form\n\nJ1(\u03b7) = k\u03b72 + l\u03b7 + m,\n\nJ2(\u03b7) = m + l\u03b7 \u2212 k log \u03b7.\n\n(15)\n\n4 Classi\ufb01cation vs. probability elicitation\n\nThe discussion above shows that the optimization carried out by the learning algorithms is identical\nto Savage\u2019s procedure for probability elicitation. Both procedures reduce to the search for\n\n\u02c6\u03b7\u2217 = arg min\n\u02c6\u03b7\n\nBF (\u03b7, \u02c6\u03b7),\n\n(16)\n\nwhere F (\u03b7) is a convex function. In both cases, this is done indirectly. Savage starts from the speci-\n\ufb01cation of F (\u03b7) = J(\u03b7), from which the conditional rewards I1(\u03b7) and I2(\u03b7) are derived, using (11)\nand (12). \u02c6\u03b7\u2217 is then found by maximizing the expected reward I(\u03b7, \u02c6\u03b7) of (9) with respect to \u02c6\u03b7. The\nlearning algorithms start from the loss \u03c6(\u00b7). The conditional risk C\u03c6(\u03b7, f ) is then minimized with\nrespect to f, so as to obtain the minimum conditional risk C \u2217\n\u03c6(\u02c6\u03b7). This\nis identical to solving (16) with F (\u03b7) = \u2212C \u2217\n\u03c6(\u03b7) it is possible\nto express the learning algorithms in \u201cSavage form\u201d, i.e. as procedures for the maximization of (9),\n\u03c6(\u03b7) in Table 1. This is\nby deriving the conditional reward functions associated with each of the C \u2217\ndone with (11) and (12) and the results are shown in Table 2. In all cases I1(\u03b7) = \u2212\u03c6(f \u2217\n\u03c6(\u03b7)) and\nI\u22121(\u03b7) = \u2212\u03c6(\u2212f \u2217\nThe opposite question of whether Savage\u2019s algorithms be expressed in \u201cmachine learning form\u201d, i.e.\nas the minimization of (4), is more dif\ufb01cult. It requires that the Ii(\u03b7) satisfy\n\n\u03c6(\u03b7). Using the relation J(\u03b7) = \u2212C \u2217\n\n\u03c6(\u03b7) and the corresponding f \u2217\n\n\u03c6(\u03b7)).\n\nI1(\u03b7) = \u2212\u03c6(f (\u03b7))\n\n(17)\n(18)\n\u03c6 it\n\u03c6 is invertible, to think of\n\u03c6)\u22121(v) as a link function, which maps a real v into a probability \u03b7. Under this interpretation,\n\nfor some f (\u03b7), and therefore constrains J(\u03b7). To understand the relationship between J, \u03c6, and f \u2217\nhelps to think of the latter as an inverse link function. Or, assuming that f \u2217\n\u03b7 = (f \u2217\nit is natural to consider link functions which exhibit the following symmetry\n\nI\u22121(\u03b7) = \u2212\u03c6(\u2212f (\u03b7))\n\n(19)\nNote that this implies that f \u22121(0) = 1/2, i.e. f maps v = 0 to \u03b7 = 1/2. We refer to such link\nfunctions as symmetric, and show that they impose a special symmetry on J(\u03b7).\n\nf \u22121(\u2212v) = 1 \u2212 f \u22121(v).\n\n4\n\n\fTable 3: Probability elicitation form progresses from minimum conditional risk, and link function (f \u2217\nto loss \u03c6. f \u2217\n\n\u03c6)\u22121(\u03b7),\n\nAlgorithm\n\n\u03c6(\u03b7) is not invertible for the SVM and modi\ufb01ed LS methods.\n\u03c6)\u22121(v)\n(f \u2217\n1\n2 (v + 1)\n\nJ(\u03b7)\n\n\u03c6(v)\n\n(1 \u2212 v)2\n\nmax(1 \u2212 v, 0)2\nmax(1 \u2212 v, 0)\n\nexp(\u2212v)\n\nlog(1 + e\u2212v)\n\nNA\nN/A\ne2v\n\n1+e2v\n\nev\n\n1+ev\n\nLeast squares\nModi\ufb01ed LS\n\nSVM\n\n\u22124\u03b7(1 \u2212 \u03b7)\n\u22124\u03b7(1 \u2212 \u03b7)\n|2\u03b7 \u2212 1| \u2212 1\n\nBoosting\n\nLogistic Regression\n\n\u22122p\u03b7(1 \u2212 \u03b7)\n\n\u03b7 log \u03b7 + (1 \u2212 \u03b7) log(1 \u2212 \u03b7)\n\nTheorem 1. Let I1(\u03b7) and I\u22121(\u03b7) be two functions derived from a continuously differentiable\nfunction J(\u03b7) according to (11) and (12), and f (\u03b7) be an invertible function which satis\ufb01es (19).\nThen (17) and (18) hold if and only if\n\nIn this case,\n\n\u03c6(v) = \u2212J[f \u22121(v)] \u2212 (1 \u2212 f \u22121(v))J \u2032[f \u22121(v)].\n\nJ(\u03b7) = J(1 \u2212 \u03b7).\n\n(20)\n\n(21)\n\nThe theorem shows that for any pair J(\u03b7), f (\u03b7), such that J(\u03b7) has the symmetry of (20) and f (\u03b7)\nthe symmetry of (19), the expected reward of (9) can be written in the \u201cmachine learning form\u201d\nof (4), using (17) and (18) with the \u03c6(v) given by (21). The following corollary specializes this\nresult to the case where J(\u03b7) = \u2212C \u2217\nCorollary 2. Let I1(\u03b7) and I\u22121(\u03b7) be two functions derived with (11) and (12) from any continu-\nously differentiable J(\u03b7) = \u2212C \u2217\n\n\u03c6(\u03b7).\n\n\u03c6(\u03b7), such that\n\nC \u2217\n\n\u03c6(\u03b7) = C \u2217\n\n\u03c6(1 \u2212 \u03b7),\n\nand f\u03c6(\u03b7) be any invertible function which satis\ufb01es (19). Then\n\nI1(\u03b7) = \u2212\u03c6(f\u03c6(\u03b7))\n\nI\u22121(\u03b7) = \u2212\u03c6(\u2212f\u03c6(\u03b7))\n\nwith\n\n\u03c6(v) = C \u2217\n\n\u03c6[f \u22121\n\n\u03c6 (v)] + (1 \u2212 f \u22121\n\n\u03c6 (v))(C \u2217\n\n\u03c6)\u2032[f \u22121\n\n\u03c6 (v)].\n\n(22)\n\n(23)\n(24)\n\n(25)\n\nNote that there could be many pairs \u03c6, f\u03c6 for which the corollary holds2. Selecting a particular f\u03c6\n\u201cpins down\u201d \u03c6, according to (25). This is the case of the algorithms in Table 1, for which C \u2217\n\u03c6(\u03b7)\nand f \u2217\n\u03c6 have the symmetries required by the corollary. The link functions associated with these\nalgorithms are presented in Table 3. From these and (25) it is possible to recover \u03c6(v), also shown\nin the table.\n\n5 New loss functions\n\nThe discussion above provides an integrated picture of the \u201cmachine learning\u201d and \u201cprobability elic-\nitation\u201d view of the classi\ufb01cation problem. Table 1 summarizes the steps of the \u201cmachine learning\nview\u201d: start from the loss \u03c6(v), and \ufb01nd 1) the inverse link function f \u2217\n\u03c6(\u03b7) of minimum condi-\ntional risk, and 2) the value of this risk C \u2217\n\u03c6(\u03b7). Table 3 summarizes the steps of the \u201cprobability\nelicitation view\u201d: start from 1) the expected maximum reward function J(\u03b7) and 2) the link func-\ntion (f \u2217\n\u03c6(\u03b7), the two procedures are\nequivalent, since they both reduce to the search for the probability estimate \u02c6\u03b7\u2217 of (16).\nComparing to Table 2, it is clear that the least squares procedures are special cases of Savage 1, with\nk = \u2212l = 4 and m = 0, and the link function \u03b7 = (v + 1)/2. The constraint k = \u2212l is necessary\n\n\u03c6)\u22121(v), and determine the loss function \u03c6(v). If J(\u03b7) = \u2212C \u2217\n\n2This makes the notation f\u03c6 and C \u2217\n\n\u03c6 technically inaccurate. C \u2217\n\nf,\u03c6 would be more suitable. We, nevertheless,\n\nretain the C \u2217\n\n\u03c6 notation for the sake of consistency with the literature.\n\n5\n\n\f)\nv\n(\n\u03c6\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nLeast squares\nModified LS\nSVM\nBoosting\nLogistic Reg.\nSavage Loss\nZero\u2212One\n\n\u22126\n\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\nv\n\n\u22121\n\n0\n\n1\n\n2\n\n)\n\u03b7\n(\n\n*\u03c6\nC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nLeast squares\nModified LS\nSVM\nBoosting\nLogistic Reg.\nSavage Loss\nZero\u2212One\n\n0.2\n\n0.4\n\n\u03b7\n\n0.6\n\n0.8\n\n1\n\nFigure 1: Loss function \u03c6(v) (left) and minimum conditional risk C \u2217\nmethods discussed in the text.\n\n\u03c6(\u03b7) (right) associated with the different\n\nfor (22) to hold, but not the others. For Savage 2, a \u201cmachine learning form\u201d is not possible (at\nthis point), because J(\u03b7) 6= J(1 \u2212 \u03b7). We currently do not know if such a form can be derived\nin cases like this, i.e. where the symmetries of (19) and/or (22) are absent. From the probability\nelicitation point of view, an important contribution of the machine learning research (in addition\nto the algorithms themselves) has been to identify new J functions, namely those associated with\nthe techniques other than least squares. From the machine learning point of view, the elicitation\nperspective is interesting because it enables the derivation of new \u03c6 functions.\nThe main observation is that, under the customary speci\ufb01cation of \u03c6, both C \u2217\n\u03c6(\u03b7) are\nimmediately set, leaving no open degrees of freedom. In fact, the selection of \u03c6 can be seen as the\nindirect selection of a link function (f \u2217\n\u03c6(\u03b7). The latter is an\napproximation to the minimum conditional risk of the 0-1 loss, C \u2217\n(\u03b7) = 1 \u2212 max(\u03b7, 1 \u2212 \u03b7). The\napproximations associated with the existing algorithms are shown in Figure 1. The approximation\nerror is smallest for the SVM, followed by least squares, logistic regression, and boosting, but all\napproximations are comparable. The alternative, suggested by the probability elicitation view, is\nto start with the selection of the approximation directly. In addition to allowing direct control over\nthe quantity that is usually of interest (the minimum expected risk of the classi\ufb01er), the selection of\n\u03c6(\u03b7) (which is equivalent to the selection of J(\u03b7)) has the added advantage of leaving one degree\nC \u2217\nof freedom open. As stated by Corollary 2 it is further possible to select across \u03c6 functions, by\ncontrolling the link function f\u03c6. This allows tailoring properties of detail of the classi\ufb01er, while\nmaintaining its performance constant, in terms of the expected risk.\n\n\u03c6)\u22121 and a minimum conditional risk C \u2217\n\n\u03c6(\u03b7) and f \u2217\n\n\u03c60/1\n\nWe demonstrate this point, by proposing a new loss function \u03c6. We start by selecting the minimum\nconditional risk of least squares (using Savage\u2019s version with k = \u2212l = 1, m = 0) C \u2217\n\u03c6(\u03b7) =\n\u03b7(1 \u2212 \u03b7), because it provides the best approximation to the Bayes error, while avoiding the lack of\ndifferentiability of the SVM. We next replace the traditional link function of least squares by the\nlogistic link function (classically used with logistic regression) f \u2217\n1\u2212\u03b7 . When used in the\ncontext of boosting (LogitBoost [3]), this link function has been found less sensitive to outliers than\nother variants [8]. We then resort to (25) to \ufb01nd the \u03c6 function, which we denote by Savage loss,\n\n2 log \u03b7\n\n\u03c6 = 1\n\n\u03c6(v) =\n\n1\n\n(1 + e2v)2 .\n\n(26)\n\nA plot of this function is presented in Figure 1, along with those associated with all the algorithms\nof Table 1. Note that the proposed loss is very similar to that of least squares in the region where |v|\nis small (the margin), but quickly becomes constant as v \u2192 \u2212\u221e. This is unlike all other previous \u03c6\nfunctions, and suggests that classi\ufb01ers designed with the new loss should be more robust to outliers.\n\nIt is also interesting to note that the new loss function is not convex, violating what has been an\nhallmark of the \u03c6 functions used in the literature. The convexity of \u03c6 is, however, not important,\na fact that is made clear by the elicitation view. Note that the convexity of the expected reward\nof (9) only depends on the convexity of the functions I1(\u03b7) and I\u22121(\u03b7). These, in turn, only depend\non the choice of J(\u03b7), as shown by (11) and (12). From Corollary 2 it follows that, as long as\nthe symmetries of (22) and (19) hold, and \u03c6 is selected according to (25), the selection of C \u2217\n\u03c6(\u03b7)\n\n6\n\n\fAlgorithm 1 SavageBoost\n\nInput: Training set D = {(x1, y1), . . . , (xn, yn)}, where y \u2208 {1, \u22121} is the class label of\nexample x, and number M of weak learners in the \ufb01nal decision rule.\nInitialization: Select uniform weights w(1)\nfor m = {1, . . . , M } do\n\ni = 1\n\n|D| , \u2200i.\n\nend for\n\ncompute the gradient step Gm(x) with (30).\nupdate weights wi according to w(m+1)\nOutput: decision rule h(x) = sgn[PM\n\ni\n\n= w(m)\n\ni \u00d7 eyiGm(xi).\n\nm=1 Gm(x)].\n\ncompletely determines the convexity of the conditional risk of (4). Whether \u03c6 is itself convex does\nnot matter.\n\n6 SavageBoost\n\nWe have hypothesized that classi\ufb01ers designed with (26) should be more robust than those derived\nfrom the previous \u03c6 functions. To test this we designed a boosting algorithm based in the new loss,\nusing the procedure proposed by Friedman to derive RealBoost [3]. At each iteration the algorithm\nsearches for the weak learner G(x) which further reduces the conditional risk EY |X[\u03c6(y(f (x) +\nG(x)))|X = x] of the current f (x), for every x \u2208 X . The optimal weak learner is\n\nG\u2217(x) = arg min\n\nG(x)(cid:8)\u03b7(x)\u03c6w(G(x)) + (1 \u2212 \u03b7(x))\u03c6w(\u2212G(x))(cid:9)\n\nwhere\n\nand\n\n\u03c6w(yG(x)) =\n\n1\n\n(1 + w(x, y)2e2y(G(x)))2\n\nw(x, y) = eyf (x)\n\n(27)\n\n(28)\n\n(29)\n\nThe minimization is by gradient descent. Setting the gradient with respect to G(x) to zero results in\n\nG\u2217(x) =\n\n1\n\n2 (cid:18)log\n\nPw(y = 1|x)\n\nPw(y = \u22121|x)(cid:19)\n\n(30)\n\nwhere Pw(y = i|x) are probability estimates obtained from the re-weighted training set. At each\niteration the optimal weak learner is found from (30) and reweighing is performed according to (29).\nWe refer to the algorithm as SavageBoost, and summarize it in the inset.\n\n7 Experimental results\n\nWe compared SavageBoost to AdaBoost [9], RealBoost [3], and LogitBoost [3]. The latter is gen-\nerally considered more robust to outliers [8] and thus a good candidate for comparison. Ten binary\nUCI data sets were used: Pima-diabetes, breast cancer diagnostic, breast cancer prognostic, original\nWisconsin breast cancer, liver disorder, sonar, echo-cardiogram, Cleveland heart disease, tic-tac-toe\nand Haberman\u2019s survival. We followed the training/testing procedure outlined in [2] to explore the\nrobustness of the algorithms to outliers. In all cases, \ufb01ve fold validation was used with varying\nlevels of outlier contamination. Figure 2 shows the average error of the four methods on the Liver-\nDisorder set. Table 4 shows the number of times each method produced the smallest error (#wins)\nover the ten data sets at a given contamination level, as well as the average error% over all data\nsets (at that contamination level). Our results con\ufb01rm previous studies that have noted AdaBoost\u2019s\nsensitivity to outliers [1]. Among the previous methods AdaBoost indeed performed the worst, fol-\nlowed by RealBoost, with LogistBoost producing the best results. This con\ufb01rms previous reports\nthat LogitBoost is less sensitive to outliers [8]. SavageBoost produced generally better results than\nAda and RealBoost at all contamination levels, including 0% contamination. LogitBoost achieves\n\n7\n\n\fSav. Loss (SavageBoost)\nExp Loss (RealBoost)\nLog Loss (LogitBoost)\nExp Loss (AdaBoost)\n\n48\n\n46\n\n44\n\n42\n\n40\n\n38\n\n36\n\n34\n\n32\n\n30\n\nr\no\nr\nr\n\nE\n%\n\n28\n0\n\n5\n\n10\n\n15\n25\nOutlier Percentage\n\n20\n\n30\n\n35\n\n40\n\nFigure 2: Average error for four boosting methods at different contamination levels.\n\nTable 4: (number of wins, average error%) for each method and outlier percentage.\n\nMethod\n\nSavage Loss (SavageBoost)\n\nLog Loss(LogitBoost)\nExp Loss(RealBoost)\nExp Loss(AdaBoost)\n\n0% outliers\n(4, 19.22%)\n(4, 20.96%)\n(2, 23.99%)\n(0, 24.58%)\n\n5% outliers\n(4, 19.91%)\n(4, 22.04%)\n(2, 25.34%)\n(0, 26.45%)\n\n40% outliers\n(6, 25.9%)\n(3, 31.73%)\n(0, 33.18%)\n(1, 38.22%)\n\ncomparable results at low contamination levels (0%, 5%) but has higher error when contamination\nis signi\ufb01cant. With 40% contamination SavageBoost has 6 wins, compared to 3 for LogitBoost\nand, on average, about 6% less error. Although, in all experiments, each algorithm was allowed\n50 iterations, SavageBoost converged much faster than the others, requiring an average of 25 itera-\ntions at 0% cantamination. This is in contrast to 50 iterations for LogitBoost and 45 iterations for\nRealBoost. We attribute fast convergence to the bounded nature of the new loss, that prevents so\ncalled \u201dearly stopping\u201d problems [10]. Fast convergence is, of course, a great bene\ufb01t in terms of the\ncomputational ef\ufb01ciency of training and testing. This issue will be studied in greater detail in the\nfuture.\n\nReferences\n\n[1] T. G. Dietterich, \u201cAn experimental comparison of three methods for constructing ensembles of decision\n\ntrees: Bagging, boosting, and randomization,\u201d Machine Learning, 2000.\n\n[2] Y. Wu and Y. Liu, \u201cRobust truncated-hinge-loss support vector machines,\u201d JASA, 2007.\n\n[3] J. Friedman, T. Hastie, and R. Tibshirani, \u201cAdditive logistic regression: A statistical view of boosting,\u201d\n\nAnnals of Statistics, 2000.\n\n[4] T. Zhang, \u201cStatistical behavior and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion,\u201d Annals of Statistics, 2004.\n\n[5] P. Bartlett, M. Jordan, and J. D. McAuliffe, \u201cConvexity, classi\ufb01cation, and risk bounds,\u201d JASA, 2006.\n\n[6] L. J. Savage, \u201cThe elicitation of personal probabilities and expectations,\u201d JASA, vol. 66, pp. 783\u2013801,\n\n1971.\n\n[7] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge: Cambridge University Press, 2004.\n\n[8] R. McDonald, D. Hand, and I. Eckley, \u201cAn empirical comparison of three boosting algorithms on real\n\ndata sets with arti\ufb01cial class noise,\u201d in International Workshop on Multiple Classi\ufb01er Systems, 2003.\n\n[9] Y. Freund and R. Schapire, \u201cA decision-theoretic generalization of on-line learning and an application to\n\nboosting,\u201d Journal of Computer and System Sciences, 1997.\n\n[10] T. Zhang and B. Yu, \u201cBoosting with early stopping: Convergence and consistency,\u201d Annals of Statistics,\n\n2005.\n\n8\n\n\f", "award": [], "sourceid": 584, "authors": [{"given_name": "Hamed", "family_name": "Masnadi-shirazi", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}