{"title": "Margin Maximizing Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1237, "page_last": 1244, "abstract": "", "full_text": "Margin Maximizing Loss Functions\n\nSaharon Rosset\n\nWatson Research Center\n\nIBM\n\nYorktown, NY, 10598\nsrosset@us.ibm.com\n\nJi Zhu\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI, 48109\n\njizhu@umich.edu\n\nTrevor Hastie\n\nDepartment of Statistics\n\nStanford University\nStanford, CA, 94305\n\nhastie@stat.stanford.edu\n\nAbstract\n\nMargin maximizing properties play an important role in the analysis of classi\u00a3-\ncation models, such as boosting and support vector machines. Margin maximiza-\ntion is theoretically interesting because it facilitates generalization error analysis,\nand practically interesting because it presents a clear geometric interpretation of\nthe models being built. We formulate and prove a suf\u00a3cient condition for the\nsolutions of regularized loss functions to converge to margin maximizing separa-\ntors, as the regularization vanishes. This condition covers the hinge loss of SVM,\nthe exponential loss of AdaBoost and logistic regression loss. We also generalize\nit to multi-class classi\u00a3cation problems, and present margin maximizing multi-\nclass versions of logistic regression and support vector machines.\n\n(cid:1)\n\n1 Introduction\ni=1 with yi \u2208 {\u22121, +1}. We\nAssume we have a classi\u00a3cation \u201clearning\u201d sample {xi, yi}n\nwish to build a model F (x) for this data by minimizing (exactly or approximately) a loss\ni C(yiF (xi)) which is a function of the margins yiF (xi)\ncriterion\nof this model on this data. Most common classi\u00a3cation modeling approaches can be cast\nin this framework: logistic regression, support vector machines, boosting and more. The\nmodel F (x) which these methods actually build is a linear combination of dictionary func-\ntions coming from a dictionary H which can be large or even in\u00a3nite:\n\ni C(yi, F (xi)) =\n\n(cid:1)\n\nF (x) =\n\n\u03b2jhj(x)\n\n(cid:2)\n\nhj\u2208H\n\nand our prediction at point x based on this model is sgnF (x).\nWhen |H| is large, as is the case in most boosting or kernel SVM applications, some regu-\nlarization is needed to control the \u201ccomplexity\u201d of the model F (x) and the resulting over-\n\u00a3tting. Thus, it is common that the quantity actually minimized on the data is a regularized\nversion of the loss function:\n\n\u02c6\u03b2(\u03bb) = min\n\nC(yi\u03b2\n\n(1)\nwhere the second term penalizes for the lp norm of the coef\u00a3cient vector \u03b2 (p \u2265 1 for\nconvexity, and in practice usually p \u2208 {1, 2}), and \u03bb \u2265 0 is a tuning regularization parame-\nter. The 1- and 2-norm support vector machine training problems with slack can be cast in\nthis form ([6], chapter 12). In [8] we have shown that boosting approximately follows the\n\n\u03b2\n\np\n\ni\n\n(cid:2)\n\nh(xi)) + \u03bb(cid:2)\u03b2(cid:2)p\n\n(cid:2)\n\n\f\u201cpath\u201d of regularized solutions traced by (1) as the regularization parameter \u03bb varies, with\nthe appropriate loss and an l1 penalty.\nThe main question that we answer in this paper is: for what loss functions does \u02c6\u03b2(\u03bb)\nconverge to an \u201coptimal\u201d separator as \u03bb \u2192 0? The de\u00a3nition of \u201coptimal\u201d which we will\nuse depends on the lp norm used for regularization, and we will term it the \u201clp-margin\nmaximizing separating hyper-plane\u201d. More concisely, we will investigate for which loss\nfunctions and under which conditions we have:\n\n(2)\n\nlim\n\u03bb\u21920\n\n\u02c6\u03b2(\u03bb)\n(cid:2) \u02c6\u03b2(\u03bb)(cid:2) = arg max\n(cid:4)\u03b2(cid:4)p=1\n\n(cid:2)\n\nh(xi)\n\nmin\n\ni\n\nyi\u03b2\n\nThis margin maximizing property is interesting for three distinct reasons. First, it gives us\na geometric interpretation of the \u201climiting\u201d model as we relax the regularization. It tells\nus that this loss seeks to optimally separate the data by maximizing a distance between a\nseparating hyper-plane and the \u201cclosest\u201d points. A theorem by Mangasarian [7] allows us\nto interpret lp margin maximization as lq distance maximization, with 1/p + 1/q = 1, and\nhence make a clear geometric interpretation. Second, from a learning theory perspective\nlarge margins are an important quantity \u2014 generalization error bounds that depend on\nthe margins have been generated for support vector machines ([10] \u2014 using l2 margins)\nand boosting ( [9] \u2014 using l1 margins). Thus, showing that a loss function is \u201cmargin\nmaximizing\u201d in this sense is useful and promising information regarding this loss function\u2019s\npotential for generating good prediction models. Third, practical experience shows that\nexact or approximate margin maximizaion (such as non-regularized kernel SVM solutions,\nor \u201cin\u00a3nite\u201d boosting) may actually lead to good classi\u00a3cation prediction models. This is\ncertainly not always the case, and we return to this hotly debated issue in our discussion.\n\nOur main result is a suf\u00a3cient condition on the loss function, which guarantees that (2)\nholds, if the data is separable, i.e.\nif the maximum on the RHS of (2) is positive. This\ncondition is presented and proven in section 2. It covers the hinge loss of support vector\nmachines, the logistic log-likelihood loss of logistic regression, and the exponential loss,\nmost notably used in boosting. We discuss these and other examples in section 3. Our result\ngeneralizes elegantly to multi-class models and loss functions. We present the resulting\nmargin-maximizing versions of SVMs and logistic regression in section 4.\n\n2 Suf\u00a3cient condition for margin maximization\n\nThe following theorem shows that if the loss function vanishes \u201cquickly\u201d enough, then it\nwill be margin-maximizing as the regularization vanishes. It provides us with a uni\u00a3ed\nmargin-maximization theory, covering SVMs, logistic regression and boosting.\nTheorem 2.1 Assume the data {xi, yi}n\ni=1 is separable, i.e. \u2203\u03b2 s.t. mini yi\u03b2\n(cid:2)\nh(xi) > 0.\nLet C(y, f) = C(yf) be a monotone non-increasing loss function depending on the margin\nonly.\nIf \u2203T > 0 (possibly T = \u221e ) such that:\n(3)\n\nlim\nt(cid:5)T\nThen C is a margin maximizing loss function in the sense that any convergence point of the\nto the regularized problems (1) as \u03bb \u2192 0 is an lp margin-\nnormalized solutions\nmaximizing separating hyper-plane. Consequently, if this margin-maximizing hyper-plane\nis unique, then the solutions converge to it:\n\nC(t \u00b7 [1 \u2212 \u0001])\n\n\u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\n\n= \u221e, \u2200\u0001 >0\n\nC(t)\n\n(4)\n\nlim\n\u03bb\u21920\n\n\u02c6\u03b2(\u03bb)\n(cid:2) \u02c6\u03b2(\u03bb)(cid:2)p\n\n= arg max\n(cid:4)\u03b2(cid:4)p=1\n\nmin\n\ni\n\n(cid:2)\n\nyi\u03b2\n\nh(xi)\n\n\f\u03bb\u21920\u2212\u2192 \u221e\n\nProof We prove the result separately for T = \u221e and T < \u221e.\na. T = \u221e:\nLemma 2.2 (cid:2) \u02c6\u03b2(\u03bb)(cid:2)p\nProof Since T = \u221e then C(m) > 0 \u2200m > 0, and limm\u2192\u221e C(m) = 0. Therefore, for\nloss+penalty to vanish as \u03bb \u2192 0, (cid:2) \u02c6\u03b2(\u03bb)(cid:2)p must diverge, to allow the margins to diverge.\nLemma 2.3 Assume \u03b21, \u03b22 are two separating models, with (cid:2)\u03b21(cid:2)p = (cid:2)\u03b22(cid:2)p = 1, and \u03b21\nseparates the data better, i.e.: 0 < m2 = mini yih(xi)(cid:2)\nThen \u2203U = U(m1, m2) such that\n(cid:2)\n\n\u03b22 < m1 = mini yih(xi)(cid:2)\n\n(cid:2)\n\n\u03b21.\n\n\u2200t > U,\n\n(cid:2)\nC(yih(xi)\n\n(t\u03b21)) <\n\n(cid:2)\nC(yih(xi)\n\n(t\u03b22))\n\ni\n\ni\n\nIn words, if \u03b21 separates better than \u03b22 then scaled-up versions of \u03b21 will incur smaller\nloss than scaled-up versions of \u03b22, if the scaling factor is large enough.\nProof Since condition (3) holds with T = \u221e, there exists U such that \u2200t > U, C(tm2)\nn. Thus from C being non-increasing we immediately get:\n\nC(tm1) >\n\n\u2200t > U,\n\n(cid:2)\nC(yih(xi)\n\n(t\u03b21)) \u2264 n \u00b7 C(tm1) < C(tm2) <\n\n(cid:2)\nC(yih(xi)\n\n(t\u03b22))\n\n(cid:2)\n\n(cid:2)\n\ni\n\ni\n\n\u2217\n\nis a convergence point of\n\n\u2217(cid:2)p = 1.\nProof of case a.: Assume \u03b2\nNow assume by contradiction \u02dc\u03b2 has (cid:2) \u02dc\u03b2(cid:2)p = 1 and bigger minimal lp margin. Denote the\n\u2217\nminimal margins for the two models by m\nBy continuity of the minimal margin in \u03b2, there exists some open neighborhood of \u03b2\non\nthe lp sphere:\n\n\u2217\nand \u02dcm, respectively, with m\n\nas \u03bb \u2192 0, with (cid:2)\u03b2\n\n\u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\n\n< \u02dcm.\n\n\u2217\n\nand an \u0001 > 0, such that:\n\nN\u03b2\u2217 = {\u03b2 : (cid:2)\u03b2(cid:2)p = 1, (cid:2)\u03b2 \u2212 \u03b2\n\n\u2217(cid:2)2 < \u03b4}\nh(xi) < \u02dcm \u2212 \u0001, \u2200\u03b2 \u2208 N\u03b2\u2217\n\n(cid:2)\n\nyi\u03b2\n\nmin\n\ni\n\n\u2217\n\ncannot be a convergence point of\n\nNow by lemma 2.3 we get that exists U = U( \u02dcm, \u02dcm \u2212 \u0001) such that t \u02dc\u03b2 incurs smaller loss\nthan t\u03b2 for any t > U, \u03b2 \u2208 N\u03b2\u2217. Therefore \u03b2\n\u02c6\u03b2(\u03bb)\n.\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\nb. T < \u221e\nLemma 2.4 C(T ) = 0 and C(T \u2212 \u03b4) > 0, \u2200\u03b4 > 0.\nProof From condition (3), C(T\u2212T \u0001)\nLemma 2.5 lim\u03bb\u21920 mini yi \u02c6\u03b2(\u03bb)(cid:2)\nProof Assume by contradiction that there is a sequence \u03bb1, \u03bb2, ... (cid:9) 0 and \u0001 >0 s.t.\n\u2200j, mini yi \u02c6\u03b2(\u03bbj)(cid:2)\nPick any separating normalized model \u02dc\u03b2 i.e. (cid:2) \u02dc\u03b2(cid:2)p = 1 and \u02dcm := mini yi\nThen for any \u03bb < \u02dcmp C(T\u2212\u0001)\n\nC(T ) = \u221e. Both results follow immediately, with \u03b4 = T \u0001.\n\nh(xi) \u2264 T \u2212 \u0001.\n\nh(xi) > 0.\n\nh(xi) =T\n\n(cid:2)\n\u02dc\u03b2\n\nT p we get:\n\n(cid:2)\n\ni\n\nC(yi\n\nT\n\u02dcm\n\n(cid:2)\n\u02dc\u03b2\n\nh(xi)) + \u03bb(cid:2) T\n\u02dcm\n\n\u02dc\u03b2(cid:2)p\n\np < C(T \u2212 \u0001)\n\n\fsince the \u00a3rst term (loss) is 0 and the penalty is smaller than C(T \u2212 \u0001) by condition on \u03bb.\nBut \u2203j0 s.t. \u03bbj0 < \u02dcmp C(T\u2212\u0001)\nand so we get a contradiction to optimality of \u02c6\u03b2(\u03bbj0), since\nT p\nwe assumed mini yi \u02c6\u03b2(\u03bbj0)(cid:2)\n(cid:2)\n\nh(xi) \u2264 T \u2212 \u0001 and thus:\n(cid:2)\nC(yi \u02c6\u03b2(\u03bbj0)\n\nh(xi)) \u2265 C(T \u2212 \u0001)\n\ni\n\nm\n\nm\n\n\u02c6\u03b2(\u03bb)(cid:2) = T\n\nWe have thus proven that lim inf \u03bb\u21920 mini yi \u02c6\u03b2(\u03bb)(cid:2)\nAssume by contradiction that for some value of \u03bb we have m := mini yi \u02c6\u03b2(\u03bb)(cid:2)\nThen the re-scaled model T\nsince (cid:2) T\nm\n\nh(xi) \u2265 T . It remains to prove equality.\nh(xi) > T .\n\u02c6\u03b2(\u03bb) has the same zero loss as \u02c6\u03b2(\u03bb), but a smaller penalty,\n(cid:2) \u02c6\u03b2(\u03bb)(cid:2) < (cid:2) \u02c6\u03b2(\u03bb)(cid:2). So we get a contradiction to optimality of \u02c6\u03b2(\u03bb).\n\u2217(cid:2)p = 1.\nProof of case b.: Assume \u03b2\nNow assume by contradiction \u02dc\u03b2 has (cid:2) \u02dc\u03b2(cid:2)p = 1 and bigger minimal margin. Denote the\n\u2217\nminimal margins for the two models by m\nLet \u03bb1, \u03bb2, ... (cid:9) 0 be a sequence along which\nassumption, (cid:2) \u02c6\u03b2(\u03bbj)(cid:2)p \u2192 T\nconsequently:\n(cid:2)\n\n\u02dcm . Thus, \u2203j0 such that \u2200j > j0, (cid:2) \u02c6\u03b2(\u03bbj)(cid:2)p > T\n\n. By lemma 2.5 and our\n\u02dcm and\n\n\u2217\nand \u02dcm, respectively, with m\n\nas \u03bb \u2192 0, with (cid:2)\u03b2\n\nis a convergence point of\n\n\u02c6\u03b2(\u03bbj )\n(cid:4) \u02c6\u03b2(\u03bbj )(cid:4)p\n\n\u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\n\nm\u2217 > T\n\n\u2192 \u03b2\n\n\u2217\n\n< \u02dcm.\n\n\u2217\n\n(cid:2)\nC(yi \u02c6\u03b2(\u03bbj)\n\nh(xi)) + \u03bb(cid:2) \u02c6\u03b2(\u03bbj)(cid:2)p\n\ni\n\np > \u03bb( T\n\u02dcm\n\n)p =\n\nC(yi\n\nT\n\u02dcm\n\n\u02dc\u03b2h(xi)) + \u03bb(cid:2) T\n\u02dcm\n\n\u02dc\u03b2(cid:2)p\n\np\n\n(cid:2)\n\ni\n\nSo we get a contradiction to optimality of \u02c6\u03b2(\u03bbj).\n\nThus we conclude for both cases a. and b.\nmaximize the lp margin. Since (cid:2) \u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\nIf the lp-margin-maximizing separating hyper-plane is unique, then we can conclude:\n\nmust\n(cid:2)p = 1, such convergence points obviously exist.\n\nthat any convergence point of\n\n\u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\n\n\u02c6\u03b2(\u03bb)\n(cid:2) \u02c6\u03b2(\u03bb)(cid:2)p\n\n\u2192 \u02c6\u03b2 := arg max\n(cid:4)\u03b2(cid:4)p=1\n\nmin\n\ni\n\nyi\u03b2\n\n(cid:2)\n\nh(xi)\n\nNecessity results\n\nA necessity result for margin maximization on any separable data seems to require either\nadditional assumptions on the loss or a relaxation of condition (3). We conjecture that if we\nalso require that the loss is convex and vanishing (i.e. limm\u2192\u221eC(m) = 0) then condition\n(3) is suf\u00a3cient and necessary. However this is still a subject for future research.\n\n3 Examples\n\nSupport vector machines\n\nSupport vector machines (linear or kernel) can be described as a regularized problem:\n\n(5)\n\nmin\n\u03b2\n\n(cid:2)\n\ni\n\n[1 \u2212 yi\u03b2\n\n(cid:2)\n\nh(xi)]+ + \u03bb(cid:2)\u03b2(cid:2)p\n\np\n\nwhere p = 2 for the standard (\u201c2-norm\u201d) SVM and p = 1 for the 1-norm SVM. This\nformulation is equivalent to the better known \u201cnorm minimization\u201d SVM formulation in\nthe sense that they have the same set of solutions as the regularization parameter \u03bb varies\nin (5) or the slack bound varies in the norm minimization formulation.\n\n\fThe loss in (5) is termed \u201chinge loss\u201d since it\u2019s linear for margins less than 1, then \u00a3xed\nat 0 (see \u00a3gure 1). The theorem obviously holds for T = 1, and it veri\u00a3es our knowledge\nthat the non-regularized SVM solution, which is the limit of the regularized solutions,\nmaximizes the appropriate margin (Euclidean for standard SVM, l1 for 1-norm SVM).\nNote that our theorem indicates that the squared hinge loss (AKA truncated squared loss):\n\nC(yi, F (xi)) = [1 \u2212 yiF (xi)]2\n\n+\n\nis also a margin-maximizing loss.\n\nLogistic regression and boosting\n\nThe two loss functions we consider in this context are:\n(6)\n(7)\nThese two loss functions are of great interest in the context of two class classi\u00a3cation: Cl\nis used in logistic regression and more recently for boosting [4], while Ce is the implicit\nloss function used by AdaBoost - the original and most famous boosting algorithm [3] .\n\nCe(m) = exp(\u2212m)\nCl(m) = log(1 + exp(\u2212m))\n\nExponential :\nLog likelihood :\n\nIn [8] we showed that boosting approximately follows the regularized path of solutions\n\u02c6\u03b2(\u03bb) using these loss functions and l1 regularization. We also proved that the two loss\nfunctions are very similar for positive margins, and that their regularized solutions converge\nto margin-maximizing separators. Theorem 2.1 provides a new proof of this result, since\nthe theorem\u2019s condition holds with T = \u221e for both loss functions.\n\nSome interesting non-examples\n\nCommonly used classi\u00a3cation loss functions which are not margin-maximizing include any\npolynomial loss function: C(m) = 1\nm , C(m) = m2, etc. do not guarantee convergence of\nregularized solutions to margin maximizing solutions.\n\nAnother interesting method in this context is linear discriminant analysis. Although it does\nnot correspond to the loss+penalty formulation we have described, it does \u00a3nd a \u201cdecision\nhyper-plane\u201d in the predictor space.\n\nFor both polynomial loss functions and linear discriminant analysis it is easy to \u00a3nd exam-\nples which show that they are not necessarily margin maximizing on separable data.\n\n4 A multi-class generalization\n\nOur main result can be elegantly extended to versions of multi-class logistic regression and\nsupport vector machines, as follows. Assume the response is now multi-class, with K \u2265 2\npossible values i.e. yi \u2208 {c1, ..., cK}. Our model consists of a \u201cprediction\u201d for each class:\n\nFk(x) =\n\n\u03b2(k)\nj hj(x)\n\n(cid:2)\n\nhj\u2208H\n\nwith the obvious prediction rule at x being arg maxk Fk(x).\nThis gives rise to a K \u2212 1 dimensional \u201cmargin\u201d for each observation. For y = ck, de\u00a3ne\nthe margin vector as:\n(8)\nAnd our loss is a function of this K \u2212 1 dimensional margin:\n\nm(ck, f1, ..., fK) = (fk \u2212 f1, ..., fk \u2212 fk\u22121, fk \u2212 fk+1, ..., fk \u2212 fK)\n(cid:2)\n\nC(y, f1, ..., fK) =\n\nI{y = ck}C(m(ck, f1, ..., fK))\n\n(cid:2)\n\nk\n\n\f3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\u22122\n\nhinge \nexponential\nlogistic \n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\u22122\n\n\u22121\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n0\n\n1\n\n0\n\n1\n\n2\n\n2\n\n3\n\n3\n\n\u22122\n\n\u22121\n\nFigure 1: Margin maximizing loss functions for 2-class problems (left) and the SVM 3-class loss\nfunction of section 4.1 (right)\n\nThe lp-regularized problem is now:\n\n(9)\n\n\u02c6\u03b2(\u03bb) = arg min\n\n\u03b2(1),...,\u03b2(K)\n\n(cid:2)\n\ni\n\n(cid:2)\nC(yi, h(xi)\n\n(cid:2)\n\u03b2(1), ..., h(xi)\n\n\u03b2(K)) + \u03bb\n\n(cid:2)\n\n(cid:2)\u03b2(k)(cid:2)p\n\np\n\nk\n\nWhere \u02c6\u03b2(\u03bb) = ( \u02c6\u03b2(1)(\u03bb), ..., \u02c6\u03b2(K)(\u03bb))(cid:2) \u2208 RK\u00b7|H|\nIn this formulation, the concept of margin maximization corresponds to maximizing the\nminimal of all n \u00b7 (K \u2212 1) normalized lp-margins generated by the data:\n(\u03b2(yi) \u2212 \u03b2(k))\n(10)\n\nmax\n\n(cid:2)\nh(xi)\n\nmin\n\n.\n\n(cid:4)\u03b2(1)(cid:4)p\n\np+...+(cid:4)\u03b2(K)(cid:4)p\n\np=1\n\ni\n\nmin\nyi(cid:8)=ck\n\nNote that this margin maximization problem still has a natural geometric interpretation, as\nh(xi)(cid:2)(\u03b2(yi) \u2212 \u03b2(k)) > 0 \u2200i, k (cid:10)= yi implies that the hyper-plane h(x)(cid:2)(\u03b2(j) \u2212 \u03b2(k)) = 0\nsuccessfully separates classes j and k for any two classes.\n\nHere is a generalization of the optimal separation theorem 2.1 to multi-class models:\n\nTheorem 4.1 Assume C(m) is commutative and decreasing in each coordinate, then if\n\u2203T > 0 (possibly T = \u221e ) such that:\n\n(11)\n\nlimt(cid:5)T\n\nC(t[1 \u2212 \u0001], tu1, ...tuK\u22122)\n\u2200\u0001 > 0, u1 \u2265 1, ..., uK\u22122 \u2265 1, v1 \u2265 1, ...vK\u22122 \u2265 1\n\nC(t, tv1, ..., tvK\u22122)\n\n= \u221e,\n\nThen C is a margin-maximizing loss function for multi-class models, in the sense that any\n, attains the optimal separa-\nconvergence point of the normalized solutions to (9),\ntion as de\u00a3ned in (10)\n\n\u02c6\u03b2(\u03bb)\n(cid:4) \u02c6\u03b2(\u03bb)(cid:4)p\n\nIdea of proof The proof is essentially identical to the two class case, now considering the\nn \u00b7 (K \u2212 1) margins on which the loss depends. The condition (11) implies that as the\nregularization vanishes the model is determined by the minimal margin, and so an optimal\nmodel puts the emphasis on maximizing that margin.\n\n\fCorollary 4.2 In the 2-class case, theorem 4.1 reduces to theorem 2.1.\nProof The loss depends on \u03b2(1) \u2212 \u03b2(2), the penalty on (cid:2)\u03b2(1)(cid:2)p\np. An optimal\nsolution to the regularized problem must thus have \u03b2(1) + \u03b2(2) = 0, since by transforming:\n\np + (cid:2)\u03b2(2)(cid:2)p\n\n\u03b2(1) \u2192 \u03b2(1) \u2212 \u03b2(1) + \u03b2(2)\n\n2\n\n, \u03b2(2) \u2192 \u03b2(2) \u2212 \u03b2(1) + \u03b2(2)\n\n2\n\nwe are not changing the loss, but reducing the penalty, by Jensen\u2019s inequality:\n(cid:2)\u03b2(1) \u2212 \u03b2(1) + \u03b2(2)\n\np = 2(cid:2) \u03b2(1) \u2212 \u03b2(2)\n(cid:2)p\n\n(cid:2)p\np + (cid:2)\u03b2(2) \u2212 \u03b2(1) + \u03b2(2)\n\n(cid:2)p\n\n\u2264 (cid:2)\u03b2(1)(cid:2)p\n\np + (cid:2)\u03b2(2)(cid:2)p\n\np\n\np\n\n2\n\n2\n\n2\n\nSo we can conclude that \u02c6\u03b2(1)(\u03bb) = \u2212 \u02c6\u03b2(2)(\u03bb) and consequently that the two margin maxi-\nmization tasks (2), (10) are equivalent.\n\n4.1 Margin maximization in multi-class SVM and logistic regression\n\nHere we apply theorem 4.1 to versions of multi-class logistic regression and SVM.\n\nFor logistic regression, we use a slightly different formulation than the \u201cstandard\u201d logistic\nregression models, which uses class K as a \u201creference\u201d class, i.e. assumes that \u03b2(K) = 0.\nThis is required for non-regularized \u00a3tting, since without it the solution is not uniquely\nde\u00a3ned. However, using regularization as in (9) guarantees that the solution will be unique\nand consequently we can \u201csymmetrize\u201d the model \u2014 which allows us to apply theorem\n4.1. So the loss function we use is (assume y = ck belongs to class k):\n\nC(y, f1, ..., fK) = \u2212 log\n\n(12)\n\nefk\n\nef1 + ... + efK\n\n=\n\n= log(ef1\u2212fk + ... + efk\u22121\u2212fk + 1 + efk+1\u2212fk + ... + efK\u2212fk)\n\nwith the linear model: fj(xi) = h(xi)(cid:2)\n\u03b2(j). It is not dif\u00a3cult to verify that condition (11)\nholds for this loss function with T = \u221e, using the fact that log(1 + \u0001) = \u0001 + O(\u00012). The\nsum of exponentials which results from applying this \u00a3rst-order approximation satis\u00a3es\n(11), and as \u0001 \u2192 0, the second order term can be ignored.\nFor support vector machines, consider a multi-class loss which is a natural generalization\nof the two-class loss:\n\n(13)\n\nC(m) =\n\nK\u22121(cid:2)\n\nj=1\n\n[1 \u2212 mj]+\n\nWhere mj is the j\u2019th component of the multi-margin m as in (8). Figure 1 shows this loss\nfor K = 3 classes as a function of the two margins. The loss+penalty formulation using 13\nis equivalent to a standard optimization formulation of multi-class SVM (e.g. [11]):\n\nmax c\ns.t.\n\n(cid:2)\nh(xi)\n\u03beik \u2265 0 ,\n\n(\u03b2(yi) \u2212 \u03b2(k)) \u2265 c(1 \u2212 \u03beik),\n\n(cid:2)\n\n(cid:2)\n\n\u03beik \u2264 B ,\n\ni \u2208 {1, ...n}, k \u2208 {1, ..., K}, ck (cid:10)= yi\np = 1\n\n(cid:2)\u03b2(k)(cid:2)p\n\ni,k\n\nk\n\nAs both theorem 4.1 (using T = 1) and the optimization formulation indicate, the regular-\nized solutions to this problem converge to the lp margin maximizing multi-class solution.\n\n\f5 Discussion\n\nWhat are the properties we would like to have in a classi\u00a3cation loss function? Recently\nthere has been a lot of interest in Bayes-consistency of loss functions and algorithms ([1]\nand references therein), as the data size increases. It turns out that practically all \u201crea-\nsonable\u201d loss functions are consistent in that sense, although convergence rates and other\nmeasures of \u201cdegree of consistency\u201d may vary.\n\nMargin maximization, on the other hand, is a \u00a3nite sample optimality property of loss func-\ntions, which is potentially of decreasing interest as sample size grows, since the training\ndata-set is less likely to be separable. Note, however, that in very high dimensional pre-\ndictor spaces, such as those typically used by boosting or kernel SVM, separability of any\n\u00a3nite-size data-set is a mild assumption, which is violated only in pathological cases.\n\nWe have shown that the margin maximizing property is shared by some popular loss func-\ntions used in logistic regression, support vector machines and boosting. Knowing that\nthese algorithms \u201cconverge\u201d, as regularization vanishes, to the same model (provided they\nuse the same regularization) is an interesting insight. So, for example, we can conclude\nthat 1-norm support vector machines, exponential boosting and l1-regularized logistic re-\ngression all facilitate the same non-regularized solution, which is an l1-margin maximizing\nseparating hyper-plane. From Mangasarian\u2019s theorem [7] we know that this hyper-plane\nmaximizes the l\u221e distance from the closest points on either side.\nThe most interesting statistical question which arises is: are these \u201coptimal\u201d separating\nmodels really good for prediction, or should we expect regularized models to always do\nbetter in practice? Statistical intuition supports the latter, as do some margin-maximizing\nexperiments by Breiman [2] and Grove and Schuurmans [5]. However it has also been ob-\nserved that in many cases margin-maximization leads to reasonable prediction models, and\ndoes not necessarily result in over-\u00a3tting. We have had similar experience with boosting\nand kernel SVM. Settling this issue is an intriguing research topic, and one that is criti-\ncal in determining the practical importance of our results, as well as that of margin-based\ngeneralization error bounds.\n\nReferences\n[1] Bartlett, P., Jordan, M. & McAuliffe, J. (2003). Convexity, Classi\u00a3cation and Risk Bounds.\n\nTechnical reports, dept. of Statistics, UC Berkeley.\n\n[2] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 7:1493-1517.\n[3] Freund, Y. & Scahpire, R.E. (1995). A decision theoretic generalization of on-line learning and\n\nan application to boosting. Proc. of 2nd Eurpoean Conf. on Computational Learning Theory.\n\n[4] Friedman, J. H., Hastie, T. & Tibshirani, R. (2000). Additive logistic regression: a statistical\n\nview of boosting. Annals of Statistics 28, pp. 337-407.\n\n[5] Grove, A.J. & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned\n\nensembles. Proc. of 15th National Conf. on AI.\n\n[6] Hastie, T., Tibshirani, R. & Friedman, J. (2001). Elements of Stat. Learning. Springer-Verlag.\n[7] Mangasarian, O.L. (1999). Arbitrary-norm separating plane. Operations Research Letters, Vol.\n\n24 1-2:15-23\n\n[8] Rosset, R., Zhu, J & Hastie, T. (2003). Boosting as a regularized path to a maximum margin\n\nclassi\u00a3er. Technical report, Dept. of Statistics, Stanford Univ.\n\n[9] Scahpire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boosting the margin: a new expla-\n\nnation for the effectiveness of voting methods. Annals of Statistics 26(5):1651-1686\n\n[10] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer.\n[11] Weston, J. & Watkins, C. (1998). Multi-class support vector machines. Technical report CSD-\n\nTR-98-04, dept of CS, Royal Holloway, University of London.\n\n\f", "award": [], "sourceid": 2433, "authors": [{"given_name": "Saharon", "family_name": "Rosset", "institution": null}, {"given_name": "Ji", "family_name": "Zhu", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}