{"title": "A PAC-Bayes Risk Bound for General Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 449, "page_last": 456, "abstract": null, "full_text": "A PAC-Bayes Risk Bound for General Loss Functions\n\nPascal Germain\n\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval\nQu\u00b4ebec, Canada\n\nAlexandre Lacasse\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval\nQu\u00b4ebec, Canada\n\nPascal.Germain.1@ulaval.ca\n\nAlexandre.Lacasse@ift.ulaval.ca\n\nFranc\u00b8ois Laviolette\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval\nQu\u00b4ebec, Canada\n\nMario Marchand\n\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval\nQu\u00b4ebec, Canada\n\nFrancois.Laviolette@ift.ulaval.ca\n\nMario.Marchand@ift.ulaval.ca\n\nAbstract\n\nWe provide a PAC-Bayesian bound for the expected loss of convex combinations\nof classi\ufb01ers under a wide class of loss functions (which includes the exponential\nloss and the logistic loss). Our numerical experiments with Adaboost indicate that\nthe proposed upper bound, computed on the training set, behaves very similarly\nas the true loss estimated on the testing set.\n\n1 Intoduction\n\nThe PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for\nlarge-margin classi\ufb01ers such as the SVM [4, 6]. Within this approach, we consider a prior distribu-\ntion P over a space of classi\ufb01ers that characterizes our prior belief about good classi\ufb01ers (before the\nobservation of the data) and a posterior distribution Q (over the same space of classi\ufb01ers) that takes\ninto account the additional information provided by the training data. A remarkable result that came\nout from this line of research, known as the \u201cPAC-Bayes theorem\u201d, provides a tight upper bound\non the risk of a stochastic classi\ufb01er (de\ufb01ned on the posterior Q) called the Gibbs classi\ufb01er. In the\ncontext of binary classi\ufb01cation, the Q-weighted majority vote classi\ufb01er (related to this stochastic\nclassi\ufb01er) labels any input instance with the label output by the stochastic classi\ufb01er with probability\nmore than half. Since at least half of the Q measure of the classi\ufb01ers err on an example incorrectly\nclassi\ufb01ed by the majority vote, it follows that the error rate of the majority vote is at most twice the\nerror rate of the Gibbs classi\ufb01er. Therefore, given enough training data, the PAC-Bayes theorem\nwill give a small risk bound on the majority vote classi\ufb01er only when the risk of the Gibbs classi\ufb01er\nis small. While the Gibbs classi\ufb01ers related to the large-margin SVM classi\ufb01ers have indeed a low\nrisk [6, 4], this is clearly not the case for the majority vote classi\ufb01ers produced by bagging [7] and\nboosting [8] where the risk of the associated Gibbs classi\ufb01er is normally close to 1/2. Consequently,\nthe PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote\nin these circumstances.\nIn an attempt to progress towards a theory giving small risk bounds for low-risk majority votes\nhaving a large risk for the associated Gibbs classi\ufb01er, we provide here a risk bound for convex\ncombinations of classi\ufb01ers under quite arbitrary loss functions, including those normally used for\nboosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one\nloss of weighted majority vote classi\ufb01ers (like the sigmoid loss). Our numerical experiments with\nAdaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss,\ncomputed on the training set, behaves very similarly as the true loss estimated on the testing set.\n\n\f2 Basic De\ufb01nitions and Motivation\nWe consider binary classi\ufb01cation problems where the input space X consists of an arbitrary subset\nof Rn and the output space Y = {\u22121, +1}. An example is an input-output (x, y) pair where\nx \u2208 X and y \u2208 Y. Throughout the paper, we adopt the PAC setting where each example (x, y)\nis drawn according to a \ufb01xed, but unknown, probability distribution D on X \u00d7 Y. We consider\nlearning algorithms that work in a \ufb01xed hypothesis space H of binary classi\ufb01ers and produce a\nconvex combination fQ of binary classi\ufb01ers taken from H. Each binary classi\ufb01er h \u2208 H contribute\nto fQ with a weight Q(h) \u2265 0. For any input example x \u2208 X , the real-valued output fQ(x) is given\nby\n\nfQ(x) =\n\nQ(h)h(x) ,\n\nh\u2208H\nwhere h(x) \u2208 {\u22121, +1}, fQ(x) \u2208 [\u22121, +1], and\ncalled the posterior distribution1.\nSince fQ(x) is also the expected class label returned by a binary classi\ufb01er randomly chosen accord-\ning to Q, the margin yfQ(x) of fQ on example (x, y) is related to the fraction WQ(x, y) of binary\nclassi\ufb01ers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and\nI(a) = 0 otherwise. We then have:\n\nh\u2208H Q(h) = 1. Consequently, Q(h) will be\n\nWQ(x, y) \u2212 1\n2\n\nI(h(x) (cid:54)= y) \u2212 1\n2\n\n= E\nh\u223cQ\n\n\u2212 yh(x)\n\n2\n\n= \u22121\n2\n\nQ(h)yh(x)\n\n(cid:88)\n\nh\u2208H\n\n(cid:88)\n\n(cid:80)\n\n(cid:184)\n\n(x,y)\u223cD\n\nWQ(x, y) is the Gibbs error rate (by de\ufb01nition), we see that the expected margin is\nSince E\njust one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is\ngiven by\n\n(cid:181)\n\nE\n\n(x,y)\u223cD\n\nI\n\nWQ(x, y) >\n\n=\n\u2264\n\u2264\n\nE\n\n(x,y)\u223cD\n\nE\n\n(x,y)\u223cD\n\nE\n\n(x,y)\u223cD\n\n1\n2\n\ntanh (\u03b2 [2WQ(x, y) \u2212 1]) +\n\nlim\n\u03b2\u2192\u221e\ntanh (\u03b2 [2WQ(x, y) \u2212 1]) + 1\nexp (\u03b2 [2WQ(x, y) \u2212 1])\n\n(\u2200\u03b2 > 0) .\n\n1\n2\n\n(\u2200\u03b2 > 0)\n\n(cid:183)\n\n= E\nh\u223cQ\n= \u22121\n\n2 yfQ(x) .\n(cid:182)\n\n1\n2\n\nHence, for large enough \u03b2, the sigmoid loss (or tanh loss) of fQ should be very close to the error\nrate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper\nbounded by twice that sigmoid loss for any \u03b2 > 0. The sigmoid loss is, in turn, upper bounded by\nthe exponential loss (which is used, for example, in Adaboost [9]).\nMore generally, we will provide tight risk bounds for any loss function that can be expanded by a\nTaylor series around WQ(x, y) = 1/2. Hence we consider any loss function \u03b6Q(x, y) that can be\nwritten as\n\n\u03b6Q(x, y)\n\ndef=\n\n1\n2\n\n+\n\n1\n2\n\ng(k) (2WQ(x, y) \u2212 1)k\n\n(cid:181)\n\n(cid:182)k\n\n\u2212 yh(x)\n\n1\n2\n\n1\n2\n\nand our task is to provide tight bounds for the expected loss \u03b6Q that depend on the empirical loss(cid:99)\u03b6Q\n\nmeasured on a training sequence S = (cid:104)(x1, y1), . . . , (xm, ym)(cid:105) of m examples, where\n\nE\nh\u223cQ\n\ng(k)\n\n(2)\n\nk=1\n\n=\n\n+\n\n,\n\n(1)\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nk=1\n\n\u03b6Q\n\ndef=\n\nE\n\n(x,y)\u223cD\n\n\u03b6Q(x, y)\n\n\u03b6Q(xi, yi) .\n\n(3)\n\nNote that by upper bounding \u03b6Q, we are taking into account all moments of WQ. In contrast, the\nPAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the \ufb01rst moment E\n\nWQ(x, y).\n\n(x,y)\u223cD\n\n1When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals.\n\n; (cid:99)\u03b6Q\n\ndef=\n\n1\nm\n\nm(cid:88)\n\ni=1\n\n\f3 A PAC-Bayes Risk Bound for Convex Combinations of Classi\ufb01ers\n\nThe PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs\nclassi\ufb01er. Given any distribution over a space of classi\ufb01ers, the Gibbs classi\ufb01er labels any example\nx \u2208 X according to a classi\ufb01er randomly drawn from that distribution. Hence, to obtain a PAC-\nBayesian bound for the expected general loss \u03b6Q of a convex combination of classi\ufb01ers, let us relate\n\u03b6Q to the zero-one loss of a Gibbs classi\ufb01er. For this task, let us \ufb01rst write\n\n(cid:181)\n\n(cid:182)k\n\nE\n\n(x,y)\u223cD\n\nE\nh\u223cQ\n\n\u2212 yh(x)\n\n= E\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\n\u00b7\u00b7\u00b7 E\nhk\u223cQ\n\nE\n(x,y)\n\n(\u2212y)kh1(x)h2(x)\u00b7\u00b7\u00b7 hk(x) .\n\nNote that the product h1(x)h2(x)\u00b7\u00b7\u00b7 hk(x) de\ufb01nes another binary classi\ufb01er that we denote as\nh1\u2212k(x). We now de\ufb01ne the error rate R(h1\u2212k) of h1\u2212k as\n\n(cid:182)\n\n(cid:181)\n\n(\u2212y)kh1\u2212k(x) = sgn(g(k))\n\nI\n\n(4)\n\nR(h1\u2212k)\n\ndef=\n\n=\n\nE\n\n(x,y)\u223cD\n1\n1\n2\n2\n\n+\n\n\u00b7 sgn(g(k)) E\n\n(x,y)\u223cD\n\n(\u2212y)kh1\u2212k(x) ,\n\nwhere sgn(g) = +1 if g > 0 and \u22121 otherwise.\nIf we now use\n\nto denote E\n\nE\n\nE\n\nh1\u2212k\u223cQk\n\nh1\u223cQ\n\nh2\u223cQ\n\n, Equation 2 now becomes\n\n\u00b7\u00b7\u00b7 E\nhk\u223cQ\n\n(cid:181)\n\n(cid:182)k\n\n1\n2\n\n\u221e(cid:88)\n\u221e(cid:88)\n\u221e(cid:88)\n\n1\n2\n\nk=1\n\nk=1\n\nk=1\n\n\u03b6Q =\n\n=\n\n=\n\n1\n2\n\n1\n2\n\n1\n2\n\n+\n\n+\n\n+\n\n|g(k)| \u00b7 sgn(g(k)) E\n\n(\u2212y)kh1\u2212k(x)\n\ng(k) E\n\n(x,y)\u223cD\n\nE\nh\u223cQ\n\n\u2212 yh(x)\n\n|g(k)| E\n\nh1\u2212k\u223cQk\n\n(x,y)\u223cD\n\nE\n\n(cid:182)\n\n(cid:181)\nh1\u2212k\u223cQk\nR(h1\u2212k) \u2212 1\n2\n\u221e(cid:88)\n\n|g(k)| ,\n\nc def=\n\n.\n\n(5)\n\nApart, from constant factors, Equation 5 relates \u03b6Q the the zero-one loss of a new type of Gibbs\nclassi\ufb01er. Indeed, if we de\ufb01ne\n\nEquation 5 can be rewritten as\n\n(cid:182)\n\n(cid:181)\n\u03b6Q \u2212 1\n2\n\n1\nc\n\n+\n\n1\n2\n\n=\n\n1\nc\n\n\u221e(cid:88)\n\nk=1\n\nk=1\n\n|g(k)| E\n\nh1\u2212k\u223cQk\n\nR(h1\u2212k) def= R(GQ) .\n\n(6)\n\n(7)\n\nThe new type of Gibbs classi\ufb01er is denoted above by GQ, where Q is a distribution over the product\nclassi\ufb01ers h1\u2212k with variable length k. More precisely, given an example x to be labelled by GQ,\nwe \ufb01rst choose at random a number k \u2208 N+ according to the discrete probability distribution given\nby |g(k)|/c and then we choose h1\u2212k randomly according to Qk to classify x with h1\u2212k(x). The\nrisk R(GQ) of this new Gibbs classi\ufb01er is then given by Equation 7.\nWe will present a tight PAC-Bayesien bound for R(GQ) which will automatically translate into a\nbound for \u03b6Q via Equation 7. This bound will depend on the empirical risk RS(GQ) which relates to\n\nthe the empirical loss(cid:99)\u03b6Q (measured on the training sequence S of m examples) through the equation\n\n+\n\n1\n2\n\n=\n\n1\nc\n\n|g(k)| E\n\nh1\u2212k\u223cQk\n\nRS(h1\u2212k) def= RS(GQ) ,\n\n(8)\n\n(cid:182)\n\n(cid:181)(cid:99)\u03b6Q \u2212 1\n\n2\n\n1\nc\n\nwhere\n\n(cid:182)\n\nRS(h1\u2212k)\n\ndef=\n\n1\nm\n\n(\u2212yi)kh1\u2212k(xi) = sgn(g(k))\n\nI\n\n.\n\n\u221e(cid:88)\nm(cid:88)\n\nk=1\n\ni=1\n\n(cid:181)\n\n\fNote that Equations 7 and 8 imply that\n\n\u03b6Q \u2212(cid:99)\u03b6Q = c \u00b7\n\n(cid:105)\n(cid:104)\nR(GQ) \u2212 RS(GQ)\n\n.\n\nHence, any looseness in the bound for R(GQ) will be ampli\ufb01ed by the scaling factor c on the bound\nfor \u03b6Q. Therefore, within this approach, the bound for \u03b6Q can be tight only for small values of c.\nNote however that loss functions having a small value of c are commonly used in practice. Indeed,\nlearning algorithms for feed-forward neural networks, and other approaches that construct a real-\nvalued function fQ(x) \u2208 [\u22121, +1] from binary classi\ufb01cation data, typically use a loss function of\nthe form |fQ(x) \u2212 y|r/2, for r \u2208 {1, 2}. In these cases we have\n\n(cid:175)(cid:175)(cid:175)(cid:175) E\n\nh\u223cQ\n\n(cid:175)(cid:175)(cid:175)(cid:175)r\n\n|fQ(x) \u2212 y|r =\n\n1\n2\n\n1\n2\n\nyh(x) \u2212 1\n\n= 2r\u22121 |WQ(x, y)|r ,\n\nwhich gives c = 1 for r = 1, and c = 3 for r = 2.\nGiven a set H of classi\ufb01ers, a prior distribution P on H, and a training sequence S of m examples,\nthe learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ\nthat suffers the expected loss \u03b6Q. Although Equation 7 holds only for a distribution Q de\ufb01ned by the\nabsolute values of the Taylor coef\ufb01cients g(k) and the product distribution Qk, the PAC-Bayesian\ntheorem will hold for any prior P and posterior Q de\ufb01ned on\n\nH\u2217 def=\n\nHk ,\n\n(9)\n\n(cid:161)\n\nand for any zero-one valued loss function (cid:96)(h(x), y)) de\ufb01ned \u2200h \u2208 H\u2217 and \u2200(x, y) \u2208 X \u00d7 Y\n(not just the one de\ufb01ned by Equation 4). This PAC-Bayesian theorem upper-bounds the value of\nkl\n\n, where\n\nRS(GQ)\n\n(cid:176)(cid:176)R(GQ)\n(cid:162)\n\n(cid:91)\n\nk\u2208N+\n\nkl(q(cid:107)p) def= q ln q\np\n\n+ (1 \u2212 q) ln\n\n1 \u2212 q\n1 \u2212 p\n\ndenotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of\nsuccess q and probability of success p. Note that an upper bound on kl\nprovides\nboth and upper and a lower bound on R(GQ).\nThe upper bound on kl\n\ndepends on the value of KL(Q(cid:107)P ), where\n\n(cid:176)(cid:176)R(GQ)\n(cid:162)\n\nRS(GQ)\n\nRS(GQ)\n\n(cid:161)\n\n(cid:161)\n\n(cid:176)(cid:176)R(GQ)\n(cid:162)\n\nKL(Q(cid:107)P ) def= E\nh\u223cQ\n\nln Q(h)\nP (h)\n\ndenotes the Kullback-Leibler divergence between distributions Q and P de\ufb01ned on H\u2217.\nIn our case, since we want a bound on R(GQ) that translates into a bound for \u03b6Q, we need a Q that\nsatis\ufb01es Equation 7. To minimize the value of KL(Q(cid:107)P ), it is desirable to choose a prior P having\nproperties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k\nwill also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classi\ufb01ers\nfrom H are chosen independently, each according to the prior P on H (however, other choices for\nP are clearly possible). In this case we have\n\n\u221e(cid:88)\n\u221e(cid:88)\n\u221e(cid:88)\n\nk=1\n\nk=1\n\nKL(Q(cid:107)P ) =\n\n=\n\n1\nc\n\n1\nc\n\n|g(k)| E\n\nh1\u2212k\u223cQk\n\nln\n\n|g(k)| E\nh1\u223cQ\n\n|g(k)| \u00b7 Qk(h1\u2212k)\n|g(k)| \u00b7 P k(h1\u2212k)\nln Q(hi)\nP (hi)\n\nk(cid:88)\n\ni=1\n\n. . . E\n\nhk\u223cQ\nln Q(h)\nP (h)\n\n1\nc\n\n|g(k)| \u00b7 k E\nh\u223cQ\n\n=\n= k \u00b7 KL(Q(cid:107)P ) ,\n\nk=1\n\n(10)\n\n\fwhere\n\n\u221e(cid:88)\n\nk=1\n\n1\nc\n\nk def=\n\n|g(k)| \u00b7 k .\n\n(11)\n\nWe then have the following theorem.\nTheorem 1 For any set H of binary classi\ufb01ers, any prior distribution P on H\u2217, and any \u03b4 \u2208 (0, 1],\nwe have\n\n(cid:176)(cid:176)R(GQ)\n\n(cid:162) \u2264 1\n\nm\n\n(cid:183)\nKL(Q(cid:107)P ) + ln m + 1\n\n(cid:184)(cid:182)\n\n\u03b4\n\nPr\n\nS\u223cDm\n\n\u2200Q on H\u2217 : kl\n\nRS(GQ)\n\n\u2265 1 \u2212 \u03b4 .\n\n(cid:181)\n\n(cid:161)\n\nProof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4]\nto priors and posteriors de\ufb01ned on the space H\u2217 of binary classi\ufb01ers with any zero-one valued loss\nfunction.\n\nNote that Theorem 1 directly provides upper and lower bounds on \u03b6Q when we use Equations 7\n\nand 8 to relate R(GQ) and RS(GQ) to \u03b6Q and (cid:99)\u03b6Q and when we use Equation 10 for KL(Q(cid:107)P ).\nTheorem 2 Consider any loss function \u03b6Q(x, y) de\ufb01ned by Equation 1. Let \u03b6Q and (cid:99)\u03b6Q be, re-\n\nConsequently, we have the following theorem.\n\nspectively, the expected loss and its empirical estimate (on a sample of m examples) as de\ufb01ned by\nEquation 3. Let c and k be de\ufb01ned by Equations 6 and 11 respectively. Then for any set H of binary\nclassi\ufb01ers, any prior distribution P on H, and any \u03b4 \u2208 (0, 1], we have\n\n+\n\n1\n2\n\n\u03b6Q \u2212 1\n(cid:183)\n2\nk \u00b7 KL(Q(cid:107)P ) + ln m + 1\n\n(cid:184)(cid:33)\n\n\u2265 1 \u2212 \u03b4 .\n\n\u2264 1\nm\n\n\u03b4\n\n(cid:195)\n\nPr\n\nS\u223cDm\n\n\u2200Q on H : kl\n\n(cid:195)\n\n(cid:184)\n\n(cid:183)(cid:99)\u03b6Q \u2212 1\n\n2\n\n1\nc\n\n+\n\n1\n2\n\n(cid:183)\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176) 1\n\nc\n\n(cid:184)\n\n(cid:33)\n\n4 Bound Behavior During Adaboost\n\nWe have decided to examine the behavior of the proposed bounds during Adaboost since\nthis learning algorithm generally produces a weighted majority vote having a large Gibbs risk\nE (x,y)WQ(x, y) (i.e., small expected margin) and a small Var (x,y)WQ(x, y) (i.e., small variance\nof the margin). Indeed, recall that one of our main motivations was to \ufb01nd a tight risk bound for the\nmajority vote precisely under these circumstances.\nWe have used the \u201csymmetric\u201d version of Adaboost [10, 9] where, at each boosting round t, the\nweak learning algorithm produces a classi\ufb01er ht with the smallest empirical error\n\nm(cid:88)\n\n\u0001t =\n\nDt(i)I[ht(xi) (cid:54)= yi]\n\nwith respect to the boosting distribution Dt(i) on the indices i \u2208 {1, . . . , m} of the training exam-\nples. After each boosting round t, this distribution is updated according to\n\ni=1\n\nwhere Zt is the normalization constant required for Dt+1 to be a distribution, and where\n\nDt+1(i) =\n\n1\nZt\n\nDt(i) exp(\u2212yi\u03b1tht(xi)) ,\n\n(cid:181)\n\n(cid:182)\n\n.\n\n\u03b1t =\n\n1\n2\n\nln\n\n1 \u2212 \u0001t\n\u0001t\n\nSince our task is not to obtain the majority vote with the smallest possible risk but to investigate\nthe tightness of the proposed bounds, we have used the standard \u201cdecision stumps\u201d for the set H\n\n\fof classi\ufb01ers that can be chosen by the weak learner. Each decision stump is a threshold classi\ufb01er\nthat depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold\nand predicts \u2212y otherwise, where y \u2208 {\u22121, +1}. For each decision stump h \u2208 H, its boolean\n(cid:80)n\ncomplement is also in H. Hence, we have 2[k(i) \u2212 1] possible decision stumps on an attribute i\nhaving k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly\ni=1 2[k(i) \u2212 1] classi\ufb01ers. Data sets having continuous-valued attributes have been\n|H| = 2\ndiscretized in our numerical experiments.\nFrom Theorem 2 and Equation 10, the bound on \u03b6Q depends on KL(Q(cid:107)P ). We have chosen a\nuniform prior P (h) = 1/|H| \u2200h \u2208 H. We therefore have\n\nKL(Q(cid:107)P ) =\n\nQ(h) ln Q(h)\nP (h)\n\n=\n\nQ(h) ln Q(h) + ln|H| def= \u2212H(Q) + ln|H| .\n\n(cid:88)\n\nh\u2208H\n\n(cid:88)\n\nh\u2208H\n\nAt boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on\nthe examples that are incorrectly classi\ufb01ed by ht. This strategy is supported by the propose bound\non \u03b6Q since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from\ntiny \ufb02uctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting\nexperiments.\nWe have focused our attention on two different loss functions: the exponential loss and the sigmoid\nloss.\n\n4.1 Results for the Exponential Loss\nThe exponential loss EQ(x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9]\nshows that the empirical estimate of the exponential loss is decreasing at each boosting round 2.\nMore precisely, we have chosen\n\nEQ(x, y) def=\n\nexp (\u03b2 [2WQ(x, y) \u2212 1]) .\n\n(12)\n\nFor this loss function, we have\n\n1\n2\nc = e\u03b2 \u2212 1\n\u03b2\nk =\n\n1 \u2212 e\u2212\u03b2 .\n\nSince c increases exponentially rapidly with \u03b2, so will the risk upper-bound for EQ. Hence, unfor-\ntunately, we can obtain a tight upper-bound only for small values of \u03b2.\nAll the data sets used were obtained from the UCI repository. Each data set was randomly split\ninto two halves of the same size: one for the training set and the other for the testing set. Figure 1\nillustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets\ncontaining 8124 examples and 208 examples respectively.\nWe \ufb01rst note that, although the test error of the majority vote (generally) decreases as function of\nthe number T of boosting rounds, the risk of the Gibbs classi\ufb01er, E (x,y)WQ(x, y) increases as a\nfunction of T but its variance Var (x,y)WQ(x, y) decreases dramatically. Another striking feature\nis the fact that the exponential loss bound curve, computed on the training set, is essentially parallel\nto the true exponential loss curve computed on the testing set. This same parallelism was observed\nfor all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the\nrisk bound increases rapidly as a function of \u03b2. Interestingly however, the risk bound curves remain\nparallel to the true risk curves.\n\n4.2 Results for the Sigmoid Loss\nWe have also investigated the sigmoid loss TQ(x, y) de\ufb01ned by\n\nTQ(x, y) def=\n\n1\n2\n\n+\n\n1\n2\n\ntanh (\u03b2 [2WQ(x, y) \u2212 1]) .\n\n(13)\n\n2In fact, this is true only for the positive linear combination produced by Adaboost. The empirical expo-\n\nnential risk of the convex combination fQ is not always decreasing as we shall see.\n\n3These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vs-\n\nkp, USvotes, mushroom, and sonar.\n\n\f0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nEQ bound\nEQ on test\nE(WQ) on test\nMV error on test\nVar(WQ) on test\n\nEQ bound\nEQ on test\nE(WQ) on test\nMV error on test\nVar(WQ) on test\n\n0\n\n40\n\n80\n\n120\n\n160\n\nT\n\n0\n\n40\n\n80\n\n120\n\n160\n\nT\n\nFigure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test),\nthe Gibbs risk (E(WQ) on test), its variance (Var(WQ) on test), and the test error of the majority\nvote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar\n(right) data sets. The risk bound and the true risk were computed for \u03b2 = ln 2.\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n\u03b2 = 1\n\u03b2 = 2\n\u03b2 = 3\n\u03b2 = 4\nMV error on test\n\n\u03b2 = 1\n\u03b2 = 2\n\u03b2 = 3\n\u03b2 = 4\nMV error on test\n\n1\n\n40\n\n80\n\n120\n\n160\n\nT\n\n1\n\n40\n\n80\n\n120\n\n160\n\nT\n\nFigure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for\ndifferent values of \u03b2 on the Mushroom data set.\n\nSince the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < \u03c0/2, we are\nlimited to \u03b2 \u2264 \u03c0/2. Under these circumstances, we have\n\nc = tan(\u03b2)\n1\n\nk =\n\ncos(\u03b2) sin(\u03b2) .\n\nSimilarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the\ntraining set, is essentially parallel to the true sigmoid loss curve computed on the testing set. More-\nover, the bound appears to be as tight as the one for the exponential risk on Figure 1.\n\n5 Conclusion\n\nBy trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a\nPAC-Bayesian risk bound for any loss function \u03b6Q that has a convergent Taylor expansion around\nWQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk\n\n\f0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nTQ bound\nTQ on test\nE(WQ) on test\nMV error on test\nVar(WQ) on test\n\nTQ bound\nTQ on test\nE(WQ) on test\nMV error on test\nVar(WQ) on test\n\n0\n\n40\n\n80\n\n120\n\n160\n\nT\n\n0\n\n40\n\n80\n\n120\n\n160\n\nT\n\nFigure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the\nGibbs risk (E(WQ) on test), its variance (Var(WQ) on test), and the test error of the majority vote\n(MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar\n(right) data sets. The risk bound and the true risk were computed for \u03b2 = ln 2.\n\nbound is tight only for small values of the scaling factor c involved in the relation between the\nexpected loss \u03b6Q of a convex combination of binary classi\ufb01ers and the zero-one loss of a related\nGibbs classi\ufb01er GQ. However, it is quite encouraging to notice in our numerical experiments with\nAdaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very\nsimilarly as the true loss.\n\nAcknowledgments\n\nWork supported by NSERC Discovery grants 262067 and 122405.\n\nReferences\n[1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355\u2013363, 1999.\n[2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of\n\nMachine Learning Research, 3:233\u2013269, 2002.\n\n[3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5\u201321,\n\n2003.\n\n[4] John Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine\n\nLearning Research, 6:273\u2013306, 2005.\n\n[5] Franc\u00b8ois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed\nGibbs classi\ufb01ers. Proceedings of the 22nth International Conference on Machine Learning\n(ICML 2005), pages 481\u2013488, 2005.\n\n[6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and\nK. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423\u2013\n430. MIT Press, Cambridge, MA, 2003.\n\n[7] Leo Breiman. Bagging predictors. Machine Learning, 24:123\u2013140, 1996.\n[8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n[9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using con\ufb01dence-rated\n\npredictions. Machine Learning, 37:297\u2013336, 1999.\n\n[10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A\nnew explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651\u2013\n1686, 1998.\n\n\f", "award": [], "sourceid": 3120, "authors": [{"given_name": "Pascal", "family_name": "Germain", "institution": null}, {"given_name": "Alexandre", "family_name": "Lacasse", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}, {"given_name": "Mario", "family_name": "Marchand", "institution": null}]}