{"title": "A PAC-Bayes approach to the Set Covering Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 731, "page_last": 738, "abstract": null, "full_text": "A PAC-Bayes approach to the Set\n\nCovering Machine\n\nFran\u00b8cois Laviolette, Mario Marchand\n\nIFT-GLO, Universit\u00b4e Laval\n\nSainte-Foy (QC) Canada, G1K-7P4\ngiven name.surname@ift.ulaval.ca\n\nMohak Shah\n\nSITE, University of Ottawa\n\nOttawa, Ont. Canada,K1N-6N5\n\nmshah@site.uottawa.ca\n\nAbstract\n\nWe design a new learning algorithm for the Set Covering Ma-\nchine from a PAC-Bayes perspective and propose a PAC-Bayes\nrisk bound which is minimized for classi\ufb01ers achieving a non trivial\nmargin-sparsity trade-o\ufb00.\n\n1 Introduction\n\nLearning algorithms try to produce classi\ufb01ers with small prediction error by trying\nto optimize some function that can be computed from a training set of examples and\na classi\ufb01er. We currently do not know exactly what function should be optimized\nbut several forms have been proposed. At one end of the spectrum, we have the\nset covering machine (SCM), proposed by Marchand and Shawe-Taylor (2002), that\ntries to \ufb01nd the sparsest classi\ufb01er making few training errors. At the other end, we\nhave the support vector machine (SVM), proposed by Boser et al. (1992), that tries\nto \ufb01nd the maximum soft-margin separating hyperplane on the training data. Since\nboth of these learning machines can produce classi\ufb01ers having small prediction error,\nwe have recently investigated (Laviolette et al., 2005) if better classi\ufb01ers could be\nfound by learning algorithms that try to optimize a non-trivial function that depends\non both the sparsity of a classi\ufb01er and the magnitude of its separating margin. Our\nmain result was a general data-compression risk bound that applies to any algorithm\nproducing classi\ufb01ers represented by two complementary sources of information: a\nsubset of the training set, called the compression set, and a message string of\nadditional information.\nIn addition, we proposed a new algorithm for the SCM\nwhere the information string was used to encode radius values for data-dependent\nballs and, consequently, the location of the decision surface of the classi\ufb01er. Since\na small message string is su\ufb03cient when large regions of equally good radius values\nexist for balls, the data compression risk bound applied to this version of the SCM\nexhibits, indirectly, a non-trivial margin-sparsity trade-o\ufb00. Moreover, this version\nof the SCM currently su\ufb00ers from the fact that the radius values, used in the \ufb01nal\nclassi\ufb01er, depends on a a priori chosen distance scale R. In this paper, we use a new\nPAC-Bayes approach, that applies to the sample-compression setting, and present a\nnew learning algorithm for the SCM that does not su\ufb00er from this scaling problem.\nMoreover, we propose a risk bound that depends more explicitly on the margin\nand which is also minimized by classi\ufb01ers achieving a non-trivial margin-sparsity\ntrade-o\ufb00.\n\n\f2 De\ufb01nitions\nWe consider binary classi\ufb01cation problems where the input space X consists of an\narbitrary subset of Rn and the output space Y = {0, 1}. An example z def= (x, y)\nis an input-output pair where x \u2208 X and y \u2208 Y. In the probably approximately\ncorrect (PAC) setting, we assume that each example z is generated independently\naccording to the same (but unknown) distribution D. The (true) risk R(f) of a\nclassi\ufb01er f : X \u2192 Y is de\ufb01ned to be the probability that f misclassi\ufb01es z on a\nrandom draw according to D:\n\nR(f) def= Pr(x,y)\u223cD (f(x) (cid:54)= y) = E(x,y)\u223cDI(f(x) (cid:54)= y)\n\nwhere I(a) = 1 if predicate a is true and 0 otherwise. Given a training set\nS = (z1, . . . , zm) of m examples, the task of a learning algorithm is to construct\na classi\ufb01er with the smallest possible risk without any information about D. To\nachieve this goal, the learner can compute the empirical risk RS(f) of any given\nclassi\ufb01er f according to:\n\nRS(f) def=\n\n1\nm\n\nI(f(xi) (cid:54)= yi) def= E(x,y)\u223cSI(f(x) (cid:54)= y)\n\nm(cid:88)\n\ni=1\n\nWe focus on learning algorithms that construct a conjunction (or disjunction) of\nfeatures called data-dependent balls from a training set. Each data-dependent ball\nis de\ufb01ned by a center and a radius value. The center is an input example xi chosen\namong the training set S. For any test example x, the output of a ball h, of radius\n\u03c1 and centered on example xi, and is given by\n\n(cid:189)\n\nhi,\u03c1(x) def=\n\nyi\n\u00afyi\n\nif d(x, xi) \u2264 \u03c1\notherwise\n\n,\n\nwhere \u00afyi denotes the boolean complement of yi and d(x, xi) denotes the distance\nbetween the two points. Note that any metric can be used for the distance here.\nTo specify a conjunction of balls we \ufb01rst need to list all the examples that participate\nas centers for the balls in the conjunction. For this purpose, we use a vector i def=\n(i1, . . . , i|i|) of indices ij \u2208 {1, . . . , m} such that i1 < i2 < . . . < i|i| where |i| is the\nnumber of indices present in i (and thus the number of balls in the conjunction).\nTo complete the speci\ufb01cation of a conjunction of balls, we need a vector \u03c1\u03c1\u03c1 =\n(\u03c1i1, \u03c1i2 , . . . , \u03c1i|i|) of radius values where ij \u2208 {1, . . . , m} for j \u2208 {1, . . . ,|i|}.\nOn any input example x, the output Ci,\u03c1\u03c1\u03c1(x) of a conjunction of balls is given by:\n\n(cid:189)\n\nCi,\u03c1\u03c1\u03c1(x) def=\n\n1 if hj,\u03c1j (x) = 1 \u2200j \u2208 i\n0 if \u2203j \u2208 i : hj,\u03c1j (x) = 0\n\nFinally, any algorithm that builds a conjunction can be used to build a disjunction\njust by exchanging the role of the positive and negative labelled examples. Due to\nlack of space, we describe here only the case of a conjunction.\n\n3 A PAC-Bayes Risk Bound\n\nThe PAC-Bayes approach, initiated by McAllester (1999a), aims at providing PAC\nguarantees to \u201cBayesian\u201d learning algorithms. These algorithms are speci\ufb01ed in\nterms of a prior distribution P over a space of classi\ufb01ers that characterizes our\n\n\fprior belief about good classi\ufb01ers (before the observation of the data) and a pos-\nterior distribution Q (over the same space of classi\ufb01ers) that takes into account\nthe additional information provided by the training data. A remarkable result that\ncame out from this line of research, known as the \u201cPAC-Bayes theorem\u201d, provides\na tight upper bound on the risk of a stochastic classi\ufb01er called the Gibbs classi\ufb01er.\nGiven an input example x, the label GQ(x) assigned to x by the Gibbs classi\ufb01er\nis de\ufb01ned by the following process. We \ufb01rst choose a classi\ufb01er h according to the\nposterior distribution Q and then use h to assign the label h(x) to x. The PAC-\nBayes theorem was \ufb01rst proposed by McAllester (1999b) and later improved by\nothers (see Langford (2005) for a survey). However, for all these versions of the\nPAC-Bayes theorem, the prior P must be de\ufb01ned without reference to the training\ndata. Consequently, these theorems cannot be applied to the sample-compression\nsetting where classi\ufb01ers are partly described by a subset of the training data (as for\nthe case of the SCM).\nIn the sample compression setting, each classi\ufb01er is described by a subset Si of the\ntraining data, called the compression set, and a message string \u03c3 that represents\nthe additional information needed to obtain a classi\ufb01er.\nIn other words, in this\nsetting, there exists a reconstruction function R that outputs a classi\ufb01er R(\u03c3, Si)\nwhen given an arbitrary compression set Si and a message string \u03c3.\nGiven a training set S, the compression set Si \u2286 S is de\ufb01ned by a vector of indices\ni def= (i1, . . . , i|i|) that points to individual examples in S. For the case of a conjunc-\ntion of balls, each j \u2208 i will point to a training example that is used for a ball center\nand the message string \u03c3 will be the vector \u03c1\u03c1\u03c1 of radius values (de\ufb01ned above) that\nare used for the balls. Hence, given Si and \u03c1\u03c1\u03c1, the classi\ufb01er obtained from R(\u03c1\u03c1\u03c1, Si)\nis just the conjunction Ci,\u03c1\u03c1\u03c1 de\ufb01ned previously.1\nRecently, Laviolette and Marchand (2005) have extended the PAC-Bayes theorem\nto the sample-compression setting. Their proposed risk bound depends on a data-\nindependent prior P and a data-dependent posterior Q that are both de\ufb01ned on\nI \u00d7 M where I denotes the set of the 2m possible index vectors i and M denotes,\nin our case, the set of possible radius vectors \u03c1\u03c1\u03c1. The posterior Q is used by a\nstochastic classi\ufb01er, called the sample-compressed Gibbs classi\ufb01er GQ, de\ufb01ned as\nfollows. Given a training set S and given a new (testing) input example x, a sample-\ncompressed Gibbs classi\ufb01er GQ chooses randomly (i, \u03c1\u03c1\u03c1) according to Q to obtain\nclassi\ufb01er R(\u03c1\u03c1\u03c1, Si) which is then used to determine the class label of x.\nIn this paper we focus on the case where, given any training set S, the learner returns\na Gibbs classi\ufb01er de\ufb01ned with a posterior distribution Q having all its weight on a\nsingle vector i. Hence, a single compression set Si will be used for the \ufb01nal classi\ufb01er.\nHowever, the radius \u03c1i for each i \u2208 i will be chosen stochastically according to the\nposterior Q. Hence we consider posteriors Q such that Q(i(cid:48), \u03c1\u03c1\u03c1) = I(i = i(cid:48))Qi(\u03c1\u03c1\u03c1)\nwhere i is the vector of indices chosen by the learner. Hence, given a training set\nS, the true risk R(GQi) of GQi and its empirical risk RS(GQi) are de\ufb01ned by\n\nR(GQi) def= E\n\u03c1\u03c1\u03c1\u223cQi\n\nR(R(\u03c1\u03c1\u03c1, Si))\n\n; RS(GQi) def= E\n\u03c1\u03c1\u03c1\u223cQi\n\nRSi\n\n(R(\u03c1\u03c1\u03c1, Si)) ,\n\nwhere i denotes the set of indices not present in i. Thus, i \u2229 i = \u2205 and i \u222a i =\n(1, . . . , m).\nIn contrast with the posterior Q, the prior P assigns a non zero weight to several\nvectors i. Let PI(i) denote the prior probability P assigned to vector i and let Pi(\u03c1\u03c1\u03c1)\n\n1We assume that the examples in Si are ordered as in S so that the kth radius value\n\nin \u03c1\u03c1\u03c1 is assigned to the kth example in Si.\n\n\fdenote the probability density function associated with prior P given i. The risk\nbound depends on the Kullback-Leibler divergence KL(Q(cid:107)P ) between the posterior\nQ and the prior P which, in our case, gives\n\nKL(Qi(cid:107)P ) = E\n\u03c1\u03c1\u03c1\u223cQi\n\nln Qi(\u03c1\u03c1\u03c1)\n\nPI(i)Pi(\u03c1\u03c1\u03c1) .\n\nFor these classes of posteriors Q and priors P , the PAC-Bayes theorem of Laviolette\nand Marchand (2005) reduces to the following simpler version.\n\nTheorem 1 (Laviolette and Marchand (2005)) Given all our previous de\ufb01ni-\ntions, for any prior P and for any \u03b4 \u2208 (0, 1]\n\n\u2200 Qi : kl(RS(GQi)(cid:107)R(GQi)) \u2264\n\n1\n\nm\u2212|i|\n\nKL(Qi (cid:107)P ) + ln m+1\n\n\u03b4\n\n\u2265 1\u2212 \u03b4 ,\n\n(cid:179)\n\nPr\n\nS\u223cDm\n\n(cid:163)\n\n(cid:164)(cid:180)\n\nwhere\n\nkl(q(cid:107)p) def= q ln q\np\n\n+ (1 \u2212 q) ln\n\n1 \u2212 q\n1 \u2212 p\n\n.\n\n1(cid:161)m|i|\n(cid:162) p(|i|)\n\nTo obtain a bound for R(GQi) we need to specify Qi(\u03c1\u03c1\u03c1), PI(i), and Pi(\u03c1\u03c1\u03c1).\nSince all vectors i having the same size |i| are, a priori, equally \u201cgood\u201d, we choose\n\nPI(i) =\n\n(cid:80)m\nfor any p(\u00b7) such that\nd=0 p(d) = 1. We could choose p(d) = 1/(m + 1) for d \u2208\n{0, 1, . . . , m} if we have complete ignorance about the size |i| of the \ufb01nal classi\ufb01er.\nBut since the risk bound will deteriorate for large |i|, it is generally preferable to\nchoose, for p(d), a slowly decreasing function of d.\nFor the speci\ufb01cation of Pi(\u03c1\u03c1\u03c1), we assume that each radius value, in some prede\ufb01ned\ninterval [0, R], is equally likely to be chosen for each \u03c1i such that i \u2208 i. Here R is\nsome \u201clarge\u201d distance speci\ufb01ed a priori. For Qi(\u03c1\u03c1\u03c1), a margin interval [ai, bi] \u2286 [0, R]\nof equally good radius values is chosen by the learner for each i \u2208 i. Hence, we choose\n\n(cid:181)\n\n(cid:182)|i|\n\n(cid:89)\n\ni\u2208i\n\nPi(\u03c1\u03c1\u03c1) =\n\n1\nR\n\n=\n\n1\nR\n\n(cid:89)\n\ni\u2208i\n\n; Qi(\u03c1\u03c1\u03c1) =\n\n1\n\nbi \u2212 ai\n\n.\n\nTherefore, the Gibbs classi\ufb01er returned by the learner will draw each radius \u03c1i\nuniformly in [ai, bi]. A deterministic classi\ufb01er is then speci\ufb01ed by \ufb01xing each radius\nvalues \u03c1i \u2208 [ai, bi]. It is tempting at this point to choose \u03c1i = (ai + bi)/2 \u2200i \u2208 i (i.e.,\nin the middle of each interval). However, we will see shortly that the PAC-Bayes\ntheorem o\ufb00ers a better guarantee for another type of deterministic classi\ufb01er.\nConsequently, with these choices for Qi(\u03c1\u03c1\u03c1), PI(i), and Pi(\u03c1\u03c1\u03c1), the KL divergence\nbetween Qi and P is given by\nKL(Qi(cid:107)P ) = ln\n\n(cid:88)\n\n(cid:182)\n\n(cid:182)\n\n(cid:181)\n\n(cid:182)\n\n(cid:181)\n\n(cid:181)\n\n+ ln\n\n.\n\n1\np(|i|)\n\n+\n\nln\n\ni\u2208i\n\nR\n\nbi \u2212 ai\n\nm\n|i|\n\nNotice that the KL divergence is small for small values of |i| (whenever p(|i|) is not\ntoo small) and for large margin values (bi \u2212 ai). Hence, the KL divergence term in\nTheorem 1 favors both sparsity (small |i|) and large margins. Hence, in practice,\nthe minimum might occur for some GQi that sacri\ufb01ces sparsity whenever larger\nmargins can be found.\n\n\fSince the posterior Q is identi\ufb01ed by i and by the intervals [ai, bi] \u2200i \u2208 i, we will\nnow refer to the Gibbs classi\ufb01er GQi by Gi\nab where a and b are the vectors formed\nby the unions of ais and bis respectively. To obtain a risk bound for Gi\nab, we need\nab). For this task, let U[a, b] denote the\nto \ufb01nd a closed-form expression for RS(Gi\nuniform distribution over [a, b] and let \u03c3i\na,b(x) be the probability that a ball with\ncenter xi assigns to x the class label yi when its radius \u03c1 is drawn according to\nU[a, b]:\n\na,b(x) def= Pr\u03c1\u223cU [a,b] (hi,\u03c1(x) = yi) =\n\u03c3i\n\nTherefore,\n\nb\u2212d(x,xi)\n0\n\nb\u2212a\n\nif d(x, xi) \u2264 a\nif a \u2264 d(x, xi) \u2264 b\nif d(x, xi) \u2265 bi .\n\n\uf8f1\uf8f2\uf8f3 1\n(cid:189)\n\n(cid:89)\n\ni\u2208i\n\n(cid:88)\n\na,b(x) def= Pr\u03c1\u223cU [a,b] (hi,\u03c1(x) = 1) =\n\u03b6 i\nab(x) denote the probability that Ci,\u03c1\u03c1\u03c1(x) = 1 when each \u03c1i \u2208 \u03c1\u03c1\u03c1 are drawn\n\na,b(x)\n\nif\nif\n\nyi = 1\nyi = 0 .\n\na,b(x)\n\u03c3i\n1 \u2212 \u03c3i\n\nNow let Gi\naccording to U[ai, bi]. We then have\n\nGi\n\nab(x) =\n\nai,bi(x) .\n\u03b6 i\n\nConsequently, the risk R(x,y)(Gi\ny = 0 and by 1 \u2212 Gi\n\nab(x) otherwise. Therefore\nab(x)) + (1 \u2212 y)Gi\n\nab) = y(1 \u2212 Gi\n\nR(x,y)(Gi\n\nab(x) = (1 \u2212 2y)(Gi\n\nab(x) \u2212 y) .\n\nab) on a single example (x, y) is given by Gi\n\nab(x) if\n\nHence, the empirical risk RS(Gi\n\nab) of the Gibbs classi\ufb01er Gi\n\nab is given by\n\nRS(Gi\n\nab) =\n\n1\n\nm \u2212 |i|\n\n(1 \u2212 2yj)(Gi\n\nab(xj) \u2212 yj) .\n\nj\u2208i\nFrom this expression we see that RS(Gi\nTraining points where Gi\n\nab(xj) \u2248 1/2 should therefore be avoided.\n\nab) is small when Gi\n\nab(xj) \u2192 yj \u2200j \u2208 i.\n\nab, we have that Bi\n\nab.\nThe PAC-Bayes theorem below provides a risk bound for the Gibbs classi\ufb01er Gi\nab just performs a majority vote under the same posterior\nSince the Bayes classi\ufb01er Bi\ndistribution as the one used by Gi\nab(x) > 1/2.\nFrom the above de\ufb01nitions, note that the decision surface of the Bayes classi\ufb01er,\ngiven by Gi\nab(x) = 1/2, di\ufb00ers from the decision surface of classi\ufb01er Ci\u03c1\u03c1\u03c1 when\n\u03c1i = (ai + bi)/2 \u2200i \u2208 i. In fact there does not exists any classi\ufb01er Ci\u03c1\u03c1\u03c1 that has the\nsame decision surface as Bayes classi\ufb01er Bi\nab and\nab) for any (x, y). Consequently,\nGi\nR(Bi\nTheorem 2 Given all our previous de\ufb01nitions, for any \u03b4 \u2208 (0, 1], for any p satis-\nfying\n\nd=0 p(d) = 1, and for any \ufb01xed distance value R, we have:\n\nab). Hence, we have the following theorem.\n\nab. From the relation between Bi\n\nab, it also follows that R(x,y)(Bi\n\nab(x) = 1 i\ufb00 Gi\n\n(cid:80)m\n\nab) \u2264 2R(Gi\n(cid:195)\n\n(cid:181)\n\n(cid:182)\n\n(cid:183)\n\nPrS\u223cDm\n\n\u2200i, a, b: R(Gi\n\n(cid:181)\n\n+ ln\n\n1\np(|i|)\n\n(cid:182)\nab) \u2264 sup\n(cid:88)\n\n+\n\nln\n\nFurthermore: R(Bi\n\nab) \u2264 2R(Gi\n\ni\u2208i\nab) \u2200i, a, b.\n\nab)(cid:107)\u0001) \u2264\n\n1\n\n(cid:35)(cid:41)(cid:33)\n\nm \u2212 |i|\n\nln\n\nm\n|i|\n\n+\n\nR\n\nbi \u2212 ai\n\n+ ln m + 1\n\n\u03b4\n\n\u2265 1 \u2212 \u03b4 .\n\nab) \u2264 2R(x,y)(Gi\n(cid:189)\n(cid:181)\n\n\u0001: kl(RS(Gi\n\n(cid:182)\n\n\fRecall that the KL divergence is small for small values of |i| (whenever p(|i|) is not\ntoo small) and for large margin values (bi \u2212 ai). Furthermore, the Gibbs empirical\nrisk RS(Gi\nab) is small when the training points are located far away from the Bayes\nab(xj) \u2192 yj \u2200j \u2208 i). Consequently, the\ndecision surface Gi\nGibbs classi\ufb01er with the smallest guarantee of risk should perform a non trivial\nmargin-sparsity tradeo\ufb00.\n\nab(x) = 1/2 (with Gi\n\n4 A Soft Greedy Learning Algorithm\n\nab\n\nTheorem 2 suggests that the learner should try to \ufb01nd the Bayes classi\ufb01er Bi\nthat uses a small number of balls (i.e., a small |i|), each with a large separating\nmargin (bi\u2212 ai), while keeping the empirical Gibbs risk RS(Gi\nab) at a low value. To\nachieve this goal, we have adapted the greedy algorithm for the set covering machine\n(SCM) proposed by Marchand and Shawe-Taylor (2002). It consists of choosing the\n(Boolean-valued) feature i with the largest utility Ui de\ufb01ned as Ui = |Ni| \u2212 p|Pi| ,\nwhere Ni is the set of negative examples covered (classi\ufb01ed as 0) by feature i, Pi\nis the set of positive examples misclassi\ufb01ed by this feature, and p is a learning\nparameter that gives a penalty p for each misclassi\ufb01ed positive example. Once the\nfeature with the largest Ui is found, we remove Ni and Pi from the training set S\nand then repeat (on the remaining examples) until either no more negative examples\nare present or that a maximum number of features has been reached.\nIn our case, however, we need to keep the Gibbs risk on S low instead of the risk\nof a deterministic classi\ufb01er. Since the Gibbs risk is a \u201csoft measure\u201d that uses the\npiece-wise linear functions \u03c3i\na,b instead of \u201chard\u201d indicator functions, we need a\n\u201csofter\u201d version of the utility function Ui. Indeed, a negative example that falls in\nthe linear region of a \u03c3i\na,b is in fact partly covered. Following this observation, let\nk be the vector of indices of the examples that we have used as ball centers so far\nfor the construction of the classi\ufb01er. Let us \ufb01rst de\ufb01ne the covering value C(Gk\nab)\nof Gk\n\nab by the \u201camount\u201d of negative examples assigned to class 0 by Gk\n\nab:\n\nC(Gk\n\nab)\n\ndef=\n\n(1 \u2212 yj)\n\n1 \u2212 Gk\n\nab(xj)\n\n.\n\n(cid:163)\n\n(cid:88)\n\nj\u2208k\n\nWe also de\ufb01ne the positive-side error E(Gk\nexamples assigned to class 0 :\nE(Gk\n\n(cid:88)\n\ndef=\n\n(cid:163)\n\nyj\n\nab)\n\n1 \u2212 Gk\n\nab(xj)\n\n.\n\nj\u2208k\n\nab) of Gk\n\nab as the \u201camount\u201d of positive\n\nWe now want to add another ball, centered on an example with index i, to obtain\na new vector k(cid:48) containing this new index in addition to those present in k. Hence,\nwe now introduce the covering contribution of ball i (centered on xi) as\nCk\nab(i)\n\ndef= C(Gk(cid:48)\n= (1 \u2212 yi)\n\n(cid:163)\na(cid:48)b(cid:48)) \u2212 C(Gk\nab)\n1 \u2212 \u03b6 i\n\nai,bi(xi) Gk\n\nab(xi)\n\n+\n\n(1 \u2212 yj)\n\n1 \u2212 \u03b6 i\n\nai,bi(xj)\n\nGk\n\nab(xj) ,\n\n(cid:164)\n\n(cid:164)\n\n(cid:163)\n\nj\u2208k(cid:48)\nand the positive-side error contribution of ball i as\n\nE k\nab(i)\n\ndef= E(Gk(cid:48)\n\n(cid:163)\n\na(cid:48)b(cid:48)) \u2212 E(Gk\nab)\n1 \u2212 \u03b6 i\n\nai,bi(xi) Gk\n\n= yi\n\nab(xi)\n\n+\n\n1 \u2212 \u03b6 i\n\nyj\n\nai,bi(xj)\n\nGk\n\nab(xj) .\n\n(cid:164)\n\n(cid:164)\n\n(cid:88)\n\n(cid:163)\n\n(cid:88)\n\nj\u2208k(cid:48)\n\n(cid:164)\n\n(cid:164)\n\n\fTypically, the covering contribution of ball i should increase its \u201cutility\u201d and its\npositive-side error should decrease it. Hence, we de\ufb01ne the utility U k\nab(i) of adding\nball i to Gk\n\nab as\n\nab(i)\nU k\n\ndef= Ck\n\nab(i) \u2212 pE k\n\nab(i) ,\n\nab(xj) = 0.\n\nwhere parameter p represents the penalty of misclassifying a positive example. For\na \ufb01xed value of p, the \u201csoft greedy\u201d algorithm simply consists of adding, to the\ncurrent Gibbs classi\ufb01er, a ball with maximum added utility until either the maxi-\nmum number of possible features (balls) has been reached or that all the negative\nexamples have been (totally) covered. It is understood that, during this soft greedy\n(cid:80)\nalgorithm, we can remove an example (xj, yj) from S whenever it is totally covered.\nThis occurs whenever Gk\ni\u2208i ln(R/(bi \u2212 ai)), present in the risk bound of Theorem 2, favors \u201csoft\nThe term\nballs\u201d having large margins bi \u2212 ai. Hence, we introduce a margin parameter \u03b3 \u2265 0\nthat we use as follows. At each greedy step, we \ufb01rst search among balls having\nbi \u2212 ai = \u03b3. Once such a ball, of center xi, having maximum utility has been found,\nwe try to increase further its utility be searching among all possible values of ai\nand bi > ai while keeping its center xi \ufb01xed2. Both p and \u03b3 will be chosen by cross\nvalidation on the training set.\nWe conclude this section with an analysis of the running time of this soft greedy\nlearning algorithm for \ufb01xed p and \u03b3. For each potential ball center, we \ufb01rst sort the\nm\u2212 1 other examples with respect to their distances from the center in O(m log m)\ntime. Then, for this center xi, the set of ai values that we examine are those\nspeci\ufb01ed by the distances (from xi) of the m \u2212 1 sorted examples3. Since the\nexamples are sorted, it takes time \u2208 O(km) to compute the covering contributions\nand the positive-side error for all the m \u2212 1 values of ai. Here k is the largest\nnumber of examples falling into the margin. We are always using small enough \u03b3\nvalues to have k \u2208 O(log m) since, otherwise, the results are terrible. It therefore\ntakes time \u2208 O(m log m) to compute the utility values of all the m \u2212 1 di\ufb00erent\nballs of a given center. This gives a time \u2208 O(m2 log m) to compute the utilities\nfor all the possible m centers. Once a ball with a largest utility value has been\nchosen, we then try to increase further its utility by searching among O(m2) pair\nvalues for (ai, bi). We then remove the examples covered by this ball and repeat\nthe algorithm on the remaining examples. It is well known that greedy algorithms\nof this kind have the following guarantee: if there exist r balls that covers all the\nm examples, the greedy algorithm will \ufb01nd at most r ln(m) balls. Since we almost\nalways have r \u2208 O(1), the running time of the whole algorithm will almost always\nbe \u2208 O(m2 log2(m)).\n\n5 Empirical Results on Natural Data\n\nWe have compared the new PAC-Bayes learning algorithm (called here SCM-PB),\nwith the old algorithm (called here SCM). Both of these algorithms were also com-\npared with the SVM equipped with a RBF kernel of variance \u03c32 and a soft margin\nparameter C. Each SCM algorithm used the L2 metric since this is the metric\npresent in the argument of the RBF kernel. However, in contrast with Laviolette\net al. (2005), each SCM was constrained to use only balls having centers of the same\nclass (negative for conjunctions and positive for disjunctions).\n\n2The possible values for ai and bi are de\ufb01ned by the location of the training points.\n3Recall that for each value of ai, the value of bi is set to ai + \u03b3 at this stage.\n\n\fTable 1: SVM and SCM results on UCI data sets.\n\nData Set\n\nName\nbreastw\nbupa\ncredit\nglass\nheart\nhaberman\nUSvotes\n\ntrain\n343\n170\n353\n107\n150\n144\n235\n\ntest\n340\n175\n300\n107\n147\n150\n200\n\nSVM results\n\u03c32\n5\n.17\n2\n.17\n.17\n1\n25\n\nSVs\n38\n169\n282\n51\n64\n81\n53\n\nC\n1\n2\n100\n10\n1\n2\n1\n\nerrs\n15\n66\n51\n29\n26\n39\n13\n\nSCM\n\nb\n1\n5\n3\n5\n1\n1\n10\n\nerrs\n12\n62\n58\n22\n23\n39\n27\n\nSCM-PB\n\n\u03b3\n.08\n.1\n.09\n.04\n0\n.2\n.14\n\nerrs\n10\n67\n55\n19\n28\n38\n12\n\nb\n4\n6\n11\n16\n1\n1\n18\n\nEach algorithm was tested the UCI data sets of Table 1. Each data set was ran-\ndomly split in two parts. About half of the examples was used for training and\nthe remaining set of examples was used for testing. The corresponding values for\nthese numbers of examples are given in the \u201ctrain\u201d and \u201ctest\u201d columns of Table 1.\nThe learning parameters of all algorithms were determined from the training set\nonly. The parameters C and \u03b3 for the SVM were determined by the 5-fold cross\nvalidation (CV) method performed on the training set. The parameters that gave\nthe smallest 5-fold CV error were then used to train the SVM on the whole training\nset and the resulting classi\ufb01er was then run on the testing set. Exactly the same\nmethod (with the same 5-fold split) was used to determine the learning parameters\nof both SCM and SCM-PB.\nThe SVM results are reported in Table 1 where the \u201cSVs\u201d column refers to the\nnumber of support vectors present in the \ufb01nal classi\ufb01er and the \u201cerrs\u201d column refers\nto the number of classi\ufb01cation errors obtained on the testing set. This notation is\nused also for all the SCM results reported in Table 1.\nIn addition to this, the\n\u201cb\u201d and \u201c\u03b3\u201d columns refer, respectively, to the number of balls and the margin\nparameter (divided by the average distance between the positive and the negative\nexamples). The results reported for SCM-PB refer to the Bayes classi\ufb01er only. The\nresults for the Gibbs classi\ufb01er are similar. We observe that, except for bupa and\nheart, the generalization error of SCM-PB was always smaller than SCM. However,\nthe only signi\ufb01cant di\ufb00erence occurs on USvotes. We also observe that SCM-PB\ngenerally sacri\ufb01ces sparsity (compared to SCM) to obtain some margin \u03b3 > 0.\n\nReferences\n\nB. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin\nclassi\ufb01ers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning\nTheory, pages 144\u2013152. ACM Press, 1992.\n\nJohn Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Ma-\n\nchine Learning Research, 6:273\u2013306, 2005.\n\nFran\u00b8cois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed\nGibbs classi\ufb01ers. Proceedings of the 22nth International Conference on Machine Learn-\ning (ICML 2005), pages 481\u2013488, 2005.\n\nFran\u00b8cois Laviolette, Mario Marchand, and Mohak Shah. Margin-sparsity trade-o\ufb00 for the\nset covering machine. Proceedings of the 16th European Conference on Machine Learning\n(ECML 2005); Lecture Notes in Arti\ufb01cial Intelligence, 3720:206\u2013217, 2005.\n\nMario Marchand and John Shawe-Taylor. The set covering machine. Journal of Machine\n\nLearning Reasearch, 3:723\u2013746, 2002.\n\nDavid McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355\u2013363, 1999a.\n\nDavid A. McAllester. Pac-bayesian model averaging. In COLT, pages 164\u2013170, 1999b.\n\n\f", "award": [], "sourceid": 2844, "authors": [{"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}, {"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Mohak", "family_name": "Shah", "institution": null}]}