{"title": "Tighter PAC-Bayes Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": null, "full_text": "Tighter PAC-Bayes Bounds\n\nAmiran Ambroladze\nDep. of Mathematics\nLund University/LTH\n\nBox 118, S-221 00 Lund, SWEDEN\n\namiran.ambroladze@math.lth.se\n\nEmilio Parrado-Hern\u00b4andez\n\nDep. of Signal Processing and Communications\n\nUniversity Carlos III of Madrid\n\nLegan\u00b4es, 28911, SPAIN\nemipar@tsc.uc3m.es\n\nJohn Shawe-Taylor\n\nDep. of Computer Science\nUniversity College London\n\nGower Street,\n\nLondon WC1E 6BT, UK\njst@cs.ucl.ac.uk\n\nAbstract\n\nThis paper proposes a PAC-Bayes bound to measure the performance of Support\nVector Machine (SVM) classi\ufb01ers. The bound is based on learning a prior over\nthe distribution of classi\ufb01ers with a part of the training samples. Experimental\nwork shows that this bound is tighter than the original PAC-Bayes, resulting in an\nenhancement of the predictive capabilities of the PAC-Bayes bound. In addition,\nit is shown that the use of this bound as a means to estimate the hyperparameters\nof the classi\ufb01er compares favourably with cross validation in terms of accuracy of\nthe model, while saving a lot of computational burden.\n\n1 Introduction\n\nSupport vector machines (SVM) implement linear classi\ufb01ers in a high-dimensional feature space\nusing the kernel trick to enable a dual representation and ef\ufb01cient computation.\nThe danger of over\ufb01tting in such high-dimensional spaces is countered by maximising the margin\nof the classi\ufb01er on the training examples. For this reason there has been considerable interest in\nbounds on the generalisation in terms of the margin.\nEarly bounds have relied on covering number computations [7], while later bounds have considered\nRademacher complexity. The tightest bounds for practical applications appear to be the PAC-Bayes\nbound [4, 5]. In particular the form given in [3] is specially attractive for margin classi\ufb01ers, like\nSVM. The PAC-Bayesian bounds are also present in other Machine Learning models such as Gaus-\nsian Processes [6].\nThe aim of this paper is to consider a re\ufb01nement of the PAC-Bayes approach and investigate whether\nit can improve on the original PAC-Bayes bound and uphold its capabilities of delivering reliable\nmodel selection.\nThe standard PAC-Bayes bound uses a Gaussian prior centred at the origin in weight space. The\nkey to the new bound is to use part of the training set to compute a more informative prior and then\ncompute the bound on the remainder of the examples relative to this prior. The bounds are tested\nexperimentally in several classi\ufb01cation tasks, including the model selection, on common benchmark\ndatasets.\n\n\fThe rest of the document is organised as follows. Section 2 brie\ufb02y reviews the PAC-Bayes bound for\nSVMs obtained in [3]. The new bound obtained by means of the re\ufb01nement of the prior is presented\nin Section 3. The experimental work, included in Section 4, compares the tightness of the new bound\nwith the original one and indicates about its usability in a model selection task. Finally, the main\nconclusions of this work are outlined in Section 5.\n\n2 PAC-Bayes Bound\n\nThis section is devoted to a brief review of the PAC-Bayes Bound Theorem of [3]. Let us consider a\ndistribution D of patterns x lying in a certain input space X , with their corresponding output labels\ny, y \u2208 {\u22121, 1}. In addition, let us also consider a distribution Q over the classi\ufb01ers c. For every\nclassi\ufb01er c, the following two error measures are de\ufb01ned:\n\nDe\ufb01nition (True error) The true error cD of a classi\ufb01er c is de\ufb01ned as the probability of misclassi-\nfying a pair pattern-label (x, y) selected at random from D\n\ncD \u2261 Pr(x,y)\u223cD(c(x) 6= y)\n\nDe\ufb01nition (Empirical error) The empirical error \u02c6cS of a classi\ufb01er c on a sample S of size m is\nde\ufb01ned as the rate of errors on a set S\n\n\u02c6cS \u2261 Pr(x,y)\u223cS(c(x) 6= y) =\n\n1\nm\n\nI(c(xi) 6= yi)\n\nmX\n\ni=1\n\nwhere I(\u00b7) is a function equal to 1 if the argument is true and equal to 0 if the argument is false.\n\nthe true error, QD \u2261\nNow we can de\ufb01ne two error measures on the distribution of classi\ufb01ers:\nEc\u223cQcD, as the probability of misclassifying an instance x chosen from D with a classi\ufb01er c chosen\naccording to Q; and the empirical error \u02c6QS \u2261 Ec\u223cQ\u02c6cS, as the probability of classi\ufb01er c chosen\naccording to Q misclassifying an instance x chosen from a sample S.\nFor these two quantities we can derive the PAC-Bayes Bound on the true error of the distribution of\nclassi\ufb01ers:\n\nTheorem 2.1 (PAC-Bayes Bound) For all prior distributions P (c) over the classi\ufb01ers c, and for\nany \u03b4 \u2208 (0, 1]\n\n(cid:18)\n\nPrS\u223cDm\n\n\u2200Q(c) : KL( \u02c6QS||QD) \u2264 KL(Q(c)||P (c)) + ln( m+1\nwhere KL is the Kullback-Leibler divergence, KL(p||q) = q ln q\nKL(Q(c)||P (c)) = Ec\u223cQ ln Q(c)\nP (c) .\n\nm\n\n\u03b4\n\n(cid:19)\n\n)\n\n\u2265 1 \u2212 \u03b4,\n\np + (1 \u2212 q) ln 1\u2212q\n\n1\u2212p and\n\nThe proof of the theorem can be found in [3].\nThis bound can be particularised for the case of linear classi\ufb01ers in the following way. The m\ntraining patterns de\ufb01ne a linear classi\ufb01er that can be represented by the following equation1:\n\nc(x) = sign(wT \u03c6(x))\n\n(1)\nwhere \u03c6(x) is a nonlinear projection to a certain feature space where a linear classi\ufb01cation actually\ntakes place, and w is a vector from that feature space that determines the separating plane.\nFor any vector w we can de\ufb01ne a stochastic classi\ufb01er in the following way: we choose the dis-\ntribution Q = Q(w, \u00b5) to be a spherical Gaussian with identity covariance matrix centred on the\ndirection given by w at a distance \u00b5 from the origin. Moreover, we can choose the prior P (c) to be\na spherical Gaussian with identity covariance matrix centred on the origin. Then, for classi\ufb01ers of\nthe form in equation (1) performance can be bounded by\n\n1We are considering here unbiased classi\ufb01ers, i.e., with b = 0.\n\n\fCorollary 2.2 (PAC-Bayes Bound for margin classi\ufb01ers [3]) For all distributions D, for all classi-\n \n\ufb01ers given by w and \u00b5 > 0, for all \u03b4 \u2208 (0, 1], we have\nKL( \u02c6QS(w, \u00b5)||QD(w, \u00b5)) \u2264 \u00b52\n\n2 + ln( m+1\n\n\u2265 1 \u2212 \u03b4.\n\n!\n\nPr\n\n)\n\n\u03b4\n\nm\n\nIt can be shown (see [3]) that\n\n(2)\nwhere Em is the average over the m training examples, \u03b3(x, y) is the normalised margin of the\ntraining patterns\n\n\u02c6QS(w, \u00b5) = Em[ \u02dcF (\u00b5\u03b3(x, y))]\n\n\u03b3(x, y) = ywT \u03c6(x)\nk\u03c6(x)kkwk\nand \u02dcF = 1 \u2212 F , where F is the cumulative normal distribution\ne\u2212x2/2dx.\n\nZ x\n\nF (x) =\n\n1\u221a\n2\u03c0\n\n\u2212\u221e\n\n(3)\n\n(4)\n\nNote that the SVM is a thresholded linear classi\ufb01er expressed as (1) computed by means of the\nkernel trick [2]. The generalisation error of such a classi\ufb01er can be bounded by at most twice the\ntrue (stochastic) error QD(w, \u00b5) in Corollary 2.2, (see [4]);\n\nPr(x,y)\u223cD(cid:0)sign(wT \u03c6(x)) 6= y(cid:1) \u2264 2QD(w, \u00b5)\n\nfor all \u00b5.\n\n3 Choosing a prior for the PAC-Bayes Bound\n\nk=m\u2212r+1 in the description below.\n\nOur \ufb01rst contribution is motivated by the fact that the PAC-Bayes bound allows us to choose the\nprior distribution, P (c). In the standard application of the bound this is chosen to be a Gaussian\ncentred at the origin. We now consider learning a different prior based on training an SVM on a\nsubset R of the training set comprising r training patterns and labels. In the experiments this is\ntaken as a random subset but for simplicity of the presentation we will assume these to be the last r\nexamples {xk, yk}m\nWith these r examples we can determine an SVM classi\ufb01er, wr and form a prior P (w|wr) consist-\ning of a Gaussian distribution with identity covariance matrix centred on wr.\nThe introduction of this prior P (w|wr) in Theorem 2.1 results in the following new bound.\nCorollary 3.1 (Single Prior based PAC-Bayes Bound for margin classi\ufb01ers) Let us consider a prior\non the distribution of classi\ufb01ers consisting in a spherical Gaussian with identity covariance centred\nalong the direction given by wr at a distance \u03b7 from the origin. Then, for all distributions D, for all\nclassi\ufb01ers wm and \u00b5 > 0, for all \u03b4 \u2208 (0, 1], we have\nKL( \u02c6QS\\R(wm, \u00b5)||QD(wm, \u00b5)) \u2264\n\n+ ln( m\u2212r+1\n\n||\u03b7wr\u2212\u00b5wm||2\n\n\u2265 1 \u2212 \u03b4\n\nPrS\u223cD\n\n \n\n!\n\n)\n\n2\n\n\u03b4\n\nm \u2212 r\n\nwhere \u02c6QS\\R is a stochastic measure of the error of the classi\ufb01er on the m \u2212 r samples not used to\nlearn the prior. This stochastic error is computed as indicated in equation (2) averaged over S\\R.\nProof Since we separate r instances to learn the prior, the actual size of the training set to which we\napply the bound is m \u2212 r. In addition, the stochastic error must be computed only on the instances\nnot used to learn the prior, i.e. the subset S\\R.\nThe KL divergence between prior and posterior is computed as follows:\n\nKL(Q(w)||P (w)) = Ew\u223cQ ln Q(w)\n= Ew\u223cQ ln exp(\u2212 1\nexp(\u2212 1\n2(w \u2212 \u00b5wm)T (w \u2212 \u00b5wm) + 1\n= Ew\u223cQ\nmwm \u2212 Ew\u223cQ\n= Ew\u223cQ\n\n(cid:2)\u2212 1\n(cid:0)\u00b5wT\nmw(cid:1) \u2212 1\n\n2 (w\u2212\u00b5wm)T (w\u2212\u00b5wm))\n2 (w\u2212\u03b7wr)T (w\u2212\u03b7wr))\n\n2 \u00b52wT\n\nP (w)\n\n2(w \u2212 \u03b7wr)T (w \u2212 \u03b7wr)(cid:3)\n(cid:0)\u03b7wT wr\n\n(cid:1) + 1\n\n2 \u03b72wT\n\nr wr\n\n\fTaking expectations using Ew\u223cQw = \u00b5wm we arrive at\n\n||\u00b5wm \u2212 \u03b7wr||2\n\n1\n2\n\nIntuitively, if the selection of the prior is appropriate, the bound can be tighter than the one given in\nCorollary 2.2 when applied to the SVM weight vector on the whole training set. It is perhaps worth\nstressing that the bound holds for all wm and so can be applied to the SVM trained on the whole set.\nThis might at \ufb01rst appear as \u2019cheating\u2019, but the critical point is that the bound is evaluated on the set\nS\\R not involved in generating the prior. The experimental work illustrates how in fact this bound\ncan be tighter than the standard PAC-Bayes bound.\nMoreover, the selection of the prior may be further re\ufb01ned in exchange for a very small increase in\nthe penalty term. This can be achieved with the application of the following result.\nTheorem 3.2 (Bound for several priors) Let {Pj(c)}J\nlected with positive weights {\u03c0j}J\nfor all posterior distributions Q(c), for all \u03b4 \u2208 (0, 1],\n\nj=1 be a set of possible priors that can be se-\nj=1,\n\nj=1 \u03c0j = 1. Then, for all priors P (c) \u2208 {Pj(c)}J\n\nj=1 so thatPJ\n\n \n\u2200Q(c),\u2200j : KL( \u02c6QS||QD) \u2264 KL(Q(c)||Pj(c)) + ln m+1\n\n\u03b4 + ln 1\n\u03c0j\n\n\u2265 1 \u2212 \u03b4,\n\nPrS\u223cDm\n\n!\n\nm\n\nProof The bound in Theorem 2.1 can be particularised for a certain Pj(c) with associated weight\n\u03c0j and with con\ufb01dence \u03b4\u03c0j\n\nPrS\u223cDm\n\n\u2200Q(c) : KL( \u02c6QS||QD) >\n\nKL(Q(c)||Pj(c)) + ln( m+1\n\n\u03b4\u03c0j\n\n)\n\nm\n\n< \u03b4\u03c0j,\n\n \n\n!\n\n!\n\nNow let us combine the bounds for all the priors {Pj(c)}J\nfact that P (a \u222a b) \u2264 P (a) + P (b)).\n\n \u2200Q(c),\u2203P (c) \u2208 {Pj(c)}J\n\nj=1 :\n\nj=1 with the union operation (we use the\n\nPrS\u223cDm\n\nKL( \u02c6QS||QD) >\n\nKL(Q(c)||Pj (c))+ln m+1\n\n\u03b4 +ln 1\n\u03c0j\n\n< \u03b4,\n\n(5)\n\nm\n\nFinally, let us take the negation of (5) to arrive at the \ufb01nal result.\n\nj=1 from the origin where {\u03b7j}J\n\nThis result can be also particularised for the case of SVM classi\ufb01ers. The set of priors is constructed\nby allocating Gaussian distributions with identity covariance matrix along the direction given by wr\nat distances {\u03b7j}J\nCorollary 3.3 (Multiple Prior PAC-Bayes Bound for linear classi\ufb01ers) Let us consider a set\n{Pj(w|wr, \u03b7j)}J\nj=1 of prior distributions of classi\ufb01ers consisting in spherical Gaussian distribu-\ntions with identity covariance matrix centred on \u03b7jwr, where {\u03b7j}J\nj=1 are real numbers. Then, for\nall distributions D, for all classi\ufb01ers w, for all \u00b5 > 0, for all \u03b4 \u2208 (0, 1], we have\n) + ln J\n\nj=1 are real numbers. In such a case, we obtain\n\n||\u03b7j wr\u2212\u00b5w||2\n\n!\n\n \n\nPrS\u223cD\n\nKL( \u02c6QS\\R(w, \u00b5)||QD(w, \u00b5)) \u2264\n\n\u2265 1 \u2212 \u03b4\n\n2\n\n+ ln( m\u2212r+1\nm \u2212 r\n\n\u03b4\n\nProof The proof is straightforward, substituting \u03c0j = 1\nthe KL divergence between prior and posterior as in the proof of Corollary 3.1.\nNote that the {\u03b7j}J\nj=1 must be chosen before we actually compute the posterior. However, the bound\nholds for all \u00b5. Therefore, a linear search can be implemented for the value of \u00b5 that leads to the\ntightest bound. In the case of several priors, the search is repeated for every prior and the reported\nvalue of the bound is the tightest. In Section 4 we present experimental results comparing this new\nbound to the standard PAC-Bayes bound and using it to guide model selection.\n\nJ for all j in Theorem 3.2 and computing\n\n\f4 Experiments\n\nThe tightness of the new bound is evaluated in a model selection and classi\ufb01cation task using some\nUCI [1] datasets (see their description in terms of number of instances, input dimension and number\nof positive/negative examples in Table 1).\n\nProblem # samples\nWdbc\nImage\n\n569\n2310\n5000\n7400\n\nWaveform\nRingnorm\n\ninput dim.\n\n30\n18\n21\n20\n\nPos/Neg\n357 / 212\n1320 / 990\n1647 / 3353\n3664 / 3736\n\nTable 1: Description of the datasets: for every set we give the number of patterns, number of input\nvariables and number of positive/negative examples.\n\nFor every dataset, we obtain 50 different training/test set partitions with 80% of the samples forming\nthe training set and the remaining 20% forming the test set.\nWith each of the partitions we learn a SVM classi\ufb01er with Gaussian RBF kernel preceded by a model\nselection. The model selection consists in the determination of an optimal pair of hyperparameters\n(C, \u03c3). C is the SVM trade-off between the maximisation of the margin and the minimisation of\nthe hinge loss of the training samples, while \u03c3 is the width of the Gaussian kernel. The best pair is\nsought in a 15\u00d7 15 grid of parameters where C \u2208 {0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100,\n\u221a\n200, 500, 1000} and \u03c3 \u2208 { 1\nd, 5\n\u221a\nd,\n6\nFor completeness, this model selection is guided by the PAC-Bayes bound: we select the model\ncorresponding to the pair that yields a lower value of QD in the bound. Table 2 shows the value of\nthe PAC-Bayes Bound averaged over the 50 training/test partitions. For every partition we use the\nminimum value of the bound resulting from all the pairs (C, \u03c3) of the grid. Note that this procedure\nis computationally less costly than the commonly used N-fold cross validation model selection,\nsince it saves the training of N classi\ufb01ers (one for each fold) for each parameter combination.\n\n\u221a\nd}, where d is the input space dimension.\n\n\u221a\nd, 3\n\n\u221a\nd, 2\n\n\u221a\nd, 7\n\n\u221a\nd, 8\n\n\u221a\nd, 4\n\nd, 1\n7\n\nd, 1\n6\n\nd, 1\n5\n\n\u221a\n\n\u221a\n\nd, 1\n4\n\nd, 1\n3\n\nd, 1\n2\n\nd,\n\n\u221a\n\n8\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nProblem PAC-Bayes Bound Test error rate\n0.073 \u00b1 0.021\nWdbc\n0.074 \u00b1 0.014\nImage\n0.089 \u00b1 0.008\n0.026 \u00b1 0.005\n\n0.334 \u00b1 0.005\n0.254 \u00b1 0.003\n0.198 \u00b1 0.002\n0.212 \u00b1 0.002\n\nWaveform\nRingnorm\n\nTable 2: Averaged PAC-Bayes Bound and Test Error Rate obtained by the model that yielded the\nlowest bound in each of the 50 training/test partitions.\n\nWe repeated this experiment using the Prior PAC-Bayes Bound with different con\ufb01gurations for\nlearning the prior distribution of classi\ufb01ers. These con\ufb01gurations are de\ufb01ned by variations on the\npercentage of training patterns separated to compute the prior and on the number of scalings of\nthe magnitude of that prior. The scalings represent different lengths \u03b7 of ||wr|| equally spaced\nbetween \u03b7 = 1 and \u03b7 = 100. To summarize, for every training/test partition and for every pair (%\npatterns, # of scalings) we look at the pair (C, \u03c3) that outputs the smaller value of QD.\nIn this case, the use of the Prior PAC-Bayes Bound to perform the model selection increases the\ncomputational burden of using the PAC-Bayes one in the training of one classi\ufb01er (the one used to\nlearn the prior), in comparison to the extra N classi\ufb01ers needed by N-fold cross validation.\nTable 3 displays both the average value and the sample standard deviation over the 50 realisations.\nIt seems that ten scalings of the prior are enough to obtain tighter bounds, since the use of 100 or\n500 scalings does not improve the best results. With respect to the percentage of training instances\nleft out to learn the prior, something close to 50% of the training set works well in the considered\nproblems. It is worth mentioning that we treat each position in the Table as a separate experiment.\n\n\fWinsconsin Database of Breast Cancer (PAC-Bayes Bound = 0.334\u00b10.005)\n\n10 %\n\n0.341 \u00b1 0.006\n0.337 \u00b1 0.010\n0.319 \u00b1 0.007\n0.324 \u00b1 0.007\n\n20%\n\n40%\n\n30%\n\n0.364 \u00b1 0.009\n0.314 \u00b1 0.012\n0.313 \u00b1 0.011\n0.319 \u00b1 0.011\n\nPercentage of training set used to compute the prior\n0.351 \u00b1 0.007\n0.379 \u00b1 0.011\n0.323 \u00b1 0.012\n0.310 \u00b1 0.013\n0.315 \u00b1 0.013\n0.315 \u00b1 0.010\n0.320 \u00b1 0.009\n0.321 \u00b1 0.013\nImage Segmentation (PAC-Bayes Bound = 0.254\u00b10.003)\nPercentage of training set used to compute the prior\n0.284 \u00b1 0.005\n0.262 \u00b1 0.005\n0.188 \u00b1 0.007\n0.203 \u00b1 0.006\n0.203 \u00b1 0.007\n0.187 \u00b1 0.007\n0.204 \u00b1 0.007\n0.189 \u00b1 0.007\n\n0.274 \u00b1 0.003\n0.200 \u00b1 0.005\n0.196 \u00b1 0.005\n0.198 \u00b1 0.005\n\n40%\n\n20%\n\n30%\n\nWaveform (PAC-Bayes Bound = 0.198\u00b10.002)\n\n10 %\n\n0.255 \u00b1 0.003\n0.215 \u00b1 0.004\n0.217 \u00b1 0.004\n0.218 \u00b1 0.004\n\n10 %\n\n0.197 \u00b1 0.003\n0.161 \u00b1 0.004\n0.161 \u00b1 0.004\n0.162 \u00b1 0.004\n\n20%\n\n30%\n\nPercentage of training set used to compute the prior\n0.201 \u00b1 0.003\n0.214 \u00b1 0.004\n0.150 \u00b1 0.005\n0.156 \u00b1 0.004\n0.152 \u00b1 0.005\n0.155 \u00b1 0.004\n0.157 \u00b1 0.004\n0.154 \u00b1 0.005\n\n0.207 \u00b1 0.003\n0.153 \u00b1 0.004\n0.153 \u00b1 0.004\n0.155 \u00b1 0.004\n\n40%\n\nRingnorm (PAC-Bayes Bound = 0.212\u00b10.002)\n\n10 %\n\n0.216 \u00b1 0.001\n0.172 \u00b1 0.068\n0.173 \u00b1 0.068\n0.173 \u00b1 0.068\n\n30%\n\n20%\n\nPercentage of training set used to compute the prior\n0.225 \u00b1 0.002\n0.249 \u00b1 0.004\n0.140 \u00b1 0.047\n0.116 \u00b1 0.030\n0.117 \u00b1 0.030\n0.139 \u00b1 0.047\n0.140 \u00b1 0.047\n0.117 \u00b1 0.030\n\n0.236 \u00b1 0.002\n0.126 \u00b1 0.037\n0.126 \u00b1 0.037\n0.127 \u00b1 0.037\n\n40%\n\n50%\n\n0.398 \u00b1 0.013\n0.306 \u00b1 0.018\n0.315 \u00b1 0.017\n0.322 \u00b1 0.017\n\n50%\n\n0.300 \u00b1 0.008\n0.184 \u00b1 0.010\n0.186 \u00b1 0.009\n0.188 \u00b1 0.009\n\n50%\n\n0.222 \u00b1 0.005\n0.151 \u00b1 0.005\n0.153 \u00b1 0.005\n0.155 \u00b1 0.005\n\n50%\n\n0.265 \u00b1 0.002\n0.109 \u00b1 0.024\n0.110 \u00b1 0.024\n0.110 \u00b1 0.024\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\nTable 3: Averaged Prior PAC-Bayes bound for different settings of percentage of training instances\nreserved to compute the prior and of number of scalings of the normalised prior.\n\nHowever, one could have included the tuning of the pair (% patterns, # of scalings) in\nthe model selection. This would have involved a further application of the union bound with the 20\nentries of the Table for each problem, at the cost of adding an extra ln(20)/m (0.0053 for Wdbc and\nless for the other datasets) in the right part of Theorem 3.2. We decided to \ufb01x the number of scalings\nand the amount of training patterns to compute the prior since to perform all of the different options\nwould augment the computational burden of the model selection.\nIn order to evaluate the predictive capabilities of the Prior PAC-Bayes bound as a means to select\nmodels with low test error rate, Table 4 displays the averaged test error corresponding to the mod-\nels selected in the previous experiment (note that in this case the computational burden involved in\ndetermining the model is increased by the training of the SVM that learns the prior wr). Table 5\ndisplays the test error rate obtained by SVMs with their hyperparameters tuned on the above men-\ntioned grid by means of ten-fold cross-validation, that serves as a baseline method for comparison\npurposes.\nAccording to the values shown in the tables, the Prior PAC-Bayes bound achieves tighter predictions\nof the generalization error of the randomized classi\ufb01er in almost all cases.\nNotice how the length of the prior is not so critical in comparison with its direction. The goodness\nof the latter relying on the subset of samples left out for the purpose of learning the prior classi\ufb01er.\nMoreover it has to be remarked that this tightening of the bound does not appear to deliver any\nreduction in the capabilities to select a good model (such a case would imply that we can predict\nmore accurately a bigger error rate, but our bound is able to predict accurately the same error rate as\nthe PAC-Bayes Bound).\n\n\fWinsconsin Database of Breast Cancer (PAC-Bayes Test Error = 0.073\u00b10.021)\n\n10 %\n\n0.076 \u00b1 0.020\n0.075 \u00b1 0.021\n0.076 \u00b1 0.021\n0.076 \u00b1 0.020\n\n20%\n\n40%\n\n30%\n\n0.076 \u00b1 0.021\n0.075 \u00b1 0.021\n0.074 \u00b1 0.021\n0.074 \u00b1 0.020\n\nPercentage of training set used to compute the prior\n0.076 \u00b1 0.021\n0.076 \u00b1 0.021\n0.076 \u00b1 0.021\n0.074 \u00b1 0.021\n0.074 \u00b1 0.020\n0.076 \u00b1 0.021\n0.076 \u00b1 0.021\n0.073 \u00b1 0.020\nImage Segmentation (PAC-Bayes Test Error = 0.074\u00b10.014)\nPercentage of training set used to compute the prior\n0.083 \u00b1 0.019\n0.078 \u00b1 0.011\n0.054 \u00b1 0.010\n0.066 \u00b1 0.011\n0.063 \u00b1 0.011\n0.059 \u00b1 0.011\n0.063 \u00b1 0.011\n0.059 \u00b1 0.011\n\n0.078 \u00b1 0.011\n0.063 \u00b1 0.014\n0.061 \u00b1 0.011\n0.061 \u00b1 0.011\n\n40%\n\n20%\n\n30%\n\nWaveform (PAC-Bayes Test Error = 0.089\u00b10.008)\n\n10 %\n\n0.078 \u00b1 0.011\n0.064 \u00b1 0.011\n0.064 \u00b1 0.011\n0.064 \u00b1 0.011\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\nScalings\n\n1\n10\n100\n500\n\n50%\n\n0.076 \u00b1 0.021\n0.072 \u00b1 0.021\n0.072 \u00b1 0.021\n0.072 \u00b1 0.021\n\n50%\n\n0.100 \u00b1 0.019\n0.056 \u00b1 0.011\n0.057 \u00b1 0.012\n0.057 \u00b1 0.012\n\n50%\n\n0.091 \u00b1 0.009\n0.089 \u00b1 0.009\n0.089 \u00b1 0.009\n0.089 \u00b1 0.009\n\n50%\n\n0.038 \u00b1 0.005\n0.026 \u00b1 0.008\n0.026 \u00b1 0.008\n0.025 \u00b1 0.005\n\n10 %\n\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n\n30%\n\n20%\n\nPercentage of training set used to compute the prior\n0.089 \u00b1 0.008\n0.091 \u00b1 0.009\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n\n0.090 \u00b1 0.009\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n0.089 \u00b1 0.008\n\n40%\n\nRingnorm (PAC-Bayes Test Error = 0.026\u00b10.005)\n\n10 %\n\n0.025 \u00b1 0.004\n0.020 \u00b1 0.007\n0.020 \u00b1 0.007\n0.020 \u00b1 0.007\n\n20%\n\n30%\n\nPercentage of training set used to compute the prior\n0.030 \u00b1 0.007\n0.036 \u00b1 0.007\n0.021 \u00b1 0.007\n0.025 \u00b1 0.008\n0.025 \u00b1 0.008\n0.021 \u00b1 0.007\n0.021 \u00b1 0.007\n0.025 \u00b1 0.008\n\n0.038 \u00b1 0.005\n0.021 \u00b1 0.007\n0.021 \u00b1 0.007\n0.021 \u00b1 0.007\n\n40%\n\nTable 4: Averaged Test Error Rate corresponding to the model determined by the bound for the\ndifferent settings of Table 3.\n\nProblem Cross-validation error rate Test error rate\n0.072 \u00b1 0.024\nWdbc\n0.024 \u00b1 0.008\nImage\n0.085 \u00b1 0.009\n0.017 \u00b1 0.004\n\n0.060 \u00b1 0.006\n0.022 \u00b1 0.002\n0.079 \u00b1 0.011\n0.015 \u00b1 0.001\n\nWaveform\nRingnorm\n\nTable 5: Averaged test error rate. For every partition we select the test error rate corresponding to\nthe model reporting the smaller cross-validation error.\n\nHowever, the comparison with Table 5 points out that the PAC-Bayes bound is not as accurate\nas Ten Fold cross-validation when it comes to selecting a model that yields a low test error rate.\nNevertheless, in two out of the four problems (waveform, and wdbc) the bound provided a model\nas good as the one found by cross-validation, added to the fact that in ringnorm the error bars\noverlap. We conclude the discussion by pointing that the Cross-validation error rate cannot be used\ndirectly as a prediction on the expected test error rate in the sense of worse case performances. Of\ncourse the values of the cross-validation error rate and the test error rate are close, but it is dif\ufb01cult\nto predict how close they are going to be.\n\n\f5 Conclusions and ongoing research\n\nIn this paper we have presented a version of the PAC-Bayes bound for linear classi\ufb01ers that intro-\nduces the learning of the prior distribution over the classi\ufb01ers. This prior distribution is a Gaussian\nwith identity covariance matrix. The mean weight vector is learnt in the following way: its direction\nis determined from a separate subset of the training examples, while its length has to be chosen from\nan a priori \ufb01xed set of lengths.\nThe experimental work shows that this new version of the bound achieves tighter predictions of the\ngeneralization error of the stochastic classi\ufb01er, compared to the original PAC-Bayes bound predic-\ntions. Moreover, if the model selection is driven by the bound, the Prior PAC-Bayes does not degrade\nthe quality of the model selected by the original bound. Nevertheless, it has to be said that in some\nof our experiments the model selected by the bounds resulted as accurate as the ones selected by\nten-fold cross-validation in terms of test error rate on a separate test. This fact is remarkable since\nto include the model selection in the training of the classi\ufb01er roughly multiplies by ten the computa-\ntional burden of the training when using ten-fold cross-validation but roughly by two when using the\nprior PAC-Bayes bound. Of course the original PAC-Bayes provides with a cheaper model selection,\nbut its predictions about the generalization capabilities are more pessimistic.\nThe amount of training patterns used to learn the prior seems to be a key aspect in the goodness of\nthis prior and thus in the tightness of the bound. Therefore, ongoing research includes methods to\nsystematically determine an amount of patterns that provides with suitable priors. Another line of\nresearch explores the use of these bounds to reinforce different properties of the design of classi\ufb01ers,\nsuch as sparsity. Finally, a deeper study about which dataset structure causes differences among the\nperformances of cross-validation and bound-driven model selections is also being carried out.\n\nAcknowledgments\n\nThis work has been supported by the IST Programme of the European Community under the PAS-\nCAL Network of Excellence IST2002-506788. E. P-H. acknowledges support from Spain CICYT\ngrant TEC2005-04264/TCM.\n\nReferences\n[1] C L Blake and C J Merz.\n\nUCI Repository of machine learning databases.\nInformation and Computer Sciences,\n\nUniversity of California,\n[http://www.ics.uci.edu/\u223cmlearn/MLRepository.html], 1998.\n\nIrvine, Dept. of\n\n[2] Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal\n\nmargin classi\ufb01ers. In Computational Learing Theory, pages 144\u2013152, 1992.\n\n[3] J Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learn-\n\ning Research, 6(Mar):273\u2013306, 2005.\n\n[4] J Langford and J Shawe-Taylor. PAC-Bayes & Margins. In Advances in Neural Information\n\nProcessing Systems, volume 14, Cambridge MA, 2002. MIT Press.\n\n[5] D McAllester. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5\u201321, 2003.\n[6] M Seeger. PAC-Bayesian Generalization Error Bounds for Gaussian Process Classi\ufb01cation.\n\nJournal of Machine Learning Research, 3:233\u2013269, 2002.\n\n[7] J Shawe-Taylor, P L Bartlett, R C Williamson, and M Anthony. Structural risk minimization\n\nover data-dependent hierarchies. IEEE Trans. Information Theory, 44(5):1926 \u2013 1940, 1998.\n\n\f", "award": [], "sourceid": 3058, "authors": [{"given_name": "Amiran", "family_name": "Ambroladze", "institution": null}, {"given_name": "Emilio", "family_name": "Parrado-hern\u00e1ndez", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}