{"title": "PAC-Bayes bounds for stable algorithms with instance-dependent priors", "book": "Advances in Neural Information Processing Systems", "page_first": 9214, "page_last": 9224, "abstract": "PAC-Bayes bounds have been proposed to get risk estimates based on a training sample. In this paper the PAC-Bayes approach is combined with stability of the hypothesis learned by a Hilbert space valued algorithm. The PAC-Bayes setting is used with a Gaussian prior centered at the expected output. Thus a novelty of our paper is using priors defined in terms of the data-generating distribution. Our main result estimates the risk of the randomized algorithm in terms of the hypothesis stability coefficients. We also provide a new bound for the SVM classifier, which is compared to other known bounds experimentally. Ours appears to be the first uniform hypothesis stability-based bound that evaluates to non-trivial values.", "full_text": "PAC-Bayes bounds for stable algorithms with\n\ninstance-dependent priors\n\nOmar Rivasplata\n\nUCL\n\nEmilio Parrado-Hern\u00b4andez\nUniversity Carlos III of Madrid\n\nJohn Shawe-Taylor\n\nUCL\n\nShiliang Sun\n\nEast China Normal University\n\nCsaba Szepesv\u00b4ari\n\nDeepMind\n\nAbstract\n\nPAC-Bayes bounds have been proposed to get risk estimates based on a training\nsample. In this paper the PAC-Bayes approach is combined with stability of the\nhypothesis learned by a Hilbert space valued algorithm. The PAC-Bayes setting is\nused with a Gaussian prior centered at the expected output. Thus a novelty of our\npaper is using priors de\ufb01ned in terms of the data-generating distribution. Our main\nresult estimates the risk of the randomized algorithm in terms of the hypothesis\nstability coef\ufb01cients. We also provide a new bound for the SVM classi\ufb01er, which\nis compared to other known bounds experimentally. Ours appears to be the \ufb01rst\nuniform hypothesis stability-based bound that evaluates to non-trivial values.\n\n1\n\nIntroduction\n\nThis paper combines two directions of research: stability of learning algorithms, and PAC-Bayes\nbounds for algorithms that randomize with a data-dependent distribution. The combination of these\nideas enables the development of risk bounds that exploit stability of the learned hypothesis but are\nindependent of the complexity of the hypothesis class. The PAC-Bayes framework (Shawe-Taylor\nand Williamson [1997], McAllester [1999a,b]) is used here with \u2018priors\u2019 de\ufb01ned in terms of the\ndata-generating distribution, as introduced by Catoni [2007] and developed further e.g. by Lever\net al. [2010], Parrado-Hern\u00b4andez et al. [2012] and Dziugaite and Roy [2018]. Speci\ufb01cally, our work\nderives PAC-Bayes bounds for hypothesis stable Hilbert space valued algorithms.\nThe analysis introduced by Bousquet and Elisseeff [2002], which followed and extended Lugosi\nand Pawlak [1994] and was further developed by Celisse and Guedj [2016], Abou-Moustafa and\nSzepesv\u00b4ari [2017] and Liu et al. [2017] among others, shows that stability of learning algorithms\ncan be used to give bounds on the generalisation of the learned functions. Intuitively, this is because\nstable learning should ensure that slightly different training sets give similar solutions. In this paper\nstability is measured by the sensitivity coef\ufb01cients (see our De\ufb01nition 1) of the hypothesis learned\nby a Hilbert space valued algorithm. We provide an analysis leading to a PAC-Bayes bound for\nrandomized classi\ufb01ers under Gaussian randomization. As a by-product of the stability analysis we\nderive a concentration inequality for the learned hypothesis. Applying it to Support Vector Machines\n(Shawe-Taylor and Cristianini [2004], Steinwart and Christmann [2008]) we deduce a concentration\nbound for the SVM weight vector, and a PAC-Bayes performance bound for SVM with Gaussian\nrandomization. Experimental results compare our new bound with other stability-based bounds, and\nwith a more standard PAC-Bayes bound. We also experiment with their use in model selection.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\f2 De\ufb01nitions and Main Result(s)\n\nWe consider a learning problem where the learner observes pairs (Xi, Yi) of patterns (inputs) Xi\nfrom the space1 X and labels Yi in the space Y. A training set (or sample) is a \ufb01nite sequence\nSn = ((X1, Y1), . . . , (Xn, Yn)) of such observations. Each pair (Xi, Yi) is a random element of\nX\u21e5Y whose (joint) probability law is2 P 2 M1(X\u21e5Y ). We think of P as the underlying \u2018true\u2019 (but\nunknown) data-generating distribution. Examples are i.i.d. (independent and identically distributed)\nin the sense that the joint distribution of Sn is the n-fold product measure P n = P \u2326\u00b7\u00b7\u00b7\u2326 P .\nA learning algorithm is a function A : [n(X\u21e5Y )n !Y X that maps training samples (of any size)\nto predictor functions. Given Sn, the algorithm produces a learned hypothesis A(Sn) : X!Y that\nwill be used to predict the label of unseen input patterns X 2X . Typically X\u21e2 Rd and Y\u21e2 R.\nFor instance, Y = {1, 1} for binary classi\ufb01cation, and Y = R for regression. A loss function\n` : R \u21e5Y! [0,1) is used to assess the quality of hypotheses h : X!Y . Say if a pair (X, Y ) is\nsampled, then `(h(X), Y ) quanti\ufb01es the dissimilarity between the label h(X) predicted by h, and\nthe actual label Y . We may write `h(X, Y ) = `(h(X), Y ) to express the losses (of h) as function of\nthe training examples. The (theoretical) risk of hypothesis h under data-generating distribution P is\nR(h, P ) = h`h, Pi.3 It is also called the error of h under P . The empirical risk of h on a sample Sn\nnPn\ni=1 (Xi,Yi) is the empirical measure4 on X\u21e5Y associated\nis R(h, Pn) = h`h, Pni where Pn = 1\nto the sample. Notice that the risk (empirical or theoretical) is tied to the choice of a loss function.\nFor instance, consider binary classi\ufb01cation with the 0-1 loss `01(y0, y) = 1[y0 6= y], where 1[\u00b7] is an\nindicator function equal to 1 when the argument is true and equal to 0 when the argument is false.\nIn this case the risk is R01(c, P ) = P [c(X) 6= Y ], i.e., the probability of misclassifying the random\nnPn\nexample (X, Y ) \u21e0 P when using c; and the empirical risk is R01(c, Pn) = 1\ni=1 1[c(Xi) 6= Yi],\ni.e., the in-sample proportion of misclassi\ufb01ed examples.\nOur main theorem concerns Hilbert space valued algorithms, in the sense that the learned hypotheses\nlive in a Hilbert space H. In this case we may use the Hilbert space norm kwkH =phw, wiH to\nmeasure the difference between the hypotheses learned from two slightly different samples.\nTo shorten the notation we will write Z = X\u21e5Y . A generic element of this space is z = (x, y),\nthe observed examples are Zi = (Xi, Yi) and the sample of size n is Sn = (Z1, . . . , Zn).\nDe\ufb01nition 1. Consider a learning algorithm A : [nZ n !H where H is a separable Hilbert space.\nWe de\ufb01ne5 the hypothesis sensitivity coef\ufb01cient of A at sample size n as follows:\nkA(z1:i1, zi, zi+1:n) A(z1:i1, z0i, zi+1:n)kH .\n\nn = sup\ni2[n]\n\nsup\nzi,z0i\n\nThis is close in spirit to what is called \u201cuniform stability\u201d in the literature, except that our de\ufb01nition\nconcerns stability of the learned hypothesis itself (measured by a distance on the hypothesis space),\nwhile e.g. Bousquet and Elisseeff [2002] deal with stability of the loss functional. The latter could be\ncalled \u201closs stability\u201d (in terms of \u201closs sensitivity coef\ufb01cients\u201d) for the sake of informative names.\nWriting z1:n \u21e1 z01:n when these n-tuples differ at one entry (at most), an equivalent formulation to\nthe above is n = supz1:n\u21e1z01:n kA(z1:n) A(z01:n)kH. In particular, if two samples Sn and S0n\ndiffer only on one example, then kA(Sn) A(S0n)kH \uf8ff n. Thus our de\ufb01nition implies stability\nwith respect to replacing one example with an independent copy. Alternatively, one could de\ufb01ne\nn = ess supSn\u21e1S0n kA(Sn) A(S0n)kH, which corresponds to the \u201cuniform argument stability\u201d\nof Liu et al. [2017]. We avoid the \u2018almost-sure\u2019 technicalities by de\ufb01ning our n\u2019s as the maximal\ndifference (in norm) with respect to all n-tuples z1:n \u21e1 z01:n. The extension to sensitivity when\nchanging several examples is natural: kA(z1:n) A(z01:n)kH \uf8ff nPn\ni=1 1[zi 6= z0i]. Note that\nn is a Lipschitz factor with respect to the Hamming distance. The \u201ctotal Lipschitz stability\u201d of\nKontorovich [2014] is a similar notion for stability of the loss functional. The \u201ccollective stability\u201d\nof London et al. [2013] is not comparable to ours (different setting) despite the similar look.\n\n1All spaces where random variables take values are assumed to be measurable spaces.\n2M1(Z) denotes the set of all probability measures over the space Z.\n3 Mathematicians write hf, \u232bi def= RX\u21e5Y\nwith respect to a (not necessarily probability) measure \u232b on X\u21e5Y .\n4 Integrals with respect to Pn evaluate as follows:RX\u21e5Y\n\nf (x, y) d\u232b(x, y) for the integral of a function f : X\u21e5Y! R\ni=1 `(h(Xi), Yi).\n5 For a list \u21e01,\u21e0 2,\u21e0 3, . . . and indexes i < j, we write \u21e0i:j = (\u21e0i, . . . ,\u21e0 j), i.e., the segment from \u21e0i to \u21e0j.\n\n`(h(x), y) dPn(x, y) = 1\n\nnPn\n\n2\n\n\fWe will consider randomized classi\ufb01ers that operate as follows. Let C be the classi\ufb01er space, and let\nQ 2 M1(C) be a probability distribution over the classi\ufb01ers. To make a prediction the randomized\nclassi\ufb01er picks c 2C according to Q and predicts a label with the chosen c. Each prediction is\nmade with a fresh c draw. For simplicity we use the same label Q for the probability distribution and\nfor the corresponding randomized classi\ufb01er. The risk measures R(c, P ) and R(c, Pn) are extended\nto randomized classi\ufb01ers: R(Q, P ) \u2318 RC\nR(c, P ) dQ(c) is the average theoretical risk of Q, and\nR(Q, Pn) \u2318RC\nR(c, Pn) dQ(c) its average empirical risk. Given two distributions Q0, Q 2 M1(C),\n\nthe Kullback-Leibler divergence (a.k.a. relative entropy) of Q with respect to Q0 is\n\nKL(QkQ0) =ZC\n) + (1 q) log( 1q\n1q0\n\nlog dQ\ndQ0 dQ .\n), and kl+(qkq0) = kl(qkq0)1[q < q0].\n\nOf course this makes sense when Q is absolutely continuous with respect to Q0, which ensures that\nthe Radon-Nikodym derivative dQ/dQ0 exists. For Bernoulli distributions with parameters q and q0\nwe write kl(qkq0) = q log( q\n2.1 Main theorem: a PAC-Bayes bound for stable algorithms with Gaussian randomization\nTheorem 2. Let A be a Hilbert space valued algorithm. Suppose that (once trained) the algorithm\nwill randomize according to Gaussian distributions Q = N (A(Sn), 2I).\nIf A has hypothesis\nstability coef\ufb01cient n, then for any randomization variance 2 > 0, for any 2 (0, 1), with\nprobability 1 2 we have\n\nq0\n\nkl+(R(Q, Pn)kR(Q, P )) \uf8ff\n\nn2\nn\n\n22 \u21e31 +q 1\n\n\u23182\n2 log 1\n\nn\n\n+ log( n+1\n )\n\n.\n\nThe proof, see Section 4 below, combines stability of the learned hypothesis (as in our De\ufb01nition 1)\nand a PAC-Bayes theorem quoted there for reference. Note that the randomizing distribution Q\nis data-dependent. Literature on the PAC-Bayes framework for learning linear classi\ufb01ers includes\nGermain et al. [2015] and Parrado-Hern\u00b4andez et al. [2012] with references. Applications of this\nframework to neural networks are given, e.g., by London [2017], Dziugaite and Roy [2017], and\nDziugaite and Roy [2018]. The latter combines PAC-Bayes and a substantially different stability\nnotion: they use distributional stability for choosing a prior in a data-dependent manner.\n\n2.2 Application: a PAC-Bayes bound for SVM with Gaussian randomization\nFor a Support Vector Machine (SVM) with feature map ' : X!H into a separable Hilbert space\nH, we may identify6 a linear classi\ufb01er cw(\u00b7) = sign(hw, '(\u00b7)i) with a vector w 2H . With this\nidenti\ufb01cation we can regard an SVM as a Hilbert space7 valued mapping that based on a training\nsample Sn learns a weight vector Wn = SVM(Sn) 2H . In this context, stability of the SVM\u2019s\nsolution then reduces to stability of the learned weight vector.\nTo be speci\ufb01c, let SVM(Sn) be the SVM that regularizes the empirical risk over the sample Sn by\nsolving the following optimization problem:\n\nw \u2713 \n\n`(cw(Xi), Yi)\u25c6 .\n\nnXi=1\nn (Example 2 of Bousquet and Elisseeff [2002],\nOur stability coef\ufb01cients in this case satisfy n \uf8ff 2\nadapted to our setting). Then a direct application of our Theorem 2 together with a concentration\nargument for the SVM weight vector (see our Corollary 9 below) gives the following:\nCorollary 3. Let Wn = SVM(Sn). Suppose that (once trained) the algorithm will randomize\naccording to Gaussian8 distributions Q = N (Wn, 2I). For any randomization variance 2 > 0,\nfor any 2 (0, 1), with probability 1 2 we have\n\n2kwk2 +\n\narg min\n\n(svm)\n\n1\nn\n\nkl+(R(Q, Pn)kR(Q, P )) \uf8ff\n\n2\n\n22n\u21e31 +q 1\n\n\u23182\n2 log 1\n\nn\n\n+ log( n+1\n )\n\n.\n\n6Riesz representation theorem is behind this identi\ufb01cation.\n7H may be in\ufb01nite-dimensional (e.g. Gaussian kernel).\n8See Section 5 about the interpretation of Gaussian randomization for a Hilbert space valued algorithm.\n\n3\n\n\fIn closing this section we mention that our main theorem is general in that it may be specialized to\nany Hilbert space valued algorithm. This covers any regularized ERM algorithm [Liu et al., 2017].\nWe applied it to SVM\u2019s whose hypothesis sensitivity coef\ufb01cients (as in our De\ufb01nition 1) are known.\nIt can be argued that neural networks (NN\u2019s) fall under this framework as well. Then an appealing\nfuture research direction, with deep learning in view, is to \ufb01gure out the sensitivity coef\ufb01cients of\nNN\u2019s trained by Stochastic Gradient Descent. Then our main theorem could be applied to provide\nnon-vacuous bounds for the performance of NN\u2019s, which we believe is very much needed.\n\n3 Comparison to other bounds\n\nFor reference we list several risk bounds (including ours). They are in the context of binary clas-\nsi\ufb01cation (Y = {1, +1}). For clarity, risks under the 0-1 loss are denoted by R01 and risks with\nrespect to the (clipped) hinge loss are denoted by Rhi. Bounds requiring a Lipschitz loss function\ndo not apply to the 0-1 loss. However, the 0-1 loss is upper bounded by the hinge loss, allowing us\nto upper bound the risk with respect to the former in therms of the risk with respect to the latter. On\nthe other hand, results requiring a bounded loss function do not apply to the regular hinge loss. In\nthose cases the clipped hinge loss is used, which enjoys boundedness and Lipschitz continuity.\n\n3.1 P@EW: Our new instance-dependent PAC-Bayes bound\nOur Corollary 3, with Q = N (Wn, 2I), a Gaussian centered at Wn = SVM(Sn) with random-\nization variance 2, gives the following risk bound which holds with probability 1 2:\n .\nlog n + 1\n\n22n2\u27131 +r 1\n\nkl+(R01(Q, Pn)kR01(Q, P )) \uf8ff\n\n\u25c62\nlog 1\n\nAs will be clear from the proof (see Section 4 below), this bound is obtained from the PAC-Bayes\nbound (Theorem 4) using a prior centered at the expected weight.\n\n1\nn\n\n+\n\n2\n\n2\n\n\n\n3.2 P@O: Prior at the origin PAC-Bayes bound\nThe PAC-Bayes bound Theorem 4 again with Q = N (Wn, 2I), gives the following risk bound\nwhich holds with probability 1 :\n\n82, kl+(R01(Q, Pn)kR01(Q, P )) \uf8ff\n\n1\n22nkWnk2 +\n\n1\nn\n\nlog n + 1\n\n\n\n .\n\n3.3 Bound of Liu et al. [2017]\nFrom Corollary 1 of Liu et al. [2017] (but with as in formulation (svm)) we get the following risk\nbound which holds with probability 1 2:\n\nR01(Wn, P ) \uf8ff Rhi(Wn, P ) \uf8ff Rhi(Wn, Pn) +\n\n8\n\nnr2 log 2\n\n +r 1\n\n2n\n\nlog 1\n .\n\nWe use Corollary 1 of Liu et al. [2017] with B = 1, L = 1 and M = 1 (clipped hinge loss).\n\n3.4 Bound of Bousquet and Elisseeff [2002]\nFrom Example 2 of Bousquet and Elisseeff [2002] (but with as in formulation (svm)) we get the\nfollowing risk bound which holds with probability 1 :\n2\nR01(Wn, P ) \uf8ff Rhi(Wn, P ) \uf8ff Rhi(Wn, Pn) +\nn\n\n2n\n\n+\u21e31 +\n\n4\n\n\u2318r 1\n\nWe use Example 2 and Theorem 17 (based on Theorem 12) of Bousquet and Elisseeff [2002] with\n\uf8ff = 1 (normalized kernel) and M = 1 (clipped hinge loss).\nIn Appendix C below there is a list of different SVM formulations, and how to convert between\nthem. We found it useful when implementing code for experiments.\n\nlog 1\n .\n\n4\n\n\fThere are obvious differences in the nature of these bounds: the last two (Liu et al. [2017] and\nBousquet and Elisseeff [2002]) are risk bounds for the (un-randomized) classi\ufb01ers, while the \ufb01rst\ntwo (P@EW, P@O) give an upper bound on the binary KL-divergence between the average risks\n(empirical to theoretical) of the randomized classi\ufb01ers. Of course inverting the KL-divergence we\nget a bound for the average theoretical risk. Also, the \ufb01rst two bounds have an extra parameter, the\nrandomization variance (2), that can be optimized. Note that P@O bound is not based on stability,\nwhile the other three bounds are based on stability notions. Next let us comment on how these\nbounds compare quantitatively.\nOur P@EW bound and the P@O bound are similar except for the \ufb01rst term on the right hand side.\nThis term comes from the KL-divergence between the Gaussian distributions. Our P@EW bound\u2019s\n\ufb01rst term improves with larger values of , which in turn penalize the norm of the weight vector\nof the corresponding SVM, resulting in a small \ufb01rst term in P@O bound. Note that P@O bound\nis equivalent to the setting of Q = N (\u00b5Wn/kWnk, 2I), a Gaussian with center in the direction\nof Wn, at distance \u00b5 > 0 from the origin (as discussed by Langford [2005] and implemented by\nParrado-Hern\u00b4andez et al. [2012]).\nThe \ufb01rst term on the right hand side of our P@EW bound comes from the concentration of the weight\n(see our Corollary 9). Lemma 1 of Liu et al. [2017] implies a similar concentration inequality for\nthe weight vector, but it is not hard to see that our concentration bound is slightly better.\nFinally, in the experiments we compare our P@EW bound with Bousquet and Elisseeff [2002].\n\n4 Proofs\n\nAs we said before, the proof of our Theorem 2 combines stability of the learned hypothesis (in the\nsense of our De\ufb01nition 1) and a well-known PAC-Bayes bound, quoted next for reference:\nTheorem 4. (PAC-Bayes bound) Consider a learning algorithm A : [n(X\u21e5Y )n !C . For any\nQ0 2 M1(C), and for any 2 (0, 1), with probability 1 we have\n\nKL(QkQ0) + log( n+1\n )\n\n8Q 2 M1(C), kl+(R(Q, Pn)kR(Q, P )) \uf8ff\n\nn\nThe probability is over the generation of the training sample Sn \u21e0 P n.\nThis is borrowed from Langford [2005], originally Theorem 1 of Seeger [2002]. See also Theorem\n2.1 of Germain et al. [2009]. To use the PAC-Bayes bound, we will use Q0 = N (E[A(Sn)], 2I)\nand Q = N (A(Sn), 2I), a Gaussian distribution centered at the expected output and a Gaussian\n(posterior) distribution centered at the random output A(Sn), both with covariance operator 2I.\nThe KL-divergence between those Gaussians scales with kA(Sn) E[A(Sn)]k2. More precisely:\n\n.\n\nKL(QkQ0) =\n\n1\n22kA(Sn) E[A(Sn)]k2 .\n\nTherefore, bounding kA(Sn)E[A(Sn)]k will give (via the PAC-Bayes bound of Theorem 4 above)\na corresponding bound on the binary divergence between the average empirical risk R(Q, Pn) and\nthe average theoretical risk R(Q, P ) of the randomized classi\ufb01er Q. Hypothesis stability (in the form\nof our De\ufb01nition 1) implies a concentration inequality for kA(Sn) E[A(Sn)]k. This is done in our\nCorollary 8 (see Section 4.3 below) and completes the circle of ideas to prove our main theorem. The\nproof of our concentration inequality is based on an extension of the bounded differences theorem\nof McDiarmid to vector-valued functions discussed next.\n\n4.1 McDiarmid\u2019s inequality for real-valued functions of the sample\nTo shorten the notation let\u2019s present the training sample as Sn = (Z1, . . . , Zn) where each example\nZi is a random variable taking values in the (measurable) space Z. We quote a well-known theorem:\nTheorem 5. (McDiarmid inequality) Let Z1, . . . , Zn be independent Z-valued random variables,\nand f : Z n ! R a real-valued function such that for each i and for each list of \u2018complementary\u2019\narguments z1, . . . , zi1, zi+1, . . . , zn we have\n\nsup\nzi,z0i\n\nThen for every \u270f> 0, Pr{f (Z1:n) E[f (Z1:n)] >\u270f }\uf8ff exp\u21e3 2\u270f2\nPn\n\n|f (z1:i1, zi, zi+1:n) f (z1:i1, z0i, zi+1:n)|\uf8ff ci .\ni\u2318.\n\ni=1 c2\n\n5\n\n\fMcDiarmid\u2019s inequality applies to a real-valued function of independent random variables. Next we\npresent an extension to vector-valued functions of independent random variables. The proof follows\nthe steps of the proof of the classic result above, but we have not found this result in the literature,\nhence we include the details.\n\n4.2 McDiarmid\u2019s inequality for vector-valued functions of the sample\nLet Z1, . . . , Zn be independent Z-valued random variables and f : Z n !H a function into a\nseparable Hilbert space. We will prove that bounded differences in norm9 implies concentration of\nf (Z1:n) around its mean in norm, i.e., that kf (Z1:n) Ef (Z1:n)k is small with high probability.\nNotice that McDiarmid\u2019s theorem can\u2019t be applied directly to f (Z1:n)Ef (Z1:n) when f is vector-\nvalued. We will apply McDiarmid to the real-valued kf (Z1:n) Ef (Z1:n)k, which will give an\nupper bound for kf Efk in terms of Ekf Efk. The next lemma upper bounds Ekf Efk for\na function f with bounded differences in norm. Its proof is in Appendix A.\nLemma 6. Let Z1, . . . , Zn be independent Z-valued random variables, and f : Z n !H a function\ninto a Hilbert space H satisfying the bounded differences property: for each i and for each list of\n\u2018complementary\u2019 arguments z1, . . . , zi1, zi+1, . . . , zn we have\n\nsup\nzi,z0i\n\nkf (z1:i1, zi, zi+1:n) f (z1:i1, z0i, zi+1:n)k \uf8ff ci .\n\nThen Ekf (Z1:n) E[f (Z1:n)]k \uf8ffpPn\n\ni .\ni=1 c2\n\nIf the vector-valued function f (z1:n) has bounded differences in norm (as in Lemma 6) and C 2 R is\nany constant, then the real-valued function kf (z1:n) Ck has the bounded differences property (as\nin McDiarmid\u2019s theorem). In particular this is true for kf (z1:n) Ef (Z1:n)k (notice that Ef (Z1:n)\nis constant over replacing Zi by an independent copy Z0i) so applying McDiarmid\u2019s inequality to it,\ncombining with Lemma 6, we get the following theorem:\nTheorem 7. Under the assumptions of Lemma 6, for any 2 (0, 1), with probability 1 we\nhave\n\nkf (Z1:n) E[f (Z1:n)]k \uf8ffvuut\nthrough its Euclidean norm kc1:nk2 =pPn\n\ni .\ni=1 c2\n\ni +rPn\n\nc2\n\ni=1 c2\ni\n2\n\nnXi=1\n\nlog 1\n .\n\nNotice that the vector c1:n = (c1, . . . , cn) of difference bounds appears in the above inequality only\n\n4.3 Stability implies concentration\nThe hypothesis sensitivity coef\ufb01cients give concentration of the learned hypothesis:\nCorollary 8. Let A be a Hilbert space valued algorithm. Suppose A has hypothesis sensitivity\ncoef\ufb01cient n at sample size n. Then for any 2 (0, 1), with probability 1 we have\n\nkA(Sn) E[A(Sn)]k \uf8ff pn n 1 +r 1\n\n2\n\n! .\nlog 1\n\nThis is a consequence of Theorem 7 since ci \uf8ff n for i = 1, . . . , n, hence kc1:nk \uf8ff pn n.\nLast (not least) we deduce concentration of the weight vector Wn = SVM(Sn).\nCorollary 9. Let Wn = SVM(Sn). Suppose that the kernel used by SVM is bounded by B. For\nany > 0, for any 2 (0, 1), with probability 1 we have\n! .\nlog 1\n\npn 1 +r 1\n\nkWn E[Wn]k \uf8ff\n\n2B\n\n2\n\nUnder these conditions we have hypothesis sensitivity coef\ufb01cients n \uf8ff 2B\nand Elisseeff [2002], Example 2 and Lemma 16, adapted to our setting). Then apply Corollary 8.\n\nn (we follow Bousquet\n\n9The Hilbert space norm, induced by the inner product of H.\n\n6\n\n\f5 Gaussian distributions over the Hilbert space of classi\ufb01ers?\n\nThis section aims to provide a rigorous explanation for Gaussian randomization in Hilbert spaces,\nwhich has been used here and in several previous machine learning works. For instance in the\nsetting of SVM classi\ufb01ers with feature map : X!H , the output is a weight vector that lives\nin the Hilbert space H. With the Gaussian kernel in mind, we are facing an in\ufb01nite-dimensional\nseparable H, which upon the choice of an orthonormal basis {e1, e2, . . .} can be identi\ufb01ed with the\nspace10 `2 \u21e2 RN of square summable sequences of real numbers, via the isometric isomorphism\nH! `2 that maps the vector w = P1i=1 wiei 2H to the sequence (w1, w2, . . .) 2 `2. Thus\nwithout loss of generality we may regard the feature map as : X! `2 \u21e2 RN.\nThe PAC-Bayes approach applied to SVMs says that instead of committing to the weight vector\nWn = SVM(Sn) we will randomize by choosing a fresh W 2H according to some probability\ndistribution on H for each prediction. Suppose the randomized classi\ufb01er is to be chosen according\nto a Gaussian distribution. Although it commonly appears in the literature, it is worth wondering\njust what is a Gaussian distribution over the space H = `2.\nTwo possibilities come to mind for the Gaussian random classi\ufb01er W : (1) according to a Gaussian\nmeasure on `2, say W \u21e0N (\u00b5, \u2303) with mean \u00b5 2 `2 and covariance operator \u2303 meeting the\nrequirements (positive, trace-class) for this to be a Gaussian measure on `2; or (2) according to a\nGaussian measure on the bigger RN, e.g. W \u21e0N (\u00b5, I) by which we mean the measure constructed\nas the product of a sequence N (\u00b5i, 1) of independent real-valued Gaussians with unit variance.\nThese two possibilities are mutually exclusive since the \ufb01rst choice gives a measure on RN whose\nmass is supported on `2, while the second choice leads to a measure on RN supported outside of `2.\nA good reference for this topic is Bogachev [1998].\n\nLet us go with the second choice: N (0, I) =N1i= N (0, 1), a \u2018standard Gaussian\u2019 on RN. This is\na legitimate probability measure on RN (by Kolmogorov\u2019s Extension theorem). But it is supported\noutside of `2, so when sampling a W 2 RN according to this measure, with probability one such W\nwill be outside of our feature space `2. Then we have to wonder about the meaning of hW,\u00b7i when\nW is not in the Hilbert space carrying this inner product.\nLet us write W = (\u21e01,\u21e0 2, . . .) a sequence of i.i.d. standard (real-valued) Gaussian random variables.\nLet x = (x1, x2, . . .) 2 `2, and consider the formal expression hx, Wi =P1i=1 xi\u21e0i. Notice that\nThen (see e.g. Bogachev [1998], Theorem 1.1.4) our formal object hx, Wi =P1i=1 xi\u21e0i is actually\n\nwell-de\ufb01ned in the sense that the series is convergent almost surely (i.e. with probability one),\nalthough as we pointed out such W is outside `2.\n\nE[|xi\u21e0i|2] =\n\n|xi|2 < 1 .\n\n1Xi=1\n\n1Xi=1\n\n5.1 Predicting with the Gaussian random classi\ufb01er\nLet Wn = SVM(Sn) be the weight vector found by running SVM on the sample Sn. We write it as\n\ni=1 \u21b5iYi(Xi). Let \uf8ff(\u00b7,\u00b7) be the kernel doing the \u201ckernel trick.\u201d\n\nWn =Pn\nAlso as above let W be a Gaussian random vector in RN, and write it as W = P1j=1 \u21e0jej with\n\u21e01,\u21e0 2, . . . i.i.d. standard Gaussians. As usual ej stands for the canonical unit vectors having a 1 in\nthe jth coordinate and zeros elsewhere.\nFor an input x 2X with corresponding feature vector (x) 2H , we predict with\n\n\u21b5iYi\uf8ff(Xi, x) +\n\n\u21e0jhej, (x)i .\n\n1Xj=1\n\n(hej, (x)i)2 = k(x)k2 ,\n\n1Xi=1\n\n7\n\nThis is well-de\ufb01ned since\n\nhWn + W, (x)i =\n\nnXi=1\nE[(\u21e0jhej, (x)i)2] =\n\n1Xi=1\n\nso the seriesP1j=1 \u21e0jhej, (x)i converges almost surely (Bogachev [1998], Theorem 1.1.4).\n\n10Just to be sure: RN stands for the set of all in\ufb01nite sequences of real numbers.\n\n\fFigure 1: Tightness of P@O bound on PIM (left) and RIN (right) shown as the difference between\nthe bound and the test error of the underlying randomized classi\ufb01er. Smaller values are preferred.\n\nFigure 2: Tightness of P@EW bound (the bound derived here) on PIM (left) and RIN (right) shown as\nthe difference between the bound and the test error of the underlying randomized classi\ufb01er. Smaller\nvalues are preferred.\n\n6 Experiments\nThe purpose of the experiments was to explore the strengths and potential weaknesses of our new\nbound in relation to the previous alternatives, as well as to explore the bound\u2019s ability to help model\nselection. For this, to facilitate comparisons, taking the setup of Parrado-Hern\u00b4andez et al. [2012], we\nexperimented with the \ufb01ve UCI datasets described there. However, we present results for PIM and\nRIN only, as the results on the other datasets mostly followed the results on these and in a way these\ntwo datasets are the most extreme. In particular, they are the smallest and largest with dimensions\n768 \u21e5 8 (768 examples, and 8 dimensional feature space), and 7200 \u21e5 20, respectively.\nModel and data preparation We used an offset-free SVM classi\ufb01er with a Gaussian RBF kernel\n2/(22)) with RBF width parameter > 0. The SVM used the so-called\n\uf8ff(x, y) = exp(kx yk2\nstandard SVM-C formulation which multiplies the total (hinge) loss by C > 0; the conversion to\nour formulation (svm) is given by C = 1\nn where n is the number of training examples and > 0 is\nthe penalty factor. The datasets were split into a training and a test set using the train test split\nmethod of scikit, keeping 80% of the data for training and 20% for testing.\nModel parameters Following the procedure suggested in Section 2.3.1 of Chapelle and Zien [2005],\nwe set up a geometric 7 \u21e5 7 grid over the (C, )-parameter space where C ranges between 28C0\nand 22C0 and ranges between 230 and 230, where 0 is the median of the Euclidean distance\nbetween pairs of data points of the training set, and given 0, C0 is obtained as the reciprocal value\nof the empirical variance of data in feature space underlying the RBF kernel with width 0. The\ngrid size was selected for economy of computation. The grid lower and upper bounds for were\nad-hoc, though they were inspired by the literature, while for the same for C, we enlarged the lower\nrange to focus on the region of the parameter space where the stability-based bounds have a better\nchance to be effective: In particular, the stability-based bounds grow with C in a linear fashion, with\na coef\ufb01cient that was empirically observed to be close to one.\n\n8\n\n\fComputations For each of the (C, ) pairs on the said grid, we trained an SVM-model using a\nPython implementation of the SMO algorithm of Platt [1999], adjusted to SVMs with no offset\n(Steinwart and Christmann [2008] argue that \u201cthe offset term has neither a known theoretical nor an\nempirical advantage\u201d for the Gaussian RBF kernel). We then calculated various bounds using the\nobtained model, as well as the corresponding test error rates (recall that the randomized classi\ufb01ers\u2019\ntest error is different than the test error of the SVM model that uses no randomization). The bounds\ncompared were the two mentioned hinge-loss based bounds: The bound by Liu et al. [2017] and\nthat of Bousquet and Elisseeff [2002]. In addition we calculated the P@O and (our) P@EW bound.\nWhen these latter were calculated we optimized the randomization variance parameter 2\nnoise by\nminimizing error estimate obtained from the respective bound (the KL divergence was inverted\nnumerically). Further details of this can be found in Appendix D.\nResults As explained earlier our primary interest is to explore the various bounds strengths and\nweaknesses. In particular, we are interested in their tightness, as well as their ability to support\nmodel selection. As the qualitative results were insensitive to the split, results for a single \u201crandom\u201d\n(arbitrary) split are shown only.\nTightness The hinge loss based bounds gave trivial bounds over almost all pairs of (C, ). Upon\ninvestigating this we found that this is because the hinge loss takes much larger values than the\ntraining error rate unless C takes large values (cf. Fig. 3 in Appendix D). However, for large values of\nC, both of the bounds are vacuous. In general, the stability based bounds (Liu et al. [2017], Bousquet\nand Elisseeff [2002] and our bound) are sensitive to large values of C. Fig. 1 show the difference\nbetween the P@O bound and the test error of the underlying respective randomized classi\ufb01ers as a\nfunction of (C, ) while Fig. 2 shows the difference between the P@EW bound and the test error\nof the underlying randomized classi\ufb01er. (Figs. 7 and 9 in the appendix show the test errors for these\nclassi\ufb01ers, while Figs. 6 and 8 shows the bound.) The meticulous reader may worry about that it\nappears that on the smaller dataset, PIM, the difference shown for P@O is sometimes negative. As\nit turns out this is due to the randomness of the test error: Once we add a con\ufb01dence correction that\naccounts for the randomness of the small test set (ntest = 154) this difference disappears once we\ncorrect the test error for this. From the \ufb01gures the most obvious difference between the bounds is\nthat the P@EW bound is sensitive to the value of C and it becomes loose for larger values of C. This\nis expected: As noted earlier, stability based bounds, which P@EW is an instance of, are sensitive\nto C. The P@O bound shows a weaker dependence on C if any. In the appendix we show the\nadvantage (or disadvantage) of the P@EW bound over the P@O bound on Fig. 10. From this \ufb01gure\nwe can see that on PIM, P@EW is to be preferred almost uniformly for small values of C (C \uf8ff 0.5),\nwhile on RIN, the advantage of P@EW is limited both for smaller values of C and a certain range\nof the RBF width. Two comments are in order in connection to this: (i) We \ufb01nd it remarkable that\na stability-based bound can be competitive with the P@O bound, which is known as one of the best\nbounds available. (ii) While comparing bounds is interesting for learning about their qualities, the\nbounds can be used together (e.g., at the price of an extra union bound).\nModel selection To evaluate a bound\u2019s capability in helping model selection it is worth comparing\nthe correlation between the bound and test error of the underlying classi\ufb01ers. By comparing Figs. 6\nand 7 with Figs. 8 and 9 it appears that perhaps the behavior of the P@EW bound (at least for\nsmall values of C) follows more closely the behavior of the corresponding test error surface. This is\nparticularly visible on RIN, where the P@EW bound seems to be able to pick better values both for\nC and , which lead to a much smaller test error (around 0.12) than what one can obtain by using\nthe P@O bound.\n\n7 Discussion\n\nWe have developed a PAC-Bayes bound for randomized classi\ufb01ers. We proceeded by investigating\nthe stability of the hypothesis learned by a Hilbert space valued algorithm. A special case being\nSVMs. We applied our main theorem to SVMs, leading to our P@EW bound, and we compared it\nto other stability-based bounds and to a previously known PAC-Bayes bound. The main \ufb01nding is\nthat perhaps P@EW is the \ufb01rst nontrivial bound that uses (uniform) hypothesis stability. Our work\ncan be viewed as contributing to a line of research that aims to develop \u2018self-bounding algorithms\u2019\n(Freund [1998], Langford and Blum [2003]) in the sense that besides producing a predictor the\nalgorithm also creates a performance certi\ufb01cate based on the available data.\n\n9\n\n\fAcknowledgements\n\nOmar Rivasplata is sponsored by DeepMind via an Overseas Impact Studentship to undertake grad\nstudies at UCL Department of Computer Science. Csaba Szepesv\u00b4ari gratefully acknowledges the\nAlberta machine intelligence institute (Amii), with funding from Alberta Innovates \u2013 Technology\nFutures and from the Natural Sciences and Engineering Research Council of Canada. Shiliang Sun\nis supported by the NSFC Project 61673179, and Shanghai Knowledge Service Platform Project\nZF1213. John Shawe-Taylor acknowledges support of the UK Defence Science and Technology\nLaboratory (Dstl) and Engineering and Physical Sciences Research Council (EPSRC) under grant\nEP/R018693/1. This is part of the collaboration between US DOD, UK MOD and UK EPSRC under\nthe Multidisciplinary University Research Initiative (MURI). This work was done in part while John\nShawe-Taylor was visiting the Simons Institute for the Theory of Computing at UC Berkeley.\n\nReferences\nKarim Abou-Moustafa and Csaba Szepesv\u00b4ari. An a priori exponential tail bound for k-folds cross-\n\nvalidation. ArXiv e-prints, 2017.\n\nVladimir I. Bogachev. Gaussian Measures. American Mathematical Society, 1998.\nOlivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalisation. Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\nOlivier Catoni. PAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning.\n\nTechnical report, Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2007.\n\nAlain Celisse and Benjamin Guedj. Stability revisited: new generalisation bounds for the leave-one-\n\nout. arXiv preprint arXiv:1608.06412, 2016.\n\nOlivier Chapelle and Alexander Zien. Semi-supervised classi\ufb01cation by low density separation. In\n\nAISTATS, 2005.\n\nGintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for\ndeep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.\nGintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential\n\nprivacy. arXiv preprint arXiv:1802.09583, 2018.\n\nYoav Freund. Self bounding learning algorithms. In Proc. of the 11th annual conference on Com-\n\nputational Learning Theory, pages 247\u2013258. ACM, 1998.\n\nPascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, and Mario Marchand. PAC-Bayesian\nlearning of linear classi\ufb01ers. In Proc. of the 26th International Conference on Machine Learning,\npages 353\u2013360. ACM, 2009.\n\nPascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy.\nRisk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal\nof Machine Learning Research, 16:787\u2013860, 2015.\n\nAryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In Proc.\n\nof the 31st International Conference on Machine Learning, pages 28\u201336, 2014.\n\nJohn Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learn-\n\ning Research, 6(Mar):273\u2013306, 2005.\n\nJohn Langford and Avrim Blum. Microchoice bounds and self bounding learning algorithms. Ma-\n\nchine Learning, 51(2):165\u2013179, 2003.\n\nGuy Lever, Franc\u00b8ois Laviolette, and John Shawe-Taylor. Distribution-dependent PAC-Bayes priors.\n\nIn International Conference on Algorithmic Learning Theory, pages 119\u2013133. Springer, 2010.\n\nTongliang Liu, G\u00b4abor Lugosi, Gergely Neu, and Dacheng Tao. Algorithmic stability and hypothesis\ncomplexity. In Proc. of the 34th International Conference on Machine Learning, pages 2159\u2013\n2167, 2017.\n\n10\n\n\fBen London. A PAC-Bayesian analysis of randomized learning with application to stochastic gra-\ndient descent. In Advances in Neural Information Processing Systems, pages 2931\u20132940, 2017.\nBen London, Bert Huang, Ben Taskar, and Lise Getoor. Collective stability in structured prediction:\nIn Proc. of the 30th International Conference on Machine\n\nGeneralization from one example.\nLearning, pages 828\u2013836, 2013.\n\nG\u00b4abor Lugosi and Miroslaw Pawlak. On the posterior-probability estimate of the error rate of non-\nparametric classi\ufb01cation rules. IEEE Transactions on Information Theory, 40(2):475\u2013481, 1994.\n\nDavid A McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999a.\nDavid A McAllester. PAC-Bayesian model averaging. In Proc. of the 12th annual conference on\n\nComputational Learning Theory, pages 164\u2013170. ACM, 1999b.\n\nEmilio Parrado-Hern\u00b4andez, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-Bayes\nbounds with data dependent priors. Journal of Machine Learning Research, 13:3507\u20133531, 2012.\nJohn C. Platt. Fast training of support vector machines using sequential minimal optimization. In\nB. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods \u2013 Support\nVector Learning, pages 185\u2013208. MIT Press, Cambridge MA, 1999.\n\nMatthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 3:233\u2013269, 2002.\n\nJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, Cambridge, UK, 2004.\n\nJohn Shawe-Taylor and Robert C Williamson. A PAC analysis of a Bayesian estimator. In Proc. of\n\nthe 10th annual conference on Computational Learning Theory, pages 2\u20139. ACM, 1997.\n\nIngo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business\n\nMedia, 2008.\n\n11\n\n\f", "award": [], "sourceid": 5551, "authors": [{"given_name": "Omar", "family_name": "Rivasplata", "institution": "University College London"}, {"given_name": "Emilio", "family_name": "Parrado-Hernandez", "institution": "University Carlos III de Madrid"}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": "UCL"}, {"given_name": "Shiliang", "family_name": "Sun", "institution": "East China Normal University"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}