{"title": "PAC-Bayes Un-Expected Bernstein Inequality", "book": "Advances in Neural Information Processing Systems", "page_first": 12202, "page_last": 12213, "abstract": "We present a new PAC-Bayesian generalization bound. Standard bounds contain a $\\sqrt{L_n \\cdot \\KL/n}$ complexity term which dominates unless $L_n$, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace $L_n$ by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough $n$). Theoretically, unlike existing bounds, our new bound can be expected to converge to $0$ faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and {\\em excess risk\\/} bounds---for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein's but with $X^2$ taken outside its expectation.", "full_text": "PAC-Bayes Un-Expected Bernstein Inequality\n\nZakaria Mhammedi\n\nThe Australian National University and Data61\n\nzak.mhammedi@anu.edu.au\n\nPeter D. Gr\u00fcnwald\n\nCWI and Leiden University\n\npdg@cwi.nl\n\nBenjamin Guedj\n\nInria and University College London\n\nbenjamin.guedj@inria.fr\n\nAbstract\n\n\u0001\n\nWe present a new PAC-Bayesian generalization bound. Standard bounds contain a\n\nLn\u22c5 KL~n complexity term which dominates unless Ln, the empirical error of\n\nthe learning algorithm\u2019s randomized predictions, vanishes. We manage to replace\nLn by a term which vanishes in many more situations, essentially whenever the\nemployed learning algorithm is suf\ufb01ciently stable on the dataset at hand. Our new\nbound consistently beats state-of-the-art bounds both on a toy example and on\nUCI datasets (with large enough n). Theoretically, unlike existing bounds, our\nnew bound can be expected to converge to 0 faster whenever a Bernstein/Tsybakov\ncondition holds, thus connecting PAC-Bayesian generalization and excess risk\nbounds\u2014for the latter it has long been known that faster convergence can be\nobtained under Bernstein conditions. Our main technical tool is a new concentration\ninequality which is like Bernstein\u2019s but with X 2 taken outside its expectation.\n\n1\n\nIntroduction\n\n\u0001\n\nPAC-Bayesian generalization bounds [1, 8, 9, 17, 18, 20, 28, 29, 30] have recently obtained renewed\ninterest within the context of deep neural networks [14, 34, 42]. In particular, Zhou et al. [42] and\nDziugaite and Roy [14] showed that, by extending an idea due to Langford and Caruana [23], one\ncan obtain nontrivial (but still not very strong) generalization bounds on real-world datasets such as\nMNIST and ImageNet. Since using alternative methods, nontrivial generalization bounds are even\nharder to get, there remains a strong interest in improved PAC-Bayesian bounds. In this paper, we\nprovide a considerably improved bound whenever the employed learning algorithm is suf\ufb01ciently\nstable on the given data.\n\nMost standard bounds have an order\nmodel complexity in the form of a Kullback-Leibler divergence between a prior and a posterior,\nand Ln is the posterior expected loss on the training sample. The latter only vanishes if there is a\nsuf\ufb01ciently large neighborhood around the \u201ccenter\u201d of the posterior at which the training error is 0.\nIn the two papers [14, 42] mentioned above, this is not the case. For example, the various deep net\n\nLn\u22c5 COMPn~n term on the right, where COMPn represents\nexperiments reported by Dziugaite et al. [14, Table 1] with n= 150000 all have Ln around 0.03, so\n\u0001\n0.03\u2248 0.17. Furthermore, they have COMPn\nLn\u22c5 COMPn~n converge to 0 at rate slower than 1~\u221an.\n\nthat\nincreasing substantially with n, making\nIn this paper, we provide a bound (Theorem 3) with Ln replaced by a second-order term Vn\u2014a term\nwhich will go to 0 in many cases in which Ln does not. This can be viewed as an extension of an\nearlier second-order approach by Tolstikhin and Seldin [39] (TS from now on); they also replace Ln,\nbut by a term that, while usually smaller than Ln, will tend to be larger than our Vn. Speci\ufb01cally, as\n\nCOMPn~n is multiplied by a non-negligible\u221a\n\n\u0001\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthey write, in classi\ufb01cation settings (our primary interest), their replacement is not much smaller than\nLn itself. Instead our Vn can be very close to 0 in classi\ufb01cation even when Ln is large. While the TS\nbound is based on an \u201cempirical\u201d Bernstein inequality due to [27]1, our bound is based on a different\nmodi\ufb01cation of Bernstein\u2019s moment inequality in which the occurrence of X 2 is taken outside of its\nexpectation (see Lemma 13). We note that an empirical Bernstein inequality was introduced in [4,\nTheorem 1], and the name \u201cEmpirical Bernstein\u201d was coined in [32].\nThe term Vn in our bound goes to 0\u2014and our bound improves on existing bounds\u2014whenever the\nemployed learning algorithm is relatively stable on the given data; for example, if the predictor\nlearned on an initial segment (say, 50%) of the dataset performs similarly (i.e. assigns similar losses\nto the same samples) to the predictor based on the full data. This improvement is re\ufb02ected in our\nexperiments where, except for very small sample sizes, we consistently outperform existing bounds\nboth on a toy classi\ufb01cation problem with label noise and on standard UCI datasets [13]. Of course,\nthe importance of stability for generalization has been recognized before in landmark papers such as\n[7, 33, 38], and recently also in the context of PAC-Bayes bounds [35]. However, the data-dependent\nstability notion \u201cVn\u201d occurring in our bound seems very different from any of the notions discussed\nin those papers.\nTheoretically, a further contribution is that we connect our PAC-Bayesian generalization bound to\nexcess risk bounds; we show that (Theorem 7) our generalization bound can be of comparable size to\nexcess risk bounds up to an irreducible complexity-free term that is independent of model complexity.\nThe excess risk bound that can be attained for any given problem depends both on the complexity of\n\nthe set of predictorsH and on the inherent \u201ceasiness\u201d of the problem. The latter is often measured in\nterms of the exponent \u03b2\u2208[0, 1] of the Bernstein condition that holds for the given problem [6, 15, 19],\n\u0001\nVn\u22c5 COMPn~n term goes to 0 can also be bounded by a quantity that gets smaller as \u03b2 gets larger.\n\nwhich generalizes the exponent in the celebrated Tsybakov margin condition [5, 40]. The larger \u03b2,\nthe faster the excess risk converges. In Section 5, we essentially show that the rate at which the\n\nIn contrast, previous PAC-Bayesian bounds do not have such a property.\nContents. In Section 2, we introduce the problem setting and provide a \ufb01rst, simpli\ufb01ed version of\nour main theorem. Section 3 gives our main bound. Experiments are presented in Section 4, followed\nby theoretical motivation in Section 5. The proof of our main bound is provided in Section 6, where\nwe \ufb01rst present the convenient ESI language for expressing stochastic inequalities, and (our main\ntool) the unexpected Bernstein lemma (Lemma 13). The paper ends with an outlook for future work.\n\n2 Problem Setting, Background, and Simpli\ufb01ed Version of Our Bound\n\nSetting and Notation. Let Z1, . . . , Zn be i.i.d. random variables in some setZ, with Z1 \u223c D.\nLetH be a hypothesis set and (cid:96)\u2236H\u00d7Z \u2192[0, b], b> 0, be a bounded loss function such that\n(cid:96)h(Z)\u2236= (cid:96)(h, Z) denotes the loss that hypothesis h makes on Z. We call any such tuple(D, (cid:96),H) a\nlearning problem. For a given hypothesis h\u2208H, we denote its risk (expected loss on a test sample\nof size 1) by L(h)\u2236= EZ\u223cD[(cid:96)h(Z)] and its empirical error by Ln(h)\u2236= 1\ni=1 (cid:96)h(Zi). For any\nn\u2211n\ndistribution P onH, we write L(P)\u2236= Eh\u223cP[L(h)] and Ln(P)\u2236= Eh\u223cP[Ln(h)].\nFor any m\u2208[n] and any variables Z1, . . . , Zn inZ, we denote Z\u2264m\u2236=(Z1, . . . , Zm) and Zm\u2236=\nZ\u2265m+1, with the convention that Z\u2265n+1=\u089d. As is customary in PAC-Bayesian works, a learning\nalgorithm is a (computable) function P \u2236\u0016n\ni=1Z i\u2192P(H) that, upon observing input Z\u2264n\u2208Z n,\noutputs a \u201cposterior\u201d distribution P(Z\u2264n)(\u22c5) onH. The posterior could be a Gibbs or a generalized-\nP(Z\u2264n) to Pn, and denote P0 any \u201cprior\u201d distribution, i.e. a distribution onH which has to be\nspeci\ufb01ed in advance, before seeing the data; we will use the convention P(\u089d)= P0. Finally, we\ndenote the Kullback-Leibler divergence between Pn and P0 by KL(Pn\u0001P0).\nfollowing form; there exists constants P, A, C\u2265 0, and a function \u03b5\u03b4,n, logarithmic in 1~\u03b4 and n, such\n\nComparing Bounds. Both existing state-of-the-art PAC-Bayes bounds and ours essentially take the\n\nBayesian posterior but also other algorithms. When no confusion can arise, we will abbreviate\n\n1An alternative form of empirical Bernstein inequality appears in [41], based on an inequality due to [11].\n\n2\n\n\f(2)\n\n(3)\n\nn\n\n\u03b4\n\n,\n\nn\n\nn\n\nn\n\n,\n\n(1)\n\nn\n\nn, and Rn.\n\n+ C\u22c5\u0004\n\nRn\u22c5(COMPn+ \u03b5\u03b4,n)\n\n\u221a\nkl(L(Pn), Ln(Pn))\u2264 KL(Pn\u0001P0)+ ln 2\n\nExisting classical bounds that after slight relaxations take on this form are due to Langford and\nSeeger [24, 37], Catoni [10], Maurer [26], and Tolstikhin and Seldin (TS) [39] (see the latter for a\n\nOf special relevance in our experiments is the bound due to Maurer [26], which as noted by TS\n[39] tightens the PAC-Bayes-kl inequality due to Seeger [36], and is one of the tightest known\n\nPinsker\u2019s inequality together with (2) implies McAllester\u2019s classical PAC-Bayesian bound [28].\nWe now present a simpli\ufb01ed version of our bound in Theorem 3 below as a corollary.\n\nthat for all \u03b4\u2208]0, 1[, with probability at least 1\u2212 \u03b4 over the sample Z1, . . . , Zn, it holds that,\nL(Pn)\u2212 Ln(Pn)\u2264 P\u22c5\u0004\n+ A\u22c5 COMPn+ \u03b5\u03b4,n\nn\u22c5 \u03b5\u03b4,n\nR\u2032\nn\u2265 0 are sample-dependent quantities which may differ from one bound to another.\nwhere Rn, R\u2032\nnice overview). In all these cases, COMPn= KL(Pn\u0001P0), R\u2032\nn= 0, and\u2014except for the TS bound\u2014\nRn= Ln(Pn). For the TS bound, Rn is equal to the empirical loss variance. Our bound in Theorem 3\nalso \ufb01ts (1) (after a relaxation), but with considerably different choices for COMPn, R\u2032\ngeneralization bounds in the literature. It can be stated as follows: for \u03b4\u2208]0, 1[, n\u2265 8, and any\nlearning algorithm P , with probability at least 1\u2212 \u03b4,\nwhere kl is the binary Kullback-Leibler divergence. Applying the inequality p\u2264 q+\u0001\n2q kl(p\u0001q)+\n2 kl(p\u0001q) to (2) yields a bound of the form (1) (see [39] for more details). Note also that using\ni=1Z i\u2192H (such as ERM),\nCorollary 1. For any 1\u2264 m< n and any deterministic estimator \u02c6h\u2236\u0016n\nthere exists P, A, C> 0, such that (1) holds with probability at least 1\u2212 \u03b4, with\n\u0002(cid:96)h(Zj)\u2212 (cid:96)\u02c6h(Z\u2264m)(Zj)\u00022\u0017\u0017\u0017\u0017\u0017 .\nLike in TS\u2019s and Catoni\u2019s bound, but unlike McAllester\u2019s and Maurer\u2019s, our \u03b5\u03b4,n grows as(ln ln n)~\u03b4.\nthis case P(Z\u2264m) or P(Z>m)) is \u201cinformed\u201d\u2014when m= n~2, it is really the posterior based on half\nthe sample. Our experiments con\ufb01rm that this tends to be much smaller than KL(Pn\u0001P0). While the\nA larger difference between our bound and others is in the fact that we have Rn= Vn instead of\nthe typical empirical error Rn= Ln(Pn). Only TS [39] have a Rn that is somewhat reminiscent\nof ours; in their case Rn= Eh\u223cPn[\u2211n\ni=1((cid:96)h(Zi)\u2212 Ln(h))2]~(n\u2212 1) is the empirical loss variance.\nnLn(Pn)(1\u2212 Ln(Pn))~(n\u2212 1), which is only close to 0 if Ln(Pn) is itself close to 0 or 1. In\nVn\u22c5 COMPn~n term in\nCOMPn~n. Note, \ufb01nally, that the term Vn has a two-fold\nThe price we pay for having Rn= Vn in our bound is the right-most, irreducible remainder term in\n(1) of order at most b~\u221an. Note, however, that this term is decoupled from the complexity COMPn,\nand thus it is not affected by COMPn growing with the \u201csize\u201d ofH. The following lemma gives a\ntighter bound (tighter than the b~\u221an just mentioned) on the irreducible term:\n\ncontrast, our Vn can go to zero 0 even if the empirical error and variance do not, as long as the\nlearning algorithm is suf\ufb01ciently stable. This can be witnessed in our experiments in Section 4. In\nSection 5, we argue more formally that under a Bernstein condition, the\nour bound can be much smaller than\ncross-validation \ufb02avor, but in contrast to a cross-validation error, for Vn to be small, it is suf\ufb01cient\nthat the losses are similar, not that they are small.\n\nThe crucial difference to our Vn is that the empirical loss variance cannot be close to 0 unless a\nsizeable Pn-posterior region of h has empirical error almost constant on most data instances. For\nclassi\ufb01cation with 0-1 loss, this is a strong condition since the empirical loss variance is equal to\n\nidea to use part of the sample to create an informed prior is due to [2], we are the \ufb01rst to combine all\nparts (halves) into a single bound, which requires a novel technique. This technique can be applied to\nother existing bounds as well (see Section 3).\n\nAnother difference is that our complexity term is a sum of two KL divergences, in which the prior (in\n\nCOMPn= KL(Pn\u0001P(Z\u2264m))+ KL(Pn\u0001P(Z>m)),\nn\u2236= V\nn\u2236= 1\n\u2032\n\u2032\nRn\u2236= Vn\u2236= 1\n\n(cid:96)\u02c6h(Z\u2264m)(Zj)2,\n\u0002(cid:96)h(Zi)\u2212 (cid:96)\u02c6h(Z>m)(Zi)\u00022+ nQ\nj=m+1\n\n(cid:96)\u02c6h(Z>m)(Zi)2+ 1\nmQ\ni=1\n\u0017\u0017\u0017\u0017\u0017 mQ\nEh\u223cPn\ni=1\n\nnQ\nj=m+1\n\nR\n\nn\n\nn\n\nn\n\n\u0001\n\n\u0001\n\n3\n\n\fLemma 2. Suppose that the loss is bounded by 1 (i.e. b= 1) and that n is even, and let m= n~2.\nFor \u03b4\u2208]0, 1[, R\u2032\ni=1Z i\u2192H, we have, with probability at least\n1\u2212 \u03b4,\n\nn as in (3), and any estimator \u02c6h\u2236\u0016n\n\n\u0004\n\nn \u2264\u0004\nR\u2032\n\nn\n\n2(L(\u02c6h(Z>m))+ L(\u02c6h(Z\u2264m)))\n\nn\n\n\u0002\n+ 4\n\nln 4\n\u03b4\n\nn\n\n.\n\n(4)\n\nV\n\nn\n\nn\n\n\u03b4\u22c5\u03c0(\u03b7)\n\n1\n\n\u03bd\u2208G\n\n3 Main Bound\n\nnQ\nj=m+1\n\nn\n\nn+ ln\n\u03b4\u22c5\u03c0(\u03bd)\n\u2032\n\u03bd\u22c5 n\n\n1\n\nn, and Vn are the random variables de\ufb01ned by:\n\nL(Pn)\u2264 Ln(Pn)+ inf\n\u03b7\u2208G\n\nBehind the proof of the lemma is an application of Hoeffding\u2019s and the empirical Bernstein inequality\n[27] (see Section C). Note that in the realizable setting, the \ufb01rst term on the RHS of (4) can be of order\n\nO(1~n) with the right choice of estimator \u02c6h (e.g. ERM). In this case (still in the realizable setting),\nour irreducible term would go to zero at the same rate as other bounds which have Rn= Ln(Pn).\nWe now present our main result in its most general form. Let \u03d1(\u03b7)\u2236=(\u2212 ln(1\u2212 \u03b7)\u2212 \u03b7)~\u03b72 and\nc\u03b7\u2236= \u03b7\u22c5 \u03d1(\u03b7b), for \u03b7\u2208]0, 1~b[, where b> 0 is an upper-bound on the loss (cid:96).\nTheorem 3. [Main Theorem] Let Z1, . . . , Zn be i.i.d. with Z1\u223c D. Let m\u2208[0..n] and \u03c0 be any\ndistribution with support on a \ufb01nite or countable gridG\u2282]0, 1~b[. For any \u03b4\u2208]0, 1[, and any learning\ni=1Z i\u2192P(H), we have,\nalgorithms P, Q\u2236\u0016n\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017 ,\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017c\u03bd\u22c5 V\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017+ inf\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017c\u03b7\u22c5 Vn+ COMPn+ 2 ln\n\u03b7\u22c5 n\nwith probability at least 1\u2212 \u03b4, where COMPn, V\u2032\nCOMPn\u2236= KL(Pn\u0001P(Z\u2264m))+ KL(Pn\u0001P(Z>m)),\nn\u2236= 1\nEh\u223cQ(Zi)[(cid:96)h\u2032(Zi)]\u00012+ nQ\nj=m+1\nWhile the result holds for all 0\u2264 m\u2264 n, in the remainder of this paper, we assume for simplicity that\nn is even and that m= n~2. We will also be using the gridG and distribution \u03c0 de\ufb01ned by\n\u0003\u0905\u0003 , and \u03c0\u2261 uniform distribution overG.\nRoughly speaking, this choice ofG ensures that the in\ufb01ma in \u03b7 and \u03bd in (5) are attained within\n[minG, maxG]. Using the relaxation c\u03b7\u2264 \u03b7~2+ \u03b7211b~20, for \u03b7\u2264 1~(2b), in (5) and tuning \u03b7 and\n\u03bd within the gridG de\ufb01ned in (7) leads to a bound of the form (1). Furthermore, we see that the\nexpression of Vn in Corollary 1 now follows when Q is chosen such that, for 1\u2264 i\u2264 m< j\u2264 n,\nQ(Z>i)\u2261 \u03b4(\u02c6h(Z>m)) and Q(Z m, in the RHS sum of Vn, we could use a posterior\nQ(Zi)) which converge to the \ufb01nal \u02c6h(Z\u2264n) based on the full data. Doing this would likely improve\nhalf of the data; in this case, the KL(Pn\u0001P0) term in these bounds can be replaced by COMPn de\ufb01ned\n\nEh\u223cQ(Z>i)\u0001(cid:96)h(Zi)2(cid:6)+ 1\nmQ\ni=1\n\u0017\u0017\u0017\u0017\u0017 mQ\nEh\u223cPn\ni=1\nG\u2236=\u0003 1\n\nInformed Priors. Other bounds can also be modi\ufb01ed to make use of \u201cinformed priors\u201d from each\n\nour bounds, but we did not try it in our experiments since it is computationally demanding.\n\nIt is clear that Theorem 3 is considerably more general than its Corollary\n\n2K b\u2236 K\u2236=\u0904log2\u0003 1\n\n2\n\nin (6). As revealed by additional experiments in the Appendix H, doing this substantially improves\nthe corresponding bounds when the learning algorithm is suf\ufb01ciently stable. Here we show how this\ncan be done for Maurer\u2019s bound in (2) (the details for other bounds are postponed to Appendix A).\n\n\u0002 n\n\nln 1\n\u03b4\n\nOnline Estimators.\n\n(5)\n\n(6)\n\n(7)\n\n2b , . . . , 1\n\n4\n\n\f\u0001\nkl(L(Pn), Ln(Pn))\u2264 KL(Pn\u0001P(Z\u2264m))+ KL(Pn\u0001P(Z>m))+ ln 4\n\nLemma 4. Let \u03b4\u2208]0, 1[ and m\u2208[0..n]. In the setting of Theorem 3, we have, with probability at\nleast 1\u2212 \u03b4,\nm(n\u2212m)\nallows choosing a learning algorithm P such that for 1\u2264 m< n, P(Z\u2264m)\u2261 P(Z>m)\u2261 P0 (i.e. no\ninformed priors); this results in COMPn= 2KL(Pn\u0001P0)\u2014the bound is otherwise unchanged.\n\nRemark 5. (Useful for Section 5 below) Though this may deteriorate the bound in practice, Theorem 3\n\nn\n\n.\n\n\u03b4\n\nBiasing. The term Vn in our bound can be seen as the result of \u201cbiasing\u201d the loss when evaluating\nthe generalization error on each half of the sample. The TS bound, having a second order variance\nterm, can be used in a way as to arrive at a bound like ours with the same Vn as in Corollary 1.\nThe idea here is to apply the TS bound twice (once on each half of the sample) to the biased losses\n\n(cid:96)(h,\u22c5)\u2212 (cid:96)(\u02c6h(Z\u2264m),\u22c5) and (cid:96)(h,\u22c5)\u2212 (cid:96)(\u02c6h(Z>m),\u22c5), then combine the results with a union bound. The\nwith a Vn term as in Theorem 3, i.e. with the online posteriors(Q(Z>i)) and(Q(Z 1~2}, where\n\u03c6(w)\u2236= 1~(1+ e\u2212w), w\u2208 R. We learn our hypotheses using regularized logistic regression; given a\nsample S=(Zp, . . . , Zq), with(p, q)\u2208{(1, m),(m+ 1, n),(1, n)} and m= n~2, we compute\nXi)+(1\u2212 Yi)\u22c5 ln(1\u2212 \u03c6(h\nXi)).\n\u0016\nFor Z\u2264n\u2208Z n, and 1\u2264 i\u2264 m< j\u2264 n, we choose algorithm Q in Theorem 3 such that\n\nYi\u22c5 ln \u03c6(h\n\u0016\nand Q(Zi)\u2261 \u03b4\u0001\u02c6h(Z>m)\u0001\n\n\u02c6h(S)\u2236= arg min\n\n+\n\n(8)\n\n2\n\n1\n\n5\n\n1000300050007000SAMPLE SIZE00.10.20.30.40.50.6BOUNDDIMENSION = 101000300050007000SAMPLE SIZEDIMENSION = 50\fminimizes it on the given data (with n instances). In order for the bounds to still hold with probability\n\nGiven a sample S\u2260\u089d, we set the \u201cposterior\u201d P(S) to be a Gaussian centered at \u02c6h(S) with variance\n\u03c32> 0; that is, P(S)\u2261N(\u02c6h(S), \u03c32Id). The prior distribution is set to P0\u2261N(0, \u03c32\n0Id), for \u03c30> 0.\nParameters. We set \u03b4 = 0.05. For all datasets, we use \u03bb= 0.01, and (approximately) solve (8)\nusing the BFGS algorithm. For each bound, we pick the \u03c32\u2208{1~2, . . . , 1~2J\u2236 J\u2236=\u0904log2 n\u0905} which\nat least 1\u2212 \u03b4, we replace \u03b4 on the RHS of each bound by \u03b4~\u0904log2 n\u0905 (this follows from the application\n0 = 1~2 (this was the best value on\naverage for the bounds we compare against). We choose the gridG in Theorem 3 as in (7). Finally,\nSynthetic data. We generate synthetic data for d={10, 50} and sample sizes between 800 and 8000.\nfrom the multivariate-Gaussian distributionN(0, Id) [resp. the Bernoulli distributionB(0.9)]; and\n2) we set Yi= 1{\u03c6(h\u0016\u2217Xi)> 1~2}\u22c5 \u0001i, for i\u2208[n], where h\u2217\u2208 Rd is the vector constructed from the\n\ufb01rst d digits of \u03c0. For example, if d= 10, then h\u2217=(3, 1, 4, 1, 5, 9, 2, 6, 5, 3)\u0016. Figure 1 shows the\n\nFor a given sample size n, we 1) draw X1, . . . , Xn [resp. \u00011, . . . , \u0001n] identically and independently\n\nwe approximate Gaussian expectations using Monte Carlo sampling.\n\nof a union bound). We choose the prior variance such that \u03c32\n\nresults averaged over 10 independent runs for each sample size.\nUCI datasets. For the second experiment, we use several UCI datasets. These are listed in Table 1\n(where Breast-C. stands for Breast Cancer). We encode categorical variables in appropriate 0-1\nvectors. This effectively increases the dimension of the input space (this is reported as d in Table 1).\nAfter removing any rows (i.e. instances) containing missing features and performing the encoding, the\ninput data is scaled such that every column has values between -1 and 1. We used a 5-fold train-test\nsplit (n in Table 1 is the training set size), and the results in Table 1 are averages over 5 runs. We only\ncompare with Maurer\u2019s bound since other bounds were worse than Maurer\u2019s and ours on all datasets.\n\nthus, all the PAC-Bayes bounds discussed in this paper\u2014get larger. Our bound suffers less from this\nincrease in d, since for a large enough sample size n, the term Vn is small enough (see Figure 1) to\nabsorb any increase in the complexity. In fact, for large enough n, the irreducible (complexity-free)\nn in our bound becomes the dominant one. This, combined with the fact that for the\n\nDiscussion. As the dimension d of the input space increases, the complexity KL(Pn\u0001P0)\u2014and\nterm involving V\u2032\nn\u2248 Ln(Pn) for large enough n (see Figure 1), makes our bound tighter than others.\n0-1 loss, V\u2032\nAdding a regularization term in the objective (8) is important as it stabilizes \u02c6h(Z 0, if for all h\u2208H,\nEZ\u223cD\u0002((cid:96)h(Z)\u2212 (cid:96)h\u2217(Z))2\u0002\u2264 B\u22c5 EZ\u223cD[(cid:96)h(Z)\u2212 (cid:96)h\u2217(Z)]\u03b2 ,\nwhere h\u2217\u2208 arg inf h\u2208H EZ\u223cD[(cid:96)h(Z)] is a risk minimizer within the closer ofH.\nproblem; it implies that the variance in the excess loss random variable (cid:96)h(Z)\u2212 (cid:96)h\u2217(Z) gets smaller\nthe closer the risk of hypothesis h\u2208H gets to that of the risk minimizer h\u2217. For bounded loss\nfunctions, the BC with \u03b2= 0 always holds. The BC with \u03b2= 1 (the \u201ceasiest\u201d learning setting) is also\n4, and also, e.g., wheneverH is convex and h\u0015 (cid:96)h(z) is exp-concave, for all z\u2208Z [15, 31]. For\nBC, in terms of the complexity COMPn and the excess risks \u00afL(Pn), \u00afL(Q(Z>m)), and \u00afL(Q(Z\u2264m)),\n\nmore examples of learning settings where a BC holds see [22, Section 3].\nOur aim in this section is to give an upper-bound on the in\ufb01mum term involving Vn in (5), under a\n\nknown as the Massart noise condition [25]; it holds in our experiment with synthetic data in Section\n\nThe Bernstein condition [3, 5, 6, 15, 22] essentially characterizes the \u201ceasiness\u201d of the learning\n\n6\n\n\fpresentation further (and for consistency with Section 4), we assume that Q is chosen such that\n\nQ(Z>i)= Q>m, for 1\u2264 i\u2264 m,\n\n\u00afL(P)\u2236= Eh\u223cP[EZ\u223cD[(cid:96)h(Z)]]\u2212 EZ\u223cD[(cid:96)h\u2217(Z)] .\n\nwhere for a distribution P\u2208P(H), the excess risk is de\ufb01ned by\nIn the next theorem, we denote Q\u2264m\u2236= Q(Z\u2264m) and Q>m\u2236= Q(Z>m), for m\u2208[n]. To simplify the\nand Q(Z 0, then for any learning algorithms P and\nQ (with Q satisfying (9)), there exists a C> 0, such that\u2200n\u2265 1 and m= n~2, with probability at\nleast 1\u2212 \u03b4,\n\u0003\u2264 \u00afL(Pn)+ \u00afL(Q\u2264m)+ \u00afL(Q>m)\n+\u0003 COMPn+ \u03b5\u03b4,n\n\n\u03b7\u2208G\u0003c\u03b7\u22c5 Vn+ COMPn+ \u03b5\u03b4,n\nC\u22c5 inf\n\u03b7\u22c5 n\n\n2\u2212\u03b2+ COMPn+ \u03b5\u03b4,n\n\u0003 1\n\n(10)\n\n(9)\n\n1\n\n1\n\nn\n\n.\n\nn\n\n\u0001\n\nNow consider the remaining term in our main bound, which matches the in\ufb01mum term on the LHS of\n\nIn addition to the \u201cESI\u201d tools provided in Section 6 and Lemma 13, the proof of Theorem 7, presented\nin Appendix E, also uses an \u201cESI version\u201d of the Bernstein condition due to [22].\nFirst note that the only terms in our main bound (5), other than the in\ufb01mum on the LHS of (10), are\n\nthe empirical error Ln(Pn) and a \u02dcO(1~\u221an)-complexity-free term which is typically smaller than\n\u0001\nKL(Pn\u0001P0)~n (e.g. when the dimension ofH is large enough). The term\nKL(Pn\u0001P0)~n is\noften the dominating one in other PAC-Bayesian bounds when lim inf n\u2192\u221e Ln(Pn)> 0.\n(10), and let us choose algorithm P as per Remark 5, so that COMPn= 2KL(Pn\u0001P0). Suppose that,\nwith high probability (w.h.p.), KL(Pn\u0001P0)~n converges to 0 for n\u2192\u221e (otherwise no PAC-Bayesian\nbound would converge to 0), then(COMPn~n)1~(2\u2212\u03b2)+ COMPn~n\u2014essentially the sum of the last\nKL(Pn\u0001P0)~n w.h.p. for \u03b2> 0,\nand at equal rate for \u03b2= 0. Thus, in light of Theorem 7, to argue that our bound can be better than\nothers (still when lim inf n\u2192\u221e Ln(Pn)> 0), it remains to show that there exist algorithms P and Q\n\u0001\nm= n~2, if one chooses Q such that it outputs a Dirac around the ERM on a given sample, then\nunder a BC with exponent \u03b2 and for \u201cparametric\u201dH (such as the d-dimensional linear classi\ufb01ers\nin Sec. 4), \u00afL(Q\u2264m) and \u00afL(Q>m) are of order \u02dcO\u0001n\u22121~(2\u2212\u03b2)\u0001 w.h.p. [3, 19]. However, setting\nPn\u2261 \u03b4(ERM(Z\u2264n)) is not allowed, since otherwise KL(Pn\u0001P0)=\u221e. Instead one can choose Pn\nparametricH, the excess risk is of order \u02dcO\u0001n\u22121~(2\u2212\u03b2)\u0001 w.h.p. for clever choices of prior P0 [3, 19].\n\nfor which the sum of the excess risks on the RHS of (10) is smaller than\nOne choice of estimator with small excess risk is the Empirical Risk Minimizer (ERM). When\n\nto be the generalized-Bayes/Gibbs posterior. In this case too, under a BC with exponent \u03b2 and for\n\ntwo terms on the RHS of (10)\u2014converges to 0 at a faster rate than\n\nKL(Pn\u0001P0)~n.\n\n\u0001\n\n6 Detailed Analysis\n\nX, Y be any two random variables with joint distribution D. We de\ufb01ne\n\nX\u0016D\n\u03b7 Y \u21d0\u21d2 X\u2212 Y \u0016D\n\nWe start this section by presenting the convenient ESI notation and use it to present our main technical\nLemma 13 (proofs of the ESI results are in Appendix D). We then continue with a proof of Theorem 3.\n\nDe\ufb01nition 8. [ESI (Exponential Stochastic Inequality, pronounce as:easy) 19, 22] Let \u03b7> 0, and\nDe\ufb01nition 8 can be extended to the case where \u03b7= \u02c6\u03b7 is also a random variable, in which case the\nProposition 9. [ESI Implications] For \ufb01xed \u03b7> 0, if X\u0016\u03b7 Y then E[X]\u2264 E[Y]. For both \ufb01xed\nand random \u02c6\u03b7, if X\u0016\u02c6\u03b7 Y , then\u2200\u03b4\u2208]0, 1[, X\u2264 Y + ln 1\n\nexpectation in (11) needs to be replaced by the expectation over the joint distribution of (X, Y , \u02c6\u03b7).\nWhen no ambiguity can arise, we omit D from the ESI notation. Besides simplifying notation, ESIs\nare useful in that they simultaneously capture \u201cwith high probability\u201d and \u201cin expectation\u201d results:\n\n\u03b7 0 \u21d0\u21d2 E(X,Y)\u223cD\u0001e\u03b7(X\u2212Y)(cid:6)\u2264 1.\n\n\u02c6\u03b7 , with probability at least 1\u2212 \u03b4.\n\n(11)\n\n\u03b4\n\n7\n\n\f1\n\u03b3i\n\nnQ\ni=1\n\n\u0004\u22121\n\nZi\u0016\u03bdn 0, where \u03bdn\u2236=\u0004 nQ\ni=1\n\n(so if\u2200i\u2208[n], \u03b3i= \u03b3> 0 then \u03bdn= \u03b3~n).\n\nIn the next proposition, we present two results concerning transitivity and additive properties of ESI:\nProposition 10. [ESI Transitivity and Chain Rule] (a) Let Z1, . . . , Zn be any random variables\n\nonZ (not necessarily independent). If for some(\u03b3i)i\u2208[n]\u2208]0,+\u221e[n, Zi\u0016\u03b3i 0, for all i\u2208[n], then\ni=1Z i\u2192 R be any real-valued function.\n(b) Suppose now that Z1, . . . , Zn are i.i.d. and let X\u2236Z\u00d7\u0016n\nIf for some \u03b7> 0, X(Zi; z* 0 and let{Yh\u2236 h\u2208H} be any family of random variables\nsuch that for all h\u2208H, Yh\u0016\u03b7 0. Let P0 be any distribution onH and let P\u2236\u0016n\ni=1Z i\u2192P(H) be a\n, where Pn\u2236= P(Z\u2264n).\n\nWe now give a basic PAC-Bayesian result for the ESI context:\n\nEh\u223cPn[Yh]\u0016\u03b7\n\nKL(Pn\u0001P0)\n\nlearning algorithm. We have:\n\n\u03b7\n\nIn many applications (especially for our main result) it is desirable to work with a random (i.e.\ndata-dependent) \u03b7 in the ESI inequalities; one can tune \u03b7 after seeing the data.\n\nProposition 12. [ESI from \ufb01xed to random \u03b7] LetG be a countable subset of]0,+\u221e[ and let \u03c0 be\na prior distribution overG. Given a countable collection{Y\u03b7\u2236 \u03b7\u2208G} of random variables satisfying\nY\u03b7\u0016\u03b7 0, for all \ufb01xed \u03b7\u2208G, we have, for arbitrary estimator \u02c6\u03b7 with support onG,\n\nY\u02c6\u03b7\u0016\u02c6\u03b7\n\n\u2212 ln \u03c0(\u02c6\u03b7)\n\n.\n\n\u02c6\u03b7\n\nThe following key lemma, which is of independent interest, is central to our main result.\nLemma 13.\n\n[Key result: un-expected Bernstein] Let X\u223c D be a random variable bounded from\nabove by b> 0 almost surely, and let \u03d1(u)\u2236=(\u2212 ln(1\u2212 u)\u2212 u)~u2. For all 0< \u03b7< 1~b, we have (a):\n(b): The result is tight; for every c< \u03b7\u22c5 \u03d1(\u03b7b), there exists a distribution D so that (12) does not hold.\nrandom variable bounded from below by\u2212b, and let \u03ba(x)\u2236=(ex\u2212 x\u2212 1)~x2. For all \u03b7> 0, we have\n\nLemma 13 is reminiscent of the following slight variation of Bernstein\u2019s inequality [12]; let X be any\n\nfor all c\u2265 \u03b7\u22c5 \u03d1(\u03b7b).\n\nE[X]\u2212 X\u0016D\n\n\u03b7 c\u22c5 X 2,\n\n(12)\n\nE[X]\u2212 X\u0016\u03b7 s\u22c5 E[X 2],\n\nfor all s\u2265 \u03b7\u22c5 \u03ba(\u03b7b).\n\n(13)\n\nNote that the un-expected Bernstein Lemma 13 has the X 2 lifted out of the expectation. In Appendix\nG, we prove (13) and compare it to standard versions of Bernstein. We also compare (12) to the\nrelated but distinct empirical Bernstein inequality due to [27, Theorem 4]. We now prove part (a) of\nLemma 13, which follows easily from the proof of an existing result [16, 21]. Part (b) is novel; its\nproof is postponed to Appendix F.\n\nProof of Lemma 13-Part (a). [16] (see also [21]) showed in the proof of their lemma 4.1 that\n\nfor all \u03bb\u2208[0, 1[ and \u03be\u2265\u22121.\nfor all \u03b7\u2208]0, 1~b[.\n\nexp(\u03bb\u03be\u2212 \u03bb2\u03d1(\u03bb)\u03be2)\u2264 1+ \u03bb\u03be,\nLetting \u03b7= \u03bb~b and \u03be=\u2212X~b, (14) becomes,\nexp(\u2212\u03b7X\u2212 \u03b72\u03d1(\u03b7b)X 2)\u2264 1\u2212 \u03b7X,\nTaking expectation on both sides of (15) and using the fact that 1\u2212 \u03b7E[X]\u2264 exp(\u2212\u03b7E[X]) on the\nProof of Theorem 3. Let \u03b7\u2208]0, 1~b[ and c\u03b7\u2236= \u03b7\u22c5 \u03d1(\u03b7b). For 1\u2264 i\u2264 m< j\u2264 n, de\ufb01ne\nfor z>i\u2208Z n\u2212i,\nfor zi)\u2236= (cid:96)h(Zi)\u2212 Eh\u2032\u223cQ(z>i)[(cid:96)h\u2032(Zi)] ,\n\u02dcXh(Zj; zi)]\u2212 Xh(Zi; z>i)\u2212 c\u03b7\u22c5 Xh(Zi; z>i)2\u0016\u03b7 0,\n\u2200z>i\u2208Z n\u2212i, Y \u03b7\n\u2032\nj; zi)\u0016\u03b7 0,\n\nh(Zi; z>i)\u2236= EZ\u2032\nh(Zj; zm) and P(Z\u2264m), respec-\ntively, and common posterior Pn= P(Z\u2264n) onH, we get, with KL>m\u2236= KL(Pn\u0001P(Z>m)) and\nKL\u2264m\u2236= KL(Pn\u0001P(Z\u2264m)):\n\nSince Z1, . . . , Zn are i.i.d. we can chain the ESIs above using Proposition 10-(b) to get:\n\n(16)\n\n\u02dcY \u03b7\n\nY \u03b7\n\nEh\u223cPn\n\nWe now apply Proposition 10-(a) to chain these two ESIs, which yields\n\nh(Zj; Zi)\u0004\u2212 KL>m\nEh\u223cPn\u0004 mQ\ni=1\nj=m+1\nh(Zj; Zm))+ KL(Pn\u0001P(Z\u2264m))\nh(Zi; Z>i)+ nQ\nEh\u223cPn\nj=m+1\nWith the prior \u03c0 onG, we have for any \u02c6\u03b7= \u02c6\u03b7(Z\u2264n)\u2208G\u2282[1~\u221a\nnb2, 1~b[ (see Proposition 12),\nh(Zj; Zi)+ nQ\nEh\u223cPn\ni=1\nj=m+1\nn\u22c5 c\u02c6\u03b7\u22c5 Vn+ COMPn+ 2 ln 1\n+\nn\u22c5(L(Pn)\u2212 Ln(Pn)) \u0016 \u02c6\u03b7\n\u03c0(\u02c6\u03b7)\n\u0017\u0017\u0017\u0017\u0017 mQ\nj)(cid:6)\u2212 \u00af(cid:96)Qi(Zi)\u0001+ nQ\ni\u223cD\u0001\u00af(cid:96)Q>i(Z\nj\u223cD\u0001\u00af(cid:96)Qi(Zi)\u2236= Eh\u223cQ(Z>i)[(cid:96)h(Zi)] and \u00af(cid:96)Qi)\u0001(cid:96)h\u2032(Zi)2(cid:6)+ nQ\n\u03c0(\u02c6\u03bd)\nj=m+1\ni=1\nLn(Pn)+ c\u02c6\u03b7\u22c5 Vn+ COMPn+ 2 ln 1\n+ c\u02c6\u03bd\u22c5 V\nL(Pn)\u0016 n \u02c6\u03b7 \u02c6\u03bd\nn+ ln 1\n\u03c0(\u02c6\u03b7)\n\u03c0(\u02c6\u03bd)\n\u2032\n\u02c6\u03b7\u22c5 n\n\u02c6\u03bd\u22c5 n\n\u0017\u0017\u0017\u0017\u0017\u0017+\u0017\u0017\u0017\u0017\u0017\u0017\u0017c\u02c6\u03bd\u22c5 V\nL(Pn)\u2264 Ln(Pn)+\u0017\u0017\u0017\u0017\u0017\u0017c\u02c6\u03b7\u22c5 Vn+ COMPn+ 2 ln\nn+ ln\n\u03c0(\u02c6\u03bd)\u22c5\u03b4\n\u03c0(\u02c6\u03b7)\u22c5\u03b4\n\u2032\n\u02c6\u03b7\u22c5 n\n\u02c6\u03bd\u22c5 n\nover the closer ofG of the quantities between braces and square brackets in (20).\n\nWe now apply Proposition 9 to (19) to obtain the following inequality with probability at least 1\u2212 \u03b4:\n\nquantity between the square brackets in (17). Using the un-expected Bernstein Lemma 13, together\n\nInequality (5) follows after picking \u02c6\u03bd and \u02c6\u03b7 to be, respectively, estimators which achieve the in\ufb01mum\n\nBy chaining (18) and (17) using Proposition 10-(a) and dividing by n, we get:\n\n.\n\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017 .\n\nCOMPn\n\n.\n\n(18)\n\n2\n\n\u02c6\u03b7\n\n\u02c6\u03b7\n\n\u02dcY \u02c6\u03b7\n\n2\n\n1\n\n1\n\n(19)\n\n(20)\n\n, i.e.,\n\n\u2032\n\n\u02c6\u03b7+2\u02c6\u03bd\n\n\u02c6\u03b7\n\n.\n\n\u02c6\u03bd\n\n7 Conclusion and Future Work\n\nThe main goal of this paper was to introduce a new PAC-Bayesian bound based on a new proof\ntechnique; we also theoretically motivated the bound in terms of a Bernstein condition. The simple\nexperiments we provided are to be considered as a basic sanity check\u2014in future work, we plan to put\nthe bound to real practical use by applying it to deep nets in the style of, e.g., [42].\n\n9\n\n\fAcknowledgments\n\nAn anonymous referee made some highly informed remarks on our paper, which led us to substantially\nrewrite the paper and made us understand our own work much better. Part of this work was performed\nwhile Zakaria Mhammedi was interning at the Centrum Wiskunde & Informatica (CWI). This work\nwas also supported by the Australian Research Council and Data61.\n\nReferences\n[1] Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine\n\nLearning, 107(5):887\u2013902, 2018.\n\n[2] Amiran Ambroladze, Emilio Parrado-Hern\u00e1ndez, and John Shawe-Taylor. Tighter PAC-Bayes\n\nbounds. In Advances in neural information processing systems, pages 9\u201316, 2007.\n\n[3] Jean-Yves Audibert. PAC-Bayesian statistical learning theory. These de doctorat de l\u2019Universit\u00e9\n\nParis, 6:29, 2004.\n\n[4] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Tuning bandit algorithms in stochastic\nenvironments. In International conference on algorithmic learning theory, pages 150\u2013165.\nSpringer, 2007.\n\n[5] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[6] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and\n\nRelated Fields, 135(3):311\u2013334, 2006.\n\n[7] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine learning\n\nresearch, 2(Mar):499\u2013526, 2002.\n\n[8] Olivier Catoni. A PAC-Bayesian approach to adaptive classi\ufb01cation. preprint, 2003.\n\n[9] Olivier Catoni. PAC-Bayesian Supervised Classi\ufb01cation. Lecture Notes-Monograph Series.\n\nIMS, 2007.\n\n[10] Olivier Catoni. PAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical\n\nlearning. Lecture Notes-Monograph Series. IMS, 2007.\n\n[11] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for\n\nprediction with expert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[12] Nicol\u00f2 Cesa-Bianchi and G\u00e0bor Lugosi. Prediction, Learning and Games. Cambridge University\n\nPress, Cambridge, UK, 2006.\n\n[13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.\n\n[14] Gintare K. Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for\ndeep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.\n\n[15] Tim Van Erven, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in\nstatistical and online learning. Journal of Machine Learning Research, 16:1793\u20131861, 2015.\n\n[16] Xiequan Fan, Ion Grama, Quansheng Liu, et al. Exponential inequalities for martingales with\n\napplications. Electronic Journal of Probability, 20, 2015.\n\n[17] Pascal Germain, Alexandre Lacasse, Fran\u00e7ois Laviolette, and Mario Marchand. PAC-Bayesian\nlearning of linear classi\ufb01ers. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 353\u2013360. ACM, 2009.\n\n[18] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis\nRoy. Risk bounds for the majority vote: From a pac-bayesian analysis to a learning algorithm.\nThe Journal of Machine Learning Research, 16(1):787\u2013860, 2015.\n\n10\n\n\f[19] Peter D. Gr\u00fcnwald and Nishant A. Mehta. Fast rates for general unbounded loss functions: from\nERM to generalized Bayes. Journal of Machine Learning Research, 2019. accepted pending\nminor revision. Available as arXiv preprint arXiv:1605.00252.\n\n[20] Benjamin Guedj. A primer on PAC-Bayesian learning. arXiv preprint arXiv:1901.05353, 2019.\n\n[21] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, nonparamet-\n\nric, non-asymptotic con\ufb01dence sequences. arXiv preprint arXiv:1810.08240, 2018.\n\n[22] Wouter M. Koolen, Peter D. Gr\u00fcnwald, and Tim van Erven. Combining adversarial guarantees\nand stochastic fast rates in online learning. In Advances in Neural Information Processing\nSystems, pages 4457\u20134465, 2016.\n\n[23] John Langford and Rich Caruana. (Not) bounding the true error. In T. G. Dietterich, S. Becker,\nand Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages\n809\u2013816. MIT Press, 2002.\n\n[24] John Langford and John Shawe-Taylor. PAC-Bayes & margins.\n\nInformation Processing Systems, pages 439\u2013446, 2003.\n\nIn Advances in Neural\n\n[25] Pascal Massart and \u00c9lodie N\u00e9d\u00e9lec. Risk bounds for statistical learning. The Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[26] Andreas Maurer. A note on the PAC-Bayesian theorem. arXiv preprint cs/0411099, 2004.\n\n[27] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance\n\npenalization. In Proceedings COLT 2009, 2009.\n\n[28] David A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh ACM\nConference on Computational Learning Theory (COLT\u2019 98), pages 230\u2013234. ACM Press, 1998.\n\n[29] David A. McAllester. PAC-Bayesian model averaging. In Proceedings of the Twelfth ACM\nConference on Computational Learning Theory (COLT\u2019 99), pages 164\u2013171. ACM Press, 1999.\n\n[30] David A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5\u201321,\n\n2003.\n\n[31] Nishant A. Mehta. Fast rates with high probability in exp-concave statistical learning. In\n\nArti\ufb01cial Intelligence and Statistics, pages 1085\u20131093, 2017.\n\n[32] Volodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. Empirical bernstein stopping. In\nProceedings of the 25th international conference on Machine learning, pages 672\u2013679. ACM,\n2008.\n\n[33] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability\nis suf\ufb01cient for generalization and necessary and suf\ufb01cient for consistency of empirical risk\nminimization. Advances in Computational Mathematics, 25(1-3):161\u2013193, 2006.\n\n[34] Behnam Neyshabur, Srinadh Bhojanapalli, David A. McAllester, and Nathan Srebro. A PAC-\nBayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR,\n2018.\n\n[35] Omar Rivasplata, Csaba Szepesv\u00e1ri, John S Shawe-Taylor, Emilio Parrado-Hernandez, and\nIn\n\nShiliang Sun. Pac-bayes bounds for stable algorithms with instance-dependent priors.\nAdvances in Neural Information Processing Systems, pages 9214\u20139224, 2018.\n\n[36] Matthias Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classi\ufb01cation.\n\nJournal of machine learning research, 3(Oct):233\u2013269, 2002.\n\n[37] Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 3:233\u2013269, 2002.\n\n[38] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability,\nstability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635\u20132670,\n2010.\n\n11\n\n\f[39] Ilya O. Tolstikhin and Yevgeny Seldin. PAC-Bayes-empirical-Bernstein inequality. In Advances\n\nin Neural Information Processing Systems, pages 109\u2013117, 2013.\n\n[40] Alexandre B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of\n\nStatistics, 32(1):135\u2013166, 2004.\n\n[41] Olivier Wintenberger. Optimal learning with bernstein online aggregation. Machine Learning,\n\n106(1):119\u2013141, 2017.\n\n[42] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous\ngeneralization bounds at the ImageNet scale: a PAC-Bayesian compression approach. In ICLR,\n2019.\n\n12\n\n\f", "award": [], "sourceid": 6604, "authors": [{"given_name": "Zakaria", "family_name": "Mhammedi", "institution": "The Australian National University"}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": "CWI and Leiden University"}, {"given_name": "Benjamin", "family_name": "Guedj", "institution": "Inria & University College London"}]}*