{"title": "PAC-Bayes-Empirical-Bernstein Inequality", "book": "Advances in Neural Information Processing Systems", "page_first": 109, "page_last": 117, "abstract": "We present PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on combination of PAC-Bayesian bounding technique with Empirical Bernstein bound. It allows to take advantage of small empirical variance and is especially useful in regression. We show that when the empirical variance is significantly smaller than the empirical loss PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. PAC-Bayes-Empirical-Bernstein inequality is an interesting example of application of PAC-Bayesian bounding technique to self-bounding functions. We provide empirical comparison of PAC-Bayes-Empirical-Bernstein inequality with PAC-Bayes-kl inequality on a synthetic example and several UCI datasets.", "full_text": "PAC-Bayes-Empirical-Bernstein Inequality\n\nIlya Tolstikhin\nComputing Centre\n\nRussian Academy of Sciences\n\niliya.tolstikhin@gmail.com\n\nyevgeny.seldin@gmail.com\n\nQueensland University of Technology\n\nYevgeny Seldin\n\nUC Berkeley\n\nAbstract\n\nWe present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based\non a combination of the PAC-Bayesian bounding technique with an Empirical\nBernstein bound. We show that when the empirical variance is signi\ufb01cantly\nsmaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is\nsigni\ufb01cantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and oth-\nerwise it is comparable. Our theoretical analysis is con\ufb01rmed empirically on a\nsynthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein\ninequality is an interesting example of an application of the PAC-Bayesian bound-\ning technique to self-bounding functions.\n\n1\n\nIntroduction\n\nPAC-Bayesian analysis is a general and powerful tool for data-dependent analysis in machine learn-\ning. By now it has been applied in such diverse areas as supervised learning [1\u20134], unsupervised\nlearning [4, 5], and reinforcement learning [6]. PAC-Bayesian analysis combines the best aspects\nof PAC learning and Bayesian learning: (1) it provides strict generalization guarantees (like VC-\ntheory), (2) it is \ufb02exible and allows the incorporation of prior knowledge (like Bayesian learning),\nand (3) it provides data-dependent generalization guarantees (akin to Radamacher complexities).\nPAC-Bayesian analysis provides concentration inequalities for the divergence between expected and\nempirical loss of randomized prediction rules. For a hypothesis space H a randomized prediction\nrule associated with a distribution \u03c1 over H operates by picking a hypothesis at random according\nto \u03c1 from H each time it has to make a prediction. If \u03c1 is a delta-distribution we recover classical\nprediction rules that pick a single hypothesis h \u2208 H. Otherwise, the prediction strategy resembles\nBayesian prediction from the posterior distribution, with a distinction that \u03c1 does not have to be the\nBayes posterior. Importantly, many of PAC-Bayesian inequalities hold for all posterior distributions\n\u03c1 simultaneously (with high probability over a random draw of a training set). Therefore, PAC-\nBayesian bounds can be used in two ways. Ideally, we prefer to derive new algorithms that \ufb01nd the\nposterior distribution \u03c1 that minimizes the PAC-Bayesian bound on the expected loss. However, we\ncan also use PAC-Bayesian bounds in order to estimate the expected loss of posterior distributions \u03c1\nthat were found by other algorithms, such as empirical risk minimization, regularized empirical risk\nminimization, Bayesian posteriors, and so forth. In such applications PAC-Bayesian bounds can be\nused to provide generalization guarantees for other methods and can be applied as a substitute for\ncross-validation in paratemer tuning (since the bounds hold for all posterior distributions \u03c1 simul-\ntaneously, we can apply the bounds to test multiple posterior distributions \u03c1 without suffering from\nover-\ufb01tting, in contrast with extensive applications of cross-validation).\nThere are two forms of PAC-Bayesian inequalities that are currently known to be the tightest de-\npending on a situation. One is the PAC-Bayes-kl inequality of Seeger [7] and the other is the PAC-\nBayes-Bernstein inequality of Seldin et. al. [8]. However, the PAC-Bayes-Bernstein inequality is\nexpressed in terms of the true expected variance, which is rarely accessible in practice. Therefore, in\norder to apply the PAC-Bayes-Bernstein inequality we need an upper bound on the expected variance\n\n1\n\n\f(or, more precisely, on the average of the expected variances of losses of each hypothesis h \u2208 H\nweighted according to the randomized prediction rule \u03c1). If the loss is bounded in the [0, 1] interval\nthe expected variance can be upper bounded by the expected loss and this bound can be used to\nrecover the PAC-Bayes-kl inequality from the PAC-Bayes-Bernstein inequality (with slightly sub-\noptimal constants and suboptimal behavior for small sample sizes). In fact, for the binary loss this\nresult cannot be signi\ufb01cantly improved (see Section 3). However, when the loss is not binary it may\nbe possible to obtain a tighter bound on the variance, which will lead to a tighter bound on the loss\nthan the PAC-Bayes-kl inequality. For example, in Seldin et. al. [6] a deterministic upper bound\non the variance of importance-weighted sampling combined with PAC-Bayes-Bernstein inequality\nyielded an order of magnitude improvement relative to application of PAC-Bayes-kl inequality to\nthe same problem. We note that the bound on the variance used by Seldin et. al. [6] depends on\nspeci\ufb01c properties of importance-weighted sampling and does not apply to other problems.\nIn this work we derive the PAC-Bayes-Empirical-Bernstein bound, in which the expected average\nvariance of the loss weighted by \u03c1 is replaced by the weighted average of the empirical variance\nof the loss. Bounding the expected variance by the empirical variance is generally tighter than\nbounding it by the empirical loss. Therefore, the PAC-Bayes-Empirical-Bernstein bound is generally\ntighter than the PAC-Bayes-kl bound, although the exact comparison also depends on the divergence\nbetween the posterior and the prior and the sample size.\nIn Section 5 we provide an empirical\ncomparison of the two bounds on several synthetic and UCI datasets.\nThe PAC-Bayes-Empirical-Bernstein bound is derived in two steps. In the \ufb01rst step we combine the\nPAC-Bayesian bounding technique with the Empirical Bernstein inequality [9] and derive a PAC-\nBayesian bound on the variance. The PAC-Bayesian bound on the variance bounds the divergence\nbetween averages [weighted by \u03c1] of expected and empirical variances of the losses of hypotheses\nin H and holds with high probability for all averaging distributions \u03c1 simultaneously. In the second\nstep the PAC-Bayesian bound on the variance is substituted into the PAC-Bayes-Bernstein inequality\nyielding the PAC-Bayes-Empirical-Bernstein bound.\nThe remainder of the paper is organized as follows. We start with some formal de\ufb01nitions and\nreview the major PAC-Bayesian bounds in Section 2, provide our main results in Section 3 and their\nproof sketches in Section 4, and \ufb01nish with experiments in Section 5 and conclusions in Section 6.\nDetailed proofs are provided in the supplementary material.\n\n2 Problem Setting and Background\n\nWe start with providing the problem setting and then give some background on PAC-Bayesian anal-\nysis.\n\n2.1 Notations and De\ufb01nitions\nWe consider supervised learning setting with an input space X , an output space Y, an i.i.d. training\nsample S = {(Xi, Yi)}n\ni=1 drawn according to an unknown distribution D on the product-space\nX \u00d7 Y, a loss function (cid:96) : Y 2 \u2192 [0, 1], and a hypothesis class H. The elements of H are functions\nh : X \u2192 Y from the input space to the output space. We use (cid:96)h(X, Y ) = (cid:96)(Y, h(X)) to denote the\nloss of a hypothesis h on a pair (X, Y ).\nFor a \ufb01xed hypothesis h \u2208 H denote its expected loss by L(h) = E(X,Y )\u223cD[(cid:96)h(X, Y )],\ni=1 (cid:96)h(Xi, Yi), and the variance of the loss V(h) =\nthe empirical\nloss Ln(h) = 1\nn\nVar(X,Y )\u223cD[(cid:96)h(X, Y )] = E(X,Y )\u223cD\nWe de\ufb01ne Gibbs regression rule G\u03c1 associated with a distribution \u03c1 over H in the following way:\nfor each point X Gibbs regression rule draws a hypothesis h according to \u03c1 and applies it to X. The\nexpected loss of Gibbs regression rule is denoted by L(G\u03c1) = Eh\u223c\u03c1[L(h)] and the empirical loss is\ndenoted by Ln(G\u03c1) = Eh\u223c\u03c1[Ln(h)]. We use KL(\u03c1(cid:107)\u03c0) = Eh\u223c\u03c1\nto denote the Kullback-\nLeibler divergence between two probability distributions [10]. For two Bernoulli distributions with\nbiases p and q we use kl(q(cid:107)p) as a shorthand for KL([q, 1 \u2212 q](cid:107)[p, 1 \u2212 p]). In the sequel we use\nE\u03c1 [\u00b7] as a shorthand for Eh\u223c\u03c1 [\u00b7].\n\n(cid:80)n\n(cid:104)(cid:0)(cid:96)h(X, Y ) \u2212 E(X,Y )\u223cD [(cid:96)h(X, Y )](cid:1)2(cid:105)\n(cid:105)\n\n.\n\n(cid:104)\n\nln \u03c1(h)\n\u03c0(h)\n\n2\n\n\f2.2 PAC-Bayes-kl bound\n\nBefore presenting our results we review several existing PAC-Bayesian bounds. The result in The-\norem 1 was presented by Maurer [11, Theorem 5] and is one of the tightest known concentration\nbounds on the expected loss of Gibbs regression rule. Theorem 1 generalizes (and slightly tightens)\nPAC-Bayes-kl inequality of Seeger [7, Theorem 1] from binary to arbitrary loss functions bounded\nin the [0, 1] interval.\nTheorem 1. For any \ufb01xed probability distribution \u03c0 over H, for any n \u2265 8 and \u03b4 > 0, with\nprobability greater than 1 \u2212 \u03b4 over a random draw of a sample S, for all distributions \u03c1 over H\nsimultaneously:\n\nkl(cid:0)Ln(G\u03c1)(cid:107)L(G\u03c1)(cid:1) \u2264 KL(\u03c1(cid:107)\u03c0) + ln 2\n\n\u221a\n\nn\n\nSince by Pinsker\u2019s inequality |p \u2212 q| \u2264 (cid:112)kl(q(cid:107)p)/2, Theorem 1 directly implies (up to minor\n\n(1)\n\nn\n\n.\n\n\u03b4\n\nfactors) the more explicit PAC-Bayesian bound of McAllester [12]:\n\u221a\nKL(\u03c1(cid:107)\u03c0) + ln 2\n\nn\n\nL(G\u03c1) \u2264 Ln(G\u03c1) +\n\n(2)\nwhich holds with probability greater than 1 \u2212 \u03b4 for all \u03c1 simultaneously. We note that kl is easy\nto invert numerically and for small values of Ln(G\u03c1) (less than 1/4) the implicit bound in (1)\nis signi\ufb01cantly tighter than the explicit bound in (2). This can be seen from another relaxation\n2kl(q(cid:107)p) for p < q:\n\nsuggested by McAllester [2], which follows from (1) by the inequality p \u2264 q +(cid:112)2qkl(q(cid:107)p) +\n\n2n\n\n,\n\n\u03b4\n\n(cid:115)\n\n(cid:118)(cid:117)(cid:117)(cid:116) 2Ln(G\u03c1)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nKL(\u03c1(cid:107)\u03c0) + ln 2\n\n\u221a\n\n\u03b4\n\nn\n\nn\n\n2\n\n+\n\nKL(\u03c1(cid:107)\u03c0) + ln 2\n\n\u221a\n\n\u03b4\n\nn\n\nn\n\n.\n\n(3)\n\nL(G\u03c1) \u2264 Ln(G\u03c1) +\n\nFrom inequality (3) we clearly see that inequality (1) achieves \u201cfast convergence rate\u201d or, in other\n\u221a\nwords, when L(G\u03c1) is zero (or small compared to 1/\nn) the bound converges at the rate of 1/n\nrather than 1/\n\nn as a function of n.\n\n\u221a\n\n2.3 PAC-Bayes-Bernstein Bound\n\nSeldin et. al. [8] introduced a general technique for combining PAC-Bayesian analysis with con-\ncentration of measure inequalities and derived the PAC-Bayes-Bernstein bound cited below. (The\nPAC-Bayes-Bernstein bound of Seldin et. al. holds for martingale sequences, but for simplicity in\nthis paper we restrict ourselves to i.i.d. variables.)\nTheorem 2. For any \ufb01xed distribution \u03c0 over H, for any \u03b41 > 0, and for any \ufb01xed c1 > 1, with\nprobability greater than 1 \u2212 \u03b41 (over a draw of S) we have\n\n(cid:118)(cid:117)(cid:117)(cid:116) (e \u2212 2)E\u03c1[V(h)]\n\n(cid:16)\n\n(cid:17)\n\nKL(\u03c1(cid:107)\u03c0) + ln \u03bd1\nn\n\n\u03b41\n\n(4)\n\nL(G\u03c1) \u2264 Ln(G\u03c1) + (1 + c1)\n\nsimultaneously for all distributions \u03c1 over H that satisfy\nKL(\u03c1(cid:107)\u03c0) + ln \u03bd1\n(e \u2212 2)E\u03c1[V(h)]\n\n\u03b41\n\n(cid:115)\n(cid:38)\n\nn,\n\n\u2264 \u221a\n(cid:33)(cid:39)\n\n(cid:32)(cid:115)\n\n1\n\nln c1\n\nln\n\n(e \u2212 2)n\n4 ln(1/\u03b41)\n\n+ 1,\n\nwhere\n\n\u03bd1 =\n\nand for all other \u03c1 we have:\n\nL(G\u03c1) \u2264 Ln(G\u03c1) + 2\n\nKL(\u03c1(cid:107)\u03c0) + ln \u03bd1\n\n\u03b41\n\n.\n\nn\n\nFurthermore, the result holds if E\u03c1 [V(h)] is replaced by an upper bound \u00afV (\u03c1), as long as\nE\u03c1 [V(h)] \u2264 \u00afV (\u03c1) \u2264 1\n\n4 for all \u03c1.\n\n3\n\n\fA few comments on Theorem 2 are in place here. First, we note that Seldin et. al. worked with\ncumulative losses and variances, whereas we work with normalized losses and variances, which\nmeans that their losses and variances differ by a multiplicative factor of n from our de\ufb01nitions.\nSecond, we note that the statement on the possibility of replacing E\u03c1 [V(h)] by an upper bound is\nnot part of [8, Theorem 8], but it is mentioned and analyzed explicitly in the text. The requirement\nthat \u00afV (\u03c1) \u2264 1\n4 is not mentioned explicitly, but it follows directly from the necessity to preserve the\nrelevant range of the trade-off parameter \u03bb in the proof of the theorem. Since 1\n4 is a trivial upper\nbound on the variance of a random variable bounded in the [0, 1] interval, the requirement is not a\nlimitation. Finally, we note that since we are working with \u201cone-sided\u201d variables (namely, the loss is\nbounded in the [0, 1] interval rather than \u201ctwo-sided\u201d [\u22121, 1] interval, which was considered in [8])\nthe variance is bounded by 1\n4 (rather than 1), which leads to a slight improvement in the value of \u03bd1.\nSince in reality we rarely have access to the expected variance E\u03c1 [V(h)] the tightness of Theorem\n2 entirely depends on the tightness of the upper bound \u00afV (\u03c1). If we use the trivial upper bound\nE\u03c1 [V(h)] \u2264 1\n4 the result is roughly equivalent to (2), which is inferior to Theorem 1. Design of a\ntighter upper bound on E\u03c1 [V(h)] that holds for all \u03c1 simultaneously is the subject of the following\nsection.\n\n3 Main Results\nThe key result of our paper is a PAC-Bayesian bound on the average expected variance E\u03c1 [V(h)]\ngiven in terms of the average empirical variance E\u03c1[Vn(h)] = Eh\u223c\u03c1[Vn(h)], where\n\nn(cid:88)\n(cid:0)(cid:96)h(Xi, Yi) \u2212 Ln(h)(cid:1)2\n\nVn(h) =\n\n1\n\nn \u2212 1\n\ni=1\n\nis an unbiased estimate of the variance V(h). The bound is given in Theorem 3 and it holds with\nhigh probability for all distributions \u03c1 simultaneously. Substitution of this bound into Theorem 2\nyields the PAC-Bayes-Empirical-Bernstein inequality given in Theorem 4. Thus, the PAC-Bayes-\nEmpirical-Bernstein inequality is based on two subsequent applications of the PAC-Bayesian bound-\ning technique.\n\n3.1 PAC-Bayesian bound on the variance\n\nTheorem 3 is based on an application of the PAC-Bayesian bounding technique to the difference\nE\u03c1 [V(h)]\u2212 E\u03c1 [Vn(h)]. We note that Vn(h) is a second-order U-statistics [13] and Theorem 3 pro-\nvides an interesting example of combining PAC-Bayesian analysis with concentration inequalities\nfor self-bounding functions.\nTheorem 3. For any \ufb01xed distribution \u03c0 over H, any c2 > 1 and \u03b42 > 0, with probability greater\nthan 1 \u2212 \u03b42 over a draw of S, for all distributions \u03c1 over H simultaneously:\n\n(5)\n\n(cid:17)\n\n,\n\n(6)\n\nE\u03c1[V(h)] \u2264 E\u03c1[Vn(h)] + (1 + c2)\n\nwhere\n\n\u03bd2 =\n\n(cid:16)\n\n(cid:118)(cid:117)(cid:117)(cid:116)E\u03c1 [Vn(h)]\n(cid:115)\n(cid:32)\n\n1\n\nln c2\n\nln\n\n1\n2\n\n(cid:38)\n\n(cid:16)\n\n(cid:17)\n(cid:33)(cid:39)\n\nn \u2212 1\nln(1/\u03b42)\n\n+ 1 +\n\n1\n2\n\n.\n\nKL(\u03c1(cid:107)\u03c0) + ln \u03bd2\n2(n \u2212 1)\n\n\u03b42\n\n2c2\n\n+\n\nKL(\u03c1(cid:107)\u03c0) + ln \u03bd2\n\n\u03b42\n\nn \u2212 1\n\nNote that (6) closely resembles the explicit bound on L(G\u03c1) in (3). If the empirical variance Vn(h)\nis close to zero the impact of the second term of the bound (that scales with 1/\nn) is relatively small\nand we obtain \u201cfast convergence rate\u201d of E\u03c1 [Vn(h)] to E\u03c1 [V(h)]. Finally, we note that the impact\nof c2 on ln \u03bd2 is relatively small and so c2 can be taken very close to 1.\n\n\u221a\n\n3.2 PAC-Bayes-Empirical-Bernstein bound\nTheorem 3 controls the average variance E\u03c1[V(h)] for all posterior distributions \u03c1 simultaneously.\nBy taking \u03b41 = \u03b42 = \u03b4\n2 we have the claims of Theorems 2 and 3 holding simultaneously with\n\n4\n\n\fprobability greater than 1 \u2212 \u03b4. Substitution of the bound on E\u03c1 [V(h)] from Theorem 3 into the\nPAC-Bayes-Bernstein inequality in Theorem 2 yields the main result of our paper, the PAC-Bayes-\nEmpirical-Bernstein inequality, that controls the loss of Gibbs regression rule E\u03c1 [L(h)] for all pos-\nterior distributions \u03c1 simultaneously.\nTheorem 4. Let Vn(\u03c1) denote the right hand side of (6) (with \u03b42 = \u03b4\nwith probability greater than 1 \u2212 \u03b4 (over a draw of S) we have:\n\n(cid:1). For any \ufb01xed distribution \u03c0 over H, for any \u03b4 > 0, and for any c1, c2 > 1,\n\nmin(cid:0)Vn(\u03c1), 1\n\n2 ) and let \u00afVn(\u03c1) =\n\n4\n\n(cid:115)\n\n(e \u2212 2) \u00afVn(\u03c1)(cid:0)KL(\u03c1(cid:107)\u03c0) + ln 2\u03bd1\n\n(cid:1)\n\n\u03b4\n\n(7)\n\nL(G\u03c1) \u2264 Ln(G\u03c1) + (1 + c1)\n\nsimultaneously for all distributions \u03c1 over H that satisfy\nKL(\u03c1(cid:107)\u03c0) + ln 2\u03bd1\n(e \u2212 2) \u00afVn(\u03c1)\n\n\u03b4\n\n(cid:115)\n\nn\n\n\u2264 \u221a\n\nn,\n\nwhere \u03bd1 was de\ufb01ned in Theorem 2 (with \u03b41 = \u03b4\n\n2 ), and for all other \u03c1 we have:\n\nL(G\u03c1) \u2264 Ln(G\u03c1) + 2\n\nKL(\u03c1(cid:107)\u03c0) + ln 2\u03bd1\n\n\u03b4\n\n.\n\nn\n\nNote that all the quantities in Theorem 4 are computable based on the sample.\n\u221a\nn) term in PAC-Bayes-Empirical-Bernstein\nAs we can see immediately by comparing the O(1/\ninequality (PB-EB for brevity) with the corresponding term in the relaxed version of the PAC-Bayes-\nkl inequality (PB-kl for brevity) in equation (3), the PB-EB inequality can potentially be tighter\nwhen E\u03c1 [Vn(h)] \u2264 (1/(2(e \u2212 2)))Ln(G\u03c1) \u2248 0.7Ln(G\u03c1). We also note that when the loss is\nbounded in the [0,1] interval we have Vn(h) \u2264 (n/(n \u2212 1))Ln(h) (since (cid:96)h(X, Y )2 \u2264 (cid:96)h(X, Y )).\nTherefore, the PB-EB bound is never much worse than the PB-kl bound and if the empirical variance\nis small compared to the empirical loss it can be much tighter. We note that for the binary loss\n((cid:96)(y, y(cid:48)) \u2208 {0, 1}) we have V(h) = L(h)(1 \u2212 L(h)) and in this case the empirical variance cannot\nbe signi\ufb01cantly smaller than the empirical loss and PB-EB does not provide an advantage over\nPB-kl. We also note that the unrelaxed version of the PB-kl inequality in equation (1) has better\nbehavior for very small sample sizes and in such cases PB-kl can be tighter than PB-EB even when\nthe empirical variance is small. To summarize the discussion, when E\u03c1 [Vn(h)] \u2264 0.7Ln(G\u03c1) the\nPB-EB inequality can be signi\ufb01cantly tighter than the PB-kl bound and otherwise it is comparable\n(except for very small sample sizes). In Section 5 we provide a more detailed numerical comparison\nof the two inequalities.\n\n4 Proofs\n\nIn this section we present a sketch of a proof of Theorem 3 and a proof of Theorem 4. Full details\nof the proof of Theorem 3 are provided in the supplementary material. The proof of Theorem 3 is\nbased on the following lemma, which is at the base of all PAC-Bayesian theorems. (Since we could\nnot \ufb01nd a reference, where the lemma is stated explicitly its proof is provided in the supplementary\nmaterial.)\nLemma 1. For any function fn : H \u00d7 (X \u00d7 Y)n \u2192 R and for any distribution \u03c0 over H, such\nthat \u03c0 is independent of S, with probability greater than 1 \u2212 \u03b4 over a random draw of S, for all\ndistributions \u03c1 over H simultaneously:\n\nE\u03c1 [fn(h, S)] \u2264 KL(\u03c1(cid:107)\u03c0) + ln\n\n+ ln E\u03c0\n\n1\n\u03b4\n\n(cid:104)ES(cid:48)\u223cDn\n\n(cid:104)\nefn(h,S(cid:48))(cid:105)(cid:105)\n\n.\n\n(8)\n\nThe smart part is to choose fn(h, S) so that we get the quantities of interest on the left hand side\nof (8) and at the same time are able to bound the last term on the right hand side of (8). Bounding\nof the moment generating function (the last term in (8)) is usually done by involving some known\nconcentration of measure results. In the proof of Theorem 3 we use the fact that nVn(h) satis\ufb01es\nthe self-bounding property [14]. Speci\ufb01cally, for any \u03bb > 0:\nn2\nn\u22121\n\nV(h)(cid:105) \u2264 1\n\ne\u03bb(nV(h)\u2212nVn(h))\u2212 \u03bb2\n\n(cid:104)\n\nES\u223cDn\n\n(9)\n\n2\n\n5\n\n\f(a) n = 1000\n\n(b) n = 4000\n\nFigure 1: The Ratio of the gap between PB-EB and Ln(G\u03c1) to the gap between PB-kl and Ln(G\u03c1)\nfor different values of n, E\u03c1[Vn(h)], and Ln(G\u03c1). PB-EB is tighter below the dashed line with label\n1. The axes of the graphs are in log scale.\n\n(see, for example, [9, Theorem 10]). We take fn(h, S) = \u03bb(cid:0)nV(h) \u2212 nVn(h)(cid:1) \u2212 \u03bb2\n\nV(h)\nand substitute fn and the bound on its moment generating function in (9) into (8). To complete the\nproof it is left to optimize the bound with respect to \u03bb. Since it is impossible to minimize the bound\nsimultaneously for all \u03c1 with a single value of \u03bb, we follow the technique suggested by Seldin et. al.\nand take a grid of \u03bb-s in a form of a geometric progression and apply a union bound over this grid.\nThen, for each \u03c1 we pick a value of \u03bb from the grid, which is the closest to the value of \u03bb that\nminimizes the bound for the corresponding \u03c1. (The approximation of the optimal \u03bb by the closest \u03bb\nfrom the grid is behind the factor c2 in the bound and the ln \u03bd2 factor is the result of the union bound\nover the grid of \u03bb-s.) Technical details of the derivation are provided in the supplementary material.\n\nn2\nn\u22121\n\n2\n\nProof of Theorem 4. By our choice of \u03b41 = \u03b42 = \u03b4\n2 the upper bounds of Theorems 2 and 3 hold\nsimultaneously with probability greater than 1 \u2212 \u03b4. Therefore, with probability greater than 1 \u2212 \u03b42\nwe have E\u03c1 [V(h)] \u2264 \u00afVn(h) \u2264 1\n\n4 and the result follows by Theorem 2.\n\n5 Experiments\n\nBefore presenting the experiments we present a general comparison of the behavior of the PB-EB\nand PB-kl bounds as a function of Ln(G\u03c1), E\u03c1 [Vn(h)], and n. In Figure 1.a and 1.b we examine\nthe ratio of the complexity parts of the two bounds\n\nPB-EB \u2212 Ln(G\u03c1)\nPB-kl \u2212 Ln(G\u03c1)\n\n,\n\nwhere PB-EB is used to denote the value of the PB-EB bound in equation (7) and PB-kl is used\nto denote the value of the PB-kl bound in equation (1). The ratio is presented in the Ln(G\u03c1) by\nE\u03c1 [Vn(h)] plane for two values of n. In the illustrative comparison we took KL(\u03c1(cid:107)\u03c0) = 18 and in\nall the experiments presented in this section we take c1 = c2 = 1.15 and \u03b4 = 0.05. As we wrote\nin the discussion of Theorem 4, PB-EB is never much worse than PB-kl and when E\u03c1 [Vn(h)] (cid:28)\nLn(G\u03c1) it can be signi\ufb01cantly tighter. In the illustrative comparison in Figure 1, in the worst case\nthe ratio is slightly above 2.5 and in the best case it is slightly above 0.3. We note that as the sample\nsize grows the worst case ratio decreases (asymptotically down to 1.2) and the improvement of the\nbest case ratio is unlimited.\nAs we already said, the advantage of the PB-EB inequality over the PB-kl inequality is most promi-\nnent in regression (for classi\ufb01cation with zero-one loss it is roughly comparable to PB-kl). Below\nwe provide regression experiments with L1 loss on synthetic data and three datasets from the UCI\nrepository [15]. We use the PB-EB and PB-kl bounds to bound the loss of a regularized empirical\n\n6\n\n0.0010.010.10.0010.010.1Average empirical lossAverage sample variance  112211.522.50.0010.010.10.0010.010.1Average empirical lossAverage sample variance  0.50.511120.511.522.5\f(cid:110)\n\nrisk minimization algorithm. In all our experiments the inputs Xi lie in a d-dimensional unit ball\ncentered at the origin ((cid:107)Xi(cid:107)2 \u2264 1) and the outputs Y take values in [\u22120.5, 0.5]. The hypothesis\nclass HW is de\ufb01ned as\n\nHW =\n\nhw(X) = (cid:104)w, X(cid:105) : w \u2208 Rd,(cid:107)w(cid:107)2 \u2264 0.5\n\nThis construction ensures that the L1 regression loss (cid:96)(y, y(cid:48)) = |y \u2212 y(cid:48)| is bounded in the [0, 1]\n\ninterval. We use uniform prior distribution over HW de\ufb01ned by \u03c0(w) = (cid:0)V (1/2, d)(cid:1)\u22121, where\n\nV (r, d) is the volume of a d-dimensional ball with radius r. The posterior distribution \u03c1 \u02c6w is taken to\nbe a uniform distribution on a d-dimensional ball of radius \u0001 centered at the weight vector \u02c6w, where\n\u02c6w is the solution of the following minimization problem:\n\n(cid:111)\n\n.\n\nn(cid:88)\n\ni=1\n\n\u02c6w = arg min\n\nw\n\n1\nn\n\n|Yi \u2212 (cid:104)w, Xi(cid:105)| + \u03bb\u2217(cid:107)w(cid:107)2\n2.\n\n(10)\n\nNote that (10) is a quadratic program and can be solved by various numerical solvers (we used\nMatlab quadprog). The role of the regularization parameter \u03bb\u2217(cid:107)w(cid:107)2\n2 is to ensure that the posterior\ndistribution is supported by HW . We use binary search in order to \ufb01nd the minimal (non-negative)\n\u03bb\u2217, such that the posterior \u03c1 \u02c6w is supported by HW (meaning that the ball of radius \u0001 around \u02c6w is\nwithin the ball of radius 0.5 around the origin). In all the experiments below we used \u0001 = 0.05.\n\n5.1 Synthetic data\n\nOur synthetic datasets are produced as follows. We take inputs X1, . . . , Xn uniformly distributed in\na d-dimensional unit ball centered at the origin. Then we de\ufb01ne\nYi = \u03c30 (50 \u00b7 (cid:104)w0, Xi(cid:105)) + \u0001i\n\nwith weight vector w0 \u2208 Rd, centred sigmoid function \u03c30(z) = 1\n[\u22120.5, 0.5], and noise \u0001i independent of Xi and uniformly distributed in [\u2212ai, ai] with\nfor \u03c30(50 \u00b7 (cid:104)w0, Xi(cid:105)) \u2265 0;\nfor \u03c30(50 \u00b7 (cid:104)w0, Xi(cid:105)) < 0.\n\n(cid:26)min(cid:0)0.1, 0.5 \u2212 \u03c30(50 \u00b7 (cid:104)w0, Xi(cid:105))(cid:1),\nmin(cid:0)0.1, 0.5 + \u03c30(50 \u00b7 (cid:104)w0, Xi(cid:105))(cid:1),\n\nai =\n\n1+e\u2212z \u2212 0.5 which takes values in\n\nThis design ensures that Yi \u2208 [\u22120.5, 0.5]. The sigmoid function creates a mismatch between the\ndata generating distribution and the linear hypothesis class. Together with relatively small level of\nthe noise (\u0001i \u2264 0.1) this results in small empirical variance of the loss Vn(h) and medium to high\nempirical loss Ln(h). Let us denote the j-th coordinate of a vector u \u2208 Rd by uj and the number\nof nonzero coordinates of u by (cid:107)u(cid:107)0. We choose the weight vector w0 to have only a few nonzero\ncoordinates and consider two settings. In the \ufb01rst setting d \u2208 {2, 5}, (cid:107)w0(cid:107)0 = 2, w1\n0 = 0.12, and\n0 = \u22120.04 and in the second setting d \u2208 {3, 6}, (cid:107)w0(cid:107)0 = 3, w1\n0 = \u22120.08, w2\n0 = 0.05, and\nw2\n0 = 0.2.\nw3\nFor each sample size ranging from 300 to 4000 we averaged the bounds over 10 randomly generated\ndatasets. The results are presented in Figure 2. We see that except for very small sample sizes\n(n < 1000) the PB-EB bound outperforms the PB-kl bound. Inferior performance for very small\nsample sizes is a result of domination of the O(1/n) term in the PB-EB bound (7). As soon as n\ngets large enough this term signi\ufb01cantly decreases and PB-EB dominates PB-kl.\n\n5.2 UCI datasets\n\nWe compare our PAC-Bayes-Empirical-Bernstein inequality (7) with the PAC-Bayes-kl inequal-\nity (1) on three UCI regression datasets: Wine Quality, Parkinsons Telemonitoring, and Concrete\nCompressive Strength. For each dataset we centred and normalised both outputs and inputs so that\nYi \u2208 [\u22120.5, 0.5] and (cid:107)Xi(cid:107) \u2264 1. The results for 5-fold train-test split of the data together with basic\ndescriptions of the datasets are presented in Table 1.\n\n6 Conclusions and future work\n\nWe derived a new PAC-Bayesian bound that controls the convergence of averages of empirical vari-\nances of losses of hypotheses in H to averages of expected variances of losses of hypothesis in H si-\nmultaneously for all averaging distributions \u03c1. This bound is an interesting example of combination\n\n7\n\n\f(a) d = 2, (cid:107)w0(cid:107)0 = 2\n\n(b) d = 5, (cid:107)w0(cid:107)0 = 2\n\n(c) d = 3, (cid:107)w0(cid:107)0 = 3\n\n(d) d = 6, (cid:107)w0(cid:107)0 = 3\n\nFigure 2: The values of the PAC-Bayes-kl and PAC-Bayes-Empirical-Bernstein bounds together\nwith the test and train errors on synthetic data. The values are averaged over the 10 random draws\nof training and test sets.\n\nTable 1: Results for the UCI datasets\n\nDataset\n\nwinequality\nparkinsons\nconcrete\n\nn\n\n6497\n5875\n1030\n\nd\n11\n16\n8\n\nTrain\n\n0.106 \u00b1 0.0005\n0.188 \u00b1 0.0014\n0.110 \u00b1 0.0008\n\nTest\n\n0.106 \u00b1 0.0022\n0.188 \u00b1 0.0055\n0.111 \u00b1 0.0038\n\nPB-kl bound\n0.175 \u00b1 0.0006\n0.266 \u00b1 0.0013\n0.242 \u00b1 0.0010\n\nPB-EB bound\n0.162 \u00b1 0.0006\n0.250 \u00b1 0.0012\n0.264 \u00b1 0.0011\n\nof PAC-Bayesian bounding technique with concentration inequalities for self-bounding functions.\nWe applied the bound to derive the PAC-Bayes-Empirical-Bernstein inequality which is a powerful\nBernstein-type inequality outperforming the state-of-the-art PAC-Bayes-kl inequality of Seeger [7]\nin situations, where the empirical variance is smaller than the empirical loss and otherwise com-\nparable to PAC-Bayes-kl. We also demonstrated an empirical advantage of the new PAC-Bayes-\nEmpirical-Bernstein inequality over the PAC-Bayes-kl inequality on several synthetic and real-life\nregression datasets.\nOur work opens a number of interesting directions for future research. One of the most important of\nthem is to derive algorithms that will directly minimize the PAC-Bayes-Empirical-Bernstein bound.\nAnother interesting direction would be to decrease the last term in the bound in Theorem 3, as it is\ndone in the PAC-Bayes-kl inequality. This can probably be achieved by deriving a PAC-Bayes-kl\ninequality for the variance.\n\nAcknowledgments\n\nThe authors are thankful to Anton Osokin for useful discussions and to the anonymous reviewers\nfor their comments. This research was supported by an Australian Research Council Australian\nLaureate Fellowship (FL110100281) and a Russian Foundation for Basic Research grants 13-07-\n00677, 14-07-00847.\n\nReferences\n\n[1] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Infor-\n\nmation Processing Systems (NIPS), 2002.\n\n[2] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 2003.\n[3] John Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine\n\nLearning Research, 2005.\n\n[4] Yevgeny Seldin and Naftali Tishby. PAC-Bayesian analysis of co-clustering and beyond. Jour-\n\nnal of Machine Learning Research, 11, 2010.\n\n[5] Matthew Higgs and John Shawe-Taylor. A PAC-Bayes bound for tailored density estimation.\nIn Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2010.\n\n8\n\n1000200030000.20.250.30.350.4Sample sizeExpected loss  PB\u2212EBPB\u2212klTrain errorTest error1000200030000.20.30.40.5Sample sizeExpected loss  PB\u2212EBPB\u2212klTrain errorTest error1000200030000.20.30.40.5Sample sizeExpected loss  PB\u2212EBPB\u2212klTrain errorTest error1000200030000.20.30.40.5Sample sizeExpected loss  PB\u2212EBPB\u2212klTrain errorTest error\f[6] Yevgeny Seldin, Peter Auer, Franc\u00b8ois Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-\nBayesian analysis of contextual bandits. In Advances in Neural Information Processing Sys-\ntems (NIPS), 2011.\n\n[7] Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01ca-\n\ntion. Journal of Machine Learning Research, 2002.\n\n[8] Yevgeny Seldin, Franc\u00b8ois Laviolette, Nicol`o Cesa-Bianchi, John Shawe-Taylor, and Peter\nAuer. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory,\n58, 2012.\n\n[9] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance\npenalization. In Proceedings of the International Conference on Computational Learning The-\nory (COLT), 2009.\n\n[10] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons,\n\n1991.\n\n[11] Andreas Maurer. A note on the PAC-Bayesian theorem. www.arxiv.org, 2004.\n[12] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37, 1999.\n[13] A.W. Van Der Vaart. Asymptotic statistics. Cambridge University Press, 1998.\n[14] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Olivier Bousquet. Concentration inequalities.\n\nIn\nO. Bousquet, U.v. Luxburg, and G. R\u00a8atsch, editors, Advanced Lectures in Machine Learning.\nSpringer, 2004.\n[15] A. Asuncion\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n\nand D.J. Newman.\n\n2007.\n\nUCI machine\n\nlearning\n\nrepository,\n\n9\n\n\f", "award": [], "sourceid": 107, "authors": [{"given_name": "Ilya", "family_name": "Tolstikhin", "institution": "Russian Academy of Sciences"}, {"given_name": "Yevgeny", "family_name": "Seldin", "institution": "Queensland University of Technology & UC Berkeley"}]}