{"title": "PAC-Bayesian Theory Meets Bayesian Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1884, "page_last": 1892, "abstract": "We exhibit a strong link between frequentist PAC-Bayesian bounds and the Bayesian marginal likelihood. That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam's razor criteria, under the assumption that the data is generated by an i.i.d. distribution. Moreover, as the negative log-likelihood is an unbounded loss function, we motivate and propose a PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that our approach is sound on classical Bayesian linear regression tasks.", "full_text": "PAC-Bayesian Theory Meets Bayesian Inference\n\nPascal Germain\u2020\n\nFrancis Bach\u2020 Alexandre Lacoste\u2021\n\nSimon Lacoste-Julien\u2020\n\n\u2020 INRIA Paris - \u00c9cole Normale Sup\u00e9rieure, firstname.lastname@inria.fr\n\n\u2021 Google, allac@google.com\n\nAbstract\n\nWe exhibit a strong link between frequentist PAC-Bayesian risk bounds and the\nBayesian marginal likelihood. That is, for the negative log-likelihood loss func-\ntion, we show that the minimization of PAC-Bayesian generalization risk bounds\nmaximizes the Bayesian marginal likelihood. This provides an alternative expla-\nnation to the Bayesian Occam\u2019s razor criteria, under the assumption that the data\nis generated by an i.i.d. distribution. Moreover, as the negative log-likelihood is\nan unbounded loss function, we motivate and propose a PAC-Bayesian theorem\ntailored for the sub-gamma loss family, and we show that our approach is sound on\nclassical Bayesian linear regression tasks.\n\n1\n\nIntroduction\n\nSince its early beginning [24, 34], the PAC-Bayesian theory claims to provide \u201cPAC guarantees\nto Bayesian algorithms\u201d (McAllester [24]). However, despite the amount of work dedicated to\nthis statistical learning theory\u2014many authors improved the initial results [8, 21, 25, 30, 35] and/or\ngeneralized them for various machine learning setups [4, 12, 15, 20, 28, 31, 32, 33]\u2014it is mostly used\nas a frequentist method. That is, under the assumptions that the learning samples are i.i.d.-generated\nby a data-distribution, this theory expresses probably approximately correct (PAC) bounds on the\ngeneralization risk. In other words, with probability 1, the generalization risk is at most \" away\nfrom the training risk. The Bayesian side of PAC-Bayes comes mostly from the fact that these bounds\nare expressed on the averaging/aggregation/ensemble of multiple predictors (weighted by a posterior\ndistribution) and incorporate prior knowledge. Although it is still sometimes referred as a theory that\nbridges the Bayesian and frequentist approach [e.g., 16], it has been merely used to justify Bayesian\nmethods until now.1\nIn this work, we provide a direct connection between Bayesian inference techniques [summarized\nby 5, 13] and PAC-Bayesian risk bounds in a general setup. Our study is based on a simple\nbut insightful connection between the Bayesian marginal likelihood and PAC-Bayesian bounds\n(previously mentioned by Gr\u00fcnwald [14]) obtained by considering the negative log-likelihood loss\nfunction (Section 3). By doing so, we provide an alternative explanation for the Bayesian Occam\u2019s\nrazor criteria [18, 22] in the context of model selection, expressed as the complexity-accuracy\ntrade-off appearing in most PAC-Bayesian results. In Section 4, we extend PAC-Bayes theorems\nto regression problems with unbounded loss, adapted to the negative log-likelihood loss function.\nFinally, we study the Bayesian model selection from a PAC-Bayesian perspective (Section 5), and\nillustrate our \ufb01nding on classical Bayesian regression tasks (Section 6).\n\n2 PAC-Bayesian Theory\ni=12(X\u21e5Y)n, that contains n input-output pairs.\nWe denote the learning sample (X, Y )={(xi, yi)}n\nThe main assumption of frequentist learning theories\u2014including PAC-Bayes\u2014is that (X, Y ) is\n\n1Some existing connections [3, 6, 14, 19, 29, 30, 36] are discussed in Appendix A.1.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fE\n\n1\nn\n\n(x,y)\u21e0D\n\nbL `\n\n`(f, x, y) .\n\nX,Y (f ) =\n\nD(f ) =\n\n`(f, xi, yi) ; L `\n\nrandomly sampled from a data generating distribution that we denote D. Thus, we denote (X, Y )\u21e0Dn\nthe i.i.d. observation of n elements. From a frequentist perspective, we consider in this work loss\nfunctions ` : F\u21e5X\u21e5Y ! R, where F is a (discrete or continuous) set of predictors f : X ! Y, and\nwe write the empirical risk on the sample (X, Y ) and the generalization error on distribution D as\n\nnXi=1\nThe PAC-Bayesian theory [24, 25] studies an averaging of the above losses according to a posterior\ndistribution \u02c6\u21e2 over F. That is, it provides probably approximately correct generalization bounds\non the (unknown) quantity Ef\u21e0\u02c6\u21e2 L `\nD(f ) = Ef\u21e0\u02c6\u21e2 E(x,y)\u21e0D `(f, x, y) , given the empirical estimate\nEf\u21e0\u02c6\u21e2 bL `\nX,Y (f ) and some other parameters. Among these, most PAC-Bayesian theorems rely on\nthe Kullback-Leibler divergence KL(\u02c6\u21e2k\u21e1) = Ef\u21e0\u02c6\u21e2 ln[\u02c6\u21e2(f )/\u21e1(f )] between a prior distribution \u21e1\nover F\u2014speci\ufb01ed before seeing the learning sample X, Y \u2014and the posterior \u02c6\u21e2\u2014typically obtained\nby feeding a learning process with (X, Y ).\nTwo appealing aspects of PAC-Bayesian theorems are that they provide data-driven generalization\nbounds that are computed on the training sample (i.e., they do not rely on a testing sample), and\nthat they are uniformly valid for all \u02c6\u21e2 over F. This explains why many works study them as model\nselection criteria or as an inspiration for learning algorithm conception. Theorem 1, due to Catoni [8],\nhas been used to derive or study learning algorithms [10, 17, 26, 27].\nTheorem 1 (Catoni [8]). Given a distribution D over X \u21e5 Y, a hypothesis set F, a loss function\n`0 : F \u21e5 X \u21e5 Y ! [0, 1], a prior distribution \u21e1 over F, a real number  2 (0, 1], and a real number\n > 0, with probability at least 1   over the choice of (X, Y ) \u21e0 Dn, we have\n\n1\n\n1  e\uf8ff1  e Ef\u21e0 \u02c6\u21e2 bL `0\n\nX,Y (f ) 1\n\nnKL(\u02c6\u21e2k\u21e1)+ ln\n\n1\n\n .\n\n(1)\n\n8\u02c6\u21e2 on F :\n\nE\n\nf\u21e0\u02c6\u21e2L `0\n\nD (f ) \uf8ff\n\nTheorem 1 is limited to loss functions mapping to the range [0, 1]. Through a straightforward rescaling\nwe can extend it to any bounded loss, i.e., ` : F \u21e5 X \u21e5 Y ! [a, b], where [a, b] \u21e2 R. This is done by\nusing  := b  a and with the rescaled loss function `0(f, x, y) := (`(f, x, y)a)/(ba) 2 [0, 1] .\nAfter few arithmetic manipulations, we can rewrite Equation (1) as\n(2)\n8\u02c6\u21e2 on F : E\nFrom an algorithm design perspective, Equation (2) suggests optimizing a trade-off between the\nempirical expected loss and the Kullback-Leibler divergence. Indeed, for \ufb01xed \u21e1, X, Y , n, and ,\nminimizing Equation (2) is equivalent to \ufb01nd the distribution \u02c6\u21e2 that minimizes\n\n1eabh1 exp\u21e3 E\nf\u21e0\u02c6\u21e2bL `\n\nnKL(\u02c6\u21e2k\u21e1)+ ln 1\n\nD(f ) \uf8ff a + ba\n\nX,Y (f )+a 1\n\n\u2318i .\n\nf\u21e0\u02c6\u21e2L `\n\n(3)\n\nX,Y (f ) + KL(\u02c6\u21e2k\u21e1) .\n\nn E\n\nf\u21e0\u02c6\u21e2bL `\n\nIt is well known [1, 8, 10, 21] that the optimal Gibbs posterior \u02c6\u21e2\u21e4 is given by\n\n\u02c6\u21e2\u21e4(f ) = 1\nZX,Y\n\n(4)\nwhere ZX,Y is a normalization term. Notice that the constant  of Equation (1) is now absorbed in\nthe loss function as the rescaling factor setting the trade-off between the expected empirical loss\nand KL(\u02c6\u21e2k\u21e1).\n3 Bridging Bayes and PAC-Bayes\n\n\u21e1(f ) en bL `\n\nX,Y (f ) ,\n\nIn this section, we show that by choosing the negative log-likelihood loss function, minimizing the\nPAC-Bayes bound is equivalent to maximizing the Bayesian marginal likelihood. To obtain this\nresult, we \ufb01rst consider the Bayesian approach that starts by de\ufb01ning a prior p(\u2713) over the set of\npossible model parameters \u21e5. This induces a set of probabilistic estimators f\u2713 2 F, mapping x to a\nprobability distribution over Y. Then, we can estimate the likelihood of observing y given x and \u2713,\ni.e., p(y|x, \u2713) \u2318 f\u2713(y|x).2 Using Bayes\u2019 rule, we obtain the posterior p(\u2713|X, Y ):\ni=1 p(yi|xi, \u2713) and p(Y |X) =R\u21e5 p(\u2713) p(Y |X, \u2713) d\u2713.\nwhere p(Y |X, \u2713) = Qn\n\n2To stay aligned with the PAC-Bayesian setup, we only consider the discriminative case in this paper. One\n\n/ p(\u2713) p(Y |X, \u2713) ,\n\np(\u2713) p(Y |X, \u2713)\n\np(\u2713|X, Y ) =\n\np(Y |X)\n\n(5)\n\ncan extend to the generative setup by considering the likelihood of the form p(y, x|\u2713) instead.\n\n2\n\n\fTo bridge the Bayesian approach with the PAC-Bayesian framework, we consider the negative\nlog-likelihood loss function [3], denoted `nll and de\ufb01ned by\n\n`nll(f\u2713, x, y) \u2318  ln p(y|x, \u2713) .\n\nX,Y of a predictor to its likelihood:\n\nThen, we can relate the empirical loss bL `\n\nX,Y (\u2713) =\n\n1\nn\n\nnXi=1\n\nbL `nll\n\nor, the other way around,\n\n`nll(\u2713, xi, yi) = \n\nln p(yi|xi, \u2713) = \n\n1\nn\n\nln p(Y |X, \u2713) ,\n\n1\nn\n\nnXi=1\np(Y |X, \u2713) = en bL `nll\n\nX,Y (\u2713) .\n\nUnfortunately, existing PAC-Bayesian theorems work with bounded loss functions or in very speci\ufb01c\ncontexts [e.g., 9, 36], and `nll spans the whole real axis in its general form. In Section 4, we explore\nPAC-Bayes bounds for unbounded losses. Meanwhile, we consider priors with bounded likelihood.\nThis can be done by assigning a prior of zero to any \u2713 yielding ln\nNow, using Equation (7) in the optimal posterior (Equation 4) simpli\ufb01es to\n\np(y|x,\u2713) /2 [a, b].\n\n1\n\n\u02c6\u21e2\u21e4(\u2713) =\n\nX,Y (\u2713)\n\n\u21e1(\u2713) en bL `nll\n\nZX,Y\n\n=\n\np(\u2713) p(Y |X, \u2713)\n\np(Y |X)\n\n= p(\u2713|X, Y ) ,\n\nwhere the normalization constant ZX,Y corresponds to the Bayesian marginal likelihood:\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nZX,Y \u2318 p(Y |X) = Z\u21e5\n\n\u21e1(\u2713) en bL `nll\n\nX,Y (\u2713)d\u2713 .\n\nThis shows that the optimal PAC-Bayes posterior given by the generalization bound of Theorem 1\ncoincides with the Bayesian posterior, when one chooses `nll as loss function and  := ba (as in\nEquation 2). Moreover, using the posterior of Equation (8) inside Equation (3), we obtain\n\nn E\n\n`nll\nX,Y\n\n\u2713\u21e0\u02c6\u21e2\u21e4 bL `nll\n= nZ\u21e5\n= Z\u21e5\n\nX,Y (\u2713) + KL(\u02c6\u21e2\u21e4k\u21e1)\nX,Y (\u2713) d\u2713 +Z\u21e5\nbL `nll\nhln 1\nZX,Yi d\u2713 = ZX,Y\n\n\u21e1(\u2713) en bL\n\u21e1(\u2713) en bL\n\nZX,Y\n\nZX,Y\n\nZX,Y\n\n`nll\nX,Y\n\n(\u2713)\n\n(\u2713)\n\n`nll\nX,Y\n\n(\u2713)\n\n\u21e1(\u2713) en bL\n\nZX,Y\n\n`nll\nX,Y\n\u21e1(\u2713) ZX,Y\n\nlnh \u21e1(\u2713) en bL\n\nln 1\nZX,Y\n\n=  ln ZX,Y .\n\n(10)\n\n(\u2713)\n\nid\u2713\n\nIn other words, minimizing the PAC-Bayes bound is equivalent to maximizing the marginal likeli-\nhood. Thus, from the PAC-Bayesian standpoint, the latter encodes a trade-off between the averaged\nnegative log-likelihood loss function and the prior-posterior Kullback-Leibler divergence. Note that\nEquation (10) has been mentioned by Gr\u00fcnwald [14], based on an earlier observation of Zhang\n[36]. However, the PAC-Bayesian theorems proposed by the latter do not bound the generalization\nloss directly, as the \u201cclassical\u201d PAC-Bayesian results [8, 24, 29] that we extend to regression in\nforthcoming Section 4 (see the corresponding remarks in Appendix A.1).\nWe conclude this section by proposing a compact form of Theorem 1 by expressing it in terms of the\nmarginal likelihood, as a direct consequence of Equation (10).\nCorollary 2. Given a data distribution D, a parameter set \u21e5, a prior distribution \u21e1 over \u21e5, a\n 2 (0, 1], if `nll lies in [a, b], we have, with probability at least 1  over the choice of (X, Y ) \u21e0 Dn,\n\nE\n\n\u2713\u21e0\u02c6\u21e2\u21e4 L`nll\n\nD (\u2713) \uf8ff a + ba\n\n1eabh1  ea npZX,Y i ,\n\nwhere \u02c6\u21e2\u21e4 is the Gibbs optimal posterior (Eq. 8) and ZX,Y is the marginal likelihood (Eq. 9).\n\nIn Section 5, we exploit the link between PAC-Bayesian bounds and Bayesian marginal likelihood to\nexpose similarities between both frameworks in the context of model selection. Beforehand, next\nSection 4 extends the PAC-Bayesian generalization guarantees to unbounded loss functions. This is\nmandatory to make our study fully valid, as the negative log-likelihood loss function is in general\nunbounded (as well as other common regression losses).\n\n3\n\n\f4 PAC-Bayesian Bounds for Regression\n\nf\u21e0\u02c6\u21e2bL `\n\nn\u21e5KL(\u02c6\u21e2k\u21e1) + ln 1\n\n\u21e4 + 1\n\n2 (b  a)2.\n\nThis section aims to extend the PAC-Bayesian results of Section 3 to real valued unbounded loss.\nThese results are used in forthcoming sections to study `nll, but they are valid for broader classes of\nloss functions. Importantly, our new results are focused on regression problems, as opposed to the\nusual PAC-Bayesian classi\ufb01cation framework.\nThe new bounds are obtained through a recent theorem of Alquier et al. [1], stated below (we provide\na proof in Appendix A.2 for completeness).\nTheorem 3 (Alquier et al. [1]). Given a distribution D over X \u21e5 Y, a hypothesis set F, a loss\nfunction ` : F \u21e5 X \u21e5 Y ! R, a prior distribution \u21e1 over F, a  2 (0, 1], and a real number  > 0,\nwith probability at least 1 over the choice of (X, Y ) \u21e0 Dn, we have\n\uf8ffKL(\u02c6\u21e2k\u21e1) + ln\n1\n\nexph\u21e3L `\n\n+ `,\u21e1,D(, n) ,\nX0,Y 0(f )\u2318i .\n\nwhere `,\u21e1,D(, n) = ln E\nf\u21e0\u21e1\n\nf\u21e0\u02c6\u21e2bL `\n\nD(f ) \uf8ff E\n\nAlquier et al. used Theorem 3 to design a learning algorithm for {0, 1}-valued classi\ufb01cation losses.\nIndeed, a bounded loss function ` : F \u21e5 X \u21e5 Y ! [a, b] can be used along with Theorem 3 by\napplying the Hoeffding\u2019s lemma to Equation (12), that gives `,\u21e1,D(, n) \uf8ff 2(ba)2/(2n). More\nspeci\ufb01cally, with  := n, we obtain the following bound\n(13)\nX,Y (f ) + 1\n\nD(f )  bL `\n\n8\u02c6\u21e2 on F :\n\nf\u21e0\u02c6\u21e2L `\n\nX0,Y 0\u21e0Dn\n\nX,Y (f ) +\n\n(12)\n\n(11)\n\nE\n\nE\n\nE\n\n1\n\n8\u02c6\u21e2 on F :\n\nf\u21e0\u02c6\u21e2L `\n\nD(f ) \uf8ff E\n\nE\n\n + 1\n\n8\u02c6\u21e2 on F :\n\nf\u21e0\u02c6\u21e2bL `\n\n2 (b  a)2\u21e4 .\n\nNote that the latter bound leads to the same trade-off as Theorem 1 (expressed by Equation 3).\nHowever, the choice  := n has the inconvenience that the bound value is at least 1\n2 (b  a)2, even\nat the limit n ! 1. With  := pn the bound converges (a result similar to Equation (14) is also\nformulated by Pentina and Lampert [28]):\n(14)\nD(f ) \uf8ff E\n\nf\u21e0\u02c6\u21e2L `\nSub-Gaussian losses.\nIn a regression context, it may be restrictive to consider strictly bounded loss\nfunctions. Therefore, we extend Theorem 3 to sub-Gaussian losses. We say that a loss function ` is\nsub-Gaussian with variance factor s2 under a prior \u21e1 and a data-distribution D if it can be described by\nD(f )`(f, x, y), i.e., its moment generating function is upper\na sub-Gaussian random variable V =L `\nbounded by the one of a normal distribution of variance s2 (see Boucheron et al. [7, Section 2.3]):\nexp\u21e5L `\n(15)\n V () = ln E eV = ln E\nf\u21e0\u21e1\n\nX,Y (f ) + 1pn\u21e5KL(\u02c6\u21e2k\u21e1) + ln 1\n\nD(f )  `(f, x, y)\u21e4 \uf8ff 2s2\n\nThe above sub-Gaussian assumption corresponds to the Hoeffding assumption of Alquier et al. [1],\nand allows to obtain the following result.\nCorollary 4. Given D, F, `, \u21e1 and  de\ufb01ned in the statement of Theorem 3, if the loss is sub-Gaussian\nwith variance factor s2, we have, with probability at least 1 over the choice of (X, Y ) \u21e0 Dn,\n\nn\u21e5KL(\u02c6\u21e2k\u21e1) + ln 1\n\u21e4 + 1\nf\u21e0\u02c6\u21e2bL `\nD(f )  `(f, x, y).\nProof. For i = 1 . . . n, we denote `i a i.i.d. realization of the random variable L `\nn `i\u21e4 =Pn\n `,\u21e1,D(, n) = ln E exp\u21e5 \ni=1 `i\u21e4 = lnQn\nnPn\nn ) \uf8ff n 2s2\n2n2 = 2s2\n2n ,\nwhere the inequality comes from the sub-Gaussian loss assumption (Equation 15). The result is then\nobtained from Theorem 3, with  := n.\n\ni=1 E exp\u21e5 \n\nD(f ) \uf8ff E\n\n8\u02c6\u21e2 on F :\n\n8 2 R .\n\nf\u21e0\u02c6\u21e2L `\n\nX,Y (f ) + 1\n\ni=1 `i( \n\n(x,y)\u21e0D\n\n2 s2 .\n\nE\n\nE\n\n2\n\n,\n\nSub-gamma losses. We say that an unbounded loss function ` is sub-gamma with a variance\nfactor s2 and scale parameter c, under a prior \u21e1 and a data-distribution D, if it can be described by a\nsub-gamma random variable V (see Boucheron et al. [7, Section 2.4]), that is\n(16)\nUnder this sub-gamma assumption, we obtain the following new result, which is necessary to study\nlinear regression in the next sections.\n\nc2 ( ln(1c)  c) \uf8ff 2s2\n\n V () \uf8ff s2\n\n8 2 (0, 1\nc ) .\n\n2(1c) ,\n\n4\n\n\fCorollary 5. Given D, F, `, \u21e1 and  de\ufb01ned in the statement of Theorem 3, if the loss is sub-gamma\nwith variance factor s2 and scale c < 1, we have, with probability at least 1 over (X, Y ) \u21e0 Dn,\n(17)\n\nX,Y (f ) + 1\n\nE\n\n8\u02c6\u21e2 on F :\n\nf\u21e0\u02c6\u21e2L `\n\nD(f ) \uf8ff E\n\nn\u21e5KL(\u02c6\u21e2k\u21e1) + ln 1\n\n\u21e4 + 1\n\n2(1c) s2 .\n\nAs a special case, with ` := `nll and \u02c6\u21e2 := \u02c6\u21e2\u21e4 (Equation 8), we have\n\nE\n\n\u2713\u21e0\u02c6\u21e2\u21e4 L`nll\n\n2(1c)  1\n\nn ln (ZX,Y ) .\n\n(18)\n\nf\u21e0\u02c6\u21e2bL `\nD (\u2713) \uf8ff s2\n\n2(1c) ,\n\ni=1 `i] = lnQn\n\ni=1 E exp [`i] =Pn\n\n\u21e1 I). Let us assume that the input examples are generated by x\u21e0N (0, 2\n\nProof. Following the same path as in the proof of Corollary 4 (with  := n), we have\n `,\u21e1,D(n, n) = ln E exp [Pn\ni=1 `i(1) \uf8ff n s2\n2(1c) = n s2\nwhere the inequality comes from the sub-gamma loss assumption, with 1 2 (0, 1\nc ).\nSquared loss. The parameters s and c of Corollary 5 rely on the chosen loss function and prior,\nand the assumptions concerning the data distribution. As an example, consider a regression problem\nwhere X\u21e5Y \u21e2 Rd\u21e5R, a family of linear predictors fw(x) = w \u00b7 x, with w 2 Rd, and a Gaussian\nprior N (0, 2\nx I) with label\ny = w\u21e4\u00b7 x + \u270f, where w\u21e42Rd and \u270f\u21e0N (0, 2\n\u270f ) is a Gaussian noise. Under the squared loss function\n`sqr(w, x, y) = (w \u00b7 x  y)2 ,\nwe show in Appendix A.4 that Corollary 5 is valid with s2  2\u21e52\nand c  22\nRegression versus classi\ufb01cation. The classical PAC-Bayesian theorems are stated in a classi\ufb01-\ncation context and bound the generalization error/loss of the stochastic Gibbs predictor G\u02c6\u21e2. In\norder to predict the label of an example x 2 X , the Gibbs predictor \ufb01rst draws a hypothesis h 2 F\naccording to \u02c6\u21e2, and then returns h(x). Maurer [23] shows that we can generalize PAC-Bayesian\nbounds on the generalization risk of the Gibbs classi\ufb01er to any loss function with output between\nzero and one. Provided that y 2 {1, 1} and h(x) 2 [1, 1], a common choice is to use the\nlinear loss function `001(h, x, y) = 1\n2 y h(x). The Gibbs generalization loss is then given by\nRD(G\u02c6\u21e2) = E(x,y)\u21e0D Eh\u21e0\u02c6\u21e2 `001(h, x, y) . Many PAC-Bayesian works use RD(G\u02c6\u21e2) as a surrogate\nloss to study the zero-one classi\ufb01cation loss of the majority vote classi\ufb01er RD(B\u02c6\u21e2):\n\n\u21e1. As expected, the bound degrades when the noise increases\n\n\u21e1d + kw\u21e4k2) + 2\n\n\u270f (1  c)\u21e4\n\nx(2\n\n(19)\n\nx2\n\n2  1\n(x,y)\u21e0D\u21e3y E\n\nh\u21e0\u02c6\u21e2\n\nRD(B\u02c6\u21e2) = Pr\n\nh(x) < 0\u2318 =\n\nE\n\n(x,y)\u21e0D\n\nIhy E\n\nh\u21e0\u02c6\u21e2\n\nh(x) < 0i ,\n\n(20)\n\nwhere I[\u00b7] being the indicator function. Given a distribution \u02c6\u21e2, an upper bound on the Gibbs risk\nis converted to an upper bound on the majority vote risk by RD(B\u02c6\u21e2) \uf8ff 2RD(G\u02c6\u21e2) [20]. In some\nsituations, this factor of two may be reached, i.e., RD(B\u02c6\u21e2) ' 2RD(G\u02c6\u21e2). In other situations, we\nmay have RD(B\u02c6\u21e2) = 0 even if RD(G\u02c6\u21e2) = 1\n2\u270f (see Germain et al. [11] for an extensive study).\nIndeed, these bounds obtained via the Gibbs risk are exposed to be loose and/or unrepresentative of\nthe majority vote generalization error.3\nIn the current work, we study regression losses instead of classi\ufb01cation ones. That is, the provided\nresults express upper bounds on Ef\u21e0\u02c6\u21e2 L `\nD(f ) for any (bounded, sub-Gaussian, or sub-gamma)\nlosses. Of course, one may want to bound the regression loss of the averaged regressor F\u02c6\u21e2(x) =\nEf\u21e0\u02c6\u21e2 f (x). In this case, if the loss function ` is convex (as the squared loss), Jensen\u2019s inequality\nD(f ) . Note that a strict inequality replaces the factor two mentioned above\ngives L `\nfor the classi\ufb01cation case, due to the non-convex indicator function of Equation (20).\nNow that we have generalization bounds for real-valued loss functions, we can continue our study\nlinking PAC-Bayesian results to Bayesian inference. In the next section, we focus on model selection.\n\nD(F\u02c6\u21e2) \uf8ff Ef\u21e0\u02c6\u21e2 L `\n\n3It is noteworthy that the best PAC-Bayesian empirical bound values are so far obtained by considering\na majority vote of linear classi\ufb01ers, where the prior and posterior are Gaussian [2, 10, 20], similarly to the\nBayesian linear regression analyzed in Section 6.\n\n5\n\n\f5 Analysis of Model Selection\nWe consider L distinct models {Mi}L\ni=1, each one de\ufb01ned by a set of parameters \u21e5i. The PAC-\nBayesian theorems naturally suggest selecting the model that is best adapted for the given task by\nevaluating the bound for each model {Mi}L\ni=1 and selecting the one with the lowest bound [2, 25, 36].\nThis is closely linked with the Bayesian model selection procedure, as we showed in Section 3 that\nminimizing the PAC-Bayes bound amounts to maximizing the marginal likelihood. Indeed, given a\ncollection of L optimal Gibbs posteriors\u2014one for each model\u2014given by Equation (8),\n\n(21)\n\n(22)\n\np(\u2713|X, Y,Mi) \u2318 \u02c6\u21e2\u21e4i (\u2713) = 1\n\nZX,Y,i\n\nfor \u2713 2 \u21e5i ,\n\nthe Bayesian Occam\u2019s razor criteria [18, 22] chooses the one with the higher model evidence\n\np(Y |X,Mi) \u2318 ZX,Y,i = Z\u21e5i\n\nX,Y (\u2713),\n\n\u21e1i(\u2713) en bL `nll\n\u21e1i(\u2713) en bL `\n\nX,Y (\u2713) d\u2713 .\n\nCorollary 6 below formally links the PAC-Bayesian and the Bayesian model selection. To obtain\nthis result, we simply use the bound of Corollary 5 L times, together with `nll and Equation (10).\nFrom the union bound (a.k.a. Bonferroni inequality), it is mandatory to compute each bound with a\ncon\ufb01dence parameter of /L, to ensure that the \ufb01nal conclusion is valid with probability at least 1.\nCorollary 6. Given a data distribution D, a family of model parameters {\u21e5i}L\ni=1 and associated\npriors {\u21e1i}L\ni=1\u2014where \u21e1i is de\ufb01ned over \u21e5i\u2014 , a  2 (0, 1], if the loss is sub-gamma with parameters\ns2 and c < 1, then, with probability at least 1   over (X, Y ) \u21e0 Dn,\nL .\nn lnZX,Y,i\n2(1c) s2  1\n\nwhere \u02c6\u21e2\u21e4i is the Gibbs optimal posterior (Eq. 21) and ZX,Y,i is the marginal likelihood (Eq. 22).\nHence, under the uniform prior over the L models, choosing the one with the best model evidence is\nequivalent to choosing the one with the lowest PAC-Bayesian bound.\n\n8i 2 {1, . . . , L} :\n\n\u2713\u21e0\u02c6\u21e2\u21e4i L`nll\n\nD (\u2713) \uf8ff\n\nE\n\n1\n\n\n\nHierarchical Bayes. To perform proper inference on hyperparameters, we have to rely on the\nHierarchical Bayes approach. This is done by considering an hyperprior p(\u2318) over the set of\nhyperparameters H. Then, the prior p(\u2713|\u2318) can be conditioned on a choice of hyperparameter \u2318. The\nBayes rule of Equation (5) becomes p(\u2713, \u2318|X, Y ) = p(\u2318) p(\u2713|\u2318) p(Y |X,\u2713)\nUnder the negative log-likelihood loss function, we can rewrite the results of Corollary 5 as a\ngeneralization bound on E\u2318\u21e0\u02c6\u21e20 E\u2713\u21e0\u02c6\u21e2\u21e4\u2318 L`nll\nD (\u2713), where \u02c6\u21e20(\u2318) / \u21e10(\u2318) ZX,Y,\u2318 is the hyperposterior\non H and \u21e10 the hyperprior. Indeed, Equation (18) becomes\n\np(Y |X)\n\n.\n\nE\n\n\u2713\u21e0\u02c6\u21e2\u21e4 L`nll\n\nD (\u2713) = E\n\u2318\u21e0\u02c6\u21e2\u21e40\n\nE\n\n\u2713\u21e0\u02c6\u21e2\u21e4\u2318 L`nll\n\nD (\u2713) \uf8ff\n\n1\n\n2(1c) s2  1\n\nn ln\u2713 E\n\n\u2318\u21e0\u21e10\n\nZX,Y,\u2318 \u25c6 .\n\n(23)\n\nE\n\n\u2713\u21e0\u02c6\u21e2\u21e4 L`nll\n\ni=1, with a uniform prior \u21e10(\u2318i) = 1\n\nTo relate to the bound obtained in Corollary 6, we consider the case of a discrete hyperparameter set\nL (from now on, we regard each hyperparameter \u2318i as\nH = {\u2318i}L\nthe speci\ufb01cation of a model \u21e5i). Then, Equation (23) becomes\n2(1c) s2  1\n\n\u2713\u21e0\u02c6\u21e2\u21e4\u2318 L`nll\nThis bound is now a function ofPL\ni=1 ZX,Y,\u2318i instead of maxi ZX,Y,\u2318i as in the bound given by\nthe \u201cbest\u201d model in Corollary 6. This yields a tighter bound, corroborating the Bayesian wisdom\nthat model averaging performs best. Conversely, when selecting a single hyperparameter \u2318\u21e4 2 H,\nthe hierarchical representation is equivalent to choosing a deterministic hyperposterior, satisfying\n\u02c6\u21e20(\u2318\u21e4) = 1 and 0 for every other values. We then have\n\nn ln\u21e3PL\n\nD (\u2713) = E\n\u2318\u21e0\u02c6\u21e2\u21e40\n\nL\u2318 .\n\nD (\u2713) \uf8ff\n\ni=1 ZX,Y,\u2318i\n\nE\n\n1\n\n\n\nKL(\u02c6\u21e2||\u21e1) = KL(\u02c6\u21e20||\u21e10) + E\n\u2318\u21e0\u02c6\u21e20\n\nKL(\u02c6\u21e2\u2318||\u21e1\u2318) = ln(L) + KL(\u02c6\u21e2\u2318\u21e4||\u21e1\u2318\u21e4) .\n\nWith the optimal posterior for the selected \u2318\u21e4, we have\n\nX,Y (\u2713) + KL(\u02c6\u21e2||\u21e1) = n E\n\nn E\n\n\u2713\u21e0\u02c6\u21e2bL `nll\n\n\u2713\u21e0\u02c6\u21e2\u21e4\u2318 bL `nll\n\nX,Y (\u2713) + KL(\u02c6\u21e2\u21e4\u2318\u21e4||\u21e1\u2318\u21e4) + ln(L)\n\u2318 .\n=  ln(ZX,Y,\u2318\u21e4) + ln(L) =  ln\u21e3 ZX,Y,\u2318\u21e4\n\nL\n\nInserting this result into Equation (17), we fall back on the bound obtained in Corollary 6. Hence,\nby comparing the values of the bounds, one can get an estimate on the consequence of performing\nmodel selection instead of model averaging.\n\n6\n\n\f6 Linear Regression\n\nIn this section, we perform Bayesian linear regression using the parameterization of Bishop [5]. The\noutput space is Y := R and, for an arbitrary input space X , we use a mapping function  :X!Rd.\nThe model. Given (x, y) 2 X \u21e5 Y and model parameters \u2713 := hw, i 2 Rd \u21e5 R+, we consider\nthe likelihood p(y|x,hw, i) = N (y|w \u00b7 (x), 2). Thus, the negative log-likelihood loss is\n22 (y  w \u00b7 (x))2 .\n\n`nll(h w,  i, x, y) =  ln p(y|x,h w,  i) = 1\n\n2 ln(2\u21e12) + 1\n\n(24)\n\nFor a \ufb01xed 2, minimizing Equation (24) is equivalent to minimizing the squared loss function of\n\u21e1: p(w|\u21e1) =\nEquation (19). We also consider an isotropic Gaussian prior of mean 0 and variance 2\n\u21e1. The Gibbs optimal\nN (w|0, 2\nposterior (see Equation 8) is then given by\n\n\u21e1I). For the sake of simplicity, we consider \ufb01xed parameters 2 and 2\n\nX,Y (w)\n\ntr(A1) = 1\n\np(Y |X,,\u21e1)\n\n2 + 1\n22\n\n2 log |A| + d ln \u21e1\n\n.\n\n22 tr(T A1)\n\nA1) = 1\n\n2 tr(A1A) = d\n2 .\n\n2 log |A| + d ln \u21e1\n\n2 tr( 1\n\n2 T A1 + 1\n2\n\u21e1\n\n= nbL `nll\n|\n\n= N (w |bw, A1) ,\n\n\u02c6\u21e2\u21e4(w) \u2318 p(w|X, Y, , \u21e1) = p(w|\u21e1) p(Y |X,w,)\n2 T  + 1\n2\n\u21e1\n\n(25)\n2 A1T y ;  is a n\u21e5d matrix such that the ith line is (xi) ;\n\nI ; bw := 1\nwhere A := 1\ny := [y1, . . . yn] is the labels-vector ; and the negative log marginal likelihood is\n22ky  bwk2 + n\n ln p(Y |X, , \u21e1) = 1\nX,Y (bw) + 1\n}\n{z\nn Ew\u21e0 \u02c6\u21e2\u21e4 bL `nll\n1\n22 tr(T A1) + 1\n22\n\u21e1\n\n\u21e1 kbwk2 + 1\n\u21e1 kbwk2 + 1\n{z\n}\n\u21e1I)\nKLN (bw,A1) k N (0,2\n2 ln(2\u21e12) = nbL `nll\nX,Y (bw) and insert\n\n|\n22kybwk2+ n\n\n2 ln(2\u21e12) + 1\n22\ntr(A1)  d\n+ 1\n22\n\u21e1\n\nTo obtain the second equality, we substitute 1\n\n0.005 and 2 = 1\n\nThis exhibits how the Bayesian regression optimization problem is related to the mini-\nX,Y (w) and\n\nmization of a PAC-Bayesian bound, expressed by a trade-off between Ew\u21e0\u02c6\u21e2\u21e4 bL `nll\nKLN (bw, A1)kN (0, 2\n\n\u21e1 I). See Appendix A.5 for detailed calculations.\n\nModel selection experiment. To produce Figures 1a and 1b, we reimplemented the toy experiment\nof Bishop [5, Section 3.5.1]. That is, we generated a learning sample of 15 data points according to\ny = sin(x) + \u270f, where x is uniformly sampled in the interval [0, 2\u21e1] and \u270f \u21e0 N (0, 1\n4 ) is a Gaussian\nnoise. We then learn seven different polynomial models applying Equation (25). More precisely, for\na polynomial model of degree d, we map input x 2 R to a vector (x) = [1, x1, x2, . . . , xd] 2 Rd+1,\nand we \ufb01x parameters 2\n2. Figure 1a illustrates the seven learned models.\nFigure 1b shows the negative log marginal likelihood computed for each polynomial model, and is\ndesigned to reproduce Bishop [5, Figure 3.14], where it is explained that the marginal likelihood\ncorrectly indicates that the polynomial model of degree d = 3 is \u201cthe simplest model which gives a\ngood explanation for the observed data\u201d. We show that this claim is well quanti\ufb01ed by the trade-off\nintrinsic to our PAC-Bayesian approach: the complexity KL term keeps increasing with the parameter\nd 2 {1, 2, . . . , 7}, while the empirical risk drastically decreases from d = 2 to d = 3, and only\nslightly afterward. Moreover, we show that the generalization risk (computed on a test sample of size\n1000) tends to increase with complex models (for d  4).\nEmpirical comparison of bound values. Figure 1c compares the values of the PAC-Bayesian\nbounds presented in this paper on a synthetic dataset, where each input x2R20 is generated by\na Gaussian x\u21e0N (0, I). The associated output y2R is given by y=w\u21e4 \u00b7 x + \u270f, with kw\u21e4k= 1\n2,\n9. We perform Bayesian linear regression in the input space, i.e., (x)=x,\n\u270f ), and 2\n\u270f\u21e0N (0, 2\n100 and 2=2. That is, we compute the posterior of Equation (25) for training samples of\n\ufb01xing 2\n\u21e1= 1\nsizes from 10 to 106. For each learned model, we compute the empirical negative log-likelihood loss\nof Equation (24), and the three PAC-Bayes bounds, with con\ufb01dence parameter of = 1\n20. Note that\nthis loss function is an af\ufb01ne transformation of the squared loss studied in Section 4 (Equation 19), i.e.,\n22 `sqr(w, x, y). It turns out that `nll is sub-gamma with parameters\n`nll(hw, i, x, y)= 1\n2\u21e52\n\u21e1), as shown in Appendix A.6. The bounds\ns2  1\nof Corollary 5 are computed using the above mentioned values of kw\u21e4k, d, , x, \u270f, \u21e1, leading\n\n\u270f (1c)\u21e4 and c  1\n\n2 ln(2\u21e12)+ 1\n\u21e1d+kw\u21e4k2)+2\n\n\u21e1 = 1\n\n\u270f = 1\n\nx(2\n\n2 (2\n\nx2\n\n7\n\n\f(a) Predicted models. Black dots are the 15 training\nsamples.\n\n(b) Decomposition of the marginal likelihood into the\nempirical loss and KL-divergence.\n\n(c) Bound values on a synthetic dataset according to the number of training samples.\n\nFigure 1: Model selection experiment (a-b); and comparison of bounds values (c).\n\nto s2 ' 0.280 and c ' 0.005. As the two other bounds of Figure 1c are not suited for unbounded\nloss, we compute their value using a cropped loss [a, b] = [1, 4]. Different parameter values could\nhave been chosen, sometimes leading to another picture: a large value of s degrades our sub-gamma\nbound, as a larger [a, b] interval does for the other bounds.\nIn the studied setting, the bound of Corollary 5\u2014that we have developed for (unbounded) sub-\ngamma losses\u2014gives tighter guarantees than the two results for [a, b]-bounded losses (up to n=106).\nHowever, our new bound always maintains a gap of\n2(1c) s2 between its value and the generalization\nloss. The result of Corollary 2 (adapted from Catoni [8]) for bounded losses suffers from a similar\ngap, while having higher values than our sub-gamma result. Finally, the result of Theorem 3 (Alquier\net al. [1]), combined with  = 1/pn (Eq. 14), converges to the expected loss, but it provides good\nguarantees only for large training sample (n & 105). Note that the latter bound is not directly\nminimized by our \u201coptimal posterior\u201d, as opposed to the one with  = 1/n (Eq. 13), for which we\nobserve values between 5.8 (for n=106) and 6.4 (for n=10)\u2014not displayed on Figure 1c.\n\n1\n\n7 Conclusion\n\nThe \ufb01rst contribution of this paper is to bridge the concepts underlying the Bayesian and the PAC-\nBayesian approaches; under proper parameterization, the minimization of the PAC-Bayesian bound\nmaximizes the marginal likelihood. This study motivates the second contribution of this paper, which\nis to prove PAC-Bayesian generalization bounds for regression with unbounded sub-gamma loss\nfunctions, including the squared loss used in regression tasks.\nIn this work, we studied model selection techniques. On a broader perspective, we would like to\nsuggest that both Bayesian and PAC-Bayesian frameworks may have more to learn from each other\nthan what has been done lately (even if other works paved the way [e.g., 6, 14, 30]). Predictors\nlearned from the Bayes rule can bene\ufb01t from strong PAC-Bayesian frequentist guarantees (under the\ni.i.d. assumption). Also, the rich Bayesian toolbox may be incorporated in PAC-Bayesian driven\nalgorithms and risk bounding techniques.\n\nAcknowledgments\nWe thank Gabriel Dub\u00e9 and Maxime Tremblay for having proofread the paper and supplemental.\n\n8\n\n012\u21e1\u21e132\u21e12\u21e1x2.01.51.00.50.00.51.01.5modeld=1modeld=2modeld=3modeld=4modeld=5modeld=6modeld=7sin(x)1234567modeldegreed0102030405060lnZX,YKL(\u02c6\u21e2\u21e4k\u21e1)nE\u2713\u21e0\u02c6\u21e2\u21e4bL`nllX,Y(\u2713)nE\u2713\u21e0\u02c6\u21e2\u21e4L`nllD(\u2713)101102103104105n1.01.52.02.53.03.54.0Alquieretal\u2019s[a,b]bound(Theorem3+Eq14)Catoni\u2019s[a,b]bound(Corollary2)sub-gammabound(Corollary5)E\u2713\u21e0\u02c6\u21e2\u21e4L`nllD(\u2713)(testloss)E\u2713\u21e0\u02c6\u21e2\u21e4bL`nllX,Y(\u2713)(trainloss)\fReferences\n[1] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of\n\nGibbs posteriors. JMLR, 17(239):1\u201341, 2016.\n\n[2] Amiran Ambroladze, Emilio Parrado-Hern\u00e1ndez, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In\n\nNIPS, 2006.\n\n[3] Arindam Banerjee. On Bayesian bounds. In ICML, pages 81\u201388, 2006.\n[4] Luc B\u00e9gin, Pascal Germain, Fran\u00e7ois Laviolette, and Jean-Francis Roy. PAC-Bayesian theory for transduc-\n\ntive learning. In AISTATS, pages 105\u2013113, 2014.\n\n[5] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[6] P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.\n\n[7] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities : a nonasymptotic\n\ntheory of independence. Oxford university press, 2013. ISBN 978-0-19-953525-5.\n\n[8] Olivier Catoni. PAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning,\n\nvolume 56. Inst. of Mathematical Statistic, 2007.\n\n[9] Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian\n\nbounds and sparsity. Machine Learning, 72(1-2):39\u201361, 2008.\n\n[10] Pascal Germain, Alexandre Lacasse, Fran\u00e7ois Laviolette, and Mario Marchand. PAC-Bayesian learning of\n\nlinear classi\ufb01ers. In ICML, pages 353\u2013360, 2009.\n\n[11] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk\n\nbounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. JMLR, 16, 2015.\n\n[12] Pascal Germain, Amaury Habrard, Fran\u00e7ois Laviolette, and Emilie Morvant. A new PAC-Bayesian\n\nperspective on domain adaptation. In ICML, pages 859\u2013868, 2016.\n\n[13] Zoubin Ghahramani. Probabilistic machine learning and arti\ufb01cial intelligence. Nature, 521:452\u2013459, 2015.\n[14] Peter Gr\u00fcnwald. The safe Bayesian - learning the learning rate via the mixability gap. In ALT, 2012.\n[15] Peter D. Gr\u00fcnwald and Nishant A. Mehta. Fast rates with unbounded losses. CoRR, abs/1605.00252, 2016.\n[16] Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin C. Cawley. Model selection: Beyond the\n\nBayesian/frequentist divide. JMLR, 11:61\u201387, 2010.\n\n[17] Tamir Hazan, Subhransu Maji, Joseph Keshet, and Tommi S. Jaakkola. Learning ef\ufb01cient random maximum\n\na-posteriori predictors with non-decomposable loss functions. In NIPS, pages 1887\u20131895, 2013.\n\n[18] William H. Jeffreys and James O. Berger. Ockham\u2019s razor and Bayesian analysis. American Scientist,\n\n1992.\n\n[19] Alexandre Lacoste. Agnostic Bayes. PhD thesis, Universit\u00e9 Laval, 2015.\n[20] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In NIPS, pages 423\u2013430, 2002.\n[21] Guy Lever, Fran\u00e7ois Laviolette, and John Shawe-Taylor. Tighter PAC-Bayes bounds through distribution-\n\ndependent priors. Theor. Comput. Sci., 473:4\u201328, 2013.\n\n[22] David J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1992.\n[23] Andreas Maurer. A note on the PAC-Bayesian theorem. CoRR, cs.LG/0411099, 2004.\n[24] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n[25] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5\u201321, 2003.\n[26] David McAllester and Joseph Keshet. Generalization bounds and consistency for latent structural probit\n\nand ramp loss. In NIPS, pages 2205\u20132212, 2011.\n\n[27] Asf Noy and Koby Crammer. Robust forward algorithms via PAC-Bayes and Laplace distributions. In\n\nAISTATS, 2014.\n\n[28] Anastasia Pentina and Christoph H. Lampert. A PAC-Bayesian bound for lifelong learning. In ICML,\n\n2014.\n\n[29] Matthias Seeger. PAC-Bayesian generalization bounds for Gaussian processes. JMLR, 3:233\u2013269, 2002.\n[30] Matthias Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and\n\nSparse Approximations. PhD thesis, University of Edinburgh, 2003.\n\n[31] Yevgeny Seldin and Naftali Tishby. PAC-Bayesian analysis of co-clustering and beyond. JMLR, 11, 2010.\n[32] Yevgeny Seldin, Peter Auer, Fran\u00e7ois Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian\n\nanalysis of contextual bandits. In NIPS, pages 1683\u20131691, 2011.\n\n[33] Yevgeny Seldin, Fran\u00e7ois Laviolette, Nicol\u00f2 Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-\n\nBayesian inequalities for martingales. In UAI, 2012.\n\n[34] John Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In COLT, 1997.\n[35] Ilya O. Tolstikhin and Yevgeny Seldin. PAC-Bayes-empirical-Bernstein inequality. In NIPS, 2013.\n[36] Tong Zhang.\n\nInformation-theoretic upper and lower bounds for statistical estimation.\n\nInformation Theory, 52(4):1307\u20131321, 2006.\n\nIEEE Trans.\n\n9\n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Pascal", "family_name": "Germain", "institution": "Laval University"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Alexandre", "family_name": "Lacoste", "institution": "Universite de Montreal"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "INRIA"}]}