{"title": "Learning Bounds for a Generalized Family of Bayesian Posterior Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1149, "page_last": 1156, "abstract": "", "full_text": "Learning Bounds for a Generalized Family of\n\nBayesian Posterior Distributions\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\nTong Zhang\n\ntzhang@watson.ibm.com\n\nAbstract\n\nIn this paper we obtain convergence bounds for the concentration of\nBayesian posterior distributions (around the true distribution) using a\nnovel method that simpli\ufb01es and enhances previous results. Based on the\nanalysis, we also introduce a generalized family of Bayesian posteriors,\nand show that the convergence behavior of these generalized posteriors is\ncompletely determined by the local prior structure around the true distri-\nbution. This important and surprising robustness property does not hold\nfor the standard Bayesian posterior in that it may not concentrate when\nthere exist \u201cbad\u201d prior structures even at places far away from the true\ndistribution.\n\n1 Introduction\nConsider a sample space X and a measure \u03bb on X (with respect to some \u03c3-\ufb01eld).\nIn\nstatistical inference, the nature picks a probability measure Q on X which is unknown. We\nassume that Q has a density q with respect to \u03bb. In the Bayesian paradigm, the statistician\nconsiders a set of probability densities p(\u00b7|\u03b8) (with respect to \u03bb on X ) indexed by \u03b8 \u2208 \u0393, and\nmakes an assumption1 that the true density q can be represented as p(\u00b7|\u03b8) with \u03b8 randomly\npicked from \u0393 according to a prior distribution \u03c0 on \u0393. Throughout the paper, all quantities\nappearing in the derivations are assumed to be measurable.\nGiven a set of samples X = {X1, . . . , Xn} \u2208 X n, where each Xi independently drawn\nfrom (the unknown distribution) Q, the optimal Bayesian method can be derived as the\noptimal inference with respect to the posterior distribution. Although a Bayesian procedure\nis optimal only when the nature picks the same prior as the statistician (which is very\nunlikely), it is known that procedures with desirable properties from the frequentist point\nof view (such minimaxity and admissibility) are often Bayesian [6]. From a theoretical\npoint of view, it is necessary to understand the behavior of Bayesian methods without\nthe assumption that the nature picks the same prior as the statistician.\nIn this respect,\nthe most fundamental issue in Bayesian analysis is whether the Bayesian inference based\non the posterior distribution will converge to the corresponding inference of the true (but\n\n1In this paper, we view the Bayesian paradigm as a method to generate statistical inferencing\nprocedures, and thus don\u2019t assume that the Bayesian prior assumption has to be true. In particular,\nwe do not even assume that q \u2208 {p(\u00b7|\u03b8) : \u03b8 \u2208 \u0393}.\n\n\funknown) distribution when the number of observations approach in\ufb01nity.\n\nA more general question is whether the Bayesian posterior distribution will be concen-\ntrated around the true underlying distribution when the sample size is large. This is often\nreferred to as the consistency of Bayesian posterior distribution, which is certainly the most\nfundamental issue for understanding the behavior of Bayesian methods. This problem has\ndrawn considerable attention in statistics. The classical results include average consistency\nresults such as Doob\u2019s consistency theorem and asymptotic convergence results such as the\nBernstein-von Mises theorem for parametric problems. For in\ufb01nite-dimensional problems,\none has to choose the prior very carefully, or the Bayesian posterior may not concentrate\naround the true underlying distribution, which leads to inconsistency [1, 2]. In [1], the\nauthors also gave conditions that guarantee the consistency of Bayesian posterior distribu-\ntions, although convergence rates were not obtained. The convergence rates were studied\nin two recent works [3, 8] by using heavy machineries from the empirical process theory.\n\nThe purpose of this paper is to develop \ufb01nite-sample convergence bounds for Bayesian\nposterior distributions using a novel approach that not only simpli\ufb01es the analysis given\nin [3, 8], but also leads to tighter bounds. At the heart of our approach are some new\nposterior averaging bounds that are related to the PAC Bayes analysis appeared in some\nrecent machine learning works. These new bounds are of independent interests (though\nwe cannot fully explore their consequences here) since they can be used to obtain correct\nconvergence rates for other statistical estimation problems such as least squares regression.\nMotivated by our learning bounds, we introduce a generalized family of Bayesian methods,\nand show that their convergence behavior relies only on the prior mass in a small neighbor-\nhood around the true distribution. This is rather surprising when we consider the example\ngiven in [1], which shows that for the (standard) Bayesian method, even if one puts a posi-\ntive prior mass around the true distribution, one may still get an inconsistent posterior when\nthere exist undesirable prior structures far away from the true distribution.\n\n2 The regularization formulation of Bayesian posterior measure\nAssume we observe n-samples X = {X1, . . . , Xn} \u2208 X n, independently drawn from the\ntrue underlying distribution Q. We shall call any probability density \u02c6wX(\u03b8) with respect to\n\u03c0 that depends on the observation X (and measurable on X n \u00d7 \u0393) a posterior distribution.\n\u2200\u03b1 \u2208 (0, 1], we de\ufb01ne a generalized Bayesian posterior \u03c0\u03b1(\u00b7|X) with respect to \u03c0 as:\n\n(1)\n\n\u03c0\u03b1(\u03b8|X) =\n\nR\n\n\u0393\n\nQn\nQn\ni=1 p\u03b1(Xi|\u03b8)\ni=1 p\u03b1(Xi|\u03b8)d\u03c0(\u03b8) .\nZ\n\nw(\u03b8) ln w(\u03b8)d\u03c0(\u03b8).\n\nWe call \u03c0\u03b1 the \u03b1-Bayesian posterior. The standard Bayesian posterior is denoted as\n\u03c0(\u00b7|X) = \u03c01(\u00b7|X). Given a probability density w(\u00b7) on \u0393 with respect to \u03c0, we de\ufb01ne\nthe KL-divergence KL(wd\u03c0||d\u03c0) as:\n\nKL(wd\u03c0||d\u03c0) =\n\n\u0393\n\nConsider a real-valued function f(\u03b8) on \u0393, we denote by E\u03c0f(\u03b8) the expectation of f(\u00b7)\nwith respect to \u03c0. Similarly, consider a real-valued function \u2018(x) on X , we denote by\nEq \u2018(x) the expectation of \u2018(\u00b7) with respect the true underlying distribution q. We also use\nEX to denote the expectation with respect to the observation X.\nThe key starting point of our analysis is the following simple observation that relates the\nBayesian posterior to the solution of an entropy regularized density (with respect to \u03c0) es-\ntimation. Under this formulation, techniques for analyzing regularized risk minimization\nproblems, such as those recently investigated by the author, can be applied to obtain sample\n\n\fcomplexity bound for Bayesian posterior distributions. The proof of the following regular-\nization formulation is straight-forward, which we shall skip due to the space limitation.\n\nProposition 2.1 For any density w on \u0393 with respect to \u03c0, let\n\nnX\n1\nn\nX(\u03c0\u03b1(\u00b7|X)) = inf w \u02c6R\u03b1\n\nX(w) = \u03b1\n\n\u02c6R\u03b1\n\ni=1\n\nX(w).\n\nThen \u02c6R\u03b1\n\nE\u03c0 w(\u03b8) ln q(Xi)\np(Xi|\u03b8)\n\n+\n\n1\nn\n\nKL(wd\u03c0||d\u03c0).\n\nThe above Proposition indicates that the generalized Bayesian posterior minimizes the reg-\nularized empirical risk \u02c6R\u03b1\nX(w) among all possible densities w with respect to the prior \u03c0.\nWe thus only need to study the behavior of this regularized empirical risk minimization\nproblem. One may de\ufb01ne the true risk of w by replacing the empirical expectation \u02c6EX\nwith the expectation with respect to the true underlying distribution q:\n\nq (w) = \u03b1E\u03c0 w(\u03b8)KL(q||p(\u00b7|\u03b8)) +\nR\u03b1\n\nKL(wd\u03c0||d\u03c0),\n\n(2)\n\n1\nn\n\nwhere KL(q||p) = Eq ln q(x)\np(x) is the KL-divergence between q and p, which is always a\nnon-negative number. This quantity is widely used to measure the closeness of two dis-\ntributions p and q. Clearly the Bayesian posterior is an approximate solution to (2) using\nq (w) measures the average KL-divergence of q\nempirical expectation. The \ufb01rst term of R\u03b1\nand p under the w-density. Since both the \ufb01rst term and the second term are non-negative,\nq (w) \u2248 0, then the distribution w is concentrated around q.\nwe know immediately that if R\u03b1\nq (w) in term\nUsing empirical process techniques, one would typically expect to bound R\u03b1\nX(w). Unfortunately, it does not work in our case since KL(q||p) is not well-de\ufb01ned\nof \u02c6R\u03b1\nfor all p. This implies that as long as w has non-zero concentration around a density p with\nq (w) = +\u221e. Therefore we may have R\u03b1\nq (\u03c0(\u00b7|X)) = +\u221e with\nKL(q||p) = +\u221e, then R\u03b1\nnon-zero probability even when the sample size approaches in\ufb01nity.\n(cid:19)\u03c1(cid:21)\n\nA remedy is to consider a distance function that is always well-de\ufb01ned. In statistics, one\noften considers the \u03c1-divergence for \u03c1 \u2208 (0, 1), which is de\ufb01ned as:\n\n\u03c1(1 \u2212 \u03c1)\n\n(3)\nThis divergence is always well-de\ufb01ned and KL(q||p) = lim\u03c1\u21920 D\u03c1(q||p). In the statistical\nliterature, the convergence results were often speci\ufb01ed under the Hellinger distance (\u03c1 =\n0.5). We would also like to mention that our learning bound derived later will become\ntrivial when \u03c1 \u2192 0. This is consistent with the above discussion since R\u03b1\nq (corresponding\nto \u03c1 = 0) may not converge at all. However, under additional assumptions, such as the\nboundedness of q/p, KL(q||p) exists and can be bounded using the \u03c1-divergence D\u03c1(q||p).\n\n(cid:18) p(x)\n\nD\u03c1(q||p) =\n\nq(x)\n\n(cid:20)\n\n1\n\nEq\n\n1 \u2212\n\n.\n\n3 Posterior averaging bounds under entropy regularization\n\nThe following inequality follows directly from a well-known convex duality. For example,\nsee [5, 7] for an explanation.\n\nProposition 3.1 Assume that f(\u03b8) is a measurable real-valued function on \u0393, and w(\u03b8) is\na density with respect to \u03c0, we have\n\nE\u03c0 w(\u03b8)f(\u03b8) \u2264 KL(wd\u03c0||d\u03c0) + ln E\u03c0 exp(f(\u03b8)).\n\n\fThe main technical result which forms the basis of the paper is given by the following\nlemma, where we assume that \u02c6wX(\u03b8) is a posterior (density with respect to \u03c0 that depends\non X and measurable on X n \u00d7 \u0393).\nLemma 3.1 Consider any posterior \u02c6wX(\u03b8). The following inequality holds for all mea-\nsurable real-valued functions LX(\u03b8) on X n \u00d7 \u0393:\n\nEX exp\n\nE\u03c0 \u02c6wX(\u03b8)(LX(\u03b8) \u2212 ln EX eLX (\u03b8)) \u2212 KL( \u02c6wX d\u03c0||d\u03c0)\n\ni \u2264 1,\n\nh\n\nwhere EX is the expectation with respect to the observation X.\n\nProof. From Proposition 3.1, we obtain\n\n\u02c6L(X) =E\u03c0 \u02c6wX(\u03b8)(LX(\u03b8) \u2212 ln EX eLX (\u03b8)) \u2212 KL( \u02c6wX d\u03c0||d\u03c0)\n\n\u2264 ln E\u03c0 exp(LX(\u03b8) \u2212 ln EX eLX (\u03b8)).\n\nNow applying the Fubini\u2019s theorem to interchange the order of integration, we have:\n\nEX e \u02c6L(X) \u2264 EXE\u03c0eLX (\u03b8)\u2212ln EX exp(LX (\u03b8)) = E\u03c0EX eLX (\u03b8)\u2212ln EX exp(LX (\u03b8)) = 1.\n\n2\n\nThe following corollary is a straight-forward consequence of Lemma 3.1. Note that for the\nBayesian method, the loss \u2018\u03b8(x) has a form of \u2018(p(x|\u03b8)).\nTheorem 3.1 (Posterior Averaging Bounds) Under the notation of Lemma 3.1. Let X =\n{X1, . . . , Xn} be n-samples that are independently drawn from q. Consider a measurable\nfunction \u2018\u03b8(x) : \u0393 \u00d7 X \u2192 R. Then \u2200t > 0 and real number \u03c1, the following event holds\nwith probability at least 1 \u2212 exp(\u2212t):\ni=1 E\u03c0 \u02c6wX(\u03b8)\u2018\u03b8(Xi) + KL( \u02c6wX d\u03c0||d\u03c0) + t\n\n\u2212 E\u03c0 \u02c6wX(\u03b8) ln Eq exp(\u2212\u03c1\u2018\u03b8(x))) \u2264 \u03c1Pn\n\n.\n\nMoreover, we have the following expected risk bound:\n\u2212EXE\u03c0 \u02c6wX(\u03b8) ln Eq exp(\u2212\u03c1\u2018\u03b8(x))) \u2264 EX\n\ni=1 E\u03c0 \u02c6wX(\u03b8)\u2018\u03b8(Xi) + KL( \u02c6wX d\u03c0||d\u03c0)\n\n.\n\n\u03c1Pn\n\nn\n\nn\n\nProof Sketch. The \ufb01rst bound is a direct consequence of Markov inequality. The second\nbound can be obtained by using the fact EX exp(\u2206X) \u2265 exp(EX\u2206X), which follows\nfrom the Jensen\u2019s inequality. 2\n\nThe above bounds are immediately applicable to Bayesian posterior distribution. The \ufb01rst\nleads to an exponential tail inequality, and the second leads to an expected risk bound.\n\nBefore analyzing Bayesian methods in detail in the next section, we shall brie\ufb02y compare\nthe above results to the so-called PAC-Bayes bounds, which can be obtained by estimating\nthe left-hand side using the Hoeffding\u2019s inequality with an appropriately chosen \u03c1. How-\never, in the following, we shall estimate the left-hand side using a Bernstein style bound,\nwhich is much more useful for general statistical estimation problems:\nCorollary 3.1 Under the notation of Theorem 3.1, and assume that sup\u03b8,x1,x2 |\u2018\u03b8(x1) \u2212\n\u2018\u03b8(x2)| \u2264 1. Then \u2200t, \u03c1 > 0, with probability of at least 1 \u2212 exp(\u2212t):\n\nE\u03c0 \u02c6wX(\u03b8)Eq\u2018\u03b8(x) \u2212 \u03c1\u03c6(\u03c1)E\u03c0 \u02c6wX(\u03b8)Varq\u2018\u03b8(x) \u2264 1\nn\n\nnX\n\ni=1\n\n+\n\nE\u03c0 \u02c6wX(\u03b8)\u2018\u03b8(Xi)\nKL( \u02c6wX d\u03c0||d\u03c0) + t\n\n,\n\n\u03c1n\n\nwhere \u03c6(x) = (exp(x) \u2212 x \u2212 1)/x2 and Varq \u2018\u03b8(x) = Eq(\u2018\u03b8(x) \u2212 Eq\u2018\u03b8(x))2.\n\n\fProof Sketch. We follow one of the standard derivations of Bernstein inequality outlined\nbelow: it is well known that \u03c6(x) is non-decreasing in x, which in turn implies that\n\nln Eq exp(\u2212\u03c1\u2018\u03b8(x))) \u2264 \u2212\u03c1Eq\u2018\u03b8(x) + \u03c12\u03c6(\u03c1)Eq(\u2018\u03b8(x) \u2212 Eq\u2018\u03b8(x))2.\n\nNow applying this bound to the left hand side of Theorem 3.1, we \ufb01nish the proof. 2\nOne may use the simple bound Varq \u2018\u03b8(x) \u2264 1/4 and obtain2.\nE\u03c0 \u02c6wX(\u03b8)Eq\u2018\u03b8(x) \u2264 E\u03c0 \u02c6wX(\u03b8)\n\nKL( \u02c6wX d\u03c0||d\u03c0) + t\n\n(cid:18) \u03c1\u03c6(\u03c1)\n\nnX\n\n\u2018\u03b8(Xi)\n\n(cid:19)\n\n+\n\n+\n\n.\n\n(4)\n\nn\n\ni=1\n\n4\n\n\u03c1n\n\nThis inequality holds for any data-independent choice of \u03c1. However, one may easily turn\nit into a bound which allows \u03c1 to depend on the data using well-known techniques (see [5],\nfor example). After we optimize \u03c1, the resulting bound becomes similar to the PAC-Bayes\n\nbound [4]. Typically the optimal \u03c1 is in the order ofpKL( \u02c6wX d\u03c0||d\u03c0)/n, and hence the\nrate of convergence given on the right-hand side is no better than O(p1/n). However, the\n\nmore interesting case is when there exists a constant b \u2265 0 such that\n\nEq(\u2018\u03b8(x) \u2212 Eq\u2018\u03b8(x))2 \u2264 bEq\u2018\u03b8(x).\n\n(5)\nThis condition appears in the theoretical analysis of many statistical estimation problems,\nsuch as least squares regression, and when the loss function is non-negative (such as clas-\nsi\ufb01cation). It also appears in some analysis of maximum-likelihood estimation (log-loss),\nthough as we shall see, log-loss can be much more directly handled in our framework using\nTheorem 3.1. A modi\ufb01ed version of this condition also occurs in some recent analysis of\nclassi\ufb01cation problems even when the problem is not separable. We shall now assume that\n(5) holds. It follows from Corollary 3.1 that \u2200\u03c1 > 0 such that \u03c1\u03c6(\u03c1) \u2264 1/b, we have\n\nE\u03c0 \u02c6wX(\u03b8)Eq\u2018\u03b8(x) \u2264 \u03c1E\u03c0 \u02c6wX(\u03b8)Pn\n\ni=1 \u2018\u03b8(Xi) + KL( \u02c6wX d\u03c0||d\u03c0) + t\n\u03c1(1 \u2212 b\u03c1\u03c6(\u03c1))n\n\n.\n\n(6)\n\ni=1\n\n\u2018\u03b8(Xi)\n\nwhen E\u03c0 \u02c6wX(\u03b8)Pn\n\nAgain the above inequality holds for any data-independent \u03c1, but we can easily turn it into\na bound that allows \u03c1 to depend on X using standard techniques. However we shall not\nlist the \ufb01nal result here since this is not the purpose of the paper. The parameter \u03c1 can be\noptimized, and it is not hard to check that the resulting bound is signi\ufb01cantly better than (4)\nn \u2248 0. The \u201cself-bounding\u201d condition (5) holds in the theo-\nretical analysis of many statistical estimation problems. To obtain the correct convergence\nbehavior in such cases (including the Bayesian method which we are interested in here),\ninequality (4) is inadequate, and it is essential to use a Bernstein-type bound such as (6). It\nis also useful to point out that to analyze such problems, one actually only needs (6) with\nan appropriately chosen data-independent \u03c1, which will lead to the correct (minimax) rate\nof convergence. Note that if we choose \u03c1 to be a constant, then it is possible to achieve a\nbound that converges as fast as O(1/n). We shall point out that in [7], a KL-divergence\nversion of the PAC-Bayes bound was developed for the 0-1 loss using related techniques,\nwhich can lead to a rate as fast as O(ln n/n) if we make near zero errors. However, the\nBernstein style bound given here is more generally applicable and is necessary for more\ncomplicated statistical estimation problems such as least squares regression.\n\n4 Convergence bounds for Bayesian posterior distributions\n\nWe shall now analyze the \ufb01nite sample convergence behavior of Bayesian posterior distri-\nbutions using Theorem 3.1. Although the exponential tail inequality provides more detailed\ninformation, our discussion will be based on the expected risk bound for simplicity.\n\n2In this case, slightly tighter results can be obtained by applying the Hoeffding\u2019s exponential\ninequality directly to the left-hand side of Theorem 3.1, instead of the method used in Corollary 3.1.\n\n\fTo analyze the Bayesian method, we let \u2018\u03b8(x) = ln(q(x)/p(x|\u03b8)) in Theorem 3.1. Con-\nsider \u03c1 \u2208 (0, 1). We also let \u02c6wX(\u03b8) be the Bayesian posterior \u03c0\u03b1(\u03b8|X) with parameter\n\u03b1 \u2208 [\u03c1, 1] de\ufb01ned in (1). Consider an arbitrary data-independent density w(\u03b8) with respect\nto \u03c0, using (3), we can obtain from Theorem 3.1 the following chain of equations:\n\n#\n\nKL(\u03c0\u03b1(\u03b8|X)d\u03c0||d\u03c0)\n\n\"\n\"\n\nEXE\u03c0\u03c0\u03b1(\u03b8|X) ln\n\n1 \u2212 \u03c1(1 \u2212 \u03c1)D\u03c1(q||p(\u00b7|\u03b8))\n\n= \u2212 EXE\u03c0\u03c0\u03b1(\u03b8|X) ln Eq exp\n\n\u2212\u03c1 ln q(x)\np(x|\u03b8)\n\n1\n\n(cid:18)\n\n(cid:19)\n\n\u2264EX\n\n\u03c1E\u03c0 \u03c0\u03b1(\u03b8|X)\n\n1\nn\n\n\u2264EX\n\n=Rq\n\n\u03b1E\u03c0 w(\u03b8)\n\u03b1(w) + \u03b1 \u2212 \u03c1\n\nn\n\nnX\nnX\n\ni=1\n\ni=1\n\n+\n\nn\nKL(wd\u03c0||d\u03c0)\n\n#\n\nn\n\nln q(Xi)\np(Xi|\u03b8)\n\n1\nn\nln q(Xi)\np(Xi|\u03b8)\nnX\n\n+\n\nln p(Xi|\u03b8)\nq(Xi) ,\n\nEX sup\n\u03b8\n\ni=1\n\n+ \u03b1 \u2212 \u03c1\n\nn\n\nEX sup\n\u03b8\n\nnX\n\ni=1\n\nln p(Xi|\u03b8)\nq(Xi)\n\nq (w) is de\ufb01ned in (2). Note that the \ufb01rst inequality follows from Theorem 3.1,\nwhere R\u03b1\nand the second inequality follows from Proposition 2.1. The empirical process bound in\nthe second term can be improved using a more precise bounding method, but we shall\nskip it here due to the lack of space. It is not dif\ufb01cult to see (also see Proposition 2.1 and\nProposition 3.1) that (we skip the derivation due to the space limitation):\nln E\u03c0 exp(\u2212\u03b1nKL(q||p(\u00b7|\u03b8))).\n\nq (w) = \u2212 1\nR\u03b1\n\ninf\nw\n\nn\n\nUsing the fact \u2212 ln(1 \u2212 x) \u2265 x to simplify the left-hand side, we thus obtain:\n\nEXE\u03c0\u03c0\u03b1(\u03b8|X)D\u03c1(q||p(\u00b7|\u03b8))\n\u2212 ln E\u03c0e\u2212\u03b1nKL(q||p(\u00b7|\u03b8)) + (\u03b1 \u2212 \u03c1)EX sup\u03b8\n\u2264\n\n\u03c1(1 \u2212 \u03c1)n\n\nPn\ni=1 ln p(Xi|\u03b8)\n\nq(Xi)\n\n.\n\n(7)\n\nIn the following, we shall compare our analysis with previous results. To be consistent with\nthe concept used in these previous studies, we shall consider the following quantity:\n\n\u03c0 (X, \u0001) = E\u03c0\u03c0\u03b1(\u03b8|X)1(D\u03c1(q||p(\u00b7|\u03b8)) \u2265 \u0001),\nm\u03b1,\u03c1\n\n\u03c0 (X, \u0001) is the probability mass of the\nwhere 1 is the set indicator function. Intuitively m\u03b1,\u03c1\n\u03b1-Bayesian posterior \u03c0\u03b1(\u00b7|X) in the region of p(\u00b7|\u03b8) that is at least \u0001-distance away from q\nin D\u03c1-divergence. Using Markov inequality, we immediately obtain from (7) the following\nbound for m\u03b1,\u03c1\n\n(X):\n\n\u0001\n\nEX m\u03b1,\u03c1\n\n\u03c0 (X, \u0001) \u2264\n\n\u2212 ln E\u03c0e\u2212\u03b1nKL(q||p(\u00b7|\u03b8)) + (\u03b1 \u2212 \u03c1)EX sup\u03b8\n\n\u03c1(1 \u2212 \u03c1)n\u0001\n\nPn\ni=1 ln p(Xi|\u03b8)\n\nq(Xi)\n\n.\n\n(8)\n\nNext we would like to estimate the right-hand side of (8). Due to the limitation of space, we\nshall only consider a simple truncation estimation, which leads to the correct convergence\nrate for non-parametric problems but yields an unnecessary ln n factor for parametric prob-\nlems (which can be correctly handled with a more precise estimation). We introduce the\nfollowing notation, which is essentially the prior measure of an \u0001-radius KL-ball around q:\n\n\u03c0 (\u0001) = \u03c0(KL(q||p(\u00b7|\u03b8)) \u2264 \u0001) = E\u03c01(KL(q||p(\u00b7|\u03b8)) \u2264 \u0001).\n\nM KL\n\n\fUsing this de\ufb01nition, we have E\u03c0e\u2212\u03b1nKL(q||p(\u00b7|\u03b8)) \u2265 M KL\n\u03c0 (\u0001)e\u2212\u03b1n\u0001. In addition, we shall\nde\ufb01ne the \u0001-upper bracketing of \u0393 (introduced in [1]), denoted by N(\u0393, \u0001), as the minimum\nnumber of non-negative functions {fi} on X with respect to \u03bb such that Eq(fi/q) = 1 + \u0001,\nand \u2200\u03b8 \u2208 \u0393, \u2203i such that p(x|\u03b8) \u2264 fi(x) a.e. [\u03bb]. We have\n\nnX\n\ni=1\n\n1\nn\n\nEX sup\n\u03b8\n\nln p(Xi|\u03b8)\nq(Xi)\n\n\u2264 1\nn\n\n\u2264 1\nn\n\nEX ln\n\ni=1 ln\n\ne\n\nN (\u0393,\u0001)X\nN (\u0393,\u0001)X\n\nj=1\n\nj=1\n\nPn\nPn\n\nln\n\nEX e\n\ni=1 ln\n\nfj (Xi)\nq(Xi)\n\nfj (Xi)\nq(Xi) =\n\nln N(\u0393, \u0001)\n\nn\n\n+ ln(1 + \u0001).\n\nTherefore we obtain from (8) that \u2200s > 0:\n\u03c0 (X, s\u0001) \u2264 \u03b1 \u2212 1\n\n\u03c1(1 \u2212 \u03c1)sEX m\u03b1,\u03c1\n\nn\u0001\n\nln M KL\n\n\u03c0 (\u0001) + (\u03b1 \u2212 \u03c1)\n\nln N(\u0393, \u0001) + n\u0001\n\nn\u0001\n\n.\n\nThe above bound immediately implies the following consistency and convergence rate the-\norem for Bayesian posterior distribution:\n\nTheorem 4.1 Consider a sequence of Bayesian prior distributions \u03c0n on a parameter\nspace \u0393n, which may be different for different sample sizes. Consider a sequence of positive\nnumbers {\u0001n} such that\n\n\u22121\nn\u0001n\n\nsup\nn\n\nln M KL\n\u03c0n\n\n(\u0001n) < \u221e,\n\n(9)\n\nthen \u2200sn > 0 such that sn \u2192 \u221e, and \u2200\u03b1 \u2208 (0, 1), m\u03b1,\u03b1\nMoreover, if\n\n\u03c0n\n\n(X, sn\u0001n) \u2192 0 in probability.\n\n\u03c0n\n\nln N(\u0393n, \u0001n)\n\nn\u0001n\n\n(10)\n\nsup\nn\n\n(X, sn\u0001n) \u2192 0 in probability.\n\n< \u221e,\nthen \u2200sn > 0 such that sn \u2192 \u221e, and \u2200\u03c1 \u2208 (0, 1), m1,\u03c1\nThe \ufb01rst claim implies that for all \u03b1 < 1, the \u03b1-Bayesian posterior \u03c0\u03b1 is concentrated in\nan \u0001n ball around q in D\u03b1 divergence, and the rate of convergence is Op(\u0001n). Note that\n\u0001n is determined only by the local property of \u03c0n around the true distribution q. It also\n(\u0001) > 0 for all \u0001 > 0, the \u03b1-Bayesian method\nimmediately implies that as long as M KL\n\u03c0n\nwith \u03b1 < 1 is consistent.\nThe second claim applies to the standard Bayesian method. Its consistency requires an\nadditional assumption (10), which depends on global properties of the prior \u03c0n. This may\nseem somewhat surprising at \ufb01rst, but the condition is necessary.\nIn fact, the counter-\nexample given in [1] shows that the standard Bayesian method can be inconsistent even\n(\u0001) > 0 for all \u0001 > 0. Therefore a standard Bayesian procedure\nunder the condition M KL\n\u03c0n\ncan be ill-behaved even if we put a suf\ufb01cient amount of prior around the true distribution.\nThe consistency theorem given in [1] also relies on the upper entropy number N(\u0393, \u0001).\nHowever, no convergence rates were established. Here we obtained a rate of convergence\nresult for the standard Bayesian method using their covering de\ufb01nitions. Other de\ufb01nitions\nof covering (e.g. Hellinger covering) were used in more recent works to obtain rate of\nconvergence for non-parametric Bayesian methods [3, 8]. Although it is possible to de-\nrive bounds using those different covering de\ufb01nitions in our analysis, we shall not work\nout the details here. However, we shall point out that these works made assumptions not\n\u03c0 (\u0001) requires additional\ncompletely necessary. For example, in [3], the de\ufb01nition of M KL\nassumptions that Eq ln(q/p(\u00b7|\u03b8))2 \u2264 \u00012. This stronger condition is not needed in our anal-\nysis. Finally we shall mention that the bound of the form in Theorem 4.1 is known to\nproduce optimal convergence rates for non-parametric problems (see [3, 8] for examples).\n\n\f5 Conclusion\n\nIn this paper, we formulated an extended family of Bayesian algorithms as empirical log-\nrisk minimization under entropy regularization. We then derived general posterior averag-\ning bounds under entropy regularization that are suitable for analyzing Bayesian methods.\nThese new bounds are of independent interests since they lead to Bernstein style exponen-\ntial inequalities, which are crucial for obtaining the correct convergence behavior for many\nstatistical estimation problems such as least squares regression.\n\nUsing the posterior averaging bounds, we obtain new convergence results for a generalized\nfamily of Bayesian posterior distributions. Our results imply that the \u03b1-Bayesian method\nwith \u03b1 < 1 is more robust than the standard Bayesian method since its convergence be-\nhavior is completely determined by the local prior density around the true distribution. Al-\nthough the standard Bayesian method is \u201coptimal\u201d in a certain averaging sense, its behavior\nis heavily dependent on the regularity of the prior distribution globally. What happens is\nthat the standard Bayesian method can put too much emphasis on the dif\ufb01cult part of the\nprior distribution, which degrades the estimation quality in the easier parts where we are\nactually more interested in. Therefore even if one is able to guess the true distribution\nby putting a large prior mass around its neighborhood, the Bayesian method can still ill-\nbehave if one accidentally makes bad choices elsewhere. It is thus dif\ufb01cult to design good\nBayesian priors. The new theoretical insights obtained here imply that unless one com-\npletely understands the impact of the prior, it is much safer to use an \u03b1-Bayesian method.\n\nAcknowledgments\n\nThe author would like to thank Andrew Barron, Ron Meir, and Matthias Seeger for helpful\ndiscussions and comments.\n\nReferences\n\n[1] Andrew Barron, Mark J. Schervish, and Larry Wasserman. The consistency of poste-\n\nrior distributions in nonparametric problems. Ann. Statist., 27(2):536\u2013561, 1999.\n\n[2] Persi Diaconis and David Freedman. On the consistency of Bayes estimates. Ann.\n\nStatist., 14(1):1\u201367, 1986. With a discussion and a rejoinder by the authors.\n\n[3] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart. Convergence rates of\n\nposterior distributions. Ann. Statist., 28(2):500\u2013531, 2000.\n\n[4] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5\u2013\n\n21, 2003.\n\n[5] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algo-\n\nrithms. Journal of Machine Learning Research, 4:839\u2013860, 2003.\n\n[6] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag,\n\nNew York, 1994.\n\n[7] M. Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01ca-\n\ntion. JMLR, 3:233\u2013269, 2002.\n\n[8] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions.\n\nAnn. Statist., 29(3):687\u2013714, 2001.\n\n\f", "award": [], "sourceid": 2439, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}