{"title": "Data-Dependent Bounds for Bayesian Mixture Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 342, "abstract": null, "full_text": "Data-Dependent Bounds for Bayesian\n\nMixture Methods\n\nRon Meir\n\nTong Zhang\n\nDepartment of Electrical Engineering\n\nTechnion, Haifa 32000, Israel\n\nIBM T.J. Watson Research Center\nYorktown Heights, NY 10598, USA\n\nrmeir@ee.technion.ac.il\n\ntzhang@watson.ibm.com\n\nAbstract\n\nWe consider Bayesian mixture approaches, where a predictor is\nconstructed by forming a weighted average of hypotheses from some\nspace of functions. While such procedures are known to lead to\noptimal predictors in several cases, where su\u2013ciently accurate prior\ninformation is available, it has not been clear how they perform\nwhen some of the prior assumptions are violated. In this paper we\nestablish data-dependent bounds for such procedures, extending\nprevious randomized approaches such as the Gibbs algorithm to\na fully Bayesian setting. The \ufb02nite-sample guarantees established\nin this work enable the utilization of Bayesian mixture approaches\nin agnostic settings, where the usual assumptions of the Bayesian\nparadigm fail to hold. Moreover, the bounds derived can be directly\napplied to non-Bayesian mixture approaches such as Bagging and\nBoosting.\n\n1\n\nIntroduction and Motivation\n\nThe standard approach to Computational Learning Theory is usually formulated\nwithin the so-called frequentist approach to Statistics. Within this paradigm one is\ninterested in constructing an estimator, based on a \ufb02nite sample, which possesses a\nsmall loss (generalization error). While many algorithms have been constructed and\nanalyzed within this context, it is not clear how these approaches relate to standard\noptimality criteria within the frequentist framework. Two classic optimality criteria\nwithin the latter approach are the minimax and admissibility criteria, which charac-\nterize optimality of estimators in a rigorous and precise fashion [9]. Except in some\nspecial cases [12], it is not known whether any of the approaches used within the\nLearning community lead to optimality in either of the above senses of the word.\nOn the other hand, it is known that under certain regularity conditions, Bayesian\nestimators lead to either minimax or admissible estimators, and thus to well-de\ufb02ned\noptimality in the classical (frequentist) sense. In fact, it can be shown that Bayes\nestimators are essentially the only estimators which can achieve optimality in the\nabove senses [9]. This optimality feature provides strong motivation for the study\nof Bayesian approaches in a frequentist setting.\n\nWhile Bayesian approaches have been widely studied, there have not been generally\n\n\fapplicable bounds in the frequentist framework. Recently, several approaches have\nattempted to address this problem. In this paper we establish \ufb02nite sample data-\ndependent bounds for Bayesian mixture methods, which together with the above\noptimality properties suggest that these approaches should become more widely\nused.\n\nConsider the problem of supervised learning where we attempt to construct an es-\ntimator based on a \ufb02nite sample of pairs of examples S = f(x1; y1); : : : ; (xn; yn)g,\neach drawn independently according to an unknown distribution \u201e(x; y). Let A be\na learning algorithm which, based on the sample S, constructs a hypothesis (esti-\nmator) h from some set of hypotheses H. Denoting by \u2018(y; h(x)) the instantaneous\nloss of the hypothesis h, we wish to assess the true loss L(h) = E\u201e\u2018(y; h(x)) where\nthe expectation is taken with respect to \u201e. In particular, the objective is to provide\ndata-dependent bounds of the following form. For any h 2 H and \u2013 2 (0; 1), with\nprobability at least 1 \u00a1 \u2013,\n\nL(h) \u2022 \u2044(h; S) + \u00a2(h; S; \u2013);\n\n(1)\n\nwhere \u2044(h; S) is some empirical assessment of the true loss, and \u00a2(h; S; \u2013) is a com-\nplexity term. For example, in the classic Vapnik-Chervonenkis framework, \u2044(h; S)\ni=1 \u2018(yi; h(xi)) and \u00a2(h; S; \u2013) depends on the VC-\ndimension of H but is independent of both the hypothesis h and the sample S. By\nalgorithm and data-dependent bounds we mean bounds where the complexity term\ndepends on both the hypothesis (chosen by the algorithm A) and the sample S.\n\nis the empirical error (1=n)Pn\n\n2 A Decision Theoretic Bayesian Framework\n\nConsider a decision theoretic setting where we de\ufb02ne the sample dependent loss of\nan algorithm A by R(\u201e; A; S) = E\u201e\u2018(y; A(x; S)). Let (cid:181)\u201e be the optimal predictor\nfor y, namely the function minimizing E\u201ef\u2018(y; `(x))g over `. It is clear that the\nbest algorithm A (Bayes algorithm) is the one that always return (cid:181)\u201e, assuming \u201e\nis known. We are interested in the expected loss of an algorithm averaged over\nsamples S:\n\nR(\u201e; A) = ESR(\u201e; A; S) =Z R(\u201e; A; S)d\u201e(S);\n\nwhere the expectation is taken with respect to the sample S drawn i.i.d. from the\nprobability measure \u201e. If we consider a family of measures \u201e, which possesses some\nunderlying prior distribution \u2026(\u201e), then we can construct the averaged risk function\nwith respect to the prior as,\n\nr(\u2026; A) = E\u2026R(\u201e; A) =Z d\u201e(S)d\u2026(\u201e)Z R(\u201e; A; S)d\u2026(\u201ejS);\n\nR\u201e d\u201e(S)d\u2026(\u201e) is the posterior distribution on the \u201e family, which\n\nwhere d\u2026(\u201ejS) = d\u201e(S)d\u2026(\u201e)\ninduces a posterior distribution on the sample space as \u2026S = E\u2026(\u201ejS)\u201e. An algorithm\nminimizing the Bayes risk r(\u2026; A) is referred to as a Bayes algorithm. In fact, for a\ngiven prior, and a given sample S, the optimal algorithm should return the Bayes\noptimal predictor with respect to the posterior measure \u2026S.\n\nFor many important practical problems, the optimal Bayes predictor is a linear\nfunctional of the underlying probability measure. For example, if the loss function is\nquadratic, namely \u2018(y; A(x)) = (y \u00a1A(x))2, then the optimal Bayes predictor (cid:181)\u201e(x)\nis the conditional mean of y, namely E\u201e[yjx]. For binary classi\ufb02cation problems, we\ncan let the predictor be the conditional probability (cid:181)\u201e(x) = \u201e(y = 1jx) (the optimal\nclassi\ufb02cation decision rule then corresponds to a test of whether (cid:181)\u201e(x) > 0:5), which\n\n\fis also a linear functional of \u201e. Clearly if the Bayes predictor is a linear functional\nof the probability measure, then the optimal Bayes algorithm with respect to the\nprior \u2026 is given by\n\nAB(x; S) =Z\u201e\n\n(cid:181)\u201e(x)d\u2026(\u201ejS) = R\u201e (cid:181)\u201e(x)d\u201e(S)d\u2026(\u201e)\nR\u201e d\u201e(S)d\u2026(\u201e)\n\n:\n\n(2)\n\nIn this case, an optimal Bayesian algorithm can be regarded as the predictor con-\nstructed by averaging over all predictors with respect to a data-dependent posterior\n\u2026(\u201ejS). We refer to such methods as Bayesian mixture methods. While the Bayes\nestimator AB(x; S) is optimal with respect to the Bayes risk r(\u2026; A), it can be\nshown, that under appropriate conditions (and an appropriate prior) it is also a\nminimax and admissible estimator [9].\n\nIn general, (cid:181)\u201e is unknown. Rather we may have some prior information about\npossible models for (cid:181)\u201e. In view of (2) we consider a hypothesis space H, and an\nalgorithm based on a mixture of hypotheses h 2 H. This should be contrasted\nwith classical approaches where an algorithm selects a single hypothesis h form\nH. For simplicity, we consider a countable hypothesis space H = fh1; h2; : : :g; the\ngeneral case will be deferred to the full paper. Let q = fqjg1\nj=1 be a probability\n\nvector, namely qj \u201a 0 and Pj qj = 1, and construct the composite predictor by\nfq(x) = Pj qjhj(x). Observe that in general fq(x) may be a great deal more\n\ncomplex that any single hypothesis hj. For example, if hj(x) are non-polynomial\nridge functions, the composite predictor f corresponds to a two-layer neural network\nwith universal approximation power. We denote by Q the probability distribution\n\nde\ufb02ned by q, namely Pj qjhj = Eh\u00bbQh.\n\nA main feature of this work is the establishment of data-dependent bounds on\nL(Eh\u00bbQh), the loss of the Bayes mixture algorithm. There has been a (cid:176)urry of\nrecent activity concerning data-dependent bounds (a non-exhaustive list includes\n[2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound\nfor the so-called Gibbs algorithm, which selects a hypothesis at random from H\nbased on the posterior distribution \u2026(hjS). Essentially, this result provides a bound\non the average error Eh\u00bbQL(h) rather than a bound on the error of the averaged\nhypothesis. Later, Langford et al. [6] extended this result to a mixture of classi\ufb02ers\nusing a margin-based loss function. A more general result can also be obtained using\nthe covering number approach described in [14]. Finally, Herbrich and Graepel\n[4] showed that under certain conditions the bounds for the Gibbs classi\ufb02er can\nbe extended to a Bayesian mixture classi\ufb02er. However, their bound contained an\nexplicit dependence on the dimension (see Thm. 3 in [4]).\n\nAlthough the approach pioneered by McAllester came to be known as PAC-Bayes,\nthis term is somewhat misleading since an optimal Bayesian method (in the decision\ntheoretic framework outline above) does not average over loss functions but rather\nover hypotheses. In this regard, the learning behavior of a true Bayesian method is\nnot addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the\ndiscrepancy by analyzing Bayesian mixture methods, where we consider a predictor\nthat is the average of a family of predictors with respect to a data-dependent poste-\nrior distribution. Bayesian mixtures can often be regarded as a good approximation\nto a true optimal Bayesian method. In fact, we have shown above that they are\nequivalent for many important practical problems.\n\nTherefore the main contribution of the present work is the extension of the above\nmentioned results in PAC-Bayes analysis to a rather uni\ufb02ed setting for Bayesian\nmixture methods, where di\ufb01erent regularization criteria may be incorporated, and\ntheir e\ufb01ect on the performance easily assessed. Furthermore, it is also essential that\n\n\fthe bounds obtained are dimension-independent, since otherwise they yield useless\nresults when applied to kernel-based methods, which often map the input space into\na space of very high dimensionality. Similar results can also be obtained using the\ncovering number analysis in [14]. However the approach presented in the current\npaper, which relies on the direct computation of the Rademacher complexity, is more\ndirect and gives better bounds. The analysis is also easier to generalize than the\ncorresponding covering number approach. Moreover, our analysis applies directly\nto other non-Bayesian mixture approaches such as Bagging and Boosting.\n\nH. Introduce the vector notation P1\n\nBefore moving to the derivation of our bounds, we formalize our approach. Consider\na countable hypothesis space H = fhjg1\nj=1, and a probability distribution fqjg over\nk=1 qkhk(x) = q>h(x). A learning algorithm\nwithin the Bayesian mixture framework uses the sample S to select a distribution\nQ over H and then constructs a mixture hypothesis fq(x) = q>h(x). In order to\nconstrain the class of mixtures used in constructing the mixture q>h we impose\nconstraints on the mixture vector q. Let g(q) be a non-negative convex function of\nq and de\ufb02ne for any positive A,\n\n\u203aA = fq 2 S : g(q) \u2022 Ag\n\n(3)\nwhere S denotes the probability simplex. In subsequent sections we will consider\ndi\ufb01erent choices for g(q), which essentially acts as a regularization term. Finally,\nfor any mixture q>h we de\ufb02ne the loss by L(q>h) = E\u201e\u2018(y; (q>h)(x)) and the\n\n; FA ='fq : fq(x) = q>h(x) : q 2 \u203aA\u201c ;\n\nempirical loss incurred on the sample by ^L(q>h) = (1=n)Pn\n\n3 A Mixture Algorithm with an Entropic Constraint\n\ni=1 \u2018(yi; (q>h)(xi)).\n\nIn this section we consider an entropic constraint, which penalizes weights deviat-\ning signi\ufb02cantly from some prior probability distribution \u201d = f\u201djg1\nj=1, which may\nincorporate our prior information about he problem. The weights q themselves are\nchosen by the algorithm based on the data. In particular, in this section we set g(q)\nto be the Kullback-Leibler divergence of q from \u201d,\n\ng(q) = D(qk\u201d)\n\nqj log(qj=\u201dj):\n\n; D(qk\u201d) =Xj\n\nLet F be a class of real-valued functions, and denote by (cid:190)i independent Bernoulli\nrandom variables assuming the values \u00a71 with equal probability. We de\ufb02ne the\ndata-dependent Rademacher complexity of F as\n\n^Rn(F) = E(cid:190)\"sup\n\nf 2F\n\n1\nn\n\n(cid:190)if (xi) jS# :\n\nn\n\nXi=1\n\nThe expectation of ^Rn(F) with respect to S will be denoted by Rn(F). We note\nthat ^Rn(F) is concentrated around its mean value Rn(F) (e.g., Thm. 8 in [1]). We\nquote a slightly adapted result from [5].\n\nTheorem 1 (Adapted from Theorem 1 in [5])\nLet fx1; x2; : : : ; xng 2 X be a sequence of points generated independently at random\naccording to a probability distribution P , and let F be a class of measurable functions\nfrom X to R. Furthermore, let ` be a non-negative Lipschitz function with Lipschitz\nconstant \u2022, such that `\u2013f is uniformly bounded by a constant M . Then for all f 2 F\nwith probability at least 1 \u00a1 \u2013\n\nE`(f (x)) \u00a1\n\n1\nn\n\nn\n\nXi=1\n\n`(f (xi)) \u2022 4\u2022Rn(F) + Mr log(1=\u2013)\n\n2n\n\n:\n\n\fAn immediate consequence of Theorem 1 is the following.\n\nLemma 3.1 Let the loss function \u2018 be bounded by M , and assume that it is Lips-\nchitz with constant \u2022. Then for all q 2 \u203aA with probability at least 1 \u00a1 \u2013\n\nL(q>h) \u2022 ^L(q>h) + 4\u2022Rn(FA) + Mr log(1=\u2013)\n\n2n\n\n:\n\nNext, we bound the empirical Rademacher average of FA using g(q) = D(qk\u201d).\n\nLemma 3.2 The empirical Rademacher complexity of FA is upper bounded as fol-\nlows:\n\n^Rn(FA) \u2022\u02c6r 2A\n\nj vuut\nn ! sup\n\n1\nn\n\nn\n\nXi=1\n\nhj(xi)2 :\n\nProof: We \ufb02rst recall a few facts from the theory of convex duality [10]. Let p(u)\n\nbe a convex function over a domain U , and set its dual s(z) = supu2U\u00a1u>z \u00a1 p(u)\u00a2.\nIt is known that s(z) is also convex. Setting u = q and p(q) = Pj qj log(qj=\u201dj) we\n\ufb02nd that s(v) = logPj \u201djezj . From the de\ufb02nition of s(z) it follows that for any\n\nq 2 S,\n\nq>z \u2022Xj\n\nqj log(qj=\u201dj) + logXj\n\n\u201djezj :\n\nSince z is arbitrary, we set z = (\u201a=n)Pi (cid:190)ih(xi) and conclude that for q 2 \u203aA and\n\nany \u201a > 0\n\nTaking the expectation with respect to (cid:190), and using the Cherno\ufb01 bound\n\nq2\u203aA( 1\n\nsup\n\nn\n\nn\n\nXi=1\n\n(cid:190)iq>h(xi)) \u2022\n\nE(cid:190) fexp (Pi (cid:190)iai)g \u2022 exp\u00a1Pi a2\n\n1\n\nnXi\n\n\u201dj exp\" \u201a\n\n\u201a8<\nA + logXj\n:\ni =2\u00a2, we have that\n(cid:190)ihj(xi)#9=\n\u201dj exp\" \u201a\nnXi\n;\n(cid:190)ihj(xi)#)\nlog E(cid:190) exp\" \u201a\nnXi\nlog exp\" \u201a2\n#)\nn2 Xi\nj Xi\n\nhj(xi)2 :\n\nhj(xi)2\n\n2\n\nA + E(cid:190) logXj\n\n^Rn(FA) \u2022\n\n\u2022\n\n\u2022\n\n=\n\n1\n\n1\n\n\u201a8<\n:\n\u201a(A + sup\n\u201a(A + sup\n\u201a\n2n2 sup\n\nA\n\u201a\n\n+\n\n1\n\nj\n\nj\n\n:\n\n(cid:190)ihj(xi)#9=\n;\n\n(Jensen)\n\n(Cherno\ufb01)\n\nMinimizing the r.h.s. with respect to \u201a, we obtain the desired result.\n\n\u2044\n\nCombining Lemmas 3.1 and 3.2 yields our basic bound, where \u2022 and M are de\ufb02ned\nin Lemma 3.1.\n\nTheorem 2 Let S = f(x1; y1); : : : ; (xn; yn)g be a sample of i.i.d. points each\ndrawn according to a distribution \u201e(x; y). Let H be a countable hypothesis class,\nand set FA to be the class de\ufb02ned in (3) with g(q) = D(qk\u201d). Set \u00a2H =\n\n\f\u00a3(1=n)E\u201e supjPn\n\n1 \u00a1 \u2013\n\ni=1 hj(xi)2\u20441=2\nL(q>h) \u2022 ^L(q>h) + 4\u2022\u00a2Hr 2A\n\nn\n\n+ Mr log(1=\u2013)\n\n2n\n\n:\n\n. Then for any q 2 \u203aA with probability at least\n\nNote that if hj are uniformly bounded, hj \u2022 c, then \u00a2H \u2022 c. Theorem 2 holds for a\n\ufb02xed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain:\n\nCorollary 3.1 Let the assumptions of Theorem 2 hold, and let fAi; pig be a set of\n\npositive numbers such that Pi pi = 1. Then for all Ai and q 2 \u203aAi with probability\n\nat least 1 \u00a1 \u2013,\n\nL(q>h) \u2022 ^L(q>h) + 4\u2022\u00a2Hr 2Ai\n\nn\n\n+ Mr log(1=pi\u2013)\n\n2n\n\n:\n\nNote that the only distinction with Theorem 2 is the extra factor of log pi which is\nthe price paid for the uniformity of the bound.\n\nFinally, we present a data-dependent bound of the form (1).\n\nTheorem 3 Let the assumptions of Theorem 2 hold. Then for all q 2 S with\nprobability at least 1 \u00a1 \u2013,\n\nL(q>h) \u2022 ^L(q>h) + max(\u2022\u00a2H; M ) \u00a3r 130D(qk\u201d) + log(1=\u2013)\n\nn\n\n:\n\n(4)\n\nProof sketch Pick Ai = 2i and pi = 1=i(i + 1), i = 1; 2; : : : (note that Pi pi = 1).\n\nFor each q, let i(q) be the smallest index for which Ai(q) \u201a D(qk\u201d) implying that\nlog(1=pi(q)) \u2022 2 log log2(4D(qk\u201d)). A few lines of algebra, to be presented in the\nfull paper, yield the desired result.\n\u2044\n\nThe results of Theorem 3 can be compared to those derived by McAllester [8] for\nthe randomized Gibbs procedure. In the latter case, the \ufb02rst term on the r.h.s. is\nEh\u00bbQ ^L(h), namely the average empirical error of the base classi\ufb02ers h. In our case\nthe corresponding term is ^L(Eh\u00bbQh), namely the empirical error of the average\nhypothesis. Since Eh\u00bbQh is potentially much more complex than any single h 2 H,\nwe expect that the empirical term in (4) is much smaller than the corresponding\nterm in [8]. Moreover, the complexity term we obtain is in fact tighter than the\ncorresponding term in [8] by a logarithmic factor in n (although the logarithmic\nfactor in [8] could probably be eliminated). We thus expect that Bayesian mixture\napproach advocated here leads to better performance guarantees.\n\nFinally, we comment that Theorem 3 can be used to obtain so-called oracle inequal-\nities. In particular, let q\u2044 be the optimal distribution minimizing L(q>h), which\ncan only be computed if the underlying distribution \u201e(x; y) is known. Consider an\nalgorithm which, based only on the data, selects a distribution ^q by minimizing\nthe r.h.s. of (4), with the implicit constants appropriately speci\ufb02ed. Then, using\nstandard approaches (e.g. [2]) we can obtain a bound on L(^q>h) \u00a1 L(q\u2044>h). For\nlack of space, we defer the derivation of the precise bound to the full paper.\n\n4 General Data-Dependent Bounds for Bayesian Mixtures\n\nThe Kullback-Leibler divergence is but one way to incorporate prior information.\nIn this section we extend the results to general convex regularization functions\n\n\fg(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are\nthe standard Lp norms kqkp.\n\nIn order to proceed along the lines of Section 3, we let s(z) be the convex func-\n\nthe arguments of Section 3 we have for any \u201a > 0 that 1\n1\n\ntion associated with g(q), namely s(z) = supq2\u203aA'q>z \u00a1 g(q)\u201c. Repeating\n\u201a'A + s\u00a1 \u201a\n\ni=1 (cid:190)iq>h(xi) \u2022\n\nnPi (cid:190)ih(xi)\u00a2\u201c, which implies that\n\u201a(A + E(cid:190)s\u02c6 \u201a\n\n^Rn(FA) \u2022 inf\n\u201a\u201a0\n\nnPn\n(cid:190)ih(xi)!) :\n\n(5)\n\n1\n\nnXi\n\nAssume that s(z) is second order di\ufb01erentiable, and that for any h = Pn\n\ni=1 (cid:190)ih(xi)\n1\n2 (s(h + \u00a2h) + s(h \u00a1 \u00a2h)) \u00a1 s(h) \u2022 u(\u00a2h). Then, assuming that s(0) = 0, it is\neasy to show by induction that\n\nE(cid:190)s\u2021(\u201a=n)Xn\n\ni=1\n\n(cid:190)ih(xi)\u00b7 \u2022\n\nn\n\nXi=1\n\nu((\u201a=n)h(xi)):\n\n(6)\n\nIn the remainder of the section we focus on the the case of regularization based on\nthe Lp norm. Consider p and q such that 1=q + 1=p = 1, p 2 (1; 1), and let p0 =\nmax(p; 2) and q0 = min(q; 2). Note that if p \u2022 2 then q \u201a 2; q0 = p0 = 2 and if p > 2\nthen q < 2; q0 = q; p0 = p. Consider p-norm regularization g(q) = 1\np , in which\ncase s(z) = 1\nq . The Rademacher averaging result for p-norm regularization\nis known in the Geometric theory of Banach spaces (type structure of the Banach\nspace), and it also follows from Khinchtine\u2019s inequality. We show that it can be\neasily obtained in our framework.\n\np0 kqkp0\n\nq0 kzkq0\n\nIn this case, it is easy to see that s(z) = 1\nSubstituting in (5) we have\n\nq0 kzkq0\n\nq\n\nimplies u(h(x)) \u2022 q\u00a11\n\nq0 kh(x)kq0\nq .\n\n1\n\nq \u00a1 1\n\n\u201a (A +\n\nn\u00b6q0 n\nXi=1\n\n^Rn(FA) \u2022 inf\n\u201a\u201a0\n\nq0 (cid:181) \u201a\nwhere Cq = ((q \u00a1 1)=q0)1=q0\nCombining this result with the methods described in Section 3, we establish a bound\nfor regularization based on the Lp norm. Assume that kh(xi)kq is \ufb02nite for all i,\n\nn1=p0 A1=p0\u02c6 1\n\nq ) =\n\nkh(xi)kq0\n\nkh(xi)kq0\n\nXi=1\n\nCq\n\nn\n\n.\n\nn\n\nq !1=q0\n\nand set \u00a2H;q =\u2021En(1=n)Pn\n\ni=1 kh(xi)kq0\n\nq o\u00b71=q0\n\n.\n\nTheorem 4 Let the conditions of Theorem 3 hold and set g(q) = 1\n(1; 1). Then for all q 2 S, with probability at least 1 \u00a1 \u2013,\n\np0 kqkp0\n\np , p 2\n\nL(q>h) \u2022 ^L(q>h) + max(\u2022\u00a2H;q; M ) \u00a3 O\u02c6 kqkp\n\nn1=p0 +r log log(kqkp + 3) + log(1=\u2013)\n\nn\n\n!\n\nwhere O(\u00a2) hides a universal constant that depends only on p.\n\n5 Discussion\n\nWe have introduced and analyzed a class of regularized Bayesian mixture ap-\nproaches, which construct complex composite estimators by combining hypotheses\n\n\ffrom some underlying hypothesis class using data-dependent weights. Such weighted\naveraging approaches have been used extensively within the Bayesian framework,\nas well as in more recent approaches such as Bagging and Boosting. While Bayesian\nmethods are known, under favorable conditions, to lead to optimal estimators in a\nfrequentist setting, their performance in agnostic settings, where no reliable assump-\ntions can be made concerning the data generating mechanism, has not been well\nunderstood. Our data-dependent bounds allow the utilization of Bayesian mixture\nmodels in general settings, while at the same time taking advantage of the bene\ufb02ts\nof the Bayesian approach in terms of incorporation of prior knowledge. The bounds\nestablished, being independent of the cardinality of the underlying hypothesis space,\ncan be directly applied to kernel based methods.\n\nAcknowledgments We thank Shimon Benjo for helpful discussions. The research\nof R.M. is partially supported by the fund for promotion of research at the Technion\nand by the Ollendor\ufb01 foundation of the Electrical Engineering department at the\nTechnion.\n\nReferences\n\n[1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds\nand structural results. In Proceedings of the Fourteenth Annual Conference on Com-\nputational Learning Theory, pages 224{240, 2001.\n\n[2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation.\n\nMachine Learning, 48:85{113, 2002.\n\n[3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning\n\nResearch, 2:499{526, 2002.\n\n[4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classi\ufb02ers; why\nsvms work. In Advances in Neural Information Processing Systems 13, pages 224{230,\nCambridge, MA, 2001. MIT Press.\n\n[5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the\n\ngeneralization error of combined classi\ufb02ers. Ann. Statis., 30(1), 2002.\n\n[6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound\nfor averaging classi\ufb02ers. In Proceeding of the Eighteenth International Conference on\nMachine Learning, pages 290{297, 2001.\n\n[7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual\nconference on Computational learning theory, pages 230{234, New York, 1998. ACM\nPress.\n\n[8] D. A. McAllester. PAC-bayesian model averaging.\n\nIn Proceedings of the twelfth\nAnnual conference on Computational learning theory, New York, 1999. ACM Press.\n\n[9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer\n\nVerlag, New York, 1994.\n\n[10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970.\n\n[11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk\nIEEE trans. Inf. Theory, 44:1926{\n\nminimization over data-dependent hierarchies.\n1940, 1998.\n\n[12] Y. Yang. Minimax nonparametric classi\ufb02cation - part I: rates of convergence. IEEE\n\nTrans. Inf. Theory, 45(7):2271{2284, 1999.\n\n[13] T. Zhang. Generalization performance of some learning problems in hilbert functional\nspace. In Advances in Neural Information Processing Systems 15, Cambridge, MA,\n2001. MIT Press.\n\n[14] T. Zhang. Covering number bounds of certain regularized linear function classes.\n\nJournal of Machine Learning Research, 2:527{550, 2002.\n\n\f", "award": [], "sourceid": 2267, "authors": [{"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}