{"title": "Posterior Consistency of the Silverman g-prior in Bayesian Model Choice", "book": "Advances in Neural Information Processing Systems", "page_first": 1969, "page_last": 1976, "abstract": "Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as ``Silverman's g-prior.'' We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion.", "full_text": "Posterior Consistency of the Silverman g-prior in\n\nBayesian Model Choice\n\nZhihua Zhang\n\nSchool of Computer Science & Technology\n\nZhejiang University, Hangzhou, China\n\nMichael I. Jordan\n\nDepartments of EECS and Statistics\n\nUniversity of California, Berkeley, CA, USA\n\nDit-Yan Yeung\n\nDepartment of Computer Science & Engineering\n\nHKUST, Hong Kong, China\n\nAbstract\n\nKernel supervised learning methods can be uni\ufb01ed by utilizing the tools from\nregularization theory. The duality between regularization and prior leads to inter-\npreting regularization methods in terms of maximum a posteriori estimation and\nhas motivated Bayesian interpretations of kernel methods. In this paper we pursue\na Bayesian interpretation of sparsity in the kernel setting by making use of a mix-\nture of a point-mass distribution and prior that we refer to as \u201cSilverman\u2019s g-prior.\u201d\nWe provide a theoretical analysis of the posterior consistency of a Bayesian model\nchoice procedure based on this prior. We also establish the asymptotic relationship\nbetween this procedure and the Bayesian information criterion.\n\n1 Introduction\nWe address a supervised learning problem over a set of training data {xi, yi}n\ni=1 where xi \u2208 X \u2282 Rp\nis a p-dimensional input vector and yi is a univariate response. Using the theory of reproducing\nkernels, we seek to \ufb01nd a predictive function f(x) from the training data.\nSuppose f = u + h \u2208 ({1} + HK) where HK is a reproducing kernel Hilbert space (RKHS). The\nestimation of f(x) is then formulated as a regularization problem of the form\n\n(cid:41)\n\nmin\nf\u2208HK\n\nL(yi, f(xi)) + g\n2\n\n(cid:107)h(cid:107)2HK\n\n,\n\n(1)\n\nwhere L(y, f(x)) is a loss function, (cid:107)h(cid:107)2HK\nparameter. By the representer theorem [7], the solution for (1) is of the form\n\nis the RKHS norm and g > 0 is the regularization\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:189)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nmin\nu,\u03b2\n\nn(cid:88)\n\nj=1\n\n(cid:80)n\n\n(cid:190)\n\nf(x) = u +\n\n\u03b2jK(x, xj),\n\n(2)\n\nwhere K(\u00b7,\u00b7) is the kernel function. Noticing that (cid:107)h(cid:107)2HK\n(2) into (1), we obtain the minimization problem with respect to (w.r.t.) the \u03b2i as\n\n=\n\ni,j=1 K(xi, xj)\u03b2i\u03b2j and substituting\n\nL(yi, f(xi)) + g\n2 \u03b2\n\n(cid:48)K\u03b2\n\n,\n\n(3)\n\nwhere K = [K(xi, xj)] is the n\u00d7n kernel matrix and \u03b2 = (\u03b21, . . . , \u03b2n)(cid:48) is the vector of regression\ncoef\ufb01cients.\n\n\f(cid:161)\n\n(cid:162)\n(cid:162)\n\n2 \u03b2\n\n(cid:161)\n\n(cid:162)\n\n(cid:161)\n\n0, g\u22121K\u22121\n\n(cid:48)K\u03b2 can be captured by assign-\nFrom the Bayesian standpoint, the role of the regularization term g\ning a design-dependent prior Nn(0, g\u22121K\u22121) to the regression vector \u03b2. The prior Nn\n0, K\u22121\nfor \u03b2 was \ufb01rst proposed by [5] in his Bayesian formulation of spline smoothing. Here we refer to the\nprior \u03b2 \u223c Nn\nas the Silverman g-prior by analogy to the Zellner g-prior [9]. When\n0, g\u22121K\u22121\nK is singular, by analogy to generalized singular g-prior (gsg-prior) [8], we call Nn\na generalized Silverman g-prior.\nGiven the high dimensionality generally associated with RKHS methods, sparseness has emerged as\na signi\ufb01cant theme, particularly when computational concerns are taken into account. For example,\nthe number of support vectors in support vector machine (SVM) is equal to the number of nonzero\ncomponents of \u03b2. That is, if \u03b2j = 0, the jth input vector is excluded from the basis expansion in\n(2); otherwise the jth input vector is a support vector. We are thus interested in a prior for \u03b2 which\n(cid:80)n\nallows some components of \u03b2 to be zero. To specify such a prior we \ufb01rst introduce an indicator\nvector \u03b3 = (\u03b31, . . . , \u03b3n)(cid:48) such that \u03b3j = 1 if xj is a support vector and \u03b3j = 0 if it is not. Let\nj=1 \u03b3j be the number of support vectors, let K\u03b3 be the n\u00d7n\u03b3 submatrix of K consisting of\nn\u03b3 =\nthose columns of K for which \u03b3j = 1, and let \u03b2\u03b3 be the corresponding subvector of \u03b2. Accordingly,\nwhere K\u03b3\u03b3 is the n\u03b3\u00d7n\u03b3 submatrix of K\u03b3 consisting of those rows\nwe let \u03b2\u03b3 \u223c Nn\u03b3\nof K\u03b3 for which \u03b3j = 1.\nWe thus have a Bayesian model choice problem in which a family of models is indexed by an\nindicator vector \u03b3. Within the Bayesian framework we can use Bayes factors to choose among these\nmodels [3]. In this paper we provide a frequentist theoretical analysis of this Bayesian procedure.\nIn particular, motivated by the work of [1] on the consistency of the Zellner g-prior, we investigate\nthe consistency for model choice of the Silverman g-prior for sparse kernel-based regression.\n\n0, g\u22121K\u22121\n\n(cid:161)\n\n(cid:162)\n\n\u03b3\u03b3\n\n2 Main Results\nOur analysis is based on the following regression model M\u03b3:\n\ny = u1n + K\u03b3\u03b2\u03b3 + \u0001\n\u0001 \u223c Nn(0, \u03c32In), \u03b2\u03b3|\u03c3 \u223c Nn\u03b3\n\n(cid:161)\n\n0, \u03c32(g\u03b3K\u03b3\u03b3)\u22121(cid:162)\n\n,\n\nwhere y = (y1, . . . , yn)(cid:48). Here and later, 1m denotes the m\u00d71 vector of ones and Im denotes\nthe m\u00d7m identity matrix. We compare each model M\u03b3 with the null model M0, formulating the\nmodel choice problem via the hypotheses H0 : \u03b2 = 0 and H\u03b3 : \u03b2\u03b3 \u2208 Rn\u03b3 .\nThroughout this paper, for any n\u03b3, it is always assumed to take a \ufb01nite value even though n \u2192 \u221e.\n\nLet (cid:101)K\u03b3 = [1n, K\u03b3]. The following condition is also assumed:\n\n(cid:101)K(cid:48)\n\n(cid:101)K\u03b3 is positive de\ufb01nite and\n\nFor a \ufb01xed n\u03b3 < n, 1\nconverges to a positive de\ufb01nite matrix as n \u2192 \u221e.\nn\n\n\u03b3\n\nSuppose that the sample y is generated by model M\u03bd with parameter values u, \u03b2\u03bd and \u03c3. We\nformalize the problem of consistency for model choice as follows [1]:\n\nplim\nn\u2192\u221e\n\np(M\u03bd|y) = 1 and plim\nn\u2192\u221e\n\np(M\u03b3|y) = 0 for all M\u03b3 (cid:54)= M\u03bd,\n\nwhere \u201cplim\u201d denotes convergence in probability and the limit is taken w.r.t. the sampling distribu-\ntion under the true model M\u03bd.\n\n2.1 A Noninformative Prior for (u, \u03c32)\n\nWe \ufb01rst consider the case when (u, \u03c32) is assigned the following noninformative prior:\n\n(4)\n\n(5)\n\n(6)\n\n(u, \u03c32) \u221d 1/\u03c32.\nMoreover, we assume 1(cid:48)\nnK = 0. In this case, we have 1(cid:48)\nregarded as a common parameter for both M\u03b3 and M0.\nAfter some calculations the marginal likelihood is found to be\n\n(7)\nnK\u03b3 = 0 so that the intercept u may be\n\np(y|M\u03b3) =\n\n(cid:107)y \u2212 \u00afy1n(cid:107)\u2212n+1|Q\u03b3|\u2212 1\n\n2 (1 \u2212 F 2\n\n\u03b3 )\u2212 n\u22121\n2 ,\n\n(8)\n\n\u0393( n\u22121\n2 )\n\u221a\nn\u22121\nn\n\u03c0\n\n2\n\n\f(cid:80)n\n\nwhere \u00afy = 1\nn\n\ni=1 yi, Q\u03b3 = In + g\u03b3\n\n\u03b3\u03b3 K(cid:48)\n\n\u22121K\u03b3K\u22121\n\u03b3 and\ny(cid:48)K\u03b3(g\u03b3K\u03b3\u03b3 + K(cid:48)\n\n(cid:107)y \u2212 \u00afy1n(cid:107)2\n\n\u03b3K\u03b3)\u22121K(cid:48)\n\u03b3y\n\n.\n\n\u03b3 =\nF 2\n\nLet RSS\u03b3 = (1 \u2212 R2\n\n\u03b3)(cid:107)y \u2212 \u00afy1n(cid:107)2 be the residual sum of squares. Here,\n(y \u2212 \u00afy1n)(cid:48)K\u03b3(K(cid:48)\ny(cid:48)K\u03b3(K(cid:48)\n\n\u03b3(y \u2212 \u00afy1n)\n\n\u03b3K\u03b3)\u22121K(cid:48)\n\nR2\n\n\u03b3 =\n\n(cid:107)y \u2212 \u00afy1n(cid:107)2\nIt is easily proven that for \ufb01xed n, plimg\u03b3\u21920 F 2\n\nand RSS\u03b3 = y(cid:48)(In \u2212 (cid:101)H\u03b3)y where (cid:101)H\u03b3 = (cid:101)K\u03b3((cid:101)K(cid:48)\n\n\u03b3 = R2\n\n\u03b3\n\nimmediate to obtain the marginal distribution of the null model as\n\n=\n\n\u03b3K\u03b3)\u22121K(cid:48)\n\u03b3y\n\n(cid:107)y \u2212 \u00afy1n(cid:107)2\n\n.\n\n(cid:101)K\u03b3)\u22121(cid:101)K(cid:48)\n\n\u03b3 and plimg\u03b3\u21920(1\u2212 F 2\n\n\u03b3 )(cid:107)y\u2212 \u00afy1n(cid:107)2 = RSS\u03b3,\n\u03b3. As a special case of (8), it is also\n\n\u0393( n\u22121\n2 )\n\u221a\nn\u22121\n\u03c0\nn\nThen the Bayes factor for M\u03b3 versus M0 is\n\np(y|M0) =\n\n2\n\n(cid:107)y \u2212 \u00afy1n(cid:107)\u2212n+1.\n\nBF\u03b30 = |Q\u03b3|\u2212 1\n\n2 (1 \u2212 F 2\n\n\u03b3 )\u2212 n\u22121\n2 .\n\nIn the limiting case when g\u03b3 \u2192 0 and both n and n\u03b3 are \ufb01xed, BF\u03b30 tends to 0. This implies that a\nlarge spread of the prior forces the Bayes factor to favor the null model. Thus, as in the case of the\nZellner g-prior [4], Bartlett\u2019s paradox arises for the Silverman g-prior.\nThe Bayes factor for M\u03b3 versus M\u03ba is given by\n\nBF\u03b3\u03ba =\n\nBF\u03b30\nBF\u03ba0\n\n=\n\n2\n\n|Q\u03b3|\u2212 1\n|Q\u03ba|\u2212 1\n\n2\n\n(1 \u2212 F 2\n(1 \u2212 F 2\n\n\u03b3 )\u2212 n\u22121\n\u03ba)\u2212 n\u22121\n\n2\n\n2\n\n.\n\n(9)\n\nBased on the Bayes factor, we now explore the consistency of the Silverman g-prior. Suppose that\nthe sample y is generated by model M\u03bd with parameter values u, \u03b2\u03bd and \u03c32. Then the consistency\nproperty (6) is equivalent to\n\nBF\u03b3\u03bd = 0,\n\nfor all M\u03b3 (cid:54)= M\u03bd.\n\nplim\nn\u2192\u221e\n\nAssume that under any model M\u03b3 that does not contain M\u03bd, i.e, M\u03b3 (cid:43) M\u03bd,\n\nwhere (cid:101)\u03b2\nthe subspace of Rn orthogonal to the span of (cid:101)K\u03b3. Given that (In \u2212 (cid:101)H\u03b3)1n = 0 and 1(cid:48)\n\n\u03b3). Note that In \u2212 (cid:101)H\u03b3 is a symmetric idempotent matrix which projects onto\n\n(cid:48)\n\u03b3 = (u, \u03b2\n\n= c\u03b3 \u2208 (0,\u221e),\n\nlim\nn\u2192\u221e\n\nnK\u03bd = 0,\n\n(10)\n\nn\n\n(cid:48)\n\n(cid:101)\u03b2\n\n(cid:48)\n\u03b3\n\n(cid:101)K(cid:48)\n\u03bd(In \u2212 (cid:101)H\u03b3)(cid:101)K\u03bd\n\n(cid:101)\u03b2\u03b3\n\ncondition (10) reduces to\n\nwhere H\u03b3 = K\u03b3(K(cid:48)\nSec. 3.\n\nlim\nn\u2192\u221e\n\u03b3K\u03b3)\u22121K(cid:48)\n\n\u03b2\n\n(cid:48)\n\u03bdK(cid:48)\n\n\u03bd(In \u2212 H\u03b3)K\u03bd\u03b2\u03bd\n\nn\n\n= c\u03b3 \u2208 (0,\u221e),\n\n\u03b3. We now have the following theorem whose proof is given in\n\nTheorem 1 Consider the regression model (4) with the noninformative prior for (u, \u03c32) in (7).\nAssume that conditions (5) and (10) are satis\ufb01ed and assume that g\u03b3 can be written in the form\n\ng\u03b3 = w1(n\u03b3)\nw2(n)\n\nwith\n\nn\u2192\u221e w2(n) = \u221e and\nlim\n\nlim\nn\u2192\u221e\n\nw(cid:48)\n2(n)\nw2(n)\n\n= 0\n\n(11)\n\nfor particular choices of functions w1 and w2, where w2 is differentiable and w(cid:48)\n2(n) is the \ufb01rst\nderivative w.r.t. n. When the true model M\u03bd is not the null model, i.e., M\u03bd (cid:54)= M0, the posterior\nprobabilities are consistent for model choice.\n\n\fTheorem 1 can provide an empirical methodology for setting g. For example, it is clear that g = 1/n\nwhere w1(n\u03b3) = 1 and w2(n) = n satis\ufb01es condition (11).\nIt is interesting to consider the (asymptotic) relationship between the Bayes factor and Bayesian\ninformation (or Schwartz) criterion (BIC) in our setting. Given two models M\u03b3 and M\u03ba, the\ndifference between the BICs of these two models is given by\n\nS\u03b3\u03ba = n\n2\n\nln\n\nRSS\u03ba\nRSS\u03b3\n\n+ n\u03ba \u2212 n\u03b3\n\n2\n\nln(n).\n\nWe thus obtain the following asymptotic relationship (the proof is given in Sec. 3):\n\nTheorem 2 Under the regression model and the conditions in Theorem 1, we have\n\nplim\nn\u2192\u221e\n\nln BF\u03b3\u03bd\nS\u03b3\u03bd + n\u03bd\u2212n\u03b3\n\n2\n\nln w2(n)\n\n= 1.\n\nFurthermore, if M\u03bd is not nested within M\u03b3, then plimn\u2192\u221e ln BF\u03b3\u03bd\nlimits are taken w.r.t. the model M\u03bd.\n\nS\u03b3\u03bd\n\n2.2 A Natural Conjugate Prior for (u, \u03c32)\n\n= 1. Here the probability\n\nIn this section, we analyze consistency for model choice under a different prior for (u, \u03c32), namely\nthe standard conjugate prior:\n\np(u, \u03c32) = N(u|0, \u03c32\u03b7\u22121)Ga(\u03c3\u22122|a\u03c3/2, b\u03c3/2)\n\nwhere Ga(u|a, b) is the Gamma distribution:\n\np(u) = ba\n\n\u0393(a) ua\u22121 exp(\u2212bu), a > 0, b > 0.\n\n(12)\n\n(13)\n\n(14)\n\n(cid:183)\n\nWe further assume that u and \u03b2\u03b3 are independent. Then\n\n(cid:101)\u03b2\u03b3 \u223c Nn\u03b3 +1(0, \u03c32\u03a3\u22121\n\n\u03b3 ) with \u03a3\u03b3 =\n\n0\n\n\u03b7\n0 g\u03b3K\u03b3\u03b3\n\nThe marginal likelihood of model M\u03b3 is thus\nba\u03c3/2\n\u0393( n+a\u03c3\n\u03c3\n2\n2 )\n\u03c0n/2\u0393( a\u03c3\n\nwhere M\u03b3 = In + (cid:101)K\u03b3\u03a3\u22121\n\np(y|M\u03b3) =\n\n)\n\n|M\u03b3|\u2212 1\n\n2\n\n,\n\u03b3. The Bayes factor for M\u03b3 versus M\u03ba is given by\n\nb\u03c3 + y(cid:48)M\u22121\n\u03b3 y\n\n(cid:163)\n\n(cid:184)\n(cid:164)\u2212 a\u03c3 +n\n\n.\n\n2\n\n\u03b3\n\n(cid:101)K(cid:48)\n\u03b3 = In \u2212 (cid:101)K\u03b3\u0398\u22121\n(cid:101)K\u03b3 + \u03a3\u03b3, we have\nBF\u03b3\u03ba = gn\u03b3 /2\n\u03b3\ngn\u03ba/2\n\u03ba\n\n\u03b3\n\n(cid:183)|M\u03ba|\n\n(cid:184) 1\n\n2\n\n(cid:183)\n\n(cid:184) a\u03c3 +n\n\n2\n\nBF\u03b3\u03ba =\n\n.\n\nb\u03c3 + y(cid:48)M\u22121\n\u03ba y\nb\u03c3 + y(cid:48)M\u22121\n\u03b3 y\n\n|M\u03b3|\n\n(cid:101)K(cid:48)\n(cid:183)|K\u03b3\u03b3||\u0398\u03ba|\n\n|K\u03ba\u03ba||\u0398\u03b3|\n\n\u03b3 and |M\u03b3| = |\u0398\u03b3||\u03a3\u03b3|\u22121 = \u03b7\u22121g\n(cid:162)\n(cid:162)\n\nIn\u2212(cid:101)K\u03ba\u0398\u22121\nIn\u2212(cid:101)K\u03b3\u0398\u22121\n\nb\u03c3 + y(cid:48)(cid:161)\nb\u03c3 + y(cid:48)(cid:161)\n\n(cid:101)K(cid:48)\n(cid:101)K(cid:48)\n\n(cid:184) 1\n\n(cid:183)\n\n\u03ba\n\n\u03ba\n\n2\n\n\u03b3\n\n\u03b3\n\ny\ny\n\n(cid:184) a\u03c3 +n\n\n2\n\n.\n\nBecause M\u22121\n\n\u0398\u03b3 = (cid:101)K(cid:48)\n\n\u03b3\n\n\u2212n\u03b3\n\u03b3\n\n|K\u03b3\u03b3|\u22121|\u0398\u03b3| where\n\nTheorem 3 Consider the regression model (4) with the conjugate prior for (u, \u03c32) in (12). Assume\nthat conditions (5) and (10) are satis\ufb01ed and that g\u03b3 takes the form in (11) with w1(n\u03b3) being a\ndecreasing function. When the true model M\u03bd is not the null model, i.e., M\u03bd (cid:54)= M0, the posterior\nprobabilities are consistent for model choice.\n\nNote the difference between Theorem 1 and Theorem 3: in the latter theorem w1(n\u03b3) is required\nto be a decreasing function of n\u03b3. Thanks to the fact that g\u03b3 = w1(n\u03b3)/w2(n), such a condition\nis equivalent to assuming that g\u03b3 is a decreasing function of n\u03b3. Again, g\u03b3 = 1/n satis\ufb01es these\nconditions. Similarly with Theorem 2, we also have\n\n\fTheorem 4 Under the regression model and the conditions in Theorem 3, we have\n\nplim\nn\u2192\u221e\n\nln BF\u03b3\u03bd\nS\u03b3\u03bd + n\u03bd\u2212n\u03b3\n\n2\n\nln w2(n)\n\n= 1.\n\nFurthermore, if M\u03bd is not nested within M\u03b3, then plimn\u2192\u221e ln BF\u03b3\u03bd\nlimits are taken w.r.t. the model M\u03bd.\n\nS\u03b3\u03bd\n\n= 1. Here the probability\n\n3 Proofs\n\n(cid:184)\n\n(cid:183)\n\n(cid:183)\n(cid:183)\n\nIn order to prove these theorems, we \ufb01rst give the following lemmas.\n\nA11 A12\nA21 A22\n\nLemma 1 Let A =\nhave the same size as A. Then A\u22121 \u2212 B is positive semide\ufb01nite.\nProof The proof follows readily once we express A\u22121 and B as\n\n(cid:184)(cid:183)\n\nbe symmetric and positive de\ufb01nite, and let B =\n\nA\u22121 =\n\nB =\n\nI \u2212A\u22121\n11 A12\nI\n0\nI \u2212A\u22121\n11 A12\n0\nI\n\n11\n\n(cid:184)(cid:183)\n\nA\u22121\n0\n0 A\u22121\n22\u00b71\nA\u22121\n11\n0\n\n0\n0\n\nI\n\n\u2212A21A\u22121\n11\n0\nI\nI\n\n\u2212A21A\u22121\n\n11\n\n(cid:184)\n\n,\n\n(cid:184)\n\n0\nI\n\n,\n\n(cid:184)(cid:183)\n(cid:184)(cid:183)\n\nwhere A22\u00b71 = A22 \u2212 A21A\u22121\n\n11 A12 is also positive de\ufb01nite.\n\n(cid:183)\n\n(cid:184)\n\nA\u22121\n11\n0\n\n0\n0\n\nThe following two lemmas were presented by [1].\nLemma 2 Under the sampling model M\u03bd: (i) if M\u03bd is nested within or equal to a model M\u03b3, i.e.,\nM\u03bd (cid:106) M\u03b3, then\n\n(cid:163)\n\n(cid:164)\u22121z\n\nn\u039bn\u03b3 + g\u03b3(n)In\u03b3\n\nn\u03b3(cid:88)\n= z(cid:48)z \u2212 z(cid:48)n\u039bn\u03b3\ng\u03b3(n)\n\n=\n\nn\u03bbj(n) + g\u03b3(n) z2\nj .\n\nj=1\n\nplim\nn\u2192\u221e\n\nRSS\u03b3\n\nn\n\n= \u03c32\n\nplim\nn\u2192\u221e\n\nRSS\u03b3\n\nn\n\n= \u03c32 + c\u03b3.\n\nand (ii) for any model M\u03b3 that does not contain M\u03bd, if (10) satis\ufb01es, then\n\n(cid:179)\n\n(cid:180)\n\nLemma 3 Under the sampling model M\u03bd, if M\u03bd is nested within a model M\u03b3, i.e., M\u03bd \u2282 M\u03b3,\nthen n ln\n\nn\u03b3\u2212n\u03bd as n \u2192 \u221e where d\u2212\u2192 denotes convergence in distribution.\n\nd\u2212\u2192 \u03c72\n\nRSS\u03bd\nRSS\u03b3\n\nLemma 4 Under the regression model (4), if limn\u2192\u221e g\u03b3(n) = 0 and condition (5) is satis\ufb01ed, then\n\n(1 \u2212 F 2\n\n\u03b3 )(cid:107)y \u2212 \u00afy1n(cid:107)2 \u2212 RSS\u03b3 = 0.\n\nplim\nn\u2192\u221e\n\nProof It is easy to compute\n\n(1 \u2212 F 2\n\n\u03b3 )(cid:107)y \u2212 \u00afy1n(cid:107)2 \u2212 RSS\u03b3\n\u03b3K\u03b3/n and K\u03b3\u03b3 are positive de\ufb01nite, there exists an n\u03b3\u00d7n\u03b3 nonsingular matrix An\nnAn.\n\n\u03b3K\u03b3)\u22121 \u2212 (K(cid:48)\n\u03b3K\u03b3 + g\u03b3(n)K\u03b3\u03b3)\u22121]K(cid:48)\n\u03b3y\n\u03c32\n\nn\u039bn\u03b3 An and K\u03b3\u03b3 = A(cid:48)\n\nSince both K(cid:48)\nand an n\u03b3\u00d7n\u03b3 positive diagonal matrix \u039bn\u03b3 such that K(cid:48)\nLetting z = \u03c3\u22121(n\u039bn\u03b3 )\u22121/2(A(cid:48)\n\n\u03b3K\u03b3/n = A(cid:48)\n\ny(cid:48)K\u03b3[(K(cid:48)\n\n\u03c32\n\n=\n\n.\n\n\u03b3y, we have\n\nn)\u22121K(cid:48)\nz \u223c Nn\u03b3 (\u03c3\u22121(n\u039bn\u03b3 )1/2An\u03b2, In\u03b3 )\n\nand\n\nf(z) (cid:44) (1 \u2212 F 2\n\n\u03b3 )(cid:107)y \u2212 \u00afy1n(cid:107)2 \u2212 RSS\u03b3\n\n\u03c32\n\n\fz2\nj\n\nthat\n\nfollows\n\n=\nNote\nn\u03bbj(n)(aj(n)(cid:48)\u03b2)2/\u03c32 where \u03bbj(n) > 0 is the jth diagonal element of \u039bn\u03b3 and aj(n) is\nthe jth column of An. We thus have E(z2\nj ) = 2(1 + 2vj). It follows from\ncondition (5) that\n\nchi-square distribution, \u03c72(1, vj), with vj\n\na noncentral\n\nn\u2192\u221e K(cid:48)\nlim\n\n\u03b3K\u03b3/n = lim\n\nj ) = 1 + vj and Var(z2\nn\u2192\u221e A(cid:48)\n\nn\u039bn\u03b3 An = A(cid:48)\u039b\u03b3A,\n\nwhere A is nonsingular and \u039b\u03b3 is a diagonal matrix with positive diagonal elements, and both are\nindependent of n. Hence,\n\n(cid:179)\n\n(cid:180)\n\ng\u03b3(n)\n\nn\u2192\u221e E\nlim\n\nn\u2192\u221e Var\nlim\nWe thus have plimn\u2192\u221e f(z) = 0. The proof is completed.\n\nn\u03bbj(n) + g\u03b3(n) z2\n\n= 0 and\n\nj\n\ng\u03b3(n)\n\nn\u03bbj(n) + g\u03b3(n) z2\n\nj\n\n= 0.\n\n(cid:179)\n\n(cid:180)\n\nLemma 5 Assume that M\u03ba is nested within M\u03b3 and g\u03b3 is a decreasing function of n\u03b3. Then\n\ny(cid:48)(In \u2212 (cid:101)K\u03ba\u0398\u22121\n\n(cid:101)K(cid:48)\n\u03ba)y \u2265 y(cid:48)(In \u2212 (cid:101)K\u03b3\u0398\u22121\n\n\u03b3\n\n(cid:101)K(cid:48)\n\n(cid:183)\n\n(cid:184)\n\nProof Since M\u03ba is nested within M\u03b3, we express (cid:101)K\u03b3 = [(cid:101)K\u03ba, K2] without loss of generality. We\n\n\u03b3)y.\n\n\u03ba\n\n\u03a311\n\u03a321\n\n(cid:101)K(cid:48)\n\n\u0398\u22121\n\n\u03b3 =\n\nwhere \u03a311\n\n\u03b3 K(cid:48)\n\n(cid:35)\u22121\n\nnow write \u03a3\u03b3 =\n\n\u03b3 \u03a312\n\u03b3\n\u03b3 \u03a322\n\u03b3\n\n\u03baK2 + \u03a312\n2K2 + \u03a322\n\n\u03b3 is of size n\u03ba\u00d7n\u03ba. Hence, we have\n(cid:184)\n\n(cid:34) (cid:101)K(cid:48)\n(cid:101)K\u03ba + \u03a311\n(cid:101)K\u03ba + \u03a321\n(cid:101)K\u03ba+\u03a311\n(cid:101)K\u03ba+\u03a3\u03ba \u2212 ((cid:101)K(cid:48)\nBecause 0 < g\u03b3 \u2264 g\u03ba, (cid:101)K(cid:48)\n(cid:101)K\u03ba+\u03a3\u03ba)\u22121 is positive semide\ufb01nite. It follows from\n(cid:101)K\u03ba+\u03a311\ninite. Consequently, ((cid:101)K(cid:48)\n\u03b3 )\u22121 \u2212 ((cid:101)K(cid:48)\n(cid:184)\n(cid:183)\n(cid:101)K\u03ba+\u03a3\u03ba)\u22121 0\n((cid:101)K(cid:48)\n(cid:101)K(cid:48)\n(cid:101)K(cid:48)\ny(cid:48)(In \u2212 (cid:101)K\u03ba\u0398\u22121\n\u03ba)y \u2212 y(cid:48)(In \u2212 (cid:101)K\u03b3\u0398\u22121\n(cid:181)(cid:34) (cid:101)K(cid:48)\n(cid:35)\u22121\n(cid:101)K(cid:48)\n(cid:101)K\u03ba+\u03a311\n= y(cid:48)(cid:101)K\u03b3\n(cid:101)K\u03ba+\u03a321\n\n(cid:101)K\u03ba+\u03a3\u03ba)\u22121 0\n\nis also positive semide\ufb01nite. We thus have\n\n0\n0 (g\u03ba\u2212g\u03b3)K\u03ba\u03ba\n\nLemma 1 that \u0398\u22121\n\n(cid:184)(cid:182)(cid:101)K(cid:48)\n\nis positive semidef-\n\n\u03baK2+\u03a312\n2K2+\u03a322\n\n\u03b3)y\n\u2212\n\n\u03b3y \u2265 0.\n\n((cid:101)K(cid:48)\n\n\u03b3 K(cid:48)\n\n\u03b3 \u2212\n\n\u03b3 ) =\n\n\u03ba\n\nK(cid:48)\n\n2\n\n(cid:183)\n\n\u03ba\n\nK(cid:48)\n\n2\n\n(cid:183)\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\n0\n\n0\n\n0\n\n\u03ba\n\n0\n\n0\n\n\u03ba\n\n\u03ba\n\n\u03ba\n\n\u03ba\n\n\u03ba\n\n\u03ba\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\n.\n\n3.1 Proof of Theorem 1\n\nWe now prove Theorem 1. Consider that\n1\n2\n\nln BF\u03b3\u03bd =\n\n|Q\u03bd|\n|Q\u03b3| + n\u22121\n\n2\n\nln\n\nln\n\n(1 \u2212 F 2\n\u03bd )\n(1 \u2212 F 2\n\u03b3 ) .\n\nBecause\n\n|Q\u03b3|\u2212 1\n\n2 =\n\nwe have\n|Q\u03bd|\n|Q\u03b3| = ln w1(n\u03b3)n\u03b3\n\nw1(n\u03bd)n\u03bd\n\nln\n\nBecause\n\n\u03b1 = lim\n\nn\u2192\u221e ln\n\n+ ln\n\n(cid:175)(cid:175) w1(n\u03bd )\n(cid:175)(cid:175) w1(n\u03b3 )\n\n|K\u03b3\u03b3|\n|K\u03bd\u03bd| + ln\n\nnw2(n) K\u03bd\u03bd + 1\nnw2(n) K\u03b3\u03b3 + 1\n\nnK(cid:48)\nnK(cid:48)\n\n\u03b3K\u03b3|1/2\n\n,\n\nn K(cid:48)\nn K(cid:48)\n\n\u03bdK\u03bd\n\u03b3K\u03b3\n\nn\u03b3\n\n2 |K\u03b3\u03b3|1/2\n\ng\u03b3\n\n|g\u03b3K\u03b3\u03b3 + K(cid:48)\n\nnw2(n)K\u03bd\u03bd + 1\nnw2(n) K\u03b3\u03b3 + 1\n\n(cid:175)(cid:175) w1(n\u03bd )\n(cid:175)(cid:175) w1(n\u03b3 )\n(cid:175)(cid:175)\n(cid:175)(cid:175) = lim\n\nn\u2192\u221e ln\n\n\u03bdK\u03bd\n\u03b3K\u03b3\n\n(cid:175)(cid:175)\n(cid:175)(cid:175) + (n\u03bd\u2212n\u03b3) ln(nw2(n)).\n\n| 1\nn K(cid:48)\n| 1\nnK(cid:48)\n\n\u03bdK\u03bd|\n\u03b3K\u03b3| \u2208 (\u2212\u221e,\u221e),\n\n\f(cid:40) \u221e n\u03b3 < n\u03bd\n\nit is easily proven that\n\n|Q\u03bd|\n|Q\u03b3| =\n\n1\n2\n\nln\n\nlim\nn\u2192\u221e\n\n\u2212\u221e n\u03b3 > n\u03bd\nconst n\u03b3 = n\u03bd,\n2 ln |K\u03b3\u03b3|\n|K\u03bd\u03bd| . According to Lemma 4, we also have\n(1\u2212F 2\n\u03bd )\n(1\u2212F 2\n\u03b3 )\n\n\u03bd )(cid:107)y\u2212\u00afy1n(cid:107)2\n\u03b3 )(cid:107)y\u2212\u00afy1n(cid:107)2 = plim\nn\u2192\u221e\n\n(1\u2212F 2\n(1\u2212F 2\n\n= plim\nn\u2192\u221e\n\nn\u22121\n2\n\nln\n\n(15)\n\nn\u22121\n2\n\nln\n\nRSS\u03bd\nRSS\u03b3\n\n.\n\n2 + 1\n\nwhere const = \u03b1\nn\u22121\n2\n\nplim\nn\u2192\u221e\n\nln\n\nNow consider the following two cases:\n(a) M\u03bd is not nested within M\u03b3:\n\nFrom Lemma 2, we obtain\n\nplim\nn\u2192\u221e\n\nln\n\nRSS\u03bd\nRSS\u03b3\n\n= plim\nn\u2192\u221e\n\nln\n\nRSS\u03bd/n\nRSS\u03b3/n\n\n= ln\n\n\u03c32\n\n.\n\n\u03c32+c\u03b3\n\n(cid:104)\n\n(cid:179) \u03c32\n\n(cid:180)\n\nMoreover, we have the following limit\n\nlim\nn\u2192\u221e\n\nn\u22121\n2\n(cid:179)\ndue to limn\u2192\u221e n\u03bd\u2212n\u03b3\nln\nlimn\u2192\u221e BF\u03b3\u03bd = 0.\n\n(cid:180)\n\n\u03c32+c\u03b3\n\n\u03c32\n\n+ n\u03bd\u2212n\u03b3\nn\u22121\n\nln\n\n= \u2212\u221e\nn\u22121 ln(nw2(n)) = limn\u2192\u221e(n\u03bd\u2212n\u03b3) w2(n)+nw(cid:48)\n\nln(nw2(n))\n\n= 0 and\n< 1. This implies that limn\u2192\u221e ln BF\u03b3\u03bd = \u2212\u221e. Thus we obtain\n\n\u03c32+c\u03b3\n\nnw2(n)\n\n2(n)\n\n(cid:105)\n\nWe always have n\u03b3 > n\u03bd. By Lemma 3, we have (n\u22121) ln(RSS\u03bd/RSS\u03b3)\nHence, (RSS\u03bd/RSS\u03b3)(n\u22121)/2\nto a zero limit for BF\u03b3\u03bd.\n\nn\u03b3\u2212n\u03bd .\nn\u03b3\u2212n\u03bd /2). Combining this result with (15) leads\n\nd\u2212\u2192 exp(\u03c72\n\nd\u2212\u2192 \u03c72\n\n(b) M\u03bd is nested within M\u03b3:\n\n3.2 Proof of Theorem 2\n\nUsing the same notations as those in Theorem 1, we have\n\nC\u03b3\u03bd =\n\nln BF\u03b3\u03bd\nS\u03b3\u03bd + n\u03bd\u2212n\u03b3\n\nln w2(n)\n(a) M\u03bd is not nested within M\u03b3:\n\n2\n\nFrom Lemma 4, we obtain\n\n=\n\nn ln (1\u2212F 2\nn\u22121\n\u03bd )\n(1\u2212F 2\nln RSS\u03bd\nRSS\u03b3\n\n\u03b3 ) + n\u03bd\u2212n\u03b3\n+ n\u03bd\u2212n\u03b3\n\nn\n\nn\n\nln(nw2(n)) + 2\n\nnConst\n\n.\n\nln(nw2(n))\n\nplim\nn\u2192\u221e\n\nC\u03b3\u03bd = lim\nn\u2192\u221e\n\nln \u03c32\n\u03c32+c\u03b3\nln \u03c32\n\n\u03c32+c\u03b3\n\nIn this case, we also have\n\n+ n\u03bd\u2212n\u03b3\n+ n\u03bd\u2212n\u03b3\n\nn\n\nn\n\nln(nw2(n))\nln(nw2(n))\n\n= 1.\n\nln BF\u03b3\u03bd\n\n= lim\nn\u2192\u221e\n\n+ n\u03bd\u2212n\u03b3\n\nn\n\nln \u03c32\n\u03c32+c\u03b3\nln \u03c32\n\n\u03c32+c\u03b3\n\n+ n\u03bd\u2212n\u03b3\n\nn\n\nln n\n\nln(nw2(n))\n\n= 1.\n\nplim\nn\u2192\u221e\n\nS\u03b3\u03bd\n(b) M\u03bd is nested within M\u03b3:\n\nWe obtain\n\nplim\nn\u2192\u221e\n\nC\u03b3\u03bd = plim\nn\u2192\u221e\n\n(n\u22121) ln (1\u2212F 2\n\u03bd )\n(1\u2212F 2\nn ln RSS\u03bd\nRSS\u03b3\n\n\u03b3 ) + (n\u03bd\u2212n\u03b3) ln(nw2(n)) + 2 \u00d7 Const\n\n+ (n\u03bd\u2212n\u03b3) ln(nw2(n))\n\n= 1\n\ndue to n\u03b3 > n\u03bd and n ln(RSS\u03bd/RSS\u03b3)\n\nd\u2212\u2192 \u03c72\n\nn\u03b3\u2212n\u03bd .\n\n\f3.3 Proof of Theorem 3\nWe now sketch the proof of Theorem 3. For the case that M\u03bd is not nested within M\u03b3, the proof\nis similar to that of Theorem 1. When M\u03bd is nested within M\u03b3, Lemma 5 shows the following\nrelationship\n\nln\n\nb\u03c3 + y(cid:48)(cid:161)\n(cid:183)\nb\u03c3 + y(cid:48)(cid:161)\nb\u03c3 + y(cid:48)(cid:161)\n(cid:183)\nb\u03c3 + y(cid:48)(cid:161)\n\n(cid:162)\n(cid:101)K(cid:48)\nIn\u2212(cid:101)K\u03bd\u0398\u22121\n(cid:162)\n(cid:101)K(cid:48)\nIn\u2212(cid:101)K\u03b3\u0398\u22121\n(cid:162)\n(cid:101)K(cid:48)\nIn\u2212(cid:101)K\u03bd\u0398\u22121\n(cid:162)\nIn\u2212(cid:101)K\u03b3\u0398\u22121\n(cid:101)K(cid:48)\n\n\u03b3\n\n\u03b3\n\n\u03bd\n\n\u03bd\n\n\u03bd\n\n\u03bd\n\n\u03b3\n\n\u03b3\n\ny\ny\n\ny\ny\n\n(cid:184)\n(cid:184)\n\nWe thus have\n\nplim\nn\u2192\u221e\n\na\u03c3+n\n\n2\n\nln\n\n(cid:183)\n\n.\n\n\u03bd\n\n\u03bd\n\n\u03b3\n\n\u03b3\n\ny\ny\n\ny(cid:48)(cid:161)\n(cid:162)\n(cid:184)\nIn\u2212(cid:101)K\u03bd\u0398\u22121\n(cid:101)K(cid:48)\ny(cid:48)(cid:161)\n(cid:162)\nIn\u2212(cid:101)K\u03b3\u0398\u22121\n(cid:101)K(cid:48)\ny(cid:48)(cid:161)\n(cid:183)\nIn\u2212(cid:101)K\u03bd\u0398\u22121\ny(cid:48)(cid:161)\nIn\u2212(cid:101)K\u03b3\u0398\u22121\ny(cid:48)(cid:161)\n(cid:162)\n(cid:183)\n(cid:184)\nIn\u2212(cid:101)H\u03bd\n(cid:162)\ny(cid:48)(cid:161)\nIn\u2212(cid:101)H\u03b3\n\n\u03b3\ny\ny\n\na\u03c3+n\n\na\u03c3+n\n\nln\n\nln\n\n2\n\n2\n\n\u03bd\n\n(cid:184)\n\n(cid:162)\n(cid:162)\n\ny\ny\n\n(cid:101)K(cid:48)\n(cid:101)K(cid:48)\n\n\u03bd\n\n\u03b3\n\n\u2208 (0,\u221e).\n\n\u2264 ln\n\n\u2264 plim\nn\u2192\u221e\n\n= plim\nn\u2192\u221e\n\nFrom this result the proof follows readily.\n\n4 Conclusions\n\nIn this paper we have presented a frequentist analysis of a Bayesian model choice procedure for\nsparse regression. We have captured sparsity by a particular choice of prior distribution which we\nhave referred to as a \u201cSilverman g-prior.\u201d This prior emerges naturally from the RKHS perspective.\nIt is similar in spirit to the Zellner g-prior, which has been widely used for Bayesian variable selec-\ntion and Bayesian model selection due to its computational tractability in the evaluation of marginal\nlikelihoods [6, 2]. Our analysis provides a theoretical foundation for the Silverman g-prior and\nsuggests that it can play a similarly wide-ranging role in the development of fully Bayesian kernel\nmethods.\n\nReferences\n[1] C. Fern\u00b4andez, E. Ley, and M. F. J. Steel. Benchmark priors for Bayesian model averaging.\n\nJournal of Econometrics, 100:381\u2013427, 2001.\n\n[2] E. I. George and R. E. McCulloch. Approaches for Bayesian variable selection. Statistica Sinica,\n\n7:339\u2013374, 1997.\n\n[3] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association,\n\n90:773\u2013795, 1995.\n\n[4] F. Liang, R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger. Mixtures of g-priors for Bayesian\n\nvariable selection. Journal of the American Statistical Association, 103(481):410\u2013423, 2008.\n\n[5] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression\n\ncurve \ufb01tting (with discussion). Journal of the Royal Statistical Society, B, 47(1):1\u201352, 1985.\n\n[6] M. Smith and R. Kohn. Nonparametric regression using Bayesian variable selection. Journal of\n\nEconometrics, 75:317\u2013344, 1996.\n\n[7] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, 1990.\n[8] M. West. Bayesian factor regression models in the \u201clarge p, small n\u201d paradigm.\n\nIn J. M.\nBernardo, M. J. Bayarri, J. .O Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West,\neditors, Bayesian Statistics 7, pages 723\u2013732. Oxford University Press, 2003.\n[9] A. Zellner. On assessing prior distributions and Bayesian regression analysis with g\u2212prior\ndistributions. In P. K. Goel and A. Zellner, editors, Bayesian Inference and Decision Techniques:\nEssays in Honor of Bruno de Finetti, pages 233\u2013243. North-Holland, Amsterdam, 1986.\n\n\f", "award": [], "sourceid": 300, "authors": [{"given_name": "Zhihua", "family_name": "Zhang", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}]}