{"title": "Integrated Non-Factorized Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2481, "page_last": 2489, "abstract": "We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the $\\ell_{1}$ norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method.", "full_text": "Integrated Non-Factorized Variational Inference\n\nShaobo Han\n\nshaobo.han@duke.edu\n\nDuke University\n\nDurham, NC 27708\n\nXuejun Liao\nDuke University\n\nDurham, NC 27708\nxjliao@duke.edu\n\nLawrence Carin\nDuke University\n\nDurham, NC 27708\nlcarin@duke.edu\n\nAbstract\n\nWe present a non-factorized variational method for full posterior inference in\nBayesian hierarchical models, with the goal of capturing the posterior variable de-\npendencies via ef\ufb01cient and possibly parallel computation. Our approach uni\ufb01es\nthe integrated nested Laplace approximation (INLA) under the variational frame-\nwork. The proposed method is applicable in more challenging scenarios than typ-\nically assumed by INLA, such as Bayesian Lasso, which is characterized by the\nnon-differentiability of the (cid:96)1 norm arising from independent Laplace priors. We\nderive an upper bound for the Kullback-Leibler divergence, which yields a fast\nclosed-form solution via decoupled optimization. Our method is a reliable ana-\nlytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter\nevidence lower bound than that of mean-\ufb01eld variational Bayes (VB) method.\n\n1\n\nIntroduction\n\nMarkov chain Monte Carlo (MCMC) methods [1] have been dominant tools for posterior analysis in\nBayesian inference. Although MCMC can provide numerical representations of the exact posterior,\nthey usually require intensive runs and are therefore time consuming. Moreover, assessment of\na chain\u2019s convergence is a well-known challenge [2]. There have been many efforts dedicated to\ndeveloping deterministic alternatives, including the Laplace approximation [3], variational methods\n[4], and expectation propagation (EP) [5]. These methods each have their merits and drawbacks [6].\nMore recently, the integrated nested Laplace approximation (INLA) [7] has emerged as an encourag-\ning method for full posterior inference, which achieves computational accuracy and speed by taking\nadvantage of a (typically) low-dimensional hyper-parameter space, to perform ef\ufb01cient numerical\nintegration and parallel computation on a discrete grid. However, the Gaussian assumption for the\nlatent process prevents INLA from being applied to more general models outside of the family of\nlatent Gaussian models (LGMs).\nIn the machine learning community, variational inference has received signi\ufb01cant use as an ef\ufb01cient\nalternative to MCMC. It is also attractive because it provides a closed-form lower bound to the\nmodel evidence. An active area of research has been focused on developing more ef\ufb01cient and\naccurate variational inference algorithms, for example, collapsed inference [8, 9], non-conjugate\nmodels [10, 11], multimodal posteriors [12], and fast convergent methods [13, 14].\nThe goal of this paper is to develop a reliable and ef\ufb01cient deterministic inference method, to both\nachieve the accuracy of MCMC and retain its inferential \ufb02exibility. We present a promising varia-\ntional inference method without requiring the widely used factorized approximation to the posterior.\nInspired by INLA, we propose a hybrid continuous-discrete variational approximation, which en-\nables us to preserve full posterior dependencies and is therefore more accurate than the mean-\ufb01eld\nvariational Bayes (VB) method [15]. The continuous variational approximation is \ufb02exible enough\nfor various kinds of latent \ufb01elds, which makes our method applicable to more general settings than\nassumed by INLA. The discretization of the low-dimensional hyper-parameter space can overcome\nthe potential non-conjugacy and multimodal posterior problems in variational inference.\n\n1\n\n\f2\n\nIntegrated Non-Factorized Variational Bayesian Inference\n\nConsider a general Bayesian hierarchical model with observation y, latent variables x, and hyperpa-\nrameters \u03b8. The exact joint posterior p(x, \u03b8|y) = p(y, x, \u03b8)/p(y) can be dif\ufb01cult to evaluate, since\n\nusually the normalization p(y) =(cid:82)(cid:82) p(y, x, \u03b8)dxd\u03b8 is intractable and numerical integration of x\n\nis too expensive.\nTo address this problem, we \ufb01nd a variational approximation to the exact posterior by minimizing\nthe Kullback-Leibler (KL) divergence KL (q(x, \u03b8|y)||p(x, \u03b8|y)). Applying Jensen\u2019s inequality to\nthe log-marginal data likelihood, one obtains\n\nln p(y) = ln(cid:82)(cid:82) q(x, \u03b8|y) p(y,x,\u03b8)\n\nq(x,\u03b8|y) dxd\u03b8 \u2265(cid:82)(cid:82) q(x, \u03b8|y) ln p(y,x,\u03b8)\n\n(1)\nwhich holds for any proposed approximating distributions q(x, \u03b8|y). L is termed the evidence\nlower bound (ELBO)[4]. The gap in the Jensen\u2019s inequality is exactly the KL divergence. Therefore\nminimizing the Kullback-Leibler (KL) divergence is equivalent to maximizing the ELBO.\nTo make the variational problem tractable, the variational distribution q(x, \u03b8|y) is commonly re-\nquired to take a restricted form. For example, mean-\ufb01eld variational Bayes (VB) method assumes\nthe distribution factorizes into a product of marginals [15], q(x, \u03b8|y) = q(x)q(\u03b8), which ignores the\nposterior dependencies among different latent variables (including hyperparameters) and therefore\nimpairs the accuracy of the approximate posterior distribution.\n\nq(x,\u03b8|y) dxd\u03b8 := L,\n\n2.1 Hybrid Continuous and Discrete Variational Approximations\nWe consider a non-factorized approximation to the posterior q(x, \u03b8|y) = q(x|y, \u03b8)q(\u03b8|y), to pre-\nserve the posterior dependency structure. Unfortunately, this generally leads to a nontrivial opti-\nmization problem,\nq(cid:63)(x, \u03b8|y) = argmin{q(x,\u03b8|y)} KL (q(x, \u03b8|y)||p(x, \u03b8|y)) ,\n\n= argmin{q(x,\u03b8|y)}(cid:82)(cid:82) q(x, \u03b8|y) ln q(x,\u03b8|y)\n= argmin{q(x|y,\u03b8), q(\u03b8|y)}(cid:82) q(\u03b8|y)\nis a \ufb01nite mixture of Dirac-delta distributions, qd(\u03b8|y) = (cid:80)\n(2)\n(cid:80)\nWe propose a hybrid continuous-discrete variational distribution q(x|y, \u03b8)qd(\u03b8|y), where qd(\u03b8|y)\nk \u03c9k\u03b4\u03b8k (\u03b8) with \u03c9k = qd(\u03b8k|y) and\nk \u03c9k = 1. Clearly, qd(\u03b8|y) is an approximation of q(\u03b8|y) by discretizing the continuous (typi-\ncally) low-dimensional parameter space of \u03b8 using a grid G with \ufb01nite grid points1. One can always\nreduce the discretization error by increasing the number of points in G. To obtain a useful discretiza-\ntion at a manageable number of grid points, the dimension of \u03b8 cannot be too large; this is also the\nsame assumption in INLA [7], but we remove here the Gaussian prior assumption of INLA on latent\neffects x.\nThe hybrid variational approximation is found by minimizing the KL divergence, i.e.,\n\n(cid:105)\np(x,\u03b8|y) dx + ln q(\u03b8|y)\n\n(cid:104)(cid:82) q(x|\u03b8, y) ln q(x|\u03b8,y)\n\np(x,\u03b8|y) dxd\u03b8,\n\nd\u03b8.\n\n(cid:104)(cid:82) q(x|\u03b8k, y) ln q(x|y,\u03b8k)\n\np(x,\u03b8k|y) dx + ln qd(\u03b8k|y)\n\nKL (q(x, \u03b8|y)||p(x, \u03b8|y)) =(cid:80)\n\nk qd(\u03b8k|y)\nwhich leads to the approximate marginal posterior,\n\nq(x|y) =(cid:80)\n\nk q(x|y, \u03b8k)qd(\u03b8k|y)\n\n(cid:105)\n\n(3)\n\n(4)\n\nAs will be clearer shortly, the problem in (3) can be much easier to solve than that in (2).\nWe give the name integrated non-factorized variational Bayes (INF-VB) to the method of approx-\nimating p(x, \u03b8|y) with q(x|y, \u03b8)qd(\u03b8|y) by solving the optimization problem in (3). The use of\nqd(\u03b8) is equivalent to numerical integration, which is a key idea of INLA [7], see Section 2.3 for\ndetails. It has also been used in sampling methods when samples are not easy to obtain directly\n[16]. Here we use this idea in variational inference to overcome the potential non-conjugacy and\nmultimodal posterior problems in \u03b8.\n\n2.2 Variational Optimization\n\nThe proposed INF-VB method consists of two algorithmic steps:\n\n1The grid points need not to be uniformly spaced, one may put more grid points to potentially high mass\n\nregions if credible prior information is available.\n\n2\n\n\f(cid:105)\n\np(x,\u03b8k|y) dx + ln qd(\u03b8k|y)\n\n\u2022 Step 1: Solving multiple independent optimization problems, each for a grid point in G, to obtain\n\nthe optimal q(x|y, \u03b8k), \u2200\u03b8k \u2208 G, i.e.,\n\nq(cid:63)(x|y, \u03b8k) = argmin{q(x|y,\u03b8k)}(cid:80)\n\n= argmin{q(x|y,\u03b8k)}(cid:82) q(x|\u03b8k, y) ln q(x|y,\u03b8k)\n\np(x|y,\u03b8k) dx\n= argmin{q(x|y,\u03b8k)} KL(q(x|y, \u03b8k)||p(x|y, \u03b8k))\n\nk qd(\u03b8k|y)\n\n(cid:104)(cid:82) q(x|\u03b8k, y) ln q(x|y,\u03b8k)\n\n(5)\nThe optimal variational distribution q(cid:63)(x|y, \u03b8k) is the exact posterior p(x|y, \u03b8k). In case it is\nnot available, we may further constrain q(x|y, \u03b8k) to a parametric form, examples including: (i)\nmultivariate Gaussian [17], if the posterior asymptotic normality holds; (ii) skew-normal densities\n[6, 18]; or (iii) an inducing factorization assumption (see Ch.10.2.5 in [19]), if the latent variables\nx are conditionally independent or their dependencies are negligible.\n\u2022 Step 2: Given {q(cid:63)(x|y, \u03b8k) : \u03b8k \u2208 G} obtained in Step 1, one solves\n{q(cid:63)\n\nd(\u03b8k|y)} = argmin{qd(\u03b8k|y)}(cid:80)\n\ndx + ln qd(\u03b8k|y)\n\n(cid:124)\nk qd(\u03b8k|y)\n\nq(cid:63)(x|\u03b8k, y) ln\n\n(cid:20)(cid:90)\n(cid:123)(cid:122)\n(cid:16)(cid:82) q(cid:63)(x|y, \u03b8k) ln p(x,\u03b8k|y)\n\nl(qd(\u03b8k|y))=l(\u03c9k)\nk > 0), which is solved to give\n.\n\nq(cid:63)(x|y, \u03b8k)\np(x, \u03b8k|y)\n\n(cid:17)\n\n(cid:21)\n(cid:125)\n\nSetting \u2202l(\u03c9k)/\u2202\u03c9k = 0 (also \u22022l(\u03c9k)/\u2202\u03c92\n\nd(\u03b8k|y) \u221d exp\nq(cid:63)\n\nmultiplicative constant, which can be identi\ufb01ed from the normalization constraint(cid:80)\n\n(6)\nNote that qd(\u03b8|y) is evaluated at a grid of points \u03b8k \u2208 G, it needs to be known only up to a\nd(\u03b8k|y) =\n\n1. The integral in (6) can be analytically evaluated in the application considered in Section 3.\n\nq(cid:63)(x|y,\u03b8k) dx\n\nk q(cid:63)\n\n2.3 Links between INF-VB and INLA\n\nThe INF-VB is a variational extension of the integrated nested Laplace approximations (INLA)\n[7], a deterministic Bayesian inference method for latent Gaussian models (LGMs), to the case\nwhen p(x|\u03b8) exhibits strong non-Gaussianity and hence p(\u03b8|y) may not be approximated accurately\nby the Laplace\u2019s method of integration [20]. To see the connection, we review brie\ufb02y the three\ncomputation steps of INLA and compare them with INF-VB in below:\n1. Based on the Laplace approximation [3], INLA seeks a Gaussian distribution qG (x|y, \u03b8k) =\nN (x; x\u2217(\u03b8k), H(x\u2217(\u03b8k))\u22121), \u2200\u03b8k \u2208 G that captures most of the probabilistic mass locally,\nwhere x\u2217(\u03b8k) = argmaxx p(x|y, \u03b8k) is the posterior mode, and H(x\u2217(\u03b8k)) is the Hessian ma-\ntrix of the log posterior evaluated at the mode. By contrast, INF-VB with the Gaussian parametric\nconstraint on q(cid:63)(x|y, \u03b8k) provides a global variational Gaussian approximation qV G(x|y, \u03b8k) in\nthe sense that the conditions of the Laplace approximation hold on average [17]. As we will\nsee next, the averaging operator plays a crucial role in handling the non-differentiable (cid:96)1 norm\narising from the double-exponential priors.\n\nqLA (\u03b8|y) = p(x,\u03b8|y)\nq(x|y,\u03b8)\n\n2. INLA computes the marginal posteriors of \u03b8 based on the Laplace\u2019s method of integration [20],\n(7)\nThe quality of this approximation depends on the accuracy of q(x|y, \u03b8). When q(x|y, \u03b8) =\np(x|y, \u03b8), one has qLA (\u03b8|y) equal to p(\u03b8|y), according to the Bayes rule. It has been shown\nin [7] that (7) is accurate enough for latent Gaussian models with qG (x|y, \u03b8). Alternatively, the\nd(\u03b8|y) by INF-VB (6) can be derived as a lower bound of the true\nvariational optimal posterior q(cid:63)\nposterior p(\u03b8|y) by Jensen\u2019s inequality.\n\n(cid:104)(cid:82) p(x,\u03b8|y)\nq(x|y,\u03b8) q(x|y, \u03b8)dx\n\n(cid:105) \u2265(cid:82) ln\n\n(cid:104) p(x,\u03b8|y)\n(cid:105)\nq(x|y,\u03b8) q(x|y, \u03b8)\n\n(8)\nIts optimality justi\ufb01cations in Section 2.2 also explain the often observed empirical successes of\nhyperparameter selection based on the ELBO of ln p(y|\u03b8) [13], when the \ufb01rst level of Bayesian\ninference is performed, i.e. only the conditional posterior q(x|y, \u03b8) with \ufb01xed \u03b8 is of interest. In\nSection 4 we compare the accuracies of both (6) and (7) for hyperparameter learning.\n3. INLA obtains the marginal distributions of interest, e.g., q(x|y) via numerically integrating out\nk q(x|y, \u03b8k)q(\u03b8k|y)\u2206k with area weights \u2206k. In INF-VB, we have qd(\u03b8|y) =\n\n\u03b8: q(x|y) =(cid:80)\n(cid:80)\nk \u03c9k\u03b4\u03b8k (\u03b8). Let \u03c9k = q(\u03b8k|y)\u2206k, we immediately have\n\ndx = ln q(cid:63)\n\nln p(\u03b8|y) = ln\n\nd(\u03b8|y)\n\n(cid:12)(cid:12)(cid:12)x=x\u2217(\u03b8)\n\n3\n\n\fq(x|y) =(cid:82) q(x|y, \u03b8)qd(\u03b8|y)d\u03b8 =(cid:80)\n\nk q(x|y, \u03b8k)qd(\u03b8k|y) =(cid:80)\n\nk q(x|y, \u03b8k)q(\u03b8k|y)\u2206k (9)\nThis Dirac-delta mixture interpretation of numerical integration also enables us to quantitize\nthe accuracy of INLA approximation qG(x|y, \u03b8)qLA(\u03b8|y) using the KL divergence to p(x, \u03b8|y)\nunder the variational framework.\n\nIn contrast to INLA, INF-VB provides q(x|y, \u03b8) and qd(\u03b8|y), both are optimal in a sense of the min-\nimum Kullback-Leibler divergence, within the proposed hybrid distribution family. In this paper we\nfocus on the full posterior inference of Bayesian Lasso [21] where the local Laplace approximation\nin INLA cannot be applied, as the non-differentiability of the (cid:96)1 norm prevents one from computing\nthe Hessian matrix. Besides, if we do not exploit the scale mixture of normals representation [22] of\nLaplace priors (i.e., no data-augmentation), we are actually dealing with a non-conjugate variational\ninference problem in Bayesian Lasso.\n\n(cid:16)\u2212 \u03bb\u221a\n\n(cid:17)\n\n3 Application to Bayesian Lasso\nConsider the Bayesian Lasso regression model [21], y = \u03a6x + e, where \u03a6 \u2208 Rn\u00d7p is the design\nmatrix containing predictors, y \u2208 Rn are responses2, and e \u2208 Rn contain independent zero-mean\nGaussian noise e \u223c N (e; 0, \u03c32In). Following [21] we assume3,\n\nxj|\u03c32, \u03bb2,\u223c \u03bb\n\u221a\n\n\u03c32 exp\n\n2\n\n\u03c32(cid:107)xj(cid:107)1\n\n, \u03c32 \u223c InvGamma(\u03c32; a, b), \u03bb2 \u223c Gamma(\u03bb2; r, s)\n\nWhile the Lasso estimates [23] provide only the posterior modes of the regression parameters x \u2208\nRp, Bayesian Lasso [21] provides the complete posterior distribution p(x, \u03b8|y), from which one\nmay obtain whatever statistical properties are desired of x and \u03b8, including the posterior mode,\nmean, median, and credible intervals.\nSince in our approach variational Gaussian approximation is performed separately (see Section 3.1)\nfor each hyperparameter {\u03bb, \u03c32} considered, the ef\ufb01ciency of approximating p(x|y, \u03b8) is particu-\nlarly important. The upper bound of the KL divergence derived in Section 3.2 provides an approxi-\nmate closed-form solution, that is often accurate enough or requires a small number of gradient iter-\nations to converge to optimality. The tightness of the upper bound is analyzed using spectral-norm\nbounds (See Section 3.3), which also provide insights on the connection between the deterministic\nLasso [23] and the Bayesian Lasso [21].\n\n3.1 Variational Gaussian Approximation\n\n(cid:110)\u2212(cid:107)y\u2212\u03a6x(cid:107)2\n\n2\u03c32 \u2212 \u03bb\n\n\u03c3(cid:107)x(cid:107)1\n\n.\n\n(cid:111)\n\nThe conditional distribution of y and x given \u03b8 is\n\np(y, x|\u03b8) = \u03bbp/(2\u03c3)p\n\n\u221a\n\nDef.\n\ng(\u00b5, D)\n\n(10)\nThe postulated approximation, q(x|\u03b8, y) = N (x; \u00b5, D), is a multivariate Gaussian density (drop-\nping dependencies of variational parameters (\u00b5, D) on (\u03b8, y) for brevity), whose parameters (\u00b5, D)\nare found by minimizing the KL divergence to p(x|\u03b8, y),\n\n(2\u03c0\u03c32)n\n\nexp\n\n= KL(q(x; \u00b5, D)(cid:107)p(x|y, \u03b8)) =(cid:82) q(x; \u00b5, D) ln q(x;\u00b5,D)\n=(cid:82) q(x; \u00b5, D) ln q(x;\u00b5,D)\n(cid:112)dj,\n\n(cid:2)\u00b5j \u2212 2\u00b5j\u03a8(hj) + 2(cid:112)dj\u03c8(hj)(cid:3) ,\n\np(y,x|\u03b8) dx + ln p(y|\u03b8),\n\n(cid:107)y\u2212\u03a6\u00b5(cid:107)2+tr(\u03a6(cid:48)\u03a6D)\n\n2 ln|D|+\n\np(x|y,\u03b8) dx\n\nEq((cid:107)x(cid:107)1) + ln p(y|\u03b8) \u2212 ln \u03c8(\u03c32, \u03bb)\ndj = Djj\n\nhj = \u2212\u00b5j\n\nj=1\n\n(11)\nwhere \u03c8(\u03c32, \u03bb) = (4\u03c0e\u03bb2\u03c3\u22122)p/2(2\u03c0\u03c32)\u2212n/2, \u03a8(\u00b7) and \u03c8(\u00b7) corresponds to the standard normal\ncumulative distribution function and probability density function, respectively. Expectation is taken\nwith respect to q(x; \u00b5, D). De\ufb01ne D = CCT , where C is the Cholesky factorization of the covari-\nance matrix D. Since g(\u00b5, D) is convex in the parameter space (\u00b5, C), a global optimal variational\nGaussian approximation q(cid:63)(x|y, \u03b8) is guaranteed, which achieves the minimum KL divergence to\np(x|\u03b8, y) within the family of multivariate Gaussian densities speci\ufb01ed [13]4.\n\nEq((cid:107)x(cid:107)1) =(cid:80)p\n\n= \u2212 1\n\n+ \u03bb\n\u03c3\n\n2\u03c32\n\n2We assume that both y and the columns of \u03a6 have been mean-centered to remove the intercept term.\n3[21] suggested using scaled double-exponential priors under which they showed that p(x, \u03c32|y, \u03bb) is uni-\nmodal, further, the unimodality helps to accelerate convergence of the data-augmentation Gibbs sampler and\nmakes the posterior mode more meaningful. Gamma prior is put on \u03bb2 for conjugacy.\n\n4Code for variational Gaussian approximation is available at mloss.org/software/view/308\n\n4\n\n\fAs a \ufb01rst step, one \ufb01nds q(cid:63)(x|y, \u03b8) using gradient based procedures independently for each hyper-\nparameter combinations {\u03bb, \u03c32}. Second, q(cid:63)(\u03b8|y) can be evaluated analytically using either (6) or\n(7); both will yield a \ufb01nite mixture of Gaussian distribution for the marginal posterior q(x|y) via nu-\nmerical integration, which is highly ef\ufb01cient since we only have two hyperparameters in Bayesian\nLasso. Finally, the evidence lower bound (ELBO) in (1) can also be evaluated analytically after\nsimple algebra. We will show in Section 4.3 a comparison with the mean-\ufb01eld variational Bayesian\n(VB) approach, derived based on a scale normal mixture representation [22] of the Laplace prior.\n\n3.2 Upper Bounds of KL divergence\n\nWe provide an approximate solution ( \u02c6\u00b5, \u02c6D) via minimizing an upper bound of KL divergence (11).\nThis solution solves a Lasso problem in \u00b5, and has a closed-form expression for D, making this\ncomputationally ef\ufb01cient. In practice, it could serve as an initialization for gradient procedures.\nLemma 3.1. (Triangle Inequality) Eq(cid:107)x(cid:107)1 \u2264 Eq(cid:107)x \u2212 \u00b5(cid:107)1 + (cid:107)\u00b5(cid:107)1, where Eq(cid:107)x \u2212 \u00b5(cid:107)1 =\n\n(cid:112)2/\u03c0(cid:80)p\n\n(cid:112)dj, with the expectation taken with respect to q(x; \u00b5, D).\nj=1 dj \u2264(cid:113)\n\nj \u2264(cid:80)p\nj=1, it holds\n++, tr(A2) \u2264 tr(A) \u2264 \u221a\n\n(cid:113)(cid:80)p\n\nLemma 3.2. For any {dj \u2265 0}p\nLemma 3.3. [24] For any A \u2208 Sp\nTheorem 3.1. (Upper and Lower bound) For any A, D \u2208 Sp\n1\u221a\n\n(cid:112)dj \u2264 \u221a\n\n++, A =\n\np tr(A2).\n\np tr(A).\n\nj=1 d2\n\n\u221a\n\nD5, dj = Djj holds\n\np(cid:80)p\n\nj.\nj=1 d2\n\nj=1\n\nj=1\n\np tr(A) \u2264(cid:80)p\n(cid:124)\n\nf (\u00b5, D) =\n\n(cid:114) 2p\n\n\u03c0\n\n\u03bb\n\u03c3\n\nApplying Lemma 3.1 and Theorem 3.1 in (11), one obtains an upper bound for KL divergence,\n+ ln p(y|\u03b8)\n\n(cid:107)y \u2212 \u03a6\u00b5(cid:107)2\n\ntr(\u03a6(cid:48)\u03a6D)\n\nln|D| +\n\n\u221a\ntr(\n\nD)\n\n+\n\n+\n\n(cid:107)\u00b5(cid:107)1\n\n\u03c8(\u03c32,\u03bb)\n\n2\n\n(cid:123)(cid:122)\n\n\u03bb\n\u03c3\n\n(cid:125)\n\n+\u2212 1\n2\n\n(cid:124)\n\n2\u03c32\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n2\u03c32\n\nf2(D)\n\n\u2265 g(\u00b5, D) = KL(q(x; \u00b5, D)(cid:107)p(x|y, \u03b8))\n\nf1(\u00b5)\n\n(12)\n\nIn the problem of minimizing the KL divergence g(\u00b5, CCT ), one needs to iteratively update \u00b5 and\nC, since they are coupled. However, the upper bound f (\u00b5, D) decouples into two additive terms:\nf1 is a function of \u00b5 while f2 is a function of D, which greatly simpli\ufb01es the minimization.\n\u2022 The minimization of f1(\u00b5) is a convex Lasso problem. Using path-following algorithms (e.g., a\nmodi\ufb01ed least angle regression algorithm (LARS) [25]), one can ef\ufb01ciently compute the entire\nsolution path of Lasso estimates as a function of \u03bb0 = 2\u03bb\u03c3 in one shot. Global optimal solutions\nfor \u02c6\u00b5(\u03b8k) on each grid point \u03b8k \u2208 G can be recovered using the piece-wise linear property.\n\n\u2022 The function f2(D) is convex in the parameter space A =\n\nD, whose minimizer is in closed-\n\nform and can be found by setting the gradient to zero and solving the resulting equation,\n\n\u221a\n\n(cid:18)(cid:113) \u03bb2p\n\n(cid:113) \u03bb2p\n\n(cid:19)\u22121\n\n\u2207Af2 = \u2212A\u22121 + \u03a6(cid:48)\u03a6A\n\n\u03c32 + \u03bb\n\n\u03c0 I = 0,\n\n\u02c6A =\n\n2\u03c0\u03c32 I +\n\n2\u03c0\u03c32 I + \u03a6(cid:48)\u03a6\n\n\u03c32\n\n, (13)\n\n(cid:113) 2p\n\nWe have \u02c6D = \u02c6A2, which is guaranteed to be a positive de\ufb01nite matrix. Note that the global\noptimum \u02c6D(\u03b8k) for each grid point \u03b8k \u2208 G have the same eigenvectors as the Gram matrix \u03a6(cid:48)\u03a6\nand differ only in eigenvalues. For j = 1, . . . , p, denote the eigenvalues of D and \u03a6(cid:48)\u03a6 as \u03b1j and\n\n\u03b2j, respectively. By (13), we have \u03b1j = \u03bb(cid:112)p/(2\u03c0\u03c32) +(cid:112)\u03bb2p/(2\u03c0\u03c32) + \u03b2j/\u03c32. Therefore,\n\none can pre-compute the eigenvectors once, and only update the eigenvalues as a function of \u03b8k.\nThis will make the computation ef\ufb01cient both in time and memory.\n\nThe solutions ( \u02c6\u00b5, \u02c6D) which minimize the KL upper bound f ( \u02c6\u00b5, \u02c6D) in (12) achieves its global\noptimum. Meanwhile, it is also accurate in the sense of the KL divergence g( \u02c6\u00b5, \u02c6D) in (11), as we\nwill show next. Tightness analysis of the upper bound is also provided, using trace norm bounds.\n\n5Since D is positive de\ufb01nite, it has a unique symmetric square root A =\n\nD by taking square root of the eigenvalues.\n\n\u221a\nD, which can be obtained from\n\n5\n\n\f3.3 Theoretical Anlaysis\n\n(cid:0)(cid:107)y \u2212 \u03a6x(cid:107)2\n\nEq(x|y,\u03b8)\n\n(cid:0)(cid:107)y \u2212 \u03a6\u00b5(cid:107)2\n\n(cid:1) ,\n\n(cid:1)\n\n(cid:16)(cid:107)y\u2212\u03a6\u00b5(cid:107)2\n\nTheorem 3.2. (KL Divergence Upper Bound) Let ( \u02c6\u00b5, \u02c6D) be the minimizer of the KL upper\nbound(12), i.e., \u02c6\u00b5 solves the Lasso and \u02c6D is given in (13). Then\n\n, f2( \u02c6D) =(cid:80)\n\ng( \u02c6\u00b5, \u02c6D) \u2264 min\u00b5,D f (\u00b5, D) = f1( \u02c6\u00b5) + f2( \u02c6D) + ln p(y|\u03b8)\n\n(14)\n\u22121.\nwhere f1( \u02c6\u00b5) = min\u00b5\nThus the KL divergence for ( \u02c6\u00b5, \u02c6D) is upper bounded by the minimum achievable (cid:96)1-penalized least\nsquare error \u00011 = f1( \u02c6\u00b5) and terms in f2( \u02c6D) which are ultimately related to the eigenvalues {\u03b2j}\n(j = 1, . . . , p) of the Gram matrix \u03a6(cid:48)\u03a6.\nLet (\u00b5\u2217, D\u2217) be the minimizer of the original KL divergence g(\u00b5, D), and g1(\u00b5|D) collect the\nterms of g(\u00b5, D) that are related to \u00b5. Then the Bayesian posterior mean obtained via VG, i.e.,\n\nj ln \u03b1j +(cid:80)\n\n2\u03c32 +(cid:80)\n\n(cid:113) 2\u03bb2n\n\n2\u03c32 + \u03bb\n\n\u03c3(cid:107)\u00b5(cid:107)1\n\n\u22122\n\u03b2j \u03b1\nj\n\n(cid:17)\n\n\u03c8(\u03c32,\u03bb)\n\n(\u03b1j)\n\n\u03c0\n\nj\n\nj\n\n2\n\n\u00b5\u2217 = argmin\u00b5 g1(\u00b5|D\u2217) = argmin\u00b5\n\n2 + 2\u03bb\u03c3(cid:107)x(cid:107)1\n\n(15)\n\n2 + 2\u03bb\u03c3(cid:107)\u00b5(cid:107)1\n\nis a counterpart of the deterministic Lasso [23], which appears naturally in the upper bound,\n\n\u02c6\u00b5 = argmin\u00b5 f1(\u00b5) = argmin\u00b5\n\n(16)\nNote that the Lasso solution cannot be found by gradient methods due to non-differentiability. By\ntaking the expectation, the objective function is smoothed around 0 and thus differentiable. This\nconnection indicates that in VG for Bayesian Lasso, the conditions of deterministic Lasso hold on\naverage, with respect to the variational distribution q(x|y, \u03b8), in the parameter space of \u00b5.\nThe following theorem (proof sketches are in the Supplementary Material) provides quantitative\nmeasures of the closeness of the upper bounds, f1(\u00b5) and f (\u00b5, D), to their respective true counter-\nparts.\nTheorem 3.3. The tightness of f1(\u00b5) and f (\u00b5, D) is given by\n\ng1(\u00b5|D) \u2212 f1(\u00b5) \u2264 tr(\u03a6(cid:48)\u03a6D)\n\n2\u03c32 + \u03bb\nwhich holds for any (\u00b5, D) \u2208 Rp \u00d7 Sp\nKL divergence, or information gap), we have\n\nf1(\u00b5\u2217) \u2264 g1(\u00b5\u2217) \u2264 g1( \u02c6\u00b5) \u2264 \u00011 + tr(\u03a6(cid:48)\u03a6D)/(2\u03c32) + \u03bb(cid:112)2p/(\u03c32\u03c0)tr(\ng( \u02c6\u00b5, \u02c6D) \u2264 f ( \u02c6\u00b5, \u02c6D) \u2264 f (\u00b5\u2217, D\u2217) \u2264 \u00012 + 2\u03bb(cid:112)2p/(\u03c32\u03c0)tr(\n\nD\u2217)\n\nD), f (\u00b5, D) \u2212 g(\u00b5, D) \u2264 2\u03bb\n\nD)(17)\n++. Further assume g(\u00b5\u2217, D\u2217) = \u00012 (minimum achievable\n\n(18a)\n(18b)\n\n\u221a\n\n\u03c3\n\n\u03c3\n\n(cid:113) 2p\n(cid:112) \u02c6D)\n\n\u221a\n\u03c0 tr(\n\n\u221a\n\u03c0 tr(\n\n(cid:113) 2p\n\n4 Experiments\n\nWe consider long runs of MCMC 6 as reference solutions, and consider two types of INF-VB: INF-\nVB-1 calculates hyperparameter posteriors using (6); while INF-VB-2 uses (7) and evaluates it at\nthe posterior mode of p(x|y, \u03b8). We also compare INF-VB-1 and INF-VB-2 to VB, a mean-\ufb01eld\nvariational Bayes (VB) solution (See Supplementary Material for update equations). The results\nshow that the INF-VB method is more accurate than VB, and is a promising alternative to MCMC\nfor Bayesian Lasso.\n\n4.1 Synthetic Dataset\n\nWe compare the proposed INF-VB methods with VB and intensive MCMC runs, in terms of the joint\nposterior q(\u03bb2, \u03c32|y) , the marginal posteriors of hyper-parameters q(\u03c32|y) and q(\u03bb2|y), and the\nmarginal posteriors of regression coef\ufb01cients q(xj|y) (see Figure 1). The observations are generated\ni x + \u0001i, i = 1, . . . , 600, where \u03c6ij are drawn from an i.i.d. normal distribution7,\nfrom yi = \u03c6T\nwhere the pairwise correlation between the jth and the kth columns of \u03a6 is 0.5|j\u2212k|; we draw\n\u0001i \u223c N (0, \u03c32), xj|\u03bb, \u03c3 \u223c Laplace(\u03bb/\u03c3), j = 1, . . . , 300, and set \u03c32 = 0.5, \u03bb = 0.5.\n\n6In all experiments shown here, we take intensive MCMC runs as the gold standard (with 5 \u00d7 103 burn-ins\nand 5\u00d7 105 samples collected). We use data-augmentation Gibbs sampler introduced in [21]. Ground truth for\nlatent variables and hyper-parameter are also compared to whenever possible. The hyperparameters for Gamma\ndistributions are set to a = b = r = s = 0.001 through all these experiments. If not mentioned, the grid size\nis 50 \u00d7 50, which is uniformly created around the ordinary least square (OLS) estimates of hyper-parameters.\n7The responses y and the columns of \u03a6 are centered; the columns of \u03a6 are also scaled to have unit variance\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f )\n\n(g)\n\n(h)\n\nFigure 1: Contour plots for joint posteriors of hyperparameters q(\u03c32, \u03bb2|y): (a)-(d); Marginal pos-\nterior of hyperparameters and coef\ufb01cients: (e) q(\u03c32|y), (f)q(\u03bb2|y); (g) q(x1|y), (h)q(x2|y)\nSee Figure 1(a)-(d), both MCMC and INF-VB preserve the strong posterior dependence among\nhyperparameters, while mean-\ufb01eld VB cannot. While mean-\ufb01eld VB approximates the posterior\nmode well, the posterior variance can be (sometimes severely) underestimated, see Figure 1(e), (f ).\nSince we have analytically approximated p(x|y) by a \ufb01nite mixture of normal distribution q(x|y, \u03b8)\nwith mixing weights q(\u03b8|y), the posterior marginals for the latent variables: q(xj|y) are easily\naccessible from this analytical representation. Perhaps surprisingly, both INF-VB and mean-\ufb01eld\nVB provide quite accurate marginal distributions q(xj|y), see Figure 1(j)-(h) for examples. The\ndifferences in the tails of q(\u03b8|y) between INF-VB and mean-\ufb01eld VB yield negligible differences\nin the marginal distributions q(xj|y), when \u03b8 is integrated out.\n\n4.2 Diabetes Dataset\n\nWe consider the benchmark diabetes dataset [25] frequently used in previous studies of Bayesian\nLasso; see [21, 26], for example. The goal of this diagnostic study, as suggested in [25], is to\nconstruct a linear regression model (n = 442, p = 10) to reveal the important determinants of\nthe response, and to provide interpretable results to guide disease progression.\nIn Figure 2, we\nshow accurate marginal posteriors of hyperparameters q(\u03c32|y) and q(\u03bb2|y) as well as marginals\nof coef\ufb01cients q(xj|y), j = 1, . . . , 10, which indicate the relevance of each predictor. We also\ncompared them to the ordinary least square (OLS) estimates.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f )\n\n(g)\n\n(h)\n\n(i)\n\nFigure 2: Posterior marginals of hyperparameters: (a) q(\u03c32|y) and (b)q(\u03bb2|y); posterior marginals\nof coef\ufb01cients: (c)-(l) q(xj|y) (j = 1, . . . , 10)\n\n(k)\n\n(j)\n\n(l)\n\n7\n\n\u03bb2\u03c32 0.150.20.250.30.350.350.40.450.50.550.6MCMCGround Truth\u03bb2\u03c32 0.150.20.250.30.350.350.40.450.50.550.6VBGround Truth\u03bb2\u03c32 0.150.20.250.30.350.350.40.450.50.550.6INF\u2212VB\u22121Ground Truth\u03bb2\u03c32 0.150.20.250.30.350.350.40.450.50.550.6INF\u2212VB\u22122Ground Truth0.30.40.50.60.705101520\u03c32q(\u03c32|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBGround Truth0.10.150.20.250.30.350.4051015202530\u03bb2q(\u03bb2|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBGround Truth\u22121.2\u22121.1\u22121\u22120.9\u22120.802468x1q(x1|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBGround Truth\u22120.2\u22120.100.10246810x2q(x2|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBGround Truth0.811.21.40123456\u03c32q(\u03c32|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS0100020003000400000.511.522.5x 10\u22123\u03bb2q(\u03bb2|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.15\u22120.1\u22120.0500.05051015x1 (age)q(x1|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.15\u22120.1\u22120.0500.05051015x2 (sex)q(x2|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x3 (bmi)q(x3|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x4 (bp)q(x4|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x5 (tc)q(x5|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.0500.050.10.15051015x6 (ldl)q(x6|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x7 (hdl)q(x7|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.0500.050.10.15051015x8 (tch)q(x8|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x9 (ltg)q(x9|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\u22120.1\u22120.0500.050.105101520x10 (glu)q(x10|y) MCMCINF\u2212VB\u22121INF\u2212VB\u22122VBOLS\f4.3 Comparison: Accuracy and Speed\nWe quantitatively measure the quality of the approximate joint probability q(x, \u03b8|y) provided by our\nnon-factorized variational methods, and compare them to VB under factorization assumptions. The\nKL divergence KL(q(x, \u03b8|y)|p(x, \u03b8|y)) is not directly available; instead, we compare the negative\nevidence lower bound (1), which can be evaluated analytically in our case and differs from the KL\ndivergence only up to a constant. We also measure the computational time of different algorithms\nIn INF-VB, different grids of sizes m \u00d7 m are considered, where\nby elapsed times (seconds).\nm = 1, 5, 10, 30, 50. We consider two real world datasets: the above Diabetes dataset, and the\nProstate cancer dataset [27]. Here, INF-VB-3 and INF-VB-4 refer to the methods that use the\napproximate solution in Section 3.2 with no gradient steps for q(x|y, \u03b8), and use (6) or (7) for\nq(\u03b8|y).\n\n(c)\n\n(d)\n\n(a)\n\n(b)\n\nFigure 3: Negative evidence lower bound (ELBO) and elapsed time v.s. grid size; (a), (b) for the\nDiabetes dataset (n = 442, p = 10). (c), (d) for the Prostate cancer dataset (n = 97, p = 8)\nThe quality of variational methods depends on the \ufb02exibility of variational distributions. In INF-VB\nfor Bayesian Lasso, we constrain q(x|y, \u03b8) to be parametric and q(\u03b8|y) to be still in free form. See\nfrom Figure 3, the accuracy of INF-VB method with a 1\u00d71 grid is worse than mean-\ufb01eld VB, which\ncorresponds to the partial Bayesian learning of q(x|y, \u03b8) with a \ufb01xed \u03b8. As the grid size increases,\nthe accuracies of INF-VB (even those without gradient steps) also increase and are in general of\nbetter quality than mean-\ufb01eld VB, in the sense of negative ELBO (KL divergence up to a constant).\nThe computational complexities of INF-VB, mean-\ufb01eld VB, and MCMC methods are proportional\nto the grid size, number of iterations toward local optimum, and the number of runs, respectively.\nSince the computations on the grid are independent, INF-VB is highly parallelizable, which is an\nimportant feature as more multiprocessor computational power becomes available. Besides, one\nmay further reduce its computational load by choosing grid points more economically, which will\nbe pursued in our next step. Even the small datasets we show here for illustration enjoy good speed-\nups. A signi\ufb01cant speed-up for INF-VB can be achieved via parallel computing.\n\n5 Discussion\nWe have provided a \ufb02exible framework for approximate inference of the full posterior p(x, \u03b8|y)\nbased on a hybrid continuous-discrete variational distribution, which is optimal in the sense of the\nKL divergence. As a reliable and ef\ufb01cient alternative to MCMC, our method generalizes INLA to\nnon-Gaussian priors and VB to non-factorization settings. While we have used Bayesian Lasso as\nan example, our inference method is generically applicable. One can also approximate p(x|y, \u03b8)\nusing other methods, such as scalable variational methods [28], or improved EP [29].\nThe posterior p(\u03b8|y), which is analyzed based on a grid approximation, enables users to do both\nmodel averaging and model selection, depending on speci\ufb01c purposes. The discretized approxi-\nmation of p(\u03b8|y) overcomes the potential non-conjugacy or multimodal issues in the \u03b8 space in\nvariational inference, and it also allows parallel implementation of the hybrid continuous-discrete\nvariational approximation with the dominant computational load (approximating the continuous\nhigh dimensional q(x|y, \u03b8)) distributed on each grid point, which is particularly important when\napplying INF-VB to large-scale Bayesian inference. INF-VB has limitations. The number of hyper-\nparameters \u03b8 should be no more than 5 to 6, which is the same fundamental limitation of INLA.\n\nAcknowledgments\n\nThe work reported here was supported in part by grants from ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n01020304050630635640645650655660665mNegative ELBO INF\u2212VB\u22121INF\u2212VB\u22122INF\u2212VB\u22123INF\u2212VB\u22124VB0102030405005101520mElapsed Time (seconds) MCMCINF\u2212VB\u22121INF\u2212VB\u22122INF\u2212VB\u22123INF\u2212VB\u22124VB01020304050115120125130135140145150mNegative ELBO INF\u2212VB\u22121INF\u2212VB\u22122INF\u2212VB\u22123INF\u2212VB\u22124VB01020304050051015mElapsed Time (seconds) MCMCINF\u2212VB\u22121INF\u2212VB\u22122INF\u2212VB\u22123INF\u2212VB\u22124VB\fReferences\n[1] D. Gamerman and H. F. Lopes. Markov chain Monte Carlo: stochastic simulation for Bayesian inference.\n\nChapman & Hall Texts in Statistical Science Series. Taylor & Francis, 2006.\n\n[2] C. P. Robert and G. Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2005.\n\n[3] R. E. Kass and D. Steffey. Approximate Bayesian inference in conditionally independent hierarchical\n\nmodels (parametric empirical Bayes models). J. Am. Statist. Assoc., 84(407):717\u2013726, 1989.\n\n[4] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\ngraphical models. In Learning in graphical models, pages 105\u2013161, Cambridge, MA, 1999. MIT Press.\n[5] T. P. Minka. Expectation propagation for approximate Bayesian inference. In J. S. Breese and D. Koller,\neditors, Proceedings of the 17th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 362\u2013369, 2001.\n[6] J. T. Ormerod. Skew-normal variational approximations for Bayesian inference. Technical Report CRG-\n\nTR-93-1, School of Mathematics and Statistics, Univeristy of Sydney, 2011.\n\n[7] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using\nintegrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):319\u2013\n392, 2009.\n\n[8] J. Hensman, M. Rattray, and N. D. Lawrence. Fast variational inference in the conjugate exponential\n\nfamily. In Advances in Neural Information Processing Systems, 2012.\n\n[9] J. Foulds, L. Boyles, C. Dubois, P. Smyth, and M. Welling. Stochastic collapsed variational Bayesian\ninference for latent Dirichlet allocation. In 19th ACM SIGKDD Conference on Knowledge Discovery and\nData Mining (KDD), 2013.\n\n[10] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In\n\nInternational Conference on Machine Learning, 2012.\n\n[11] C. Wang and D. M. Blei. Truncation-free online variational inference for Bayesian nonparametric models.\n\nIn Advances in Neural Information Processing Systems, 2012.\n\n[12] S. J. Gershman, M. D. Hoffman, and D. M. Blei. Nonparametric variational inference. In International\n\nConference on Machine Learning, 2012.\n\n[13] E. Challis and D. Barber. Concave Gaussian variational approximations for inference in large-scale\nBayesian linear models. Journal of Machine Learning Research - Proceedings Track, 15:199\u2013207, 2011.\n[14] M. E. Khan, S. Mohamed, and K. P. Muprhy. Fast Bayesian inference for non-conjugate Gaussian process\n\nregression. In Advances in Neural Information Processing Systems, 2012.\n\n[15] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, University College London, 2003.\n\n[16] C. Ritter and M. A. Tanner. Facilitating the Gibbs sampler: The Gibbs stopper and the griddy-Gibbs\n\nsampler. J. Am. Statist. Assoc., 87(419):pp. 861\u2013868, 1992.\n\n[17] M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Comput.,\n\n21(3):786\u2013792, 2009.\n\n[18] E. Challis and D. Barber. Af\ufb01ne independence variational inference. In Advances in Neural Information\n\nProcessing Systems, 2012.\n\n[19] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[20] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. J.\n\nAm. Statist. Assoc., 81:82\u201386, 1986.\n\n[21] T. Park and G. Casella. The Bayesian Lasso. J. Am. Statist. Assoc., 103(482):681\u2013686, 2008.\n[22] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical\n\nSociety. Series B, 36(1):pp. 99\u2013102, 1974.\n\n[23] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1996.\n\n[24] G. H. Golub and C. V. Loan. Matrix Computations(Third Edition). Johns Hopkins University Press, 1996.\n[25] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407\u2013\n\n499, 2004.\n\n[26] C. Hans. Bayesian Lasso regression. Biometrika, 96(4):835\u2013845, 2009.\n[27] T. Stamey, J. Kabalin, J. McNeal, I. Johnstone, F. Freha, E. Redwine, and N. Yang. Prostate speci\ufb01c\nantigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated\npatients. Journal of Urology, 16:pp. 1076\u20131083, 1989.\n\n[28] M. W. Seeger and H. Nickisch. Large scale Bayesian inference and experimental design for sparse linear\n\nmodels. SIAM J. Imaging Sciences, 4(1):166\u2013199, 2011.\n\n[29] B. Cseke and T. Heskes. Approximate marginals in latent Gaussian models. J. Mach. Learn. Res., 12:417\u2013\n\n454, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1175, "authors": [{"given_name": "Shaobo", "family_name": "Han", "institution": "Duke University"}, {"given_name": "Xuejun", "family_name": "Liao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}