{"title": "Generalized Beta Mixtures of Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 531, "abstract": "In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems.  In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases.  This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely.", "full_text": "Generalized Beta Mixtures of Gaussians\n\nArtin Armagan\n\nDavid B. Dunson\n\nDept. of Statistical Science\n\nDept. of Statistical Science\n\nDept. of Statistical Science\n\nMerlise Clyde\n\nDuke University\n\nDurham, NC 27708\n\nDuke University\n\nDurham, NC 27708\n\nDuke University\n\nDurham, NC 27708\n\nartin@stat.duke.edu\n\ndunson@stat.duke.edu\n\nclyde@stat.duke.edu\n\nAbstract\n\nIn recent years, a rich variety of shrinkage priors have been proposed that have\ngreat promise in addressing massive regression problems. In general, these new\npriors can be expressed as scale mixtures of normals, but have more complex\nforms and better properties than traditional Cauchy and double exponential priors.\nWe \ufb01rst propose a new class of normal scale mixtures through a novel general-\nized beta distribution that encompasses many interesting priors as special cases.\nThis encompassing framework should prove useful in comparing competing pri-\nors, considering properties and revealing close connections. We then develop a\nclass of variational Bayes approximations through the new hierarchy presented\nthat will scale more ef\ufb01ciently to the types of truly massive data sets that are now\nencountered routinely.\n\n1\n\nIntroduction\n\nPenalized likelihood estimation has evolved into a major area of research, with (cid:96)1[22] and other\nregularization penalties now used routinely in a rich variety of domains. Often minimizing a loss\nfunction subject to a regularization penalty leads to an estimator that has a Bayesian interpretation as\nthe mode of a posterior distribution [8, 11, 1, 2], with different prior distributions inducing different\npenalties. For example, it is well known that Gaussian priors induce (cid:96)2 penalties, while double ex-\nponential priors induce (cid:96)1 penalties [8, 19, 13, 1]. Viewing massive-dimensional parameter learning\nand prediction problems from a Bayesian perspective naturally leads one to design new priors that\nhave substantial advantages over the simple normal or double exponential choices and that induce\nrich new families of penalties. For example, in high-dimensional settings it is often appealing to\nhave a prior that is concentrated at zero, favoring strong shrinkage of small signals and potentially a\nsparse estimator, while having heavy tails to avoid over-shrinkage of the larger signals. The Gaus-\nsian and double exponential priors are insuf\ufb01ciently \ufb02exible in having a single scale parameter and\nrelatively light tails; in order to shrink many small signals strongly towards zero, the double expo-\nnential must be concentrated near zero and hence will over-shrink signals not close to zero. This\nphenomenon has motivated a rich variety of new priors such as the normal-exponential-gamma, the\nhorseshoe and the generalized double Pareto [11, 14, 1, 6, 20, 7, 12, 2].\nAn alternative and widely applied Bayesian framework relies on variable selection priors and\nBayesian model selection/averaging [18, 9, 16, 15]. Under such approaches the prior is a mix-\nture of a mass at zero, corresponding to the coef\ufb01cients to be set equal to zero and hence excluded\nfrom the model, and a continuous distribution, providing a prior for the size of the non-zero signals.\nThis paradigm is very appealing in fully accounting for uncertainty in parameter learning and the\nunknown sparsity structure through a probabilistic framework. One obtains a posterior distribution\nover the model space corresponding to all possible subsets of predictors, and one can use this pos-\nterior for model-averaged predictions that take into account uncertainty in subset selection and to\nobtain marginal inclusion probabilities for each predictor providing a weight of evidence that a spe-\nci\ufb01c signal is non-zero allowing for uncertainty in the other signals to be included. Unfortunately,\n\n1\n\n\fthe computational complexity is exponential in the number of candidate predictors (2p with p the\nnumber of predictors).\nSome recently proposed continuous shrinkage priors may be considered competitors to the con-\nventional mixture priors [15, 6, 7, 12] yielding computationally attractive alternatives to Bayesian\nmodel averaging. Continuous shrinkage priors lead to several advantages. The ones represented as\nscale mixtures of Gaussians allow conjugate block updating of the regression coef\ufb01cients in linear\nmodels and hence lead to substantial improvements in Markov chain Monte Carlo (MCMC) ef\ufb01-\nciency through more rapid mixing and convergence rates. Under certain conditions these will also\nyield sparse estimates, if desired, via maximum a posteriori (MAP) estimation and approximate\ninferences via variational approaches [17, 24, 5, 8, 11, 1, 2].\nThe class of priors that we consider in this paper encompasses many interesting priors as special\ncases and reveals interesting connections among different hierarchical formulations. Exploiting\nan equivalent conjugate hierarchy of this class of priors, we develop a class of variational Bayes\napproximations that can scale up to truly massive data sets. This conjugate hierarchy also allows\nfor conjugate modeling of some previously proposed priors which have some rather complex yet\nadvantageous forms and facilitates straightforward computation via Gibbs sampling. We also argue\nintuitively that by adjusting a global shrinkage parameter that controls the overall sparsity level, we\nmay control the number of non-zero parameters to be estimated, enhancing results, if there is an\nunderlying sparse structure. This global shrinkage parameter is inherent to the structure of the priors\nwe discuss as in [6, 7] with close connections to the conventional variable selection priors.\n\n2 Background\n\nWe provide a brief background on shrinkage priors focusing primarily on the priors studied by [6, 7]\nand [11, 12] as well as the Strawderman-Berger (SB) prior [7]. These priors possess some very\nappealing properties in contrast to the double exponential prior which leads to the Bayesian lasso [19,\n13]. They may be much heavier-tailed, biasing large signals less drastically while shrinking noise-\nlike signals heavily towards zero. In particular, the priors by [6, 7], along with the Strawderman-\nBerger prior [7], have a very interesting and intuitive representation later given in (2), yet, are not\nformed in a conjugate manner potentially leading to analytical and computational complexity.\n[6, 7] propose a useful class of priors for the estimation of multiple means. Suppose a p-dimensional\nvector y|\u03b8 \u223c N (\u03b8, I) is observed. The independent hierarchical prior for \u03b8j is given by\n\n\u03b8j|\u03c4j \u223c N (0, \u03c4j), \u03c4 1/2\n\nj \u223c C+(0, \u03c61/2),\n\n(1)\nfor j = 1, . . . , p, where N (\u00b5, \u03bd) denotes a normal distribution with mean \u00b5 and variance \u03bd and\nC+(0, s) denotes a half-Cauchy distribution on (cid:60)+ with scale parameter s. With an appropriate\ntransformation \u03c1j = 1/(1 + \u03c4j), this hierarchy also can be represented as\n\n\u22121/2\nj\n\n(1 \u2212 \u03c1j)\u22121/2\n\n.\n\n1\n\n\u03b8j|\u03c1j \u223c N (0, 1/\u03c1j \u2212 1), \u03c0(\u03c1j|\u03c6) \u221d \u03c1\n\n1 + (\u03c6 \u2212 1)\u03c1j\n\n(2)\nA special case where \u03c6 = 1 leads to \u03c1j \u223c B(1/2, 1/2) (beta distribution) where the name of the prior\narises, horseshoe (HS) [6, 7]. Here \u03c1js are referred to as the shrinkage coef\ufb01cients as they determine\nthe magnitude with which \u03b8js are pulled toward zero. A prior of the form \u03c1j \u223c B(1/2, 1/2) is\nnatural to consider in the estimation of a signal \u03b8j as this yields a very desirable behavior both at\nthe tails and in the neighborhood of zero. That is, the resulting prior has heavy-tails as well as being\nunbounded at zero which creates a strong pull towards zero for those values close to zero. [7] further\ndiscuss priors of the form \u03c1j \u223c B(a, b) for a > 0, b > 0 to elaborate more on their focus on the\nchoice a = b = 1/2. A similar formulation dates back to [21]. [7] refer to the prior of the form\n\u03c1j \u223c B(1, 1/2) as the Strawderman-Berger prior due to [21] and [4]. The same hierarchical prior\nis also referred to as the quasi-Cauchy prior in [16]. Hence, the tail behavior of the Strawderman-\nBerger prior remains similar to the horseshoe (when \u03c6 = 1), while the behavior around the origin\nchanges. The hierarchy in (2) is much more intuitive than the one in (1) as it explicitly reveals the\nbehavior of the resulting marginal prior on \u03b8j. This intuitive representation makes these hierarchical\npriors interesting despite their relatively complex forms. On the other hand, what the prior in (1) or\n(2) lacks is a more trivial hierarchy that yields recognizable conditional posteriors in linear models.\n\n2\n\n\f[11, 12] consider the normal-exponential-gamma (NEG) and normal-gamma (NG) priors respec-\ntively which are formed in a conjugate manner yet lack the intuition the Strawderman-Berger and\nhorseshoe priors provide in terms of the behavior of the density around the origin and at the tails.\nHence the implementation of these priors may be more user-friendly but they are very implicit in\nhow they behave. In what follows we will see that these two forms are not far from one another.\nIn fact, we may unite these two distinct hierarchical formulations under the same class of priors\nthrough a generalized beta distribution and the proposed equivalence of hierarchies in the following\nsection. This is rather important to be able to compare the behavior of priors emerging from different\nhierarchical formulations. Furthermore, this equivalence in the hierarchies will allow for a straight-\nforward Gibbs sampling update in posterior inference, as well as making variational approximations\npossible in linear models.\n\n3 Equivalence of Hierarchies via a Generalized Beta Distribution\n\nIn this section we propose a generalization of the beta distribution to form a \ufb02exible class of scale\nmixtures of normals with very appealing behavior. We then formulate our hierarchical prior in a\nconjugate manner and reveal similarities and connections to the priors given in [16, 11, 12, 6, 7].\nAs the name generalized beta has previously been used, we refer to our generalization as the three-\nparameter beta (TPB) distribution.\nIn the forthcoming text \u0393(.) denotes the gamma function, G(\u00b5, \u03bd) denotes a gamma distri-\nbution with shape and rate parameters \u00b5 and \u03bd, W(\u03bd, S) denotes a Wishart distribution with\n\u03bd degrees of freedom and scale matrix S, U(\u03b11, \u03b12) denotes a uniform distribution over\n(\u03b11, \u03b12), GIG(\u00b5, \u03bd, \u03be) denotes a generalized inverse Gaussian distribution with density function\n\u221a\n(\u03bd/\u03be)\u00b5/2{2K\u00b5(\n\u03bd\u03be)}\u22121x\u00b5\u22121 exp{(\u03bdx + \u03be/x)/2}, and K\u00b5(.) is a modi\ufb01ed Bessel function of the\nsecond kind.\nDe\ufb01nition 1. The three-parameter beta (TPB) distribution for a random variable X is de\ufb01ned by\nthe density function\n\n\u03c6bxb\u22121(1 \u2212 x)a\u22121 {1 + (\u03c6 \u2212 1)x}\u2212(a+b) ,\n\n(3)\n\nf (x; a, b, \u03c6) =\n\n\u0393(a + b)\n\u0393(a)\u0393(b)\n\nfor 0 < x < 1, a > 0, b > 0 and \u03c6 > 0 and is denoted by T PB(a, b, \u03c6).\n\nIt can be easily shown by a change of variable x = 1/(y + 1) that the above density integrates to 1.\nThe kth moment of the TPB distribution is given by\n\nE(X k) =\n\n\u0393(a + b)\u0393(b + k)\n\n\u0393(b)\u0393(a + b + k) 2F1(a + b, b + k; a + b + k; 1 \u2212 \u03c6)\n\n(4)\n\nwhere 2F1 denotes the hypergeometric function. In fact it can be shown that TPB is a subclass of\nGauss hypergeometric (GH) distribution proposed in [3] and the compound con\ufb02uent hypergeomet-\nric (CCH) distribution proposed in [10].\nThe density functions of GH and CCH distributions are given by\n\nfGH(x; a, b, r, \u03b6) =\n\nfCCH(x; a, b, r, s, \u03bd, \u03b8) =\n\nxb\u22121(1 \u2212 x)a\u22121(1 + \u03b6x)\u2212r\nB(b, a)2F1(r, b; a + b;\u2212\u03b6)\n\n,\n\n\u03bdbxb\u22121(1 \u2212 x)a\u22121(\u03b8 + (1 \u2212 \u03b8)\u03bdx)\u2212r\n\nB(b, a) exp(\u2212s/\u03bd)\u03a61(a, r, a + b, s/\u03bd, 1 \u2212 \u03b8)\n\n(5)\n\n(6)\n\n,\n\nfor 0 < x < 1 and 0 < x < 1/\u03bd, respectively, where B(b, a) = \u0393(a)\u0393(b)/\u0393(a + b) denotes the beta\nfunction and \u03a61 is the degenerate hypergeometric function of two variables [10]. Letting \u03b6 = \u03c6\u2212 1,\nr = a + b and noting that 2F1(a + b, b; a + b; 1 \u2212 \u03c6) = \u03c6\u2212b, (5) becomes a TPB density. Also note\nthat (6) becomes (5) for s = 1, \u03bd = 1 and \u03b6 = (1 \u2212 \u03b8)/\u03b8 [10].\n[20] considered an alternative special case of the CCH distribution for the shrinkage coef\ufb01cients,\n\u03c1j, by letting \u03bd = r = 1 in (6). [20] refer to this special case as the hypergeometric-beta (HB)\ndistribution. TPB and HB generalize the beta distribution in two distinct directions, with one prac-\ntical advantage of the TPB being that it allows for a straightforward conjugate hierarchy leading to\npotentially substantial analytical and computational gains.\n\n3\n\n\fNow we move onto the hierarchical modeling of a \ufb02exible class of shrinkage priors for the estimation\nof a potentially sparse p-vector. Suppose a p-dimensional vector y|\u03b8 \u223c N (\u03b8, I) is observed where\n\u03b8 = (\u03b81, . . . , \u03b8p)(cid:48) is of interest. Now we de\ufb01ne a shrinkage prior that is obtained by mixing a normal\ndistribution over its scale parameter with the TPB distribution.\nDe\ufb01nition 2. The TPB normal scale mixture representation for the distribution of random variable\n\u03b8j is given by\n\n\u03b8j|\u03c1j \u223c N (0, 1/\u03c1j \u2212 1), \u03c1j \u223c T PB(a, b, \u03c6),\n\n(7)\nwhere a > 0, b > 0 and \u03c6 > 0. The resulting marginal distribution on \u03b8j is denoted by\nT PBN (a, b, \u03c6).\nFigure 1 illustrates the density on \u03c1j for varying values of a, b and \u03c6. Note that the special case for\na = b = 1/2 in Figure 1(a) gives the horseshoe prior. Also when a = \u03c6 = 1 and b = 1/2, this\nrepresentation yields the Strawderman-Berger prior. For a \ufb01xed value of \u03c6, smaller a values yield\na density on \u03b8j that is more peaked at zero, while smaller values of b yield a density on \u03b8j that is\nheavier tailed. For \ufb01xed values of a and b, decreasing \u03c6 shifts the mass of the density on \u03c1j from\nleft to right, suggesting more support for stronger shrinkage. That said, the density assigned in the\nneighborhood of \u03b8j = 0 increases while making the overall density lighter-tailed. We next propose\nthe equivalence of three hierarchical representations revealing a wide class of priors encompassing\nmany of those mentioned earlier.\nProposition 1. If \u03b8j \u223c T PBN (a, b, \u03c6), then\n1) \u03b8j \u223c N (0, \u03c4j), \u03c4j \u223c G(a, \u03bbj) and \u03bbj \u223c G(b, \u03c6).\n2) \u03b8j \u223c N (0, \u03c4j), \u03c0(\u03c4j) = \u0393(a+b)\nthe inverted beta distribution with parameters a and b.\n\n\u0393(a)\u0393(b) \u03c6\u2212a\u03c4 a\u22121(1 + \u03c4j/\u03c6)\u2212(a+b) which implies that \u03c4j\u03c6 \u223c \u03b2(cid:48)(a, b),\n\nThe equivalence given in Proposition 1 is signi\ufb01cant as it makes the work in Section 4 possible\nunder the TPB normal scale mixtures as well as further revealing connections among previously\nproposed shrinkage priors. It provides a rich class of priors leading to great \ufb02exibility in terms of\nthe induced shrinkage and makes it clear that this new class of priors can be considered simultaneous\nextensions to the work by [11, 12] and [6, 7]. It is worth mentioning that the hierarchical prior(s)\ngiven in Proposition 1 are different than the approach taken by [12] in how we handle the mixing.\nIn particular, the \ufb01rst hierarchy presented in Proposition 1 is identical to the NG prior up to the \ufb01rst\nstage mixing. While \ufb01xing the values of a and b, we further mix over \u03bbj (rather than a global \u03bb)\nand further over \u03c6 if desired as will be discussed later. \u03c6 acts as a global shrinkage parameter in\nthe hierarchy. On the other hand, [12] choose to further mix over a and a global \u03bb while \ufb01xing the\nvalues of b and \u03c6. By doing so, they forfeit a complete conjugate structure and an explicit control\nover the tail behavior of \u03c0(\u03b8j).\nAs a direct corollary to Proposition 1, we observe a possible equivalence between the SB and the\nNEG priors.\nCorollary 1. If a = 1 in Proposition 1, then TPBN \u2261 NEG. If (a, b, \u03c6) = (1, 1/2, 1) in Proposition\n1, then TPBN \u2261 SB \u2261 NEG.\nAn interesting, yet expected, observation on Proposition 1 is that a half-Cauchy prior can be repre-\nsented as a scale mixture of gamma distributions, i.e. if \u03c4j \u223c G(1/2, \u03bbj) and \u03bbj \u223c G(1/2, \u03c6), then\nj \u223c C+(0, \u03c61/2). This makes sense as \u03c4 1/2|\u03bbj has a half-Normal distribution and the mixing\n\u03c4 1/2\ndistribution on the precision parameter is gamma with shape parameter 1/2.\n\nIf \u03b8j \u223c N (0, \u03c4j), \u03c4 1/2\n\n[7] further place a half-Cauchy prior on \u03c61/2 to complete the hierarchy. The aforementioned obser-\nvation helps us formulate the complete hierarchy proposed in [7] in a conjugate manner. This should\nbring analytical and computational advantages as well as making the application of the procedure\nmuch easier for the average user without the need for a relatively more complex sampling scheme.\nthen \u03b8j \u223c\nCorollary 2.\nT PBN (1/2, 1/2, \u03c6), \u03c6 \u223c G(1/2, \u03c9) and \u03c9 \u223c G(1/2, 1).\nHence disregarding the different treatments of the higher-level hyper-parameters, we have shown\nthat the class of priors given in De\ufb01nition 1 unites the priors in [16, 11, 12, 6, 7] under one family\nand reveals their close connections through the equivalence of hierarchies given in Proposition 1.\nThe \ufb01rst hierarchy in Proposition 1 makes much of the work possible in the following sections.\n\n\u223c C+(0, \u03c61/2) and \u03c61/2 \u223c C+(0, 1),\n\nj\n\n4\n\n\f(a)\n\n(c)\n\n(e)\n\n(b)\n\n(d)\n\n(f)\n\nFigure 1: (a, b) = {(1/2, 1/2), (1, 1/2), (1, 1), (1/2, 2), (2, 2), (5, 2)} for (a)-(f) respectively. \u03c6 =\n{1/10, 1/9, 1/8, 1/7, 1/6, 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} considered for all pairs of a\nand b. The line corresponding to the lowest value of \u03c6 is drawn with a dashed line.\n\n4 Estimation and Posterior Inference in Regression Models\n\n4.1 Fully Bayes and Approximate Inference\n\nConsider the linear regression model, y = X\u03b2+\u0001, where y is an n-dimensional vector of responses,\nX is the n \u00d7 p design matrix and \u0001 is an n-dimensional vector of independent residuals which are\nnormally distributed, N (0, \u03c32In) with variance \u03c32.\nWe place the hierarchical prior given in Proposition 1 on each \u03b2j, i.e. \u03b2j \u223c N (0, \u03c32\u03c4j),\n\u03c4j \u223c G(a, \u03bbj), \u03bbj \u223c G(b, \u03c6). \u03c6 is used as a global shrinkage parameter common to all \u03b2j, and may\nbe inferred using the data. Thus we follow the hierarchy by letting \u03c6 \u223c G(1/2, \u03c9), \u03c9 \u223c G(1/2, 1)\nwhich implies \u03c61/2 \u223c C+(0, 1) that is identical to what was used in [7] at this level of the hierarchy.\nHowever, we do not believe at this level in the hierarchy the choice of the prior will have a huge\nimpact on the results. Although treating \u03c6 as unknown may be reasonable, when there exists some\nprior knowledge, it is appropriate to \ufb01x a \u03c6 value to re\ufb02ect our prior belief in terms of underlying\nsparsity of the coef\ufb01cient vector. This sounds rather natural as soon as one starts seeing \u03c6 as a pa-\nrameter that governs the multiplicity adjustment as discussed in [7]. Note also that here we form the\ndependence on the error variance at a lower level of hierarchy rather than forming it in the prior of\n\u03c6 as done in [7]. If we let a = b = 1/2, we will have formulated the hierarchical prior given in [7]\nin a completely conjugate manner. We also let \u03c3\u22122 \u223c G(c0/2, d0/2). Under a normal likelihood,\nan ef\ufb01cient Gibbs sampler may be obtained as the fully conditional posteriors can be extracted:\n\u03b2|y, X, \u03c32, \u03c41, . . . , \u03c4p \u223c N (\u00b5\u03b2, V\u03b2), \u03c3\u22122|y, X, \u03b2, \u03c41, . . . , \u03c4p \u223c G(c\u2217, d\u2217), \u03c4j|\u03b2j, \u03c32, \u03bbj \u223c\nGIG(a\u22121/2, 2\u03bbj, \u03b22\nj=1 \u03bbj +\u03c9), \u03c9|\u03c6 \u223c\nG(1, \u03c6 + 1), where \u00b5\u03b2 = (X(cid:48)X + T\u22121)\u22121X(cid:48)y, V\u03b2 = \u03c32(X(cid:48)X + T\u22121)\u22121, c\u2217 = (n + p + c0)/2,\nd\u2217 = {(y \u2212 X\u03b2)(cid:48)(y \u2212 X\u03b2) + \u03b2\n\nj /\u03c32), \u03bbj|\u03c4j, \u03c6 \u223c G(a+b, \u03c4j +\u03c6), \u03c6|\u03bbj, \u03c9 \u223c G(pb+1/2,(cid:80)p\n\n(cid:48)\n\nT\u22121\u03b2 + d0}/2, T = diag(\u03c41, . . . , \u03c4p).\n\n5\n\n0.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0\fAs an alternative to MCMC and Laplace approximations [23], a lower-bound on marginal like-\nlihoods may be obtained via variational methods [17] yielding approximate posterior distribu-\ntions on the model parameters. Using a similar approach to [5, 1], the approximate marginal\nposterior distributions of the parameters are given by \u03b2 \u223c N (\u00b5\u03b2, V\u03b2), \u03c3\u22122 \u223c G (c\u2217, d\u2217),\n\u03c4j \u223c GIG(a\u22121/2, 2(cid:104)\u03bbj(cid:105),(cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\nj=1(cid:104)\u03bbj(cid:105)),\n\u03c9 \u223c G(1,(cid:104)\u03c6(cid:105) + 1), where \u00b5\u03b2 = (cid:104)\u03b2(cid:105) = (X(cid:48)X + T\u22121)\u22121X(cid:48)y, V\u03b2 = (cid:104)\u03c3\u22122(cid:105)\u22121(X(cid:48)X + T\u22121)\u22121,\n(cid:80)p\nT\u22121 = diag((cid:104)\u03c4\u22121\n(cid:48)(cid:105)xi +\n(cid:104)\u03c6(cid:105) = (pb + 1/2)/((cid:104)\u03c9(cid:105) +(cid:80)p\nj=1(cid:104)\u03b22\n(cid:48)(cid:105), (cid:104)\u03c3\u22122(cid:105) = c\u2217/d\u2217, (cid:104)\u03bbj(cid:105) = (a + b)/((cid:104)\u03c4j(cid:105) + (cid:104)\u03c6(cid:105)),\n\nj(cid:105)), \u03bbj \u223c G(a+b,(cid:104)\u03c4j(cid:105)+(cid:104)\u03c6(cid:105)), \u03c6 \u223c G(pb+1/2,(cid:104)\u03c9(cid:105)+(cid:80)p\n\n(cid:48)(cid:105) = V\u03b2 + (cid:104)\u03b2(cid:105)(cid:104)\u03b2\nj=1(cid:104)\u03bbj(cid:105)), (cid:104)\u03c9(cid:105) = 1/((cid:104)\u03c6(cid:105) + 1) and\n((cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\n\np (cid:105)), c\u2217 = (n + p + c0)/2, d\u2217 = (y(cid:48)y\u2212 2y(cid:48)X(cid:104)\u03b2(cid:105) +(cid:80)n\n(cid:8)(2(cid:104)\u03bbj(cid:105)(cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\nj(cid:105))1/2(cid:9)\nj(cid:105))1/2(cid:9) ,\n(cid:8)(2(cid:104)\u03bbj(cid:105)(cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\nj(cid:105))1/2(cid:9)\n(cid:8)(2(cid:104)\u03bbj(cid:105)(cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\nj(cid:105))1/2(cid:9) .\n(cid:8)(2(cid:104)\u03bbj(cid:105)(cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\n\n(2(cid:104)\u03bbj(cid:105))1/2Ka\u22121/2\n(2(cid:104)\u03bbj(cid:105))1/2K3/2\u2212a\n\n1 (cid:105), . . . ,(cid:104)\u03c4\u22121\n(cid:105) + d0)/2, (cid:104)\u03b2\u03b2\n\nj(cid:105))1/2K1/2\u2212a\n\nj(cid:105))1/2Ka+1/2\n\ni=1 xi(cid:104)\u03b2\u03b2\n\n((cid:104)\u03c3\u22122(cid:105)(cid:104)\u03b22\n\n(cid:104)\u03c4\u22121(cid:105) =\n\nj(cid:105)(cid:104)\u03c4\u22121\n\n(cid:104)\u03c4(cid:105) =\n\nj\n\nThis procedure consists of initializing the moments and iterating through them until some conver-\ngence criterion is reached. The deterministic nature of these approximations make them attractive\nas a quick alternative to MCMC.\nThis conjugate modeling approach we have taken allows for a very straightforward implementation\nof Strawderman-Berger and horseshoe priors or, more generally, TPB normal scale mixture priors in\nregression models without the need for a more sophisticated sampling scheme which may ultimately\nattract more audiences towards the use of these more \ufb02exible and carefully de\ufb01ned normal scale\nmixture priors.\n\n4.2 Sparse Maximum a Posteriori Estimation\n\nAlthough not our main focus, many readers are interested in sparse solutions, hence we give the\nfollowing brief discussion. Given a, b and \u03c6, maximum a posteriori (MAP) estimation is rather\nstraightforward via a simple expectation-maximization (EM) procedure. This is accomplished in a\nsimilar manner to [8] by obtaining the joint MAP estimates of the error variance and the regression\ncoef\ufb01cients having taken the expectation with respect to the conditional posterior distribution of \u03c4\u22121\nusing the second hierarchy given in Proposition 1. The kth expectation step then would consist of\ncalculating\n\nj\n\n(cid:82) \u221e\n(cid:82) \u221e\n0 \u03c4 a\u22121/2\n0 \u03c4 1/2+a\n\nj\n\nj\n\n(cid:104)\u03c4\u22121\n\nj\n\n(cid:105)(k) =\n\n(1 + \u03c4j/\u03c6)\u2212(a+b) exp{\u2212\u03b22(k\u22121)\n(1 + \u03c4j/\u03c6)\u2212(a+b) exp{\u2212\u03b22(k\u22121)\n\nj\n\nj\n\n/(2\u03c32\n\n/(2\u03c32\n\n(k\u22121)\u03c4j)}d\u03c4\u22121\n(k\u22121)\u03c4j)}d\u03c4\u22121\n\nj\n\nj\n\n(8)\n\nj\n\nand \u03c32\n\nwhere \u03b22(k\u22121)\n(k\u22121) denote the modal estimates of the jth component of \u03b2 and the error\nvariance \u03c32 at iteration (k \u2212 1). The solution to (8) may be expressed in terms of some special\nfunction(s) for changing values of a, b and \u03c6. b < 1 is a good choice as it will keep the tails of the\nmarginal density on \u03b2j heavy. A careful choice of a, on the other hand, is essential to sparse esti-\nmation. Admissible values of a for sparse estimation is apparent by the representation in De\ufb01nition\n2, noting that for any a > 1, \u03c0(\u03c1j = 1) = 0, i.e. \u03b2j may never be shrunk exactly to zero. Hence for\nsparse estimation, it is essential that 0 < a \u2264 1. Figure 2 (a) and (b) give the prior densities on \u03c1j\nfor b = 1/2, \u03c6 = 1 and a = {1/2, 1, 3/2} and the resulting marginal prior densities on \u03b2j. These\nmarginal densities are given by\n\n(cid:111)\nwhere Erf(.) denotes the error function and \u0393(s, z) = (cid:82) \u221e\n\nj /2\u0393(0, \u03b22\n2 e\u03b22\n2 e\u03b22\n\n1\u221a\n2\u03c03/2 e\u03b22\n\u2212 |\u03b2j|\n1\u221a\n\u221a\n2\u03c0\n1 \u2212 1\n2\n\u03c03/2\n\nj /2)\n2 e\u03b22\nj \u0393(0, \u03b22\n\nj /2 + \u03b2j\nj /2\u03b22\n\n\u03c0(\u03b2j) =\n\n(cid:110)\n\nj /2)\n\nj /2Erf(\u03b2j/\n\n\u221a\n\na = 1/2\n\n2) a = 1\n\na = 3/2\n\nts\u22121e\u2212tdt is the incomplete gamma\nfunction. Figure 2 clearly illustrates that while all three cases have very similar tail behavior, their\nbehavior around the origin differ drastically.\n\nz\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Prior densities of (a) \u03c1j and (b) \u03b2j for a = 1/2 (solid), a = 1 (dashed) and a = 3/2 (long\ndash).\n\n5 Experiments\n\n\u2217\n\n\u2217 \u2212 \u02c6\u03b2)(cid:48)C(\u03b2\n\nThroughout this section we use the Jeffreys\u2019 prior on the error precision by setting c0 = d0 = 0. We\ngenerate data for two cases, (n, p) = {(50, 20), (250, 100)}, from yi = x(cid:48)\n+ \u0001i, for i = 1, . . . , n\ni\u03b2\n\u2217 is a p-vector that on average contains 20q non-zero elements which are indexed by the\nwhere \u03b2\nset A = {j : \u03b2\u2217\nj (cid:54)= 0} for some random q \u2208 (0, 1). We randomize the procedure in the following\nmanner: (i) C \u223c W(p, Ip\u00d7p), (ii) xi \u223c N (0, C), (iii) q \u223c B(1, 1) for the \ufb01rst and q \u223c B(1, 4) for\nthe second cases, (iv) I(j \u2208 A) \u223c Bernoulli(q) for j = 1, . . . , p where I(.) denotes the indicator\nfunction, (v) for j \u2208 A, \u03b2j \u223c U(0, 6) and for j /\u2208 A, \u03b2j = 0 and \ufb01nally (vi) \u0001i \u223c N (0, \u03c32) where\n\u03c3 \u223c U(0, 6). We generated 1000 data sets for each case resulting in a median signal-to-noise ratio\nof approximately 3.3 and 4.5. We obtain the estimate of the regression coef\ufb01cients, \u02c6\u03b2, using the\nvariational Bayes procedure and measure the performance by model error which is calculated as\n\u2217 \u2212 \u02c6\u03b2). Figure 3(a) and (b) display the median relative model error (RME) values\n(\u03b2\n(with their distributions obtained via bootstrapping) which is obtained by dividing the model error\nobserved from our procedures by that of (cid:96)1 regularization (lasso) tuned by 10-fold cross-validation.\nThe boxplots in Figure 3(a) and (b) correspond to different (a, b, \u03c6) values where C+ signi\ufb01es that \u03c6\nis treated as unknown with a half-Cauchy prior as given earlier in Section 4.1. It is worth mentioning\nthat we attain a clearly superior performance compared to the lasso, particularly in the second case,\ndespite the fact that the estimator resulting from the variational Bayes procedure is not a thresholding\nrule. Note that b = 1 choice leads to much better performance under Case 2 than Case 1. This is\ndue to the fact that Case 2 involves a much sparser underlying setup on average than Case 1 and that\nthe lighter tails attained by setting b = 1 leads to stronger shrinkage.\nTo give a high dimensional example, we also generate a data set from the model yi = x(cid:48)\n+ \u0001i,\ni\u03b2\n\u2217 is a 10000-dimensional very sparse vector with 10 randomly chosen\nfor i = 1, . . . , 100, where \u03b2\ncomponents set to be 3, \u0001i \u223c N (0, 32) and xij \u223c N (0, 1) for j = 1, . . . , p. This \u03b2\n\u2217 choice leads\nto a signal-to-noise ratios of 3.16. For the particular data set we generated, the randomly chosen\n\u2217 to be non-zero were indexed by 1263, 2199, 2421, 4809, 5530, 7483, 7638, 7741,\ncomponents of \u03b2\n7891 and 8187. We set (a, b, \u03c6) = (1, 1/2, 10\u22124) which implies that a priori P(\u03c1j > 0.5) = 0.99\nplacing much more density in the neighborhood of \u03c1j = 1 (total shrinkage). This choice is due to the\nfact that n/p = 0.01 and to roughly re\ufb02ect that we do not want any more than 100 predictors in the\nresulting model. Hence \u03c6 is used, a priori, to limit the number of predictors in the model in relation\nto the sample size. Also note that with a = 1, the conditional posterior distribution of \u03c4\u22121\nis reduced\nto an inverse Gaussian. Since we are adjusting the global shrinkage parameter, \u03c6, a priori, and it is\nchosen such that P(\u03c1j > 0.5) = 0.99, whether a = 1/2 or a = 1 should not matter. We \ufb01rst run\nthe Gibbs sampler for 100000 iterations (2.4 hours on a computer with a 2.8 GHz CPU and 12 Gb\nof RAM using Matlab), discard the \ufb01rst 20000, thin the rest by picking every 5th sample to obtain\nthe posteriors of the parameters. We observed that the chain converged by the 10000th iteration.\nFor comparison purposes, we also ran the variational Bayes procedure using the values from the\nconverged chain as the initial points (80 seconds). Figure 4 gives the posterior means attained by\nsampling and the variational approximation. The estimates corresponding to the zero elements of\n\n\u2217\n\nj\n\n7\n\n0.00.20.40.60.81.0r\u22123\u22122\u221210123b\f(a)\n\n(b)\n\nFigure 3: Relative ME at different (a, b, \u03c6) values for (a) Case 1 and (b) Case 2.\n\nFigure 4: Posterior mean of \u03b2 by sampling (square) and by approximate inference (circle).\n\n\u2217 are plotted with smaller shapes to prevent clutter. We see that in both cases the procedure is able\n\u03b2\nto pick up the larger signals and shrink a signi\ufb01cantly large portion of the rest towards zero. The\napproximate inference results are in accordance with the results from the Gibbs sampler. It should\nbe noted that using a good informed guess on \u03c6, rather than treating it as an unknown in this high\ndimensional setting, improves the performance drastically.\n\n6 Discussion\n\nWe conclude that the proposed hierarchical prior formulation constitutes a useful encompassing\nframework in understanding the behavior of different scale mixtures of normals and connecting\nthem under a broader family of hierarchical priors. While (cid:96)1 regularization, or namely lasso,\narising from a double exponential prior in the Bayesian framework yields certain computational\nadvantages, it demonstrates much inferior estimation performance relative to the more carefully\nformulated scale mixtures of normals. The proposed equivalence of the hierarchies in Proposi-\ntion 1 makes computation much easier for the TPB scale mixtures of normals. As per differ-\nent choices of hyper-parameters, we recommend that a \u2208 (0, 1] and b \u2208 (0, 1); in particular\n(a, b) = {(1/2, 1/2), (1, 1/2)}. These choices guarantee that the resulting prior has a kink at\nzero, which is essential for sparse estimation, and leads to heavy tails to avoid unnecessary bias\nin large signals (recall that a choice of b = 1/2 will yield Cauchy-like tails). In problems where\noracle knowledge on sparsity exists or when p >> n, we recommend that \u03c6 is \ufb01xed at a reasonable\nquantity to re\ufb02ect an appropriate sparsity constraint as mentioned in Section 5.\n\nAcknowledgments\n\nThis work was supported by Award Number R01ES017436 from the National Institute of Envi-\nronmental Health Sciences. The content is solely the responsibility of the authors and does not\nnecessarily represent the of\ufb01cial views of the National Institute of Environmental Health Sciences\nor the National Institutes of Health.\n\nReferences\n\n[1] A. Armagan. Variational bridge regression. JMLR: W&CP, 5:17\u201324, 2009.\n\n8\n\n0.91.01.11.2(.5,.5,C+)(1,.5,C+)(.5,.5,1)(1,.5,1)(.5,1,C+)(1,1,C+)(.5,1,1)(1,1,1)0.40.50.60.7(.5,.5,C+)(1,.5,C+)(.5,.5,1)(1,.5,1)(.5,1,C+)(1,1,C+)(.5,1,1)(1,1,1)01000200030004000500060007000800090001000001231263219924214809553074837638774178918187Variable #\u03b2\f[2] A. Armagan, D. B. Dunson, and J. Lee.\n\narXiv:1104.0861v2, 2011.\n\nGeneralized double Pareto shrinkage.\n\n[3] C. Armero and M. J. Bayarri. Prior assessments for prediction in queues. The Statistician,\n\n43(1):pp. 139\u2013153, 1994.\n\n[4] J. Berger. A robust generalized Bayes estimator and con\ufb01dence region for a multivariate normal\n\nmean. The Annals of Statistics, 8(4):pp. 716\u2013761, 1980.\n\n[5] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In UAI \u201900: Pro-\nceedings of the 16th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 46\u201353, San\nFrancisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.\n\n[6] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. JMLR:\n\nW&CP, 5, 2009.\n\n[7] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.\n\nBiometrika, 97(2):465\u2013480, 2010.\n\n[8] M. A. T. Figueiredo. Adaptive sparseness for supervised learning.\n\nPattern Analysis and Machine Intelligence, 25:1150\u20131159, 2003.\n\nIEEE Transactions on\n\n[9] E. I. George and R. E. McCulloch. Variable selection via Gibbs sampling. Journal of the\n\nAmerican Statistical Association, 88, 1993.\n\n[10] M. Gordy. A generalization of generalized beta distributions. Finance and Economics Discus-\n\nsion Series 1998-18, Board of Governors of the Federal Reserve System (U.S.), 1998.\n\n[11] J. E. Grif\ufb01n and P. J. Brown. Bayesian adaptive lassos with non-convex penalization. Technical\n\nReport, 2007.\n\n[12] J. E. Grif\ufb01n and P. J. Brown. Inference with normal-gamma prior distributions in regression\n\nproblems. Bayesian Analysis, 5(1):171\u2013188, 2010.\n\n[13] C. Hans. Bayesian lasso regression. Biometrika, 96:835\u2013845, 2009.\n[14] C. J. Hoggart, J. C. Whittaker, and David J. Balding M. De Iorio. Simultaneous analysis of all\n\nSNPs in genome-wide and re-sequencing association studies. PLoS Genetics, 4(7), 2008.\n\n[15] H. Ishwaran and J. S. Rao. Spike and slab variable selection: Frequentist and Bayesian strate-\n\ngies. The Annals of Statistics, 33(2):pp. 730\u2013773, 2005.\n\n[16] I. M. Johnstone and B. W. Silverman. Needles and straw in haystacks: Empirical Bayes esti-\n\nmates of possibly sparse sequences. Annals of Statistics, 32(4):pp. 1594\u20131649, 2004.\n\n[17] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. MIT Press, Cambridge, MA, USA, 1999.\n\n[18] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal\n\nof the American Statistical Association, 83(404):pp. 1023\u20131032, 1988.\n\n[19] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association,\n\n103:681\u2013686(6), 2008.\n\n[20] N. G. Polson and J. G. Scott. Alternative global-local shrinkage rules using hypergeometric-\nbeta mixtures. Discussion Paper 2009-14, Department of Statistical Science, Duke University,\n2009.\n\n[21] W. E. Strawderman. Proper Bayes minimax estimators of the multivariate normal mean. The\n\nAnnals of Mathematical Statistics, 42(1):pp. 385\u2013388, 1971.\n\n[22] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[23] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal\n\ndensities. Journal of the American Statistical Association, 81(393):82\u201386, 1986.\n\n[24] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1, 2001.\n\n9\n\n\f", "award": [], "sourceid": 374, "authors": [{"given_name": "Artin", "family_name": "Armagan", "institution": null}, {"given_name": "Merlise", "family_name": "Clyde", "institution": null}, {"given_name": "David", "family_name": "Dunson", "institution": null}]}