{"title": "Bayesian Inference for Structured Spike and Slab Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 1745, "page_last": 1753, "abstract": "Sparse signal recovery addresses the problem of solving underdetermined linear inverse problems subject to a sparsity constraint. We propose a novel prior formulation, the structured spike and slab prior, which allows to incorporate a priori knowledge of the sparsity pattern by imposing a spatial Gaussian process on the spike and slab probabilities. Thus, prior information on the structure of the sparsity pattern can be encoded using generic covariance functions. Furthermore, we provide a Bayesian inference scheme for the proposed model based on the expectation propagation framework. Using numerical experiments on synthetic data, we demonstrate the benefits of the model.", "full_text": "Bayesian Inference for Structured Spike and\n\nSlab Priors\n\nMichael Riis Andersen, Ole Winther & Lars Kai Hansen\n\nDTU Compute, Technical University of Denmark\n\nDK-2800 Kgs. Lyngby, Denmark\n\n{miri, olwi, lkh}@dtu.dk\n\nAbstract\n\nSparse signal recovery addresses the problem of solving underdetermined\nlinear inverse problems subject to a sparsity constraint. We propose a novel\nprior formulation, the structured spike and slab prior, which allows to in-\ncorporate a priori knowledge of the sparsity pattern by imposing a spatial\nGaussian process on the spike and slab probabilities. Thus, prior informa-\ntion on the structure of the sparsity pattern can be encoded using generic\ncovariance functions. Furthermore, we provide a Bayesian inference scheme\nfor the proposed model based on the expectation propagation framework.\nUsing numerical experiments on synthetic data, we demonstrate the bene-\n\ufb01ts of the model.\n\n1 Introduction\n\nConsider a linear inverse problem of the form:\n\ny = Ax + e,\n\n(1)\nwhere A \u2208 RN\u00d7D is the forward matrix, y \u2208 RN is the measurement vector, x \u2208 RD\nis the desired solution and e \u2208 RN is a vector of corruptive noise. The \ufb01eld of sparse\nsignal recovery deals with the task of reconstructing the sparse solution x from (A, y) in\nthe ill-posed regime where N < D.\nIn many applications it is bene\ufb01cial to encourage a\nstructured sparsity pattern rather than independent sparsity. In this paper we consider a\nmodel for exploiting a priori information on the sparsity pattern, which has applications\nin many di\ufb00erent \ufb01elds, e.g., structured sparse PCA [1], background subtraction [2] and\nneuroimaging [3].\n\nIn the framework of probabilistic modelling sparsity can be enforced using so-called sparsity\npromoting priors, which conventionally has the following form\n\np(x(cid:12)(cid:12)\u03bb) =\n\nD(cid:89)\n\ni=1\n\n(cid:12)(cid:12)\u03bb),\n\np(xi\n\n(cid:12)(cid:12)\u03bb) is the marginal prior on xi and \u03bb is a \ufb01xed hyperparameter controlling the\n\nwhere p(xi\ndegree of sparsity. Examples of such sparsity promoting priors include the Laplace prior\n(LASSO [4]), and the Bernoulli-Gaussian prior (the spike and slab model [5]). The main\nadvantage of this formulation is that the inference schemes become relatively simple due to\nthe fact that the prior factorizes over the variables xi. However, this fact also implies that\nthe models cannot encode any prior knowledge of the structure of the sparsity pattern.\n\n(2)\n\nOne approach to model a richer sparsity structure is the so-called group sparsity ap-\nproach, where the set of variables x has been partitioned into groups beforehand. This\n\n1\n\n\fapproach has been extensively developed for the (cid:96)1 minimization community, i.e. group\nLASSO, sparse group LASSO [6] and graph LASSO [7]. Let G be a partition of the set of\nvariables into G groups. A Bayesian equivalent of group sparsity is the group spike and\nslab model [8], which takes the form\n\np(x(cid:12)(cid:12)z) =\n\nG(cid:89)\n\n(cid:2)(1 \u2212 zg) \u03b4 (xg) + zgN(cid:0)xg\n\n(cid:12)(cid:12)0, \u03c4 Ig\n\n(cid:3) ,\n\np(z(cid:12)(cid:12)\u03bb(cid:1) =\n\nG(cid:89)\n\nBernoulli(cid:0)zg\n\n(cid:12)(cid:12)\u03bbg\n\n(cid:1) ,\n\n(3)\n\ng=1\n\ng=1\n\nwhere z \u2208 [0, 1]G are binary support variables indicating whether the variables in di\ufb00erent\ngroups are active or not. Other relevant work includes [9, 10, 11] and [12]. Another more\n\ufb02exible approach is to use a Markov random \ufb01eld (MRF) as prior for the binary variables\n[2].\n\nRelated to the MRF-formulation, we propose a novel model called the Structured Spike and\nSlab model. This model allows us to encode a priori information of the sparsity pattern into\nthe model using generic covariance functions rather than through clique potentials as for\nthe MRF-formulation [2]. Furthermore, we provide a Bayesian inference scheme based on\nexpectation propagation for the proposed model.\n\n2 The structured spike and slab prior\n\nWe propose a hierarchical prior of the following form:\n\np(x(cid:12)(cid:12)\u03b3) =\n\nD(cid:89)\n\ni=1\n\n(cid:12)(cid:12)g(\u03b3i)),\n\np(xi\n\np(\u03b3) = N(cid:0)\u03b3(cid:12)(cid:12)\u00b50, \u03a30\n\n(cid:1) ,\n\n(4)\n\nwhere g : R \u2192 R is a suitable injective transformation. That is, we impose a Gaussian\nprocess [13] as a prior on the parameters \u03b3i. Using this parametrization, prior knowledge\nof the structure of the sparsity pattern can be encoded using \u00b50 and \u03a30. The mean value\n\u00b50 controls the prior belief of the support and the covariance matrix determines the prior\ncorrelation of the support. In the remainder of this paper we restrict p(xi|g(\u03b3i)) to be a\nspike and slab model, i.e.\n\n(cid:1) ,\n\n(cid:12)(cid:12)zi) = (1 \u2212 zi)\u03b4(xi) + ziN(cid:0)xi\np(zi = 1(cid:12)(cid:12)\u03b3i)p(\u03b3i)d\u03b3i =\n\n(cid:12)(cid:12)0, \u03c40\n\u03c6(\u03b3i)N(cid:0)\u03b3i\n\n(cid:90)\n\np(xi\n\n(cid:90)\n\np(zi = 1) =\n\nzi \u223c Ber (g(\u03b3i)) .\n\n(5)\n\n(cid:12)(cid:12)\u00b5i, \u03a3ii\n\n(cid:1) d\u03b3i = \u03c6\n\n(cid:18)\n\n(cid:19)\n\n\u00b5i\u221a\n1 + \u03a3ii\n\n.\n\n(6)\n\nThis formulation clearly \ufb01ts into eq. (4) when zi is marginalized out. Furthermore, we will\nassume that g is the standard Normal CDF, i.e. g(x) = \u03c6(x). Using this formulation, the\nmarginal prior probability of the i\u2019th weight being active is given by:\n\nThis implies that the probability of zi = 1 is 0.5 when \u00b5i = 0 as expected. In contrast\nto the (cid:96)1-based methods and the MRF-priors, the Gaussian process formulation makes\nit easy to generate samples from the model. Figures 1(a), 1(b) each show three real-\nizations of the support from the prior using a squared exponential kernel of the form:\n\u03a3ij = 50 exp(\u2212 (i \u2212 j)2 /2s2) and \u00b5i is \ufb01xed such that the expected level of sparsity is\n10%. It is seen that when the scale, s, is small, the support consists of scattered spikes.\nAs the scale increases, the support of the signals becomes more contiguous and clustered,\nwhere the sizes of the clusters increase with the scale.\n\nTo gain insight into the relationship between \u03b3 and z, we consider the two dimensional\nsystem with \u00b5i = 0 and the following covariance structure\n\n(cid:20)1 \u03c1\n\n(cid:21)\n\n\u03c1 1\n\n\u03a30 = \u03ba\n\n,\n\n\u03ba > 0.\n\n(7)\n\nThe correlation between z1 and z2 is then computed as a function of \u03c1 and \u03ba by sampling.\nThe resulting curves in Figure 1(c) show that the desired correlation is an increasing function\nof \u03c1 as expected. However, the \ufb01gure also reveals that for \u03c1 = 1, i.e. 100% correlation\nbetween the \u03b3 parameters, does not imply 100% correlation of the support variables z. This\n\n2\n\n\f(a) Scale s = 0.1\n\n(b) Scale s = 5\n\n(c) Correlation of support\n\n,\n\nFigure 1: (a,b) Realizations of the support z from the prior distribution using a squared\nexponential covariance function for \u03b3, i.e. \u03a3ij = 50 exp(\u2212(i \u2212 j)2/2s2) and \u00b5 is \ufb01xed to\nmatch an expected sparsity rate K/D of 10%. (c) Correlation of z1 and z2 as a function\nof \u03c1 for 5 di\ufb00erent values of A obtained by sampling. This prior mean function is \ufb01xed at\n\u00b5i = 0 for all i.\n\nis due to the fact that there are two levels of uncertainty in the prior distribution of the\nsupport. That is, \ufb01rst we sample \u03b3, and then we sample the support z conditioned on \u03b3.\n\nThe proposed prior formulation extends easily to the multiple measurement vector (MMV)\nformulation [14, 15, 16], in which multiple linear inverse problems are solved simultaneously.\nThe most straightforward way is to assume all problem instances share the same support\nvariable, commonly known as joint sparsity [16]\n\nt=1\n\ni=1\n\n(8)\n\np(zi\n\n(cid:12)(cid:12)0, \u03c4(cid:1)(cid:3) ,\n\ni) + ziN(cid:0)xt\n\nD(cid:89)\nT(cid:89)\np(cid:0)X(cid:12)(cid:12)z(cid:1) =\n(cid:2)(1 \u2212 zi)\u03b4(xt\n(cid:12)(cid:12)\u03c6(\u03b3i)(cid:1) ,\n(cid:12)(cid:12)\u03b3i) = Ber(cid:0)zi\np(\u03b3) = N(cid:0)\u03b3(cid:12)(cid:12)\u00b50, \u03a30\n(cid:1) ,\n. . . xT(cid:3) \u2208 RD\u00d7T . The model can also be extended to problems, where\nD(cid:89)\nT(cid:89)\np(cid:0)X(cid:12)(cid:12)z(cid:1) =\n(cid:2)(1 \u2212 zt\n(cid:12)(cid:12)\u03c6(\u03b3t\n(cid:12)(cid:12)\u03b3t\ni )(cid:1) ,\ni ) = Ber(cid:0)zt\n(cid:1) T(cid:89)\n(cid:12)(cid:12)\u00b50, \u03a30\np(\u03b31, ..., \u03b3T ) = N(cid:0)\u03b31\n\niN(cid:0)xt\n(cid:12)(cid:12)(1 \u2212 \u03b1)\u00b50 + \u03b1\u03b3t\u22121, \u03b2\u03a30\n\n(cid:12)(cid:12)0, \u03c4(cid:1)(cid:3) ,\n\nN(cid:0)\u03b3t\n\ni) + zt\n\n(cid:1) ,\n\ni )\u03b4(xt\n\np(zt\ni\n\n(11)\n\n(12)\n\n(13)\n\n(10)\n\n(9)\n\nt=1\n\ni=1\n\ni\n\ni\n\ni\n\nwhere X =(cid:2)x1\n\nthe sparsity pattern changes in time\n\nwhere the parameters 0 \u2264 \u03b1 \u2264 1 and \u03b2 \u2265 0 controls the temporal dynamics of the support.\n\nt=2\n\n3 Bayesian inference using expectation propagation\n\ndistribution of interest thus becomes\n\nIn this section we combine the structured spike and slab prior as given in eq.\n(5) with\nan isotropic Gaussian noise model and derive an inference algorithm based on expectation\n\npropagation. The likelihood function is p(y(cid:12)(cid:12)x) = N(cid:0)y(cid:12)(cid:12)Ax, \u03c32\n0I(cid:1) and the joint posterior\np(x, z, \u03b3(cid:12)(cid:12)y) =\np(y(cid:12)(cid:12)x)p(x(cid:12)(cid:12)z)p(z(cid:12)(cid:12)\u03b3)p(\u03b3)\nD(cid:89)\nD(cid:89)\nN(cid:0)\u03b3(cid:12)(cid:12)\u00b50, \u03a30\nN(cid:0)y(cid:12)(cid:12)Ax, \u03c32\n(cid:12)(cid:12)\u03c6 (\u03b3i)(cid:1)\n(cid:1)\n0I(cid:1)\nBer(cid:0)zi\n(cid:124)\n(cid:124)\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n(cid:125)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n\n(cid:2)(1 \u2212 zi)\u03b4(xi) + ziN(cid:0)xi\n\n(cid:12)(cid:12)0, \u03c40\n\n(cid:1)(cid:3)\n(cid:125)\n\n(14)\n\n1\nZ\n\n1\nZ\n\ni=1\n\ni=1\n\n=\n\nf4\n\nf1\n\n,\n\nf2\n\nf3\n\n3\n\n00.20.40.60.8100.20.40.60.81\u03c1 = Correlation of \u03b31 and \u03b32Correlation of z1 and z2 \u03ba = 1.0\u03ba = 10.0\u03ba = 10000.0\fwhere Z is the normalization constant independent of x, z and \u03b3. Unfortunately, the true\nposterior is intractable and therefore we have to settle for an approximation. In particular,\nwe apply the framework of expectation propagation (EP) [17, 18], which is an iterative\ndeterministic framework for approximating probability distributions using distributions from\nthe exponential family. The algorithm proposed here can be seen as an extension of the\nwork in [8].\n\nAs shown in eq. (14), the true posterior is a composition of 4 factors, i.e. fa for a = 1, .., 4.\nThe terms f2 and f3 are further decomposed into D conditionally independent factors\n\nD(cid:89)\nD(cid:89)\n\ni=1\n\nf2(x, z) =\n\nf3(z, \u03b3) =\n\nf2,i(xi, zi) =\n\nf3,i(zi, \u03b3i) =\n\ni=1\n\ni=1\n\n(cid:12)(cid:12)0, \u03c40\n\n(cid:1)(cid:3) ,\n\nD(cid:89)\nD(cid:89)\n\ni=1\n\n(cid:12)(cid:12)\u03c6 (\u03b3i)(cid:1)\n\n(cid:2)(1 \u2212 zi)\u03b4(xi) + ziN(cid:0)xi\nBer(cid:0)zi\n4(cid:89)\n\n\u02dcfa (x, z, \u03b3) .\n\nQ (x, z, \u03b3) =\n\n1\n\nZEP\n\na=1\n\nThe idea is then to approximate each term in the true posterior density, i.e. fa, by simpler\nterms, i.e. \u02dcfa for a = 1, .., 4. The resulting approximation Q (x, z, \u03b3) then becomes\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(cid:12)(cid:12)\u03c6 (\u02dc\u03b32,i)(cid:1) ,\n(cid:12)(cid:12)\u03c6 (\u02dc\u03b32,i)(cid:1) ,\n\n\u02dcm1 = 1\n\nThe terms \u02dcf1 and \u02dcf4 can be computed exact. In fact, \u02dcf4 is simply equal to the prior over\n\u03b3 and \u02dcf1 is a multivariate Gaussian distribution with mean \u02dcm1 and covariance matrix\n\u02dcV1 determined by \u02dcV \u22121\n\u03c32 AT A. Therefore, we only have to\napproximate the factors \u02dcf2 and \u02dcf3 using EP. Note that the exact term f1 is a distribution\nof y conditioned on x, whereas the approximate term \u02dcf1 is a function of x that depends\non y through \u02dcm1 and \u02dcV1 etc. In order to take full advantage of the structure of the true\nposterior distribution, we will further assume that the terms \u02dcf2 and \u02dcf3 also are decomposed\ninto D independent factors.\n\n\u03c32 AT y and \u02dcV \u22121\n\n1 = 1\n\n1\n\nThe EP scheme provides great \ufb02exibility in the choice of the approximating factors. This\nchoice is a trade-o\ufb00 between analytical tractability and su\ufb03cient \ufb02exibility for capturing the\nimportant characteristics of the true density. Due to the product over the binary support\nvariables {zi} for i = 1, .., D, the true density is highly multimodal. Finally, f2 couples the\nvariables x and z, while f3 couples the variables z and \u03b3. Based on these observations, we\nchoose \u02dcf2 and \u02dcf3 to have the following forms\n\nwhere \u02dcm2 = [ \u02dcm2,1, .., \u02dcm2,D]T , \u02dcV2 = diag (\u02dcv2,1, ..., \u02dcv2,D) and analogously for \u02dc\u00b53 and \u02dc\u03a33.\nThese choices lead to a joint variational approximation Q(x, z, \u03b3) of the form\n\n\u02dcf2 (x, z) \u221d D(cid:89)\n\u02dcf3 (z, \u03b3) \u221d D(cid:89)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n(cid:12)(cid:12)\u03c6 (\u02dc\u03b32,i)(cid:1) = N(cid:16)\n(cid:17) D(cid:89)\n(cid:1) D(cid:89)\nx(cid:12)(cid:12) \u02dcm2, \u02dcV2\n(cid:12)(cid:12) \u02dcm2,i, \u02dcv2,i\nBer(cid:0)zi\nBer(cid:0)zi\nN(cid:0)xi\n(cid:17) D(cid:89)\n(cid:1) = N(cid:16)\n(cid:12)(cid:12)\u03c6 (\u02dc\u03b33,i)(cid:1) D(cid:89)\n\u03b3(cid:12)(cid:12) \u02dc\u00b53, \u02dc\u03a33\n(cid:12)(cid:12)\u02dc\u00b53,i, \u02dc\u03c33,i\nBer(cid:0)zi\nN(cid:0)\u03b3i\nBer(cid:0)zi\nQ (x, z, \u03b3) = N(cid:16)\n(cid:17) D(cid:89)\n(cid:12)(cid:12)g (\u02dc\u03b3i)(cid:1)N(cid:16)\n(cid:17)\nx(cid:12)(cid:12) \u02dcm, \u02dcV\n\u03b3(cid:12)(cid:12) \u02dc\u00b5, \u02dc\u03a3\nBer(cid:0)zi\n(cid:16) \u02dcV \u22121\n(cid:16) \u02dcV \u22121\n(cid:17)\u22121\n(cid:17)\n(cid:17)\n(cid:16) \u02dc\u03a3\u22121\n(cid:17)\u22121\n(cid:16) \u02dc\u03a3\u22121\n1 + \u02dcV \u22121\n(cid:34)(cid:18) (1 \u2212 \u03c6(\u02dc\u03b32,j)) (1 \u2212 \u03c6(\u02dc\u03b33,j))\n(cid:19)\u22121(cid:35)\n3 + \u02dc\u03a3\u22121\n\n4 \u02dc\u00b54\n\u2200j \u2208 {1, .., D} .\n\n3 \u02dc\u00b53 + \u02dc\u03a3\u22121\n\n\u02dcm1 + \u02dcV \u22121\n\n\u02dcm = \u02dcV\n\n\u02dc\u00b5 = \u02dc\u03a3\n\n\u02dcm2\n\n+ 1\n\ni=1\n\ni=1\n\n2\n\n4\n\n,\n\n,\n\n1\n\n2\n\n,\n\n,\n\n\u03c6(\u02dc\u03b32,j)\u03c6(\u02dc\u03b33,j)\n\n\u02dcV =\n\n\u02dc\u03a3 =\n\n\u02dc\u03b3j = \u03c6\u22121\n\nwhere the joint parameters are given by\n\nwhere \u03c6\u22121(x) is the probit function. The function in eq. (21) amounts to computing the\nproduct of two Bernoulli densities parametrized using \u03c6 (\u00b7).\n\n4\n\n\f\u2022 Initialize approximation terms \u02dcfa for a = 1, 2, 3, 4 and Q\n\u2022 Repeat until stopping criteria\n\n\u2013 For each \u02dcf2,i:\n\n\u2217 Compute cavity distribution: Q\\2,i \u221d Q\n\u02dcf2,i\n\n\u2217 Minimize: KL(cid:0)f2,iQ\\2,i(cid:12)(cid:12)(cid:12)(cid:12)Q2,new(cid:1) w.r.t. Qnew\n\n\u2217 Compute: \u02dcf2,i \u221d Q2,new\n\nQ\\2,i to update parameters \u02dcm2,i, \u02dcv2,i and \u02dc\u03b32,i.\n\n\u2013 Update joint approximation parameters: \u02dcm, \u02dcV and \u02dc\u03b3\n\u2013 For each \u02dcf3,i:\n\n\u2217 Compute cavity distribution: Q\\3,i \u221d Q\n\u02dcf3,i\n\n\u2217 Minimize: KL(cid:0)f3,iQ\\3,i(cid:12)(cid:12)(cid:12)(cid:12)Q3,new(cid:1) w.r.t. Qnew\n\n\u2217 Compute: \u02dcf3,i \u221d Q3,new\n\nQ\\3,i to update parameters \u02dc\u00b53,i, \u02dc\u03c33,i and \u02dc\u03b33,i\n\n\u2013 Update joint approximation parameters: \u02dc\u00b5, \u02dc\u03a3 and \u02dc\u03b3\n\nFigure 2: Proposed algorithm for approximating the joint posterior distribution over x, z\nand \u03b3.\n\n3.1 The EP algorithm\n\nConsider the update of the term \u02dcfa,i for a given a and a given i, where \u02dcfa =(cid:81)\n\n\u02dcfa,i. This\nupdate is performed by \ufb01rst removing the contribution of \u02dcfa,i from the joint approximation\nby forming the so-called cavity distribution\n\ni\n\nQ\\a,i \u221d Q\n\u02dcfa,i\n\n(22)\n\nfollowed by the minimization of the Kullbach-Leibler [19] divergence between fa,iQ\\a,i and\nQa,new w.r.t. Qa,new. For distributions within the exponential family, minimizing this form\nof KL divergence amounts to matching moments between fa,iQ\\a,i and Qa,new [17]. Finally,\nthe new update of \u02dcfa,i is given by\n\n\u02dcfa,i \u221d Qa,new\nQ\\a,i\n\n.\n\n(23)\n\nAfter all the individual approximation terms \u02dcfa,i for a = 1, 2 and i = 1, .., D have been\nupdated, the joint approximation is updated using eq. (19)-(21). To minimize the compu-\ntational load, we use parallel updates of \u02dcf2,i [8] followed by parallel updates of \u02dcf3,i rather\nthan the conventional sequential update scheme. Furthermore, due to the fact that \u02dcf2 and\n\u02dcf3 factorizes, we only need the marginals of the cavity distributions Q\\a,i and the marginals\nof the updated joint distributions Qa,new for a = 2, 3.\n\nComputing the cavity distributions and matching the moments are tedious, but straight-\nforward. The moments of fa,iQ\\2,i require evaluation of the zeroth, \ufb01rst and second order\n\nmoment of the distributions of the form \u03c6(\u03b3i)N(cid:0)\u03b3i\n\n(cid:1). Derivation of analytical ex-\n\n(cid:12)(cid:12)\u00b5i, \u03a3ii\n\npressions for these moments can be found in [13]. See the supplementary material for more\ndetails. The proposed algorithm is summarized in \ufb01gure 2. Note, that the EP framework\nalso provides an approximation of the marginal likelihood [13], which can be useful for\nlearning the hyperparameters of the model. Furthermore, the proposed inference scheme\ncan easily be extended to the MMV formulation eq. (8)-(10) by introducing a \u02dcf t\n2,i for each\ntime step t = 1, .., T .\n\n5\n\n\f3.2 Computational details\n\nMost linear inverse problems of practical interest are high dimensional, i.e. D is large. It is\ntherefore of interest to simplify the computational complexity of the algorithm as much as\npossible. The dominating operations in this algorithm are the inversions of the two D \u00d7 D\n\ncovariance matrices in eq. (19) and eq. (20), and therefore the algorithm scales as O(cid:0)D3(cid:1).\n\nBut \u02dcV1 has low rank and \u02dcV2 is diagonal, and therefore we can apply the Woodbury matrix\nidentity [20] to eq. (19) to get\n\n\u02dcV = \u02dcV2 \u2212 \u02dcV2AT(cid:16)\n\noI + A \u02dcV2AT(cid:17)\u22121\n\n(cid:16)\n\nFor N < D, this scales as O(cid:0)N D2(cid:1), where N is the number of observations. Unfortunately,\n\nA \u02dcV2.\n\n(24)\n\n\u03c32\n\nwe cannot apply the same identity to the inversion in eq. (20) since \u02dc\u03a34 has full rank and\nis non-diagonal in general. The eigenvalue spectrum of many prior covariance structures of\ninterest, i.e. simple neighbourhoods etc., decay relatively fast. Therefore, we can approx-\nimate \u03a30 with a low rank approximation \u03a30 \u2248 P \u039bP T , where \u039b \u2208 RR\u00d7R is a diagonal\nmatrix of the R largest eigenvalues and P \u2208 RD\u00d7R is the corresponding eigenvectors. Using\nthe R-rank approximation, we can now invoke the Woodbury matrix identity again to get:\n\nSimilarly, for R < D, this scales as O(cid:0)RD2(cid:1). Another better approach that preserves the\n\n\u02dc\u03a3 = \u02dc\u03a33 \u2212 \u02dc\u03a33P\n\n\u039b + P T \u02dc\u03a33P\n\nP T \u02dc\u03a33.\n\n(25)\n\ntotal variance would be to use probabilistic PCA [21] to approximate \u03a30. A third alternative\nis to consider other structures for \u03a30, which facilitate fast matrix inversions such as block\nstructures and Toeplitz structures. Numerical issues can arise in EP implementations and\nin order to avoid this, we use the same precautions as described in [8].\n\n4 Numerical experiments\n\nThis section describes a series of numerical experiments that have been designed and con-\nducted in order to investigate the properties of the proposed algorithm.\n\n4.1 Experiment 1\n\n(cid:17)\u22121\n\nThe \ufb01rst experiment compares the proposed method to the LARS algorithm [22] and to\nthe BG-AMP method [23], which is an approximate message passing-based method for the\nspike and slab model. We also compare the method to an \u201doracle least squares estimator\u201d\nthat knows the true support of the solutions. We generate 100 problem instances from\ny = Ax0 + e, where the solutions vectors have been sampled from the proposed prior using\nthe kernel \u03a3i,j = 50 exp(\u2212||i \u2212 j||2\n2/(2 \u00b7 102)), but constrained to have a \ufb01xed sparsity level\nof the K/D = 0.25. That is, each solution x0 has the same number of non-zero entries,\nbut di\ufb00erent sparsity patterns. We vary the degree of undersampling from N/D = 0.05 to\nN/D = 0.95. The elements of A \u2208 RN\u00d7250 are i.i.d Gaussian and the columns of A have\nbeen scaled to unit (cid:96)2-norm. The SNR is \ufb01xed at 20dB. We apply the four methods to each\nof the 100 problems, and for each solution we compute the Normalized Mean Square Error\n(NMSE) between the true signal x0 and the estimated signal \u02c6x as well as the F -measure:\n\nNMSE =\n\n||x0 \u2212 \u02c6x||2\n||x0||2\n\nF = 2\n\nprecision \u00b7 recall\nprecision + recall\n\n,\n\n(26)\n\nwhere precision and recall are computed using a MAP estimate of the support. For the\nstructured spike and slab method, we consider three di\ufb00erent covariance structures: \u03a3ij =\n\u03ba \u00b7 \u03b4(i \u2212 j), \u03a3ij = \u03ba exp(\u2212||i \u2212 j||2/s) and \u03a3ij = \u03ba exp(\u2212||i \u2212 j||2\n2/(2s2)) with parameters\n\u03ba = 50 and s = 10. In each case, we use a R = 50 rank approximation of \u03a3. The average\nresults are shown in \ufb01gures 3(a)-(f). Figure (a) shows an example of one of the sampled\nvectors x0 and \ufb01gure (b) shows the three covariance functions.\n\nFrom \ufb01gure 3(c)-(d), it is seen that the two EP methods with neighbour correlation are\nable to improve the phase transition point. That is, in order to obtain a reconstruction\n\n6\n\n\f(a) Example signal\n\n(b) Covariance functions\n\n(c) NMSE\n\n(d) F-measure\n\n(e) Run times\n\n(f) Iterations\n\nFigure 3: Illustration of the bene\ufb01t of modelling the additional structure of the sparsity\npattern. 100 problem instances are generated using the linear measurement model y =\nAx + e, where elements of A \u2208 RN\u00d7250 are i.i.d Gaussian and the columns are scaled to\nunit (cid:96)2-norm. The solutions x0 are sampled from the prior in eq. (5) with hyperparameters\n\n\u03a3ij = 50 exp(cid:2)\u2212||i \u2212 j||2 /(cid:0)2 \u00b7 102(cid:1)(cid:3) and a \ufb01xed level of sparsity of K/D = 0.25. For EP\n\nmethods, the \u03a30 matrix is approximated using a rank 50 matrix. SNR is \ufb01xed at 20dB.\n\nof the signal such that F \u2248 0.8, EP with diagonal covariance and BG-AMP need an un-\ndersamplingratio of N/D \u2248 0.55, while the EP methods with neighbour correlation only\nneed N/D \u2248 0.35 to achieve F \u2248 0.8. For this speci\ufb01c problem, this means that utilizing\nthe neighbourhood structure allows us to reconstruct the signal with 50 fewer observations.\nNote that, the reconstruction using the exponential covariance function does also improve\nthe result even if the true underlying covariance structure corresponds to a squared exponen-\ntial function. Furthermore, we see similar performance of BG-AMP and EP with a diagonal\ncovariance matrix. This is expected for problems where Aij is drawn iid as assumed in\nBG-AMP. However, the price of the improved phase transition is clear from \ufb01gure 3(e). The\nproposed algorithm has signi\ufb01cantly higher computational complexity than BG-AMP and\nLARS. Figure 4(a) shows the posterior mean of z for the signal shown in \ufb01gure 3(a). Here\nit is seen that the two models with neighbour correlation provide a better approximation\nto the posterior activation probabilities. Figure 4(b) shows the posterior mean of \u03b3 for the\nmodel with the squared exponential kernel along with \u00b1 one standard deviation.\n\n4.2 Experiment 2\n\nIn this experiment we consider an application of the MMV formulation as given in eq. (8)-\n(10), namely EEG source localization with synthetic sources [24]. Here we are interested in\nlocalizing the active sources within a speci\ufb01c region of interest on the cortical surface (grey\narea on \ufb01gure 5(a)). To do this, we now generate a problem instance of Y = AEEGX0 +\nE using the procedure as described in experiment 1, where AEEG \u2208 R128\u00d7800 is now a\nsubmatrix of a real EEG forward matrix corresponding to the grey area on the \ufb01gure. The\ncondition number of AEEG is \u2248 8\u00b71015. The true sources X0 \u2208 R800\u00d720 are sampled from the\nstructured spike and slab prior in eq. (8) using a squared exponential kernel with parameters\nA = 50, s = 10 and T = 20. The number of active sources is 46, i.e. x has 46 non-zero\nrows. SNR is \ufb01xed to 20dB. The true sources are shown in \ufb01gure 5(a). We now use the EP\nalgorithm to recover the sources using the true prior, i.e. squared exponential kernel and\n\n7\n\n050100150200250\u22123\u22122\u221210123Signal domainSignal Example signal x\u221250\u221240\u221230\u221220\u221210010203040500102030405060||i\u2212j||2cov(||i\u2212j||2) DiagonalExponentialSq. exponential00.20.40.60.8100.20.40.60.81Undersamplingsratio N/DNMSE Oracle LSLARSBG\u2212AMPEP, DiagonalEP, ExponentialEP, Sq. exponential00.20.40.60.8100.20.40.60.81Undersamplingsratio N/DF Oracle LSLARSBG\u2212AMPEP, DiagonalEP, ExponentialEP, Sq. exponential00.20.40.60.8100.511.522.533.5Undersamplingsratio N/DSecond Oracle LSLARSBG\u2212AMPEP, DiagonalEP, ExponentialEP, Sq. exponential00.20.40.60.81050100150200250300Undersamplingsratio N/DIterations EP, DiagonalEP, ExponentialEP, Sq. exponential\f(a)\n\n(b)\n\nFigure 4: (a) Marginal posterior means over z obtained using the structured spike and slab\nmodel for the signal in \ufb01gure 3(a). The experiment set-up is the as described in \ufb01gure\n3, except the undersamplingsratio is \ufb01xed to N/D = 0.5. (b) The posterior mean of \u03b3\nsuperimposed with \u00b1 one standard deviation. The green dots indicate the true support.\n\n(a) True sources\n\n(b) EP, Sq. exponential\n\n(c) EP, Diagonal\n\nFigure 5: Source localization using synthetic sources. The A \u2208 R128\u00d7800 is a submatrix\n(grey area) of a real EEG forward matrix. (a) True sources. (b) Reconstruction using the\ntrue prior , Fsq = 0.78. (c) Reconstruction using a diagonal covariance matrix, Fdiag = 0.34.\n\nthe results are shown in \ufb01gure 5(b). We see that the algorithm detects most of the sources\ncorrectly, even the small blob on the right hand side. However, it also introduces a small\nnumber of false positives in the neighbourhood of the true active sources. The resulting\nF -measure is Fsq = 0.78. Figure 5(c) shows the result of reconstructing the sources using a\ndiagonal covariance matrix, where Fdiag = 0.34. Here the BG-AMP algorithm is expected\nto perform poorly due to the heavy violation of the assumption of Aij being Gaussian iid.\n\n4.3 Experiment 3\n\nWe have also recreated the Shepp-Logan Phantom experiment from [2] with D = 104 un-\nknowns, K = 1723 non-zero weights, N = 2K observations and SNR = 10dB (see sup-\nplementary material for more details). The EP method yields Fsq = 0.994 and NMSEsq\n= 0.336 for this experiment, whereas BG-AMP yields F = 0.624 and NMSE = 0.717. For\nreference, the oracle estimator yields NMSE = 0.326.\n\n5 Conclusion and outlook\n\nWe introduced the structured spike and slab model, which allows incorporation of a priori\nknowledge of the sparsity pattern. We developed an expectation propagation-based algo-\nrithm for Bayesian inference under the proposed model. Future work includes developing\na scheme for learning the structure of the sparsity pattern and extending the algorithm to\nthe multiple measurement vector formulation with slowly changing support.\n\n8\n\n2040608010012014016018020022024000.10.20.30.40.50.60.70.80.91Signal indexp(zi = 1|y) True supportEP, DiagEP, Exp.EP, Sq. exp50100150200250\u221210\u2212505Signal index\u03b3i|y \u00b1 1 standard deviationPosterior mean of \u03b3 for sq. exp.\fReferences\n\n[1] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In\n\nAISTATS, pages 366\u2013373, 2010.\n\n[2] V. Cevher, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery using markov\n\nrandom \ufb01elds. In NIPS, Vancouver, B.C., Canada, 8\u201311 December 2008.\n\n[3] M. Pontil, L. Baldassarre, and J. Mouro-Miranda. Structured sparsity models for brain decod-\ning from fMRI data. Proceedings - 2012 2nd International Workshop on Pattern Recognition\nin NeuroImaging, PRNI 2012, pages 5\u20138, 2012.\n\n[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the royal statistical\n\nsociety series b-methodological, 58(1):267\u2013288, 1996.\n\n[5] T. J. Mitchell and J. Beauchamp. Bayesian variable selection in linear-regression. Journal of\n\nthe American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[6] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal Of\n\nComputational And Graphical Statistics, 22(2):231\u2013245, 2013.\n\n[7] G. Obozinski, J. P. Vert, and L. Jacob. Group lasso with overlap and graph lasso. ACM\n\nInternational Conference Proceeding Series, 382:\u2013, 2009.\n\n[8] D. Hernandez-Lobato, J. Hernandez-Lobato, and P. Dupont. Generalized spike-and-slab pri-\nors for bayesian group feature selection using expectation propagation. Journal Of Machine\nLearning Research, 14:1891\u20131945, 2013.\n\n[9] L. Yu, H. Sun, J. P. Barbot, and G. Zheng. Bayesian compressive sensing for cluster structured\n\nsparse signals. Signal Processing, 92(1):259 \u2013 269, 2012.\n\n[10] M. Van Gerven, B. Cseke, R. Oostenveld, and T. Heskes. Bayesian source localization with the\nmultivariate laplace prior. In Y. Bengio, D. Schuurmans, J.D. La\ufb00erty, C.K.I. Williams, and\nA. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1901\u20131909.\nCurran Associates, Inc., 2009.\n\n[11] J. M. Hernndez-Lobato, D. Hernndez-Lobato, and A. Surez. Network-based sparse bayesian\n\nclassi\ufb01cation. Pattern Recognition, 44(4):886\u2013900, 2011.\n\n[12] D. Hern\u00b4andez-Lobato and J. M. Hern\u00b4andez-Lobato. Learning feature selection dependencies\nin multi-task learning. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 26, pages 746\u2013754.\nCurran Associates, Inc., 2013.\n\n[13] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press,\n\n2006.\n\n[14] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-delgado. Sparse solutions to linear inverse\nproblems with multiple measurement vectors. IEEE Trans. Signal Processing, pages 2477\u20132488,\n2005.\n\n[15] D. P. Wipf and B. D. Rao. An empirical bayesian strategy for solving the, simultaneous sparse\n\napproximation problem. IEEE Transactions On Signal Processing, 55(7):3704\u20133716, 2007.\n\n[16] J. Ziniel and P. Schniter. Dynamic compressive sensing of time-varying signals via approximate\n\nmessage passing. IEEE Transactions On Signal Processing, 61(21):5270\u20135284, 2013.\n\n[17] T. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the\nSeventeenth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-01),\npages 362\u2013369, San Francisco, CA, 2001. Morgan Kaufmann.\n\n[18] M. Opper and O. Winther. Gaussian processes for classi\ufb01cation: Mean-\ufb01eld algorithms. Neural\n\nComputation, 12(11):2655\u20132684, 2000.\n\n[19] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.\n[20] K. B. Petersen and M. S. Pedersen. The matrix cookbook. 2012.\n[21] M. E Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, Series B, 61:611\u2013622, 1999.\n\n[22] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of\n\nStatistics, 32:407\u2013499, 2004.\n\n[23] P. Schniter and J. Vila. Expectation-maximization gaussian-mixture approximate message\npassing. 2012 46th Annual Conference on Information Sciences and Systems, CISS 2012,\npages \u2013, 2012.\n\n[24] S. Baillet, J. C. Mosher, and R. M. Leahy. Electromagnetic brain mapping.\n\nIEEE Signal\n\nProcessing Magazine, 18(6):14\u201330, 2001.\n\n9\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Michael", "family_name": "Andersen", "institution": "Technical University of Denmark"}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}, {"given_name": "Lars", "family_name": "Hansen", "institution": "Technical University of Denmark"}]}