{"title": "Sparse Bayesian structure learning with \u201cdependent relevance determination\u201d priors", "book": "Advances in Neural Information Processing Systems", "page_first": 1628, "page_last": 1636, "abstract": "In many problem settings, parameter vectors are not merely sparse, but dependent in such a way that non-zero coefficients tend to cluster together. We refer to this form of dependency as \u201cregion sparsity\u201d. Classical sparse regression methods, such as the lasso and automatic relevance determination (ARD), model parameters as independent a priori, and therefore do not exploit such dependencies. Here we introduce a hierarchical model for smooth, region-sparse weight vectors and tensors in a linear regression setting. Our approach represents a hierarchical extension of the relevance determination framework, where we add a transformed Gaussian process to model the dependencies between the prior variances of regression weights. We combine this with a structured model of the prior variances of Fourier coefficients, which eliminates unnecessary high frequencies. The resulting prior encourages weights to be region-sparse in two different bases simultaneously. We develop efficient approximate inference methods and show substantial improvements over comparable methods (e.g., group lasso and smooth RVM) for both simulated and real datasets from brain imaging.", "full_text": "Sparse Bayesian structure learning with dependent\n\nrelevance determination prior\n\nAnqi Wu1\n\nMijung Park2\n\nOluwasanmi Koyejo3\n\nJonathan W. Pillow4\n\n1,4 Princeton Neuroscience Institute, Princeton University,\n\n{anqiw, pillow}@princeton.edu\n\n2 The Gatsby Unit, University College London, mijung@gatsby.ucl.ac.uk\n\n3 Department of Psychology, Stanford University, sanmi@stanford.edu\n\nAbstract\n\nIn many problem settings, parameter vectors are not merely sparse, but depen-\ndent in such a way that non-zero coef\ufb01cients tend to cluster together. We re-\nfer to this form of dependency as \u201cregion sparsity\u201d. Classical sparse regression\nmethods, such as the lasso and automatic relevance determination (ARD), model\nparameters as independent a priori, and therefore do not exploit such dependen-\ncies. Here we introduce a hierarchical model for smooth, region-sparse weight\nvectors and tensors in a linear regression setting. Our approach represents a hi-\nerarchical extension of the relevance determination framework, where we add a\ntransformed Gaussian process to model the dependencies between the prior vari-\nances of regression weights. We combine this with a structured model of the prior\nvariances of Fourier coef\ufb01cients, which eliminates unnecessary high frequencies.\nThe resulting prior encourages weights to be region-sparse in two different bases\nsimultaneously. We develop ef\ufb01cient approximate inference methods and show\nsubstantial improvements over comparable methods (e.g., group lasso and smooth\nRVM) for both simulated and real datasets from brain imaging.\n\n1\n\nIntroduction\n\nRecent work in statistics has focused on high-dimensional inference problems where the number of\nparameters p equals or exceeds the number of samples n. Although ill-posed in general, such prob-\nlems are made tractable when the parameters have special structure, such as sparsity in a particular\nbasis. A large literature has provided theoretical guarantees about the solutions to sparse regression\nproblems and introduced a suite of practical methods for solving them ef\ufb01ciently [1\u20137].\nThe Bayesian interpretation of standard \u201cshrinkage\u201d based methods for sparse regression problems\ninvolves maximum a postieriori (MAP) inference under a sparse, independent prior on the regres-\nsion coef\ufb01cients [8\u201315]. Under such priors, the posterior has high concentration near the axes, so\nthe posterior maximum is at zero for many weights unless it is pulled strongly away by the likeli-\nhood. However, these independent priors neglect a statistical feature of many real-world regression\nproblems, which is that non-zero weights tend to arise in clusters, and are therefore not independent\na priori. In many settings, regression weights have an explicit topographic relationship, as when\nthey index regressors in time or space (e.g., time series regression, or spatio-temporal neural recep-\ntive \ufb01eld regression). In such settings, nearby weights exhibit dependencies that are not captured by\nindependent priors, which results in sub-optimal performance.\nRecent literature has explored a variety of techniques for improving sparse inference methods by\nincorporating different types of prior dependencies, which we will review here brie\ufb02y. The smooth\nrelevance vector machine (s-RVM) extends the RVM to incorporate a smoothness prior de\ufb01ned\n\n1\n\n\fin a kernel space, so that weights are smooth as well as sparse in a particular basis [16]. The\ngroup lasso captures the tendency for groups of coef\ufb01cients to remain in or drop out of a model\nin a coordinated manner by using an l1 penalty on the l2 norms pre-de\ufb01ned groups of coef\ufb01cients\n[17]. A method described in [18] uses a multivariate Laplace distribution to impose spatio-temporal\ncoupling between prior variances of regression coef\ufb01cients, which imposes group sparsity while\nleaving coef\ufb01cients marginally uncorrelated. The literature includes many related methods [19\u201324],\nalthough most require a priori knowledge of the dependency structure, which may be unavailable in\nmany applications of interest.\nHere we introduce a novel, \ufb02exible method for capturing dependencies in sparse regression prob-\nlems, which we call dependent relevance determination (DRD). Our approach uses a Gaussian\nprocess to model dependencies between latent variables governing the prior variance of regres-\nsion weights. (See [25], which independently proposed a similar idea.) We simultaneously impose\nsmoothness by using a structured model of the prior variance of the weights\u2019 Fourier coef\ufb01cients.\nThe resulting model captures sparse, local structure in two different bases simultaneously, yielding\nestimates that are sparse as well as smooth. Our method extends previous work on automatic local-\nity determination (ALD) [26] and Bayesian structure learning (BSL) [27], both of which described\nhierarchical models for capturing sparsity, locality, and smoothness. Unlike these methods, DRD\ncan tractably recover region-sparse estimates with multiple regions of non-zero coef\ufb01cients, without\npre-de\ufb01nining number of regions. We argue that DRD can substantially improve structure recovery\nand predictive performance in real-world applications.\nThis paper is organized as follows: Sec. 2 describes the basic sparse regression problem; Sec. 3 in-\ntroduces the DRD model; Sec. 4 and Sec. 5 describe the approximate methods we use for inference;\nIn Sec. 6, we show applications to simulated data and neuroimaging data.\n\n2 Problem setup\n\n2.1 Observation model\nWe consdier a scalar response yi \u2208 R linked to an input vector xi \u2208 Rp via the linear model:\n\nyi = xi\n\n(cid:62)w + \u0001i,\n\ni = 1, 2,\u00b7\u00b7\u00b7 , n,\n\n(1)\nwith observation noise \u0001i \u223c N (0, \u03c32). The regression (linear weight) vector w \u2208 Rp is the quantity\nof interest. We denote the design matrix by X \u2208 Rn\u00d7p, where each row of X is the ith input vector\nxi\n\n(cid:62), and the observation vector by y = [y1,\u00b7\u00b7\u00b7 , yn](cid:62) \u2208 Rn. The likelihood can be written:\n\nfor\n\ny|X, w, \u03c32 \u223c N (y|Xw, \u03c32I).\n\n2.2 Prior on regression vector\n\nWe impose the zero-mean multivariate normal prior on w:\nw|\u03b8 \u223c N (0, C(\u03b8))\n\n(2)\n\n(3)\n\nwhere the prior covariance matrix C(\u03b8) is a function of hyperparameters \u03b8. One can specify C(\u03b8)\nbased on prior knowledge on the regression vector, e.g. sparsity [28\u201330], smoothness [16, 31], or\nboth [26]. Ridge regression assumes C(\u03b8) = \u03b8\u22121I where \u03b8 is a scalar for precision. Automatic rel-\nevance determination (ARD) uses a diagonal prior covariance matrix with a distinct hyperparameter\n\u03b8i for each element of the diagonal, thus Cii = \u03b8\u22121\n. Automatic smoothness determination (ASD)\nassumes a non-diagonal prior covariance, given by a Gaussian kernel, Cij = exp(\u2212\u03c1 \u2212 \u2206ij/2\u03b42)\nwhere \u2206ij is the squared distance between the \ufb01lter coef\ufb01cients wi and wj in pixel space and\n\u03b8 = {\u03c1, \u03b42}. Automatic locality determination (ALD) parametrizes the local region with a Gaus-\nsian form, so that prior variance of each \ufb01lter coef\ufb01cient is determined by its Mahalanobis distance\n(in coordinate space) from some mean location \u03bd under a symmetric positive semi-de\ufb01nite matrix\n\u03a8. The diagonal prior covariance matrix is given by Cii = exp(\u2212 1\n2 (\u03c7i \u2212 \u03bd)(cid:62)\u03a8\u22121(\u03c7i \u2212 \u03bd))), where\n\u03c7i is the space-time location (i.e., \ufb01lter coordinates) of the ith \ufb01lter coef\ufb01cient wi and \u03b8 = {\u03bd, \u03a8}.\n\ni\n\n2\n\n\f3 Dependent relevance determination (DRD) priors\n\nWe formulate the prior covariances to capture the region dependent sparsity in the regression vector\nin the following.\n\nSparsity inducing covariance\n\nWe \ufb01rst parameterise the prior covariance to capture region sparsity in w\n\nCs = diag[exp(u)],\n\n(4)\n\nwhere the parameters are u \u2208 Rp. We impose the Gaussian process (GP) hyperprior on u\n\nu \u223c N (b1, K).\n\n(5)\nThe GP hyperprior is controlled by the mean parameter b \u2208 R and the squared exponential kernel\nparameters, overall scale \u03c1 \u2208 R and the length scale l \u2208 R. We denote the hyperparameters by\n\u03b8s = {b, \u03c1, l}. We refer to the prior distribution associated with the covariance Cs as dependent\nrelevance determination (DRD) prior.\nNote that this hyperprior induces dependencies between the ARD precisions, that is, prior variance\nchanges slowly between neighboring coef\ufb01cients. If the ith coef\ufb01cient of u has large prior variance,\nthen probably the i + 1 and i \u2212 1 coef\ufb01cients are large as well.\n\nSmoothness inducing covariance\n\nWe then formulate the smoothness inducing covariance in frequency domain. Smoothness is cap-\ntured by a low pass \ufb01lter with only lower frequencies passing through. Therefore, we de\ufb01ne a zero-\nmean Gaussian prior over the Fourier-transformed weights w using a diagonal covariance matrix\nCf with diagonal\n\nCf,ii = exp(\u2212 \u03c72\n\ni\n\n2\u03b42 ),\n\n(6)\n\nwhere \u03c7i is the ith location of the regression weights w in frequency domain and \u03b42 is the Gaussian\ncovariance. We denote the hyperparameters by \u03b8f = \u03b42. This formulation imposes neighboring\nweights to have similar levels of Fourier power.\nSimilar to the automatic determination in frequency coordinates (ALDf) [26], this way of formulat-\ning the covariance requires taking discrete Fourier transform of input vectors to construct the prior in\nthe frequency domain. This is a signi\ufb01cant consumption in computation and memory requirements\nespecially when p is large. To avoid the huge expense, we abandon the single frequency version Cf\nbut combine it with Cs to form Csf with both sparsity and smoothness induced as the following.\n\nSmoothness and region sparsity inducing covariance\n\nFinally, to capture both region sparsity and smoothness in w, we combine Cs and Cf in the following\nway\n\nCsf = C\n\n1\n2\n\ns B(cid:62)Cf BC\n\n1\n2\n\ns ,\n\n(7)\n\nwhere B is the Fourier transformation matrix which could be huge when p is large. Implementation\nexploits the speed of the FFT to apply B implicitly. This formulation implies that the sparse regions\nthat are captured by Cs are pruned out and the variance of the remaining entries in weights are\ncorrelated by Cf . We refer to the prior distribution associated with the covariance Csf as smooth\ndependent relevance determination (sDRD) prior.\nUnlike Cs, the covariance Csf is no longer diagonal. To reduce computational complexity and\nstorage requirements, we only store those values that correspond to non-zero portions in the diagonal\nof Cs and Cf from the full Csf .\n\n3\n\n\fFigure 1: Generative model for locally smooth and glob-\nally sparse Bayesian structure learning. The ith response\nyi is linked to an input vector xi and a weight vector w\nin each i. The weight vector w is governed by u and \u03b8f .\nThe hyper-prior p(u|\u03b8s) imposes correlated sparsity in w\nand the hyperparameters \u03b8f imposes smoothness in w.\n\n4 Posterior inference for w\nFirst, we denote the overall hyperparameter set to be \u03b8 = {\u03c32, \u03b8s, \u03b8f} = {\u03c32, b, \u03c1, l, \u03b42}. We\ncompute the maximum likelihood estimate for \u03b8 (denoted by \u02c6\u03b8) and compute the conditional MAP\nestimate for the weights w given \u02c6\u03b8 (closed form), which is the empirical Bayes procedure equipped\nwith a hyper-prior. Our goal is to infer w. The posterior distribution over w is obtained by\n\np(w|X, y) =\n\np(w, u, \u03b8|X, y)dud\u03b8,\n\n(8)\n\n(cid:90) (cid:90)\n\nwhich is analytically intractable. Instead, we approximate the marginal posterior distribution with\nthe conditional distribution given the MAP estimate of u, denoted by \u00b5u, and the maximum likeli-\nhood estimation of \u03c32, \u03b8s, \u03b8f denoted by \u02c6\u03c32, \u02c6\u03b8s, \u02c6\u03b8f ,\n\np(w|X, y) \u2248 p(w|X, y, \u00b5u, \u02c6\u03c32, \u02c6\u03b8s, \u02c6\u03b8f ).\n\n(9)\n\nThe approximate posterior over w is multivariate normal with the mean and covariance given by\n\np(w|X, y, \u00b5u, \u02c6\u03c32, \u02c6\u03b8s, \u02c6\u03b8f ) = N (\u00b5w, \u039bw),\n\n\u039bw = (\n\n\u00b5w = 1\n\nX(cid:62)X + C\u22121\n\n1\n\u02c6\u03c32\n\u02c6\u03c32 \u039bwX T y.\n\n\u00b5u, \u02c6\u03b8s, \u02c6\u03b8f\n\n)\u22121,\n\n(10)\n\n(11)\n\n(12)\n\nInference for hyperparameters\n\n5\nThe MAP inference of w derived in the previous section depends on the values of \u02c6\u03b8 = { \u02c6\u03c32, \u02c6\u03b8s, \u02c6\u03b8f}.\nTo estimate \u02c6\u03b8, we maximize the marginal likelihood of the evidence.\n\n\u02c6\u03b8 = arg max\n\nlog p(y|X, \u03b8)\n\n\u03b8\n\nwhere\n\np(y|X, \u03b8) =\n\n(cid:90) (cid:90)\n\np(y|X, w, \u03c32)p(w|u, \u03b8f )p(u|\u03b8s)dwdu.\n\n(13)\n\n(14)\n\nUnfortunately, computing the double integrals is intractable. In the following, we consider the the\napproximation method based on Laplace approximation to compute the integral approximately.\n\nLaplace approximation to posterior over u\n\nTo approximate the marginal likelihood, we can rewrite Bayes\u2019 rule to express the marginal likeli-\nhood as the likelihood times prior divided by the posterior,\n\np(y|X, \u03b8) =\n\np(y|X, u)p(u|\u03b8)\n\n(15)\nLaplace\u2019s method allows us to approximate p(u|y, X, \u03b8), the posterior over the latent u given\nthe data {X, y} and hyper-parameters \u03b8, using a Gaussian centered at the mode of the distri-\nbution and inverse covariance given by the Hessian of the negative log-likelihood. Let \u00b5u =\narg maxu p(u|y, X, \u03b8) and \u039bu = \u2212(\n\u2202u\u2202u(cid:62) log p(u|y, X, \u03b8))\u22121 denote the mean and covariance\n\np(u|y, X, \u03b8)\n\n\u22022\n\n,\n\n4\n\n\fFigure 2: Comparison of estimators for 1D simulated example. First column: True \ufb01lter\nweight, maximum likelihood (linear regression) estimate, empirical Bayesian ridge regression (L2-\npenalized) estimate; Second column: ARD estimate with different and independent prior covari-\nance hyperparameters, lasso regression with L1-regularization and group lasso with group size of 5;\nThird column: ALD methods in space-time domain, frequency domain and combination of both, re-\nspectively; Fourth column: DRD method in space-time domain only and its smooth version sDRD\nimposing both sparsity (space-time) and smoothness (frequency), and smooth RVM initialized with\nelastic net estimate.\n\nof this Gaussian, respectively. Although the right-hand-side can be evaluated at any value of u, a\ncommon approach is to use the mode u = \u00b5u, since this is where the Laplace approximation is\nmost accurate. This leads to the following expression for the log marginal likelihood:\n2 log |2\u03c0\u039bu|.\n\n(16)\nThen by optimizing log p(y|X, \u03b8) with regard to \u03b8, we can get \u02c6\u03b8 given a \ufb01xed \u00b5u, de-\nnoted as \u02c6\u03b8\u00b5u. Following an iterative optimization procedure, we obtain an updating rule \u00b5t\nu =\narg maxu p(u|y, X, \u02c6\u03b8\u00b5t\u22121\n) at tth iteration. The algorithm will stop when u and \u03b8 converge. More\ninformation and details about formulation and derivation are described in the appendix.\n\nlog p(y|X, \u03b8) \u2248 log p(y|X, \u00b5u) + log p(\u00b5u|\u03b8) \u2212 1\n\nu\n\n6 Experiment and Results\n\n6.1 One Dimensional Simulated Data\n\nBeginning with simulated data, we compare performances of various regression estimators. One\ndimensional data is generated from a generative model with d = 200 dimensions. Firstly to generate\na Gaussian process, a covariance kernel matrix K is built with squared exponential kernel with the\nspatial locations of regression weights as inputs. Then a scalar b is set as the mean function to\ndetermine the scale of prior covariance. Given the Gaussian process, we generate a multivariate\nvector u, and then take its exponential to obtain the diagonal of prior covariance Cs in space-time\ndomain. To induce smoothness, eq. 7 is introduced to get covariance Csf . Then a weight vector w\nis sampled from a Gaussian distribution with zero mean and Csf . Finally, we obtain the response\ny given stimulus x with w plus Gaussian noise \u0001. In our case, \u0001 should be large enough to ensure\nthat data and response won\u2019t impose strong likelihood over prior knowledge. Thus the introduced\nprior would largely dominate the estimate. Three local regions are constructed, which are positive,\nnegative and a half-positive-half-negative with suf\ufb01cient zeros between separate bumps clearly apart.\nAs shown in Figure 2, the left top sub\ufb01gure shows the underlying weight vector w.\nTraditional methods like maximum likelihood, without any prior, are signi\ufb01cantly overwhelmed by\nlarge noise of the data. Weak priors such as ridge, ARD, lasso could \ufb01t the true weight better with\n\n5\n\n\fFigure 3: Estimated \ufb01lter weights\nand prior covariances. Upper row\nshows the true \ufb01lter (dotted black)\nand estimated ones (red); Bottom\nrow shows the underlying prior co-\nvariance matrix.\n\ndifferent levels of sparsity imposed, but are still not sparse enough and not smooth at all. Group\nlasso enforces a stronger sparsity than lasso by assuming block sparsity, thus making the result\nsmoother locally. ALD based methods have better performance, compared with traditional ones, in\nidentifying one big bump explicitly. ALDs is restricted by the assumption of one modal Gaussian,\ntherefore is able to \ufb01nd one dominating local region. ALDf focuses localities in frequency domain\nthus make the estimate smoother but no spatial local regions are discovered. ALDsf combines\nthe effects in both ALDs and ALDf, and thus possesses smoothness but only one region is found.\nSmooth Relevance Vector Machine (sRVM) can smooth the curve by incorporating a \ufb02exible noise-\ndependent smoothness prior into the RVM, but is not able to draw information from data likelihood\nmagni\ufb01cently. Our DRD can impose distinct local sparsity via Gaussian process prior and sDRD can\ninduce smoothness via bounding the frequencies. For all baseline models, we do model selection\nvia cross-validation varying through a wide range of parameter space, thus we can guarantee the\nfairness for comparisons.\nTo further illustrate the bene\ufb01ts and principles of DRD, we demonstrate the estimated covariance\nvia ARD, ALDsf and sDRD in Figure 3. It can be stated that ARD could detect multiple localities\nsince its priors are purely independent scalars which could be easily in\ufb02uenced by data with strong\nlikelihood, but the consideration is the loss of dependency and smoothness. ALDsf can only detect\none locality due to its deterministic Gaussian form when likelihood is not suf\ufb01ciently strong, but\nwith Fourier components over the prior, it exhibits smoothness. sDRD could capture multiple local\nsparse regions as well as impose smoothness. The underlying Gaussian process allows multiple\nnon-zero regions in prior covariance with the result of multiple local sparsities for weight tensor.\nSmoothness is introduced by a Gaussian type of function controlling the frequency bandwidth and\ndirection.\nIn addition, we examine the convergence properties of various estimators as a function of the amount\nof collected data and give the average relative errors of each method in Figure 4. Responses are\nsimulated from the same \ufb01lter as above with large Gaussian white noise which weakens the data\nlikelihood and thus guarantees a signi\ufb01cant effect of prior over likelihood. The results show that\nsDRD estimate achieves the smallest MSE (mean squared error), regardless of the number of training\nsamples. The MSE, mentioned here and in the following paragraphs, refers to the error compared\nwith the underlying w. The test error, which will be mentioned in later context, refers to the error\ncompared with true y. The left plot in Figure 4 shows that other methods require at least 1-2 times\nmore data than sDRD to achieve the same error rate. The right \ufb01gure shows the ratio of the MSE for\neach estimate to the MSE for sDRD estimate, showing that the next best method (ALDsf) exhibits\nan error nearly two times of sDRD.\n\n6.2 Two Dimensional Simulated Data\n\nTo better illustrate the performance of DRD and lay the groundwork for real data experiment, we\npresent a 2-dimensional synthetic experiment. The data is generated to match characteristics of\nreal fMRI data, as will be outlined in the next section. With a similar generation procedure as in 1-\ndimensional experiment, a 2-dimensional w is generated with analogical properties as the regression\nweights in fMRI data. The analogy is based on reasonable speculation and accumulated acknowl-\nedge from repeated trials and experiment. Two comparative studies are conducted to investigate the\nin\ufb02uences of sample size on the recovery accuracy of w and predictive ability, both with dimension\n= 1600 (the same as fMRI). To demonstrate structural sparsity recovery, we only compare our DRD\nmethod with ARD, lasso, elastic net (elnet), group lasso (glasso).\n\n6\n\n\fFigure 4: Convergence of error rates on simulated data with varying training size (Left) and the\nrelative error (MSE ratio) for sDRD (Right)\n\nFigure 5: Test error for each method when n = 215 (Left) and n = 800 (Right) for 2D simulated\ndata.\n\nThe sample size n varies in {215, 800}. The results are shown in Fig. 5 and Fig. 6. When n = 215,\nonly DRD is able to reveal an approximative estimation of true w with a small level of noise as well\nas giving the lowest predictive error. Group lasso performs slightly better than ARD, lasso and elnet,\nand presents only a weakly distinct block wise estimation compared with lasso and elnet. Lasso\nand elnet both show similar performances and give a stronger sparsity than ARD, which indicates\nthat ARD fails to impose strong sparsity in this synthetic case due to its complete independencies\namong dimensions when data is less suf\ufb01cient and noisy. When n = 800, DRD still holds the\nbest prediction. Group lasso fails to keep the record since block-wise penalty can capture group\ninformation but miss the subtlety when \ufb01ner details matter. ARD progresses to the second place\nbecause when data likelihood is strong enough, posterior of w won\u2019t be greatly in\ufb02uenced by the\nnoise but follow the likelihood and the prior. Additionally, since ARD\u2019s prior is more \ufb02exible and\nindependent than lasso and elnet, the posterior would approximate the underlying w better and \ufb01ner.\n\n6.3\n\nfMRI Data\n\nWe analyzed functional MRI data from the Human Connectome Project 1 collected from 215 healthy\nadult participants on a relational reasoning task. We used contrast images for the comparison of re-\nlational reasoning and matching tasks. Data were processed using the HCP minimal preprocessing\npipelines [32], down-sampled to 63\u00d7 76\u00d7 63 voxels using the \ufb02irt applyXfm tool [33], then tailored\nfurther down to 40 \u00d7 76 \u00d7 40 by deleting zero-signal regions outside the brain. We analyzed 215\nsamples, each of which is an average from Z-slice 37 to 39 slices of 3D structure based on recom-\nmendations by domain experts. As the dependent variable in the regression, we selected the number\nof correct responses on the Penn Matrix Text, which is a measure of \ufb02uid intelligence that should be\nrelated to relational reasoning performance.\nIn each run, we randomly split the fMRI data into \ufb01ve sets for \ufb01ve-fold cross-validation, and took\nan average of test errors across \ufb01ve folds for each run. Hyperparameters were chosen by a \ufb01ve-fold\ncross-validation within the training set, and the optimal hyper parameter set was used for computing\ntest performance. Fig. 7 shows the regions of positive (red) and negative (blue) support for the\nregression weights we obtained using different sparse regression methods. The rightmost panel\nquanti\ufb01es performance using mean squared error on held out test data. Both predictive performance\nand estimated pattern are similar to n = 215 result in 2D synthetic experiment. ARD returns a quite\nnoisy estimation due to the complete independencies and weak likelihood. The elastic net estimate\nimproves slightly over lasso but is signi\ufb01cantly better than ARD, which indicates that lasso type\nof regularizations impose stronger sparsity than ARD in this case. Group lasso is slightly better\n\n1http://www.humanconnectomeproject.org/.\n\n7\n\n\fFigure 6: Surface plot of estimated w from each method using 2-dimensional simulated data when\nn = 215.\n\nFigure 7: Positive (red) and negative (blue) supports of the estimated weights from each method\nusing real fMRI data and the corresponding test errors.\n\nbecause of its block-wise regularization, but more noisy blocks pop up in\ufb02uencing the predictive\nability. DRD reveals strong sparsity as well as clustered local regions. It also possesses the smallest\ntest error indicating the best predictive ability. Given that local group information most likely gather\naround a few pixels in fMRI data, smoothness would be less valuable to be induced. This is the\nreason that sDRD doesn\u2019t show a distinct outperforming result over DRD, as a result of which we\nomit smoothness imposing comparative experiment for fMRI data. In addition, we also test the\nStructOMP [24] method for both 2D simulated data and fMRI data, but it doesn\u2019t show satisfactory\nestimation and predictive ability in the 2D data with our data\u2019s intrinsic properties. Therefore we\nchose to not show it for comparison in this study.\n\n7 Conclusion\n\nWe proposed DRD, a hierarchal model for smooth and region-sparse weight tensors, which uses a\nGaussian process to model spatial dependencies in prior variances, an extension of the relevance\ndetermination framework. To impose smoothness, we also employed a structured model of the\nprior variances of Fourier coef\ufb01cients, which allows for pruning of high frequencies. Due to the\nintractability of marginal likelihood integration, we developed an ef\ufb01cient approximate inference\nmethod based on Laplace approximation, and showed substantial improvements over comparable\nmethods on both simulated and fMRI real datasets. Our method yielded more interpretable weights\nand indeed discovered multiple sparse regions that were not detected by other methods. We have\nshown that DRD can gracefully incorporate structured dependencies to recover smooth, region-\nsparse weights without any speci\ufb01cation of groups or regions, and believe it will be useful for other\nkinds of high-dimensional datasets from biology and neuroscience.\n\nAcknowledgments\n\nThis work was supported by the McKnight Foundation (JP), NSF CAREER Award IIS-1150186\n(JP), NIMH grant MH099611 (JP) and the Gatsby Charitable Foundation (MP).\n\n8\n\n\fReferences\n[1] R. Tibshirani. Journal of the Royal Statistical Society. Series B, pages 267\u2013288, 1996.\n[2] H. Lee, A. Battle, R. Raina, and A. Ng. In NIPS, pages 801\u2013808, 2006.\n[3] H. Zou and T. Hastie. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n\n67(2):301\u2013320, 2005.\n\n[4] B. Efron, T. Hastie, I. Johnstone, and et al. Tibshirani, R. Least angle regression. The Annals of statistics,\n\n32(2):407\u2013499, 2004.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] G. Yuan, K. Chang, C. Hsieh, and C. Lin. JMLR, 11:3183\u20133234, 2010.\n[7] F. Bach, R. Jenatton, J. Mairal, and et al. Obozinski, G. Convex optimization with sparsity-inducing\n\nnorms. Optimization for Machine Learning, pages 19\u201353, 2011.\n\n[8] R. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n[9] M. Tipping. Sparse bayesian learning and the relevance vector machine. JMLR, 1:211\u2013244, 2001.\n[10] D. MacKay. Bayesian non-linear modeling for the prediction competition.\n\nIn Maximum Entropy and\n\nBayesian Methods, pages 221\u2013234. Springer, 1996.\n\n[11] T. Mitchell and J. Beauchamp. Bayesian variable selection in linear regression. JASA, 83(404):1023\u2013\n\n1032, 1988.\n\n[12] E. George and R. McCulloch. Variable selection via gibbs sampling. JASA, 88(423):881\u2013889, 1993.\n[13] C. Carvalho, N. Polson, and J. Scott. Handling sparsity via the horseshoe. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 73\u201380, 2009.\n\n[14] C. Hans. Bayesian lasso regression. Biometrika, 96(4):835\u2013845, 2009.\n[15] B. Anirban, P. Debdeep, P. Natesh, and David D. Bayesian shrinkage. December 2012.\n[16] A. Schmolck. Smooth Relevance Vector Machines. PhD thesis, University of Exeter, 2008.\n[17] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[18] M. Van Gerven, B. Cseke, F. De Lange, and T. Heskes. Ef\ufb01cient bayesian multivariate fmri analysis using\n\na sparsifying spatio-temporal prior. NeuroImage, 50(1):150\u2013161, 2010.\n\n[19] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. arXiv\n\npreprint arXiv:1001.0736, 2010.\n\n[20] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th\n\nAnnual International Conference on Machine Learning, pages 433\u2013440. ACM, 2009.\n\n[21] H. Liu, L. Wasserman, and J. Lafferty. Nonparametric regression and classi\ufb01cation with joint sparsity\n\nconstraints. In NIPS, pages 969\u2013976, 2009.\n\n[22] R. Jenatton, J. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. JMLR,\n\n12:2777\u20132824, 2011.\n\n[23] S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative trait net-\n\nwork. PLoS genetics, 5(8):e1000587, 2009.\n\n[24] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. JMLR, 12:3371\u20133412, 2011.\n[25] B. Engelhardt and R. Adams. Bayesian structured sparsity from gaussian \ufb01elds.\n\narXiv preprint\n\narXiv:1407.2235, 2014.\n\n[26] M. Park and J. Pillow. Receptive \ufb01eld inference with localized priors. PLoS computational biology,\n\n7(10):e1002219, 2011.\n\n[27] M. Park, O. Koyejo, J. Ghosh, R. Poldrack, and J. Pillow. In Proceedings of the Sixteenth International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 489\u2013497, 2013.\n\n[28] M. Tipping. Sparse Bayesian learning and the relevance vector machine. JMLR, 1:211\u2013244, 2001.\n[29] A. Tipping and A. Faul. Analysis of sparse bayesian learning. NIPS, 14:383\u2013389, 2002.\n[30] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In NIPS, 2007.\n[31] M. Sahani and J. Linden. Evidence optimization techniques for estimating stimulus-response functions.\n\nNIPS, pages 317\u2013324, 2003.\n\n[32] M. Glasser, S. Sotiropoulos, A. Wilson, T. Coalson, B. Fischl, J. Andersson, J. Xu, S. Jbabdi, M. Webster,\n\nand et al. Polimeni, J. NeuroImage, 2013.\n\n[33] N.M. Alpert, D. Berdichevsky, Z. Levin, E.D. Morris, and A.J. Fischman. Improved methods for image\n\nregistration. NeuroImage, 3(1):10 \u2013 18, 1996.\n\n9\n\n\f", "award": [], "sourceid": 861, "authors": [{"given_name": "Anqi", "family_name": "Wu", "institution": "Ut austin"}, {"given_name": "Mijung", "family_name": "Park", "institution": "University College London"}, {"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "Stanford University"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "UT Austin"}]}