{"title": "EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2663, "page_last": 2671, "abstract": "For many real-world applications, we often need to select correlated variables---such as genetic variations and imaging features associated with Alzheimer's disease---in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Specifically, it integrates a sparse conditional classification model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efficient active-set algorithm to estimate the model via evidence maximization. Experiments on synthetic data and imaging genetics data demonstrated the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination.", "full_text": "EigenNet: A Bayesian hybrid of generative and\n\nconditional models for sparse learning\n\nYuan Qi\n\nComputer Science and Statistics Depts.\n\nPurdue University\n\nWest Lafayette, IN 47907, USA\n\nFeng Yan\n\nComputer Science Dept.\n\nPurdue University\n\nWest Lafayette, IN 47907, USA\n\nAbstract\n\nFor many real-world applications, we often need to select correlated variables\u2014\nsuch as genetic variations and imaging features associated with Alzheimer\u2019s\ndisease\u2014in a high dimensional space. The correlation between variables presents\na challenge to classical variable selection methods. To address this challenge,\nthe elastic net has been developed and successfully applied to many applications.\nDespite its great success, the elastic net does not exploit the correlation informa-\ntion embedded in the data to select correlated variables. To overcome this limi-\ntation, we present a novel hybrid model, EigenNet, that uses the eigenstructures\nof data to guide variable selection. Speci\ufb01cally, it integrates a sparse conditional\nclassi\ufb01cation model with a generative model capturing variable correlations in a\nprincipled Bayesian framework. We develop an ef\ufb01cient active-set algorithm to\nestimate the model via evidence maximization. Experimental results on synthetic\ndata and imaging genetics data demonstrate the superior predictive performance\nof the EigenNet over the lasso, the elastic net, and the automatic relevance deter-\nmination.\n\n1\n\nIntroduction\n\nIn this paper we consider the problem of selecting correlated variables in a high dimensional space.\nAmong many variable selection methods, the lasso and the elastic net are two popular choices (Tib-\nshirani, 1994; Zou and Hastie, 2005). The lasso uses a l1 regularizer on model parameters. This\nregularizer shrinks the parameters towards zero, removing irreverent variables and yielding a sparse\nmodel (Tibshirani, 1994). However, the l1 penalty may lead to over-sparisi\ufb01cation: given many\ncorrelated variables, the lasso often only select a few of them. This not only degenerates its pre-\ndiction accuracy but also affects the interpretability of the estimated model. For example, based on\nhigh-throughput biological data such as gene expression and RNA-seq data, it is highly desirable\nto select multiple correlated genes associated with a phenotype since it may reveal underlying bi-\nological pathways. Due to its over-sparsi\ufb01cation, the lasso may not be suitable for this task. To\naddress this issue, the elastic net has been developed to encourage a grouping effect, where strongly\ncorrelated variables tend to be in or out of the model together (Zou and Hastie, 2005). However,\nthe grouping effect is just the result of its composite l1 and l2 regularizer; the elastic net does not\nexplicitly incorporate correlation information among variables in its model.\nIn this paper, we propose a new sparse Bayesian hybrid model to utilize the eigen-information\nextracted from data for the selection of correlated variables. Speci\ufb01cally, it integrates a sparse con-\nditional classi\ufb01cation model with a generative model in a principle Bayesian framework (Lasserre\net al., 2006): the conditional model achieves sparsity via automatic relevance determination (ARD)\n(MacKay, 1991), an empirical Bayesian approach for model sparisi\ufb01cation; and the generative\nmodel is a latent variable model in which the observations are the eigenvectors of the unlabeled\ndata, capturing correlations between variables. By integrating these two models together, the hybrid\n\n1\n\n\fmodel enables identi\ufb01cation of groups of correlated variables guided by the eigenstructures. At the\nsame time, the model passes the information from its conditional part to its generative part, selecting\ninformative eigenvectors for the classi\ufb01cation task. Furthermore, using the Bayesian hybrid model,\nwe can automate the estimation of model hyperparameters.\nFrom the regularization perspective, the new hybrid model naturally generalizes the elastic net using\na composite regularizer adaptive to the data eigenstructures. It contains a sparsity regularizer and\na directional regularizer that encourages selecting variables associated with eigenvectors chosen by\nthe model. When the variables are independent of each other, the eigenvectors are parallel to the\naxes and this composite regularizer reduces to the combination of the ARD and a l2 regularizer\n(similar to the composite regularizer of the elastic net). But when some of the input variables are\nstrongly correlated, the regularizer will encourage the classi\ufb01er aligned with eigenvectors selected\nby the model. On one hand, our model is like the elastic net to retain \u2018all the big \ufb01sh\u2019. On the other\nhand, our model is different from the elastic net by the guidance from the eigen-information. Hence\nthe name EigenNet.\nExperiments on synthetic data are presented in Section 5. Our results demonstrate that the EigenNet\nsigni\ufb01cantly outperforms the lasso, and the elastic net in terms of prediction accuracy. We applied\nthis new approach to two tasks in imaging genetics:\ni) predicting cognitive function of healthy\nsubjects and AD patients based on brain imaging markers, and ii) classifying the healthy and AD\nsubjects based on single-nucleotide polymorphism (SNP) data. Compared to the lasso, the elastic\nnet and the ARD, our approach achieves improved prediction accuracy.\n\n2 Background: lasso and elastic net\nWe denote n independent and identically distributed samples as D = {(x1, y1), . . . , (xn, yn)},\nwhere xi is a p dimensional input features (i.e., explanatory variables) and yi is a scalar label (i.e.,\nresponse). Also, we denote [x1, . . . , xn] by X and (y1, . . . , yn) by y. Although our presentation\nfocuses on the binary classi\ufb01cation problem (yi \u2208 {\u22121, 1}), our approach can be readily applied\nto other problems such as regression and survival analysis by choosing appropriate likelihood func-\ntions.\nFor classi\ufb01cation, we use a probit model as the data likelihood:\n\np(y|X, w) =\n\n\u03c3(yiwTxi)\n\n(1)\n\nn(cid:89)\n\ni=1\n\nwhere \u03c3(z) is the Gaussian cumulative distribution function and w denotes the classi\ufb01er.\nTo identify relevant variables for high dimensional problems, the lasso (Tibshirani, 1994) uses a l1\npenalty, effectively shrinking w and b towards zero and pruning irrelevant variables. In a probabilis-\ntic framework this penalty corresponds to a Laplace prior distribution:\n\n\u03bb exp(\u2212\u03bb|wj|)\n\n(2)\n\np(w) =(cid:89)\n\nj\n\ncombined regularizer corresponds to the following prior distribution, p(w) \u221d(cid:81)\n\nwhere \u03bb is a hyperparameter that controls the sparsity of the estimated model. The larger the hyper-\nparameter \u03bb, the sparser the model.\nAs described in Section 1, the lasso may over-penalize relevant variables and hurt its predictive\nperformance, especially when there are strongly correlated variables. To address this issue, the\nelastic net (Zou and Hastie, 2005) combines l1 and l2 regularizers to avoid the over-penalization. The\nj exp(\u2212\u03bb1|wj| \u2212\nj ), where \u03bb1 and \u03bb2 are hyperparameters. While it is well known that the elastic net tends to\n\u03bb2w2\nselect strongly correlated variables together, it does not uses correlation information embedded in\nthe unlabeled data. The selection of correlated variables is merely the result of a less aggressive\nregularizer for sparisty.\nBesides the elastic net, there are many variants (and extensions) to the lasso, such as the bridge\n(Frank and Friedman, 1993) and smoothly clipped absolute deviation (Fan and Li, 2001). These\nvariants modify the l1 penalty to improve variable selection, but do not explicitly use the correlation\ninformation embedded in data.\n\n2\n\n\f(a) Independent variables\n\n(b) Correlated variables\n\nFigure 1: Toy examples. (a) When the variables x1 and x2 are independent of each other, both the\nlasso and the EigenNet select only x1. (b) When the variables x1 and x2 are correlated, the lasso\nselects only one variable. By contrast, guided by the major eigenvector of the data, the EigenNet\nselects both variables.\n\n3 EigenNet: eigenstructure-guided variable selection\n\nIn this section, we propose to use the covariance structure in data to guide the sparse estimation of\nmodel parameters. First, let us consider the following toy examples.\n\n3.1 Toy examples\n\nFigure 1(a) shows samples from two classes. Clearly the variables x1 and x2 are not correlated.\nThe lasso or the elastic net can successfully select the relevant variable x1 to classify the data. For\nthe samples in Figure 1(b), the variables x1 and x2 are strongly correlated. Despite the strong\ncorrelation, the lasso would select only x1 and ignore x2. The elastic net may select both x1 and x2\nif the regularization weight \u03bb1 is small and \u03bb2 is big, so that the elastic net behaves like l2 regularized\nclassi\ufb01er. The elastic net, however, does not explore the fact that x1 and x2 are correlated.\nSince the eigenstructure of the data covariance matrix captures correlation information between\nvariables, we propose to not only regularize the classi\ufb01er to be sparse, but also encourage it to be\naligned with certain eigenvector(s) that are helpful for the classi\ufb01cation task. Note that although\nclassical Fisher linear discriminant also uses the data covariance matrix to learn the classi\ufb01er, it\ngenerally does not provide a sparse solution, thus not suitable for the task of selecting correlated\nvariables and removing irrelevant ones.\nFor the data in Figure 1(a), since the two eigenvectors are parallel to the horizontal and vertical axes,\nthe EigenNet essentially reduces to the elastic net and selects x1. For the data in Figure 1(b), the\nprinciple eigenvector can guide the EigenNet to select both x1 and x2. The minor eigenvector is,\nhowever, not useful for the classi\ufb01cation task (in general, we need to select which eigenvectors are\nrelevant to classi\ufb01cation). We use a Bayesian framework to materialize the above ideas as described\nin the following section.\n\n3.2 Bayesian hybrid of conditional and generative models\n\nThe EigenNet is a hybrid of conditional and generative models. The conditional component allows\nus to learn the classi\ufb01er via \u201ddiscriminative\u201d training; the generative component captures the cor-\nrelations between variables; and these two models are glued together via a joint prior distribution,\nso that the correlation information is used to guide the estimation of the classi\ufb01er and the classi\ufb01-\ncation task is used to choose or scale relevant eigenvectors. Our approach is based on the general\nBayesian framework proposed by Lasserre et al. (2006)), which allows one to combine conditional\nand generative models in an elegant principled way.\nSpeci\ufb01cally, for the conditional model we have the same likelihood as (1), p(y|X, w) =\n). We\nwill describe later how to ef\ufb01ciently learn the precision parameter \u03b2j from the data to obtain a sparse\nclassi\ufb01er.\n\n(cid:81)\ni \u03c3(yiwTxi). For the classi\ufb01er w, we use a Gaussian prior: p(w) = (cid:81)p\n\nj=1 N (wj|0, \u03b2\n\n\u22121\nj\n\n3\n\n01230.511.522.5 Lasso, EigenNet01230.511.522.5 LassoEigenNet\fTo encourage the classi\ufb01er aligned with certain\neigenvectors, we introduce \u02dcw\u2014a latent vector\n(tightly) linked to the classi\ufb01er w\u2014in the genera-\ntive model:\n\nN (vj|sj \u02dcw, (\u03bbv\u03b7j)\u22121I)\n\n(3)\n\np(V|s, \u02dcw) \u221d m(cid:89)\n\nj=1\n\nwhere vj and \u03b7j are the j-th eigenvector and\neigenvalue of the data covariance matrix, \u03bbv is a\nhyperparameter, s = [s1, . . . , sm] are scaling fac-\ntors for the parameter \u02dcw. To combat over\ufb01tting,\nwe assign a Gamma prior Gam(\u03bbv|c0, d0) over\n\u03bbv. Note that this generative model encourages \u02dcw\nto align with the major eigenvectors with bigger\neigenvalues. However, eigenvectors are noisy and\nnot all of them relevant to the classi\ufb01cation task\u2014\nwe need to select relevant eigenvectors (i.e.\nones.\nTo enable the selection of the relevant eigenvectors, we assign a Laplace prior on sj:\n\nFigure 2: The graphical model of the EigenNet.\n\nthe relevant sub-eigenspace) and remove irrelevant\n\np(s) \u221d m(cid:89)\n\nj=1\n\n\u03bbs exp(\u2212\u03bbs|sj|)\n\n(4)\n\nwhere \u03bbs is a hyperparameter.\nFinally, to link the conditional and generative models together, we use a prior for \u02dcw conditional on\nw:\n\n(5)\nNote that the variance parameter r controls how similar w and \u02dcw are in our joint model. For\nsimplicity, we set r = 0 here so that p( \u02dcw|w) = \u03b4( \u02dcw \u2212 w) where \u03b4(a) = 1 if a = 1 and \u03b4(a) = 0\notherwise. The graphical model representation of the EigenNet is given in Figure 2.\n\np( \u02dcw|w) \u221d N ( \u02dcw|w, rI)\n\n3.3 Model estimation\n\nIn this section we present how to estimate the model based on an empirical Bayesian approach.\nSpeci\ufb01cally, we will use expectation propagation (EP) (Minka, 2001) to estimate the posterior of\nthe classi\ufb01er w (and \u02dcw) and optimize the marginal likelihood of the joint model over the scaling\nvariables s and the precision parameters \u03b2.\nFirst, given the hyperparameter \u03bbv and the latent variable s, the posterior distribution of w is\nN (vj|sjw, (\u03bbv\u03b7j)\u22121I)\n\np(w|y, X, ) \u221d N (w|0, diag(\u03b2)\u22121)\n\n(cid:16)(cid:89)\n\n(cid:17)(cid:89)\n\n\u03c3(yiwTxi)\n\n\u221d N (w|mp, Vp)(cid:89)\n\n(cid:80)\n\ni\n\nj\n\n\u03c3(yiwTxi)\n\ni\n\n(cid:80)\n\n(6)\n\n(7)\n\njI))\u22121 and mp = \u03bbv\n\nj \u03b7js2\n\nwhere Vp = (diag(\u03b2 + \u03bbv\nj \u03b7jsjvj. Then we initialize the EP up-\ndates by p(w) = N (w|mp, Vp) and then iteratively approximate each likelihood factor \u03c3(yiwTxi)\nby a factor with the Gaussian form: N (ti|xT\n). In other words, EP maps the nonlinear non-\n\u22121\nGaussian factor to the Gaussian factor with the virtual observation ti and the noise variance h\n.\ni\nAfter the convergence of EP, we obtain both the mean mw and the covariance Vw.\nGiven the approximate posterior q(w), we maximize the variational lower bound over \u03bbv:\n\n\u22121\ni w, h\ni\n\nlnN (vj|sjw, (\u03bbv\u03b7j)\u22121I) + ln Gam(\u03bbv|c0, d0)]\n\n(8)\n\nL(\u03bbv) = Eqw[(cid:88)\n\nj\n\n= pm\n2\n\nln \u03bbv \u2212 F\n\n2 \u03bbv + (c0 \u2212 1) ln \u03bbv \u2212 d0\u03bbv + contant\n\n4\n\ni=1,...,nyixiw\u02dcwvjsjj=1,...,m\u03b2\u03bbv\u03bbs\fAlgorithm 1 The empirical Bayesian estimation algorithm\n\n1. Initialize the model to contain a small fraction of features and initialize the\nparameters: s = 0, \u03bbv = 1, t = 0 h = \u221e.\n2. Run EP to obtain the initial mean and the covariance mw and Vw.\n3. Loop until convergence or reaching the maximum number of iterations\n\n4. Loop over the j-th active set\na. Update \u03b2 via (12) and (13).\nb. If u2\nj < rj, remove the features in the j-th active set from the model\nc. Update the posterior mean mw and the covariance Vw based on EP.\nd. Optimize the precision parameter \u03bbv via (9).\ne. Optimize the scaling factors s via (11).\n\nwhere F =(cid:80)\n\nj \u03b7j \u2212 2((cid:80)\n\nj vj\u03b7jsj)Tmw +(cid:80)\n\ni + (Vw)i,i). As a result, we have\n\nj \u03b7js2\n\nj((mw)2\n\u03bbv = c0 \u2212 1 + pm/2\n\n.\n\nd0 + F/2\n\nSimilarly, we maximize the variational lower bound over s:\n\nL(s) =(cid:88)\n\n(cid:0)Eqw[lnN (vj|sjw, (\u03bbv\u03b7j)\u22121I)] \u2212 \u03bbs|sj|(cid:1) + contant.\n\nConsequently we have for each j,\n\nj\n\nif |vT\n\nj mw| <\n\n\u03bbs\n\u03b7j\u03bbv\n\n, sj = Sign(vT\n\nj mw)\n\n|vT\nj mw| \u2212 \u03bbs/(\u03b7j\u03bbv)\ni + (Vw)i,i\n(mw)2\n\n; otherwise, sj = 0.\n\n(11)\n\nTo estimate \u03b2, we develop an active-set method to iteratively maximize the model marginal like-\nlihood over elements of \u03b2. In particular, we use a strategy similar to Tipping and Faul (2003)\u2019s\napproach: given the approximation factors N (t|XTw, diag(h)\u22121), the distribution over eigenvec-\ntors N (vj|sjw, (\u03bbv\u03b7j)\u22121I), and the prior distribution N (w|0, diag(\u03b2)\u22121), we can compute and\ndecompose the log marginal likelihood L(\u03b2) = log p(y|X, s, \u03bbv) into two parts: L(\u03b2j) and L(\u03b2\\j)\nwhere j and \\j index the elements of \u03b2 in the active set and the rest elements, respectively. Note\nthat because the effective prior over w becomes N (w|mp, Vp) as in (7) \u2014 instead of the zero mean\nprior N (w|0, diag(\u03b2)\u22121)\u2014 we cannot apply the algorithm proposed by Tipping and Faul (2003).\nInstead, we decompose L(\u03b2) into L(\u03b2j) and L(\u03b2\\j) as follows.\nFirst let us de\ufb01ne\n\n(9)\n\n(10)\n\nm(cid:88)\n\nk=1\n\n\u03b7kskvj\n\nUj = tTdiag(h)xj + \u03bbv\n\nuj = \u03b2jUj\n\u03b2j \u2212 Rj\n\nk \u2212 bTmw, Rj = (xj)Tdiag(h)xj + \u03bbv\n(cid:80)m\n\nrj = \u03b2jRj\n\u03b2j \u2212 Rj\n\nm(cid:88)\n\nk=1\n\n\u03b7ks2\n\nk \u2212 bTVwb\n\n(12)\n\nk, xj is the j-th column of the data matrix X, vj\nwhere b = (xj)Tdiag(h)Xa + \u03bbvea\nj\nk\nis the j-th element of the vector vk, Xa are the columns of X associated with currently selected\nfeatures (indexed by a), and ea\n\nj are the a-th elements of the j-th row of the identity matrix.\n\nk=1 \u03b7ks2\n\nThen we have L(\u03b2) = L(\u03b2\\j) + 1\non \u03b2j. Therefore, we can directly optimize over \u03b2j without updating \u03b2\\j.\nSetting the gradient of L(\u03b2) over \u03b2j, we easily obtain the following optimality condition: if u2\n\n). where L(\u03b2\\j) does not depend\n\n2(ln \u03b2j \u2212 ln(\u03b2j + uj) + r2\n\n\u03b2j +uj\n\nj \u2265 rj,\n\nj\n\nif u2\n\nj < rj, \u03b2j = \u221e. In the latter case we remove the j-th feature if it is currently in the model.\n\n\u03b2j =\n\nr2\nj\nj \u2212 rj\nu2\n\n;\n\n(13)\n\n5\n\n\f(a) Lasso\n\n(b) Elastic net\n\n(c) EigenNet\n\n(d) True\n\nFigure 3: Visualization of the lasso, the elastic net, the EigenNet and the true classi\ufb01er weights. We\nused 80 training samples with 40 features. The test error rates of the lasso, the elastic net, and the\nEigenNet on 2000 test samples are 0.297, 0.245, and 0.137, respectively.\n\nThe above active-set updates are very ef\ufb01cient, because during each iteration we only deal with\na reduced model de\ufb01ned on the currently selected features. This approach signi\ufb01cantly reduces\nthe computational cost of EP from O(np2) to O(nl2) where l is the biggest model size during the\nactive-set iterations. The empirical Bayesian estimation algorithm of EigenNet is summarized in\nAlgorithm 1.\n4 Related work\n\nThe EigenNet is related to the classical eigenface approaches (Turk and Pentland, 1991; Sirovich\nand Kirby, 1987). The eigenface approach learns a model in the subspace spanned by the major\neigenvectors of the data covariance matrix. The EigenNet also uses the eigensubspace to guide\nthe model estimation. However, unlike the eigenface approach, the EigenNet adaptively selects\neigenvectors and learns a sparse classi\ufb01er.\nThere are Bayesian versions of the lasso and the elastic net. Bayesian lasso (Park et al., 2008) puts\na hyper-prior on the regularization coef\ufb01cient and use a Gibbs sampler to jointly sample both re-\ngression weights and the regularization coef\ufb01cient. Using a similar treatment to Bayesian lasso,\nBayesian elastic net (Li and Lin, 2010) samples the two regularization coef\ufb01cients simultaneously,\npotentially avoiding the \u201cdouble shrinkage\u201d problem described in the original elastic net paper (Zou\nand Hastie, 2005). As the EigenNet, these methods are grounded in a Bayesian framework, shar-\ning the bene\ufb01ts of obtaining posterior distributions for handling estimation uncertainty. However,\nBayesian lasso and Bayesian elastic net are presented to handle regression problems (though cer-\ntainly they can be generalized for classi\ufb01cation problems) and do not use the eigen-information\nembedded in data. The EigenNet, by contrast, selects the eigen-subspace and uses it to guide classi-\n\ufb01cation.\nGroup lasso (Jacob et al., 2009) enforces sparsity on the groups of predictors\u2014an entire group of\ncorrelated predictors may be retained or pruned off. However, applying the idea of group lasso\nto the EigenNet faces several dif\ufb01culties: First, this approach won\u2019t give (approximately) sparse\nclassi\ufb01ers unless we truncate eigenvectors. If we use truncation, we need to decide what threshold\nwe should use to truncate each eigenvector\u2014again it\u2019s a dif\ufb01cult task. Second, it will be hard to\ntune all regularization coef\ufb01cients associated with all major eigenvectors\u2013cross validation would\nnot suf\ufb01ce. By contrast, our classi\ufb01er is sparse because of the ARD effect. More importantly, the\nlatent variables sj in our model are automatically estimated from data, deciding how important each\neigenvector is for the classi\ufb01cation task in a principled Bayesian framework.\n\n5 Experimental results\n\nWe evaluated the new sparse Bayesian model, the EigenNet, on both synthetic and real data and\ncompared it with three representative variable selection methods, the lasso, the elastic net, and an\nARD approach (Qi et al., 2004). For the lasso and the elastic net, we used the Glmnet software\npackage that uses cyclical coordinate descent in a pathwise fashion1. Like the EigenNet, the ARD\napproach also uses EP to approximate the model marginal likelihood. For the lasso and the elastic\nnet, we used cross-validation to tune the hyperparameters; for the EigenNet, we estimated \u03bbv from\ndata and tuned \u03bbs by cross-validation.\n\n1http://www-stat.stanford.edu/ tibs/glmnet-matlab/\n\n6\n\n010203040010203040010203040010203040\f(b) correlated features\n\n(d) correlated features\n\n(a) independent features\nFigure 4: Predictive performance on synthetic datasets.\n(a) and (b): classi\ufb01cation; (c) and (d):\nregression The results were averaged over 10 runs. For the data with independent features, the\nEigenNet outperforms the alternative methods when the number of training samples is small; for\ndata with correlated features, the EigenNet outperforms the alternative methods consistently.\n\n(c) independent features\n\n5.1 Visualization of estimated classi\ufb01ers\n\nFirst, we tested these methods on synthetic data that contain correlated features. We sampled 40\ndimensional data points, each of which contains two groups of correlated variables. The correlation\ncoef\ufb01cient between variables in each group is 0.81 and there are 4 variables in each group. We set\nthe values of the classi\ufb01er weights in one group as 5 and in the other group as -5. We also generated\nthe bias term randomly from a standard Gaussian distribution. We set the number of training points\nto 80. Figure 3 shows the estimated classi\ufb01ers and the true classi\ufb01er we used to produce the data\nlabels. Unlike the lasso and the elastic net, the EigenNet clearly identi\ufb01es two groups of correlated\nvariables, very close to the ground truth. As a result, on 2000 test points, the EigenNet achieves the\nlowest prediction error rate, 0.137, while the test error rates of the lasso and the elastic net are 0.297\nand 0.245, respectively.\n\n5.2 Experiments on synthetic data\n\nNow we systematically compared these methods for classi\ufb01cation and regression on synthetic\ndatasets containing correlated features and containing independent features (Although this presenta-\ntion so far has been focused on classi\ufb01cation, we can easily implement the EigenNet for regression;\nsince we can compute the marginal likelihood exactly, the EP approximation is not needed for regres-\nsion.) To generate data with correlated variables we used a similar procedure as in the visualization\nexample: we sampled 40 dimensional data points, each of which contains two groups of correlated\nvariables. The correlation coef\ufb01cient between variables in each group is 0.81 and there are 4 vari-\nables in each group. However, unlike for the previous example where the classi\ufb01er weights are the\nsame for the correlated variables, now we set the weights within the same group to have the same\nsign, but with different random values. We varied the number of training points, ranging from 10 to\n80, and tested all these methods. For the datasets with independent features, we followed the same\nprocedure except that the features are independently sampled. We ran the experiments 10 times.\nFigure 4 shows the results averaged over 10 runs. We did not report the standard errors since they\nare very small.\nFor the datasets with independent features, the EigenNet outperforms the alternative methods when\nthe number of training examples is small (probably because in this case the eigenspace has a smaller\ndimension than than that of the classi\ufb01er, effectively controlling the model \ufb02exibility); with more\ntraining examples, it is not unsurprising to see all these methods perform quite similarly. For the\ndata with correlated features, although the results of the elastic net appear to overlaps with those\nof the lasso in the \ufb01gure, the elastic net often outperforms the lasso with a small margin; also, the\nEigenNet consistently outperforms the lasso and the elastic net sign\ufb01cantly. The improved predictive\nperformance of the EigenNet re\ufb02ects the bene\ufb01t of using the valuable correlation information to help\nthe model estimation.\n\n5.3 Application to imaging genetics\n\nImaging genetics is an emerging research area where imaging markers and genetic variations (e.g.,\nSNPs) are used to study neurodegenerative diseases, in particular, Alzheimer\u2019s disease (AD). We\n\n7\n\n2040608000.10.20.30.4# of training examplestest error rate LassoElastic netEigenNet204060800.10.150.20.250.30.35# of training examplestest error rate LassoElastic netEigenNet20406080051015# of training examplesRMSE LassoElastic netEigenNet20406080010203040# of training examplesRMSE LassoElastic netEigenNet\f(a) Regression of ADAS-Cog score\n\n(b) Classi\ufb01cation of healthy & AD subjects\n\nFigure 5: Imaging genetics applications: (a) prediction of the ADAS-Cog score based on 14 imaging\nfeatures and (b) AD classi\ufb01cation based on 2000 SNPs. The error bars represent the standard errors.\n\napplied the EigenNet to two critical problems in imaging genetics and compared its performance\nwith that of alternative sparse learning methods.\nFirst, we considered a regression problem where the predictors are imaging features, which were\ngenerated by Holland et al. (2009) for ADNI and include volume measured in 14 brain regions of\ninterest (ROI)\u2014including the whole brain, ventricles, hippocampus, etc. We used these imaging\nfeatures to predict the ADAS-Cog score, which is widely used to assess cognitive function of AD\npatients. It is hypothesized that the brain ROI volumes are associated with the ADAS-Cog score.\nBut this association has not been rigorously studied by statistical learning methods. After removing\nmissing entries, we obtained the data of 726 subjects, including healthy people, people with mild\ncognitive impairment (MCI), and AD patients. Then we applied the lasso, the elastic net, and the\nEigenNet to this prediction task. We randomly selected 508 training samples and 218 test samples\nfor 50 times. The results are shown in Figure 5.(a).\nSecond, we used SNP data to classify a subject into the healthy group or AD patients. We chose the\ntop 2000 SNPs that are associated with AD based on a simple statistical test. There are 374 subjects\nin total (roughly the same size for each class). We compared the EigenNet with the lasso and the\nelastic net as well as the the ARD approach\u2014since it corresponds to EigenNet\u2019s conditional com-\nponent. We randomly split the dataset into 262 training and 112 test samples 10 times. The results\nare summarized in Figure 5.(b). As shown in the Figure, for both the regression and classi\ufb01cation\nproblems, the EigenNet outperforms the alternative methods signi\ufb01cantly.\n\n6 Conclusions\n\nIn this paper, we have presented a novel sparse Bayesian hybrid model to select correlated vari-\nables for regression and classi\ufb01cation. It integrates the sparse conditional ARD model with a latent\nvariable model for eigenvectors.\nFor this hybrid model, we could explore other latent variable models, such as sparse projection\nmethods (Guan and Dy, 2009; Archambeau and Bach, 2009); these models can better deal with\nnoise in the unlabeled data and improve the selection of interdependent features (i.e., predictors).\nFurthermore, if we have certain prior knowledge about the interdependence between features, such\nas linkage disequilibrium between SNPs, we could easily incorporate them into our model. Thus,\nour model provides an elegant framework for integrating complex data generation processes and\ndomain knowledge in sparse learning.\n\n7 Acknowledgments\n\nThe authors thank the anonymous reviewers and T. S. Jaakkola for constructive suggestions. This\nwork was supported by NSF IIS-0916443, NSF CAREER award IIS-1054903, and the Center for\nScience of Information (CSoI), an NSF Science and Technology Center, under grant agreement\nCCF-0939370.\n\n8\n\nLassoElastic netEigenNet6.577.588.5Root\u2212Mean\u2212Square ErrorLassoElastic netARDEigenNet0.30.350.4Classification Error Rate\fReferences\n\nRobert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1994.\n\nHui Zou and Trevor Hastie. Regularization and variable selection via the Elastic Net. Journal of the\n\nRoyal Statistical Society B, 67:301\u2013320, 2005.\n\nJulia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of genera-\ntive and discriminative models. In Proc. of IEEE Conference on Computer Vision and Pattern\nRecognition, pages 87\u201394, 2006.\n\nDavid J.C. MacKay. Bayesian interpolation. Neural Computation, 4:415\u2013447, 1991.\nIldiko E. Frank and Jerome H. Friedman. A statistical view of some chemometrics regression tools.\n\nTechnometrics, 35(2):109\u2013135, 1993.\n\nJianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle\n\nproperties. Journal of the American Statistical Association, 96(456):1348\u20131360, 2001.\n\nThomas P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of\n\nthe 17th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 362\u2013369, 2001.\n\nMichael E. Tipping and Anita C. Faul. Fast marginal likelihood maximisation for sparse Bayesian\nmodels. In Proceedings of the Ninth International Workshop on Arti\ufb01cial Intelligence and Statis-\ntics, 2003.\n\nMatthew Turk and Alex Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3:71\u201386,\n\n1991.\n\nL. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of human faces. J.\n\nOpt. Soc. Am. A, 4(3):519\u2013524, 1987.\n\nPark, Trevor, Casella, and George. The Bayesian Lasso. Journal of the American Statistical Associ-\n\nation, 103(482):681\u2013686, 2008.\n\nQing Li and Nan Lin. The Bayesian Elastic Net. Bayesian Analysis, 5(1):151\u2013170, 2010.\nLaurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph\nlasso. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.\nYuan Qi, Thomas P. Minka, Rosalind W. Picard, and Zoubin Ghahraman. Predictive automatic\nrelevance determination by expectation propagation. In Proceedings of Twenty-\ufb01rst International\nConference on Machine Learning, pages 671\u2013678, 2004.\n\nDominic Holland, James B Brewer, Donald J Hagler, Christine Fenema-Notestine, and Anders M\nDale. Subregional neuroanatomical change as a biomarker for alzheimer\u2019s disease. Proceedings\nof the National Academy of Sciences, 106(49):20954\u201320959, 2009.\n\nYue Guan and Jennifer Dy. Sparse probabilistic principal component analysis. JMLR W&CP:\n\nAISTATS, 5, 2009.\n\nC\u00b4edric Archambeau and Francis Bach. Sparse probabilistic projections.\n\nInformation Processing Systems 21. 2009.\n\nIn Advances in Neural\n\n9\n\n\f", "award": [], "sourceid": 1450, "authors": [{"given_name": "Feng", "family_name": "Yan", "institution": null}, {"given_name": "Yuan", "family_name": "Qi", "institution": null}]}