{"title": "Learning Feature Selection Dependencies in Multi-task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 746, "page_last": 754, "abstract": "A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data.", "full_text": "Learning Feature Selection Dependencies in\n\nMulti-task Learning\n\nDaniel Hern\u00b4andez-Lobato\nComputer Science Department\n\nUniversidad Aut\u00b4onoma de Madrid\ndaniel.hernandez@uam.es\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\nDepartment of Engineering\nUniversity of Cambridge\njmh233@cam.ac.uk\n\nAbstract\n\nA probabilistic model based on the horseshoe prior is proposed for learning de-\npendencies in the process of identifying relevant features for prediction. Exact\ninference is intractable in this model. However, expectation propagation offers\nan approximate alternative. Because the process of estimating feature selection\ndependencies may suffer from over-\ufb01tting in the model proposed, additional data\nfrom a multi-task learning scenario are considered for induction. The same model\ncan be used in this setting with few modi\ufb01cations. Furthermore, the assumptions\nmade are less restrictive than in other multi-task methods: The different tasks\nmust share feature selection dependencies, but can have different relevant features\nand model coef\ufb01cients. Experiments with real and synthetic data show that this\nmodel performs better than other multi-task alternatives from the literature. The\nexperiments also show that the model is able to induce suitable feature selection\ndependencies for the problems considered, only from the training data.\n\n1\n\nIntroduction\n\nMany linear regression problems are characterized by a large number d of features or explaining\nattributes and by a reduced number n of training instances. In this large d but small n scenario\nthere is an in\ufb01nite number of potential model coef\ufb01cients that explain the training data perfectly\nwell. To avoid over-\ufb01tting problems and to obtain estimates with good generalization properties, a\ntypical regularization is to assume that the model coef\ufb01cients are sparse, i.e., most coef\ufb01cients are\nequal to zero [1]. This is equivalent to considering that only a subset of the features or attributes\nare relevant for prediction. The sparsity assumption can be introduced by carrying out Bayesian\ninference under a sparsity enforcing prior for the model coef\ufb01cients [2, 3], or by minimizing a loss\nfunction penalized by some sparse regularizer [4, 5]. Among the priors that enforce sparsity, the\nhorseshoe has some attractive properties that are very convenient for the scenario described [3]. In\nparticular, this prior has heavy tails, to model coef\ufb01cients that signi\ufb01cantly differ from zero, and an\nin\ufb01nitely tall spike at the origin, to favor coef\ufb01cients that take negligible values.\nThe estimation of the coef\ufb01cients under the sparsity assumption can be improved by introducing\ndependencies in the process of determining which coef\ufb01cients are zero [6, 7]. An extreme case of\nthese dependencies appears in group feature selection methods in which groups of coef\ufb01cients are\nconsidered to be jointly equal or different from zero [8, 9]. However, a practical limitation is that\nthe dependency structure (groups) is often assumed to be given. Here, we propose a model based on\nthe horseshoe prior that induces the dependencies in the feature selection process from the training\ndata. These dependencies are expressed by a correlation matrix that is speci\ufb01ed by O(d) parameters.\nUnfortunately, the estimation of these parameters from the training data is dif\ufb01cult since we consider\nn < d instances only. Thus, over-\ufb01tting problems are likely to appear. To improve the estimation\nprocess we assume a multi-task learning setting, where several learning tasks share feature selection\ndependencies. The method proposed can be adapted to such a scenario with few modi\ufb01cations.\n\n1\n\n\fTraditionally, methods for multi-task learning under the sparsity assumption have considered com-\nmon relevant and irrelevant features among tasks [8, 10, 11, 12, 13, 14]. Nevertheless, recent re-\nsearch cautions against this assumption when the supports and values of the coef\ufb01cients for each\ntask can vary widely [15]. The model proposed here limits the impact of this problem because it is\nhas fewer restrictions. The tasks used for induction can have, besides different model coef\ufb01cients,\ndifferent relevant features. They must share only the dependency structure for the selection process.\nThe model described here is most related to the method for sparse coding introduced in [16], where\nspike-and-slab priors [2] are considered for multi-task linear regression under the sparsity assump-\ntion and dependencies in the feature selection process are speci\ufb01ed by a Boltzmann machine. Fitting\nexactly the parameters of a Boltzmann machine to the observed data has exponential cost in the num-\nber of dimensions of the learning problem. Thus, when compared to the proposed model, the model\nconsidered in [16] is particularly dif\ufb01cult to train. For this, an approximate algorithm based on\nblock-coordinate optimization has been described in [17]. The algorithm alternates between greedy\nMAP estimation of the sparsity patterns of each task and maximum pseudo-likelihood estimation of\nthe Boltzmann parameters. Nevertheless, this algorithm lacks a proof of convergence and we have\nobserved that is prone to get trapped in sub-optimal solutions.\nOur experiments with real and synthetic data show the better performance of the model proposed\nwhen compared to other methods that try to overcome the problem of different supports among\ntasks. These methods include the model described in [16] and the model for dirty data proposed\nin [15]. These experiments also illustrate the bene\ufb01ts of the proposed model for inducing depen-\ndencies in the feature selection process. Speci\ufb01cally, the dependencies obtained are suitable for the\nmulti-task learning problems considered. Finally, a dif\ufb01culty of the model proposed is that exact\nBayesian inference is intractable. Therefore, expectation propagation (EP) is employed for ef\ufb01cient\napproximate inference. In our model EP has a cost that is O(Kn2d), where K is the number of\nlearning tasks, n is the number of samples of each task, and d is the dimensionality of the data.\nThe rest of the paper is organized as follows: Section 2 describes the proposed model for learning\nfeature selection dependencies. Section 3 shows how to use expectation propagation to approximate\nthe quantities required for induction. Section 4 compares this model with others from the literature\non synthetic and real data regression problems. Finally, Section 5 gives the conclusions of the paper\nand some ideas for future work.\n\n2 A Model for Learning Feature Selection Dependencies\n\nWe describe a linear regression model that can be used for learning dependencies in the process\nof identifying relevant features or attributes for prediction. For simplicity, we \ufb01rst deal with the\ncase of a single learning task. Then, we show how this model can be extended to address multi-\ntask learning problems. In the single task scenario we consider some training data in the form of\nn d-dimensional vectors summarized in a design matrix X = (x1, . . . , xn)T and associated targets\ny = (y1, . . . , yn)T, with yi \u2208 R. A linear predictive rule is assumed for y given X. Namely,\ny = Xw + \u0001, where w is a vector of latent coef\ufb01cients and \u0001 is a vector of independent Gaussian\nnoise with variance \u03c32, i.e., \u0001 \u223c N (0, \u03c32I). Given X and y, the likelihood for w is:\nN (yi|wTxi, \u03c32) = N (y|Xw, \u03c32I) .\n\np(yi|xi, w) =\n\np(y|X, w) =\n\nn(cid:89)\n\nn(cid:89)\n\n(1)\n\ni=1\n\ni=1\n\nConsider the under-determined scenario n < d. In this case, the likelihood is not strictly concave\nand in\ufb01nitely many values of w \ufb01t the training data perfectly well. A strong regularization technique\nthat is often used in this context is to assume that only some features are relevant for prediction [1].\nThis is equivalent to assuming that w is sparse with many zeros. This inductive bias can be naturally\nincorporated into the model using a horseshoe sparsity enforcing prior for w [3].\nThe horseshoe prior lacks a closed form but can be de\ufb01ned as a scale mixture of Gaussians:\nj \u03c4 2)C+(\u03bbj|0, 1) d\u03bbj ,\n\nN (wj|0, \u03bb2\n\np(wj|\u03c4 ) =\n\np(w|\u03c4 ) =\n\np(wj|\u03c4 ) ,\n\n(2)\n\nd(cid:89)\n\n(cid:90)\n\nj=1\n\nwhere \u03bbj is a latent scale for coef\ufb01cient wj, C+(\u00b7|0, 1) is a half-Cauchy distribution with zero loca-\ntion and unit scale and \u03c4 > 0 is a global shrinkage parameter that controls the level of sparsity. The\n\n2\n\n\fsmaller the value of \u03c4 the sparser the prior and vice-versa. Figure 1 (left) and (middle) show a com-\nparison of the horseshoe with other priors from the literature. The horseshoe has an in\ufb01nitely tall\nspike at the origin which favors coef\ufb01cients with small values, and has heavy tails which favor co-\nef\ufb01cients that take values that signi\ufb01cantly differ from zero. Furthermore, assume that \u03c4 = \u03c32 = 1\nj ). Then, the posterior mean for wj is (1 \u2212 \u03baj)yj, where\nand that X = I, and de\ufb01ne \u03baj = 1/(1 + \u03bb2\n\u03baj is a random shrinkage coef\ufb01cient that can be interpreted as the amount of weight placed at the\norigin [3]. Figure 1 (right) shows the prior density for \u03baj that results from the horseshoe. It is from\nthe shape of this \ufb01gure that the horseshoe takes its name. We note that one expects to see two things\nunder this prior: relevant coef\ufb01cients (\u03baj \u2248 0, no shrinkage), and zeros (\u03baj \u2248 1, total shrinkage).\nThe horseshoe is therefore very convenient for the sparse inducing scenario described before.\n\nFigure 1: (left) Density of different priors, horseshoe, Gaussian, Student-t and Laplace near the\norigin. Note the in\ufb01nitely tall spike of the horseshoe. (middle) Tails of the different priors considered\nbefore. (right) Prior density of the shrinkage parameter \u03baj for the horseshoe prior.\n\nA limitation of the horseshoe is that it does not consider dependencies in the feature selection pro-\ncess. Speci\ufb01cally, the fact that one feature is actually relevant for prediction has no impact at all\nin the prior relevancy or irrelevancy of other features. We now describe how to introduce these\ndependencies in the horseshoe. Consider the de\ufb01nition of a Cauchy distribution as the ratio of two\nindependent standard Gaussian random variables [18]. An equivalent representation of the prior is:\n\np(w|\u03c12, \u03b32) =\n\nN (wj|0, u2\n\nj )N (uj|0, \u03c12)N (vj|0, \u03b32) dujdvj .\n\nj /v2\n\n(3)\n\n(cid:90) d(cid:89)\n\nj=1\n\n(cid:90) \uf8ee\uf8f0 d(cid:89)\n\nj=1\n\nwhere uj and vj are latent variables introduced for each dimension j. In particular, \u03bbj = uj\u03b3/vj\u03c1.\nFurthermore, \u03c4 has been incorporated into the prior for uj and vj using \u03c4 2 = \u03c12/\u03b32. The latent\nvariables uj and vj can be interpreted as indicators of the relevance or irrelevance of feature j. The\nlarger u2\nA simple way of introducing dependencies in the feature selection process is to consider correlations\namong variables uj and vj, with j = 1, . . . , d. These correlations can be introduced in (3) as follows:\n\nj, the more relevant the feature. Conversely, the larger v2\n\nj , the more irrelevant.\n\n\uf8f9\uf8fbN (u|0, \u03c12C)N (v|0, \u03b32C) dudv ,\n\n(4)\n\np(w|\u03c12, \u03b32, C) =\n\nN (wj|0, u2\n\nj /v2\nj )\n\nwhere u = (u1, . . . , ud)T, v = (v1, . . . , vd)T, C is a correlation matrix that speci\ufb01es the dependen-\ncies in the feature selection process, and \u03c12 and \u03b32 act as regularization parameters that control the\nlevel of sparsity. When C = I, (4) factorizes and gives the same prior as the one in (2) and (3). In\npractice, however, C has to be estimated from the data. This can be problematic since it will involve\nthe estimation of O(d2) free parameters which can lead to over-\ufb01tting. To alleviate this problem and\nalso to allow for ef\ufb01cient approximate inference we consider a special form for C:\nM11, . . . , 1/\n\n(5)\nwhere diag(a1, . . . , ad) denotes a diagonal matrix with entries a1, . . . , ad; D is a diagonal matrix\nwhose entries are all equal to some small positive constant (this matrix guarantees that C\u22121 exists);\nthe products by \u2206 ensure that the entries of C are in the range (\u22121, 1); and P is a d \u00d7 m matrix\nof real entries which speci\ufb01es the correlation structure of C. Thus, C is fully determined by P and\nwill only have O(md) free parameters with m < d. The value of m is a regularization parameter\nthat limits the complexity of C. The larger its value, the more expressive C is. For computational\nreasons described later on we will set in our experiments m equal to n, the number of data instances.\n\nM = (D + PPT) ,\n\n\u2206 = diag(1/\n\nC = \u2206M\u2206 ,\n\n(cid:112)\n\n(cid:112)\n\nMdd) ,\n\n3\n\n\u22123\u22122\u2212101230.00.10.20.30.40.50.60.7Prob.DensityHorseshoeGaussianStudent\u2212t(df=1)Laplace45670.0000.0050.0100.0150.0200.025Prob.DensityHorseshoeGaussianStudent\u2212t(df=1)Laplace0.00.20.40.60.81.0012345Prob.Density\f2.1\n\nInference, Prediction and Learning Feature Selection Dependencies\n\nDenote by z = (wT, uT, vT)T the vector of latent variables of the model described above. Based on\nthe formulation of the previous section, the joint probability distribution of y and z is:\n\nN(cid:0)wj|0, u2\n\nj /v2\nj\n\n(cid:1) . (6)\n\nd(cid:89)\n\nj=1\n\np(y, z|X, \u03c32, \u03c12, \u03b32, C) = N (y|Xw, \u03c32I)N (u|0, \u03c12C)N (v|0, \u03b32C)\n\nFigure 2 shows the factor graph corresponding to this joint probability distribution. This graph\nsummarizes the interactions between the random variables in the model. All the factors in (6) are\nGaussian, except the ones corresponding to the prior for wj given uj and vj, N (wj|0, u2\nGiven the observed targets y one is typically interested in inferring the latent variables z of the\nmodel. For this, Bayes\u2019 theorem can be used:\n\nj ).\nj /v2\n\np(z|X, y, \u03c32, \u03c12, \u03b32, C) =\n\np(y, z|X, \u03c32, \u03c12, \u03b32, C)\np(y|X, \u03c32, \u03c12, \u03b32, C)\n\n,\n\n(7)\n\nwhere the numerator in the r.h.s. of (7) is the joint distribution (6) and the denominator is simply a\nnormalization constant (the model evidence) which can be used for Bayesian model selection [19].\nThe posterior distribution in (7) is useful to compute a predictive distribution for the target ynew\nassociated to a new unseen data instance xnew:\n\n(cid:90)\n\np(ynew|xnew, X, \u03c32, \u03c12, \u03b32, C) =\n\np(ynew|xnew, w) p(z|X, \u03c32, \u03c12, \u03b32, C) dz .\n\n(8)\n\nSimilarly, one can marginalize (7) with respect to w to obtain a posterior distribution for u and v\nwhich can be useful to identify the most relevant or irrelevant features.\nIdeally, however, one should also infer C, the cor-\nrelation matrix that describes the dependencies in\nthe feature selection process, and compute a pos-\nterior distribution for it. This can be complicated,\neven for approximate inference methods. Denote\nby Z the model evidence, i.e., the denominator in\nthe r.h.s. of (7). A simpler alternative is to use\ngradient ascent to maximize log Z (and therefore\nZ) with respect to P, the matrix that completely\nspeci\ufb01es C. This corresponds to type-II maxi-\nmum likelihood (ML) estimation and allows to\ndetermine P from the training data alone, without\nresorting to cross-validation [19]. The gradient of\nlog Z with respect to P, i.e., \u2202 log Z/\u2202P can be\nused for this task. The other hyper-parameters of\nthe model \u03c32, \u03c12 and \u03b32 can be found following\na similar approach.\nUnfortunately, neither (7), (8) nor the model ev-\nidence can be computed in closed form. Specif-\nically, it is not possible to compute the required\nintegrals analytically. Thus, one has to resort to\napproximate inference. For this, we use expecta-\ntion propagation [20]. See Section 3 for details.\n\nFigure 2:\nthe probabilistic\nmodel. The factor f (\u00b7) corresponds to the likeli-\nhood N (y|Xw, \u03c32I), and each gj(\u00b7) to the prior\nfor wj given uj and vj, N (wj|0, u2\nj ). Finally,\nhu(\u00b7) and hv(\u00b7) correspond to N (u|0, \u03c12C) and\nN (v|0, \u03b32C), respectively. Only the targets y are\nobserved, the other variables are latent.\n\nFactor graph of\n\nj /v2\n\n2.2 Extension to the Multi-Task Learning Setting\n\nIn the single-task learning setting maximizing the model evidence with respect to P is not expected\nto be effective to improve the prediction accuracy. The reason is the dif\ufb01culty of obtaining an ac-\ncurate estimate of P. This matrix has m \u00d7 d free parameters and these have to be induced from a\nsmall number of n < d training instances. The estimation process is hence likely to be affected by\nover-\ufb01tting. One way to mitigate over-\ufb01tting problems is to consider additional data for the estima-\ntion process. These additional data may come from a multi-task learning setting, where there are K\n\n4\n\n.........\frelated but different tasks available for induction. A simple assumption is that all these tasks share a\ncommon dependency structure C for the feature selection process, although the model coef\ufb01cients\nand the actual relevant features may differ between tasks. This assumption is less restrictive than as-\nsuming jointly relevant and irrelevant features across tasks and can be incorporated into the learning\nprocess using the described model with few modi\ufb01cations. By using the data from the K tasks for\nthe estimation of P we expect to obtain better estimates and to improve the prediction accuracy.\nAssume there are K learning tasks available for induction and that each task k = 1, . . . , K consists\nof a design matrix Xk with nk d-dimensional data instances and target values yk. As in (1), a linear\npredictive rule with additive Gaussian noise \u03c32\nk is considered for each task. Let wk be the model\ncoef\ufb01cients of task k. Assume for the model coef\ufb01cients of each task a horseshoe prior as the one\nspeci\ufb01ed in (4) with a shared correlation matrix C, but with task speci\ufb01c hyper-parameters \u03c12\nk and\nk. Denote by uk and vk the vectors of latent Gaussian variables of the prior for task k. Similarly, let\n\u03b32\nzk = (wT\nk)T be the vector of latent variables of task k. Then, the joint posterior distribution\nof the latent variables of the different tasks factorizes as follows:\n\nk, uT\n\nk, vT\n\np(cid:0){z}K\n\nk=1|{Xk, yk, \u03c4 2\n\nk , \u03c12\n\nk, \u03c32\n\np(yk, zk|Xk, \u03c32\np(yk|Xk, \u03c32\nk, \u03c12\n\nk, \u03c12\n\nk, \u03b32\n\nk, C)\n\nk, \u03b32\n\nk, C)\n\n,\n\n(9)\n\nk=1, C(cid:1) =\n\nk}K\n\nK(cid:89)\nk, C) =(cid:81)K\n\nk=1\n\nk, \u03c12\nk, \u03c12\n\nr.h.s. of (9), i.e., ZMT =(cid:81)K\n\nwhere each factor in the r.h.s. of (9) is given by (7). This indicates that the K models for each task\nk \u2200k. Denote by ZMT the denominator in the\ncan be learnt independently given C and \u03c32\nk=1 p(yk|Xk, \u03c32\nk=1 Zk, with Zk the evidence for task\nk. Then, ZMT is the model evidence for the multi-task setting. As in single-task learning, speci\ufb01c\nvalues for the hyper-parameters of each task and C can be found by a type-II maximum likelihood\n(ML) approach. For this, log ZMT is maximized using gradient ascent. Speci\ufb01cally, the gradient\nof log ZMT with respect to \u03c32\nk and P can be easily computed in terms of the gradient of\neach log Zk. In summary, if there is a method to approximate the required quantities for learning\na single task using the model proposed, implementing a multi-task learning method that assumes\nshared feature selection dependencies but task dependent hyper-parameters is straight-forward.\n\nk and \u03b32\nk, \u03b32\n\nk, \u03b32\n\nk, \u03c12\n\n3 Approximate Inference\n\nd(cid:89)\n\nExpectation propagation (EP) [20] is used to approximate the posterior distribution and the evidence\nof the model described in Section 2. For the clarity of presentation we focus on the model for a single\nlearning task. The multi-task extension of Section 2.2 is straight-forward. Consider the posterior\ndistribution of z, (6). Up to a normalization constant this distribution can be written as\n\np(z|X, y, \u03c32, \u03c12, \u03b32) \u221d f (w)hu(u)hv(v)\n\ngj(z) ,\n\n(10)\n\nj=1\n\nare Gaussian. EP approximates (10) by a distribution q(z) \u221d f (w)hu(u)hv(v)(cid:81)d\n\nwhere the factors in the r.h.s. of (10) are displayed in Figure 2. Note that all factors except the gj\u2019s\nj=1 \u02dcgj(z), which\nis obtained by replacing each non-Gaussian factor gj in (10) with an approximate factor \u02dcgj that is\nGaussian but need not be normalized. Since the Gaussian distribution belongs to the exponential\nfamily of distributions, which is closed under the product and division operations [21], q is Gaussian\nwith natural parameters equal to the sum of the natural parameters of each factor.\nEP iteratively updates each \u02dcgj until convergence by \ufb01rst computing q\\j \u221d q/\u02dcgj and then minimiz-\ning the Kullback-Leibler (KL) divergence between gjq\\j and qnew, KL(gjq\\j||qnew), with respect to\nj = sjqnew/q\\j, where sj is the normalization\nqnew. The new approximate factor is obtained as \u02dcgnew\nconstant of gjq\\j. This update rule ensures that \u02dcgj looks similar to gj in regions of high posterior\nprobability in terms of q\\j [20]. Minimizing the KL divergence is a convex problem whose optimum\nis found by matching the means and the covariance matrices between gjq\\j and qnew. These expec-\ntations can be readily obtained from the derivatives of log sj with respect to the natural parameters\nof q\\j [21]. Unfortunately, the computation of sj is intractable under the horseshoe. As a practical\nalternative, our EP implementation employes numerical quadrature to evaluate sj and its derivatives.\nImportantly, gj, and therefore \u02dcgj, depend only on wj, uj and vj, so a three-dimensional quadrature\n\n5\n\n\fwill suf\ufb01ce. However, using similar arguments to those in [7] more ef\ufb01cient alternatives exist. As-\nsume that q\\j(wj, uj, vj) = N (wj|mj, \u03b7j)N (uj|0, \u03bdj)N (vj|0, \u03bej), i.e., q\\j factorizes with respect\nto wj, uj and vj and that the mean of uj and vj is zero. Since gj is symmetric with respect to uj\nand vj then E[uj] = E[vj] = E[ujvj] = E[ujwj] = E[vjwj] = 0 under gjq\\j. Thus, if the initial\napproximate factors \u02dcgj factorize with respect to wj, uj and vj, and have zero mean with respect to\nuj and vj, any updated factor will also satisfy these properties and q\\j will have the assumed form.\nThe crucial point here is that the dependencies introduced by gj do not lead to correlations that need\nto be tracked under a Gaussian approximation. In this situation, the integral of gjq\\j with respect to\nwj is given by the convolution of two Gaussians and the integral of the result with respect to uj and\nvj can be simpli\ufb01ed using arguments similar to those employed to obtain (3). Namely,\n\n(cid:90)\n\n(cid:18)\n\n(cid:19)\n\nsj =\n\nN\n\nmj|0,\n\n\u03bdj\n\u03bej\n\n\u03bb2\nj + \u03b7j\n\nC+(\u03bbj|0, 1)d\u03bbj ,\n\n(11)\n\nwhere mj, \u03b7j, \u03bdj and \u03bej are the parameters of q\\j. The derivatives of log sj with respect to the\nnatural parameters of q\\j can also be evaluated using a one-dimensional quadrature. Therefore,\neach update of \u02dcgj requires \ufb01ve quadratures: one to evaluate sj and four to evaluate its derivatives.\nInstead of sequentially updating each \u02dcgj, we follow [7] and update these factors in parallel. For this,\nwe compute all q\\j at the same time and update each \u02dcgj. The marginals of q are strictly required\nfor this task. These can be ef\ufb01ciently obtained using the low rank representation structure of the\ncovariance matrix of q that results from the fact that all the \u02dcgj\u2019s are factorizing univariate Gaussians\nand from the assumed form for C in (5). Speci\ufb01cally, if m (the number of columns of P) is equal\nto n, the cost of this operation (and hence the cost of EP) is O(n2d). Lastly, we damp the update of\neach \u02dcgj as follows: \u02dcgj = (\u02dcgnew\nrespectively denote the new and the\nold \u02dcgj, and \u03b1 \u2208 [0, 1] is a parameter that controls the amount of damping. Damping signi\ufb01cantly\nimproves the convergence of EP and leaves the \ufb01xed points of the algorithm invariant [22].\nAfter EP has converged, q can be used instead of the exact posterior in (8) to make predictions.\nSimilarly, the model evidence in (7) can be approximated by \u02dcZ, the normalization constant of q:\n\nj )1\u2212\u03b1, where \u02dcgnew\n\n)\u03b1(\u02dcgold\n\nand \u02dcgold\nj\n\nj\n\nj\n\n(cid:90)\n\nd(cid:89)\n\n\u02dcZ =\n\nf (w)hu(u)hv(v)\n\n\u02dcgj(z)dwdudv .\n\n(12)\n\nj=1\n\nSince all the factors in (12) are Gaussian, log \u02dcZ can be readily computed and maximized with respect\nto \u03c32, \u03c12, \u03b32 and P to \ufb01nd good values for these hyper-parameters. Speci\ufb01cally, once EP has\nconverged, the gradient of the natural parameters of the \u02dcgj\u2019s with respect to these hyper-parameters\nis zero [21]. Thus, the gradient of log \u02dcZ with respect to \u03c32, \u03c12, \u03b32 and P can be computed in terms\nof the gradient of the exact factors. The derivations are long and tedious and hence omitted here,\nbut by careful consideration of the covariance structure of q it is possible to limit the complexity of\nthese computations to O(n2d) if m is equal to n. Therefore, to \ufb01t a model that maximizes log \u02dcZ we\nalternate between running EP to obtain the estimate of log \u02dcZ and its gradient, and doing a gradient\nascent step to maximize this estimate with respect to \u03c32, \u03c12, \u03b32 and P. The derivation details of the\nEP algorithm and an R-code implementation of it can be found in the supplementary material.\n\n4 Experiments\n\nWe carry out experiments to evaluate the performance of the model described in Section 2. We refer\nto this model as HSDep. Other methods from the literature are also evaluated. The \ufb01rst one, HSST,\nis a particular case of HSDep that is obtained when each task is learnt independently and correlations\nin the feature selection process are ignored (i.e., C = I). A multi-task learning model, HSMT, which\nassumes common relevant and irrelevant features among tasks is also considered. The details of\nthis model are omitted, but it follows [10] closely. It assumes a horseshoe prior in which the scale\nparameters \u03bbj in (2) are shared among tasks, i.e., each feature is either relevant or irrelevant in all\ntasks. A variant of HSM T , SSMT, is also evaluated. SSMT considers a spike-and-slab prior for joint\nfeature selection across all tasks, instead of a horseshoe prior. The details about the prior of SSMT\nare given in [10]. EP is used for approximate inference in both HSMT and SSMT. The dirty model,\nDM, described in [15] is also considered. This model assumes shared relevant and irrelevant features\n\n6\n\n\famong tasks. However, some tasks are allowed to have speci\ufb01c relevant features. For this, a loss\nfunction is minimized via combined (cid:96)1 and (cid:96)1/(cid:96)\u221e block regularization. Particular cases of DM are\nthe lasso [4] and the group lasso [8]. Finally, we evaluate the model introduced in [16]. This model,\nBM, uses spike-and-slab priors for feature selection and speci\ufb01es dependencies in this process using\na Boltzmann machine. BM is trained using the approximate block-coordinate algorithm described\nin [17]. All models considered assume Gaussian additive noise around the targets.\n\n4.1 Experiments with Synthetic Data\n\nk and \u03b32\n\nk and \u03b32\n\nIn HSST \u03c12\n\nk and \u03b32\n\nMethod\nHSST\nHSMT\nSSMT\nDM\nBM\nHSDep\n\nA \ufb01rst batch of experiments is carried out using synthetic data. We generate K = 64 dif-\nferent tasks of n = 64 samples and d = 128 features.\nIn each task, the entries of Xk are\nsampled from a standard Gaussian distribution and the model coef\ufb01cients, wk, are all set to\nzero except for the i-th group of 8 consecutive coef\ufb01cients, with i chosen randomly for each\ntask from the set {1, 2, . . . , 16}. The values of these 8 non-zero coef\ufb01cients are uniformly dis-\ntributed in the interval [\u22121, 1]. Thus, in each task there are only 8 relevant features for pre-\ndiction. Given each Xk and each wk, the targets yk are obtained using (1) with \u03c32\nk = 0.5\n\u2200k. The hyper-parameters of each method are set as follows:\nk are found\nby type-II ML. In HSMT \u03c12 and \u03b32 are set to the average value found\nby HSST for \u03c12\nk, respectively. In SSMT the parameters of the\nspike-and-slab prior are found by type-II ML. In HSDep m = n.\nFurthermore, \u03c12\nk take the values found by HSST while P is\nobtained using type-II ML. In all models we set the variance of the\nnoise for task k, \u03c32\nk, equal to 0.5. Finally, in DM we try different\nhyper-parameters and report the best results observed. After train-\ning each model on the data, we measure the average reconstruction\nerror of wk. Denote by \u02c6wk the estimate of the model coef\ufb01cients\nfor task k (this is the posterior mean except in BM and DM). The\nreconstruction error is measured as || \u02c6wk \u2212 wk||2/||wk||2, where\n|| \u00b7 ||2 is the (cid:96)2-norm and wk are the exact coef\ufb01cients of task k.\nFigure 3 (top) shows the average reconstruction error of each\nmethod over 50 repetitions of the experiments described. HSDep ob-\ntains the lowest error. The observed differences in performance are\nsigni\ufb01cant according to a Student\u2019s t-test (p-value < 5%). BM per-\nforms worse than HSDep because the greedy MAP estimation of the\nsparsity patterns of each task is sometimes trapped in sub-optimal\nsolutions. The poor results of HSMT, SSMT and DM are due to the\nassumption made by these models of all tasks sharing relevant fea-\ntures, which is not satis\ufb01ed. Figure 3 (bottom) shows the average\nentries in absolute value of the correlation matrix C estimated by\nHSDep. The matrix has a block diagonal form, with blocks of size\n8 \u00d7 8 (8 is the number of relevant coef\ufb01cients in each task). Thus,\nwithin each block the corresponding latent variables uj and vj are\nstrongly correlated, indicating jointly relevant or irrelevant features.\nThis is the expected estimation for the scenario considered.\n\nFigure 3: (top) Average recon-\nstruction error of each method.\n(bottom) Average absolute value\nof the entries of the matrix C es-\ntimated by HSDep in gray scale\n(white=0 and black=1). Black\nsquares are groups of jointly rel-\nevant / irrelevant features.\n\nError\n\n0.29\u00b10.01\n0.38\u00b10.03\n0.77\u00b10.01\n0.37\u00b10.01\n0.24\u00b10.02\n0.21\u00b10.01\n\n4.2 Reconstruction of Images of Hand-written Digits from the MNIST\n\nA second batch of experiments considers the reconstruction of images of hand-written digits ex-\ntracted from the MNIST data set [23]. These images are in gray scale with pixel values between 0\nand 255. Most pixels are inactive and equal to 0. Thus, the images are sparse and suitable to be\nreconstructed using the model proposed. The images are reduced to size 10\u00d7 10 pixels and the pixel\nintensities are normalized to lie in the interval [0, 1]. Then, K = 100 tasks of n = 75 samples each\nare generated. For this, we randomly choose 50 images corresponding to the digit 3 and 50 images\ncorresponding to the digit 5 (these digits are chosen because they differ signi\ufb01cantly). Similar results\n(not shown) to the ones reported here are obtained for other pairs of digits. For each task, the entries\nof Xk are sampled from a standard Gaussian. The model coef\ufb01cients, wk, are simply the pixel\nvalues of each image (i.e., d = 100). Importantly, unlike in the previous experiments, the model\ncoef\ufb01cients are not synthetically generated but correspond to actual images. Furthermore, since the\n\n7\n\n\ftasks contain images of different digits they are expected to have different relevant features. Given\nk = 0.1 \u2200k. The objective is to reconstruct\nXk and wk, the targets yk are generated using (1) with \u03c32\nwk from Xk and yk for each task k. The hyper-parameters are set as in Section 4.1 with \u03c32\nk = 0.1\n\u2200k. The reconstruction error is also measured as in that section.\nFigure 4 (top) shows the average reconstruction error of each method over 50 repetitions of the ex-\nperiments described. Again, HSDep performs best. Furthermore, the differences in performance are\nalso statistically signi\ufb01cant. The second best result corresponds to HSMT, probably due to back-\nground pixels which are irrelevant in all the tasks and to the heavy-tails of the horseshoe prior.\nHSST, SSM T , BM and DM perform signi\ufb01cantly worse. DM performs poorly probably because of\nthe inferior shrinking properties of the (cid:96)1 norm compared to the horseshoe [3]. The poor results of\nSSMT are due to the lack of heavy-tails in the spike-and-slab prior. In BM we have observed that\nthe greedy MAP estimation of the task supports is more frequently trapped in sub-optimal solutions.\nFurthermore, the algorithm described in [17] fails to converge most times in this scenario. Figure 4\n(right, bottom) shows a representative subset of the images reconstructed by each method. The best\nreconstructions correspond to HSDep. Finally, Figure 4 (left, bottom) shows in gray scale the average\ncorrelations in absolute value induced by HSDep for the selection process of each pixel of the image\nwith respect to the selection of a particular pixel which is displayed in green. Correlations are high\nto avoid the selection of background pixels and to select pixels that actually correspond to the digits\n3 and 5. The correlations induced are hence appropriate for the multi-task problem considered.\n\nHSST\n\n0.36\u00b10.02\n\nHSMT\n\n0.25\u00b10.02\n\nSSMT\n\n0.39\u00b10.01\n\nDM\n\n0.37\u00b10.01\n\nBM\n\n0.52\u00b10.03\n\nHSDep\n\n0.20\u00b10.01\n\nError\n\nFigure 4: (top) Average reconstruction error each method. (left, bottom) Average absolute value correlation in\na gray scale (white=0 and black=1) between the latent variables uj and vj corresponding to the pixel displayed\nin green and the variables uj and vj corresponding to all the other pixels of the image. (right, bottom) Examples\nof actual and reconstructed images by each method. The best reconstruction results correspond to HSDep.\n\n5 Conclusions and Future Work\n\nWe have described a linear sparse model for learning dependencies in the feature selection process.\nThe model can be used in a multi-task learning setting with several tasks available for induction\nthat need not share relevant features, but only dependencies in the feature selection process. Exact\ninference is intractable in such a model. However, expectation propagation provides an ef\ufb01cient\napproximate alternative with a cost in O(Kn2d), where K is the number of tasks, n is the number\nof samples of each task, and d is the dimensionality of the data. Experiments with real and synthetic\ndata illustrate the bene\ufb01ts of the proposed method. Speci\ufb01cally, this model performs better than\nother multi-task alternatives from the literature. Our experiments also show that the proposed model\nis able to induce relevant feature selection dependencies from the training data alone. Future paths\nof research include the evaluation of this model in practical problems of sparse coding, i.e., when\nall tasks share a common design matrix X that has to be induced from the data alongside with the\nmodel coef\ufb01cients, with potential applications to image denoising and image inpainting [24].\nAcknowledgment: Daniel Hern\u00b4andez-Lobato is supported by the Spanish MCyT (Ref. TIN2010-\n21575-C02-02). Jos\u00b4e Miguel Hern\u00b4andez-Lobato is supported by Infosys Labs, Infosys Limited.\n\n8\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\fReferences\n[1] I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data. Philosophical\nTransactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4237,\n2009.\n\n[2] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the\n\nAmerican Statistical Association, 83(404):1023\u20131032, 1988.\n\n[3] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. Journal of Machine\n\nLearning Research W&CP, 5:73\u201380, 2009.\n\n[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), 58(1):267\u2013288, 1996.\n\n[5] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning\n\nResearch, 1:211\u2013244, 2001.\n\n[6] J. M. Hern\u00b4andez-Lobato, D. Hern\u00b4andez-Lobato, and A. Su\u00b4arez. Network-based sparse Bayesian classi\ufb01-\n\ncation. Pattern Recognition, 44:886\u2013900, 2011.\n\n[7] M. Van Gerven, B. Cseke, R. Oostenveld, and T. Heskes. Bayesian source localization with the multivari-\nate Laplace prior. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,\nAdvances in Neural Information Processing Systems 22, pages 1901\u20131909, 2009.\n\n[8] Julia E. Vogt and Volker Roth. The group-lasso: (cid:96)1,\u221e regularization versus (cid:96)1,2 regularization. In Goesele\net al., editor, 32nd Anual Symposium of the German Association for Pattern Recognition, volume 6376,\npages 252\u2013261. Springer, 2010.\n\n[9] Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16(2):375, 2006.\n[10] D. Hern\u00b4andez-Lobato, J. M. Hern\u00b4andez-Lobato, T. Helleputte, and P. Dupont. Expectation propagation\nfor Bayesian multi-task feature selection. In Jos\u00b4e L. Balc\u00b4azar, Francesco Bonchi, Aristides Gionis, and\nMich`ele Sebag, editors, Proceedings of the European Conference on Machine Learning, volume 6321,\npages 522\u2013537. Springer, 2010.\n\n[11] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Statistics and Computing, pages 1\u201322, 2009.\n\n[12] T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multi-task learning.\nIn Proceedings of the Seventh SIAM International Conference on Data Mining, pages 332\u2013342. SIAM,\n2007.\n\n[13] T. Jebara. Multi-task feature and kernel selection for svms. In Proceedings of the twenty-\ufb01rst international\n\nconference on Machine learning, pages 55\u201362. ACM, 2004.\n\n[14] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning.\n\nIn B. Sch\u00a8olkopf, J. Platt, and\nT. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 41\u201348. MIT Press,\nCambridge, MA, 2007.\n\n[15] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In J. Lafferty,\nC. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information\nProcessing Systems 23, pages 964\u2013972. 2010.\n\n[16] P. Garrigues and B. Olshausen. Learning horizontal connections in a sparse coding model of natural\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information\n\nimages.\nProcessing Systems 20, pages 505\u2013512. MIT Press, Cambridge, MA, 2008.\n\n[17] T. Peleg, Y. C Eldar, and M. Elad. Exploiting statistical dependencies in sparse representations for signal\n\nrecovery. Signal Processing, IEEE Transactions on, 60(5):2286\u20132303, 2012.\n\n[18] A. Papoulis. Probability, Random Variables, and Stochastic Processes. Mc-Graw Hill, 1984.\n[19] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer,\n\nAugust 2006.\n\n[20] T. Minka. A Family of Algorithms for approximate Bayesian Inference. PhD thesis, Massachusetts Insti-\n\ntute of Technology, 2001.\n\n[21] M. W. Seeger. Expectation propagation for exponential families. Technical report, Department of EECS,\n\nUniversity of California, Berkeley, 2006.\n\n[22] T. Minka. Power EP. Technical report, Carnegie Mellon University, Department of Statistics, 2004.\n[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[24] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.\n\nJournal of Machine Learning Research, 11:19\u201360, 2010.\n\n9\n\n\f", "award": [], "sourceid": 429, "authors": [{"given_name": "Daniel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "Universidad Aut\u00f3noma de Madrid"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "University of Cambridge"}]}