{"title": "Adaptive Multi-Task Lasso: with Application to eQTL Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1306, "page_last": 1314, "abstract": "To understand the relationship between genomic variations among population and complex diseases, it is essential to detect eQTLs which are associated with phenotypic effects. However, detecting eQTLs remains a challenge due to complex underlying mechanisms and the very large number of genetic loci involved compared to the number of samples. Thus, to address the problem, it is desirable to take advantage of the structure of the data and prior information about genomic locations such as conservation scores and transcription factor binding sites. In this paper, we propose a novel regularized regression approach for detecting eQTLs which takes into account related traits simultaneously while incorporating many regulatory features. We first present a Bayesian network for a multi-task learning problem that includes priors on SNPs, making it possible to estimate the significance of each covariate adaptively. Then we find the maximum a posteriori (MAP) estimation of regression coefficients and estimate weights of covariates jointly. This optimization procedure is efficient since it can be achieved by using convex optimization and a coordinate descent procedure iteratively. Experimental results on simulated and real yeast datasets confirm that our model outperforms previous methods for finding eQTLs.", "full_text": "Adaptive Multi-Task Lasso: with Application to\n\neQTL Detection\n\nSeunghak Lee, Jun Zhu and Eric P. Xing\n\nSchool of Computer Science, Carnegie Mellon University\n\n{seunghak,junzhu,epxing}@cs.cmu.edu\n\nAbstract\n\nTo understand the relationship between genomic variations among population and\ncomplex diseases, it is essential to detect eQTLs which are associated with phe-\nnotypic effects. However, detecting eQTLs remains a challenge due to complex\nunderlying mechanisms and the very large number of genetic loci involved com-\npared to the number of samples. Thus, to address the problem, it is desirable to\ntake advantage of the structure of the data and prior information about genomic\nlocations such as conservation scores and transcription factor binding sites.\nIn this paper, we propose a novel regularized regression approach for detecting\neQTLs which takes into account related traits simultaneously while incorporating\nmany regulatory features. We \ufb01rst present a Bayesian network for a multi-task\nlearning problem that includes priors on SNPs, making it possible to estimate the\nsigni\ufb01cance of each covariate adaptively. Then we \ufb01nd the maximum a posteriori\n(MAP) estimation of regression coef\ufb01cients and estimate weights of covariates\njointly. This optimization procedure is ef\ufb01cient since it can be achieved by us-\ning a projected gradient descent and a coordinate descent procedure iteratively.\nExperimental results on simulated and real yeast datasets con\ufb01rm that our model\noutperforms previous methods for \ufb01nding eQTLs.\n\n1 Introduction\n\nOne of the fundamental problems in computational biology is to understand associations between\ngenomic variations and phenotypic effects. The most common genetic variations are single nu-\ncleotide polymorphisms (SNPs), and many association studies have been conducted to \ufb01nd SNPs\nthat cause phenotypic variations such as diseases or gene-expression traits [1]. However, association\nmapping of causal QTLs or eQTLs remains challenging as the variation of complex traits is a result\nof contributions of many genomic variations. In this paper, we focus on two important problems to\ndetect eQTLs. First, we need to \ufb01nd methods to take advantage of the structure of data for \ufb01nding\nassociation SNPs from high dimensional eQTL datasets when p \u226b N , where p is the number of\nSNPs and N is the sample size. Second, we need techniques to take advantage of prior biological\nknowledge to improve the performance of detecting eQTLs.\n\nTo address the \ufb01rst problem, Lasso is a widely used technique for high-dimensional association\nmapping problems, which can yield a sparse and easily interpretable solution via an \u21131 regularization\n[2]. However, despite the success of Lasso, it is limited to considering each trait separately. If we\nhave multiple related traits it would be bene\ufb01cial to estimate eQTLs jointly since we can share\ninformation among related traits. For the second problem, Fig. 1 shows some prior knowledge on\nSNPs in a genome including transcription factor binding sites (TFBS), 5\u2019 UTR and exon, which play\nimportant roles for the regulation of genes. For example, TFBS controls the transcription of DNA\nsequences to mRNAs. Intuitively, if SNPs are located on these regions, they are more likely to be\ntrue eQTLs compared to those on regions without such annotations since they are related to genes or\ngene regulations. Thus, it would be desirable to penalize regression coef\ufb01cients less corresponding\n\n1\n\n\fTranscription factor binding site\n\n5\u2019 UTR\n\nExon \n\nSNPs\n\nChromosome\n\nAnnotation\n\nFigure 1: Examples of prior knowledge on SNPs including transcription factor binding sites, 5\u2019 UTR and\nexon. Arrows represent SNPs and we indicate three genomic annotations on the chromosome. Here association\nSNPs are denoted by red arrows (best viewed in color), showing that SNPs on regions with regulatory features\nare more likely to be associated with traits.\n\nto SNPs having signi\ufb01cant annotations such as TFBS in a regularized regression model. Again, the\nwidely used Lasso is limited to treating all SNPs equally.\n\nThis paper presents a novel regularized regression approach, called adaptive multi-task Lasso, to\neffectively incorporate both the relatedness among multiple gene-expression traits and useful prior\nknowledge for challenging eQTL detection. Although some methods have been developed for either\nadaptive or multi-task learning, to the best of our knowledge, adaptive multi-task Lasso is the \ufb01rst\nmethod that can consider prior information on SNPs and multi-task learning simultaneously in one\nsingle framework. For example, Lirnet uses prior knowledge on SNPs such as conservation scores,\nnon-synonymous coding and UTR regions for a better search of association mappings [3]. However,\nLirnet considers the average effects of SNPs on gene modules by assuming that association SNPs are\nshared in a module. This approach is different from multi-task learning where association SNPs are\nfound for each trait while considering group effects over multiple traits. To \ufb01nd genetic markers that\naffect correlated traits jointly, the graph-guided fused Lasso [4] was proposed to consider networks\nover multiple traits within an association analysis. However, graph-guided fused Lasso does not\nincorporate prior knowledge of genomic locations.\n\nUnlike other methods, we de\ufb01ne the adaptive multi-task Lasso as \ufb01nding a MAP estimate of a\nBayesian network, which provides an elegant Bayesian interpretation of our approach; the resultant\noptimization problem is ef\ufb01ciently solved with an alternating minimization procedure. Finally, we\npresent empirical results on both simulated and real yeast eQTL datasets, which demonstrates the\nadvantages of adaptive multi-task Lasso over many other competitors.\n\n2 Problem De\ufb01nition: Adaptive Multi-task Lasso\n\nLet Xij \u2208 {0, 1, 2} denote the number of minor alleles at the j-th SNP of i-th individual for\ni = 1, . . . , N and j = 1, . . . , p. We have K related gene traits and Y k\nrepresents the gene\ni\nIn our setting, we assume\nexpression level of k-th gene of i-th individual for k = 1, . . . , K.\nthat the K traits are related to each other and we explore the relatedness in a multi-task learning\nframework. To achieve the relatedness among tasks via grouping effects [5], we can use any\nclustering algorithms such as spectral clustering or hierarchical clustering. In association mapping\nproblems, these clusters can be viewed as clusters of genes which consist of regulatory networks or\npathways [4]. We treat the problem of detecting eQTLs as a linear regression problem. The general\nsetting includes one design matrix X and multiple tasks (genes) for k = 1, . . . , K,\n\nY k = X\u03b2k + \u01eb\n\n(1)\nwhere \u01eb is a standard Gaussian noise. We further assume that Xij\u2019s are standardized such that\nPi Xij/N = 0 and Pi X 2\nNow, the open question is how we can devise an appropriate objective function over \u03b2 that could ef-\nfectively consider the desirable group effects over multiple traits and incorporate useful prior knowl-\nedge, as we have stated. To explain the motivation of our work and provide a useful baseline that\ngrounds the proposed approach, we \ufb01rst brie\ufb02y review the standard Lasso and multi-task Lasso.\n\nij/N = 1, and consider a model without an intercept.\n\n2.1 Lasso and Multi-task Lasso\n\nLasso [2] is a technique for estimating the regression coef\ufb01cients \u03b2 and has been widely used\nfor association mapping problems. Mathematically, it solves the \u21131-regularized least square problem,\n\n\u02c6\u03b2 = argmin\n\n\u03b2\n\n1\n2\n\nkY \u2212 X\u03b2k2\n\n2 + \u03bb\n\np\n\nXj=1\n\n\u03b4j |\u03b2j |\n\n(2)\n\n2\n\n\fwhere \u03bb determines the degree of regularization of nonzero \u03b2j. The scaling parameters \u03b4j \u2208 [0, 1]\nare usually \ufb01xed (e.g., unit ones) or set by cross-validation, which can be very dif\ufb01cult when p is\nlarge. Due to the singularity at the origin, the \u21131 regularization (Lasso penalty) can yield a stable and\nsparse solution, which is desirable for association mapping problems because in most cases we have\np \u226b N and there exists only a small number of eQTLs. It is worth mentioning that Lasso estimates\nare posterior mode estimates under a multivariate independent Laplace prior for \u03b2 [2].\nAs we can see from problem (2), the standard Lasso does not distinguish the inputs and regression\ncoef\ufb01cients from different tasks.\nIn order to capture some desirable properties (e.g., shared\nstructures or sparse patterns) among multiple related tasks, the multi-task Lasso was proposed [5],\nwhich solves the problem,\n\nmin\n\n\u03b2\n\n1\n2\n\nK\n\nXk=1\n\nkY k \u2212 X\u03b2kk2\n\n2 + \u03bb\n\np\n\nXj=1\n\n\u03b4j k\u03b2j k2\n\n(3)\n\nwhere k\u03b2jk2 = qPk(\u03b2k\n2 is the \u21132-norm. This model encourages group-wise sparsity across\nj )\nrelated tasks via the \u21131/\u21132 regularization. Again, the solution of Eq. (3) can be interpreted as a MAP\nestimate under appropriate priors with \ufb01xed scaling parameters.\n\nMulti-task Lasso has been applied (with some extensions) to perform association analysis [4]. How-\never, as we have stated, the limitation of current approaches is that they do not incorporate the useful\nprior knowledge. The proposed adaptive multi-task Lasso, as to be presented, is an extension of the\nmulti-task Lasso to perform joint group-wise and within-group feature selection and incorporate the\nuseful prior knowledge for effective association analysis.\n\n2.2 Adaptive Multi-task Lasso\n\nNow, we formally introduce the adaptive multi-task Lasso. For clarity, we \ufb01rst de\ufb01ne the sparse\nmulti-task Lasso with \ufb01xed scaling parameters, which will be a sub-problem of the adaptive\nmulti-task Lasso, as we shall see. Speci\ufb01cally, sparse multi-task Lasso solves the problem,\n\nmin\n\n\u03b2\n\n1\n2\n\nK\n\nXk=1\n\nkY k \u2212 X\u03b2kk2\n\n2 + \u03bb1\n\np\n\nXj=1\n\n\u03b8j\n\nK\n\nXk=1\n\n|\u03b2k\n\nj | + \u03bb2\n\np\n\nXj=1\n\n\u03c1jk\u03b2j k2\n\n(4)\n\nwhere \u03b8 and \u03c1 are the scaling parameters for the \u21131 and \u21131/\u21132-norm, respectively. The regularization\nparameters \u03bb1 and \u03bb2 can be determined by cross or holdout validation. Obviously, this model sub-\nsumes the standard Lasso and multi-task Lasso, and it has three advantages over previous models.\nFirst, unlike the multi-task Lasso, which contains the \u2113l/\u21132-norm only to achieve group-wise spar-\nsity, the \u21131-norm in Eq. (4) can achieve sparsity among SNPs within a group. This property is useful\nwhen K tasks are not perfectly related and we need additional sparsity in each block of k\u03b2jk2. In\nsection 4, we demonstrate the usefulness of the blended regularization. The hierarchical penaliza-\ntion [6] can achieve a smooth shrinkage effect for variables within a group, but it cannot achieve\nwithin-group sparsity. Second, unlike Lasso we induce group sparsity across multiple related traits.\nFinally, as to be extended, unlike Lasso and multi-task Lasso which treat \u03b2j equally or with a \ufb01xed\nscaling parameter, we can adaptively penalize each \u03b2j according to prior knowledge on covariates\nin such a way that SNPs having desirable features are less penalized (see Fig. 1 for details of prior\nknowledge on SNPs).\n\nTo incorporate the prior knowledge as we have stated, we propose to automatically learn the scaling\nparameters (\u03b8, \u03c1) from data. To that end, we de\ufb01ne \u03b8 and \u03c1 as mixtures of features on j-th SNP, i.e.\n(5)\n\n\u03c9tf j\n\nt is t-th feature for j-th SNP. For example f j\n\nwhere f j\nt can be a conservation score of j-th SNP or one\nif the SNP is located on TFBS, zero otherwise. To avoid scaling issues, we assume each feature is\nstandardized, i.e., Pj f j\nt = 1, \u2200t. Since we are interested in the relative contributions from different\nfeatures, we further add the constraints that Pt \u03c9t = 1 and Pt \u03bdt = 1. These constraints can be\ninterpreted as a regularization on the feature weights \u03c9 \u2265 0 and \u03bd \u2265 0.\nAlthough using the de\ufb01nitions (5) in problem (4) and jointly estimating \u03b2 and feature weights (\u03c9, \u03bd)\ncan give a solution of adaptive multi-task learning, the resultant method would be lack of an el-\negant Bayesian interpretation, which is a desirable property that can make the framework more\n\n\u03b8j = Xt\n\nt and \u03c1j = Xt\n\n\u03bdtf j\nt ,\n\n3\n\n\f\ufb02exible and easily extensible. Recall that the Lasso\nestimates can be interpreted as MAP estimates under\nLaplace priors. Similarly, to achieve a framework\nthat enjoys an elegant Bayesian interpretation, we\nde\ufb01ne a Bayesian network and treat the adaptive\nmulti-task learning problem as \ufb01nding its MAP\nestimate. Speci\ufb01cally, we build a Bayesian network\nas shown in Fig. 2 in order to compute the MAP\nestimate of \u03b2 under adaptive scaling parameters,\n{\u03b8, \u03c1}. We de\ufb01ne the conditional probability of \u03b2\ngiven scaling parameters as,\n\nf1\n\nfT\n\n\u03c9\n\n\u03b8\n\n\u03c1\n\n\u03bd\n\nX\n\nY\n\n\u03b2\n\nFigure 2: Graphical model representation of\nadaptive multi-task Lasso.\n\nP (\u03b2|\u03b8, \u03c1) =\n\n1\n\nZ(\u03b8, \u03c1)\n\np\n\nK\n\nYj=1\n\nYk=1\n\nexp (\u2212\u03b8j|\u03b2k\n\nj |) \u00d7\n\np\n\nYj=1\n\nexp (\u2212\u03c1jk\u03b2j k2)\n\nwhere Z(\u03b8, \u03c1) is a normalization factor, and P (Y |X, \u03b2) \u223c N (X\u03b2, \u03a3), where \u03a3 is the identity\nmatrix. Although in principle we can treat \u03b8 and \u03c1 as random variables and de\ufb01ne a fully Bayesian\napproach, for simplicity, we de\ufb01ne \u03b8 and \u03c1 as deterministic functions of \u03c9 and \u03bd as in Eq. (5).\nExtension to a fully Bayesian approach is our future work.\n\nNow we de\ufb01ne the adaptive multi-task Lasso as \ufb01nding the MAP estimation of \u03b2 and simultane-\nously estimating the feature weights (\u03c9, \u03bd), which is equivalent to solving the optimization problem,\n\nmin\n\u03b2,\u03c9,\u03bd\n\n1\n2\n\nK\n\nXk=1\n\nkY k \u2212 X\u03b2kk2\n\n2 + \u03bb1\n\np\n\nXj=1\n\n\u03b8j\n\nK\n\nXk=1\n\n|\u03b2k\n\nj | + \u03bb2\n\np\n\nXj=1\n\n\u03c1jk\u03b2j k2 + log Z(\u03b8, \u03c1),\n\n(6)\n\nwhere \u03c9 and \u03bd are related to \u03b8 and \u03c1 through Eq. (5) and subject to the constraints as de\ufb01ned above.\n\nRemark 1 Although we can interpret problem (4) as a MAP estimate of \u03b2 under appropriate priors when\nscaling parameters (\u03b8, \u03c1) are \ufb01xed, it does not enjoy an elegant Bayesian interpretation if we perform joint\nestimation of \u03b2 and the scaling parameters (\u03c9, \u03bd) because it ignores normalization factors of the appropriate\npriors. Lee et al. [3] used this approach where a regularized regression model is optimized over scaling\nparameters and \u03b2 jointly. Therefore, their method does not have an elegant Bayesian interpretation. Moreover,\nas we have stated, Lee et al. [3] did not consider grouping effects over multiple traits.\n\nRemark 2 Our method also differs from the adaptive Lasso [7] , transfer learning with meta-priors [8] and\nthe Bayesian Lasso [9]. First, although both adaptive Lasso and our method use adaptive parameters for\npenalizing regression coef\ufb01cients, we learn adaptive parameters from prior knowledge on covariates in a multi-\ntask setting while adaptive Lasso uses ordinary least square solutions for adaptive parameters in a single task\nsetting. Second, the method of transfer learning with meta-priors [8] is similar to our method in a sense that\nboth use prior knowledge with multiple related tasks. However, we couple related tasks via \u21131/\u21132 penalty while\nthey couple tasks via transferring hyper-parameters among them. Thus we have group sparsity across tasks\nas well as sparsity in each group but they cannot induce group sparsity across different tasks. Finally, the\nBayesian Lasso [9] does not have the grouping effects in multiple traits and the priors used usually do not\nconsider domain knowledge.\n\n3 Optimization: an Alternating Minimization Approach\n\nNow, we solve the adaptive multi-task Lasso problem (6). First, since the normalization factor Z is\nhard to compute, we use its upper bound, as given by,\n\nZ \u2264\n\np\n\nYj=1ZRK\n\nexp (\u2212k\u03c1jk2)d\u03c1Yj\n\n\u03b8j(cid:19)K\n(cid:18) 2\n\n=\n\n\u03c0\n\np\n\nYj=1\n\nK\u22121\n\n2 \u0393( K+1\n\n2 )2K\n\n(\u03c1jK)K Yj\n\n\u03b8j(cid:19)K\n(cid:18) 2\n\n.\n\n(7)\n\nThis integral result is due to normalization constant of K dimensional multivariate Laplace distri-\nbution [10, 11]. Using this upper bound, the learning problem is to minimize an upper bound of the\nobjective function in problem (6), which will be denoted by L(\u03b2, \u03c9, \u03bd) henceforth. Although L is\nnot joint convex over \u03b2, \u03c9 and \u03bd, it is convex over \u03b2 given {\u03c9, \u03bd} and convex over {\u03c9, \u03bd} given \u03b2.\nWe use an alternating optimization procedure which (1) minimizes the upper bound L of problem (6)\nover {\u03c9, \u03bd} by \ufb01xing \u03b2; and (2) minimizes L over \u03b2 by \ufb01xing {\u03c9, \u03bd} iteratively until convergence\n[12]. Both sub-problems are convex and can be solved ef\ufb01ciently via a projected gradient descent\nmethod and a coordinate descent method, respectively.\n\n4\n\n\fFor the \ufb01rst step of optimizing L over \u03c9 and \u03bd, the sub-problem is to solve\n\nmin\n\n\u03c9\u2208P\u03c9,\u03bd\u2208P\u03bd Xj Xk (cid:16)\u2212 log \u03b8j + \u03b8j |\u03b2k\n\nj |(cid:17) +Xj\n\n(\u2212K log \u03c1j + \u03c1jk\u03b2j k2) ,\n\nwhere P\u03c9 , {\u03c9 : Pt \u03c9t = 1, \u03c9t \u2265 0, \u2200t} is a simplex over \u03c9, likewise for P\u03bd. \u03b8 and \u03c1 are\nfunctions of \u03c9 and \u03bd as de\ufb01ned in Eq. (5). This constrained problem is convex and can be solved by\nusing a gradient descent algorithm combined with a projection onto a simplex sub-space, which can\nbe ef\ufb01ciently done [13]. Since \u03c9 and \u03bd are not coupled, we can learn each of them separately.\nFor the second sub-problem that optimizes L over \u03b2 given \ufb01xed feature weights (\u03c9, \u03bd), it is exactly\nthe optimization problem (4). We can solve it using a coordinate descent procedure, which has been\nused to optimize the sparse group Lasso [14]. Our problem is different from the sparse group Lasso\nin the sense that the sparse group Lasso includes group penalty over multiple covariates for a single\ntrait, while adaptive multi-task Lasso considers group effects over multiple traits. Here we solve\nproblem (4) using a modi\ufb01ed version of the algorithm proposed for the sparse group Lasso.\n\nj = 0 for each k. If it is true that \u03b2k\n\nAs summarized in Algorithm 1, the general optimization procedure is as follows: for each j, we\ncheck the group sparsity condition that \u03b2j = 0. If it is true, no update is needed for \u03b2j. Otherwise,\nwe check whether \u03b2k\nj ; otherwise,\nwe optimize problem (4) over \u03b2k\nj with all other coef\ufb01cients \ufb01xed. This one-dimensional optimiza-\ntion problem can be ef\ufb01ciently solved by using a standard optimization method. This procedure is\ncontinued until a convergence condition is met.\nMore speci\ufb01cally, we \ufb01rst obtain the optimal conditions for problem (4) by computing the subgra-\ndient of its objective function with respect to \u03b2k\n\nj = 0, no update is needed for \u03b2k\n\n\u2212X T\n\nj (Y k \u2212 X\u03b2k) + \u03bb2\u03c1jgk\n\nj and set it to zero:\nj + \u03bb1\u03b8j hk\n\nj = 0,\n\nwhere g and h are sub-gradients of the \u21131/\u21132-norm and the \u21131-norm, respectively. Note that gk\n\nj =\n\n\u03b2k\nj\n\nif \u03b2j 6= 0, otherwise kgjk2 \u2264 1; and hk\n\nj = sign(\u03b2k\n\nj ) if \u03b2k\n\nj 6= 0, otherwise hk\n\nj \u2208 [\u22121, 1].\n\nk\u03b2j k2\nThen, we check the group sparsity that \u03b2j = 0. To do that, we set \u03b2j = 0 in Eq. (8), and we have,\n\nX T\n\nj Y k\u2212X T\n\nj Xr6=j\n\nXr\u03b2k\n\nr = \u03bb2\u03c1jgk\n\nj +\u03bb1\u03b8jhk\n\nj , and ||gj ||2\n\n2 =\n\n1\n2\u03c12\n\u03bb2\nj\n\nK\n\nXk=1\n\n(X T\n\nj Y k \u2212 X T\n\nj Xr6=j\n\nXr\u03b2k\n\nr \u2212 \u03bb1\u03b8j hk\n\nj )2.\n\nAccording to subgradient conditions, we need to have a gj that satis\ufb01es the less than inequality\n2 < 1; otherwise, \u03b2j will be non-zero. Since gj is a function of hj, it suf\ufb01ces to check whether\nkgjk2\nthe minimal square \u21132-norm of gj is less than 1. Therefore, we solve the minimization problem of\nkgjk2\n\n2 w.r.t hj, which gives the optimal hj as,\n\n(8)\n\n(9)\n\nck\nj\n\n\u03bb1\u03b8j\n\nhk\n\nif |\n\nj = \uf8f1\uf8f2\n\uf8f3\nr . If the minimal kgjk2\nj Pr6=jXr\u03b2k\n\nck\nj\n\u03bb1\u03b8j\n\nsign(\n\n)\n\nck\nj\n\u03bb1\u03b8j\n\notherwise\n\n| \u2264 1\n\nj =X T\n\nwhere ck\nupdate is needed; otherwise, we continue to the next step of checking whether \u03b2k\n\n2 is less than 1, then \u03b2j is zero and no\nj =0, \u2200k, as follows.\n\nj Y k \u2212 X T\n\nAgain, we start by assuming \u03b2k\nj Xr6=j\n\nj Y k \u2212 X T\n\nX T\n\nj is zero. By setting \u03b2k\nj , and hk\nXr\u03b2k\n\nr = \u03bb1\u03b8j hk\n\nj = 0 in Eq. (8), we have,\nj Xr6=j\n\nj Y k \u2212 X T\n\n(X T\n\n\u03bb1\u03b8j\n\nj =\n\n1\n\nXr\u03b2k\n\nr ).\n\nj | < 1;\nAccording to the de\ufb01nition of the subgradient hk\nj will be non-zero. This checking step can be easily done. After the check, if we have\notherwise, \u03b2k\nj , and\nj 6= 0, the problem (4) becomes an one-dimensional optimization problem with respect to \u03b2k\n\u03b2k\nthe solution can be obtained using existing optimization algorithms (e.g. optimize function in the\nR). We used majorize-minimize algorithm with gradient descent [15].\n\nj , it needs to satisfy the condition that |hk\n\nWith the above two steps, we iteratively optimize (\u03c9, \u03bd) by \ufb01xing \u03b2 and optimize \u03b2 by \ufb01xing feature\nweights until convergence. Note that the parameters \u03bb1 and \u03bb2 in Eq. (4), which determine sparsity\nlevels, are determined by cross or hold-out validation.\n\n5\n\n\fInput : X \u2208 RN \u00d7p; Y \u2208 RN \u00d7K; \u03b8 \u2208 Rp; \u03c1 \u2208 Rp; and \u03b2init \u2208 Rp\u00d7K\nOutput: \u03b2 \u2208 Rp\u00d7K\n\u03b2 \u2190 \u03b2init;\nIterate this procedure until convergence;\nfor j \u2190 1 to p do\n\nm \u2190 1\n\u03bb2\n\n2\u03c12\nj\n\nPK\n\nk=1 (ck\n\nj \u2212 \u03bb1\u03b8j hk\n\nj )2 where ck\n\nj and hk\n\nj are computed as in Eq. (9);\n\nif m < 1 then \u03b2k\nelse for k \u2190 1 to K do\n\nj = 0, for all k = 1, . . . K;\n\nq \u2190 1\n\n\u03bb1\u03b8j\n\nj Xj \u03b2k\n\nj |;\n\n|X T\n\nj (Y k \u2212 X\u03b2k) + X T\nj = 0;\n\nif q < 1 then \u03b2k\nelse Solve the following one-dimensional optimization problem:\n\u03b2k\n\n2 kY k \u2212 X\u03b2kk2\n\n2 + \u03bb1\u03b8j|\u03b2k\n\nj \u2190 argmin\n\nj | + \u03bb2\u03c1j k\u03b2j k2;\n\n1\n\n\u03b2k\nj\n\nend\n\nend\n\nAlgorithm 1: Optimization algorithm for Equation (4) with \ufb01xed scaling parameters.\n\n4 Simulation Study\n\nTo con\ufb01rm the behavior of our model, we run the adaptive multi-task Lasso and other methods on\nour simulated dataset (p=100, K=10). We \ufb01rst randomly select 100 SNPs from 114 yeast genotypes\nfrom the yeast eQTL dataset [16]. Following the simulation study in Kim et al. [4], we assume that\nsome SNPs affect biological networks including multiple traits, and true causal SNPs are selected\nby the following procedure. Three sets of randomly selected four SNPs are associated with three\ntrait clusters (1 \u2212 3), (4 \u2212 6), (7 \u2212 10), respectively. One SNP is associated with two clusters\n(1 \u2212 3) and (4 \u2212 6), and one causal SNP is for all traits (1 \u2212 10). For all association SNPs we\nset identical association strength from 0.3 to 1. Traits are generated by Y k = X\u03b2k + \u01eb, for all\nk = 1, . . . , 10 where \u01eb follows the standard normal distribution. We make 10 features (f1 \u2212 f10),\nof which six are continuous and four are discrete. For the \ufb01rst three continuous features (f1 \u2212 f3),\nthe feature value is drawn from s(N (2, 1)) if a SNP is associated with any traits; otherwise from\ns(N (1, 1)), where s(x) =\n1+exp(x) is the sigmoid function. For the other three continuous features\n(f4 \u2212f6), the value is drawn from s(N (2, 0.5)) if a SNP is associated with any traits; otherwise from\ns(N (1, 0.5)). Finally, for the discrete features (f7 \u2212 f10), the value is set to s(2) with probability\n0.8 if a SNP is associated with any traits; otherwise to s(1). We standardize all the features.\n\n1\n\nTrue \u03b2\n\nAML\n\nSML\n\n/l\nA + l\n1\n2\n\nSingle SNP\n\nLasso\n\nl\n/l\u221e\n1\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n2 4 6 8 10\n\n100\n\n2 4 6 8 10\n\n100\n\n2 4 6 8 10\n\n100\n\n2 4 6 8 10\n\n100\n\n2 4 6 8 10\n\n100\n\n2 4 6 8 10\n\n100\n\n \n\n2 4 6 8 10\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nFigure 3: Results of the \u03b2 matrix estimated by different methods. For visualization, we present normalized\nabsolute values of regression coef\ufb01cients and darker colors imply stronger association with traits. For each\nmatrix, X-axis represents traits (1-10) and Y-axis represents SNPs (1-100). True \u03b2 is shown in the left.\n\nFig. 3 shows the estimated \u03b2 matrix by various methods including AML (adaptive multi-task Lasso),\nSML (sparse multi-task Lasso which is AML without adaptive weights), A+\u21131/\u21132 (AML without\nLasso penalty), Single SNP [17], Lasso and \u21131/\u2113\u221e (multi-task learning with \u21131/\u2113\u221e norm). In this\n\ufb01gure, X-axis represents traits (1-10) and Y-axis represents SNPs (1-100). Note that regression\nparameters (e.g. \u03bb1 and \u03bb2 for AML) were determined by holdout validation, and we set association\nstrength to 0.3. We also used hierarchical clustering with cutoff criterion 0.8 prior to run AML,\nSML, A+\u21131/\u21132 and \u21131/\u2113\u221e, and Single SNP and Lasso were analyzed for each trait separately.\nWe investigate the effect of Lasso penalty in our model by comparing the results of AML and\nA+\u21131/\u21132. While AML is slightly more ef\ufb01cient than A+\u21131/\u21132 in \ufb01nding association SNPs, both\n\n6\n\n\fwork very well for this task. It is not surprising since hierarchical clustering reproduced true trait\nclusters and true \u03b2 could be detected without considering single SNP level sparsity in each group.\nTo further validate the effectiveness of Lasso penalty, we run AML and A+\u21131/\u21132 without a priori\nclustering step. Interestingly, AML could pick correct SNP-traits associations due to Lasso penalty,\nhowever, A+\u21131/\u21132 failed to do so (see Fig. 5c,d for the comparison of performance). While Lasso\npenalty did not show signi\ufb01cant contribution for this task when we generated a priori clusters, it is\ngood to include it when the quality of a clustering is not guaranteed. Comparing the results of AML\nand SML in Fig. 3, we could observe that adaptive weights improve the performance signi\ufb01cantly.\nAdaptive weights help not only reduce false positives but also increase true positives.\n\nFig. 4 shows the learned feature weights of \u03c9 (\u03bd is al-\nmost identical to \u03c9 and not shown here). The results are\nbased on 100 simulations for each association strength\n0.3, 0.5, 0.8 and 1, and half of error bar represents one\nstandard deviation from the mean. We could observe that\ndiscrete features f7 \u2212f10 have highest weights while low-\nest weights are assigned to f1 \u2212 f3. These weights are\nreasonable because f1 \u2212f3 are drawn from Gaussian with\nlarge standard deviation (STD: 1) compared to that of fea-\ntures f4 \u2212 f6 (STD: 0.5). Also, discrete features are the\nmost important since they discriminate true association\nSNPs with a high probability 0.8.\n\n0.16\n\n0.14\n\n0.12\n\nt\n\n\u03c9\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nf\n1\n\nf\n2\n\nf\n3\n\nf\n4\n\nf\n5\n\nf\n6\n\nf\n7\n\nf\n8\n\nf\n9\n\nf\n10\n\nFeatures\n\nFigure 4: Learned feature weights of \u03c9.\n\na\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\nb\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\nc\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n0\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n \n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0.5\n\n1 \u2212 Specificity\n\n1\n\nd\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n0\n\n \n\nAML\nSML\nA+l1/l2\nl1/l\u221e\nLasso\nSingle SNP\n\n0.5\n\n1 \u2212 Specificity\n\n1\n\n0.5\n\n1 \u2212 Specificity\n\n1\n\n0.5\n\n1 \u2212 Specificity\n\n1\n\nFigure 5: ROC curves of various methods as association strength varies (a) 0.3, (b) 0.5 on clustered data, (c)\n0.3, and (d) 0.5 on input dataset. (a,b) Results on clustered data, where correct groups of gene traits are found\nusing hierarchical clustering (cutoff = 0.8). (c,d) Results on input dataset without using clustering algorithm.\n\nWe compare the sensitivity and speci\ufb01city of our model with other methods. In Fig. 5, we generated\nROC curves for association strength of 0.3 and 0.5. Fig. 5a,b show the results with a priori hierar-\nchical clustering and Fig. 5c,d is with no such preprocessing steps. Using hierarchical clustering we\ncould correctly \ufb01nd three clusters of gene traits at cutoff 0.8. In Fig. 5, when association strength\nis small (i.e., 0.3), AML and A+\u21131/\u21132 signi\ufb01cantly outperformed other methods. As association\nstrength increased, the performance of multi-task learning methods improved quickly while meth-\nods based on a single trait such as Lasso and Single SNP showed gradual increase of performance.\n\nWe computed test errors on 100 simulated dataset using 30 samples for test and 84 samples for\ntraining. On average, AML achieved the best test error rate of 0.9427, and the order of other methods\nin terms of test errors is: A + \u21131/\u21132 (0.9506), SML (1.0436), \u21131/\u2113\u221e (1.0578) and Lasso (1.1080).\n\n5 Yeast eQTL dataset\n\nWe analyze the yeast eQTL dataset [16] that contains expression levels of 5,637 genes and 2,956\nSNPs. The genotype data include genetic variants of 114 yeast strains that are progenies of the\nstandard laboratory strain (BY) and a wild strain (RM). We used 141 modules given by Lee et al.\n[3] as groups of gene traits, and extracted unique 1,260 SNPs from 2,956 SNPs for our analysis. For\nprior biological knowledge on SNPs used for adaptive multi-task Lasso, we downloaded 12 features\nfrom Saccharomyces Genome Database (http://www.yeastgenome.org) including 11 discrete and 1\ncontinuous feature (conservation score). For a discrete feature, we set its value as f j\nt = s(2) if the\nfeature is found on the j-th SNP, f j\nt = s(1) otherwise. For conservation score, we set f j\nt = s(score).\nAll the features are then standardized.\n\n7\n\n\f0.2\n\n0.18\n\n0.16\n\n0.14\n\nFig. 6 represents \u03c9 learned from the yeast eQTL dataset\n(\u03bd is almost identical to \u03c9). The features are ncRNA (f1),\nnoncoding exon (f2), snRNA (f3), tRNA (f4), intron (f5),\nbinding site (f6), 5\u2019 UTR intron (f7), LTR retrotranspo-\nson (f8), ARS (f9), snoRNA (f10), transposable element\ngene (f11) and conservation score (f12). Five discrete fea-\ntures turn out to be important including ncRNA, snRNA,\nbinding site, 5\u2019 UTR intron and snoRNA as well as one\ncontinuous feature, i.e., conservation score. These re-\nsults agree with biological insights. For example, ncRNA,\nsnRNA and snoRNA are potentially important for gene\nregulation since they are functional RNA molecules hav-\ning a variety of roles such as transcriptional regulation [18]. Also, conservation score would be\nsigni\ufb01cant since mutation in conserved region is more likely to result in phenotypic effects.\n\nFigure 6: Learned weights of \u03c9 on the yeast\neQTL dataset.\n\nf5 f6 f7 f8 f9\nFeatures\n\nf1 f2 f3 f4\n\nf10 f11\n\nf12\n\nt\n\n\u03c9\n\n0.1\n\n0.12\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n3.5\n\n2\n\n0\n2\n\n3\n\n \n*\n \ns\nt\ni\n\na\nr\nt\n \n\nd\ne\n\nt\n\ni\n\na\nc\no\ns\ns\na\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n0\n\n \n\n\u03b2\nncRNA\nsnRNA\nbinding sites\nfive prime UTR intron\nconservation scores\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\nSNPs\n\nFigure 7: Plot of 121 SNPs on chromosome 1 and 2 vs the number of genes affected by the SNPs from the\nyeast eQTL analysis (blue bar). Five signi\ufb01cant prior knowledge on SNPs are overlapped with the plot. For\nthe four discrete priors (ncRNA, snRNA, binding site, 5\u2019 UTR intron) we set the value to 1 if annotated, 0\notherwise. Binding sites and regions with no associated traits are denoted by long green and short blue arrows.\n\nFig. 7 shows the number of associated genes for SNPs on chromosome 1 and 2, superimposed on 5\nsigni\ufb01cant features. We see that association mapping results were affected by both priors and data.\nFor example, genomic region indicated by blue arrow shows weak association with traits, where\nconservation score is low and no other annotations exist. Also we can see that three SNPs located on\nbinding sites affect a larger number of gene traits (see green arrows). As an example of biological\nanalysis, we investigate these three association SNPs. The three SNPs are located on telomeres\n(chr1:483, chr1:229090, chr2:9425 (chromosome:coordinate)), and these genomic locations are in\ncis to Abf1p (autonomously replicating sequence binding factor-1) binding sites. In biology, it is\nknown that Abf1p acts as a global transcriptional regulator in yeast [19]. Thus, the genomic regions\nin telomeres would be good candidates for novel putative eQTL hotspots that regulate the expression\nlevels of many genes. They were not reported as eQTL hotspots in Yvert et al. [20].\n\n6 Conclusions\n\nIn this paper, we proposed a novel regularized regression model, referred to as adaptive multi-task\nLasso, which takes into account multiple traits simultaneously while weights of different covariates\nare learned adaptively from prior knowledge and data. Our simulation results support that our model\noutperforms other methods via \u21131 and \u21131/\u21132 penalty over multiple related genes, and especially\nadaptively learned regularization signi\ufb01cantly improved the performance. In our experiments on the\nyeast eQTL dataset, we could identify putative three eQTL hotspots with biological supports where\nSNPs are associated with a large number of genes.\n\nAcknowledgments\n\nThis work was done under a support from NIH 1 R01 GM087694-01, NIH 1RC2HL101487-01\n(ARRA), AFOSR FA9550010247, ONR N0001140910758, NSF Career DBI-0546594, NSF IIS-\n0713379 and Alfred P. Sloan Fellowship awarded to E.X.\n\n8\n\n\fReferences\n[1] R. Sladek, G. Rocheleau, J. Rung, C. Dina, L. Shen, D. Serre, P. Boutin, D. Vincent, A. Belisle,\nS. Hadjadj, et al. A genome-wide association study identi\ufb01es novel risk loci for type 2 diabetes.\nNature, 445(7130):881\u2013885, 2007.\n\n[2] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[3] S.I. Lee, A.M. Dudley, D. Drubin, P.A. Silver, N.J. Krogan, D. Pe\u2019er, and D. Koller. Learning\n\na prior on regulatory potential from eQTL data. PLoS Genetics, 5(1):e1000358, 2009.\n\n[4] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative\n\ntrait network. PLoS Genetics, 5(8):e1000587, 2009.\n\n[5] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. In Technical Report,\n\nDepartment of Statistics, University of California, Berkeley, 2006.\n\n[6] M. Szafranski, Y. Grandvalet, and P. Morizet-Mahoudeaux. Hierarchical penalization. Ad-\n\nvances in Neural Information Processing Systems, 20:1457\u20131464, 2007.\n\n[7] H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical\n\nAssociation, 101(476):1418\u20131429, 2006.\n\n[8] S.I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. Learning a meta-level prior for feature\nrelevance from multiple related tasks. In Proceedings of the 24th International Conference on\nMachine Learning, pages 489\u2013496, 2007.\n\n[9] T. Park and G. Casella. The bayesian Lasso. Journal of the American Statistical Association,\n\n103(482):681\u2013686, 2008.\n\n[10] B. M. Marlin, M. Schmidt, and K. P. Murphy. Group sparse priors for covariance estimation. In\nProceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 383\u2013392,\n2009.\n\n[11] E. G\u00b4omez, M. A. Gomez-Viilegas, and J. M. Marin. A multivariate generalization of the\npower exponential family of distributions. Communications in Statistics-Theory and Methods,\n27(3):589\u2013600, 1998.\n\n[12] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Ef\ufb01cient sparse coding algorithms. Advances in\n\nNeural Information Processing Systems, 19:801\u2013808, 2007.\n\n[13] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u21131-\nball for learning in high dimensions. In Proceedings of the 25th International Conference on\nMachine Learning, pages 272\u2013279, 2008.\n\n[14] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group Lasso and a sparse group Lasso.\n\narXiv:1001.0736v1 [math.ST], 2010.\n\n[15] T. T. Wu and K. Lange. Coordinate descent algorithms for Lasso penalized regression. Ann.\n\nAppl. Stat, 2(1):224\u2013244, 2008.\n\n[16] R. B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expres-\nsion traits in yeast. Proceedings of the National Academy of Sciences of the United States of\nAmerica, 102(5):1572\u20131577, 2005.\n\n[17] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender, J. Maller,\nP. Sklar, P. I. W. De Bakker, M. J. Daly, et al. PLINK: a tool set for whole-genome association\nand population-based linkage analyses. The American Journal of Human Genetics, 81(3):559\u2013\n575, 2007.\n\n[18] G. Storz. An expanding universe of noncoding RNAs. Science, 296(5571):1260\u20131263, 2002.\n[19] T. Miyake, J. Reese, C. M. Loch, D. T. Auble, and R. Li. Genome-wide analysis of ARS (au-\ntonomously replicating sequence) binding factor 1 (Abf1p)-mediated transcriptional regulation\nin Saccharomyces cerevisiae. Journal of Biological Chemistry, 279(33):34865\u201334872, 2004.\n[20] G. Yvert, R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang,\nL. Kruglyak, et al. Trans-acting regulatory variation in Saccharomyces cerevisiae and the\nrole of transcription factors. Nature Genetics, 35(1):57\u201364, 2003.\n\n9\n\n\f", "award": [], "sourceid": 499, "authors": [{"given_name": "Seunghak", "family_name": "Lee", "institution": null}, {"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}