{"title": "On Multiplicative Multitask Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2411, "page_last": 2419, "abstract": "We investigate a general framework of multiplicative multitask feature learning which decomposes each task's model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-specific. Several previous methods have been proposed as special cases of our framework. We study the theoretical properties of this framework when different regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the task-specific component for all these regularizers, leading to a better understanding of the shrinkage effect. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. Empirical studies have revealed the relative advantages of the two new formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.", "full_text": "On Multiplicative Multitask Feature Learning\n\nXin Wang\u2020, Jinbo Bi\u2020, Shipeng Yu\u2021, Jiangwen Sun\u2020\n\n\u2020Dept. of Computer Science & Engineering\n\n\u2021 Health Services Innovation Center\n\nUniversity of Connecticut\n\nStorrs, CT 06269\n\nSiemens Healthcare\nMalvern, PA 19355\n\nwangxin,jinbo,javon@engr.uconn.edu\n\nshipeng.yu@siemens.com\n\nAbstract\n\nWe investigate a general framework of multiplicative multitask feature learning\nwhich decomposes each task\u2019s model parameters into a multiplication of two com-\nponents. One of the components is used across all tasks and the other component\nis task-speci\ufb01c. Several previous methods have been proposed as special cases of\nour framework. We study the theoretical properties of this framework when dif-\nferent regularization conditions are applied to the two decomposed components.\nWe prove that this framework is mathematically equivalent to the widely used\nmultitask feature learning methods that are based on a joint regularization of all\nmodel parameters, but with a more general form of regularizers. Further, an an-\nalytical formula is derived for the across-task component as related to the task-\nspeci\ufb01c component for all these regularizers, leading to a better understanding of\nthe shrinkage effect. Study of this framework motivates new multitask learning\nalgorithms. We propose two new learning formulations by varying the parameters\nin the proposed framework. Empirical studies have revealed the relative advan-\ntages of the two new formulations by comparing with the state of the art, which\nprovides instructive insights into the feature learning problem with multiple tasks.\n\n1\n\nIntroduction\n\nMultitask learning (MTL) captures and exploits the relationship among multiple related tasks and\nhas been empirically and theoretically shown to be more effective than learning each task indepen-\ndently. Multitask feature learning (MTFL) investigates a basic assumption that different tasks may\nshare a common representation in the feature space. Either the task parameters can be projected to\nexplore the latent common substructure [18], or a shared low-dimensional representation of data can\nbe formed by feature learning [10]. Recent methods either explore the latent basis that is used to\ndevelop the entire set of tasks, or learn how to group the tasks [16, 11], or identify if certain tasks\nare outliers to other tasks [6].\nA widely used MTFL strategy is to impose a blockwise joint regularization of all task parameters\nto shrink the effects of features for the tasks. These methods employ a regularizer based on the\nso-called (cid:96)1,p matrix norm [12, 13, 15, 22, 24] that is the sum of the (cid:96)p norms of the rows in a\nmatrix. Regularizers based on the (cid:96)1,p norms encourage row sparsity. If rows represent features\nand columns represent tasks, they shrink the entire rows of the matrix to have zero entries. Typical\nchoices for p are 2 [15, 4] and \u221e [20] which are used in the very early MTFL methods. Effective\nalgorithms have since then been developed for the (cid:96)1,2 [13] and (cid:96)1,\u221e [17] regularization. Later,\nthe (cid:96)1,p norm is generalized to include 1 < p \u2264 \u221e with a probabilistic interpretation that the\nresultant MTFL method solves a relaxed optimization problem with a generalized normal prior for\nall tasks [22]. Recent research applies the capped (cid:96)1,1 norm as a nonconvex joint regularizer [5].\nThe major limitation of joint regularized MTFL is that it either selects a feature as relevant to all\ntasks or excludes it from all models, which is very restrictive in practice where tasks may share some\nfeatures but may have their own speci\ufb01c features as well.\n\n1\n\n\fTo overcome this limitation, one of the most effective strategies is to decompose the model param-\neters into either summation [9, 3, 6] or multiplication [21, 2, 14] of two components with different\nregularizers applied to the two components. One regularizer is used to take care of cross-task sim-\nilarities and the other for cross-feature sparsity. Speci\ufb01cally, for the methods that decompose the\nparameter matrix into summation of two matrices, the dirty model in [9] employs (cid:96)1,1 and (cid:96)1,\u221e\nregularizers to the two components. A robust MTFL method in [3] uses the trace norm on one com-\nponent for mining a low-rank structure shared by tasks and a column-wise (cid:96)1,2-norm on the other\ncomponent for identifying task outliers. Another method applies the (cid:96)1,2-norm both row-wisely to\none component and column-wisely to the other [6].\nFor the methods that work with multiplicative decompositions, the parameter vector of each task\nis decomposed into an element-wise product of two vectors where one is used across tasks and\nthe other is task-speci\ufb01c. These methods either use the (cid:96)2-norm penalty on both of the component\nvectors [2], or the sparse (cid:96)1-norm on the two components (i.e., multi-level LASSO) [14]. The\nmulti-level LASSO method has been analytically compared to the dirty model [14], showing that\nthe multiplicative decomposition creates better shrinkage on the global and task-speci\ufb01c parameters.\nThe across-task component can screen out the features irrelevant to all tasks. To exclude a feature\nfrom a task, the additive decomposition requires the corresponding entries in both components to\nbe zero whereas the multiplicative decomposition only requires one of the components to have a\nzero entry. Although there are different ways to regularize the two components in the product, no\nsystematic work has been done to analyze the algorithmic and statistical properties of the different\nregularizers. It is insightful to answer the questions such as how these learning formulations differ\nfrom the early methods based on blockwise joint regularization, how the optimal solutions of the two\ncomponents look like, and how the resultant solutions are compared with those of other methods that\nalso learn both shared and task-speci\ufb01c features.\nIn this paper, we investigate a general framework of the multiplicative decomposition that enables\na variety of regularizers to be applied. This general form includes all early methods that represent\nmodel parameters as a product of two components [2, 14]. Our theoretical analysis has revealed that\nthis family of methods is actually equivalent to the joint regularization based approach but with a\nmore general form of regularizers, including those that do not correspond to a matrix norm. The\noptimal solution of the across-task component can be analytically computed by a formula of the\noptimal task-speci\ufb01c parameters, showing the different shrinkage effects. Statistical justi\ufb01cation\nand ef\ufb01cient algorithms are derived for this family of formulations. Motivated by the analysis, we\npropose two new MTFL formulations. Unlike the existing methods [2, 14] where the same kind of\nvector norm is applied to both components, the shrinkage of the global and task-speci\ufb01c parameters\ndiffer in the new formulations. Hence, one component is regularized by the (cid:96)2-norm and the other\nis by the (cid:96)1-norm, which aims to re\ufb02ect the degree of sparsity of the task-speci\ufb01c parameters rela-\ntive to the sparsity of the across-task parameters. In empirical experiments, simulations have been\ndesigned to examine the various feature sharing patterns where a speci\ufb01c choice of regularizer may\nbe preferred. Empirical results on benchmark data are also discussed.\n\n2 The Proposed Multiplicative MTFL\n\nGiven T tasks in total, for each task t, t \u2208 {1,\u00b7\u00b7\u00b7 , T}, we have sample set (Xt \u2208 R(cid:96)t\u00d7d, yt \u2208 R(cid:96)t).\nThe data set of Xt has (cid:96)t examples, where the i-th row corresponds to the i-th example xt\ni of task\nt, i \u2208 {1,\u00b7\u00b7\u00b7 , (cid:96)t}, and each column represents a feature. The vector yt contains yt\ni, the label of the\ni-th example of task t. We consider functions of the linear form Xt\u03b1t where \u03b1t \u2208 Rd. We de\ufb01ne\nthe parameter matrix or weight matrix A = [\u03b11,\u00b7\u00b7\u00b7 , \u03b1T ] and \u03b1j are the rows, j \u2208 {1,\u00b7\u00b7\u00b7 , d}.\nA family of multiplicative MTFL methods can be derived by rewriting \u03b1t = diag(c)\u03b2t where\ndiag(c) is a diagonal matrix with its diagonal elements composing a vector c. The c vector is used\nacross all tasks, indicating if a feature is useful for any of the tasks, and the vector \u03b2t is only for task\nt. Let j index the entries in these vectors. We have \u03b1t\nj. Typically c comprises binary entries\nthat are equal to 0 or 1, but the integer constraint is often relaxed to require just non-negativity. We\nminimize a regularized loss function as follows for the best c and \u03b2t:t=1,\u00b7\u00b7\u00b7 ,T :\np + \u03b32||c||k\n\nL(c, \u03b2t, Xt, yt) + \u03b31\n\n||\u03b2t||p\n\nj = cj\u03b2t\n\nk\n\n(1)\n\nT(cid:80)\n\nt=1\n\nmin\n\u03b2t,c\u22650\n\nT(cid:80)\n\nt=1\n\n2\n\n\fp =(cid:80)d\n\nk =(cid:80)d\n\nj=1 |\u03b2t\n\nj|p and ||c||k\n\nwhere L(\u00b7) is a loss function, e.g., the least squares loss for regression problems or the logistic loss\nfor classi\ufb01cation problems, ||\u03b2t||p\nj=1(cj)k, which are the (cid:96)p-norm\nof \u03b2t to the power of p and the (cid:96)k-norm of c to the power of k if p and k are positive integers.\nThe tuning parameters \u03b31, \u03b32 are used to balance the empirical loss and regularizers. At optimality,\nif cj = 0, the j-th variable is removed for all tasks, and the corresponding row vector \u03b1j = 0;\notherwise the j-th variable is selected for use in at least one of the \u03b1\u2019s. Then, a speci\ufb01c \u03b2t can rule\nout the j-th variable from task t if \u03b2t\nIn particular, if both p = k = 2, Problem (1) becomes the formulation in [2] and if p = k = 1,\nProblem (1) becomes the formulation in [14]. Any other choices of p and k will derive into new\nformulations for MTFL. We \ufb01rst examine the theoretical properties of this entire family of methods,\nand then empirically study two new formulations by varying p and k.\n\nj = 0.\n\n3 Theoretical Analysis\n\nThe joint (cid:96)1,p regularized MTFL method minimizes(cid:80)T\n\nj=1 ||\u03b1j||p for the\nbest \u03b1t:t=1,\u00b7\u00b7\u00b7 ,T where \u03bb is a tuning parameter. We now extend this formulation to allow more\nchoices of regularizers. We introduce a new notation that is an operator applied to a vector, such\nas \u03b1j. The operator ||\u03b1j||p/q = q\nj|p, p, q \u2265 0, which corresponds to the (cid:96)p norm if\np = q and both are positive integers. A joint regularized MTFL approach can solve the following\noptimization problem with pre-speci\ufb01ed values of p, q and \u03bb, for the best parameters \u03b1t:t=1,\u00b7\u00b7\u00b7 ,T :\n\n(cid:113)(cid:80)T\n\nt=1 |\u03b1t\n\nt=1 L(\u03b1t, Xt, yt) + \u03bb(cid:80)d\n\nT(cid:80)\n\nt=1\n\nmin\n\u03b1t\n\nL(\u03b1t, Xt, yt) + \u03bb\n\n(cid:112)||\u03b1j||p/q.\n\nd(cid:80)\n\nj=1\n\n(2)\n\nOur main result of this paper is (i) a theorem that establishes the equivalence between the models\nderived from solving Problem (1) and Problem (2) for properly chosen values of \u03bb, q, k, \u03b31 and \u03b32;\nand (ii) an analytical solution of Problem (1) for c which shows how the sparsity of the across-task\ncomponent is relative to the sparsity of task-speci\ufb01c components.\n\nTheorem 1 Let \u02c6\u03b1t be the optimal solution to Problem (2) and ( \u02c6\u03b2t, \u02c6c) be the optimal solution to\nProblem (1). Then \u02c6\u03b1t = diag(\u02c6c) \u02c6\u03b2t when \u03bb = 2\n\n2 and q = k+p\n\n2k (or k = p\n\n2\u2212 p\n1\n\n2q\u22121 ).\n\np\nkq\n\n\u03b3\n\n\u03b3\n\nkq\n\nProof. The theorem holds by proving the following two Lemmas. The \ufb01rst lemma proves that the\nsolution \u02c6\u03b1t of Problem (2) also minimizes the following optimization problem:\n\nmin\u03b1t,\u03c3\u22650\n\nt=1 L(\u03b1t, Xt, yt) + \u00b51\n\n||\u03b1j||p/q + \u00b52\n\nj=1 \u03c3j,\n\n(3)\n\n(cid:80)d\nj=1 \u03c3\u22121\n\nj\n\n(cid:80)d\n\nand the optimal solution of Problem (3) also minimizes Problem (2) when proper values of \u03bb, \u00b51\nand \u00b52 are chosen. The second lemma connects Problem (3) to our formulation (1). We show that\nthe optimal \u02c6\u03c3j is equal to (\u02c6cj)k, and then the optimal \u02c6\u03b2 can be computed from the optimal \u02c6\u03b1.\n\n\u221a\nLemma 1 The solution sets of Problem (2) and Problem (3) are identical when \u03bb = 2\n\n\u00b51\u00b52.\n\n\u221a\nProof. First, we show that when \u03bb = 2\n\u00b51\u00b52, the optimal solution \u02c6\u03b1t\n\u2212 1\nProblem (3) and the optimal \u02c6\u03c3j = \u00b5\n1 \u00b5\n2\nfollowing inequality holds\n\n1\n2\n\n2\n\nj of Problem (2) minimizes\n\n(cid:113)|| \u02c6\u03b1j||p/q. By the Cauchy-Schwarz inequality, the\nd(cid:88)\n\nd(cid:88)\n(cid:112)||\u03b1j||p/q. Since Problems (3) and (2) use\n(cid:113)|| \u02c6\u03b1j||p/q, Problems (3) and (2) have\n\n(cid:113)||\u03b1j||p/q\n\n\u221a\n\u03c3j \u2265 2\n\n\u00b51\u00b52\n\nj=1\n\n1\n2\n\n2\n\nj)jt, \u02c6\u03c3 = (\u02c6\u03c3j)j=1,\u00b7\u00b7\u00b7 ,d)\n\n\u00b51\n\n\u03c3\u22121\n\nj\n\n||\u03b1j||p/q + \u00b52\n\nj=1\n\nj=1\n\n\u2212 1\nwhere the equality holds if and only if \u03c3j = \u00b5\n1 \u00b5\n2\nthe exactly same loss function, when we set \u02c6\u03c3j = \u00b5\nidentical objective function if \u03bb = 2\nminimizes Problem (3) as it entails the objective function to reach its lower bound.\n\n\u00b51\u00b52. Hence the pair ( \u02c6A = (\u02c6\u03b1t\n\n\u2212 1\n1 \u00b5\n2\n\n\u221a\n\n1\n2\n\n2\n\n(cid:80)T\n\nd(cid:88)\n\n(cid:113)\n\n3\n\n\fSecond, it can be proved that if the pair ( \u02c6A, \u02c6\u03c3) minimizes Problem (3), then \u02c6A also minimizes\nProblem (2) by proof of contradiction. Suppose that \u02c6A does not minimize Problem (2), which means\nthat there exists \u02dc\u03b1j ((cid:54)= \u02c6\u03b1j for some j) that is an optimal solution to Problem (2) and achieves a lower\nobjective value than \u02c6\u03b1j. We set \u02dc\u03c3j = \u00b5\nof Problem (3) as proved in the \ufb01rst paragraph. Then ( \u02dcA, \u02dc\u03c3) will bring the objective function of\nProblem (3) to a lower value than that of ( \u02c6A, \u02c6\u03c3), contradicting to the assumption that ( \u02c6A, \u02c6\u03c3) be\noptimal to Problem (3).\n\u221a\nHence, we have proved that Problems (3) and (2) have identical solutions when \u03bb = 2\n\n(cid:113)|| \u02dc\u03b1j||p/q. The pair ( \u02dcA, \u02dc\u03c3) is an optimal solution\n\n\u2212 1\n1 \u00b5\n2\n\n1\n2\n\n\u00b51\u00b52.\n\n2\n\nFrom the proof of Lemma (1), we also see that the optimal objective value of Problem (2) gives a\nlower bound to the objective of Problem (3). Let \u03c3j = (cj)k, k \u2208 R, k (cid:54)= 0 and \u03b1t\nj, an\nequivalent objective function of Problem (3) can be derived.\n\nj = cj\u03b2t\n\nkq\n\n\u00b51\n\nj =\n\nj = \u02c6cj\n\nkq\u2212p\n2kq\u2212p\n\u00b5\n2\n\n, \u03b32 = \u00b52, and k = p\n\nLemma 2 The optimal solution ( \u02c6A, \u02c6\u03c3) of Problem (3) is equivalent to the optimal solution ( \u02c6B, \u02c6c) of\n2kq\u2212p\n\u02c6\u03b2t\nj and \u02c6\u03c3j = (\u02c6cj)k when \u03b31 = \u00b5\nProblem (1) where \u02c6\u03b1t\n2q\u22121 .\n1\nProof. First, by proof of contradiction, we show that if \u02c6\u03b1t\nj and \u02c6\u03c3j optimize Problem (3), then\noptimize Problem (1). Denote the objectives of (1) and (3) by J (1)\nj, \u02c6\u03c3j in J (3) yields an objective function L(\u02c6c, \u02c6\u03b2t, Xt, yt) +\nj=1(\u02c6cj)k. By the proof of Lemma 1, \u02c6\u03c3j = \u00b5\n\n(cid:113)|| \u02c6\u03b1j||p/q.\n\n(cid:80)d\nj||p/q(cid:17) q\n\n\u02c6\u03b1t\nj\n\u02c6cj\nand J (3). Substituting \u02c6\u03b2t\nj, \u02c6cj for \u02c6\u03b1t\nj||p/q\u02c6c(p\u2212kq)/q\n2 ||\u02c6\u03b2\n\n\u02c6cj = k(cid:112)\u02c6\u03c3j and \u02c6\u03b2t\n(cid:80)d\n(cid:16)\nj=1 ||\u02c6\u03b2\nHence, \u02c6cj =\n\u03b31 and \u03b32 yield an objective identical to J (1). Suppose \u2203( \u02dc\u03b2t\n\u02dc\u03b2t\nj and substitute \u02dc\u03b2t\nJ (1)( \u02dc\u03b2t\nj = \u02dccj\ninequality, we similarly have \u02dccj = (\u03b31\u03b3\u22121\nJ (3)(\u02dc\u03b1t\nthe optimality of (\u02c6\u03b1t\n\u02c6\u03b1t\n\nt=1(\u02dc\u03b1t\nj, \u02dccj). Let \u02dc\u03c3j = (\u02dccj)k, and we have J (3)(\u02dc\u03b1t\n\n\u02c6\u03b2t\nj and \u02c6\u03c3j = (\u02c6cj)k optimize Problem (3).\n\nj, \u02c6\u03c3j). Second, we similarly prove that if \u02c6\u03b2t\n\n2kq\u2212p . Applying the formula of \u02c6cj and substituting \u00b51 and \u00b52 by\nj, \u02c6cj)) that minimize (1), and\nj/\u02dccj in J (1). By Cauchy-Schwarz\nj, \u02dccj) can be derived into\nj, \u02c6\u03c3j), which contradicts with\nj and \u02c6cj optimize Problem (1), then\n\nj)p)\nj, \u02dc\u03c3j) < J (3)(\u02c6\u03b1t\n\nj, \u02dccj)((cid:54)= ( \u02c6\u03b2t\nj by \u02dc\u03b1t\n\np+k . Thus, J (1)(\u02dc\u03b1t\n\n(cid:80)T\n\nj, \u02dccj) < J (1)( \u02c6\u03b2t\n\nj, \u02c6cj). Let \u02dc\u03b1t\n\n\u00b51\u00b5\u22121\n\n\u2212 1\n1 \u00b5\n2\n\nj = \u02c6cj\n\n+ \u00b52\n\n1\n2\n\n2\n\nj\n\n2\n\n1\n\n(cid:113)\n\n2\u2212 p\n1\n\np\nkq\n2\n\nand\n2k , the optimal solutions to Problems (1) and (2) are equivalent. Solving Problem (1) will\n\nNow, combining the results from the two Lemmas, we can derive that when \u03bb = 2\nq = k+p\nyield an optimal solution \u02c6\u03b1 to Problem (2) and vice versa.\nTheorem 2 Let \u02c6\u03b2t, t = 1,\u00b7\u00b7\u00b7 , T, be the optimal solutions of Problem (1), Let \u02c6B = [\u02c6\u03b21,\u00b7\u00b7\u00b7 , \u02c6\u03b2T ]\nand \u02c6\u03b2\n\nj denote the j-th row of the matrix \u02c6B. Then,\n\n\u03b3\n\n\u03b3\n\nkq\n\n\u02c6cj = (\u03b31/\u03b32)\n\n1\n\nk ||\u02c6\u03b2\n\nj||\n\np\n\n2kq\u2212p ,\n\n(4)\n\nfor all j = 1,\u00b7\u00b7\u00b7 , d, is optimal to Problem (1).\nProof. This analytical formula can be directly derived from Lemma 1 and Lemma 2. When we set\n2kq\u2212p . In the proof\n\u02c6\u03c3j = (\u02c6cj)k and \u02c6\u03b1t\n\nj||p/q(cid:17) q\n\n\u02c6\u03b2t\nj in Problem (3), we obtain \u02c6cj =\n\n\u00b51\u00b5\u22121\n\n2 ||\u02c6\u03b2\n\nj = \u02c6cj\n\n(cid:16)\n\n2kq\u2212p\n\np\u2212kq\n\n2\n\n1\n\nkq\n\nkq\n\n\u03b3\n\nand \u00b52 = \u03b32. Substituting these formula into the\n\nof Lemma 2, we obtain that \u00b51 = \u03b3\nformula of c yields the formula (4).\nBased on the derivation, for each pair of {p, q} and \u03bb in Problem (2), there exists an equivalent\nproblem (1) with determined values of k, \u03b31 and \u03b32, and vice versa. Note that if q = p/2, the\nregularization term on \u03b1j in Problem (2) becomes the standard p-norm. In particular, if {p, q} =\n{2, 1} in Problem (2) as used in the methods of [15] and [1], the (cid:96)2-norm regularizer is applied\n\u03b31\u03b32, the same\nto \u03b1j. Then, this problem is equivalent to Problem (1) when k = 2 and \u03bb = 2\nformulation in [2]. If {p, q} = {1, 1}, the square root of (cid:96)1-norm regularizer is applied to \u03b1j. Our\n\u221a\ntheorem 1 shows that this problem is equivalent to the multi-level LASSO MTFL formulation [14]\nwhich is Problem (1) with k = 1 and \u03bb = 2\n\n\u221a\n\n\u03b31\u03b32.\n\n4\n\n\f4 Probabilistic Interpretation\n\np(A|X, y, \u2206) \u221d p(A|\u2206)(cid:81)T\n\nIn this section we show the proposed multiplicative formalism is related to the maximum a pos-\nteriori (MAP) solution of a probabilistic model. Let p(A|\u2206) be the prior distribution of the\nweight matrix A = [\u03b11, . . . , \u03b1T ] = [\u03b11(cid:62), . . . , \u03b1d(cid:62)](cid:62) \u2208 Rd\u00d7T , where \u2206 denote the param-\neter of the prior. Then the a posteriori distribution of A can be calculated via Bayes rule as\nt=1 p(yt|Xt, \u03b1t). Denote z \u223c GN (\u00b5, \u03c1, q) the univariate generalized\n), in which \u03c1 > 0,\n\u03c1q\nj, follow a generalized\nj \u223c GN (0, \u03b4j, q). Then with the i.i.d. assumption, the prior takes the form (also\n\nnormal distribution, with the density function p(z) =\nq > 0, and \u0393(\u00b7) is the Gamma function [7]. Now let each element of A, \u03b1t\nnormal prior, \u03b1t\nrefer to [22] for a similar treatment)\n\n2\u03c1\u0393(1+1/q) exp(\u2212|z\u2212\u00b5|q\n\n1\n\nNow let us look at the multiplicative nature of \u03b1t\n\n(6)\nj with different q \u2208 [1,\u221e]. When q = 1, we have:\n\nwhere (cid:107)\u00b7(cid:107)q denote vector q-norm. With an appropriately chosen likelihood function p(yt|Xt, \u03b1t) \u221d\nexp(\u2212L(\u03b1t, Xt, yt)), \ufb01nding the MAP solution is equivalent to solving the following problem:\n. By setting the derivative of J with\n\nminA,\u2206 J =(cid:80)T\n\n+ T ln \u03b4j\n\nq\n\nrespect to \u03b4j to zero, we obtain:\n\nq\n\n,\n\n\u03b4q\nj\n\n=\n\nt=1\n\nj=1\n\nj=1\n\nj=1\n\n\u03b4q\nj\n\nexp\n\nexp\n\n1\n\u03b4j\n\n(cid:17)\n\n(cid:17)\n\n1\n\u03b4T\nj\n\nd(cid:89)\n\nT(cid:89)\n\n(cid:16) \u2212 (cid:107)\u03b1j(cid:107)q\n\n(cid:16) \u2212 |\u03b1t\nj|q\n\u03b4q\nj\n(cid:16)(cid:107)\u03b1j(cid:107)q\n\np(A|\u2206) \u221d d(cid:89)\n(cid:17)\nt=1 L(\u03b1t, Xt, yt) +(cid:80)d\n(cid:88)T\n(cid:88)d\n(cid:32) T(cid:88)\n(cid:33)\n(cid:33)\nd(cid:88)\nd(cid:88)\nd(cid:88)\nj|\n|cj\u03b2t\nj|\n|\u03b1t\nt=1 L(\u03b1t, Xt, yt) + T(cid:80)d\nj=1 |cj| + T(cid:80)d\n(cid:80)T\nt=1 |\u03b2t\n(cid:32)\n(cid:33)\nd(cid:88)\nmax{|c1|, . . . ,|cd|}q \u00b7 T(cid:88)\n\n(cid:32) T(cid:88)\n\n(cid:32) T(cid:88)\n\nL(\u03b1t, Xt, yt) + T\n\nln|cj| + ln\n\nln(cid:107)\u03b1j(cid:107)q.\n\n(cid:32)\n\nmin\nA\n\nJ =\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\nt=1\n\nt=1\n\nt=1\n\nln\n\nln\n\nln\n\nln\n\n=\n\n=\n\nBecause of ln z \u2264 z \u2212 1 for any z > 0, we can optimize an upper bound of J in (6). We then have\nj|, which is equivalent to the\nmultiplicative formulation (1) where {p, k} = {1, 1}. For q > 1, we have:\n\nj=1\n\n\u2264 1\nq\n\nj|q\n|cj\u03b2t\nd(cid:88)\n\nt=1\n\nj=1\n\nT(cid:88)\nan upper bound of J in (6), i.e., minA Jq,k =(cid:80)T\n\nq \u2212 (d +\n(cid:80)T\nSince vector norms satisfy (cid:107)z(cid:107)\u221e \u2264 (cid:107)z(cid:107)k for any vector z and k \u2265 1, these inequalities lead to\nt=1 (cid:107)\u03b2t(cid:107)q\nq,\n\nt=1 L(\u03b1t, Xt, yt) + T d(cid:107)c(cid:107)k + T\n\n|\u03b2t\nj|q \u2264 d(cid:107)c(cid:107)\u221e +\n\nln(cid:107)c(cid:107)\u221e +\n\nT(cid:88)\n\n(cid:107)\u03b2t(cid:107)q\n\n(9)\n\nd\nq\n\n1\nq\n\n1\nq\n\nj=1\n\nj=1\n\nj=1\n\nt=1\n\nt=1\n\nt=1\n\nln\n\n=\n\n).\n\nq\n\nwhich is equivalent to the general multiplicative formulation in (1).\n\nj=1\n\nln(cid:107)\u03b1j(cid:107)1 =\n\nd(cid:88)\nminA J1 =(cid:80)T\nd(cid:88)\n\nln(cid:107)\u03b1j(cid:107)q =\n\nd(cid:88)\nd(cid:88)\n\n1\nq\n\nj=1\n\n(5)\n\n(cid:33)\nj|\n|\u03b2t\n\nT(cid:88)\n\nt=1\n\n.\n\n(7)\n\n(cid:33)\n\n|\u03b2t\nj|q\n\n(8)\n\n5 Optimization Algorithm\n\nAlternating optimization algorithms have been used in both of the early methods [2, 14] to solve\nProblem (1) which alternate between solving two subproblems: solve for \u03b2t with \ufb01xed c; solve for\nc with \ufb01xed \u03b2t. The convergence property of such an alternating algorithm has been analyzed in [2]\nthat it converges to a local minimizer. However, both subproblems in the existing methods can only\nbe solved using iterative algorithms such as gradient descent, linear or quadratic program solvers.\nWe design a new alternating optimization algorithm that utilizes the property that both Problems (1)\nand (2) are equivalent to Problem (3) used in our proof and we derive a closed-form solution for c\nfor the second subproblem. The following theorem characterizes this result.\nTheorem 3 For any given values of \u03b1t:t=1,\u00b7\u00b7\u00b7 ,T , the optimal \u03c3 of Problem (3) when \u03b1t:t=1,\u00b7\u00b7\u00b7 ,T are\n\ufb01xed to the given values can be computed by \u03c3j = \u03b3\n\n(cid:113)(cid:80)T\n\nj)p, j = 1,\u00b7\u00b7\u00b7 , d.\n\n1\u2212 p\n1\n\nt=1(\u03b1t\n\n2\u2212 p\n\n2kp\n\n2kq\n\n\u03b3\n\n2q\n\n2\n\n1\n\n5\n\n\fProof. By the Cauchy-Schwarz inequality and the same argument used in the proof of Lemma 1,\nwe obtain that the best \u03c3 for a given set of \u03b1t:t=1,\u00b7\u00b7\u00b7 ,T is \u03c3j = \u00b5\nkq\u2212p\n2kq\u2212p\n2kq\u2212p\nthat \u00b51 and \u00b52 are chosen in such a way that \u03b31 = \u00b5\n\u00b5\n1\n2\nhave \u00b51 = \u03b3\n\n\u2212 1\n1 \u00b5\n2\nand \u03b32 = \u00b52. This is equivalent to\n\nand \u00b52 = \u03b32. Substituting them into the formula of \u03c3 yields the result.\n\n2kq\u2212p\n\np\u2212kq\n\n\u03b3\n\nkq\n\nkq\n\nkq\n\n1\n2\n\n2\n\n1\n\n2\n\n(cid:112)||\u03b1j||p/q. We also know\n\nt\n\nNow, in the algorithm to solve Problem (1), we solve the \ufb01rst subproblem to obtain a new iterate\n\u03b2new\n= diag(cold)\u03b2new\n, then we use the current value of c, cold, to compute the value of \u03b1new\n,\n\u221a\nwhich is then used to compute \u03c3j according to the formula in Theorem 3. Then, c is computed as\n\u03c3j, j = 1,\u00b7\u00b7\u00b7 , d. The overall procedure is summarized in Algorithm 1.\ncj = k\n\nt\n\nt\n\nAlgorithm 1 Alternating optimization for multiplicative MTFL\n\nInput: Xt, yt, t = 1,\u00b7\u00b7\u00b7 , T , as well as \u03b31, \u03b32, p and k\nInitialize: cj = 1, \u2200j = 1,\u00b7\u00b7\u00b7 , d\nrepeat\n\n1. Convert Xtdiag(cs\u22121) \u2192 \u02dcXt, \u2200 t = 1,\u00b7\u00b7\u00b7 , T\nfor t = 1,\u00b7\u00b7\u00b7 , T do\n\nSolve min\u03b2t L(\u03b2t, \u02dcXt, yt) + \u03b31||\u03b2t||p\n\nend for\n2. Compute \u03b1s\naccording to the formula in Theorem 3.\n\nt = diag(c(s\u22121))\u03b2s\n\nuntil max(|(\u03b1t\nj)s\u22121|) < \u0001\nOutput: \u03b1t, c and \u03b2t, t = 1,\u00b7\u00b7\u00b7 , T\n\nj)s \u2212 (\u03b1t\n\np for \u03b2s\nt\n\nt , and compute cs as cs\n\n\u03c3j where \u03c3j is computed\n\n\u221a\nj = k\n\nAlgorithm 1 can be used to solve the entire family of methods characterized by Problem (1). The\n\ufb01rst subproblem involves convex optimization if a convex loss function is chosen and p \u2265 1, and\ncan be solved separately for individual tasks using single task learning. The second subproblem\nis analytically solved by a formula that guarantees that Problem (1) reaches a lower bound for the\ncurrent \u03b1t. In this paper, the least squares and logistic regression losses are used and both of them are\nconvex and differentiable loss functions. When convex and differentiable losses are used, theoretical\nresults in [19] can be used to prove the convergence of the proposed algorithm. We choose to monitor\nthe maximum norm of the A matrix to terminate the process, but it can be replaced by any other\nsuitable termination criterion. Initialization can be important for this algorithm, and we suggest\nstarting with c = 1, which considers all features initially in the learning process.\n\n6 Two New Formulations\n\nThe two existing methods discussed in [2, 14] use p = k in their formulations, which renders \u03b2t\nj and\ncj the same amount of shrinkage. To explore other feature sharing patterns among tasks, we propose\ntwo new formulations where p (cid:54)= k. For the common choices of p and k, the relation between the\noptimal c and \u03b2 can be computed according to Theorem 2, and is summarized in Table 1.\n1. When the majority of the features is not relevant to any of the tasks, it requires a sparsity-\ninducing norm on c. However, within the relevant features, many features are shared between tasks.\nIn other words, the features used in each task are not sparse relative to all the features selected by\nc, which requires a non-sparsity-inducing norm on \u03b2. Hence, we use (cid:96)1 norm on c and (cid:96)2 norm\non all \u03b2\u2019s in Formulation (1). This formulation is equivalent to the joint regularization method of\nmin\u03b1t\n2. When many or all features are relevant to the given tasks, it may prefer the (cid:96)2 norm penalty on\nc. However, only a limited number of features are shared between tasks, i.e., the features used by\nindividual tasks are sparse with respect to the features selected as useful across tasks by c. We can\nimpose the (cid:96)1 norm penalty on \u03b2. This formulation is equivalent to the joint regularization method\n\nt=1 L(\u03b1t, Xt, yt) + \u03bb(cid:80)d\n(cid:80)T\n\n(cid:113)(cid:80)T\n\nj)2 where \u03bb = 2\u03b3\n\nt=1(\u03b1t\n\n1 \u03b3\n\n2 .\n\nj=1\n\n1\n3\n\n2\n3\n\n3\n\n(cid:80)T\nt=1 L(\u03b1t, Xt, yt) + \u03bb(cid:80)d\n\nj=1\n\nof min\u03b1t\n\nt=1 |\u03b1t\n\nwhere \u03bb = 2\u03b3\n\n2\n3\n\n1 \u03b3\n\n1\n3\n\n2 .\n\nj|(cid:17)2\n\n(cid:114)(cid:16)(cid:80)T\n\n3\n\n6\n\n\fTable 1: The shrinkage effect of c with respect to \u03b2 for four common choices of p and k.\n\n(cid:80)T\n(cid:113)(cid:80)T\n\nt=1\n\n\u02c6\u03b2t\nj\n\n2\n\nc\n\u22121\n2\n\n(cid:113)\n\n\u22121\n2\n\n\u03b31\u03b3\n\n\u02c6cj = \u03b31\u03b3\n\n\u02c6cj =\n\nj|\n\u02c6|\u03b2t\n\nt=1\n\n(cid:113)\n\nc\n\n\u02c6cj =\n\n\u22121\n\u03b31\u03b3\n2\n\u22121\n\u02c6cj = \u03b31\u03b3\n2\n\n(cid:113)(cid:80)T\n(cid:80)T\n\nt=1\n\nt=1\n\nj|\n\u02c6|\u03b2t\n\n2\n\n\u02c6\u03b2t\nj\n\np\n2\n\n1\n\nk\n1\n\n2\n\np\n2\n\n1\n\nk\n2\n\n1\n\n7 Experiments\n\nIn this section, we empirically evaluate the performance of the proposed multiplicative MTFL with\nthe four parameter settings listed in Table 1 on synthetic and real-world data for both classi\ufb01cation\nand regression problems. The \ufb01rst two settings (p, k) = (2, 2), (1, 1) give the same methods re-\nspectively in [2, 14], and the last two settings correspond to our new formulations. The least squares\nand logistic regression losses are used, respectively, for regression and classi\ufb01cation problems. We\nfocus on the understanding of the shrinkage effects created by the different choices of regularizers\nin multiplicative MTFL. These methods are referred to as MMTFL and are compared with the dirty\nmodel (DMTL) [9] and robust MTFL (rMTFL) [6] that use the additive decomposition.\nThe \ufb01rst subproblem of Algorithm 1 was solved using CPLEX solvers and single task learning in the\ninitial \ufb01rst subproblem served as baseline. We used respectively 25%, 33% and 50% of the available\ndata in each data set for training and the rest data for test. We repeated the random split 15 times\nand reported the averaged performance. For each split, the regularization parameters of each method\nwere tuned by a 3-fold cross validation within the training data. The regression performance was\nmeasured by the coef\ufb01cient of determination, denoted as R2, which was computed as 1 minus the\nratio of the sum of squared residuals and the total sum of squares. The classi\ufb01cation performance\nwas measured by the F1 score, which was the harmonic mean of precision and recall.\nSynthetic Data. We created two synthetic data sets which included 10 and 20 tasks, respectively.\nFor each task, we created 200 examples using 100 features with pre-de\ufb01ned combination weights\n\u03b1. Each feature was generated following the N (0, 1) distribution. We added noise and computed\nyt = Xt\u03b1t + \u0001t for each task t where the noise \u0001 followed a distribution N (0, 1). We put the\ndifferent tasks\u2019 \u03b1\u2019s together as rows in Figure 1. The values of \u03b1\u2019s were speci\ufb01ed in such a way\nfor us to explore how the structure of feature sharing in\ufb02uences the multitask learning models when\nvarious regularizers are used. In particular, we illustrate the cases where the two newly proposed\nformulations outperformed other methods.\n\n(a) Synthetic data D1\n\n(b) Synthetic data D2\n\nFigure 1: Parameter matrix learned by different methods (darker color indicates greater values.).\n\nSynthetic Data 1 (D1). As shown in Figure 1a, 40% of components in all \u03b1\u2019s were set to 0, and\nthese features were irrelevant to all tasks. The rest features were used in every task\u2019s model and\nhence these models were sparse with respect to all of the features, but not sparse with respect to the\nselected features. This was the assumption for the early joint regularized methods to work. To learn\nthis feature sharing structure, however, we observed that the amount of shrinkage needed would be\ndifferent for c and \u03b2. This case might be in favor of the (cid:96)1 norm penalty on c.\nSynthetic Data 2 (D2). The designed parameter matrix is shown in Figure 1b where tasks were split\ninto 6 groups. Five features were irrelevant to all tasks, 10 features were used by all tasks, and each\n\n7\n\n\fof the remaining 85 features was used by only 1 or 2 groups. The neighboring groups of tasks in\nFigure 1b shared only 7 features besides those 10 common features. Non-neighboring tasks did not\nshare additional features. We expected c to be non-sparse. However, each task only used very few\nfeatures with respect to all available features, and hence each \u03b2 should be sparse.\nFigure 1 shows the parameter matrices (with columns representing features for illustrative conve-\nnience) learned by different methods using 33% of the available examples in each data set. We can\nclearly see that MMTFL(2,1) performs the best for Synthetic data D1. This result suggests that the\nclassic choices of using (cid:96)2 or (cid:96)1 penalty on both c and \u03b2 (corresponding to early joint regularized\nmethods) might not always be optimal. MMTFL(1,2) is superior for Synthetic data D2, where each\nmodel shows strong feature sparsity but few features can be removed if all tasks are considered.\nTable 2 summarizes the performance comparison where the best performance is highlighted in bold\nfont. Note that the feature sharing patterns may not be revealed by the recent methods on clustered\nmultitask learning that cluster tasks into groups [10, 8, 23] because no cluster structure is present in\nFigure 1b, for instance. Rather, the sharing pattern in Figure 1b is in the shape of staircase.\n\nTable 2: Comparison of the performance between various multitask learning models\n\nData set\nSynthetic data\n\nD1 (R2)\n\nD2 (R2)\n\nReal-world data\n\nSARCOS\n\n(R2)\n\nUSPS\n\n(F1 score)\n\n25%\n33%\n50%\n25%\n33%\n50%\n\n25%\n33%\n50%\n25%\n33%\n50%\n\nSTL\n\n0.40\u00b10.02\n0.55\u00b10.03\n0.60\u00b10.02\n0.28\u00b10.02\n0.35\u00b10.01\n0.75\u00b10.01\n0.78\u00b10.02\n0.78\u00b10.02\n0.83\u00b10.06\n0.83\u00b10.01\n0.84\u00b10.02\n0.87\u00b10.02\n\nDMTL\n0.60\u00b10.02\n0.73\u00b10.01\n0.75\u00b10.01\n0.36\u00b10.01\n0.42\u00b10.02\n0.81\u00b10.01\n0.90\u00b1 0\n0.88\u00b10.11\n0.87\u00b1 0.1\n0.89\u00b10.01\n0.90\u00b10.01\n0.91\u00b10.01\n\nrMTFL\n0.58\u00b10.02\n0.61\u00b10.02\n0.66\u00b10.01\n0.46\u00b10.01\n0.63\u00b10.03\n0.83\u00b10.01\n0.90\u00b10\n0.89\u00b10.1\n0.89\u00b10.1\n0.91\u00b10.01\n0.90\u00b10.01\n0.92\u00b10.01\n\nMMTFL(2,2) MMTFL(1,1) MMTFL(2,1) MMTFL(1,2)\n0.64\u00b10.02\n0.42\u00b10.04\n0.65\u00b10.03\n0.79\u00b10.02\n0.84\u00b10.01\n0.86\u00b10.01\n0.49\u00b10.02\n0.45\u00b10.01\n0.69\u00b10.02\n0.83\u00b10.02\n0.91\u00b10\n0.97\u00b10\n0.89\u00b1 0\n0.90\u00b1 0\n0.91\u00b1 0\n0.90\u00b10.01\n0.89\u00b10.01\n0.92\u00b10.01\n\n0.54\u00b10.03\n0.76\u00b10.01\n0.88\u00b10.01\n0.35\u00b10.05\n0.75\u00b10.01\n0.95\u00b10\n0.89\u00b1 0\n0.90\u00b1 0\n0.90\u00b1 0.01\n0.90\u00b10.01\n0.90\u00b10.01\n0.92\u00b10.01\n\n0.73\u00b10.02\n0.86\u00b10.01\n0.90\u00b10.01\n0.46\u00b10.02\n0.67\u00b10.03\n0.92\u00b10.01\n0.90\u00b10.01\n0.91\u00b10.01\n0.91\u00b10.01\n0.90\u00b10.01\n0.90\u00b10.01\n0.92\u00b10.01\n\n0.87\u00b10.01\n0.89\u00b10.01\n0.89\u00b10.01\n0.91\u00b10.01\n0.91\u00b10.01\n0.93\u00b10.01\n\nReal-world Data. Two benchmark data sets, the Sarcos [1] and the USPS data sets [10], were used\nfor regression and classi\ufb01cation tests respectively. The Sarcos data set has 48,933 observations and\neach observation (example) has 21 features. Each task is to map from the 21 features to one of the 7\nconsecutive torques of the Sarcos robot arm. We randomly selected 2000 examples for use in each\ntask. USPS handwritten digits data set has 2000 examples and 10 classes as the digits from 0 to 9.\nWe \ufb01rst used principle component analysis to reduce the feature dimension to 87. To create binary\nclassi\ufb01cation tasks, we randomly chose images from the other 9 classes to be the negative examples.\nTable 2 provides the performance of the different methods on these two data sets, which shows the\neffectiveness of MMTFL(2,1) and MMTFL(1,2).\n\n8 Conclusion\n\nIn this paper, we study a general framework of multiplicative multitask feature learning. By decom-\nposing the model parameter of each task into a product of two components: the across-task feature\nindicator and task-speci\ufb01c parameters, and applying different regularizers to the two components,\nwe can select features for individual tasks and also search for the shared features among tasks. We\nhave studied the theoretical properties of this framework when different regularizers are applied and\nfound that this family of methods creates models equivalent to those of the joint regularized MTL\nmethods but with a more general form of regularization. Further, an analytical formula is derived for\nthe across-task component as related to the task-speci\ufb01c component, which shed light on the differ-\nent shrinkage effects in the various regularizers. An ef\ufb01cient algorithm is derived to solve the entire\nfamily of methods and also tested in our experiments. Empirical results on synthetic data clearly\nshow that there may not be a particular choice of regularizers that is universally better than other\nchoices. We empirically show a few feature sharing patterns that are in favor of two newly-proposed\nchoices of regularizers, which is con\ufb01rmed on both synthetic and real-world data sets.\nAcknowledgements\nJinbo Bi and her students Xin Wang and Jiangwen Sun were supported by NSF grants IIS-1320586,\nDBI-1356655, IIS-1407205, and IIS-1447711.\n\n8\n\n\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning.\n\nNIPS\u201907, pages 41\u201348. 2007.\n\nIn Proceedings of\n\n[2] J. Bi, T. Xiong, S. Yu, M. Dundar, and R. B. Rao. An improved multi-task learning approach\n\nwith applications in medical diagnosis. In Proceedings of ECML\u201908, pages 117\u2013132, 2008.\n\n[3] J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-\n\ntask learning. In Proceedings of KDD\u201911, pages 42\u201350, 2011.\n\n[4] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In Proceedings of KDD\u201904, pages\n\n109\u2013117, 2004.\n\n[5] P. Gong, J. Ye, and C. Zhang. Multi-stage multi-task feature learning.\n\nNIPS\u201912, pages 1997\u20132005, 2012.\n\nIn Proceedings of\n\n[6] P. Gong, J. Ye, and C. Zhang. Robust multi-task feature learning. In Proceedings of KDD\u201912,\n\npages 895\u2013903, 2012.\n\n[7] I. R. Goodman and S. Kotz. Multivariate \u03b8-generalized normal distributions. Journal of Mul-\n\ntivariate Analysis, 3(2):204\u2013219, 1973.\n\n[8] L. Jacob, B. Francis, and J.-P. Vert. Clustered multi-task learning: a convex formulation. 2008.\n[9] A. Jalali, S. Sanghavi, C. Ruan, and P. K. Ravikumar. A dirty model for multi-task learning.\n\nIn Proceedings of NIPS\u201910, pages 964\u2013972, 2010.\n\n[10] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning.\n\nIn Proceedings of ICML\u201911, pages 521\u2013528, 2011.\n\n[11] A. Kumar and H. Daume III. Learning task grouping and overlap in multi-task learning. In\n\nProceedings of ICML\u201912, 2012.\n\n[12] S. Lee, J. Zhu, and E. Xing. Adaptive multi-task lasso: with application to eQTL detection. In\n\nProceedings of NIPS\u201910, pages 1306\u20131314. 2010.\n\n[13] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef\ufb01cient (cid:96)1,2-norm minimization. In\n\nProceedings of UAI\u201909, pages 339\u2013348, 2009.\n\n[14] A. Lozano and G. Swirszcz. Multi-level lasso for sparse multi-task regression. In Proceedings\n\nof ICML\u201912, pages 361\u2013368, 2012.\n\n[15] G. Obozinski and B. Taskar. Multi-task feature selection. In Technical report, Statistics De-\n\npartment, UC Berkeley, 2006.\n\n[16] A. Passos, P. Rai, J. Wainer, and H. Daume III. Flexible modeling of latent task structures in\n\nmultitask learning. In Proceedings of ICML\u201912, pages 1103\u20131110, 2012.\n\n[17] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for l1,in\ufb01nity\n\nregularization. In Proceedings of ICML\u201909, pages 108\u2013115, 2009.\n\n[18] P. Rai and H. Daume. In\ufb01nite predictor subspace models for multitask learning. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 613\u2013620, 2010.\n\n[19] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimiza-\n\ntion. Journal Optimization Theory Applications, 109(3):475\u2013494, 2001.\n\n[20] B. A. Turlach, W. N. Wenables, and S. J. Wright. Simultaneous variable selection. Technomet-\n\nrics, 47(3):349\u2013363, 2005.\n\n[21] T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multi-task\nIn Proceedings of SIAM International Conference on Data Mining (SDM), pages\n\nlearning.\n69\u201376, 2006.\n\n[22] Y. Zhang, D.-Y. Yeung, and Q. Xu. Probabilistic multi-task feature selection. In Proceedings\n\nof NIPS\u201910, pages 2559\u20132567, 2010.\n\n[23] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization.\n\nIn Proceedings of NIPS\u201911, pages 702\u2013710, 2011.\n\n[24] Y. Zhou, R. Jin, and S. C. Hoi. Exclusive lasso for multi-task feature selection. In Proceedings\n\nof UAI\u201910, pages 988\u2013995, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1256, "authors": [{"given_name": "Xin", "family_name": "Wang", "institution": "University of Connecticut"}, {"given_name": "Jinbo", "family_name": "Bi", "institution": "University of Connecticut"}, {"given_name": "Shipeng", "family_name": "Yu", "institution": "Siemens"}, {"given_name": "Jiangwen", "family_name": "Sun", "institution": "University of Connecticut"}]}