{"title": "Clustered Multi-Task Learning: A Convex Formulation", "book": "Advances in Neural Information Processing Systems", "page_first": 745, "page_last": 752, "abstract": "In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the iedb MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non convex methods dedicated to the same problem.", "full_text": "Clustered Multi-Task Learning:\n\na Convex Formulation\n\nLaurent Jacob\n\nMines ParisTech \u2013 CBIO\n\nINSERM U900, Institut Curie\n\nFrancis Bach\n\nINRIA \u2013 Willow Project\nEcole Normale Sup\u00b4erieure,\n\n35, rue Saint Honor\u00b4e, 77300 Fontainebleau, France\n\nlaurent.jacob@mines-paristech.fr\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\nJean-Philippe Vert\n\nMines ParisTech \u2013 CBIO\n\nINSERM U900, Institut Curie\n\n35, rue Saint Honor\u00b4e, 77300 Fontainebleau, France\n\njean-philippe.vert@mines-paristech.fr\n\nAbstract\n\nIn multi-task learning several related tasks are considered simultaneously, with\nthe hope that by an appropriate sharing of information across tasks, each task may\nbene\ufb01t from the others. In the context of learning linear functions for supervised\nclassi\ufb01cation or regression, this can be achieved by including a priori informa-\ntion about the weight vectors associated with the tasks, and how they are expected\nto be related to each other. In this paper, we assume that tasks are clustered into\ngroups, which are unknown beforehand, and that tasks within a group have similar\nweight vectors. We design a new spectral norm that encodes this a priori assump-\ntion, without the prior knowledge of the partition of tasks into groups, resulting\nin a new convex optimization formulation for multi-task learning. We show in\nsimulations on synthetic examples and on the IEDB MHC-I binding dataset, that\nour approach outperforms well-known convex methods for multi-task learning, as\nwell as related non-convex methods dedicated to the same problem.\n\n1 Introduction\n\nRegularization has emerged as a dominant theme in machine learning and statistics, providing an\nintuitive and principled tool for learning from high-dimensional data. In particular, regularization\nby squared Euclidean norms or squared Hilbert norms has been thoroughly studied in various set-\ntings, leading to ef\ufb01cient practical algorithms based on linear algebra, and to very good theoretical\nunderstanding (see, e.g., [1, 2]). In recent years, regularization by non Hilbert norms, such as \u2113p\nnorms with p 6= 2, has also generated considerable interest for the inference of linear functions in\nsupervised classi\ufb01cation or regression. Indeed, such norms can sometimes both make the problem\nstatistically and numerically better-behaved, and impose various prior knowledge on the problem.\nFor example, the \u21131-norm (the sum of absolute values) imposes some of the components to be equal\nto zero and is widely used to estimate sparse functions [3], while various combinations of \u2113p norms\ncan be de\ufb01ned to impose various sparsity patterns.\nWhile most recent work has focused on studying the properties of simple well-known norms, we\ntake the opposite approach in this paper. That is, assuming a given prior knowledge, how can we\ndesign a norm that will enforce it?\nMore precisely, we consider the problem of multi-task learning, which has recently emerged as a\nvery promising research direction for various applications [4]. In multi-task learning several re-\nlated inference tasks are considered simultaneously, with the hope that by an appropriate sharing\n\n1\n\n\fof information across tasks, each one may bene\ufb01t from the others. When linear functions are es-\ntimated, each task is associated with a weight vector, and a common strategy to design multi-task\nlearning algorithm is to translate some prior hypothesis about how the tasks are related to each other\ninto constraints on the different weight vectors. For example, such constraints are typically that the\nweight vectors of the different tasks belong (a) to a Euclidean ball centered at the origin [5], which\nimplies no sharing of information between tasks apart from the size of the different vectors, i.e., the\namount of regularization, (b) to a ball of unknown center [5], which enforces a similarity between\nthe different weight vectors, or (c) to an unknown low-dimensional subspace [6, 7].\nIn this paper, we consider a different prior hypothesis that we believe could be more relevant in some\napplications: the hypothesis that the different tasks are in fact clustered into different groups, and that\nthe weight vectors of tasks within a group are similar to each other. A key difference with [5], where\na similar hypothesis is studied, is that we don\u2019t assume that the groups are known a priori, and in a\nsense our goal is both to identify the clusters and to use them for multi-task learning. An important\nsituation that motivates this hypothesis is the case where most of the tasks are indeed related to each\nother, but a few \u201coutlier\u201d tasks are very different, in which case it may be better to impose similarity\nor low-dimensional constraints only to a subset of the tasks (thus forming a cluster) rather than to\nall tasks. Another situation of interest is when one can expect a natural organization of the tasks\ninto clusters, such as when one wants to model the preferences of customers and believes that there\nare a few general types of customers with similar preferences within each type, although one does\nnot know beforehand which customers belong to which types. Besides an improved performance if\nthe hypothesis turns out to be correct, we also expect this approach to be able to identify the cluster\nstructure among the tasks as a by-product of the inference step, e.g., to identify outliers or groups of\ncustomers, which can be of interest for further understanding of the structure of the problem.\nIn order to translate this hypothesis into a working algorithm, we follow the general strategy men-\ntioned above which is to design a norm or a penalty over the set of weights which can be used as\nregularization in classical inference algorithms. We construct such a penalty by \ufb01rst assuming that\nthe partition of the tasks into clusters is known, similarly to [5]. We then attempt to optimize the\nobjective function of the inference algorithm over the set of partitions, a strategy that has proved\nuseful in other contexts such as multiple kernel learning [8]. This optimization problem over the\nset of partitions being computationally challenging, we propose a convex relaxation of the problem\nwhich results in an ef\ufb01cient algorithm.\n\n2 Multi-task learning with clustered tasks\n\nWe consider m related inference tasks that attempt to learn linear functions over X = Rd from a\ntraining set of input/output pairs (xi, yi)i=1,...,n, where xi \u2208 X and yi \u2208 Y. In the case of binary\nclassi\ufb01cation we usually take Y = {\u22121, +1}, while in the case of regression we take Y = R. Each\ntraining example (xi, yi) is associated to a particular task t \u2208 [1, m], and we denote by I(t) \u2282 [1, n]\nthe set of indices of training examples associated to the task t. Our goal is to infer m linear functions\nft(x) = w\u22a4t x, for t = 1, . . . , m, associated to the different tasks. We denote by W = (w1 . . . wm)\nthe d \u00d7 m matrix whose columns are the successive vectors we want to estimate.\nWe \ufb01x a loss function l : R \u00d7 Y 7\u2192 R that quanti\ufb01es by l(f (x), y) the cost of predicting f (x)\nfor the input x when the correct output is y. Typical loss functions include the square error in\nregression l(u, y) = 1\n2 (u \u2212 y)2 or the hinge loss in binary classi\ufb01cation l(u, y) = max(0, 1 \u2212 uy)\nwith y \u2208 {\u22121, 1}. The empirical risk of a set of linear classi\ufb01ers given in the matrix W is then\nde\ufb01ned as the average loss over the training set:\n\n\u2113(W ) = 1\n\nn Pm\n\nt=1Pi\u2208I(t) l(w\u22a4t xi, yi) .\n\n(1)\n\nIn the sequel, we will often use the m\u00d71 vector 1 composed of ones, the m\u00d7m projection matrices\nU = 11\u22a4/m whose entries are all equal to 1/m, as well as the projection matrix \u03a0 = I \u2212 U.\nIn order to learn simultaneously the m tasks, we follow the now well-established approach which\nlooks for a set of weight vectors W that minimizes the empirical risk regularized by a penalty\nfunctional, i.e., we consider the problem:\n\nwhere \u2126(W ) can be designed from prior knowledge to constrain some sharing of information be-\ntween tasks. For example, [5] suggests to penalize both the norms of the wi\u2019s and their variance,\n\nminW\u2208Rd\u00d7m \u2113(W ) + \u03bb\u2126(W ) ,\n\n(2)\n\n2\n\n\fi.e., to consider a function of the form:\n\n\u2126variance(W ) = k \u00afwk2 + \u03b2\n\nm Pm\n\ni=1 kwi \u2212 \u00afwk2 ,\n\n(3)\n\nwhere \u00afw = (Pn\n\ntowards their mean when \u03b2 increases. Alternatively, [7] propose to penalize the trace norm of W :\n\ni=1 wi) /m is the mean weight vector. This penalty enforces a clustering of the w\u2032is\n\n\u2126trace(W ) = Pmin(d,m)\n\ni=1\n\n\u03c3i(W ) ,\n\n(4)\n\nwhere \u03c31(W ), . . . , \u03c3min(d,m)(W ) are the successive singular values of W . This enforces a low-rank\nsolution in W , i.e., constrains the different wi\u2019s to live in a low-dimensional subspace.\nHere we would like to de\ufb01ne a penalty function \u2126(W ) that encodes as prior knowledge that tasks\nare clustered into r < m groups. To do so, let us \ufb01rst assume that we know beforehand the clusters,\ni.e., we have a partition of the set of tasks into r groups. In that case we can follow an approach\nproposed by [5] which for clarity we rephrase with our notations and slightly generalize now. For a\ngiven cluster c \u2208 [1, r], let us denote J (c) \u2282 [1, m] the set of tasks in c, mc = |J (c)| the number\nof tasks in the cluster c, and E the m \u00d7 r binary matrix which describes the cluster assignment\nfor the m tasks, i.e., Eij = 1 if task i is in cluster j, 0 otherwise. Let us further denote by \u00afwc =\n(Pi\u2208J (c) wi)/mc the average weight vector for the tasks in c, and recall that \u00afw = (Pm\ni=1 wi) /m\ndenotes the average weight vector over all tasks. Finally it will be convenient to introduce the matrix\nM = E(E\u22a4E)\u22121E\u22a4. M can also be written I \u2212 L, where L is the normalized Laplacian of the\ngraph G whose nodes are the tasks connected by an edge if and only if they are in the same cluster.\nThen we can de\ufb01ne three semi-norms of interest on W that quantify different orthogonal aspects:\n\n\u2022 A global penalty, which measures on average how large the weight vectors are:\n\n\u2126mean(W ) = nk \u00afwk2 = trW U W \u22a4 .\n\n\u2022 A measure of between-cluster variance, which quanti\ufb01es how close to each other the dif-\n\nferent clusters are:\n\n\u2126between(W ) = Pr\n\nc=1 mck \u00afwc \u2212 \u00afwk2 = trW (M \u2212 U )W \u22a4.\n\n\u2022 A measure of within-cluster variance, which quanti\ufb01es the compactness of the clusters:\n\n\u2126within(W ) = Pr\n\nc=1nPi\u2208J (c) kwi \u2212 \u00afwck2o = trW (I \u2212 M )W \u22a4 .\n\nWe note that both \u2126between(W ) and \u2126within(W ) depend on the particular choice of clusters E, or\nequivalently of M. We now propose to consider the following general penalty function:\n\n\u2126(W ) = \u03b5M \u2126mean(W ) + \u03b5B\u2126between(W ) + \u03b5W \u2126within(W ) ,\n\n(5)\n\nwhere \u03b5M , \u03b5B and \u03b5W are non-negative parameters that can balance the importance of the compo-\nnents of the penalty. Plugging this quadratic penalty into (2) leads to the general problem:\n\nminW\u2208Rd\u00d7m \u2113(W ) + \u03bbtrW \u03a3(M )\u22121W \u22a4 ,\n\n(6)\n\nwhere\n\n(7)\nHere we use the notation \u03a3(M ) to insist on the fact that this quadratic penalty depends on the cluster\nstructure through the matrix M. Observing that the matrices U, M \u2212 U and I \u2212 M are orthogonal\nprojections onto orthogonal supplementary subspaces, we easily get from (7):\n\n\u03a3(M )\u22121 = \u03b5M U + \u03b5B(M \u2212 U ) + \u03b5W (I \u2212 M ) .\n\n\u03a3(M ) = \u03b5\u22121\n\nM U + \u03b5\u22121\n\nB (M \u2212 U ) + \u03b5\u22121\n\nW (I \u2212 M ) = \u03b5\u22121\n\nW I + (\u03b5\u22121\n\nM \u2212 \u03b5\u22121\n\nB )U + (\u03b5\u22121\n\nB \u2212 \u03b5\u22121\n\nW )M . (8)\n\nBy choosing particular values for \u03b5M , \u03b5B and \u03b5W we can recover several situations, In particular:\n\u2022 For \u03b5W = \u03b5B = \u03b5M = \u03b5, we simply recover the Frobenius norm of W , which does not put\n\nany constraint on the relationship between the different tasks:\n\n\u2126(W ) = \u03b5trW W \u22a4 = \u03b5Pm\n\ni=1 kwik2 .\n\n3\n\n\f\u2022 For \u03b5W = \u03b5B > \u03b5M , we recover the penalty of [5] without clusters:\n\u2126(W ) = trW (\u03b5M U + \u03b5B(I \u2212 U )) W \u22a4 = \u03b5M nk \u00afwk2 + \u03b5BPm\n\ni=1 kwi \u2212 \u00afwk2 .\n\nIn that case, a global similarity between tasks is enforced, in addition to the general con-\nstraint on their mean. The structure in clusters plays no role since the sum of the between-\nand within-cluster variance is independent of the particular choice of clusters.\n\n\u2022 For \u03b5W > \u03b5B = \u03b5M we recover the penalty of [5] with clusters:\nnmck \u00afwck2 + \u03b5W\n\n\u2126(W ) = trW (\u03b5M M + \u03b5W (I \u2212 M )) W \u22a4 = \u03b5M\n\nrX\n\nc=1\n\n\u03b5M Pi\u2208J (c) kwi \u2212 \u00afwck2o .\n\nIn order to enforce a cluster hypothesis on the tasks, we therefore see that a natural choice is to\ntake \u03b5W > \u03b5B > \u03b5M in (5). This would have the effect of penalizing more the within-cluster\nvariance than the between-cluster variance, hence promoting compact clusters. Of course, a major\nlimitation at this point is that we assumed the cluster structure known a priori (through the matrix\nE, or equivalently M). In many cases of interest, we would like instead to learn the cluster structure\nitself from the data. We propose to learn the cluster structure in our framework by optimizing our\nobjective function (6) both in W and M, i.e., to consider the problem:\n\nminW\u2208Rd\u00d7m,M\u2208Mr \u2113(W ) + \u03bbtrW \u03a3(M )\u22121W \u22a4 ,\n\n(9)\nwhere Mr denotes the set of matrices M = E(E\u22a4E)\u22121E\u22a4 de\ufb01ned by a clustering of the m tasks\ninto r clusters and \u03a3(M ) is de\ufb01ned in (8). Denoting by Sr = {\u03a3(M ) : M \u2208 Mr} the correspond-\ning set of positive semide\ufb01nite matrices, we can equivalently rewrite the problem as:\n(10)\nThe objective function in (10) is jointly convex in W \u2208 Rd\u00d7m and \u03a3 \u2208 S m\n+ , the set of m\u00d7m positive\nsemide\ufb01nite matrices, however the (\ufb01nite) set Sr is not convex, making this problem intractable. We\nare now going to propose a convex relaxation of (10) by optimizing over a convex set of positive\nsemide\ufb01nite matrices that contains Sr.\n3 Convex relaxation\n\nminW\u2208Rd\u00d7m,\u03a3\u2208Sr \u2113(W ) + \u03bbtrW \u03a3\u22121W \u22a4 .\n\nIn order to formulate a convex relaxation of (10), we observe that in the penalty term (5) the cluster\nstructure only contributes to the second and third terms \u2126between(W ) and \u2126within(W ), and that\nthese penalties only depend on the centered version of W . In terms of matrices, only the last two\nterms of \u03a3(M )\u22121 in (7) depend on M, i.e., on the clustering, and these terms can be re-written as:\n(11)\nIndeed, it is easy to check that M \u2212 U = M \u03a0 = \u03a0M \u03a0, and that I \u2212 M = I \u2212 U \u2212 (M \u2212 U ) =\n\u03a0 \u2212 \u03a0M \u03a0 = \u03a0(I \u2212 M )\u03a0. Intuitively, multiplying by \u03a0 on the right (resp. on the left) centers the\nrows (resp. the columns) of a matrix, and both M \u2212 U and I \u2212 M are row- and column-centered.\nTo simplify notations, let us introduce fM = \u03a0M \u03a0. Plugging (11) in (7) and (9), we get the penalty\n\n\u03b5B(M \u2212 U ) + \u03b5W (I \u2212 M ) = \u03a0(\u03b5BM + \u03b5W (I \u2212 M ))\u03a0.\n\ntrW \u03a3(M )\u22121W \u22a4 = \u03b5M \u00a1trW \u22a4W U\u00a2 + (W \u03a0)(\u03b5BfM + \u03b5W (I \u2212 fM ))(W \u03a0)\u22a4,\n\n(12)\nin which, again, only the second part needs to be optimized with respect to the clustering M. Denot-\ning \u03a3\u22121\n\nc (M ) = \u03b5BfM + \u03b5W (I \u2212 fM ), one can express \u03a3c(M ), using the fact that fM is a projection:\n\u03a3c is characterized by fM = \u03a0M \u03a0, that is discrete by construction, hence the non-convexity of Sr.\nWe have the natural constraints M \u2265 0 (i.e., fM \u2265 \u2212U), 0 \u00b9 M \u00b9 I (i.e., 0 \u00b9 fM \u00b9 \u03a0) and\ntrM = r (i.e., trfM = r \u2212 1). A possible convex relaxation of the discrete set of matrices fM is\ntherefore {fM : 0 \u00b9 fM \u00b9 I, trfM = r \u2212 1}. This gives an equivalent convex set Sc for \u03a3c, namely:\n(14)\nB . Incorporating the \ufb01rst part of the\npenalty (12) into the empirical risk term by de\ufb01ning \u2113c(W ) = \u03bb\u2113(W ) + \u03b5M \u00a1trW \u22a4W U\u00a2, we are\n\n+ : \u03b1I \u00b9 \u03a3c \u00b9 \u03b2I, tr\u03a3c = \u03b3\u00aa ,\n\nB and \u03b3 = (m \u2212 r + 1)\u03b5\u22121\n\nW \u00a2 fM + \u03b5\u22121\n\nSc = \u00a9\u03a3c \u2208 S m\n\n\u03a3c(M ) = \u00a1\u03b5\u22121\n\nB \u2212 \u03b5\u22121\n\nwith \u03b1 = \u03b5\u22121\n\nW , \u03b2 = \u03b5\u22121\n\nW + (r \u2212 1)\u03b5\u22121\n\nW I.\n\n(13)\n\nnow ready to state our relaxation of (10):\n\nminW\u2208Rd\u00d7m,\u03a3c\u2208Sc \u2113c(W ) + \u03bbtrW \u03a0\u03a3\u22121\n\nc (W \u03a0)\u22a4 .\n\n(15)\n\n4\n\n\fc = min\u03a3c\u2208Sc trW \u03a3\u22121\n\n3.1 Reinterpretation in terms of norms\nWe denote kWk2\nc W T the cluster norm (CN). For any convex set Sc, we ob-\ntain a norm on W (that we apply here to its centered version). By putting some different constraints\non the set Sc, we obtain different norms on W , and in fact all previous multi-task formulations may\nbe cast in this way, i.e., by choosing a speci\ufb01c set of positive matrices Sc (e.g., trace constraint for\nthe trace norm, and simply a singleton for the Frobenius norm). Thus, designing norms for multi-\ntask learning is equivalent to designing a set of positive matrices. In this paper, we have investigated\na speci\ufb01c set adapted for clustered-tasks, but other sets could be designed in other situations.\nNote that we have selected a simple spectral convex set Sc in order to make the optimization sim-\npler in Section 3.3, but we could also add some additional constraints that encode the point-wise\npositivity of the matrix M. Finally, when r = 1 (one cluster) and r = m (one cluster per task), we\nget back the formulation of [5].\n\n3.2 Reinterpretation as a convex relaxation of K-means\nc that we have designed earlier, can be interpreted\nIn this section we show that the semi-norm kW \u03a0k2\nas a convex relaxation of K-means on the tasks [9]. Indeed, given W \u2208 Rd\u00d7m, K-means aims\nto decompose it in the form W = \u00b5E\u22a4 where \u00b5 \u2208 Rd\u00d7r are cluster centers and E represents\na partition. Given E, \u00b5 is found by minimizing min\u00b5 kW \u22a4 \u2212 E\u00b5\u22a4k2\nF . Thus, a natural strategy\noutlined by [9], is to alternate between optimizing \u00b5, the partition E and the weight vectors W . We\nnow show that our convex norm is obtained when minimizing in closed form with respect to \u00b5 and\nrelaxing.\nF . If we add a\nBy translation invariance, this is equivalent to minimizing min\u00b5 k\u03a0W \u22a4 \u2212 \u03a0E\u00b5\u22a4k2\npenalization on \u00b5 of the form \u03bbtrE\u22a4E\u00b5\u00b5\u22a4, then a short calculation shows that the minimum with\nrespect to \u00b5 (i.e., after optimization of the cluster centers) is equal to\n\ntr\u03a0W \u22a4W \u03a0(\u03a0E(E\u22a4E)\u22121E\u22a4\u03a0/\u03bb + I)\u22121 = tr\u03a0W \u22a4W \u03a0(\u03a0M \u03a0/\u03bb + I)\u22121.\n\nBy comparing with Eq. (13), we see that our formulation is indeed a convex relaxation of K-means.\n\n3.3 Primal optimization\n\nLet us now show in more details how (15) can be solved ef\ufb01ciently. Whereas a dual formulation\ncould be easily derived following [8], a direct approach is to rewrite (15) as\n\nc (W \u03a0)\u22a4\u00a2\n\nc = min\u03a3c\u2208Sc trW \u03a0\u03a3\u22121\n\nminW\u2208Rd\u00d7m \u00a1\u2113c(W ) + min\u03a3c\u2208Sc trW \u03a0\u03a3\u22121\n\n(16)\nwhich, if \u2113c is differentiable, can be directly optimized by gradient-based methods on W since\nc (W \u03a0)\u22a4 is a quadratic semi-norm of W \u03a0. This regularization\nkW \u03a0k2\nterm trW \u03a0\u03a3\u22121\nc (W \u03a0)\u22a4 can be computed ef\ufb01ciently using a semi-closed form. Indeed, since \u03a3c as\nde\ufb01ned in (14) is a spectral set (i.e., it does depend only on eigenvalues of covariance matrices), we\nobtain a function of the singular values of W \u03a0 (or equivalently the eigenvalues of W \u03a0W \u22a4):\nmin\u03a3c\u2208Sc trW \u03a0\u03a3\u22121\nc (W \u03a0)\u22a4 = min\u03bb\u2208Rm, \u03b1\u2264\u03bbi\u2264\u03b2, \u03bb1=\u03b3, V \u2208Om trW \u03a0V diag(\u03bb)\u22121V \u22a4(W \u03a0)\u22a4,\nwhere Om is the set of orthogonal matrices in Rm\u00d7m. The optimal V is the matrix of the eigenvec-\ntors of W \u03a0W \u22a4, and we obtain the value of the objective function at the optimum:\n\nmin\u03a3\u2208S trW \u03a0\u03a3\u22121(W \u03a0)\u22a4 = min\u03bb\u2208Rm, \u03b1\u2264\u03bbi\u2264\u03b2, \u03bb1=\u03b3 Pm\n\ni=1\n\n\u03c32\ni\n\u03bbi\n\n,\n\nwhere \u03c3 and \u03bb are the vectors containing the singular values of W \u03a0 and \u03a3 respectively. Now, we\nsimply need to be able to compute this function of the singular values.\nThe only coupling in this formulation comes from the trace constraint. The Lagrangian correspond-\ning to this constraint is:\n\ni=1\n\n(17)\nFor \u03bd \u2264 0, this is a decreasing function of \u03bbi, so the minimum on \u03bbi \u2208 [\u03b1, \u03b2] is reached for \u03bbi = \u03b2.\nThe dual function is then a linear non-decreasing function of \u03bd (since \u03b1 \u2264 \u03b3/m \u2264 \u03b2 from the\nde\ufb01nition of \u03b1, \u03b2, \u03b3 in (14)), which reaches it maximum value (on \u03bd \u2264 0) at \u03bd = 0. Let us therefore\nnow consider the dual for \u03bd \u2265 0. (17) is then a convex function of \u03bbi. Canceling its derivative with\nrespect to \u03bbi gives that the minimum in \u03bb \u2208 R is reached for \u03bbi = \u03c3i/\u221a\u03bd. Now this may not be\n\ni=1 \u03bbi \u2212 \u03b3) .\n\nL(\u03bb, \u03bd) = Pm\n\n\u03c32\ni\n\u03bbi\n\n+ \u03bd (Pm\n\n5\n\n\fin the constraint set (\u03b1, \u03b2), so if \u03c3i < \u03b1\u221a\u03bd then the minimum in \u03bbi \u2208 [\u03b1, \u03b2] of (17) is reached\nfor \u03bbi = \u03b1, and if \u03c3i > \u03b2\u221a\u03bd it is reached for \u03bbi = \u03b2. Otherwise, it is reached for \u03bbi = \u03c3i/\u221a\u03bd.\nReporting this in (17), the dual problem is therefore\n\u03b2 + \u03bd\u03b2\u00b4\u2212 \u03bd\u03b3 . (18)\nmax\u03bd\u22650Pi,\u03b1\u221a\u03bd\u2264\u03c3i\u2264\u03b2\u221a\u03bd 2\u03c3i\u221a\u03bd +Pi,\u03c3i<\u03b1\u221a\u03bd \u00b3 \u03c32\nSince a closed form for this expression is known for each \ufb01xed value of \u03bd, one can obtain kW \u03a0k2\nc\n(and the eigenvalues of \u03a3\u2217) by Algorithm 1. The cancellation condition in Algorithm 1 is that the\n\n\u03b1 + \u03bd\u03b1\u00b4 +Pi,\u03b2\u221a\u03bd<\u03c3i \u00b3 \u03c32\n\ni\n\ni\n\nc\n\nAlgorithm 1 Computing kAk2\nRequire: A, \u03b1, \u03b2, \u03b3.\nEnsure: kAk2\nc, \u03bb\u2217.\nCompute the singular values \u03c3i of A.\nOrder the \u03c32\nfor all interval (a, b) of I do\n\n\u03b12 , \u03c32\n\ni\n\ni\n\n\u03b22 in a vector I (with an additional 0 at the beginning).\n\nif \u2202L(\u03bb\u2217,\u03bd)\n\n\u2202\u03bd\n\nis canceled on \u03bd \u2208 (a, b) then\n\nReplace \u03bd\u2217 in the dual function L(\u03bb\u2217, \u03bd) to get kAk2\nreturn kAk2\n\nc, \u03bb\u2217.\n\nc, compute \u03bb\u2217 on (a, b).\n\nend if\nend for\n\nvalue canceling the derivative belongs to (a, b), i.e.,\n\n\u03b3\u2212(\u03b1n\u2212+\u03b2n+) \u00b42\n\u03bd = \u00b3 Pi,\u03b1\u221a\u03bd\u2264\u03c3i\u2264\u03b2\u221a\u03bd \u03c3i\n\n\u2208 (a, b) ,\n\nwhere n\u2212 and n+ are the number of \u03c3i < \u03b1\u221a\u03bd and \u03c3i > \u03b2\u221a\u03bd respectively. Denoting kAk2\nc =\nF (A, \u03a3\u2217(A)), \u2207AF = \u2202AF + \u2202\u03a3F \u2202A\u03a3 cannot be computed because of the non-differentiable\nconstraints on \u03a3 for F . We followed an alternative direction, using only the \u2202AF part.\n\n4 Experiments\n\n4.1 Arti\ufb01cial data\n\nc ), \u03c32\n\nWe generated synthetic data consisting of two clusters of two tasks. The tasks are vectors of Rd, d =\n30. For each cluster, a center \u00afwc was generated in Rd\u22122, so that the two clusters be orthogonal. More\nprecisely, each \u00afwc had (d \u2212 2)/2 random features randomly drawn from N (0, \u03c32\nr = 900, and\n(d \u2212 2)/2 zero features. Then, each tasks t was computed as wt + \u00afwc(t), where c(t) was the cluster\nof t. wt had the same zero feature as its cluster center, and the other features were drawn from\nc = 16. The last two features were non-zero for all the tasks and drawn from N (0, \u03c32\nc ).\nN (0, \u03c32\nn = 150 was added.\nFor each task, 2000 points were generated and a normal noise of variance \u03c32\nc with the single-task learning given by the\nIn a \ufb01rst experiment, we compared our cluster norm k.k2\nFrobenius norm, and with the trace norm, that corresponds to the assumption that the tasks live in a\nlow-dimension space. The multi-task kernel approach being a special case of CN, its performance\nwill always be between the performance of the single task and the performance of CN.\nIn a second setting, we compare CN to alternative methods that differ in the way they learn \u03a3:\n\nr ), \u03c32\n\n\u2022 The True metric approach, that simply plugs the actual clustering in E and optimizes W\nusing this \ufb01xed metric. This necessitates to know the true clustering a priori, and can be\nthought of like a golden standard.\n\n\u2022 The k-means approach, that alternates between optimizing the tasks in W given the metric\n\u03a3 and re-learning \u03a3 by clustering the tasks wi [9]. The clustering is done by a k-means run\n3 times. This is a non convex approach, and different initialization of k-means may result\nin different local minima.\n\nWe also tried one run of CN followed by a run of True metric using the learned \u03a3 reprojected\nin Sr by rounding, i.e., by performing k-means on the eigenvectors of the learned \u03a3 (Reprojected\napproach), and a run of k-means starting from the relaxed solution (CNinit approach).\n\n6\n\n\fOnly the \ufb01rst method requires to know the true clustering a priori, all the other methods can be run\nwithout any knowledge of the clustering structure of the tasks.\nEach method was run with different numbers of training points. The training points were equally\nseparated between the two clusters and for each cluster, 5/6th of the points were used for the \ufb01rst\ntask and 1/6th for the second, in order to simulate a natural setting were some tasks have fewer data.\nWe used the 2000 points of each task to build 3 training folds, and the remaining points were used\nfor testing. We used the mean RMSE across the tasks as a criterion, and a quadratic loss for \u2113(W ).\nThe results of the \ufb01rst experiment are shown on Figure 1 (left). As expected, both multi-task ap-\nproaches perform better than the approach that learns each task independently. CN penalization on\nthe other hand always gives better testing error than the trace norm penalization, with a stronger ad-\nvantage when very few training points are available. When more training points become available,\nall the methods give more and more similar performances. In particular, with large samples, it is not\nuseful anymore to use a multi-task approach.\n\n35\n\n30\n\n25\n\n20\n\n15\n\nE\nS\nM\nR\n\n \n10\n3\n\n3.5\n\n4\n\n4.5\n\n5\n\n5.5\n\nNumber of training points (log)\n\n \n\nFrob\nTrace\nCN\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\nE\nS\nM\nR\n\n6\n\n6.5\n\n \n14\n3\n\n3.5\n\n4\n\n4.5\n\n5\n\n5.5\n\nNumber of training points (log)\n\n \n\nCN\nKM\nTrue\nRepr\n\n6\n\n6.5\n\nFigure 1: RMSE versus number of training points for the tested methods.\n\nFigure 2: Recovered \u03a3 with CN (upper line) and k-means (lower line) for 28, 50 and 100 points.\n\nFigure 1 (right) shows the results of the second experiment. Using the true metric always gives the\nbest results. For 28 training points, no method recovers the correct clustering structure, as displayed\non Figure 2, although CN performs slightly better than the k-means approach since the metric it\nlearns is more diffuse. For 50 training points, CN performs much better than the k-means approach,\nwhich completely fails to recover the clustering structure as illustrated by the \u03a3 learned for 28 and\n50 training points on Figure 2. In the latter setting, CN partially recovers the clusters. When more\ntraining points become available, the k-means approach perfectly recovers the clustering structure\nand outperforms the relaxed approach. The reprojected approach, on the other hand, performs al-\nways as well as the best of the two other methods. The CNinit approach results are not displayed\nsince the are the same as for the reprojected method.\n\n4.2 MHC-I binding data\n\nWe also applied our method to the IEDB MHC-I peptide binding benchmark proposed in [10]. This\ndatabase contains binding af\ufb01nities of various peptides, i.e., short amino-acid sequences, with dif-\nferent MHC-I molecules. This binding process is central in the immune system, and predicting it is\ncrucial, for example to design vaccines. The af\ufb01nities are thresholded to give a prediction problem.\nEach MHC-I molecule is considered as a task, and the goal is to predict whether a peptide binds a\nmolecule. We used an orthogonal coding of the amino acids to represent the peptides and balanced\n\n7\n\n\fTable 1: Prediction error for the 10 molecules with less than 200 training peptides in IEDB.\n\nMethod\nTest error\n\nPooling\n\n26.53% \u00b1 2.0\n\nFrobenius norm Multi-task kernel\n11.62% \u00b1 1.4\n10.10% \u00b1 1.4\n\nTrace norm Cluster norm\n9.20% \u00b1 1.3\n8.71% \u00b1 1.5\n\nthe data by keeping only one negative example for each positive point, resulting in 15236 points\ninvolving 35 different molecules. We chose a logistic loss for \u2113(W ).\nMulti-task learning approaches have already proved useful for this problem, see for example [11,\n12]. Besides, it is well known in the vaccine design community that some molecules can be grouped\ninto empirically de\ufb01ned supertypes known to have similar binding behaviors.\n[12] showed in particular that the multi-task approaches were very useful for molecules with few\nknown binders. Following this observation, we consider the mean error on the 10 molecules with\nless than 200 known ligands, and report the results in Table 1. We did not select the parameters by\ninternal cross validation, but chose them among a small set of values in order to avoid over\ufb01tting.\nMore accurate results could arise from such a cross validation, in particular concerning the number\nof clusters (here we limited the choice to 2 or 10 clusters).\nThe pooling approach simply considers one global prediction problem by pooling together the data\navailable for all molecules. The results illustrate that it is better to consider individual models than\none unique pooled model.On the other hand, all the multitask approaches improve the accuracy, the\ncluster norm giving the best performance. The learned \u03a3, however, did not recover the known super-\ntypes, although it may contain some relevant information on the binding behavior of the molecules.\n\n5 Conclusion\n\nWe have presented a convex approach to clustered multi-task learning, based on the design of a\ndedicated norm. Promising results were presented on synthetic examples and on the IEDB dataset.\nWe are currently investigating more re\ufb01ned convex relaxations and the natural extension to non-\nlinear multi-task learning as well as the inclusion of speci\ufb01c features on the tasks, which has shown\nto improve performance in other settings [6].\n\nReferences\n[1] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series\n\nin Applied Mathematics. SIAM, Philadelphia, 1990.\n\n[2] F. Girosi, M. Jones, and T. Poggio. Regularization Theory and Neural Networks Architectures. Neural\n\nComput., 7(2):219\u2013269, 1995.\n\n[3] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Stat. Soc. B., 58:267\u2013288, 1996.\n[4] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn.\n\nRes., 4:83\u201399, 2003.\n\n[5] T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. J. Mach. Learn.\n\nRes., 6:615\u2013637, 2005.\n\n[6] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical\n\nReport cs/0611124, arXiv, 2006.\n\n[7] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning.\n\nIn B. Sch\u00a8olkopf, J. Platt, and\n\nT. Hoffman, editors, Adv. NIPS 19, pages 41\u201348, Cambridge, MA, 2007. MIT Press.\n\n[8] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the Kernel Matrix\n\nwith Semide\ufb01nite Programming. J. Mach. Learn. Res., 5:27\u201372, 2004.\n\n[9] M. Deodhar and J. Ghosh. A framework for simultaneous co-clustering and learning from complex data.\n\nIn KDD \u201907, pages 250\u2013259, New York, NY, USA, 2007. ACM.\n\n[10] B. Peters, H.-H Bui, S. Frankild, M. Nielson, C. Lundegaard, E. Kostem, D. Basch, K. Lamberth,\nM. Harndahl, W. Fleri, S. S Wilson, J. Sidney, O. Lund, S. Buus, and A. Sette. A community resource\nbenchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol, 2(6):e65, 2006.\n\n[11] D. Heckerman, D. Kadie, and J. Listgarten. Leveraging information across HLA alleles/supertypes im-\n\nproves epitope prediction. J. Comput. Biol., 14(6):736\u2013746, 2007.\n\n[12] L. Jacob and J.-P. Vert. Ef\ufb01cient peptide-MHC-I binding prediction for alleles with few known binders.\n\nBioinformatics, 24(3):358\u2013366, Feb 2008.\n\n8\n\n\f", "award": [], "sourceid": 680, "authors": [{"given_name": "Laurent", "family_name": "Jacob", "institution": null}, {"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}