{"title": "Multi-Task Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 41, "page_last": 48, "abstract": null, "full_text": "Multi-Task Feature Learning\n\nTheodoros Evgeniou\n\nTechnology Management and Decision Sciences,\n\nINSEAD,\n\nAndreas Argyriou\n\nDepartment of Computer Science\n\nUniversity College London\n\nGower Street, London WC1E 6BT, UK\n\nBd de Constance, Fontainebleau 77300, France\n\na.argyriou@cs.ucl.ac.uk\n\ntheodoros.evgeniou@insead.edu\n\nMassimiliano Pontil\n\nDepartment of Computer Science\n\nUniversity College London\n\nGower Street, London WC1E 6BT, UK\n\nm.pontil@cs.ucl.ac.uk\n\nAbstract\n\nWe present a method for learning a low-dimensional representation which is\nshared across a set of multiple related tasks. The method builds upon the well-\nknown 1-norm regularization problem using a new regularizer which controls the\nnumber of learned features common for all the tasks. We show that this problem\nis equivalent to a convex optimization problem and develop an iterative algorithm\nfor solving it. The algorithm has a simple interpretation: it alternately performs a\nsupervised and an unsupervised step, where in the latter step we learn common-\nacross-tasks representations and in the former step we learn task-speci\ufb01c functions\nusing these representations. We report experiments on a simulated and a real data\nset which demonstrate that the proposed method dramatically improves the per-\nformance relative to learning each task independently. Our algorithm can also be\nused, as a special case, to simply select \u2013 not learn \u2013 a few common features across\nthe tasks.\n\n1 Introduction\n\nLearning multiple related tasks simultaneously has been empirically [2, 3, 8, 9, 12, 18, 19, 20] as\nwell as theoretically [2, 4, 5] shown to often signi\ufb01cantly improve performance relative to learning\neach task independently. This is the case, for example, when only a few data per task are available,\nso that there is an advantage in \u201cpooling\u201d together data across many related tasks.\n\nTasks can be related in various ways. For example, task relatedness has been modeled through\nassuming that all functions learned are close to each other in some norm [3, 8, 15, 19]. This may be\nthe case for functions capturing preferences in users\u2019 modeling problems [9, 13]. Tasks may also be\nrelated in that they all share a common underlying representation [4, 5, 6]. For example, in object\nrecognition, it is well known that the human visual system is organized in a way that all objects1 are\nrepresented \u2013 at the earlier stages of the visual system \u2013 using a common set of features learned, e.g.\nlocal \ufb01lters similar to wavelets [16]. In modeling users\u2019 preferences/choices, it may also be the case\nthat people make product choices (e.g. of books, music CDs, etc.) using a common set of features\ndescribing these products.\n\nIn this paper, we explore the latter type of task relatedness, that is, we wish to learn a low-\ndimensional representation which is shared across multiple related tasks. Inspired by the fact that\nthe well known 1(cid:0)norm regularization problem provides such a sparse representation for the single\n\n1We consider each object recognition problem within each object category, e.g. recognizing a face among\n\nfaces, or a car among cars, to be a different task.\n\n\ftask case, in Section 2 we generalize this formulation to the multiple task case. Our method learns\na few features common across the tasks by regularizing within the tasks while keeping them cou-\npled to each other. Moreover, the method can be used, as a special case, to select (not learn) a few\nfeatures from a prescribed set. Since the extended problem is nonconvex, we develop an equivalent\nconvex optimization problem in Section 3 and present an algorithm for solving it in Section 4. A\nsimilar algorithm was investigated in [9] from the perspective of conjoint analysis. Here we provide\na theoretical justi\ufb01cation of the algorithm in connection with 1-norm regularization.\nThe learning algorithm simultaneously learns both the features and the task functions through two\nalternating steps. The \ufb01rst step consists of independently learning the parameters of the tasks\u2019\nregression or classi\ufb01cation functions. The second step consists of learning, in an unsupervised way,\na low-dimensional representation for these task parameters, which we show to be equivalent to\nlearning common features across the tasks. The number of common features learned is controlled,\nas we empirically show, by the regularization parameter, much like sparsity is controlled in the case\nof single-task 1-norm regularization.\nIn Section 5, we report experiments on a simulated and a real data set which demonstrate that the\nproposed method learns a few common features across the tasks while also improving the perfor-\nmance relative to learning each task independently. Finally, in Section 6 we brie\ufb02y compare our\napproach with other related multi-task learning methods and draw our conclusions.\n\n2 Learning sparse multi-task representations\n\nWe begin by introducing our notation. We let IR be the set of real numbers and IR+ (IR++) the\nsubset of non-negative (positive) ones. Let T be the number of tasks and de\ufb01ne INT := f1; : : : ; T g.\nFor each task t 2 INT , we are given m input/output examples (xt1; yt1); : : : (xtm; ytm) 2 IRd (cid:2) IR.\nBased on this data, we wish to estimate T functions ft : IRd ! IR, t 2 INT , which approximate\nwell the data and are statistically predictive, see e.g. [11].\n\ni=1 wiui, the standard inner product in IRd. For every p (cid:21) 1,\np . If A is a d (cid:2) T matrix we denote by\nai 2 IRT and aj 2 IRd the i-th row and the j-th column of A respectively. For every r; p (cid:21) 1 we\n\n1\n\nIf w; u 2 IRd, we de\ufb01ne hw; ui := Pd\nwe de\ufb01ne the p-norm of vector w as kwkp := (Pd\nde\ufb01ne the (r; p)-norm of A as kAkr;p :=(cid:0)Pd\ninite ones. If D is a d (cid:2) d matrix, we de\ufb01ne trace(D) := Pd\n\ni=1 jwijp)\n\nWe denote by Sd the set of d (cid:2) d real symmetric matrices and by Sd\n\n+ the subset of positive semidef-\ni=1 Dii. If X is a p (cid:2) q real matrix,\nrange(X) denotes the set fx 2 IRp : x = Xz; for some z 2 IRqg. We let Od be the set of d (cid:2) d\northogonal matrices. Finally, D+ denotes the pseudoinverse of a matrix D.\n\ni=1 kaikp\n\n1\n\np .\n\nr(cid:1)\n\n2.1 Problem formulation\n\nThe underlying assumption in this paper is that the functions ft are related so that they all share a\nsmall set of features. Formally, our hypothesis is that the functions ft can be represented as\n\nft(x) =\n\nd\n\nXi=1\n\naithi(x);\n\nt 2 INT ;\n\n(2.1)\n\nwhere hi\nassumption is that all the features but a few have zero coef\ufb01cients across all the tasks.\n\n: IRd ! IR are the features and ait 2 IR are the regression parameters. Our main\n\nFor simplicity, we focus on linear features, that is, hi(x) = hui; xi, where ui 2 IRd. In addition,\nwe assume that the vectors ui are orthonormal. Thus, if U denotes the d (cid:2) d matrix with columns\nthe vectors ui, then U 2 Od. The functions ft are linear as well, that is ft(x) = hwt; xi, where\n\nwt =Pi aitui. Extensions to nonlinear functions may be done, for example, by using kernels along\n\nthe lines in [8, 15]. Since this is not central in the present paper we postpone its discussion to a future\noccasion.\nLet us denote by W the d (cid:2) T matrix whose columns are the vectors wt and by A the d (cid:2) T matrix\nwith entries ait. We then have that W = U A. Our assumption that the tasks share a \u201csmall\u201d set\n\n\fof features means that the matrix A has \u201cmany\u201d rows which are identically equal to zero and, so,\nthe corresponding features (columns of matrix U) will not be used to represent the task parameters\n(columns of matrix W ). In other words, matrix W is a low rank matrix. We note that the problem\nof learning a low-rank matrix factorization which approximates a given partially observed target\nmatrix has been considered in [1], [17] and references therein. We brie\ufb02y discuss its connection to\nour current work in Section 4.\nIn the following, we describe our approach to computing the feature vectors ui and the parameters\nait. We \ufb01rst consider the case that there is only one task (say task t) and the features ui are \ufb01xed. To\nlearn the parameter vector at 2 IRd from data f(xti; yti)gm\ni=1 we would like to minimize the empiri-\ni=1 L(yti; hat; U >xtii) subject to an upper bound on the number of nonzero components\nof at, where L : IR (cid:2) IR ! IR+ is a prescribed loss function which we assume to be convex in the\nsecond argument. This problem is intractable and is often relaxed by requiring an upper bound on\n\ncal errorPm\nthe 1-norm of at. That is, we consider the problem min(cid:8)Pm\n\nor equivalently the unconstrained problem\n\ni=1 L(yti; hat; U >xtii) : katk2\n\n1 (cid:20) (cid:11)2(cid:9),\n\nL(yti; hat; U >xtii) + (cid:13)katk2\n\n(2.2)\n\n1 : at 2 IRd) ;\n\nwhere (cid:13) > 0 is the regularization parameter. It is well known that using the 1-norm leads to sparse\nsolutions, that is, many components of the learned vector at are zero, see [7] and references therein.\nMoreover, the number of nonzero components of a solution to problem (2.2) is \u201ctypically\u201d a non-\nincreasing function of (cid:13) [14].\n\nWe now generalize problem (2.2) to the multi-task case. For this purpose, we introduce the regular-\nization error function\n\nT\n\nm\n\nE(A; U ) =\n\nL(yti; hat; U >xtii) + (cid:13)kAk2\n\n2;1:\n\n(2.3)\n\nmin( m\nXi=1\n\nXt=1\n\nXi=1\n\nThe \ufb01rst term in (2.3) is the average of the empirical error across the tasks while the second one is a\nregularization term which penalizes the (2; 1)-norm of the matrix A. It is obtained by \ufb01rst computing\nthe 2-norm of the (across the tasks) rows ai (corresponding to feature i) of matrix A and then the\n1-norm of the vector b(A) = (ka1k2; : : : ; kadk2). This norm combines the tasks and ensures that\ncommon features will be selected across them.\nIndeed, if the features U are prescribed and ^A minimizes the function E over A, the number of\nnonzero components of the vector b( ^A) will typically be non-increasing with (cid:13) like in the case\nof 1-norm single-task regularization. Moreover, the components of the vector b( ^A) indicate how\nimportant each feature is and favor uniformity across the tasks for each feature.\n\nSince we do not simply want to select the features but also learn them, we further minimize the\nfunction E over U, that is, we consider the optimization problem\n\n(2.4)\n\nminnE(A; U ) : U 2 Od; A 2 IRd(cid:2)To :\n\nThis method learns a low-dimensional representation which is shared across the tasks. As in the\nsingle-task case, the number of features will be typically non-increasing with the regularization\nparameter \u2013 we shall present experimental evidence of this fact in Section 5 (see Figure 1 therein).\nWe note that when the matrix U is not learned and we set U = Id(cid:2)d, problem (2.4) computes\na common set of variables across the tasks. That is, we have the following convex optimization\nproblem\n\nmin( T\nXt=1\n\nm\n\nXi=1\n\nL(yti; hat; xtii) + (cid:13)kAk2\n\n2;1 : A 2 IRd(cid:2)T) :\n\n(2.5)\n\nWe shall return to problem (2.5) in Section 4 where we present an algorithm for solving it.\n\n3 Equivalent convex optimization formulation\n\nSolving problem (2.4) is a challenging task for two main reasons. First, it is a non-convex problem,\nalthough it is separately convex in each of the variables A and U. Second, the norm kAk2;1 is\nnonsmooth which makes it more dif\ufb01cult to optimize.\n\n\fA main result in this paper is that problem (2.4) can be transformed into an equivalent convex\nproblem. To this end, for every W 2 IRd(cid:2)T and D 2 Sd\n\n+, we de\ufb01ne the function\n\nR(W; D) =\n\nL(yti; hwt; xtii) + (cid:13)\n\nT\n\nm\n\nXt=1\n\nXi=1\n\nhwt; D+wti:\n\nT\n\nXt=1\n\nTheorem 3.1. Problem (2.4) is equivalent to the problem\n\n(3.1)\n\n(3.2)\n\nd\n\nXt=1\nXi=1\nkAk2;1 trace(cid:0)\nXt=1\n\nT\n\nminnR(W; D) : W 2 IRd(cid:2)T ; D 2 Sd\n\n+; trace(D) (cid:20) 1; range(W ) (cid:18) range(D)o :\n\nThat is, ( ^A; ^U ) is an optimal solution for (2.4) if and only if ( ^W ; ^D) = ( ^U ^A; ^U Diag(^(cid:21)) ^U >) is an\noptimal solution for (3.2), where\n\nProof. Let W = U A and D = U Diag( kaik2\nkAk2;1\n\nT\n\n)U >. Then kaik2 = kW >uik2 and hence\n\n^(cid:21)i :=\n\nk^aik2\nk ^Ak2;1\n\n:\n\n(3.3)\n\nhwt; D+wti = trace(W >D+W ) = kAk2;1 trace(W >U Diag(kW >uik2)+U >W ) =\n\n(kW >uik2)+ W >uiu>\n\nkW >uik2 = kAk2\n\n2;1 :\n\ni W(cid:1) = kAk2;1\n\nd\n\nXi=1\n\nTherefore, minW;D R(W; D) (cid:20) minA;U E(A; U ): Conversely, let D = U Diag((cid:21))U >. Then\n\nhwt; D+wti = trace(W >U Diag((cid:21)+\n\ni )U >W ) = trace(Diag((cid:21)+\n\ni )AA>) (cid:21) kAk2\n\n2;1 ;\n\nby Lemma 4.2. Note that the range constraint ensures that W is a multiple of the submatrix of U\nwhich corresponds to the nonzero eigenvalues of D, and hence if (cid:21)i = 0 then ai = 0 as well.\nTherefore, minA;U E(A; U ) (cid:20) minW;D R(W; D):\n\nIn problem (3.2) we have constrained the trace of D, otherwise the optimal solution would be to\nsimply set D = 1 and only minimize the empirical error term in (3.1). Similarly, we have imposed\nthe range constraint to ensure that the penalty term is bounded below and away from zero. Indeed,\nwithout this constraint, it may be possible that DW = 0 when W does not have full rank, in which\n\ncase there is a matrix D for whichPT\n\nWe note that the rank of matrix D indicates how many common relevant features the tasks share.\nIndeed, it is clear from equation (3.3) that the rank of matrix D equals the number of nonzero rows\nof matrix A.\n\nt=1 hwt; D+wti = trace(W >D+W ) = 0.\n\nWe now show that the function R in equation (3.1) is jointly convex in W and D. For this purpose,\nwe de\ufb01ne the function f (w; D) = w>D+w, if D 2 Sd\n+ and w 2 range(D), and f (w; D) = +1\notherwise. Clearly, R is convex provided f is convex. The latter is true since a direct computation\nexpresses f as the supremum of a family of convex functions, namely we have that f (w; D) =\nsupfw>v + trace(ED) : E 2 Sd; v 2 IRd; 4E + vv> (cid:22) 0g.\n4 Learning algorithm\n\nWe solve problem (3.2) by alternately minimizing the function R with respect to D and the wt\n(recall that wt is the t-th column of matrix W ).\nWhen we keep D \ufb01xed, the minimization over wt simply consists of learning the parameters wt\nindependently by a regularization method, for example by an SVM or ridge regression type method2.\nFor a \ufb01xed value of the vectors wt, we learn D by simply solving the minimization problem\n\nmin( T\nXt=1\n\nhwt; D+wti : D 2 Sd\n\n+; trace(D) (cid:20) 1; range(W ) (cid:18) range(D)) :\n\n(4.1)\n\nThe following theorem characterizes the optimal solution of problem (4.1).\n\n2As noted in the introduction, other multi-task learning methods can be used. For example, we can also\npenalize the variance of the wt\u2019s \u2013 \u201cforcing\u201d them to be close to each other \u2013 as in [8]. This would only slightly\nchange the overall method.\n\n\fAlgorithm 1 (Multi-Task Feature Learning)\n\nInput: training sets f(xti; yti)gm\nParameters: regularization parameter (cid:13)\nOutput: d (cid:2) d matrix D, d (cid:2) T regression matrix W = [w1; : : : ; wT ]\nInitialization: set D = Id(cid:2)d\nd\nwhile convergence condition is not true do\n\ni=1; t 2 INT\n\nfor t = 1; : : : ; T do\n\ncompute wt = argminnPm\n\nend for\nset D = (W W >)\n\n1\n2\n\ntrace(W W >)\n\n1\n2\n\nend while\n\ni=1 L(yti; hw; xtii) + (cid:13)hw; D+wi : w 2 IRd; w 2 range(D)o\n\nTheorem 4.1. Let C = W W >. The optimal solution of problem (4.1) is\n\nD =\n\n1\n\n2\n\nC\n\ntrace C\n\n1\n\n2\n\nand the optimal value equals (trace C\n\n1\n\n2 )2.\n\nWe \ufb01rst introduce the following lemma which is useful in our analysis.\nLemma 4.2. For any b = (b1; : : : ; bd) 2 IRd, we have that\n\ninf( d\nXi=1\n\nb2\ni\n(cid:21)i\n\n: (cid:21)i > 0;\n\n(cid:21)i (cid:20) 1) = kbk2\n\n1\n\nd\n\nXi=1\n\nand any minimizing sequence converges to ^(cid:21)i = jbij\nkbk1\n\n, i 2 INd.\n\n(4.2)\n\n(4.3)\n\n1\n\nProof. From the Cauchy-Schwarz inequality we have that kbk1 = Pbi6=0 (cid:21)\n(Pbi6=0 (cid:21)i)\nPd\n\njbij (cid:20)\n2 . Convergence to the in\ufb01mum is obtained when\n. The\n\n! 0 for all i; j 2 INd such that bi; bj 6= 0. Hence (cid:21)i ! jbij\nkbk1\n\nin\ufb01mum is attained when bi 6= 0 for all i 2 INd.\n\n2 (cid:20) (Pd\n\n2 (Pbi6=0 (cid:21)(cid:0)1\n\ni b2\ni )\n(cid:0) jbj j\n(cid:21)j\n\ni=1 (cid:21)i ! 1 and jbij\n\ni=1 (cid:21)(cid:0)1\n\ni b2\ni )\n\ni (cid:21)\n\n(cid:21)i\n\n1\n\n1\n\n1\n\n2\n\n2\n\n(cid:0) 1\ni\n\nProof of Theorem 4.1. We write D = U Diag((cid:21))U >, with U 2 Od and (cid:21) 2 IRd\nmize over (cid:21). For this purpose, we use Lemma 4.2 to obtain that\n\n+. We \ufb01rst mini-\n\ninfftrace(W >U Diag((cid:21))(cid:0)1U >W ) : (cid:21) 2 IRd\n\n++;\n\nd\n\nXi=1\n\n(cid:21)i (cid:20) 1g = kU >W k2\n\n2;1 =(cid:0)\n\nd\n\nXi=1\n\nkW >uik2(cid:1)2\n\n:\n\nNext we show that\n\nminfkU >W k2\n\n2;1 : U 2 Odg = (trace C\n\n1\n\n2 )2\n\nand a minimizing U is a system of eigenvectors of C. To see this, note that\n\ntrace(W W >uiu>\n\ni ) = trace(C\n(cid:21) (trace(C\n\ni uiu>\n2 ) trace(uiu>\ni C\ni ))2 = trace(C\ni uiu>\n\ni uiu>\ni )\n2 uiu>\ni ) = u>\n\n1\n\ni C\n\n1\n\n2 uiu>\n\n1\n\n2 ui\n\n1\n\n2 uiu>\n\n1\n\nsince uiu>\nthat C\n\n1\n\ni uiu>\n\ni = uiu>\n\ni . The equality is veri\ufb01ed if and only if C\n\n2 uiu>\n\ni = auiu>\n\ni which implies\n\n1\n\n2 ui = aui, that is, if ui is an eigenvector of C. The optimal a is trace(C\n\n1\n\n2 ).\n\n1\n\nThe expression trace(W W >)\n2 in (4.2) is simply the sum of the singular values of W and is some-\ntimes called the trace norm. As shown in [10], the trace norm is the convex envelope of rank(W )\nin the unit ball, which gives another interpretation of the relationship between the rank and (cid:13) in our\nexperiments. Using the trace norm, problem (3.2) becomes a regularization problem which depends\nonly on W .\n\n\f18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n10\u22124\n\n10\u22122\n\n100\n\n15\n\n10\n\n5\n\n0\n\n100\n\n101\n\nFigure 1: Number of features learned versus the regularization parameter (cid:13) (see text for description).\n\nHowever, since the trace norm is nonsmooth, we have opted for the above alternating minimization\nstrategy which is simple to implement and has a natural interpretation. Indeed, Algorithm 1 alter-\nnately performs a supervised and an unsupervised step, where in the latter step we learn common\nrepresentations across the tasks and in the former step we learn task-speci\ufb01c functions using these\nrepresentations.\n\nWe conclude this section by noting that when matrix D in problem (3.2) is additionally constrained\nto be diagonal, problem (3.2) reduces to problem (2.5). Formally, we have the following corollary.\nCorollary 4.3. Problem (2.5) is equivalent to the problem\n\nmin(R(W; Diag((cid:21))) : W 2 IRd(cid:2)T ; (cid:21) 2 IRd\n\n+;\n\nand the optimal (cid:21) is given by\n\n(cid:21)i (cid:20) 1; (cid:21)i 6= 0 when wi 6= 0)\n\nd\n\nXi=1\n\n(cid:21)i =\n\nkwik2\nkW k2;1\n\n;\n\ni 2 INd:\n\n(4.4)\n\n(4.5)\n\nUsing this corollary we can make a simple modi\ufb01cation to Algorithm 1 in order to use it for variable\nselection. That is, we modify the computation of the matrix D (penultimate line in Algorithm 1) as\nD = Diag((cid:21)), where the vector (cid:21) = ((cid:21)1; : : : ; (cid:21)d) is computed using equation (4.5).\n5 Experiments\n\nIn this section, we present experiments on a synthetic and a real data set. In all of our experiments,\nwe used the square loss function and automatically tuned the regularization parameter (cid:13) with leave-\none-out cross validation.\nSynthetic Experiments. We created synthetic data sets by generating T = 200 task param-\neters wt from a 5-dimensional Gaussian distribution with zero mean and covariance equal to\nDiag(1; 0:25; 0:1; 0:05; 0:01). These are the relevant dimensions we wish to learn. To these we\nkept adding up to 20 irrelevant dimensions which are exactly zero. The training and test sets were\nselected randomly from [0; 1]25 and contained 5 and 10 examples per task respectively. The outputs\nyti were computed from the wt and xti as yti = hwt; xtii + (cid:23), where (cid:23) is zero-mean Gaussian noise\nwith standard deviation equal to 0:1.\nWe \ufb01rst present, in Figure 1, the number of features learned by our algorithm, as measured by\nrank(D). The plot on the left corresponds to a data set of 200 tasks with 25 input dimensions and\nthat on the right to a real data set of 180 tasks described in the next subsection. As expected, the\nnumber of features decreases with (cid:13).\n\nFigure 2 depicts the performance of our algorithm for T = 10; 25; 100 and 200 tasks along with the\nperformance of 200 independent standard ridge regressions on the data. For T = 10; 25 and 100, we\naveraged the performance metrics over runs on all the data so that our estimates have comparable\nvariance. In agreement with past empirical and theoretical evidence (see e.g. [4]), learning multiple\ntasks together signi\ufb01cantly improves on learning the tasks independently. Moreover, the perfor-\nmance of the algorithm improves when more tasks are available. This improvement is moderate for\nlow dimensionalities but increases as the number of irrelevant dimensions increases.\n\n\fT = 200\nT = 100\nT = 25\nT = 10\nindependent\n\nT = 200\nT = 100\nT = 25\nT = 10\nindependent\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n10\n\n15\n\n20\n\n25\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n5\n\nFigure 2: Test error (left) and residual of learned features (right) vs. dimensionality of the input.\n\n5.3\n\n5.2\n\n5.1\n\n5\n\n4.9\n\n4.8\n\n4.7\n\n4.6\n\n4.5\n\n4.4\n\n4.3\n0\n\n50\n\n100\n\n150\n\n200\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n\u22120.1\n\nTE RAM SC CPU HD CD CA CO AV WA SW GU PR\n\nFigure 3: Test error vs. number of tasks (left) for the computer survey data set. Signi\ufb01cance of\nfeatures (middle) and attributes learned by the most important feature (right).\n\nOn the right, we have plotted a residual measure of how well the learned features approximate\nthe actual ones used to generate the data. More speci\ufb01cally, we depict the Frobenius norm of the\ndifference of the learned and actual D\u2019s versus the input dimensionality. We observe that adding\nmore tasks leads to better estimates of the underlying features.\nConjoint analysis experiment. We then tested the method using a real data set about people\u2019s\nratings of products from [13]. The data was taken from a survey of 180 persons who rated the\nlikelihood of purchasing one of 20 different personal computers. Here the persons correspond to\ntasks and the PC models to examples. The input is represented by the following 13 binary attributes:\ntelephone hot line (TE), amount of memory (RAM), screen size (SC), CPU speed (CPU), hard\ndisk (HD), CD-ROM/multimedia (CD), cache (CA), Color (CO), availability (AV), warranty (WA),\nsoftware (SW), guarantee (GU) and price (PR). We also added an input component accounting for\nthe bias term. The output is an integer rating on the scale 0(cid:0)10. Following [13], we used 4 examples\nper task as the test data and 8 examples per task as the training data.\nAs shown in Figure 3, the performance of our algorithm improves with the number of tasks. It also\nperforms much better than independent ridge regressions, whose test error is equal to 16:53. In this\nparticular problem, it is also important to investigate which features are signi\ufb01cant to all consumers\nand how they weight the 13 computer attributes. We demonstrate the results in the two adjacent\nplots, which were obtained with the data for all 180 tasks. In the middle, the distribution of the\neigenvalues of D is depicted, indicating that there is a single most important feature which is shared\nby all persons. The plot on the right shows the weight of each input dimension in this most important\nfeature. This feature seems to weight the technical characteristics of a computer (RAM, CPU and\nCD-ROM) against its price. Therefore, in this application our algorithm is able to discern interesting\npatterns in people\u2019s decision process.\nSchool data. Preliminary experiments with the school data used in [3] achieved explained variance\n37:1% compared to 29:5% in that paper. These results will be reported in future work.\n\n6 Conclusion\n\nWe have presented an algorithm which learns common sparse function representations across a pool\nof related tasks. To our knowledge, our approach provides the \ufb01rst convex optimization formulation\nfor multi-task feature learning. Although convex optimization methods have been derived for the\n\n\fsimpler problem of feature selection [12], prior work on multi-task feature learning has been based\non more complex optimization problems which are not convex [2, 4, 6] and, so, are at best only\nguaranteed to converge to a local minimum.\n\nOur algorithm shares some similarities with recent work in [2] where they also alternately update\nthe task parameters and the features. Two main differences are that their formulation is not convex\nand that, in our formulation, the number of learned features is not a parameter but it is controlled by\nthe regularization parameter.\n\nThis work may be extended in different directions. For example, it would be interesting to explore\nwhether our formulation can be extended to more general models for the structure across the tasks,\nlike in [20] where ICA type features are learned, or to hierarchical feature models like in [18].\n\nAcknowledgments\nWe wish to thank Yiming Ying and Raphael Hauser for observations on the convexity of (3.2),\nCharles Micchelli for valuable suggestions and the anonymous reviewers for their useful comments.\nThis work was supported by EPSRC Grants GR/T18707/01 and EP/D071542/1, and by the IST\nProgramme of the European Commission, under the PASCAL Network of Excellence IST-2002-\n506778.\nReferences\n\n[1] J.Abernethy, F. Bach, T. Evgeniou and J-P. Vert. Low-rank matrix factorization with attributes. Technical\n\nreport N24/06/MM, Ecole des Mines de Paris, 2006.\n\n[2] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. J. Machine Learning Research. 6: 1817\u20131853, 2005.\n\n[3] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multi\u2013task learning. J. of Machine\n\nLearning Research, 4: 83\u201399, 2003.\n\n[4] J. Baxter. A model for inductive bias learning. J. of Arti\ufb01cial Intelligence Research, 12: 149\u2013198, 2000.\n[5] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. Proceedings of\n\nComputational Learning Theory (COLT), 2003.\n\n[6] R. Caruana. Multi\u2013task learning. Machine Learning, 28: 41\u201375, 1997.\n[7] D. Donoho. For most large underdetermined systems of linear equations, the minimal l 1-norm near-\nsolution approximates the sparsest near-solution. Preprint, Dept. of Statistics, Stanford University,\n2004.\n\n[8] T. Evgeniou, C.A. Micchelli and M. Pontil. Learning multiple tasks with kernel methods. J. Machine\n\nLearning Research, 6: 615\u2013637, 2005.\n\n[9] T. Evgeniou, M. Pontil and O. Toubia. A convex optimization approach to modeling consumer hetero-\n\ngeneity in conjoint estimation. INSEAD N 2006/62/TOM/DS.\n\n[10] M. Fazel, H. Hindi and S. P. Boyd. A rank minimization heuristic with application to minimum order\n\nsystem approximation. Proceedings, American Control Conference, 6, 2001.\n\n[11] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference\n\nand Prediction. Springer Verlag Series in Statistics, New York, 2001.\n\n[12] T. Jebara. Multi-task feature and kernel selection for SVMs. Proc. of ICML 2004.\n[13] P.J. Lenk, W.S. DeSarbo, P.E. Green, M.R. Young. Hierarchical Bayes conjoint analysis: recovery of\npartworth heterogeneity from reduced experimental designs. Marketing Science, 15(2): 173\u2013191, 1996.\n[14] C.A. Micchelli and A. Pinkus. Variational problems arising from balancing several error criteria. Ren-\n\ndiconti di Matematica, Serie VII, 14: 37-86, 1994.\n\n[15] C. A. Micchelli and M. Pontil. On learning vector\u2013valued functions. Neural Computation, 17:177\u2013204,\n\n2005.\n\n[16] T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, T. Poggio. Theory of object recognition:\ncomputations and circuits in the feedforward path of the ventral stream in primate visual cortex. AI\nMemo No. 2005-036, MIT, Cambridge, MA, October, 2005.\n\n[17] N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrix factorization. NIPS 2004.\n[18] A. Torralba, K. P. Murphy and W. T. Freeman. Sharing features: ef\ufb01cient boosting procedures for\n\nmulticlass object detection. Proc. of CVPR\u201904, pages 762\u2013769, 2004.\n\n[19] K. Yu, V. Tresp and A. Schwaighofer. Learning Gaussian processes from multiple tasks. Proc. of ICML\n\n2005.\n\n[20] J. Zhang, Z. Ghahramani and Y. Yang. Learning Multiple Related Tasks using Latent Independent\n\nComponent Analysis. NIPS 2006.\n\n\f", "award": [], "sourceid": 3143, "authors": [{"given_name": "Andreas", "family_name": "Argyriou", "institution": null}, {"given_name": "Theodoros", "family_name": "Evgeniou", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}