{"title": "A Spectral Regularization Framework for Multi-Task Structure Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 25, "page_last": 32, "abstract": null, "full_text": "A Spectral Regularization Framework for\n\nMulti-Task Structure Learning\n\nAndreas Argyriou\n\nDepartment of Computer Science\n\nUniversity College London\n\nGower Street, London WC1E 6BT, UK\n\na.argyriou@cs.ucl.ac.uk\n\nMassimiliano Pontil\n\nDepartment of Computer Science\n\nUniversity College London\n\nGower Street, London WC1E 6BT, UK\n\nm.pontil@cs.ucl.ac.uk\n\nCharles A. Micchelli\n\nDepartment of Mathematics and Statistics\n\nSUNY Albany\n\n1400 Washington Avenue\nAlbany, NY, 12222, USA\n\nYiming Ying\n\nDepartment of Engineering Mathematics\n\nUniversity of Bristol\n\nUniversity Walk, Bristol, BS8 1TR, UK\n\nenxyy@bristol.ac.uk\n\nAbstract\n\nLearning the common structure shared by a set of supervised tasks is an important\npractical and theoretical problem. Knowledge of this structure may lead to bet-\nter generalization performance on the tasks and may also facilitate learning new\ntasks. We propose a framework for solving this problem, which is based on reg-\nularization with spectral functions of matrices. This class of regularization prob-\nlems exhibits appealing computational properties and can be optimized ef(cid:2)ciently\nby an alternating minimization algorithm. In addition, we provide a necessary\nand suf(cid:2)cient condition for convexity of the regularizer. We analyze concrete ex-\namples of the framework, which are equivalent to regularization with Lp matrix\nnorms. Experiments on two real data sets indicate that the algorithm scales well\nwith the number of tasks and improves on state of the art statistical performance.\n\n1 Introduction\n\nRecently, there has been renewed interest in the problem of multi-task learning, see [2, 4, 5, 14,\n16, 19] and references therein. This problem is important in a variety of applications, ranging from\nconjoint analysis [12], to object detection in computer vision [18], to multiple microarray data set\nintegration in computational biology [8] (cid:150) to mention just a few. A key objective in many multi-\ntask learning algorithms is to implement mechanisms for learning the possible structure underlying\nthe tasks. Finding this common structure is important because it allows pooling information across\nthe tasks, a property which is particularly appealing when there are many tasks but only few data\nper task. Moreover, knowledge of the common structure may facilitate learning new tasks (transfer\nlearning), see [6] and references therein.\nIn this paper, we extend the formulation of [4], where the structure shared by the tasks is described\nby a positive de(cid:2)nite matrix. In Section 2, we propose a framework in which the task parameters and\nthe structure matrix are jointly computed by minimizing a regularization function. This function has\nthe following appealing property. When the structure matrix is (cid:2)xed, the function decomposes across\nthe tasks, which can hence be learned independently with standard methods such as SVMs. When\nthe task parameters are (cid:2)xed, the optimal structure matrix is a spectral function of the covariance of\nthe tasks and can often be explicitly computed. As we shall see, spectral functions are of particular\ninterest in this context because they lead to an ef(cid:2)cient alternating minimization algorithm.\n\n1\n\n\fThe contribution of this paper is threefold. First, in Section 3 we provide a necessary and suf(cid:2)cient\ncondition for convexity of the optimization problem. Second, in Section 4 we characterize the spec-\ntral functions which relate to Schatten Lp regularization and present the alternating minimization\nalgorithm. Third, in Section 5 we discuss the connection between our framework and the convex\noptimization method for learning the kernel [11, 15], which leads to a much simpler proof of the\nconvexity in the kernel than the one given in [15]. Finally, in Section 6 we present experiments on\ntwo real data sets. The experiments indicate that the alternating algorithm runs signi(cid:2)cantly faster\nthan gradient descent and that our method improves on state of the art statistical performance on\nthese data sets. They also highlight that our approach can be used for transfer learning.\n\n2 Modelling Tasks\u2019 Structure\n\n+ (Sd\n\nIn this section, we introduce our multi-task learning framework. We denote by Sd the set of d (cid:2) d\n++) the subset of positive semide(cid:2)nite (de(cid:2)nite) ones and by Od the\nsymmetric matrices, by Sd\nset of d (cid:2) d orthogonal matrices. For every positive integer n, we de(cid:2)ne INn = f1; : : : ; ng. We\nlet T be the number of tasks which we want to simultaneously learn. We assume for simplicity\nthat each task t 2 INT is well described by a linear function de(cid:2)ned, for every x 2 IRd, as w>\nt x,\nwhere wt is a (cid:2)xed vector of coef(cid:2)cients. For each task t 2 INT , there are m data examples\nf(xtj; ytj) : j 2 INmg (cid:26) IRd (cid:2) IR available. In practice, the number of examples per task may vary\nbut we have kept it constant for simplicity of notation.\nOur goal is to learn the vectors w1; : : : ; wT , as well as the common structure underlying the tasks,\nfrom the data examples. In this paper we follow the formulation in [4], where the tasks\u2019 structure\nis summarized by a positive de(cid:2)nite matrix D which is linked to the covariance matrix between\nthe tasks, W W >. Here, W denotes the d (cid:2) T matrix whose t-th column is given by the vector wt\n(we have assumed for simplicity that the mean task is zero). Speci(cid:2)cally, we learn W and D by\nminimizing the function\n\nReg(W; D) := Err(W ) + (cid:13) Penalty(W; D);\n\n(2.1)\nwhere (cid:13) is a positive parameter which balances the importance between the error and the penalty.\nThe former may be any bounded from below and convex function evaluated at the values w >\nt xtj,\nt 2 INT , j 2 INm. Typically, it will be the average error on the tasks, namely, Err(W ) =\nt xtj) and \u2018 : IR (cid:2) IR ! [0; 1) is a prescribed\nloss function (e.g. quadratic, SVM, logistic etc.). We shall assume that the loss \u2018 is convex in its\nsecond argument, which ensures that the function Err is also convex. The latter term favors the tasks\nsharing some common structure and is given by\n\nLt(wt), where Lt(wt) = Pj2INm\n\nPt2INT\n\n\u2018(ytj; w>\n\nT\n\nPenalty(W; D) = tr(F (D)W W >) =\n\nw>\n\nt F (D)wt;\n\n(2.2)\n\nXt=1\n\n++ ! Sd\n\nwhere F : Sd\n++ is a prescribed spectral matrix function. This is to say that F is induced\nby applying a function f : (0; 1) ! (0; 1) to the eigenvalues of its argument. That is, for every\nD 2 Sd\n\n++ we write D = U (cid:3)U >, where U 2 Od, (cid:3) = Diag((cid:21)1; : : : ; (cid:21)d), and de(cid:2)ne\n\nF (D) = U F ((cid:3))U >; F ((cid:3)) = Diag(f ((cid:21)1); : : : ; f ((cid:21)d)):\n\n(2.3)\nIn the rest of the paper, we will always use F to denote a spectral matrix function and f to denote\nthe associated real function, as above.\nMinimization of the function Reg allows us to learn the tasks and at the same time a good represen-\ntation for them which is summarized by the eigenvectors and eigenvalues of the matrix D. Different\nchoices of the function f re(cid:3)ect different properties which we would like the tasks to share. In the\nspecial case that f is a constant, the tasks are totally independent and the regularizer (2.2) is a sum\nof T independent L2 regularizers. In the case f ((cid:21)) = (cid:21)(cid:0)1, which is considered in [4], the regular-\nizer favors a sparse representation in the sense that the tasks share a small common set of features.\nMore generally, functions of the form f ((cid:21)) = (cid:21)(cid:0)(cid:11); (cid:11) (cid:21) 0, allow for combining shared features\nand task-speci(cid:2)c features to some degree tuned by the exponent (cid:11). Moreover, the regularizer (2.2)\nensures that the optimal representation (optimal D) is a function of the tasks\u2019 covariance W W >.\nThus, we propose to solve the minimization problem\n\ninfnReg(W; D) : W 2 IRd(cid:2)T ; D 2 Sd\n\n++; tr D (cid:20) 1o\n\n(2.4)\n\n2\n\n\ffor functions f belonging to an appropriate class. As we shall see in Section 4, the upper bound\non the trace of D in (2.4) prevents the in(cid:2)mum from being zero, which would lead to over(cid:2)tting.\nMoreover, even though the in(cid:2)mum above is not attained in general, the problem in W resulting\nafter partial minimization over D admits a minimizer.\nSince the (cid:2)rst term in (2.1) is independent of D, we can (cid:2)rst optimize the second term with respect\nto D. That is, we can compute the in(cid:2)mum\n\n++; tr D (cid:20) 1(cid:9) :\n\n(cid:10)f (W ) := inf(cid:8)tr(F (D)W W >) : D 2 Sd\n\n(2.5)\nIn this way we could end up with an optimization problem in W only. However, in general this\nwould be a complex matrix optimization problem. It may require sophisticated optimization tools\nsuch as semide(cid:2)nite programming, which may not scale well with the size of W . Fortunately, as\nwe shall show, problem (2.4) can be ef(cid:2)ciently solved by alternately minimizing over D and W . In\nparticular, in Section 4 we shall show that (cid:10)f is a function of the singular values of W only. Hence,\nthe only matrix operation required by alternate minimization is singular value decomposition and\nthe rest are merely vector problems.\nFinally, we note that the ideas above may be extended naturally to a reproducing kernel Hilbert space\nsetting [3].\n\n3 Joint Convexity via Matrix Concave Functions\nIn this section, we address the issue of convexity of the regularization function (2.1). Our main\nresult characterizes the class of spectral functions F for which the term w >F (D)w is jointly convex\nin (w; D), which in turn implies that (2.4) is a convex optimization problem.\nTo illustrate our result, we require the matrix analytic concept of concavity, see, for example, [7].\nWe say that the real-valued function g : (0; 1) ! IR is matrix concave of order d if\n\n8A; B 2 Sd\n\n(cid:21)G(A) + (1 (cid:0) (cid:21))G(B) (cid:22) G((cid:21)A + (1 (cid:0) (cid:21))B)\n\n++ and (cid:21) 2 [0; 1] ;\nwhere G is de(cid:2)ned as in (2.3). The notation (cid:22) denotes the Loewner partial order on Sd: C (cid:22) D\nif and only if D (cid:0) C is positive semide(cid:2)nite. If g is a matrix concave function of order d for any\nd 2 IN, we simply say that g is matrix concave. We also say that g is matrix convex (of order d)\nif (cid:0)g is matrix concave (of order d). Clearly, matrix concavity implies matrix concavity of smaller\norders (and hence standard concavity).\nTheorem 3.1. Let F : Sd\nde\ufb01ned as (cid:26)(w; D) = w>F (D)w is jointly convex if and only if 1\nProof. By de(cid:2)nition, (cid:26) is convex if and only if, for any w1; w2 2 IRd; D1; D2 2 Sd\n(0; 1), it holds that\n\n++ be a spectral function. Then the function (cid:26) : IRd (cid:2)Sd\n\nf is matrix concave of order d.\n\n++! [0; 1)\n\n++ and (cid:21) 2\n\n++! Sd\n\n(cid:26)((cid:21)w1 + (1 (cid:0) (cid:21))w2; (cid:21)D1 + (1 (cid:0) (cid:21))D2) (cid:20) (cid:21)(cid:26)(w1; D1) + (1 (cid:0) (cid:21))(cid:26)(w2; D2):\n\nLet C := F ((cid:21)D1 + (1 (cid:0) (cid:21))D2); A := F (D1)=(cid:21); B := F (D2)=(1 (cid:0) (cid:21)), w := (cid:21)w1 + (1 (cid:0) (cid:21))w2\nand z := (cid:21)w1: Using this notation, the above inequality can be rewritten as\n\nw>Cw (cid:20) z>Az + (w (cid:0) z)>B(w (cid:0) z)\n\n8 w; z 2 IRd:\n\n(3.1)\n\nThe right hand side in (3.1) is minimized for z = (A + B)(cid:0)1Bw and hence (3.1) is equivalent to\n\n8 w 2 IRd, or to\n\nw>Cw (cid:20) w>(cid:2)B(A + B)(cid:0)1A(A + B)(cid:0)1B +(cid:0)I (cid:0) (A + B)(cid:0)1B(cid:1)>\nC (cid:22) B(A + B)(cid:0)1A(A + B)(cid:0)1B +(cid:0)I (cid:0) (A + B)(cid:0)1B(cid:1)>\n\nB(cid:0)I (cid:0) (A + B)(cid:0)1B(cid:1)\n\n= B(A + B)(cid:0)1A(A + B)(cid:0)1B + B (cid:0) 2B(A + B)(cid:0)1B + B(A + B)(cid:0)1B(A + B)(cid:0)1B\n= B (cid:0) B(A + B)(cid:0)1B = (A(cid:0)1 + B(cid:0)1)(cid:0)1 ;\n\nB(cid:0)I (cid:0) (A + B)(cid:0)1B(cid:1)(cid:3)w ;\n\nwhere the last equality follows from the matrix inversion lemma [10, Sec. 0.7]. The above inequality\nis identical to (see e.g. [10, Sec. 7.7])\n\nA(cid:0)1 + B(cid:0)1 (cid:22) C (cid:0)1 ;\n\n3\n\n\for, using the initial notation,\n\n(cid:21)(cid:0)F (D1)(cid:1)(cid:0)1\n\n+ (1 (cid:0) (cid:21))(cid:0)F (D2)(cid:1)(cid:0)1\n\n(cid:22)(cid:0)F ((cid:21)D1 + (1 (cid:0) (cid:21))D2)(cid:1)(cid:0)1\n\nBy de(cid:2)nition, this inequality holds for any D1; D2 2 Sd\nconcave of order d.\nExamples of matrix concave functions on (0; 1) are log(x + 1) and the function xs for s 2 [0; 1]\n(cid:150) see [7] for other examples and theoretical results. We conclude with the remark that, whenever 1\nis matrix concave of order d, function (cid:10)f in (2.5) is convex, because it is the partial in(cid:2)mum of a\njointly convex function [9, Sec. IV.2.4].\n\n++; (cid:21) 2 (0; 1) if and only if 1\n\nf is matrix\n\nf\n\n:\n\n4 Regularization with Schatten L\n4.1 Partial Minimization of the Penalty Term\n\np Prenorms\n\nIn this section, we focus on the family of negative power functions f and obtain that function (cid:10)f\nin (2.5) relates to the Schatten Lp prenorms. We start by showing that problem (2.5) reduces to a\nminimization problem in IRd, by application of a useful matrix inequality. In the following, we let\nB take the place of W W > for brevity.\nLemma 4.1. Let F : Sd ! Sd be a spectral function, B 2 Sd and (cid:12)i; i 2 INd, the eigenvalues of\nB. Then,\n\ninfftr(F (D)B) : D 2 Sd\n\n++; tr D (cid:20) 1g = inf(Xi2INd\n\nf ((cid:14)i)(cid:12)i\n\n: (cid:14)i > 0; i 2 INd; Xi2INd\n\n(cid:14)i (cid:20) 1) :\n\nMoreover, for the in\ufb01mum on the left to be attained, F (D) has to share a set of eigenvectors with B\nso that the corresponding eigenvalues are in the reverse order as the (cid:12)i.\n\nProof. We use an inequality of Von Neumann [13, Sec. H.1.h] to obtain, for all X; Y 2 Sd, that\n\ntr(XY ) (cid:21) Xi2INd\n\n(cid:21)i(cid:22)i\n\nwhere (cid:21)i and (cid:22)i are the eigenvalues of X and Y in nonincreasing and nondecreasing order, re-\nspectively. The equality is attained whenever X = U Diag((cid:21))U >; Y = U Diag((cid:22))U > for some\nU 2 Od. Applying this inequality for X = F (D); Y = B and denoting f ((cid:14)i) = (cid:21)i; i 2 INd, the\nresult follows.\n\nUsing this lemma, we can now derive the solution of problem (2.5) in the case that f is a negative\npower function.\nProposition 4.2. Let B 2 Sd\n\n+ and s 2 (0; 1]. Then we have that\n\n(tr Bs)\n\n1\n\ns = infntr(D\n\ns(cid:0)1\n\ns B) : D 2 Sd\n\n++; tr D (cid:20) 1o :\n\nMoreover, if B 2 Sd\n\nBs\ntr Bs :\nProof. By Lemma 4.1, it suf(cid:2)ces to show the analogous statement for vectors, namely that\n\n++ the in\ufb01mum is attained and the minimizer is given by D =\n\nwhere (cid:12)i (cid:21) 0; i 2 INd. To this end, we apply H\u00a4older\u2019s inequality with p = 1\n\n1\ns\n\n(cid:12)s\n\ni!\n\n Xi2INd\n\ns(cid:0)1\n\ns\n\ni\n\n(cid:14)\n\n= inf(Xi2INd\ni (cid:20) Xi2INd\n\n(cid:14)1(cid:0)s\n\ns\n\n(cid:14)\n\ni\n\nXi2INd\n\n(cid:12)s\n\ni = Xi2INd(cid:16)(cid:14)\n\ns(cid:0)1\n\ns\n\ni\n\n(cid:12)i(cid:17)\n\n(cid:14)i (cid:20) 1)\ns and q = 1\n\n1(cid:0)s\n\n:\n\n(cid:20) Xi2INd\n\ns(cid:0)1\n\ns\n\n(cid:14)\n\ni\n\n(cid:12)i!s\n\n:\n\ns\n\ns(cid:0)1\n\n(cid:12)i : (cid:14)i > 0; i 2 INd; Xi2INd\n(cid:14)i!1(cid:0)s\n(cid:12)i!s Xi2INd\nPj2INd\n\n(cid:12)s\ni\n\n4\n\nWhen (cid:12)i > 0; i 2 INd, the equality is attained for (cid:14)i =\n; i 2 INd. To show that the\ninequality is sharp in all other cases, we replace (cid:12)i by (cid:12)i;\" := (cid:12)i + \", i 2 INd; \" > 0, de(cid:2)ne\n(cid:14)i;\" = (cid:12)s\n\nj;\") and take the limits as \" ! 0.\n\n(cid:12)s\nj\n\ni;\"=(Pj (cid:12)s\n\n\fThe above result implies that the regularization problem (2.4) is conceptually equivalent to regular-\nization with a Schatten Lp prenorm of W , when the coupling function f takes the form f (x) = x1(cid:0)\nwith p 2 (0; 2], p = 2s. The Schatten Lp prenorm is the Lp prenorm of the singular values of a\nmatrix. In particular, trace norm regularization (see [1, 17]) corresponds to the case p = 1. We also\nnote that generalization error bounds for Schatten Lp norm regularization can be derived along the\nlines of [14].\n\n2\np\n\n4.2 Learning Algorithm\n\nLemma 4.1 demonstrates that optimization problems such as (2.4) with spectral regularizers of the\nform (2.2) are computationally appealing, since they decompose to vector problems in d variables\nalong with singular value decomposition of the matrix W . In particular, for the Schatten Lp prenorm\nwith p 2 (0; 2], the proof of Proposition 4.2 suggests a way to solve problem (2.4). We modify the\npenalty term (2.2) as\n\n(4.1)\nwhere \" > 0 and let Reg\"(W; D) = Err(W ) + (cid:13) Penalty\"(W; D) be the corresponding regulariza-\ntion function. By Proposition 4.2, for a (cid:2)xed W 2 IRd(cid:2)T there is a unique minimizer of Penalty\"\n(under the constraints in (2.5)), given by the formula\n\nPenalty\"(W; D) = tr(cid:0)F (D)(W W > + \"I)(cid:1);\n\nD\"(W ) =\n\np\n2\n\n(W W > + \"I)\ntr(W W > + \"I)\n\n:\n\np\n2\n\n(4.2)\n\nMoreover, there exists a minimizer of problem (2.4), which is unique if p 2 (1; 2].\nTherefore, we can solve problem (2.4) using an alternating minimization algorithm, which is an\nextension of the one presented in [4] for the special case F (D) = D(cid:0)1. Each iteration of the\nalgorithm consists of two steps. In the (cid:2)rst step, we keep D (cid:2)xed and minimize over W . This\nconsists in solving the problem\n\nmin( Xt2INT\n\nLt(wt) + (cid:13) Xt2INT\n\nw>\n\nt F (D)wt : W 2 IRd(cid:2)T) :\n\n1\n\nThis minimization can be carried out independently for each task since the regularizer decouples\n2 wt yields a standard L2 reg-\nwhen D is (cid:2)xed. Speci(cid:2)cally, introducing new variables for (F (D))\nularization problem for each task with the same kernel K(x; z) = x>(F (D))(cid:0)1z, x; z 2 IRd. In\nother words, we simply learn the parameters wt (cid:150) the columns of matrix W (cid:150) independently by\na regularization method, for example by an SVM or ridge regression method, for which there are\nwell developed tool boxes. In the second step, we keep matrix W (cid:2)xed and minimize over D using\nequation (4.2).\nSpace limitations prevent us from providing a convergence proof of the algorithm. We only note\nthat following the proof detailed in [3] for the case p = 1, one can show that the sequence produced\nby the algorithm converges to the unique minimizer of Reg\" if p 2 [1; 2], or to a local minimizer\nif p 2 (0; 1). Moreover, by [3, Thm. 3] as \" goes to zero the algorithm converges to a solution of\nproblem (2.4), if p 2 [1; 2]. In theory, an algorithm without \"-perturbation does not converge to a\nminimizer, since the columns of W and D always remain in the initial column space. In practice,\nhowever, we have observed that even such an algorithm converges to an optimal solution, because\nof round-off effects.\n\n5 Relation to Learning the Kernel\nIn this section, we discuss the connection between the multi-task framework (2.1)-(2.4) and the\nframework for learning the kernel, see [11, 15] and references therein. To this end, we de(cid:2)ne the\nkernel Kf (D)(x; z) = x>(F (D))(cid:0)1z, x; z 2 IRd, the set of kernels Kf = fKf (D) : D 2\n++; tr D (cid:20) 1g and, for every kernel K, the task kernel matrix Kt = (K(xti; xtj) : i; j 2 INm),\nSd\nt 2 INT . It is easy to prove, using Weyl\u2019s monotonicity theorem [10, Sec. 4.3] and [7, Thm. V.2.5],\nthat the set Kf is convex if and only if 1\nf is matrix concave. By the well-known representer theorem\n(see e.g. [11]), problem (2.4) is equivalent to minimizing the function\n\nXt2INT Xi2INm\n\n\u2018(yti; (Ktct)i) + (cid:13) c>\n\nt Ktct!\n\n5\n\n(5.1)\n\n\fover ct 2 IRm (for t 2 INT ) and K 2 Kf. It is apparent that the function (5.1) is not jointly convex\nin ct and K. However, minimizing each term over the vector ct gives a convex function of K.\nProposition 5.1. Let K be the set of all reproducing kernels on IRd. If \u2018(y; (cid:1)) is convex for any\ny 2 IR then the function Et : K ! [0; 1) de\ufb01ned for every K 2 K as\n\nEt(K) = min( Xi2INm\n\n\u2018(yti; (Ktc)i) + (cid:13) c>Ktc : c 2 IRm)\n\nis convex.\n\nFor every a 2 IRm and K 2 K , we de(cid:2)ne the function Gt(a; K) =Pi2INm\n\nProof. Without loss of generality, we can assume as in [15] that Kt are invertible for all t 2 INT .\nt a,\nwhich is jointly convex by Theorem 3.1. Clearly, Et(K) = minfGt(a; K) : a 2 IRmg. Recalling\nthat the partial minimum of a jointly convex function is convex [9, Sec.\nIV.2.4], we obtain the\nconvexity of Et.\nThe fact that the function Et is convex has already been proved in [15], using minimax theorems\nand Fenchel duality. Here, we were able to simplify the proof of this result by appealing to the joint\nconvexity property stated in Theorem 3.1.\n\n\u2018(yti; ai)+(cid:13) a>K (cid:0)1\n\n6 Experiments\nIn this section, we (cid:2)rst report a comparison of the computational cost between the alternating min-\nimization algorithm and the gradient descent algorithm. We then study how performance varies for\ndifferent Lp regularizers, compare our approach with other multi-task learning methods and report\nexperiments on transfer learning.\nWe used two data sets in our experiments. The (cid:2)rst one is the computer survey data from [12]. It\nwas taken from a survey of 180 persons who rated the likelihood of purchasing one of 20 different\npersonal computers. Here the persons correspond to tasks and the computer models to examples.\nThe input represents 13 different computer characteristics (price, CPU, RAM etc.) while the output\nis an integer rating on the scale 0 (cid:0) 10. Following [12], we used the (cid:2)rst 8 examples per task as the\ntraining data and the last 4 examples per task as the test data. We measured the root mean square\nerror of the predicted from the actual ratings for the test data, averaged across people.\nThe second data set is the school data set from the Inner London Education Authority (see\nhttp://www.cmm.bristol.ac.uk/learning-training/multilevel-m-support/datasets.shtml). It consists of\nexamination scores of 15362 students from 139 secondary schools in London. Thus, there are 139\ntasks, corresponding to predicting student performance in each school. The input consists of the year\nof the examination, 4 school-speci(cid:2)c and 3 student-speci(cid:2)c attributes. Following [5], we replaced\ncategorical attributes with binary ones, to obtain 27 attributes in total. We generated the training and\ntest sets by 10 random splits of the data, so that 75% of the examples from each school (task) belong\nto the training set and 25% to the test set. Here, in order to compare our results with those in [5], we\nused the measure of percentage explained variance, which is de(cid:2)ned as one minus the mean squared\ntest error over the variance of the test data and indicates the percentage of variance explained by\nthe prediction model. Finally, we note that in both data sets we used the square loss, tuned the\nregularization parameter (cid:13) with 5-fold cross-validation and added an additional input component\naccounting for the bias term.\nIn the (cid:2)rst experiment, we study the computational cost of the alternating minimization algorithm\nagainst the gradient descent algorithm, both implemented in Matlab, for the Schatten L1:5 norm. The\nleft plot in Figure 1 shows the value of the objective function (2.1) versus the number of iterations,\non the computer survey data. The curves for different learning rates (cid:17) are shown, whereas for rates\ngreater than 0:05 gradient descent diverges. The alternating algorithm curve for \" = 10(cid:0)16 is also\nshown. We further note that for both data sets our algorithm typically needed less than 30 iterations\nto converge. The right plot depicts the CPU time (in seconds) needed to reach a value of the objective\nfunction which is less than 10(cid:0)5 away from the minimum, versus the number of tasks. It is clear\nthat our algorithm is at least an order of magnitude faster than gradient descent with the optimal\nlearning rate and scales better with the number of tasks. We note that the computational cost of our\nmethod is mainly due to the T ridge regressions in the supervised step (learning W ) and the singular\n\n6\n\n\fvalue decomposition in the unsupervised step (learning D). A singular value decomposition is also\nneeded in gradient descent, for computing the gradient of the Schatten Lp norm. We have observed\nthat the cost per iteration is smaller for gradient descent but the number of iterations is at least an\norder of magnitude larger, leading to the large difference in time cost.\n\n28.5\n\n28\n\n27.5\n\n27\n\nReg\n\n26.5\n\n26\n\n25.5\n\n25\n\n24.5\n \n0\n\n \n\n = 0.05\n = 0.03\n = 0.01\n\nAlternating\n\n20\n\n40\n\n60\n\n80\n\n100\n\niterations\n\n6\n\n5\n\n4\n\nseconds\n3\n\n2\n\n1\n\n0\n \n50\n\nAlternating\n\n = 0.05\n\n \n\n100\n\ntasks\n\n150\n\n200\n\nFigure 1: Comparison between the alternating algorithm and the gradient descent algorithm.\n\n4\n\n3.5\n\n3\n\nRMSE\n\n2.5\n\n2\n\n1.5\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\np\n\n0.27\n\n0.265\n\n0.26\n\n0.255\n\nexpl. variance\n0.25\n\n0.245\n\n0.24\n\n0.235\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\np\n\nFigure 2: Performance versus p for the computer survey data (left) and the school data (right).\n\nTable 1: Comparison of different methods on the computer survey data (left) and school data (right).\n\nMethod\np = 2\np = 1\np = 0.7\n\nHierarchical Bayes [12]\n\nRMSE\n3.88\n1.93\n1.86\n1.90\n\nMethod\np = 2\np = 1\n\nHierarchical Bayes [5]\n\nExplained variance\n\n23:5 (cid:6) 2:0%\n26:7 (cid:6) 2:0%\n29:5 (cid:6) 0:4%\n\nIn the second experiment we study the statistical performance of our method as the spectral function\nchanges. Speci(cid:2)cally, we choose functions giving rise to Schatten Lp prenorms, as discussed in\nSection 4. The results, shown in Figure 2, indicate that the trace norm is the best norm on these\ndata sets. However, on the computer survey data a value of p less than one gives the best result\noverall. From this we speculate that our method can even approximate well the solutions of certain\nnon-convex problems. In contrast, on the school data the trace norm gives almost the best result.\nNext, in Table 1, we compare our algorithm with the hierarchical Bayes (HB) method described in\n[5, 12]. This method also learns a matrix D using Bayesian inference. Our method improves on\nthe HB method on the computer survey data and is competitive on the school data (even though our\nregularizer is simpler than HB and the data splits of [5] are not available).\nFinally, we present preliminary results on transfer learning. On the computer survey data, we trained\nour method with p = 1 on 150 randomly selected tasks and then used the learned structure matrix D\nfor training 30 ridge regressions on the remaining tasks. We obtained an RMSE of 1:98 on these 30\n(cid:147)new(cid:148) tasks, which is not much worse than an RMSE of 1:88 on the 150 tasks. In comparison, when\n\n7\n\nh\nh\nh\nh\n\fusing the raw data (D = I\nd) on the 30 tasks we obtained an RMSE of 3:83. A similar experiment was\nperformed on the school data, (cid:2)rst training on a random subset of 110 schools and then transferring\nD to the remaining 29 schools. We obtained an explained variance of 19:2% on the new tasks. This\nwas worse than the explained variance of 24:8% on the 110 tasks but still better than the explained\nvariance of 13:9% with the raw representation.\n7 Conclusion\nWe have presented a spectral regularization framework for learning the structure shared by many\nsupervised tasks. This structure is summarized by a positive de(cid:2)nite matrix which is a spectral\nfunction of the tasks\u2019 covariance matrix. The framework is appealing both theoretically and prac-\ntically. Theoretically, it brings to bear the rich class of spectral functions which is well-studied in\nmatrix analysis. Practically, we have argued via the concrete example of negative power spectral\nfunctions, that the tasks\u2019 parameters and the structure matrix can be ef(cid:2)ciently computed using an\nalternating minimization algorithm, improving upon state of the art statistical performance on two\nreal data sets. A natural question is to which extent the framework can be generalized to allow for\nmore complex task sharing mechanisms, in which the structure parameters depend on higher order\nstatistical properties of the tasks.\nAcknowledgements\nThis work was supported by EPSRC Grant EP/D052807/1, NSF Grant DMS 0712827 and by the\nIST Programme of the European Commission, PASCAL Network of Excellence IST-2002-506778.\nReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J-P. Vert. Low-rank matrix factorization with attributes. Technical\n\nReport N24/06/MM, Ecole des Mines de Paris, 2006.\n\n[2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. Journal of Machine Learning Research, 6:1817(cid:150)1853, 2005.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 2007.\n\nIn press.\n\n[4] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information\n\nProcessing Systems 19, pages 41(cid:150)48. 2007.\n\n[5] B. Bakker and T. Heskes. Task clustering and gating for bayesian multi(cid:150)task learning. Journal of Machine\n\nLearning Research, 4:83(cid:150)99, 2003.\n\n[6] J. Baxter. A model for inductive bias learning. J. of Arti\ufb01cial Intelligence Research, 12:149(cid:150)198, 2000.\n[7] R. Bhatia. Matrix Analysis. Graduate texts in Mathematics. Springer, 1997.\n[8] R. Chari, W.W. Lockwood, and B.P. Coe et al. Sigma: a system for integrative genomic microarray\n\nanalysis of cancer genomes. BMC Genomics, 7:324, 2006.\n\n[9] J.-B. Hiriart-Urruty and C. Lemar\u00b7echal. Convex Analysis and Minimization Algorithms. Springer, 1996.\n[10] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n[11] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix\n\nwith semide(cid:2)nite programming. Journal of Machine Learning Research, 5:27(cid:150)72, 2005.\n\n[12] P. J. Lenk, W. S. DeSarbo, P. E. Green, and M. R. Young. Hierarchical Bayes conjoint analysis: recovery\nof partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173(cid:150)191, 1996.\n[13] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press,\n\n1979.\n\n[14] A. Maurer. Bounds for linear multi-task learning. J. of Machine Learning Research, 7:117(cid:150)139, 2006.\n[15] C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6:1099(cid:150)1125, 2005.\n\n[16] R. Raina, A. Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In Proceedings\n\nof the 23rd International Conference on Machine Learning, 2006.\n\n[17] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances in\n\nNeural Information Processing Systems 17, pages 1329(cid:150)1336. 2005.\n\n[18] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: ef(cid:2)cient boosting procedures for multi-\nclass object detection. In Proc. of Conf. on Computer Vision and Pattern Recognition. 2:762-769, 2004.\n[19] J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent compo-\n\nnent analysis. In Advances in Neural Information Processing Systems 18, pages 1585(cid:150)1592. 2006.\n\n8\n\n\f", "award": [], "sourceid": 186, "authors": [{"given_name": "Andreas", "family_name": "Argyriou", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}, {"given_name": "Yiming", "family_name": "Ying", "institution": null}, {"given_name": "Charles", "family_name": "Micchelli", "institution": null}]}