{"title": "Learning Multiple Tasks using Manifold Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "We present a novel method for multitask learning (MTL) based on {\\it manifold regularization}: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common {\\it linear} subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is fixed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efficient and easy to implement. We show the efficacy of our method on several datasets.", "full_text": "Learning Multiple Tasks using Manifold\n\nRegularization\n\nArvind Agarwal\u2217\n\nHal Daum\u00b4e III\u2217\n\nSamuel Gerber\n\nDepartment of Computer Science\n\nScienti\ufb01c Computing and Imaging Institute\n\nUniversity of Maryland\nCollege Park, MD 20740\narvinda@cs.umd.edu\nhal@umiacs.umd.edu\n\nUniversity of Utah\n\nSalt Lake City, Utah 84112\nsgerber@cs.utah.edu\n\nAbstract\n\nWe present a novel method for multitask learning (MTL) based on manifold regu-\nlarization: assume that all task parameters lie on a manifold. This is the general-\nization of a common assumption made in the existing literature: task parameters\nshare a common linear subspace. One proposed method uses the projection dis-\ntance from the manifold to regularize the task parameters. The manifold structure\nand the task parameters are learned using an alternating optimization framework.\nWhen the manifold structure is \ufb01xed, our method decomposes across tasks which\ncan be learnt independently. An approximation of the manifold regularization\nscheme is presented that preserves the convexity of the single task learning prob-\nlem, and makes the proposed MTL framework ef\ufb01cient and easy to implement.\nWe show the ef\ufb01cacy of our method on several datasets.\n\n1\n\nIntroduction\n\nRecently, it has been shown that learning multiple tasks together helps learning [8, 19, 9] when the\ntasks are related, and one is able to use an appropriate notion of task relatedness. There are many\nways by which one can enforce the relatedness of the tasks. One way to do so is to assume that two\ntasks are related if their parameters are \u201cclose\u201d. This notion of relatedness is usually incorporated in\nthe form of a regularizer [4, 16, 13] or a prior [15, 22, 21].\nIn this work we present a novel approach for multitask learning (MTL) that considers a notion of\nrelatedness based on ideas from manifold regularization1. Our approach is based on the assumption\nthat the parameters of related tasks can not vary arbitrarily but rather lie on a low dimensional man-\nifold. A similar idea underlies the standard manifold learning problems: the data does not change\narbitrarily, but instead follows a manifold structure. Our assumption is also a generalization of the\nassumption made in [1] which assumes that all tasks share a linear subspace, and a learning frame-\nwork consists of learning this linear subspace and task parameters simultaneously. We remove the\nlinear constraint from this problem, and assume that the tasks instead share a non-linear subspace.\nIn our proposed approach we learn the task parameters and the task-manifold alternatively, learning\none while keeping the other \ufb01xed, similar to [4]. First, we learn all task parameters using a single\ntask learning (STL) method, and then use these task parameters to learn the initial task manifold. The\ntask-manifold is then used to relearn the task parameters using manifold regularization. Learning of\nmanifold and task parameters is repeated until convergence. We emphasize that when we learn the\ntask parameters (keeping the manifold structure \ufb01xed), the MTL framework decomposes across the\n\n\u2217This work was done at School of Computing, University of Utah, Salt Lake City, Utah\n1It is not to be confused with the manifold regularization presented in [7]. We use the projection distance\n\nfor regularization while Belkin et.al. use the graph structure (graph Laplacian).\n\n1\n\n\ftasks, which can be learned independently using standard method such as SVMs. Note that unlike\nmost manifold learning algorithms, our framework learns an explicit representation of the manifold\nand naturally extends to new tasks. Whenever a new task arrives, one can simply use the existing\nmanifold to learn the parameters of the new task. For a new task, our MTL model is very ef\ufb01cient\nas it does not require relearning all tasks.\nAs shown later in the examples, our method is simple, and can be implemented with only a small\nchange to the existing STL algorithms. Given a black box for manifold learning, STL algorithms\ncan be adapted to the proposed MTL setting. To make the proposed framework even simpler, we\nprovide an approximation which preserves the convexity of the STL problem. We emphasize that\nthis approximation works very well in practice. All the experimental results used this approximation.\n\n2 Related Work\n\nIn MTL, task relatedness is a fundamental question and models differ in the ways they answer\nthis question. Like our method, most of the existing methods \ufb01rst assume a structure that de\ufb01nes\nthe task relatedness, and then incorporate this structure in the MTL framework in the form of a\nregularizer [4, 16, 13].\nOne plausible approach is to assume that all task parameters lie in a subspace [1]. The tasks are\nlearned by forcing the parameters to lie in a common linear subspace therefore exploiting the as-\nsumed relatedness in the model. Argyriou et.al. [4] later generalized this work by using a function\nF to model the shared structure. In this work, the relatedness structure is forced by applying a\nfunction F on a covariance matrix D which yields a regularization of the form tr(F (D)W W T ) on\nthe parameters W . Here, the function F can model different kind of relatedness structures among\ntasks including the linear subspace structure [1]. Given a function F , this framework learns both,\nthe relatedness matrix D and the task parameters W . One of the limitations of this approach is\nthe dependency on F which has to be provided externally. In an informal way, F introduces the\nnon-linearity and it is not clear as what the right choice of F is. Our framework generalizes the lin-\near framework by introducing the nonlinearity through the manifold structure learned automatically\nfrom the data, and thus avoids the need of any external function. Argyriou et. al. extend their work\n[4] in [2, 3] where non-linearity is introduced by considering a kernel function on the input data,\nand then learning the linear subspace in the Hilbert space. This method in spirit is very similar to\nour method except that we learn an explicit manifold therefore our method is naturally extensible to\nnew tasks.\nAnother work that models the task relatedness in the form of proximity of the parameters is [16]\nwhich assumes that task parameters wt for each task is close to some common task w0 with some\nvariance vt. These vt and w0 are learned by minimizing the Euclidean norm which is again equiva-\nlent to working in the linear space. This idea is later generalized by [13], where tasks are clustered,\nand regularized with respect to the cluster they belong to. The task parameters are learned under this\ncluster assumption by minimizing a combination of different penalty functions.\nThere is another line of work [10], where task relatedness is modeled in term of a matrix B which\nneeds to be provided externally. There is also a large body of work on multitask learning that \ufb01nd\nthe shared structure in the tasks using Bayesian inference [23, 24, 9], which in spirit, is similar to\nthe above approaches, but done in a Bayesian way. It is to be noted that all of the above methods\neither work in a linear setting or require external function/matrix to enforce the nonlinearity. In our\nmethod, we work in the non-linear setting without using any external function.\n\n3 Multitask Learning using Manifold\n\nIn this section we describe the proposed MTL framework. As mentioned earlier, our framework\nassumes that the tasks parameters lie on a manifold which is a step further to the assumption made\nin [1] i.e., the task parameters lie on a linear subspace or share a common set of features. Similar\nto the linear subspace algorithm [1] that learns the task parameters (and the shared subspace) by\nregularizing the STL framework with the orthogonal projections of the task parameters onto the\nsubspace, we propose to learn the task parameters (and non-linear subspace i.e., task-manifold) by\n\n2\n\n\f} be the set of examples and Yt = {y1, . . . ynt\n\nregularizing the STL with the projection distance of the task parameters from this task-manifold (see\nFigure 1).\nWe begin with some notations. Let T be the total number of tasks, and for each task t, let\nXt = {x1, . . . xnt\n} be the corresponding labels.\nEach example xi \u2208 Rd is a d dimensional vector, and yi is a label; yi \u2208 {+1,\u22121} in case of a\nclassi\ufb01cation problem, and a real value yi \u2208 R in case of regression problem. nt is the number\nof examples in task t. For the simplicity of the notations, we assume that all tasks have the same\nnumber of examples i.e. n1 = . . . = nT = n, though in practice they may vary. Now for each task\nt, let \u03b8t be the parameter vector, referred as the task parameter.\nGiven example-label pairs set (Xt, Yt) for task t, a learning\nproblem would be to \ufb01nd a function ft that for any future\nexample x, predicts the correct value of y i.e. y = ft(x). A\nstandard way to learn this function is to minimize the loss be-\ntween the value predicted by the function and the true value.\nLet L be such a loss function. Let k be a kernel de\ufb01ned on the\ninput examples k : Rd \u00d7 Rd \u2192 R and Hk be the reproduc-\ning kernel Hilbert space (RKHS) associated with the kernel\nk. Restricting ft to the functions in the RKHS and denoting\nit by f(x, \u03b8t) = (cid:104)\u03b8t, \u03c6(x)(cid:105), single task learning solves the\nfollowing optimization problem:\n\n(cid:88)\n\nx\u2208Xt\n\n\u03b8\u2217\nt = arg min\n\n\u03b8t\n\nL(f(x; \u03b8t), y) + \u03bb||ft||2\n\nHk\n\n,\n\n(1)\n\nFigure 1: Projection of the estimated\nparameters w of the task in hand on the\nmanifold learned from all tasks parame-\nters. w\u2217 is the optimal parameter.\n\nhere \u03bb is a regularization parameter. Note that the kernel is\nassumed to be common for all tasks hence does not have the\nsubscript t. This is equivalent to saying that all tasks belong\nto the same RKHS.\nNow one can extend the above STL framework to the multitask setting. In MTL, tasks are related,\nthis notion of relatedness is incorporated through a regularizer. Let u be such regularizer, then MTL\nsolves:\n\n\u2217\n1 , . . . \u03b8\n\n(\u03b8\n\n\u2217\nT ) = arg min\n(\u03b81,...\u03b8T )\n\nL(f (x; \u03b8t), y) + \u03bb||ft||2\nHk\n\n+ \u03b3u(\u03b81 . . . \u03b8T ),\n\n(2)\n\nwhere \u03b3 is a trade off parameter similar to \u03bb that trades off the amount of MTL regularization. As\nmentioned in Section 2, there are many ways in which this regularizer can be implemented. For\nexample, for the assumption that the task parameters are close to a common task \u03b80, regularizer\nwould just be (cid:107)\u03b8t \u2212 \u03b80(cid:107)2. In our approach, we split the regularizer u(\u03b81, . . . , \u03b8T ) into T different\nregularizers u(\u03b8t, M) such that u(\u03b8t, M) regularizes the parameter of task t while considering the\neffect of other tasks through the manifold M. The optimization problem under such regularizer can\nbe written as:\n\n\u201d\n\n\u2217\n1 , . . . \u03b8\n\n(\u03b8\n\n\u2217\nT ) = arg min\n\n(\u03b81,...\u03b8T ),M\n\nL(f (x; \u03b8t), y) + \u03bb||ft||2\nHk\n\n+ \u03b3u(\u03b8t, M)\n\n.\n\n(3)\n\nNote that optimization is now performed over both task parameters and the manifold. If manifold\nstructure M is \ufb01xed then the above optimization problem decomposes into T independent optimiza-\ntion problems. In our approach, the regularizer depends on the structure of the manifold constructed\nfrom the task parameters {\u03b81, . . . \u03b8T}. Let M be such manifold, and PM(\u03b8t) be the projection dis-\ntance of \u03b8t from the manifold. Now one can use this projection distance as a regularizer u(\u03b8t, M)\nin the cost function since all task parameters are assumed to lie on the task manifold M. The cost\nfunction is now given by:\n\nTX\n\n\u201c X\n\nt=1\n\nx\u2208Xt\n\nTX\n\n\u201c X\n\nt=1\n\nx\u2208Xt\n\n\u201d\n\nTX\n\n\u201c X\n\nt=1\n\nx\u2208Xt\n\nCP =\n\nL(f (x; \u03b8t), y) + \u03bb||ft||2\nHk\n\n+ \u03b3PM(\u03b8t)\n\n(4)\n\n\u201d\n\n.\n\nSince the manifold structure is not known, the cost function (4) needs to be optimized simulta-\nneously for the task parameters (\u03b81 . . . \u03b8T ) and for the task-manifold M. Optimizing for \u03b8 and M\njointly is a hard optimization problem, therefore we resort to the alternating optimization. We \ufb01rst\n\n3\n\nww\u2217\f\ufb01x the task parameters and learn the manifold. Next, we \ufb01x the manifold M, and learn the task\nparameters by minimizing (4). In order to minimize (4) for the task parameters, we need an expres-\nsion for PM i.e. an expression for computing the projection distance of task parameters from the\nmanifold. More precisely, we only need the gradient of PM not the function itself since we will\nsolve this problem using gradient descent.\n\n3.1 Manifold Regularization\n\nOur approach relies heavily on the capability to learn a manifold, and to be able to compute the\ngradient of the projection distances onto the manifold. Much recent work in manifold learning\nfocused on uncovering low dimensional representation [18, 6, 17, 20] of the data. These approaches\ndo not provide the tools crucial to this work i.e., the gradient of the projection distance. Recent\nwork [11] addresses this issues and proposes a manifold learning algorithm, based on the idea of\nprincipal surfaces [12]. It explicitly represents the manifold in the ambient space as a parametric\nsurface which can be used to compute the projection distance and its gradient.\nFor the sake of completeness, we brie\ufb02y describe this method (for details refer [11]). The method\nis based on minimizing the expected reconstruction error E[g(h(\u03b8)) \u2212 \u03b8] of the task parameter \u03b8\nonto the manifold M. Here h is the mapping from the manifold to the lower dimensional Euclidean\nspace and g is the mapping from the lower dimensional Euclidean space to the manifold. Thus, the\ncomposition g \u25e6 h maps a point belonging to manifold to the manifold, using the mapping to the\nEuclidean space as an intermediate step. Note that \u03b8 and g(h(\u03b8)) are usually not the same. These\nmappings g and h can be formulated in terms of kernel regressions over the data points:\n\nh(\u03b8) =\n\nzj\n\n(5)\n\nTX\n\nj=1\n\nPT\nK\u03b8(\u03b8 \u2212 \u03b8j)\nl=1 K\u03b8(\u03b8 \u2212 \u03b8l)\n\nTX\n\nj=1\n\nPT\nKr(r \u2212 h(\u03b8j))\nl=1 Kr(r \u2212 h(\u03b8l))\n\nwith K\u03b8 a kernel function and zj a set of parameters to be estimated in the manifold learning\nprocess. Similarly\n\ng(r) =\n\n\u03b8j\n\n(6)\n\nagain with Kr a kernel function.\nNote that in the limit, the kernel regression converges to the conditional expectation g(r) =\nE[(\u03b81, . . . , \u03b8T )|r] where expectation is taken with respect to probability distribution p(\u03b8), param-\neters are assumed to be sampled from.\nIf h is an orthogonal projection, this yields a principal\nsurface [12], i.e informally g passes through the middle of the density. In [11] it is shown that in\nthe limit, as the number of samples to learn from increases, h indeed yields an orthogonal projec-\n(cid:80)T\ntion onto g. Under this orthogonal projection, the estimation of the parameters zi, i.e. the mani-\nfold learning, can be done through gradient descent on the sample mean of the projection distance\ni=1 g(h(\u03b8i)) \u2212 \u03b8i using a global manifold learning approach for initialization. Once h is esti-\n1\nT\nmated, the projection distance is immediate by\n\nFor the optimization of (4) we need the gradient of the projection distance which is\n\nPM = (cid:107)\u03b8 \u2212 g(h(\u03b8))(cid:107)2 = (cid:107)\u03b8 \u2212 \u03b8\n\nM(cid:107)2\n\ndPM(\u03b8)\n\nd\u03b8\n\n= 2(g(h(\u03b8)) \u2212 \u03b8)\n\ndg(r)\n\ndr\n\n|r=h(\u03b8)\n\ndh(\u03b8)\n\nd\u03b8\n\n.\n\n(7)\n\n(8)\n\nThe projection distance for a single task parameters is O(n) due to the de\ufb01nition of h and g as\n|r=h(\u03b8) and dh(\u03b8)\nkernel regressions which show up in the projection distance gradient in dg(r)\nd\u03b8 .\ndr\nThis is fairly expensive therefore we propose an approximation, justi\ufb01ed by the convergence to an\northogonal projection of h, to the exact projection gradient. For an orthogonal projection the term\n|r=h(\u03b8)of the projected\ndg(r)\npoint) and the gradient simpli\ufb01es to\n\nis orthogonal to the tangent plane dg(r)\ndr\n\nvanishes ( dh(\u03b8)\nd\u03b8\n\n|r=h(\u03b8)\n\ndh(\u03b8)\n\nd\u03b8\n\ndr\n\ndPM(\u03b8)\n\nd\u03b8\n\n= 2(g(h(\u03b8)) \u2212 \u03b8),\n\n(9)\n\nwhich is exactly the gradient of (7) assuming that the projection of \u03b8 onto the manifold is \ufb01xed. A\nfurther advantage of this approximation , besides a computational speedup, is that no non-convexities\nare introduced due to the regularization.\n\n4\n\n\fi=1 for t = 1 . . . T .\n\nAlgorithm 1 MTL using Manifold Regularization\nInput: {xi, yi}n\nOutput: \u03b81, . . . \u03b8T .\nInitialize: Learn \u03b81, . . . \u03b8T independently.\nLearn the task-manifold using \u03b81, . . . \u03b8T .\nwhile it < numIter do\n\nfor t = 1 to T do\n\nLearn \u03b8t using (4) with (7) or (10).\n\nend for\nRelearn the task-manifold using \u03b81, . . . \u03b8T .\n\nend while\n\nThe proposed manifold regularization approximation allows to use any STL method without much\nchange in the optimization of the STL problem. The proposed method for MTL pipelines manifold\nlearning with the STL. Using (7) one can write the (4) as:\n\nCP =\n\nL(f (x; \u03b8t), y) + \u03bb||\u03b8t||2 + \u03b3\n\n(10)\n\n\u02db\u02db\u02db\u02db\u02db\u02db\u03b8t \u2212 \u02dc\u03b8\n\nM\nt\n\n\u02db\u02db\u02db\u02db\u02db\u02db2\u201d\n\nTX\n\n\u201c X\n\nt=1\n\nx\u2208Xt\n\nt\n\nt\n\nhere \u02dc\u03b8M\nis the \ufb01xed projection of the \u03b8 on the manifold. Note that in the proposed approximation\nof the above expression, \u02dc\u03b8M\nis \ufb01xed while computing the gradient i.e., one does not have to worry\nabout moving the projection of the point on the manifold during the gradient step. Although in\nthe following example, we will solve (10) for linear kernel, extension for the non-linear kernels is\nstraightforward under the proposed approximation. This approximation allows one to treat the man-\nifold regularizer similar to the RKHS regularizer (cid:107)\u03b8t(cid:107)2 and solve the generalized learning problem\n(4) with non-linear kernels. Note that (cid:107)\u03b8t \u2212 \u02dc\u03b8M\n(cid:107)2 is a monotonic function of \u03b8 so it does not violate\nthe representer theorem.\n\nt\n\n3.2 Example: Linear Regression\n\nIn this section, we solve the optimization problem (4) for the linear regression model. This is the\nmodel we have used in all of our experiments. In the learning framework (4), the loss function is\nL(x, y, wt) = (y\u2212(cid:104)wt, x(cid:105))2 with linear kernel k(x, y) = (cid:104)x, y(cid:105). We have changed the notations for\nparameters from \u03b8 to w to differentiate the linear regression from the general framework. The cost\nfunction for linear regression can now be written as:\n\n(y \u2212 (cid:104)wt, x(cid:105))2 +\n\n\u03bb\n2\n\n||wt||2 + \u03b3PM(wt)\n\n(11)\n\n\u201d\n\nTX\n\n\u201c X\n\nt=1\n\nx\u2208Xt\n\nCP =\n\nThis cost function may be convex or non-convex depending upon the manifold terms PM(wt). The\n\ufb01rst two terms are convex. If one uses the approximation (10), this problem becomes convex and\nhas the form similar to STL. The solution under this approximation is given by:\n(12)\nwhere I is a d \u00d7 d identity matrix, Xt is a d \u00d7 n example matrix, and Yt is a row vector of\ncorresponding labels. \u02dcwM\n\nwt =`(\u03bb + \u03b3)I + (cid:104)Xt, X T\n\nis the orthogonal projection of w on the manifold.\n\nt (cid:105)\u00b4\u22121`XtY T\n\nt + \u03b3 \u02dcw\n\n\u00b4\n\nM\nt\n\nt\n\n3.3 Algorithm Description\n\nThe algorithm for MTL with manifold regularization is straightforward and shown in Algorithm 1.\nThe algorithm begins with the STL setting i.e., each task parameter is learned independently. These\nlearned task parameters are then used to estimate the task-manifold. Keeping the manifold structure\n\ufb01xed, we relearn all task parameters using manifold regularization. Equation (9) is used to compute\nthe gradient of the projection distance used in relearning the parameters. This step gives us the\nexplicit representation of the projection in the case of a linear kernel while a set of weights in the\ncase of a non-linear kernel. Current code available for computing the projection [11] only handles\npoints in the Euclidean space (RKHS with linear kernel), not in a general RKHS, though in theory,\nit is possible to extend the current code to general RKHS. Once the parameters for all tasks are\nlearned, the manifold is re-estimated based on the updated task parameters. This process is repeated\nfor a \ufb01xed number of iterations (in our experiments we use 5 iterations).\n\n5\n\n\f4 Experiments\n\nIn this section, we consider the regression task and show the experimental results of our method. We\nevaluate our method on both synthetic and real datasets.\n\n4.1 Synthetic Dataset\n\nFirst, we evaluate our method on a synthetic data. This data is generated from the task parameters\nsampled from a known manifold (swiss roll). The data is generated by \ufb01rst sampling the points\nfrom the 3-dimensional swiss roll, and then using these points as the task parameters to generate the\nexamples using the linear regression model. We sample 100 tasks, and for each task we generate\n2 examples. The number of examples per task is kept low for two reasons. First, the task at hand\n(this is linear) is a relatively easy task and more number of examples give a nearly perfect regression\nmodel with the STL method itself, leaving almost no room for improvement. Second, MTL in the\nreal world makes sense only when the number of examples per task is low. In all of our experiments,\nwe compare our approach with the approach presented in [4] for two reasons. First, this is the\napproach most closely related to our approach (this makes linear assumption while we make the\nnon-linear assumption), and second, code is available online2 .\nIn all our experiments we report the root mean square error (RMSE) [4]. For a set of 100 tasks,\ntaskwise results for the synthetic data is shown in Figure 2(a). In this \ufb01gure, the x-axis represents\nthe RMSE of the STL model while the y-axis is the RMSE of the MTL model. Figure 2(a) shows\nthe performance of the MTL model relative to the STL model. Each point (x, y) in the \ufb01gure\nrepresents the (STL,MTL) pair. Blue dots denote the MTL performance of our method while green\ncrosses denote the performance of the baseline method [4]. The red line denote the points where\nMTL and STL performed equally. Any point above the red line shows that the RMSE of MTL is\nhigher (bad case) while points below denote that RMSE of MTL is lower (good case). It is clear\nfrom Figure 2(a) that our method is able to use the manifold information therefore outperform both\nSTL and MTL-baseline methods. We improve the performance of almost all tasks with respect to\nSTL, while MTL-baseline improves the performance of only few tasks. Note the mean performance\nimprovement (reduction in RMSE i.e. RMSE of (STL-MTL)) of all tasks in our method and in the\nbaseline-MTL. We get an improvement of +0.0131 while baseline has the negative performance\nimprovement of \u22120.0204. For the statistical signi\ufb01cance, reported numbers are averaged over 10\nruns. Hyperparameters of both models (baseline and ours (\u03bb and \u03b3)) were tuned on a small dataset\nchosen randomly.\n\n4.2 Real Regression Dataset\n\nWe now evaluate our method on two real datasets school dataset and computer survey dataset [14],\nthe same datasets as used in the baseline model [4]. Moreover they have also been used in previous\nMTL studies, for example, school dataset in [5, 10] and computer dataset in [14].\n\nComputer This dataset is a survey of 190 students who rated the likelihood of purchasing one of\n20 different personal computers. Here students correspond to the tasks and computers correspond\nto the examples. Each student rated all of the 20 computers on a scale of 0-10, therefore giving 20\nlabeled examples per task. Each computer (input example) is represented by 13 different computer\ncharacteristics (RAM, cache, CPU, price etc.). Training and test sets were obtained by splitting the\ndataset into 75% and 25%, thus giving 15 examples for training and 5 examples for testing.\n\nSchool This dataset 3 is from the Inner London Education Authority and consists of the exami-\nnation scores of 15362 students from 139 schools in London. Here, each school corresponds to a\ntask, thus a total of 139 tasks. The input consists of the year of the examination, 4 school-speci\ufb01c\nand 3 student-speci\ufb01c attributes. Following [5, 4], each categorical feature is replaced with binary\n\n2For a fair comparison, we use the code provided by the author, available at http://ttic.uchicago.\n\nedu/\u02dcargyriou/code/mtl_feat/mtl_feat.tar.\n\n3Available\n\nat\n\nhttp://www.cmm.bristol.ac.uk/learning-training/\n\nmultilevel-m-support/datasets.shtml\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Taskwise performance on the synthetic dataset. The red line marks where STL and MTL\nperform equally. Any points above it represent the tasks whose RMSE increased through the MTL\nframework while those below showed performance improvement (reduced RMSE). Green crosses\nare the baseline method and blue dots are the manifold method. Avg{Manifold,Baseline} in the title is\nthe mean performance improvement of all tasks over STL. (b) Average RMSE vs number of examples\nfor school dataset\n\n(a)\n\n(b)\n\nFigure 3: Taskwise performance on (a) computer and (b) school datasets.\n\nfeatures, giving us a total of 26 features. We again split the dataset into 75% training and 25%\ntesting.\nSimilar to the synthetic dataset, hyperparameters of the baseline method and manifold method (\u03b3\nand \u03bb) were tuned on a small validation dataset picked randomly from the training set.\nIn the\nexperiments, whenever we are required to use fewer number of examples, examples were chosen\nrandomly. In such experiments, reported numbers were averaged over 10 runs for the statistical\nsigni\ufb01cance. Note that the fewer the examples, the higher the variance because of randomness. In\norder to see if learning tasks simultaneously helps, we did not consider the zero value while tuning\nthe hyperparameters of MTL to avoid the reduction of MTL method to STL ones.\nFigure 3(a) and Figure 3(b) shows the taskwise performance of the computer and school datasets\nrespectively. We note that for the computer dataset, we perform signi\ufb01cantly better than both STL\nand the baseline methods. The baseline method performs worse than the STL method, therefore\ngiving a negative average performance improvement of \u22120.9121. We believe that this is because\nthe tasks are related non-linearly. For the school dataset, we perform better than both STL and the\nbaseline method though relative performance improvement is not as signi\ufb01cant as in the computer\ndataset. On the school dataset, the baseline method has a mixed behavior relative to the STL method,\nperforming good on some tasks while performing worse on others. In both of these datasets, we\nobserve that our method does not cause the negative transfer i.e. causing a task to perform worse\nthan the STL. Although we have not used anything in our problem formulation to avoid negative\ntransfer, this observation is interesting. Note that almost all of the existing MTL methods suffer\nfrom the negative transfer phenomena. We emphasize that the baseline method has two parameters\n\n7\n\n0.050.10.150.20.250.050.10.150.20.25n=2, T=100, AvgManifold=0.0131 AvgBaseline=\u22120.0204RMSE (STL)RMSE (MTL)0501001502002501.61.71.81.922.12.22.32.42.52.6Number of examples per taskAvg RMSE STLMTL\u2212ManifoldMTL\u2212Baseline11.522.511.21.41.61.822.22.42.62.8n=15, T=190, AvgManifold=0.2302 AvgBaseline=\u22120.9121RMSE (STL)RMSE (MTL)11.522.533.5411.522.533.54n=10, T=139, AvgManifold=0.1458 AvgBaseline=0.1563RMSE (STL)RMSE (MTL)\fthat are very important, the regularization parameter and the P . In our experiments we found that the\nbaseline method is very sensitive to both of these parameters. In order to have a fair and competitive\ncomparison, we used the best value of these parameters, tuned on a small validation dataset picked\nrandomly from the training set.\n\n(a)\n\n(b)\n\nFigure 4: RMSE Vs number of tasks for (a) computer dataset (b) school dataset\n\nNow we show the performance variation with respect to the number of training examples. Fig-\nure 2(b) shows the relative performance of the STL, MTL-baseline and MTL-Manifold for the\nschool dataset. We outperform STL method signi\ufb01cantly while we perform comparative to the\nbaseline. Note that when the number of examples is relatively low, the baseline method outper-\nforms our method because we do not have enough examples to estimate the parameters of the task\nwhich is used for the manifold construction. But as we increase the number of examples, we get\nbetter estimate of the parameters, hence better manifold regularization. For n > 100 we outperform\nthe baseline method by a small amount. Variation of the performance with n is not shown for the\ncomputer dataset because computer dataset has only 20 examples per task.\nPerformance variation with respect to the number of tasks for school and computer datasets is shown\nin Figure 4. We outperform STL method and the baseline method for the computer dataset while\nperform better/equal on the school dataset. These two plots indicate how the tasks are related in these\ntwo datasets. It suggests that tasks in school datasets are related linearly (Manifold and baseline\nmethods have the same performance 4) while tasks in the computer dataset are related non-linearly,\nwhich is why baseline method performs poor compared to the STL method. Both datasets exhibit the\ndifferent behavior as we increase the number of tasks, though behavior relative to the STL method\nremains constant. This suggests that after a certain number of tasks, performance is not affected by\nadding more tasks. This is especially true for the computer dataset since it only has 13 features and\nonly a few tasks are required to learn the task relatedness structure.\nIn summary, our method improves the performance over STL in all of these datasets (no negative\ntransfer), while baseline method performs comparatively on the school dataset and performs worse\non the computer dataset.\n\n5 Conclusion\n\nWe have presented a novel method for multitask learning based on a natural and intuitive assumption\nabout the task relatedness. We have used the manifold assumption to enforce the task relatedness\nwhich is a generalization of the previous notions of relatedness. Unlike many other previous ap-\nproaches, our method does not require any other external information e.g. function/matrix other\nthan the manifold assumption. We have performed experiments on synthetic and real datasets, and\ncompared our results with the state-of-the-art method. We have shown that we outperform the base-\nline method in nearly all cases. We emphasize that unlike the baseline method, we improve over\nsingle task learning in almost all cases and do not encounter the negative transfer.\n\n4In the ideal case, the non-linear method should be able to discover the linear structure. But in practice,\nthey might differ, especially when there are fewer number of tasks. This is the reason we perform equal on the\nschool dataset when the number of tasks is high.\n\n8\n\n0501001502000.511.522.53Number of tasksAvg RMSE STLMTL\u2212ManifoldMTL\u2212Baseline0501001501.61.71.81.922.12.22.3Number of tasksAvg RMSE STLMTL\u2212ManifoldMTL\u2212Baseline\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS \u201906, 2006.\n[2] A. Argyriou, T. Evgeniou, M. Pontil, A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-\n\ntask feature learning. In Machine Learning. press, 2007.\n\n[3] A. Argyriou, C. A. Micchelli, and M. Pontil. When is there a representer theorem? vector\n\nversus matrix regularizers. J. Mach. Learn. Res., 10:2507\u20132529, 2009.\n\n[4] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for\n\nmulti-task structure learning. In NIPS \u201908. 2008.\n\n[5] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. JMLR,\n\n4:2003, 2003.\n\n[6] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15:1373\u20131396, 2002.\n\n[7] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n\nlearning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399\u20132434, 2006.\n\n[8] R. Caruana. Multitask learning. In Machine Learning, pages 41\u201375, 1997.\n[9] H. Daum\u00b4e III. Bayesian multitask learning with latent hierarchies. In Conference on Uncer-\n\ntainty in Arti\ufb01cial Intelligence \u201909, Montreal, Canada, 2009.\n\n[10] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJMLR, 6:615\u2013637, 2005.\n\n[11] S. Gerber, T. Tasdizen, and R. Whitaker. Dimensionality reduction and principal surfaces via\nkernel map manifolds. In In Proceedings of the 2009 International Conference on Computer\nVison (ICCV), 2009.\n\n[12] T. Hastie. Principal curves and surfaces. PhD thesis, Stanford University, 1984.\n[13] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In NIPS\n\n\u201908, 2008.\n\n[14] P. J. Lenk, W. S. DeSarbo, P. E. Green, and M. R. Young. Hierarchical bayes conjoint anal-\nysis: Recovery of partworth heterogeneity from reduced experimental designs. MARKETING\nSCIENCE, 1996.\n\n[15] Q. Liu, X. Liao, H. L. Carin, J. R. Stack, and L. Carin. Semisupervised multitask learning.\n\nIEEE 2009, 2009.\n\n[16] C. A. Micchelli and M. Pontil. Regularized multi-task learning. In KDD 2004, pages 109\u2013117,\n\n2004.\n\n[17] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290(5500):2323\u20132326, December 2000.\n\n[18] J. B. Tenenbaum, V. Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, December 2000.\n\n[19] S. Thrun and L. Pratt, editors. Learning to learn. Kluwer Academic Publishers, Norwell, MA,\n\nUSA, 1998.\n\n[20] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality\n\nreduction. In In ICML 2004, pages 839\u2013846. ACM Press, 2004.\n\n[21] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with\n\ndirichlet process priors. J. Mach. Learn. Res., 8:35\u201363, 2007.\n\n[22] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In\n\nICML \u201905, 2005.\n\n[23] J. Zhang, Z. Ghahramani, and Y. Yang. Flexible latent variable models for multi-task learning.\n\nMach. Learn., 73(3):221\u2013242, 2008.\n\n[24] J. Zhang, J. Zhang, Y. Yang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks\n\nusing latent independent component analysis. In NIPS \u201905, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1110, "authors": [{"given_name": "Arvind", "family_name": "Agarwal", "institution": null}, {"given_name": "Samuel", "family_name": "Gerber", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}