{"title": "Multi-task Vector Field Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 287, "page_last": 295, "abstract": "Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the prediction functions and the vector fields simultaneously. MTVFL has the following key properties: (1) the vector fields we learned are close to the gradient fields of the prediction functions; (2) within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace; (3) the vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach.", "full_text": "Multi-task Vector Field Learning\n\n2Sen Yang\n\n1Chiyuan Zhang\n\n2Jieping Ye\n\n1Xiaofei He\n\n1Binbin Lin\n\n1State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China\n{binbinlinzju, chiyuan.zhang.zju, xiaofeihe}@gmail.com\n\n2The Biodesign Institute, Arizona State University, Tempe, AZ, 85287\n\n{senyang, jieping.ye}@asu.edu\n\nAbstract\n\nMulti-task learning (MTL) aims to improve generalization performance by learn-\ning multiple related tasks simultaneously and identifying the shared information\namong tasks. Most of existing MTL methods focus on learning linear models\nunder the supervised setting. We propose a novel semi-supervised and nonlinear\napproach for MTL using vector \ufb01elds. A vector \ufb01eld is a smooth mapping from\nthe manifold to the tangent spaces which can be viewed as a directional derivative\nof functions on the manifold. We argue that vector \ufb01elds provide a natural way to\nexploit the geometric structure of data as well as the shared differential structure\nof tasks, both of which are crucial for semi-supervised multi-task learning. In this\npaper, we develop multi-task vector \ufb01eld learning (MTVFL) which learns the pre-\ndictor functions and the vector \ufb01elds simultaneously. MTVFL has the following\nkey properties. (1) The vector \ufb01elds MTVFL learns are close to the gradient \ufb01elds\nof the predictor functions. (2) Within each task, the vector \ufb01eld is required to be as\nparallel as possible which is expected to span a low dimensional subspace. (3) The\nvector \ufb01elds from all tasks share a low dimensional subspace. We formalize our\nidea in a regularization framework and also provide a convex relaxation method\nto solve the original non-convex problem. The experimental results on synthetic\nand real data demonstrate the effectiveness of our proposed approach.\n\n1\n\nIntroduction\n\nIn many applications, labeled data are expensive and time consuming to obtain while unlabeled data\nare abundant. The problem of using unlabeled data to improve the generalization performance is\noften referred to as semi-supervised learning (SSL). It is well known that in order to make semi-\nsupervised learning work, some assumptions on the dependency between the predictor function and\nthe marginal distribution of data are needed. The manifold assumption [15, 5], which has been\nwidely adopted in the last decade, states that the predictor function lives in a low dimensional man-\nifold of the marginal distribution.\nMulti-task learning was proposed to enhance the generalization performance by learning multiple\nrelated tasks simultaneously. The abundant literature on multi-task learning demonstrates that the\nlearning performance indeed improves when the tasks are related [4, 6, 7]. The key step in MTL\nis to \ufb01nd the shared information among tasks. Evgeniou et al. [12] proposed a regularization MTL\nframework which assumes all tasks are related and close to each other. Ando and Zhang [2] pro-\nposed a structural learning framework, which assumed multiple predictors for different tasks shared\na common structure on the underlying predictor space. An alternating structure optimization (ASO)\nmethod was proposed for linear predictors which assumed the task parameters shared a low dimen-\nsional subspace. Arvind et al. [1] generalized the idea of sharing a subspace by assuming that all\ntask parameters lie on a manifold.\n\n1\n\n\f(a) A parallel \ufb01eld on R2\n\n(b) A parallel \ufb01eld on Swiss roll\n\nFigure 1: Examples of parallel \ufb01elds. The parallel \ufb01eld on R2 spans a one dimensional subspace\nand the parallel \ufb01eld on the Swiss roll spans a two dimensional subspace.\n\nIn this paper, we consider semi-supervised multi-task learning (SSMTL). Although many SSL meth-\nods have been proposed in the literature [10], these methods are often not directly amenable to MTL\nextensions [18]. Liu et al. [18] proposed an SSMTL framework which encouraged related models\nto have similar parameters. However they require that related tasks share similar representations [9].\nWang et al. [19] proposed another SSMTL method under the assumption that the tasks are clus-\ntered [4, 14]. The cluster structure is characterized by task parameters of linear predictor functions.\nFor linear predictors, the task parameters they used are actually the constant gradient of the predictor\nfunctions which form a \ufb01rst order differential structure. For general nonlinear predictor functions,\nwe show it is more natural to capture the shared differential structure using vector \ufb01elds.\nIn this paper, we propose a novel SSMTL formulation using vector \ufb01elds. A vector \ufb01eld is a smooth\nmapping from the manifold to the tangent spaces which can be viewed as a directional derivative\nof functions on the manifold.\nIn this way, a vector \ufb01eld naturally characterizes the differential\nstructure of functions while also providing a natural way to exploit the geometric structure of data;\nthese are the two most important aspects for SSMTL. Based on this idea, we develop the multi-task\nvector \ufb01eld learning (MTVFL) method which learns the prediction functions and the vector \ufb01elds\nsimultaneously. The vector \ufb01elds we learned are forced to be close to the gradient \ufb01elds of predictor\nfunctions.\nIn each task, the vector \ufb01eld is required to be as parallel as possible. We say that a\nvector \ufb01eld is parallel if the vectors are parallel along the geodesics on the manifold. In extreme\ncases, when the manifold is a linear (or an af\ufb01ne) space, then the geodesics of such manifold are\nstraight lines. In such cases, the space spanned by these parallel vectors is a simply one-dimensional\nsubspace. Thus when the manifold is \ufb02at (i.e., with zero curvature) or the curvature is small, it is\nexpected that these parallel vectors concentrate on a low dimensional subspace. As an example, we\ncan see from Fig. 1 that the parallel \ufb01eld on the plane spans a one-dimensional subspace and the\nparallel \ufb01eld on the Swiss roll spans a two-dimensional subspace. For the multi-task case, these\nvector \ufb01elds share a low dimensional subspace. In addition, we assume these vector \ufb01elds share a\nlow dimensional subspace among all tasks. In essence, we use a \ufb01rst-order differential structure to\ncharacterize the shared structure of tasks and use a second-order differential structure to characterize\nthe speci\ufb01c parts of tasks. We formalize our idea in a regularization framework and provide a convex\nrelaxation method to solve the original non-convex problem. We have performed experiments using\nboth synthetic and real data; results demonstrate the effectiveness of our proposed approach.\n\n2 Multi-task Learning: A Vector Field Approach\n\nIn this section, we \ufb01rst introduce vector \ufb01elds and then present multi-task learning via exploring\nshared structure using vector \ufb01elds.\n\n2.1 Multi-task Learning Setting and Vector Fields\n\nWe \ufb01rst introduce notation and symbols. We are given m tasks, with nl samples xl\n\nfor the l-th task. The total number of samples is n =(cid:80)\n(cid:8)xl\n(cid:9) are on a dl-dimensional manifold Ml. All of these data manifolds are embedded in the same\n\ni, i = 1, . . . , nl\nl nl. For the l-th task, we assume the data\n\ni\n\n2\n\n\fi\n\ni\n\ni\n\nl (n(cid:48)\n\nM.\n\ni\n\ni \u223c xl\n\ni = T l\n\ni T l\ni\n\ni), n(cid:48)\n\nl (xl\n\ni a) \u22a5 P l\ni a.\n\nM and (a \u2212 P l\n\nIt is easy to show that P l\n\nj are neighbors. Let wl\n\nl n(cid:48)\nl + 1 \u2264 i \u2264 nl.\n\nj \u2208 {\u22121, 1} for classi\ufb01cation, j = 1, . . . , n(cid:48)\n\nnumber of labeled samples is n(cid:48) = (cid:80)\n\nj \u2208 R for regression and yl\n\nj denote that xl\ni and xl\ni, we estimate its tangent space Txl\n\nD-dimensional ambient space RD. It is worth noting that the dimensions of different data manifolds\nare not required to be the same. Without loss of generality, we assume the \ufb01rst n(cid:48)\nl < nl) samples\nl. The total\nare labeled, with yl\nl. For the l-th task, we denote the regression function or\nclassi\ufb01cation function by f\u2217\nl . The goal of semi-supervised multi-task learning is to learn the function\nvalue on unlabeled data, i.e., f\u2217\nGiven the l-th task, we \ufb01rst construct a nearest neighbor graph by either \u0001-neighborhood or k nearest\nneighbors. Let xl\nij denote the weight which\ni and xl\nj. It can be approximated by the heat kernel weight or the\nmeasures the similarity between xl\nM by performing PCA on its\nsimple 0-1 weight. For each point xl\nM has\nneighborhood. We choose the largest dl eigenvectors as the bases since the tangent space Txl\ni \u2208 RD\u00d7dl be the matrix whose columns constitute\nthe same dimension as the manifold Ml. Let T l\nT is the unique orthogonal\nan orthonormal basis for Txl\nM [13]. That is, for any vector a \u2208 Rm, we have\nprojection from RD onto the tangent space Txl\ni a \u2208 Txl\nP l\nWe now formally de\ufb01ne the vector \ufb01eld and show how to represent it in the discrete case.\nDe\ufb01nition 2.1 ([16]). A vector \ufb01eld X on the manifold M is a continuous map X : M \u2192 TM\nwhere TM is the set of tangent spaces, written as p (cid:55)\u2192 Xp, with the property that for each p \u2208 M,\nXp is an element of TpM.\nWe can think of a vector \ufb01eld on the manifold as an arrow in the same way as we think of the vector\n\ufb01eld in the Euclidean space, with a given magnitude and direction attached to each point on the\nmanifold, and chosen to be tangent to the manifold. A vector \ufb01eld V on the manifold is called a\ngradient \ufb01eld if there exists a function f on the manifold such that \u2207f = V where \u2207 is the covariant\nderivative on the manifold. Therefore, gradient \ufb01elds are one kind of vector \ufb01elds. It plays a critical\nrole in connecting vector \ufb01elds and functions.\nLet Vl be a vector \ufb01eld on the manifold Ml. For each point xl\ni, let Vxl\ndenote the value of the\ni. Recall the de\ufb01nition of vector \ufb01eld, Vxl\nvector \ufb01eld Vl at xl\nshould be a vector in the tangent\nMl as\nspace Txl\ni \u2208 Rdl is the local representation of Vxl\ni . Let fl be a function\n= T l\nVxl\non the manifold Ml. By abusing the notation without confusion, we also use fl to denote the vector\nfl = (fl(x1\nis, Vl is a dlnl-dimensional big column vector which concatenates all the vl\neach task, we aim to compute the vector fl and the vector Vl.\n\nMl. Therefore, we can represent it by the coordinates of the tangent space Txl\ni, where vl\ni vl\n\nT(cid:17)T \u2208 Rdlnl. That\n\n))T and use Vl to denote the vector Vl =\n\ni\u2019s for a \ufb01xed l. Then for\n\nwith respect of T l\n\nl ), . . . , fl(xl\nnl\n\nT\n\nvl\n1\n\n, . . . , vl\nnl\n\ni\n\ni\n\ni\n\ni\n\n(cid:16)\n\ni\n\ni\n\ni\n\n2.2 Multi-task Vector Field Learning\n\nIn this section, we introduce multi-task vector \ufb01eld learning (MTVFL).\nMany existing MTL methods capture the task relatedness by sharing task parameters. For linear\npredictors, the task parameters they used are actually the constant gradient vectors of the predictor\nfunctions. For general nonlinear predictor functions, we show it is natural to capture the shared\nm)T and V denote the\ndifferential structure using vector \ufb01elds. Let f denote the vector (f T\nvector (V T\n\nT )T . We propose to learn f and V simultaneously:\n\n1 , . . . , f T\n\n1 , . . . , V T\n\nm )T = (v1\n1\n\nT , . . . , vm\nnl\n\n\u2022 The vector \ufb01eld Vl should be close to the gradient \ufb01eld \u2207fl of fl, which can be formularized\n\nas follows:\n\n(cid:90)\nm(cid:88)\n(cid:90)\nm(cid:88)\n\nl=1\n\nMl\n\nl=1\n\nm(cid:88)\nm(cid:88)\n\nl=1\n\nl=1\n\n3\n\nR1(fl, Vl) :=\n\u2022 The vector \ufb01eld Vl should be as parallel as possible:\n\nR1(f, V ) =\n\nmin\nf,V\n\n(cid:107)\u2207fl \u2212 Vl(cid:107)2.\n\nMl\n\nmin\n\nV\n\nR2(V ) =\n\nR2(Vl) :=\n\n(cid:107)\u2207Vl(cid:107)2\n\nHS,\n\n(1)\n\n(2)\n\n\fing(cid:82)\n\nwhere \u2207 is the covariant derivative on the manifold, where (cid:107) \u00b7 (cid:107)HS denotes the Hilbert-\nSchmidt tensor norm [11]. \u2207Vl measures the change of the vector \ufb01eld, therefore minimiz-\n\n(cid:107)\u2207Vl(cid:107)2\n\nMl\n\nHS enforces the vector \ufb01eld Vl to be parallel.\n\n\u2022 All vector \ufb01elds share an h-dimensional subspace where h is a prede\ufb01ned parameter:\n\nT l\ni vl\n\ni = ul\n\ni + \u0398T wl\ni,\n\ns.t. \u0398\u0398T = Ih\u00d7h.\n\n(3)\n\nSince these vector \ufb01elds are assumed to share a low dimensional space, it is expected that the residual\nvector ul\n\ni is small. We de\ufb01ne another term R3 to control the complexity as follows:\n\nR3(vl\n\ni, wl\n\ni, \u0398) =\n\n=\n\n\u03b1(cid:107)ul\n\ni(cid:107)2 + \u03b2(cid:107)T l\ni(cid:107)2\ni vl\n\n\u03b1(cid:107)T l\ni \u2212 \u0398T wl\ni vl\n\ni(cid:107)2 + \u03b2(cid:107)T l\ni(cid:107)2.\ni vl\n\n(4)\n\n(5)\n\nm(cid:88)\nm(cid:88)\n\nl=1\n\nnl(cid:88)\nnl(cid:88)\n\ni=1\n\nl=1\n\ni=1\n\nNote that \u03b1 and \u03b2 are pre-speci\ufb01ed coef\ufb01cients, indicating the importance of the corresponding\nregularization component. Since we would like the vector \ufb01eld to be parallel, the vector norm is not\nexpected to be too small. Besides, we assume the vector \ufb01elds share a low dimensional subspace,\nthe residual vector ul\ni is expected to be small. In practice we suggest to use a small \u03b2 and a large \u03b1.\nBy setting \u03b2 = 0, R3 will reduce to the regularization term proposed in ASO if we also replace the\ntangent vectors by the task parameters. Therefore, this formulation is a generalization of ASO.\n\nIt can be veri\ufb01ed that wl\ni\ni = (I \u2212 \u0398T \u0398)T l\ni. Therefore, we can rewrite R3 as follows:\n\u0398T wl\ni vl\n\ni = arg minwl\n\n= \u0398T l\n\nR3(vl\n\ni, wl\n\ni vl\n\ni\n\n\u2217\n\ni, \u0398). Thus we have ul\n\ni = T l\n\ni \u2212\ni vl\n\nR3(V, \u0398) =\n\n=\n\n\u03b1(cid:107)ul\n\ni(cid:107)2 + \u03b2(cid:107)T l\ni(cid:107)2\ni vl\n\n(cid:0)\u03b1(cid:107)(I \u2212 \u0398T \u0398)T l\n\ni(cid:107)2(cid:1)\n\ni(cid:107)2 + \u03b2(cid:107)T l\ni vl\ni vl\n\n(6)\n\nl=1\n\ni=1\n\n= \u03b1V T A\u0398V + \u03b2V T HV,\n\nwhere H is a block diagonal matrix with the diagonal blocks being T l\ni\ndiagonal matrix with the diagonal blocks being T l\ni\nTherefore, the proposed formulation solves the following optimization problem:\n\n(I\u2212\u0398T \u0398)T (I\u2212\u0398T \u0398)T l\n\nT\n\ni , and A\u0398 is another block\nT l\n(I\u2212\u0398T \u0398)T l\ni .\n\ni = T l\ni\n\nT\n\nT\n\narg min\n\nE(f, V, \u0398) = R0(f ) + \u03bb1R1(f, V ) + \u03bb2R2(V ) + \u03bb3R3(V, \u0398)\n\ns.t. \u0398\u0398T = Ih\u00d7h, (7)\n\nm(cid:88)\nm(cid:88)\n\nl=1\n\nnl(cid:88)\nnl(cid:88)\n\ni=1\n\nwhere R0(f ) is the loss function. For simplicity, we use the quadratic loss function R0(f ) =\n\nf,V,\u0398\n\n(cid:80)m\n\nl=1\n\n(cid:80)n(cid:48)\n\nl\n\ni=1(fl(xl\n\ni) \u2212 yl\n\ni)2.\n\n2.3 Objective Function in the Matrix Form\n\nTo simplify Eq. (7), in this section we rewrite our objective function in the matrix form.\nUsing the discrete methods in [17], we have the following discrete form equations:\n\n(cid:1)2\n\nR1(fl, Vl) =\n\nR2(fl, Vl) =\n\nj \u2212 xl\n\ni)T T l\n\ni \u2212 f l\ni vl\n\nj + f l\ni\n\n(cid:13)(cid:13)2\n\n.\n\nj \u2212 T l\ni vl\nj vl\ni\n\ni T l\n\n,\n\n(8)\n\n(9)\n\n(cid:0)(xl\n(cid:13)(cid:13)P l\n\nwl\nij\n\nwl\nij\n\n(cid:88)\n(cid:88)\n\ni\u223cj\n\ni\u223cj\n\nInterestingly, with some algebraic transformations, we have the following matrix forms for our\nobjective functions:\n\nR1(fl, Vl) = 2f T\n\nl Llfl + V T\n\nl GlVl \u2212 2V T\n\nl Clfl,\n\n(10)\n\n4\n\n\fT\n\nT\n\n, . . . , C l\nn\n\nwhere Ll is the graph Laplacian matrix, Gl is a dlnl \u00d7 dlnl block diagonal matrix, and Cl =\n]T is a dlnl \u00d7 nl block matrix. Denote the i-th dl \u00d7 dl diagonal block of Gl by Gl\n[C l\nand the i-th dl \u00d7 nl block of Cl by C l\n1\nj \u2212 xl\n\ni, we have\nj \u2212 xl\n\n(cid:88)\n\n(cid:88)\n\nj \u2212 xl\n\ni)T , C l\n\ni)(xl\n\nij(xl\n\nij(xl\n\ni)sl\nij\n\n(11)\n\nGl\n\nii =\n\ni =\n\nT\n\n,\n\nii\n\nwl\n\nj\u223ci\n\nwl\n\nj\u223ci\n\nij \u2208 Rnl is a selection vector of all zero elements except for the i-th element being \u22121 and\n\nwhere sl\nthe j-th element being 1. And R2 becomes\n\nwhere Bl is a dlnl \u00d7 dlnl sparse block matrix. If we index each dl \u00d7 dl block by Bl\n\nR2(Vl) = V T\n\nl BlVl,\n\n(12)\nij, then we have\n\n(cid:88)\n(cid:40)\u22122wl\n\nj\u223ci\n\nwl\n\nBl\n\nii =\n\nBl\n\nij =\n\nij(Ql\n\nijQl\nij\n\nijQl\n\nij,\n0,\n\nT\n\n+ I),\n\nif xi \u223c xj\notherwise\n\n,\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nwhere Ql\n\nij = T l\ni\n\nT\n\nj. It is worth nothing that both R1 and R2 depend on tangent spaces T l\ni .\nT l\n\nThus we can further write R1(f, V ) and R2(V ) as follows\n\nm(cid:88)\nm(cid:88)\n\nl=1\n\nR1(f, V ) =\n\nR2(V ) =\n\nR1(fl, Vl) = 2f T Lf + V T GV \u2212 2V T Cf,\n\nR2(Vl) = V T BV,\n\nl=1\n\nwhere L, G and B are block diagonal matrices with the corresponding l-th block matrix being Ll,\nGl and Bl, respectively. C is a column block matrix with the l-th block matrix being Cl.\nLet I denote an n \u00d7 n diagonal matrix where Iii = 1 if the corresponding i-th data is labeled and\nIii = 0 otherwise. And let y \u2208 Rn be a column vector whose i-th element is the corresponding label\nn(cid:48) (f \u2212 y)T I(f \u2212 y). Finally, we get the\nof the i-th labeled data and 0 otherwise. Then R0(f ) = 1\nfollowing matrix form for our objective function in Eq. (7) with the constraint \u0398\u0398T = Ih\u00d7h as:\n\n=\n\nE(f, V, \u0398) = R0(f ) + \u03bb1R1(f, V ) + \u03bb2R2(V ) + \u03bb3R3(V, \u0398)\n1\n\nn(cid:48) (f \u2212 y)T I(f \u2212 y) + \u03bb1(2f T Lf + V T GV \u2212 2V T Cf ) + \u03bb2V T BV + \u03bb3V T (\u03b1A\u0398 + \u03b2H)V\nn(cid:48) (f \u2212 y)T I(f \u2212 y) + 2\u03bb1f T Lf + V T (\u03bb1G + \u03bb2B + \u03bb3(\u03b1A\u0398 + \u03b2H))V \u2212 2\u03bb1V T Cf.\nIt is worth noting that matrices L, G, B, C depend on data, and only the matrix A\u0398 is related to \u0398.\n\n=\n\n1\n\n3 Optimization\n\nIn this section, we discuss how to solve the following optimization problem:\n\narg min\n\nE(f, V, \u0398),\n\nf,V,\u0398\n\ns.t. \u0398\u0398T = Ih\u00d7h.\n\n(17)\n\nWe use the alternating optimization to solve this problem.\n\n\u2022 Optimization of f and V . For a \ufb01xed \u0398, the optimal f and V can be obtained via solving\n(18)\n\nE(f, V, \u0398).\n\narg min\n\n\u2022 Optimization of \u0398. For a \ufb01xed V , the optimal \u0398 can be obtained via solving.\n\nf,V\n\narg min\n\n\u0398\n\nR3(V, \u0398),\n\ns.t. \u0398\u0398T = Ih\u00d7h.\n\n(19)\n\n5\n\n\f3.1 Optimization of f and V for a Given \u0398\n\nWhen \u0398 is \ufb01xed, the objective function is similar to that of the single task case. However, there are\nsome differences we would like to mention. Firstly, when constructing the nearest neighbor graph,\ndata points from different tasks are disconnected. Therefore when estimating tangent spaces, data\npoints from different tasks are independent. Secondly, we do not require the dimension of tangent\nspaces from each task to be the same.\nWe note that\n\n(cid:19)\n(cid:18) 1\nn(cid:48) I + 2\u03bb1L\n\n= 2\n\nf \u2212 2\u03bb1C T V \u2212 2\n\n1\nn(cid:48) y,\n\n= \u22122\u03bb1Cf + 2(\u03bb1G + \u03bb2H + \u03bb3(\u03b1A\u0398 + \u03b2H))V.\n\n\u2202E\n\u2202f\n\u2202E\n\u2202V\n\n(cid:18) 1\n\nn(cid:48) I + 2\u03bb1L\n\n\u2212\u03bb1C\n\n(20)\n\n(21)\n\n(22)\n\n(cid:19)\n\n.\n\n(cid:19)(cid:18) f\n\n(cid:19)\n\n(cid:18) 1\n\n=\n\nn(cid:48) y\n0\n\nRequiring the derivatives to be vanish, we get the following linear system\n\n\u2212\u03bb1C T\n\n\u03bb1G + \u03bb2B + \u03bb3(\u03b1A\u0398 + \u03b2H)\n\nV\n\nExcept for the matrix A\u0398, all other matrices can be computed in advance and will not change during\nthe iterative process.\n\n3.2 Optimization of \u0398 for a Given V\n\nSince functions R0(f ), R1(f, V ) and R2(V ) are not related to the variable \u0398, we only need to\noptimize R3(V, \u0398) subject to \u0398\u0398T = Ih\u00d7h.\nRecall Eq. (6), we rewrite R3(V, \u0398) as follows:\n\n(cid:18)\n\n\u03b1\n\nm(cid:88)\n\nnl(cid:88)\n(cid:18)\n\ni=1\n\n\u02c6\u0398 = arg min\n\n\u0398\n\nl=1\n\n= arg min\n\n\u03b1 tr\n\n\u0398\n\ni(cid:107)2 +\n(cid:107)(I \u2212 \u0398T \u0398)T l\n(cid:19)\ni vl\n\n)I \u2212 \u0398T \u0398)V\n\n\u03b2\n\u03b1\n\nVT ((1 +\n\n(cid:19)\n\n(cid:107)T l\ni(cid:107)2\ni vl\n\n\u03b2\n\u03b1\n\n(23)\n\n= arg max\n\ntr(\u0398VVT \u0398T ),\n) is a D \u00d7 n matrix with each column being a tangent vector. The\nwhere V = (T 1\noptimal \u02c6\u0398 can be obtained by using singular value decomposition (SVD). Let V = Z1\u03a3Z T\n2 be the\nSVD of V and we assume that the singular values are in a decreasing order in \u03a3. Then the rows of\n\u02c6\u0398 are given by the \ufb01rst h columns of Z1.\n\n1, . . . , T m\nnm\n\n\u0398\nvm\nnm\n\n1 v1\n\n3.3 Convex Relaxation\n\n\u0398T \u0398)\u22121. Then we can rewrite R3(V, \u0398) as R3(V, \u0398) = \u03b1\u03b7(1 + \u03b7) tr(cid:0)VT (\u03b7I + \u0398T \u0398)\u22121V(cid:1) .\n\nThe orthogonality constraint in Eq. (23) is non-convex. Next, we propose to convert Eq. (23) into a\nconvex formulation by relaxing its feasible domain into a convex set.\nLet \u03b7 = \u03b2/\u03b1. It can be veri\ufb01ed that the following equality holds: (1 + \u03b7)I \u2212 \u0398T \u0398 = \u03b7(1 + \u03b7)(\u03b7I +\nLet Me be de\ufb01ned as Me = {M : M = \u0398T \u0398, \u0398\u0398T = I, \u0398 \u2208 Rh\u00d7d}. The convex hull [8] of Me\ncan be expressed as the convex set Mc given by Mc = {M : tr(M ) = h, M (cid:22) I, M \u2208 Sd\n+} and\neach element in Me is referred to as an extreme point of Mc.\nTo convert the non-convex problem Eq. (23) into a convex formulation, we replace \u0398T \u0398 with M,\nand naturally relax its feasible domain into a convex set based on the relationship between Me and\nMc presented above; this results in an optimization problem as\n\nwhere R3(V, M ) is de\ufb01ned as R3(V, M ) = \u03b1\u03b7(1 + \u03b7) tr(cid:0)VT (\u03b7I + M )\u22121V(cid:1) . It follows from\n\ns.t. , tr(M ) = h, M (cid:22) I, M \u2208 Sd\n+,\n\n[3, Theorem 3.1] that the relaxed R3 is jointly convex in V and M. After we obtain the optimal\nM, the optimal \u0398 can be approximated using the \ufb01rst h eigenvectors (corresponding to the largest\nh eigenvalues) of the optimal M.\n\nR3(V, M ),\n\narg min\n\n(24)\n\n\u0398\n\n6\n\n\f4 Experiments\n\nIn this section, we evaluate our method on one synthetic data and one real data set. We compare\nthe proposed Multi-Task Vector Field Learning (MTVFL) algorithm against the following methods:\n(a) Single Task Vector Field Learning (STVFL, or PFR), (b) Alternating Structure Optimization\n(ASO) and (c) its nonlinear version - Kernelized Alternating Structure Optimization (KASO). The\nkernel constructed in KASO uses both labeled data and unlabeled data. Thus it can be viewed as a\nsemi-supervised MTL method.\n\n4.1 Synthetic Data\n\n(a) MSE\n\n(b) Singular value distribution\n\nFigure 2: (a) Performance of MTVFL and STVFL; (b) The singular value distribution.\n\nWe \ufb01rst construct a synthetic data to evaluate our method in comparison with the semi-supervised\nsingle task learning method (STVFL). We generate two data sets including Swiss roll and Swiss roll\nwith hole embedded in 3-dimensional Euclidean space. The Swiss roll is generated by the following\nequations x = t1 cos t1; y = t2; z = t1 sin t1 where t1 \u2208 [3\u03c0/2, 9\u03c0/2]; t2 \u2208 [0, 21]. The Swill\nroll with hole excludes points within t1 \u2208 [9, 12] and t2 \u2208 [9, 14]. The ground truth function is\nf (x, y, z) = t1. This test is a semi-supervised multi-task regression problem. We randomly select a\nnumber of labeled data in each task and try to predict the value on other unlabeled data.\nEach data set has 400 points. We construct a nearest neighbor graph for each task. The number of\nnearest neighbors is set to 5 and the manifold dimension is set to 2 as they are both 2 dimensional\nmanifolds. The shared subspace dimension is set to 2. The regularization parameters are chosen\nvia cross-validation. We perform 100 independent trials with randomly selected labeled sets. The\nperformance is measured by the mean squared error (MSE). We also try ASO and KASO, however\nthey perform poorly since the data is highly nonlinear. The averaged MSE over two tasks is presented\nin Fig. 2. We can observe that MTVFL consistently outperforms STVFL which demonstrates the\neffectiveness of SSMTL.\nWe also show the singular value distribution of the ground truth gradient \ufb01elds. Given the ground\ntruth f, we can compute the gradient \ufb01eld V by taking derivatives of R1(f, V ) with respect to V .\nRequiring the derivative to vanish, we get the following equation GV = Cf. After obtaining V , the\ngradient vector Vxl\ni. Then we perform PCA on these\n= T l\nvectors and the singular values of the covariance matrix of Vxl\nare shown in Fig. 2 (b). As can be\nseen from Fig. 2 (b), the number of dominant singular values is 2 which indicates that the ground\ntruth gradient \ufb01elds concentrate on a 2-dimensional subspace.\n\nat each point can be obtained as Vxl\n\ni\n\ni\n\ni vl\n\ni\n\n4.2 Landmine Detection\n\nWe use the landmine data set studied in [20]. There are totally 29 sets of data which are collected\nfrom various real landmine \ufb01elds. Each data example is represented by a 9-dimensional vector with\na binary label, which is either 1 for landmine or 0 for clutter. The problem of landmine detection\n\n1The data set is available at http://www.ee.duke.edu/\u02dclcarin/LandmineData.zip.\n\n7\n\n102030405000.050.10.150.20.25Number of Labeled DataMSE MTVFLSTVFL123012345678Sigular ValuePrincipal Component\f(a) Averaged AUC\n\n(b) Singular value distribution\n\nFigure 3: (a) Performance of various MTL algorithms; (b) The singular value distribution.\n\nis to predict the labels of unlabeled objects. Among the 29 data sets, 1-15 correspond to relatively\nhighly foliated regions and 16-29 correspond to bare earth or desert regions. Following [20], we\nchoose the data sets 1-10 and 16-24 to form 19 tasks.\nThe basic setup of all the algorithms is as follows. First, we construct a nearest neighbor graph\nfor each task. The number of nearest neighbors is set to 10 and the manifold dimension is set to 4\nempirically. These two parameters are the same for all 19 tasks. The shared subspace dimension\nis set to be 5 for both of MTVFL and ASO and the shared subspace dimension of KASO is set to\n10. All the regularization parameters for the four algorithms are chosen via cross-validation. Note\nthat KASO needs to construct a kernel matrix. We use Gaussian kernel in KASO and the Gaussian\nwidth is set to be optimal by searching within [0.01, 10].\nWe perform 100 independent trials with randomly selected labeled sets. We measure the perfor-\nmance by AUC which denotes area under the Receiver Operation Characteristic (ROC) curve. A\nlarge AUC value indicates good classi\ufb01cation performance. Since the data have severely unbal-\nanced labels, following [20], we do a special setting that assures there is at least one \u201c1\u201d and one \u201c0\u201d\nlabeled sample in the training set of each task. The AUC averaged over the 19 tasks is presented in\nFig. 3 (a). As can be seen, MTVFL consistently outperforms the other three algorithms. When the\nnumber of labeled data increases, KASO outperforms STVFL. ASO does not improve much when\nthe amount of labeled data increases, which is probably because the data have severely unbalanced\nlabels and the ground truth predictor function is nonlinear. We also show the singular value distri-\nbution of the ground truth gradient \ufb01elds in Fig. 3 (b). The computation of the singular values is the\nsame as in Section. 4.1. As can be seen from Fig. 3 (b), the number of dominant singular values\nis 5. The percentage of the sum of the \ufb01rst 5 singular values over the total sum is 91.34%, which\nindicates that the ground truth gradient \ufb01elds concentrate on a 5-dimensional subspace.\n\n5 Conclusion\n\nIn this paper, we propose a new semi-supervised multi-task learning formulation using vector \ufb01elds.\nWe show that vector \ufb01elds can naturally capture the shared differential structure among tasks as well\nas the structure of the data manifolds. Our experimental results on synthetic and real data demon-\nstrate the effectiveness of the proposed method. There are several interesting directions suggested\nin this work. One is the relation between learning on task parameters and learning on vector \ufb01elds.\nUltimately, both of them are learning functions. Another one is to apply other assumptions made in\nthe multi-task learning community into vector \ufb01eld learning, e.g., the cluster assumption.\n\nAcknowledgments\n\nThis work was supported by the National Natural Science Foundation of China under Grants\n61125203, 61233011 and 90920303, the National Basic Research Program of China (973 Program)\nunder Grant 2012CB316404, the Fundamental Research Funds for the Central Universities under\ngrant 2011FZA5022, NIH (R01 LM010730) and NSF (IIS-0953662, CCF-1025177).\n\n8\n\n203040506070800.650.70.750.80.85Number of Labeled DataAverage AUC on 19 Tasks MTVFLSTVFLKASOASO2468020040060080010001200Principal ComponentSigular Value\fReferences\n[1] A. Agarwal, H. D. III, and S. Gerber. Learning multiple tasks using manifold regularization.\n\nIn Advances in Neural Information Processing Systems 23, pages 46\u201354. 2010.\n\n[2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[3] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework\nfor multi-task structure learning. In Advances in Neural Information Processing Systems 20,\npages 25\u201332. 2008.\n\n[4] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal\n\nof Machine Learning Research, 4:83\u201399, 2003.\n\n[5] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research,\n7:2399\u20132434, December 2006.\n\n[6] S. Ben-David, J. Gehrke, and R. Schuller. A theoretical framework for learning from a pool of\ndisparate data sources. In Proceedings of the eighth ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 443\u2013449, 2002.\n\n[7] S. Ben-David and R. Schuller. Exploiting task relatedness for mulitple task learning. In Con-\n\nference on Learning Theory, pages 567\u2013580, 2003.\n\n[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[9] A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr., and T. M. Mitchell. Coupled semi-\nsupervised learning for information extraction. In Proceedings of the third ACM international\nconference on Web search and data mining, pages 101\u2013110, 2010.\n\n[10] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n[11] A. Defant and K. Floret. Tensor Norms and Operator Ideals. North-Holland Mathematics\n\nStudies, North-Holland, Amsterdam, 1993.\n\n[12] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJournal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[13] G. H. Golub and C. F. V. Loan. Matrix computations. Johns Hopkins University Press, 3rd\n\nedition, 1996.\n\n[14] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation.\n\nAdvances in Neural Information Processing Systems 21, pages 745\u2013752. 2009.\n\nIn\n\n[15] J. Lafferty and L. Wasserman. Statistical analysis of semi-supervised regression. In Advances\n\nin Neural Information Processing Systems 20, pages 801\u2013808, 2007.\n\n[16] J. M. Lee. Introduction to Smooth Manifolds. Springer Verlag, New York, 2nd edition, 2003.\n[17] B. Lin, C. Zhang, and X. He. Semi-supervised regression via parallel \ufb01eld regularization. In\n\nAdvances in Neural Information Processing Systems 24, pages 433\u2013441. 2011.\n\n[18] Q. Liu, X. Liao, and L. Carin. Semi-supervised multitask learning. In Advances in Neural\n\nInformation Processing Systems 20, pages 937\u2013944. 2008.\n\n[19] F. Wang, X. Wang, and T. Li. Semi-supervised multi-task learning with task regularizations.\nIn Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pages 562\u2013\n568. IEEE Computer Society, 2009.\n\n[20] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with\n\ndirichlet process priors. Journal of Machine Learning Research, 8:35\u201363, 2007.\n\n9\n\n\f", "award": [], "sourceid": 146, "authors": [{"given_name": "Binbin", "family_name": "Lin", "institution": null}, {"given_name": "Sen", "family_name": "Yang", "institution": null}, {"given_name": "Chiyuan", "family_name": "Zhang", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}, {"given_name": "Xiaofei", "family_name": "He", "institution": null}]}