{"title": "Multi-Task Learning via Conic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 744, "abstract": "When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as \\emph{uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.", "full_text": "Multi-Task Learning via Conic Programming\n\nTsuyoshi Kato(cid:2),\u25e6\n\n, Hisashi Kashima\n\n, Masashi Sugiyama\n\n(cid:2) Graduate School of Frontier Sciences, The University of Tokyo,\n\u25e6\nInstitute for Bioinformatics Research and Development (BIRD),\n\n\u2021\n\n, Kiyoshi Asai(cid:2),(cid:3)\n\n\u2020\n\nJapan Science and Technology Agency (JST)\n\u2020\nTokyo Research Laboratory, IBM Research,\n\n\u2021\n\n(cid:3)\n\nDepartment of Computer Science, Tokyo Institute of Technology,\n\nAIST Computational Biology Research Center,\n\nkato-tsuyoshi@cb.k.u-tokyo.ac.jp,\n\nkashi pong@yahoo.co.jp,\n\nsugi@cs.titech.ac.jp,\n\nasai@cbrc.jp\n\nAbstract\n\nWhen we have several related tasks, solving them simultaneously is shown to be\nmore effective than solving them individually. This approach is called multi-task\nlearning (MTL) and has been studied extensively. Existing approaches to MTL\noften treat all the tasks as uniformly related to each other and the relatedness of\nthe tasks is controlled globally. For this reason, the existing methods can lead\nto undesired solutions when some tasks are not highly related to each other, and\nsome pairs of related tasks can have signi\ufb01cantly different solutions. In this pa-\nper, we propose a novel MTL algorithm that can overcome these problems. Our\nmethod makes use of a task network, which describes the relation structure among\ntasks. This allows us to deal with intricate relation structures in a systematic way.\nFurthermore, we control the relatedness of the tasks locally, so all pairs of related\ntasks are guaranteed to have similar solutions. We apply the above idea to sup-\nport vector machines (SVMs) and show that the optimization problem can be cast\nas a second order cone program, which is convex and can be solved ef\ufb01ciently.\nThe usefulness of our approach is demonstrated through simulations with protein\nsuper-family classi\ufb01cation and ordinal regression problems.\n\n1 Introduction\n\nIn many practical situations, a classi\ufb01cation task can often be divided into related sub-tasks. Since\nthe related sub-tasks tend to share common factors, solving them together is expected to be more\nadvantageous than solving them independently. This approach is called multi-task learning (MTL,\na.k.a. inductive transfer or learning to learn) and has theoretically and experimentally proven to be\nuseful [4, 5, 8].\n\nTypically, the \u2018relatedness\u2019 among tasks is implemented as imposing the solutions of related tasks to\nbe similar (e.g. [5]). However, the MTL methods developed so far have several limitations. First, it\nis often assumed that all sub-tasks are related to each other [5]. However, this may not be always true\nin practice\u2014some are related but others may not be. The second problem is that the related tasks\nare often imposed to be close in the sense that the sum of the distances between solutions over all\npairs of related tasks is upper-bounded [8] (which is often referred to as the global constraint [10]).\nThis implies that all the solutions of related tasks are not necessarily close, but some can be quite\ndifferent.\n\nIn this paper, we propose a new MTL method which overcomes the above limitations. We settle the\n\ufb01rst issue by making use of a task network that describes the relation structure among tasks. This\nenables us to deal with intricate relation structures in a systematic way. We solve the second problem\n\n1\n\n\fby directly upper-bounding each distance between the solutions of related task pairs (which we call\nlocal constraints).\n\nWe apply this ideas in the framework of support vector machines (SVMs) and show that linear SVMs\ncan be trained via a second order cone program (SOCP) [3] in the primal. An SOCP is a convex\nproblem and the global solution can be computed ef\ufb01ciently. We further show that the kernelized\nversion of the proposed method can be formulated as a matrix-fractional program (MFP) [3] in the\ndual, which can be again cast as an SOCP; thus the optimization problem of the kernelized variant is\nstill convex and the global solution can be computed ef\ufb01ciently. Through experiments with arti\ufb01cial\nand real-world protein super-family classi\ufb01cation data sets, we show that the proposed MTL method\ncompares favorably with existing MTL methods.\n\nWe further test the performance of the proposed approach in ordinal regression scenarios [9], where\nthe goal is to predict ordinal class labels such as users\u2019 preferences (\u2018like\u2019/\u2018neutral\u2019/\u2018dislike\u2019) or\nstudents\u2019 grades (from \u2018A\u2019 to \u2018F\u2019). The ordinal regression problems can be formulated as a set of\none-versus-one classi\ufb01cation problems, e.g., \u2018like\u2019 vs. \u2018neutral\u2019 and \u2018neutral\u2019 vs. \u2018dislike\u2019. In ordinal\nregression, the relatedness among tasks is highly structured. That is, the solutions (decision bound-\naries) of adjacent problems are expected to be similar, but others may not be related, e.g., \u2018A\u2019 vs. \u2018B\u2019\nand \u2018B\u2019 vs. \u2018C\u2019 would be related, but \u2018A\u2019 vs. \u2018B\u2019 and \u2018E\u2019 vs. \u2018F\u2019 may not be. Our experiments\ndemonstrate that the proposed method is also useful in the ordinal regression scenarios and tends to\noutperform existing approaches [9, 8]\n\n2 Problem Setting\n\nIn this section, we formulate the MTL problem.\nLet us consider M binary classi\ufb01cation tasks, which all share the common input-output space X \u00d7\n{\u00b11}. For the time being, we assume X \u2282 Rd for simplicity; later in Section 4, we extend it to\nreproducing kernel Hilbert spaces. Let {x t, yt}(cid:3)\nt=1 be the training set, where xt \u2208 X and yt \u2208 {\u00b11}\nfor t = 1, . . . , (cid:2). Each data sample (xt, yt) has its target task; we denote the set of sample indices\nof the i-th task by Ii. We assume that each sample belongs only to a single task, i.e., the index sets\nare exclusive:\n(cid:4)\nThe goal is to learn the score function of each classi\ufb01cation task: f i(x; wi, bi) = w\ni x + bi, for\ni = 1, . . . , M, where wi \u2208 Rd and bi \u2208 R are the model parameters of the i-th task. We assume that\na task network is available. The task network describes the relationships among tasks, where each\nnode represents a task and two nodes are connected by an edge if they are related to each other 1. We\ndenote the edge set by E \u2261 {(ik, jk)}K\n\n(cid:2)M\ni=1 |Ii| = (cid:2) and Ii \u2229 Ij = null, \u2200i (cid:6)= j.\n\nk=1.\n\n3 Local MTL with Task Network: Linear Version\n\nIn this section, we propose a new MTL method.\n\n3.1 Basic Idea\n\nWhen the relation among tasks is not available, we may just solve M penalized \ufb01tting problems\nindividually:\n\n(cid:8)wi(cid:8)2 + C\u03b1\n\n1\n2\n\n(cid:3)\nt\u2208Ii\n\nHinge(fi(xt; wi, bi), yt),\n\n(1)\nwhere C\u03b1 \u2208 R+ is a regularization constant and Hinge(\u00b7,\u00b7) is the hinge loss function:\nHinge(f, y) \u2261 max(1 \u2212 f y, 0). This individual approach tends to perform poorly if the number\nof training samples in each task is limited\u2014the performance is expected to be improved if more\ntraining samples are available. Here, we can exploit the information of the task network. A naive\n\nfor i = 1, . . . , M,\n\n1More generally, the tasks can be related in an inhomogeneous way, i.e., the strength of the relationship\namong tasks can be dependent on tasks. This general setting can be similarly formulated by a weighted network,\nwhere edges are weighted according to the strength of the connections. All the discussions in this paper can be\neasily extended to weighted networks, but for simplicity we focus on unweighted networks.\n\n2\n\n\fidea would be to use the training samples of neighboring tasks in the task network for solving the\ntarget \ufb01tting problem. However, this does not fully make use of the network structure since there are\nmany other indirectly connected tasks via some paths on the network.\n\nTo cope with this problem, we take another approach here, which is based on the expectation that\nthe solutions of related tasks are close to each other. More speci\ufb01cally, we impose the following\nconstraint on the optimization problem (1):\n\u2212 wjk\n\nfor \u2200k = 1, . . . , K.\n\n(cid:8)2 \u2264 \u03c1,\n\n(cid:8)wik\n\n(2)\n\n1\n2\n\nNamely, we upper-bound each difference between the solutions of related task pairs by a positive\nscalar \u03c1 \u2208 R+. We refer to this constraint as local constraint following [10]. Note that we do\nnot impose a constraint on the bias parameter b i since the bias could be signi\ufb01cantly different even\namong related tasks. The constraint (2) allows us to implicitly increase the number of training\nsamples over the task network in a systematic way through the solutions of related tasks.\n\nFollowing the convention [8], we blend Eqs.(1) and (2) as\n\nM(cid:3)\n\n1\n2M\n\n(cid:8)wi(cid:8)2 + C\u03b1\n\ni=1\n\ni=1\n\nM(cid:3)\n\n(cid:3)\nt\u2208Ii\n\nHinge(fi(xt; \u03b8), yt) + C\u03c1\u03c1,\n\n(3)\n\nwhere C\u03c1 is a positive trade-off parameter. Then our optimization problem is summarized as follows:\nProblem 1.\n\nM(cid:3)\n\ni=1\n\nmin\n\n1\n2M\n(cid:8)wik\n1\nwhere w \u2261 (cid:6)\n2\n\nsubj. to\n\n(cid:8)2 \u2264 \u03c1, \u2200k,\n\n\u2212 wjk\n(cid:4)\n1 , . . . , w\n\nw\n\n(cid:4)\nM\n\n(cid:7)(cid:4)\n\n,\n\n(cid:8)wi(cid:8)2 + C\u03b1(cid:8)\u03be(cid:8)1 + C\u03c1\u03c1, wrt. w \u2208 R\n\nMd, b \u2208 R\n\n(cid:4)\n\nand\n\nand\n\nw\n\n(cid:4)\ni xt + bi\nyt\n\u03be\u03b1 \u2261 [\u03be\u03b1\n\n1 , . . . , \u03be\u03b1\n(cid:3) ]\n\n(cid:4) .\n\n(cid:3)\n\n+, and \u03c1 \u2208 R+,\n\nM , \u03be\u03b1 \u2208 R\n(cid:5) \u2265 1 \u2212 \u03be\u03b1\n\nt , \u2200t \u2208 Ii,\u2200i\n\n(4)\n\n3.2 Primal MTL Learning by SOCP\n\nf\n\nz\n\nn\n\nmin\n\nsubj. to\n\nwrt z \u2208 R\n\nThe second order cone program (SOCP) is a class of convex programs of minimizing a linear func-\ntion over an intersection of second-order cones [3]: 2\nProblem 2.\n(cid:4)\n\n(cid:8)Aiz + bi(cid:8) \u2264 c\n(cid:4)\ni z + di,\nwhere f \u2208 Rn, Ai \u2208 R(ni\u22121)\u00d7n, bi \u2208 Rni\u22121, ci \u2208 Rn, di \u2208 R.\nLinear programs, quadratic programs, and quadratically-constrained quadratic programs are actually\nspecial cases of SOCPs. SOCPs are a sub-class of semide\ufb01nite programs (SDPs) [3], but SOCPs can\nbe solved more ef\ufb01ciently than SDPs. Successful optimization algorithms for both SDP and SOCP\nare interior-point algorithms. The SDP solvers (e.g. [2]) consume O(n 2\ni ) time complexity\nfor solving Problem 2, but the SOCP-specialized solvers that directly solve Problem 2 take only\nO(n2\nWe can show that Problem 1 is cast as an SOCP using hyperbolic constraints [3].\nTheorem 1. Problem 1 can be reduced to an SOCP and it can be solved with O((M d+(cid:2)) 2(Kd+(cid:2)))\ncomputation.\n\n(cid:2)\ni ni) computation [7]. Thus, SOCPs can be solved more ef\ufb01ciently than SDPs.\n\nfor i = 1, . . . , N,\n\n(cid:2)\n\ni n2\n\n(5)\n\n4 Local MTL with Task Network: Kernelization\n\nThe previous section showed that a linear version of the proposed MTL method can be cast as an\nSOCP. In this section, we show how the kernel trick could be employed for obtaining a non-linear\nvariant.\n\n2More generally, an SOCP can include linear equality constraints, but they can be eliminated, for example,\n\nby some projection method.\n\n3\n\n\f4.1 Dual Formulation\n\n(cid:8)\n\nLet Kfea be a positive semide\ufb01nite matrix with the (s, t)-th element being the inner-product of\ns,t \u2261 (cid:11)xs, xt(cid:12) . This is a kernel matrix of feature vectors. We also\nfeature vectors xs and xt: Kfea\nintroduce a kernel among tasks. Using a new K-dimensional non-negative parameter vector \u03bb \u2208\nRK\n+ , we de\ufb01ne the kernel matrix of tasks by\nKnet(\u03bb) \u2261\n\nwhere U\u03bb \u2261 (cid:2)K\nk=1 \u03bbkUk, Uk \u2261 Eikik + Ejkjk \u2212 Eikjk \u2212 Ejkik , and E (i,j) \u2208 RM\u00d7M is the\nsparse matrix whose (i, j)-th element is one and all the others are zero. Note that this is the graph\nLaplacian kernel [11], where the k-th edge is weighted according to \u03bb k. Let Z \u2208 NM\u00d7(cid:3) be the\nindicator of a task and a sample such that Z i,t = 1 if t \u2208 Ii and Zi,t = 0 otherwise. Then the\ninformation about the tasks are expressed by the (cid:2) \u00d7 (cid:2) kernel matrix Z\nKnet(\u03bb) Z. We integrate\nthe two kernel matrices Kfea and Z\n\nM IM + U\u03bb\n1\n\nKnet(\u03bb) Z by\n\n(cid:9)\u22121\n\n(cid:4)\n\n(cid:4)\n\n,\n\n(6)\nwhere \u25e6 denotes the Hadamard product (a.k.a element-wise product). This parameterized ma-\ntrix Kint(\u03bb) is guaranteed to be positive semide\ufb01nite [6].\nBased on the above notations, the dual formulation of Problem 1 can be expressed using the param-\neterized integrated kernel matrix K int(\u03bb) as follows:\nProblem 3.\n\nKnet(\u03bb) Z\n\nZ\n\n,\n\nKint(\u03bb) \u2261 Kfea \u25e6(cid:4)\n\n(cid:4)\n\n(cid:5)\n\nmin\n\n1\n(cid:4)\n2 \u03b1\n\ndiag(y)Kint(\u03bb) diag(y)\u03b1 \u2212 (cid:8)\u03b1(cid:8)1,\n\nwrt. \u03b1 \u2208 R\n\n(cid:3)\n\n+, and \u03bb \u2208 R\n\nsubj. to \u03b1 \u2264 C\u03b11(cid:3), Z diag(y) \u03b1 = 0M ,\n\n(cid:8)\u03bb(cid:8)1 \u2264 C\u03c1.\n\nM\n+ ,\n\n(7)\n\nWe note that the solutions \u03b1 and \u03bb tend to be sparse due to the (cid:2) 1 norm.\nChanging the de\ufb01nition of Kfea from the linear kernel to an arbitrary kernel, we can extend the\nproposed linear MTL method to non-linear domains. Furthermore, we can also deal with non-\nvectorial structured data by employing a suitable kernel such as the string kernel and the Fisher\nkernel.\nIn the test stage, a new sample x in the j-th task is classi\ufb01ed by\n\n(cid:3)(cid:3)\n\nM(cid:3)\n\nfj(x) =\n\n\u03b1tytkfea(xt, x)knet(i, j)Zi,t + bj,\n\n(8)\n\nwhere kfea(\u00b7,\u00b7) and knet(\u00b7,\u00b7) are the kernel functions over features and tasks, respectively.\n\nt=1\n\ni=1\n\n4.2 Dual MTL Learning by SOCP\n\nHere, we show that the above dual problem can also be reduced to an SOCP. To this end, we \ufb01rst\nintroduce a matrix-fractional program (MFP) [7]:\nProblem 4.\n\np\n\n(cid:4)\n\nP (z)\n\n\u22121 (F z + g)\n\n+ subj. to P (z) \u2261 P0 +\n\nwrt. z \u2208 R\n+, F \u2208 Rn\u00d7p, and g \u2208 Rn. Here Sn\n\nmin (F z + g)\nwhere Pi \u2208 Sn\nand strictly positive de\ufb01nite cone of n \u00d7 n matrices, respectively.\nLet us re-de\ufb01ne d as the rank of the feature kernel matrix K fea. We introduce a matrix Vfea \u2208 R(cid:3)\u00d7d\nwhich decomposes the feature kernel matrix as K fea = VfeaVfea\n. De\ufb01ne the (cid:2)-dimensional vectors\nfh \u2208 R(cid:3) of the h-th feature as Vfea \u2261 [f1,\n. . . , fd] \u2208 R(cid:3)\u00d7d and the matrices Fh \u2261 Z diag(fh \u25e6\n(cid:8)\n(cid:9)\u22121\ny), for h = 1, . . . , d. Using those variables, the objective function in Problem 3 can be rewritten as\n\n++ denote the positive semide\ufb01nite cone\n\nziPi \u2208 S\n\n+ and Sn\n\nn\n++,\n\ni=1\n\n(cid:4)\n\np(cid:3)\n\n(cid:4)\n\n(cid:4)\nF\nh\n\n\u03b1\n\nM IM + U\u03bb\n1\n\nFh\u03b1 \u2212 \u03b1\n\n(cid:4)1(cid:3).\n\n(9)\n\nd(cid:3)\n\nh=1\n\n1\n2\n\nJD =\n\n4\n\n\fThis implies that Problem 3 can be transformed into the combination of a linear program and d\nMFPs.\n\u2212ejk, where eik is a unit vector\nLet us further introduce the vector v k \u2208 RM for each edge: vk = eik\nwith the ik-th element being one. Let Vlap be a matrix de\ufb01ned by Vlap = [v1, . . . , vK] \u2208 R\nM\u00d7K.\nThen we can re-express the graph Lagrangian matrix of tasks as U\u03bb = V lap diag(\u03bb)Vlap\nGiven the fact that an MFP can be reduced to an SOCP [7], we can reduce Problem 3 to the following\nSOCP:\nProblem 5.\n\n(cid:4).\n\nM , uh = [u1,h, . . . , uK,h]\n\n(cid:4) \u2208 R\n\nK\n\n\u2200k, \u2200h\n\n(cid:10)(cid:10)(cid:10)(cid:10)\n\nK\u03bb \u2264 C\u03c1,\n(cid:11)\n1(cid:4)\n2u0,h\ns0,h \u2212 1\n\n(cid:12)(cid:10)(cid:10)(cid:10)(cid:10) \u2264 s0,h + 1,\n\n\u2200h\n\n\u2200k, \u2200h\n\n(10)\n\n(11)\n(12)\n(13)\n\n(14)\n\n(15)\n\nmin \u2212 1(cid:4)\n\n(cid:3) \u03b1 +\n\n1\n2\n\ns0,h + s1,h + \u00b7\u00b7\u00b7 + sK,h,\n\nd(cid:3)\n\nh=1\n\nwrt\n\ns0,h \u2208 R, sk,h \u2208 R, u0,h \u2208 R\n\u03bb \u2208 R\n\n+ , \u03b1 \u2208 R\n\n(cid:3)\n+,\n\nK\n\nsubj. to \u03b1 \u2264 C\u03b11(cid:3), Z diag(y) \u03b1 = 0M ,\n\nM\u22121/2u0,h + Vlapuh = Fh\u03b1,\n(cid:11)\n2uk,h\nsk,h \u2212 \u03bbk\n\n(cid:12)(cid:10)(cid:10)(cid:10)(cid:10) \u2264 sk,h + \u03bbk\n\n(cid:10)(cid:10)(cid:10)(cid:10)\n\nConsequently, we obtain the following result:\nTheorem 2. The dual problem of CoNs learning (Problem 3) can be reduced to the SOCP in Prob-\nlem 5 and it can be solved with O((Kd + (cid:2))2((M + K)d + (cid:2))) computation.\n\n5 Discussion\n\nIn this section, we discuss the properties of the proposed MTL method and the relation to existing\nmethods.\n\nMTL with Common Bias A possible variant of the proposed MTL method would be to share the\ncommon bias parameter with all tasks (i.e. b 1 = b2 = \u00b7\u00b7\u00b7 = bM ). The idea is expected to be useful\nparticularly when the number of samples in each task is very small. We can also apply the common\nbias idea in the kernelized version just by replacing the constraint Z diag(y)\u03b1 = 0 M in Problem 3\nby y\n\n\u03b1 = 0.\n\n(cid:4)\n\n\u2212 wjk\n\n(cid:2)K\nk=1 (cid:8)wik\n\nGlobal vs. Local Constraints Micchelli and Pontil\n[8] have proposed a related MTL\nmethod which upper-bounds the sum of\ni.e.,\n(cid:8)2 \u2264 \u03c1. We call it the global constraint. This global constraint can also have\n1\n2\na similar effect to our local constraint (2), i.e., the related task pairs tend to have close solutions.\nHowever, the global constraint can allow some of the distances to be large since only the sum is\nupper-bounded. This actually causes a signi\ufb01cant performance degradation in practice, which will\nbe experimentally demonstrated in Section 6. We note that the idea of local constraints is also used\nin the kernel learning problem [10].\n\nthe differences of K related task pairs,\n\nRelation to Standard SVMs By construction, the proposed MTL method includes the standard\nSVM learning algorithm a special case. Indeed, when the number of tasks is one, Problem 3 is\nreduced to the standard SVM optimization problem. Thus, the proposed method may be regarded\nas a natural extension of SVMs.\n\nOrdinal Regression As we mentioned in Section 1, MTL approaches are useful in ordinal regres-\nsion problems. Ordinal regression is a task of learning multiple quantiles, which can be formulated\nas a set of one-versus-one classi\ufb01cation problems. A naive approach to ordinal regression is to\nindividually train M SVMs with score functions f i(x) = (cid:11)wi, x(cid:12) + bi, i = 1, . . . , M . Shashua\n\n5\n\n\f1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n(a) True classi\ufb01cation boundaries\n\n(b) IL-SVMs\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n-1\n\n-1 -0.5 0\n\n0.5\n\n1\n\n(c) MTL-SVM(global/full)\n\n(d) MTL-SVM(local/network)\n\nFigure 1: Toy multi classi\ufb01cation tasks. Each sub\ufb01gure contains the 10-th, 30-th, 50-th, 70-th, and\n90-th tasks in the top row and the 110-th, 130-th, 150-th, 170-th, and 190-th tasks in the bottom row.\n\nand Levin [9] proposed an ordinal regression method called the support vector ordinal regres-\nsion (SVOR), where the weight vectors are shared by all SVMs (i.e. w 1 = w2 = \u00b7\u00b7\u00b7 = wM )\nand only the bias parameter is learned individually.\nThe proposed MTL method can be naturally employed in ordinal regression by constraining the\nweight vectors as (cid:8)wi \u2212 wi+1(cid:8)2 \u2264 \u03c1, i = 1, . . . , M \u2212 1, i.e., the task network only has a weight be-\ntween consecutive tasks. This method actually includes the above two ordinal regression approaches\nas special cases\u2014C\u03c1 = 0 (i.e., ignoring the task network) yields the independent training of SVMs\nand C\u03c1 = \u221e (i.e., the weight vectors of all SVMs agree) is reduced to SVOR. Thus, in the context\nof ordinal regression, the proposed method smoothly bridges two extremes and allows us to control\nthe belief of task constraints.\n\n6 Experiments\n\nIn this section, we show the usefulness of the proposed method through experiments.\n\n6.1 Toy Multiple Classi\ufb01cation Tasks\n\nFirst, we illustrate how the proposed method behaves using a 2-dimensional toy data set, which\nincludes 200 tasks (see Figure 1(a)). Each task possesses a circular-shaped classi\ufb01cation boundary\nwith different centers and a \ufb01xed radius 0.5. The location of the center in the i-th task is (\u22121 +\n0.02(i \u2212 1), 0) for 1 \u2264 i \u2264 100 and (0,\u22121 + 0.02(i \u2212 101)) for 101 \u2264 i \u2264 200. For each task,\nonly two positive and two negative samples are generated following the uniform distribution. We\nconstruct a task network where consecutive tasks are connected in a circular manner, i.e., (1, 2),\n(2, 3), . . ., (99, 100), and (100, 1) for the \ufb01rst 100 tasks and (101, 102), (102, 103), . . ., (199, 200),\nand (200, 1) for the last 100 tasks; we further add (50, 150), which connects the clusters of the \ufb01rst\n100 and the last 100 nodes.\nWe compare the following methods: a naive method where 200 SVMs are trained indivisually (in-\ndividually learned SVM, \u2018IL-SVM\u2019), the MTL-SVM algorithm where the global constraint and the\nfully connected task network are used [5] (\u2018MTL-SVM(global/full)\u2019), and the proposed method which\nuses local constraints and the properly de\ufb01ned task network (\u2018MTL-SVM(local/network)\u2019).\n\nThe results are exhibited in Figure 1, showing that IL-SVM can not capture the circular shape due\nto the small sample size in each task. MTL-SVM(global/full) can successfully capture closed-loop\nboundaries by making use of the information from other tasks. However, the result is still not\nso reliable since non-consecutive unrelated tasks heavily damage the solutions. On the other hand,\nMTL-SVM(local/network) nicely captures the circular boundaries and the results are highly reliable.\nThus, given an appropriate task network, the proposed MTL-SVM(local/network) can effectively\nexploit information of the related tasks.\n\n6\n\n\fTable 1: The accuracy of each method in the protein super-family classi\ufb01cation task.\n\nDataset\n\nIL-SVM\n\nd-f\nd-s\nd-o\nf-s\nf-o\ns-o\n\n0.908 (0.023)\n0.638 (0.067)\n0.725 (0.032)\n0.891 (0.036)\n0.792 (0.046)\n0.663 (0.034)\n\nOne-SVM\n0.941 (0.015)\n0.722 (0.030)\n0.747 (0.017)\n0.886 (0.021)\n0.819 (0.029)\n0.695 (0.034)\n\nMTL-SVM\n(global/full)\n0.945 (0.013)\n0.698 (0.036)\n0.748 (0.021)\n0.918 (0.020)\n0.834 (0.021)\n0.692 (0.050)\n\nMTL-SVM\n\n(global/network)\n0.933 (0.017)\n0.695 (0.032)\n0.749 (0.023)\n0.911 (0.022)\n0.828 (0.015)\n0.663 (0.068)\n\nMTL-SVM\n\n(local/network)\n0.952 (0.015)\n0.747 (0.020)\n0.764 (0.028)\n0.918 (0.025)\n0.838 (0.018)\n0.703 (0.036)\n\n6.2 Protein Super-Family Classi\ufb01cation\n\nNext, we test the performance of the proposed method with real-world protein super-family classi\ufb01-\ncation problems.\n\nThe input data are amino acid sequences from the SCOP database [1] (not SOCP). We counted\n2-mers for extraction of feature vectors. There are 20 kinds of amino acids. Hence, the number\nof features is 202 = 400. We use RBF kernels, where the kernel width \u03c3 2\nrbf is set to the average\nof the squared distances to the \ufb01fth nearest neighbors. Each data set consists of two folds. Each\nfold is divided into several super-families. We here consider the classi\ufb01cation problem into the\nsuper-families. A positive class is chosen from one fold, and a negative class is chosen from the\nother fold. We perform multi-task learning from all the possible combinations. For example, three\nsuper-families are in DNA/RNA binding, and two in SH3. The number of combinations is 3\u00b7 2 = 6.\nSo the data set d-s has the six binary classi\ufb01cation tasks. We used four folds: DNA/RNA binding,\nFlavodoxin, OB-fold and SH3. From these folds, we generate six data sets: d-f, d-f, d-o, f-o, f-s,\nand o-s, where the fold names are abbreviated to d, f, o, and s, respectively.\n\nThe task networks are constructed as follows: if the positive super-family or the negative super-\nfamily is common to two tasks, the two tasks are regarded as a related task pair and connected by\nan edge. We compare the proposed MTL-SVM(local/network) with IL-SVM, \u2018One-SVM\u2019, MTL-\nSVM(global/full), and MTL-SVM(global/network). One-SVM regards the multiple tasks as one big\ntask and learns the big task once by a standard SVM. We set C \u03b1 = 1 for all the approaches. The\nvalue of the parameter C\u03c1 for three MTL-SVM approaches is determined by cross-validation over\nthe training set. We randomly pick ten training sequences from each super-family, and use them for\ntraining. We compute the classi\ufb01cation accuracies of the remaining test sequences. We repeat this\nprocedure 10 times and take the average of the accuracies.\nThe results are described in Table 1, showing that the proposed MTL-SVM(local/network) com-\npares favorably with the other methods. In this simulation, the task network is constructed rather\nheuristically. Even so, the proposed MTL-SVM(local/network) is shown to signi\ufb01cantly outperform\nMTL-SVM(global/full), which does not use the network structure. This implies that the proposed\nmethod still works well even when the task network contains small errors. It is interesting to note\nthat MTL-SVM(global/network) actually does not work well in this simulation, implying that the\ntask relatedness are not properly controlled by the global constraint. Thus the use of the local con-\nstraints would be effective in MTL scenarios.\n\n6.3 Ordinal Regression\n\nAs discussed in Section 5, MTL methods are useful in ordinal regression. Here we create \ufb01ve ordinal\nregression data sets described in Table 2, where all the data sets are originally regression and the out-\nput values are divided into \ufb01ve quantiles. Therefore, the overall task can be divided into four isolated\nclassi\ufb01cation tasks, each of which estimates a quantile. We compare MTL-SVM(local/network) with\nIL-SVM, SVOR [9] (see Section 5), MTL-SVM(full/network) and MTL-SVM(global/network). The\nvalue of the parameter C\u03c1 for three MTL-SVM approaches is determined by cross-validation over\nthe training set. We set C\u03b1 = 1 for all the approaches. We use RBF kernels, where the parame-\nter \u03c32\nrbf is set to the average of the squared distances to the \ufb01fth nearest neighbors. We randomly\npicked 200 samples for training. The remaining samples are used for evaluating the classi\ufb01cation\naccuracies.\n\n7\n\n\fTable 2: The accuracy of each method in ordinal regression tasks.\n\nData set\npumadyn\n\nstock\n\n0.643 (0.007)\n0.894 (0.012)\n0.781 (0.003)\nbank-8fh\nbank-8fm 0.854 (0.004)\ncalihouse\n0.648 (0.003)\n\nIL-SVM\n\nSVOR\n\nMTL-SVM\n(global/full)\n0.629 (0.025)\n0.872 (0.010)\n0.772 (0.006)\n0.832 (0.013)\n0.640 (0.005)\n\nMTL-SVM\n\n(global/network)\n0.645 (0.018)\n0.888 (0.010)\n0.773 (0.006)\n0.847 (0.009)\n0.646 (0.007)\n\nMTL-SVM\n\n(local/network)\n0.661 (0.007)\n0.902 (0.007)\n0.779 (0.002)\n0.854 (0.009)\n0.650 (0.004)\n\n0.661 (0.006)\n0.878 (0.011)\n0.777 (0.006)\n0.845 (0.010)\n0.642 (0.008)\n\nThe averaged performance over \ufb01ve runs is described in Table 2, showing that the proposed MTL-\nSVM(local/network) is also promising in ordinal regression scenarios.\n\n7 Conclusions\n\nIn this paper, we proposed a new multi-task learning method, which overcomes the limitation of\nexisting approaches by making use of a task network and local constraints. We demonstrated through\nsimulations that the proposed method is useful in multi-task learning scenario; moreover, it also\nworks excellently in ordinal regression scenarios.\n\nThe standard SVMs have a variety of extensions and have been combined with various techniques,\ne.g., one-class SVMs, SV regression, and the \u03bd-trick. We expect that such extensions and techniques\ncan also be applied similarly to the proposed method. Other possible future works include the\nelucidation of the entire regularization path and the application to learning from multiple networks;\ndeveloping algorithms for learning probabilistic models with a task network is also a promising\ndirection to be explored.\n\nAcknowledgments\n\nThis work was partially supported by a Grant-in-Aid for Young Scientists (B), number 18700287,\nfrom the Ministry of Education, Culture, Sports, Science and Technology, Japan.\n\nReferences\n[1] A. Andreeva, D. Howorth, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin. SCOP database\nin 2004: re\ufb01nements integrate structure and sequence family data. Nucl. Acid Res., 32:D226\u2013D229, 2004.\n[2] B. Borchers. CSDP, a C library for semide\ufb01nite programming. Optimization Methods and Software,\n\n11(1):613\u2013623, 1999.\n\n[3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[5] T. Evgeniou and M. Pontil. Regularized multitask learning. In Proc. of 17-th SIGKDD Conf. on Knowl-\n\nedge Discovery and Data Mining, 2004.\n\n[6] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UC Santa\n\nCruz, July 1999.\n\n[7] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming.\n\nLinear Algebra and its Applications, 284:193\u2013228, 1998.\n\n[8] C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In Lawrence K. Saul, Yair Weiss, and L\u00b4eon\nBottou, editors, Advances in Neural Information Processing Systems 17, pages 921\u2013928, Cambridge, MA,\n2005. MIT Press.\n\n[9] A. Shashua and A. Levin. Ranking with large margin principle: two approaches. In Advances in Neural\n\nInformation Processing Systems 15, pages 937\u2013944, Cambridge, MA, 2003. MIT Press.\n\n[10] K. Tsuda and W.S. Noble. Learning kernels from biological networks by maximizing entropy. Bioinfor-\n\nmatics, 20(Suppl. 1):i326\u2013i333, 2004.\n\n[11] X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for\nsemi-supervised learning. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, Advances in Neural\nInformation Processing Systems 17, Cambridge, MA, 2004. MIT Press.\n\n8\n\n\f", "award": [], "sourceid": 445, "authors": [{"given_name": "Tsuyoshi", "family_name": "Kato", "institution": null}, {"given_name": "Hisashi", "family_name": "Kashima", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "Kiyoshi", "family_name": "Asai", "institution": null}]}