{"title": "Clustered Multi-Task Learning Via Alternating Structure Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 702, "page_last": 710, "abstract": "Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms.", "full_text": "Clustered Multi-Task Learning Via Alternating\n\nStructure Optimization\n\nJiayu Zhou, Jianhui Chen, Jieping Ye\n\nComputer Science and Engineering\n\nArizona State University\n\nTempe, AZ 85287\n\n{jiayu.zhou, jianhui.chen, jieping.ye}@asu.edu\n\nAbstract\n\nMulti-task learning (MTL) learns multiple related tasks simultaneously to improve\ngeneralization performance. Alternating structure optimization (ASO) is a popular\nMTL method that learns a shared low-dimensional predictive structure on hypoth-\nesis spaces from multiple related tasks. It has been applied successfully in many\nreal world applications. As an alternative MTL approach, clustered multi-task\nlearning (CMTL) assumes that multiple tasks follow a clustered structure, i.e.,\ntasks are partitioned into a set of groups where tasks in the same group are similar\nto each other, and that such a clustered structure is unknown a priori. The objec-\ntives in ASO and CMTL differ in how multiple tasks are related. Interestingly,\nwe show in this paper the equivalence relationship between ASO and CMTL, pro-\nviding signi\ufb01cant new insights into ASO and CMTL as well as their inherent rela-\ntionship. The CMTL formulation is non-convex, and we adopt a convex relaxation\nto the CMTL formulation. We further establish the equivalence relationship be-\ntween the proposed convex relaxation of CMTL and an existing convex relaxation\nof ASO, and show that the proposed convex CMTL formulation is signi\ufb01cantly\nmore ef\ufb01cient especially for high-dimensional data. In addition, we present three\nalgorithms for solving the convex CMTL formulation. We report experimental\nresults on benchmark datasets to demonstrate the ef\ufb01ciency of the proposed algo-\nrithms.\n\n1\n\nIntroduction\n\nMany real-world problems involve multiple related classi\ufb01catrion/regression tasks. A naive ap-\nproach is to apply single task learning (STL) where each task is solved independently and thus the\ntask relatedness is not exploited. Recently, there is a growing interest in multi-task learning (MTL),\nwhere we learn multiple related tasks simultaneously by extracting appropriate shared information\nacross tasks. In MTL, multiple tasks are expected to bene\ufb01t from each other, resulting in improved\ngeneralization performance. The effectiveness of MTL has been demonstrated empirically [1, 2, 3]\nand theoretically [4, 5, 6]. MTL has been applied in many applications including biomedical infor-\nmatics [7], marketing [1], natural language processing [2], and computer vision [3].\n\nMany different MTL approaches have been proposed in the past; they differ in how the related-\nness among different tasks is modeled. Evgeniou et al. [8] proposed the regularized MTL which\nconstrained the models of all tasks to be close to each other. The task relatedness can also be mod-\neled by constraining multiple tasks to share a common underlying structure [4, 6, 9, 10]. Ando\nand Zhang [5] proposed a structural learning formulation, which assumed multiple predictors for\ndifferent tasks shared a common structure on the underlying predictor space. For linear predictors,\nthey proposed the alternating structure optimization (ASO) that simultaneously performed inference\non multiple tasks and discovered the shared low-dimensional predictive structure. ASO has been\n\n1\n\n\fshown to be effective in many practical applications [2, 11, 12]. One limitation of the original ASO\nformulation is that it involves a non-convex optimization problem and a globally optimal solution is\nnot guaranteed. A convex relaxation of ASO called CASO was proposed and analyzed in [13].\n\nMany existing MTL formulations are based on the assumption that all tasks are related. In practical\napplications, the tasks may exhibit a more sophisticated group structure where the models of tasks\nfrom the same group are closer to each other than those from a different group. There have been\nmany prior work along this line of research, known as clustered multi-task learning (CMTL). In\n[14], the mutual relatedness of tasks was estimated and knowledge of one task could be transferred\nto other tasks in the same cluster. Bakker and Heskes [15] used clustered multi-task learning in a\nBayesian setting by considering a mixture of Gaussians instead of single Gaussian priors. Evgeniou\net al. [8] proposed the task clustering regularization and showed how cluster information could\nbe encoded in MTL, and however the group structure was required to be known a priori. Xue et\nal. [16] introduced the Dirichlet process prior which automatically identi\ufb01ed subgroups of related\ntasks. In [17], a clustered MTL framework was proposed that simultaneously identi\ufb01ed clusters\nand performed multi-task inference. Because the formulation is non-convex, they also proposed a\nconvex relaxation to obtain a global optimum [17]. Wang et al. [18] used a similar idea to consider\nclustered tasks by introducing an inter-task regularization.\n\nThe objective in CMTL differs from many MTL formulations (e.g., ASO which aims to identify a\nshared low-dimensional predictive structure for all tasks) which are based on the standard assump-\ntion that each task can learn equally well from any other task. In this paper, we study the inherent\nrelationship between these two seemingly different MTL formulations. Speci\ufb01cally, we establish\nthe equivalence relationship between ASO and a speci\ufb01c formulation of CMTL, which performs\nsimultaneous multi-task learning and task clustering: First, we show that CMTL performs cluster-\ning on the tasks, while ASO performs projection on the features to \ufb01nd a shared low-rank structure.\nNext, we show that the spectral relaxation of the clustering (on tasks) in CMTL and the projection\n(on the features) in ASO lead to an identical regularization, related to the negative Ky Fan k-norm\nof the weight matrix involving all task models, thus establishing their equivalence relationship. The\npresented analysis provides signi\ufb01cant new insights into ASO and CMTL as well as their inherent\nrelationship. To our best knowledge, the clustering view of ASO has not been explored before.\n\nOne major limitation of the ASO/CMTL formulation is that it involves a non-convex optimization,\nas the negative Ky Fan k-norm is concave. We propose a convex relaxation of CMTL, and establish\nthe equivalence relationship between the proposed convex relaxation of CMTL and the convex ASO\nformulation proposed in [13]. We show that the proposed convex CMTL formulation is signi\ufb01cantly\nmore ef\ufb01cient especially for high-dimensional data. We further develop three algorithms for solving\nthe convex CMTL formulation based on the block coordinate descent, accelerated projected gra-\ndient, and gradient descent, respectively. We have conducted experiments on benchmark datasets\nincluding School and Sarcos; our results demonstrate the ef\ufb01ciency of the proposed algorithms.\nNotation: Throughout this paper, Rd denotes the d-dimensional Euclidean space. I denotes the\nidentity matrix of a proper size. N denotes the set of natural numbers. Sm\n+ denotes the set of\nsymmetric positive semi-de\ufb01nite matrices of size m by m. A (cid:22) B means that B \u2212 A is positie\nsemi-de\ufb01nite. tr (X) is the trace of X.\n\n2 Multi-Task Learning: ASO and CMTL\n\nAssume we are given a multi-task learning problem with m tasks; each task i \u2208 Nm is associated\nwith a set of training data {(xi\nni )} \u2282 Rd \u00d7 R, and a linear predictive function fi:\nj, where wi is the weight vector of the i-th task, d is the data dimensionality, and ni\nfi(xi\nis the number of samples of the i-th task. We denote W = [w1, . . . , wm] \u2208 Rd\u00d7m as the weight\nmatrix to be estimated. Given a loss function \u2113(\u00b7,\u00b7), the empirical risk is given by:\n\n1), . . . , (xi\n\nj) = wT\n\nni , yi\n\n1, yi\n\ni xi\n\nL(W ) =\n\nmX\n\ni=1\n\n1\nni\n\n\uf8eb\nniX\n\uf8ed\n\nj=1\n\n\u2113(wT\n\ni xi\n\nj, yi\n\nj)\uf8f6\n\uf8f8 .\n\nWe study the following multi-task learning formulation: minW L(W ) + \u2126(W ), where \u2126 encodes\nour prior knowledge about the m tasks. Next, we review ASO and CMTL and explore their inherent\nrelationship.\n\n2\n\n\f2.1 Alternating structure optimization\n\ni xi\n\nj = uT\n\ni xi\n\nj) = wT\n\nIn ASO [5], all tasks are assumed to share a common feature space \u0398 \u2208 Rh\u00d7d, where h \u2264\nmin(m, d) is the dimensionality of the shared feature space and \u0398 has orthonormal columns, i.e.,\n\u0398\u0398T = Ih. The predictive function of ASO is: fi(xi\nj, where the weight\nwi = ui + \u0398T vi consists of two components including the weight ui for the high-dimensional\nfeature space and the weight vi for the low-dimensional space based on \u0398. ASO minimizes the\n2, subject to: \u0398\u0398T = Ih, where \u03b1 is the reg-\nularization parameter for task relatedness. We can further improve the formulation by including\n2, to improve the generalization performance as in traditional supervised\n\nfollowing objective function: L(W ) + \u03b1Pm\na penalty, \u03b2Pm\nlearning. Since ui = wi \u2212 \u0398T vi, we obtain the following ASO formulation:\n2(cid:1) .\n2 + \u03b2kwik2\n\n(cid:0)\u03b1kwi \u2212 \u0398T vik2\n\ni=1 kwik2\n\ni=1 kuik2\n\nmin\n\nW,{vi},\u0398:\u0398\u0398T =Ih L(W ) +\n\nmX\n\nj +vT\n\ni \u0398xi\n\n(1)\n\ni=1\n\n2.2 Clustered multi-task learning\n\nIn CMTL, we assume that the tasks are clustered into k < m clusters, and the index set of the\nj-th cluster is de\ufb01ned as Ij = {v|v \u2208 cluster j}. We denote the mean of the jth cluster to be\n\u00afwj = 1\nwv. For a given W = [w1,\u00b7\u00b7\u00b7 , wm], the sum-of-square error (SSE) function in\nK-means clustering is given by [19, 20]:\n\nnj Pv\u2208Ij\n\nkX\n\nj=1\n\nX\n\nv\u2208Ij\n\nkwv \u2212 \u00afwjk2\n\n2 = tr(cid:0)W T W(cid:1) \u2212 tr(cid:0)F T W T W F(cid:1) ,\n\n(2)\n\nwhere the matrix F \u2208 Rm\u00d7k is an orthogonal cluster indicator matrix with Fi,j = 1\u221anj\nif i \u2208 Ij and\nFi,j = 0 otherwise. If we ignore the special structure of F and keep the orthogonality requirement\nonly, the relaxed SSE minimization problem is:\n\nmin\n\nF :F T F =Ik\n\ntr(cid:0)W T W(cid:1) \u2212 tr(cid:0)F T W T W F(cid:1) ,\n\n(3)\n\nresulting in the following penalty function for CMTL:\n\n\u2126CMTL0(W, F ) = \u03b1(cid:0)tr(cid:0)W T W(cid:1) \u2212 tr(cid:0)F T W T W F(cid:1)(cid:1) + \u03b2 tr(cid:0)W T W(cid:1) ,\n\n(4)\nwhere the \ufb01rst term is derived from the K-means clustering objective and the second term is to\nimprove the generalization performance. Combing Eq. (4) with the empirical error term L(W ), we\nobtain the following CMTL formulation:\n(5)\n\nmin\n\nW,F :F T F =Ik L(W ) + \u2126CMTL0 (W, F ).\n\n2.3 Equivalence of ASO and CMTL\n\nIn the ASO formulation in Eq. (1), it is clear that the optimal vi is given by v\u2217i = \u0398wi. Thus, the\npenalty in ASO has the following equivalent form:\n\n\u2126ASO(W, \u0398) =\n\n(cid:0)\u03b1kwi \u2212 \u0398T \u0398wik2\n\nmX\n= \u03b1(cid:0)tr(cid:0)W T W(cid:1) \u2212 tr(cid:0)W T \u0398T \u0398W(cid:1)(cid:1) + \u03b2 tr(cid:0)W T W(cid:1) ,\n\n2(cid:1)\n2 + \u03b2kwik2\n\ni=1\n\nresulting in the following equivalent ASO formulation:\n\nmin\n\nW,\u0398:\u0398\u0398T =Ih L(W ) + \u2126ASO(W, \u0398).\n\n(6)\n\n(7)\n\nThe penalty of the ASO formulation in Eq. (7) looks very similar to the penalty of the CMTL\nformulation in Eq. (5), however the operations involved are fundamentally different. In the CMTL\nformulation in Eq. (5), the matrix F is operated on the task dimension, as it is derived from the\nK-means clustering on the tasks; while in the ASO formulation in Eq. (7), the matrix \u0398 is operated\non the feature dimension, as it aims to identify a shared low-dimensional predictive structure for all\ntasks. Although different in the mathematical formulation, we show in the following theorem that\nthe objectives of CMTL and ASO are equivalent.\n\n3\n\n\fTheorem 2.1. The objectives of CMTL in Eq. (5) and ASO in Eq. (7) are equivalent if the cluster\nnumber, k, in K-means equals to the size, h, of the shared low-dimensional feature space.\n\nProof. Denote Q(W ) = L(W ) + (\u03b1 + \u03b2) tr(cid:0)W T W(cid:1), with \u03b1, \u03b2 > 0. Then, CMTL and ASO solve\n\nthe following optimization problems:\n\nW,F :F T F =Ip Q(W ) \u2212 \u03b1 tr(cid:0)W F F T W T(cid:1) ,\n\nmin\n\nW,\u0398:\u0398\u0398T =Ip Q(W ) \u2212 \u03b1 tr(cid:0)W T \u0398T \u0398W(cid:1) ,\n\nmin\n\nrespectively. Note that in both CMTL and ASO, the \ufb01rst term Q is independent of F or \u0398, for a\ngiven W . Thus, the optimal F and \u0398 for these two optimization problems are given by solving:\n\n[CMTL]\n\nmax\n\nF :F T F =Ik\n\ntr(cid:0)W F F T W T(cid:1) ,\n\n[ASO]\n\nmax\n\n\u0398:\u0398\u0398T =Ik\n\ntr(cid:0)W T \u0398T \u0398W(cid:1) .\n\nSince W W T and W T W share the same set of nonzero eigenvalues, by the Ky-Fan Theo-\nrem [21], both problems above achieve exactly the same maximum objective value: kW T Wk(k) =\nPk\ni=1 \u03bbi(W T W ), where \u03bbi(W T W ) denotes the i-th largest eigenvalue of W T W and kW T Wk(k)\nis known as the Ky Fan k-norm of matrix W T W . Plugging the results back to the original objective,\nthe optimization problem for both CMTL and ASO becomes minW Q(W ) \u2212 \u03b1kW T Wk(k). This\ncompletes the proof of this theorem.\n\n3 Convex Relaxation of CMTL\n\nThe formulation in Eq. (5) is non-convex. A natural approach is to perform a convex relaxation on\nCMTL. We \ufb01rst reformulate the penalty in Eq. (5) as follows:\n\n\u2126CMTL0(W, F ) = \u03b1 tr(cid:0)W ((1 + \u03b7)I \u2212 F F T )W T(cid:1) ,\n\nwhere \u03b7 is de\ufb01ned as \u03b7 = \u03b2/\u03b1 > 0. Since F T F = Ik, the following holds:\n\nThus, we can reformulate \u2126CMTL0 in Eq. (8) as the following equivalent form:\n\n(1 + \u03b7)I \u2212 F F T = \u03b7(1 + \u03b7)(\u03b7I + F F T )\u22121.\n\n\u2126CMTL1(W, F ) = \u03b1\u03b7(1 + \u03b7) tr(cid:0)W (\u03b7I + F F T )\u22121W T(cid:1) .\n\nresulting in the following equivalent CMTL formulation:\n\nmin\n\nW,F :F T F =Ik L(W ) + \u2126CMTL1 (W, F ).\n\nFollowing [13, 17], we obtain the following convex relaxation of Eq. (10), called cCMTL:\n\nW,M L(W ) + \u2126cCMTL(W, M ) s.t. tr (M ) = k, M (cid:22) I, M \u2208 Sm\n+ .\nmin\n\nwhere \u2126cCMTL(W, M ) is de\ufb01ned as:\n\n\u2126cCMTL(W, M ) = \u03b1\u03b7(1 + \u03b7) tr(cid:0)W (\u03b7I + M )\u22121W T(cid:1) .\n\nThe optimization problem in Eq. (11) is jointly convex with respect to W and M [9].\n\n3.1 Equivalence of cASO and cCMTL\n\nA convex relaxation (cASO) of the ASO formulation in Eq. (7) has been proposed in [13]:\n\nW,S L(W ) + \u2126cASO(W, S) s.t. tr (S) = h, S (cid:22) I, S \u2208 Sd\n+,\nmin\n\nwhere \u2126cASO is de\ufb01ned as:\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n\u2126cASO(W, S) = \u03b1\u03b7(1 + \u03b7) tr(cid:0)W T (\u03b7I + S)\u22121W(cid:1) .\n\nThe cASO formulation in Eq. (13) and the cCMTL formulation in Eq. (11) are different in the regu-\nlarization components: the respective Hessian of the regularization with respect to W are different.\nSimilar to Theorem 2.1, our analysis shows that cASO and cCMTL are equivalent.\n\n4\n\n\fTheorem 3.1. The objectives of the cCMTL formulation in Eq. (11) and the cASO formulation\nin Eq. (13) are equivalent if the cluster number, k, in K-means equals to the size, h, of the shared\nlow-dimensional feature space.\n\nProof. De\ufb01ne the following two convex functions of W :\n\ngcCMTL(W ) = min\nM\n\ntr(cid:0)W (\u03b7I + M )\u22121W T(cid:1) , s.t. tr (M ) = k, M (cid:22) I, M \u2208 Sm\n\n+ ,\n\nand\n\ngcASO(W ) = min\n\nS\n\ntr(cid:0)W T (\u03b7I + S)\u22121W(cid:1) , s.t. tr (S) = h, S (cid:22) I, S \u2208 Sd\n\n+.\n\n(15)\n\n(16)\n\nThe cCMTL and cASO formulations can be expressed as unconstrained optimization w.r.t. W :\nW L(W ) + c \u00b7 gASO(W ),\n\nW L(W ) + c \u00b7 gCMTL(W ),\n\n[cCMTL] min\n\n[cASO] min\n\nwhere c = \u03b1\u03b7(1 + \u03b7). Let h = k \u2264 min(d, m). Next, we show that for a given W , gCMTL(W ) =\ngASO(W ) holds.\nLet W = Q1\u03a3Q2, M = P1\u039b1P T\n2 , be the SVD of W , M, and S (M and\nS are symmetric positive semi-de\ufb01nite), respectively, where \u03a3 = diag{\u03c31, \u03c32, . . . , \u03c3m}, \u039b1 =\nm }, and \u039b2 = {\u03bb(2)\ndiag{\u03bb(1)\nm }. Let q < k be the rank of \u03a3. It follows\nfrom the basic properties of the trace that:\n\n1 , and S = P2\u039b2P T\n\n2 , . . . , \u03bb(1)\n\n1 , \u03bb(1)\n\n1 , \u03bb(2)\n\n2 , . . . , \u03bb(2)\n\ntr(cid:0)W (\u03b7I + M )\u22121W T(cid:1) = tr(cid:0)(\u03b7I + \u039b1)\u22121P T\n\n1 Q2\u03a32QT\n\n2 P1(cid:1) .\n\nThe problem in Eq. (15) is thus equivalent to:\n\nmin\nP1,\u039b1\n\ntr(cid:0)(\u03b7I + \u039b1)\u22121P T\n\n1 Q2\u03a32QT\n\n2 P1(cid:1) ,\n\ns.t. P1P T\n\n1 = I, P T\n\n1 P1 = I,\n\ndX\n\ni=1\n\n\u03bb(1)\ni = k.\n\n(17)\n\nIt can be shown that the optimal P \u22171 is given by P \u22171 = Q2 and the optimal \u039b\u22171 is given by solving the\nfollowing simple (convex) optimization problem [13]:\n\n\u039b\u22171 = argmin\n\n\u039b1\n\nqX\n\ni=1\n\n\u03c32\ni\n\n\u03b7 + \u03bb(1)\n\ni\n\n,\n\ns.t.\n\nqX\n\ni\n\n\u03bb(1)\ni = k, 0 \u2264 \u03bb(1)\n\ni \u2264 1.\n\n(18)\n\nIt follows that gcCMTL(W ) = tr(cid:0)(\u03b7I + \u039b\u22171)\u22121\u03a32(cid:1). Similarly, we can show that gcASO(W ) =\ntr(cid:0)(\u03b7I + \u039b\u22172)\u22121\u03a32(cid:1), where\n\u039b\u22172 = argmin\n\n\u03c32\ni\n\ns.t.\n\n,\n\ni = h, 0 \u2264 \u03bb(2)\n\u03bb(2)\n\ni \u2264 1.\n\n\u039b2\n\nqX\n\ni\n\nqX\n\ni=1\n\n\u03b7 + \u03bb(2)\n\ni\n\nIt is clear that when h = k, \u039b\u22171 = \u039b\u22172 holds. Therefore, we have gcCMTL(W ) = gcASO(W ). This\ncompletes the proof.\n+, while\nRemark 3.2. In the functional of cASO in Eq. (16) the variable to be optimized is S \u2208 Sd\n+ . In many practical\nin the functional of cCMTL in Eq. (15) the optimization variable is M \u2208 Sm\nMTL problems the data dimensionality d is much larger than the task number m, and in such cases\ncCMTL is signi\ufb01cantly more ef\ufb01cient in terms of both time and space. Our equivalence relationship\nestablished in Theorem 3.1 provides an (equivalent) ef\ufb01cient implementation of cASO especially\nfor high-dimensional problems.\n\n4 Optimization Algorithms\n\nIn this section, we propose to employ three different methods, i.e., Alternating Optimization Method\n(altCMTL), Accelerated Projected Gradient Method (apgCMTL), and Direct Gradient Descent\nMethod (graCMTL), respectively, for solving the convex relaxation in Eq. (11). Note that we focus\non smooth loss functions in this paper.\n\n5\n\n\f4.1 Alternating Optimization Method\n\nThe Alternating Optimization Method (altCMTL) is similar to the Block Coordinate Descent (BCD)\nmethod [22], in which the variable is optimized alternatively with the other variables \ufb01xed. The\npseudo-code of altCMTL is provided in the supplemental material. Note that using similar tech-\nniques as the ones from [23], we can show that altCMTL \ufb01nds the globally optimal solution to\nEq. (11). The altCMTL algorithm involves the following two steps in each iteration:\nOptimization of W For a \ufb01xed M, the optimal W can be obtained via solving:\n\nmin\n\nW L(W ) + c tr(cid:0)W (\u03b7I + M )\u22121W T(cid:1) .\n\nThe problem above is smooth and convex. It can be solved using gradient-type methods [22]. In the\nspecial case of a least square loss function, the problem in Eq. (19) admits an analytic solution.\nOptimization of M For a \ufb01xed W , the optimal M can be obtained via solving:\n\ntr(cid:0)W (\u03b7I + M )\u22121W T(cid:1) , s.t. tr (M ) = k, M (cid:22) I, M \u2208 Sm\n\n+ .\n\nmin\nM\n\nFrom Theorem 3.1, the optimal M to Eq. (20) is given by M = Q\u039b\u2217QT , where \u039b\u2217 is obtained from\nEq. (18). The problem in Eq. (18) can be ef\ufb01ciently solved using similar techniques in [17].\n\n4.2 Accelerated Projected Gradient Method\n\nThe accelerated projected gradient method (APG) has been applied to solve many machine learning\nformulations [24]. We apply APG to solve the cCMTL formulation in Eq. (11). The algorithm is\ncalled apgCMTL. The key component of apgCMTL is to compute a proximal operator as follows:\n\nmin\n\nWZ ,MZ\n\n(cid:13)(cid:13)(cid:13)WZ \u2212 \u02c6WS(cid:13)(cid:13)(cid:13)\n\n2\n\nF\n\n+(cid:13)(cid:13)(cid:13)MZ \u2212 \u02c6MS(cid:13)(cid:13)(cid:13)\n\n2\n\nF\n\n,\n\ns.t.\n\ntr (MZ) = k, MZ (cid:22) I, MZ \u2208 Sm\n+ ,\n\n(21)\n\nwhere the details about the construction of \u02c6WS and \u02c6MS can be found in [24]. The optimization\nproblem in Eq. (21) is involved in each iteration of apgCMTL, and hence its computation is critical\nfor the practical ef\ufb01ciency of apgCMTL. We show below that the optimal WZ and MZ to Eq. (21)\ncan be computed ef\ufb01ciently.\nComputation of Wz The optimal WZ to Eq. (21) can be obtained by solving:\n\n(19)\n\n(20)\n\n(22)\n\n(23)\n\nmin\nWZ\n\n(cid:13)(cid:13)(cid:13)WZ \u2212 \u02c6WS(cid:13)(cid:13)(cid:13)\n\n2\n\nF\n\n.\n\nClearly the optimal WZ to Eq. (22) is equal to \u02c6WS.\nComputation of Mz The optimal MZ to Eq. (21) can be obtained by solving:\ntr (MZ) = k, MZ (cid:22) I, MZ \u2208 Sm\n+ ,\n\nmin\nMZ\n\ns.t.\n\nF\n\n2\n\n,\n\n(cid:13)(cid:13)(cid:13)MZ \u2212 \u02c6MS(cid:13)(cid:13)(cid:13)\n\nwhere \u02c6MS is not guaranteed to be positive semide\ufb01nite. Our analysis shows that the optimization\nproblem in Eq. (23) admits an analytical solution via solving a simple convex projection problem.\nThe main result and the pseudo-code of apgCMTL are provided in the supplemental material.\n\n4.3 Direct Gradient Descent Method\n\nIn Direct Gradient Descent Method (graCMTL) as used in [17], the cCMTL problem in Eq. (11) is\nreformulated as an optimization problem with one single variable W , given by:\n\nW L(W ) + c \u00b7 gCMTL(W ),\nmin\n\n(24)\n\nwhere gCMTL(W ) is a functional of W de\ufb01ned in Eq. (15).\nGiven the intermediate solution Wk\u22121 from the (k \u2212 1)-th iteration of graCMTL, we compute\nthe gradient of gCMTL(W ) and then apply the general gradient descent scheme [25] to obtain Wk.\nNote that at each iterative step in line search, we need to solve the optimization problem in the\nform of Eq. (20). The gradient of gCMTL(\u00b7) at Wk\u22121 is given by [26, 27]: \u2207W gCMTL(Wk) =\n2(\u03b7I + \u02c6M )\u22121W T\nk\u22121, where \u02c6M is obtained by solving Eq. (20) at W = Wk\u22121. The pseudo-code of\ngraCMTL is provided in the supplemental material.\n\n6\n\n\fTruth\n\nRidgeSTL\n\nRegMTL\n\ncCMTL\n\nFigure 1: The correlation matrices of the ground truth model, and the models learnt from RidgeSTL,\nRegMTL, and cCMTL. Darker color indicates higher correlation. In the ground truth there are 100\ntasks clustered into 5 groups. Each task has 200 dimensions. 95 training samples and 5 testing\nsamples are used in each task. The test errors (in terms of nMSE) for RidgeSTL, RegMTL, and\ncCMTL are 0.8077, 0.6830, 0.0354, respectively.\n\n5 Experiments\n\nIn this section, we empirically evaluate the effectiveness and the ef\ufb01ciency of the proposed algo-\nrithms on synthetic and real-world data sets. The normalized mean square error (nMSE) and the\naveraged mean square error (aMSE) are used as the performance measure [23]. Note that in this\npaper we have not developed new MTL formulations; instead our main focus is on the theoretical\nunderstanding of the inherent relationship between ASO and CMTL. Thus, an extensive compar-\native study of various MTL algorithms is out of the scope of this paper. As an illustration, in the\nfollowing experiments we only compare cCMTL with two baseline techniques: ridge regression\nSTL (RidgeSTL) and regularized MTL (RegMTL) [28].\nSimulation Study We apply the proposed cCMTL formulation in Eq. (11) on a synthetic data\nset (with a pre-de\ufb01ned cluster structure). We use 5-fold cross-validation to determine the regulariza-\ntion parameters for all methods. We construct the synthetic data set following a procedure similar\nto the one in [17]: the constructed synthetic data set consists of 5 clusters, where each cluster in-\ncludes 20 (regression) tasks and each task is represented by a weight vector of length d = 300.\nDetails of the construction is provided in the supplemental material. We apply RidgeSTL, RegMTL,\nand cCMTL on the constructed synthetic data. The correlation coef\ufb01cient matrices of the obtained\nweight vectors are presented in Figure 1. From the result we can observe (1) cCMTL is able to\ncapture the cluster structure among tasks and achieves a small test error; (2) RegMTL is better than\nRidgeSTL in terms of test error. It however introduces unnecessary correlation among tasks pos-\nsibly due to the assumption that all tasks are related; (3) In cCMTL we also notice some \u2018noisy\u2019\ncorrelation, which may because of the spectral relaxation.\n\nTable 1: Performance comparison on the School data in terms of nMSE and aMSE. Smaller nMSE\nand aMSE indicate better performance. All regularization parameters are tuned using 5-fold cross\nvalidation. The mean and standard deviation are calculated based on 10 random repetitions.\n\nMeasure Ratio\nnMSE\n\nRidgeSTL\n\nRegMTL\n\n10% 1.3954 \u00b1 0.0596\n15% 1.1370 \u00b1 0.0146\n20% 1.0290 \u00b1 0.0309\n25% 0.8649 \u00b1 0.0123\n30% 0.8367 \u00b1 0.0102\n10% 0.3664 \u00b1 0.0160\n15% 0.2972 \u00b1 0.0034\n20% 0.2717 \u00b1 0.0083\n25% 0.2261 \u00b1 0.0033\n30% 0.2196 \u00b1 0.0035\n\ncCMTL\n\n1.0988 \u00b1 0.0178\n1.0636 \u00b1 0.0170\n1.0349 \u00b1 0.0091\n1.0139 \u00b1 0.0057\n1.0042 \u00b1 0.0066\n0.2865 \u00b1 0.0054\n0.2771 \u00b1 0.0045\n0.2709 \u00b1 0.0027\n0.2650 \u00b1 0.0027\n0.2632 \u00b1 0.0028\n\n1.0850 \u00b1 0.0206\n0.9708 \u00b1 0.0145\n0.8864 \u00b1 0.0094\n0.8243 \u00b1 0.0031\n0.8006 \u00b1 0.0081\n0.2831 \u00b1 0.0050\n0.2525 \u00b1 0.0048\n0.2322 \u00b1 0.0022\n0.2154 \u00b1 0.0020\n0.2101 \u00b1 0.0016\n\naMSE\n\nEffectiveness Comparison Next, we empirically evaluate the effectiveness of the cCMTL formu-\nlation in comparison with RidgeSTL and RegMTL using real world benchmark datasets including\nthe School data1 and the Sarcos data2. The regularization parameters for all algorithms are deter-\n\n1http://www.cs.ucl.ac.uk/staff/A.Argyriou/code/\n2http://gaussianprocess.org/gpml/data/\n\n7\n\n\fapgCMTL\naltCMTL\ngraCMTL\n\n200\n\n150\n\n100\n\n50\n\ns\nd\nn\no\nc\ne\nS\n\napgCMTL\naltCMTL\ngraCMTL\n\n \n\n12\n\n10\n\n8\n\n6\n\n4\n\ns\nd\nn\no\nc\ne\nS\n\n \n\n5\n\n4\n\n3\n\n2\n\n1\n\ns\nd\nn\no\nc\ne\nS\n\napgCMTL\naltCMTL\ngraCMTL\n\n \n\n \n\n0\n500\n\n1000\n\n1500\n\nDimension\n\n2000\n\n2500\n\n \n\n2\n3000 5000 7000 9000 3000 5000 7000 9000\n\nSample Size\n\n0\n\n \n\n50\n\n90\n130\nTask Number\n\n170\n\nFigure 2: Sensitivity study of altCMTL, apgCMTL, graCMTL in terms of the computation cost (in\nseconds) with respect to feature dimensionality (left), sample size (middle), and task number (right).\n\nmined via 5-fold cross validation; the reported experimental results are averaged over 10 random\nrepetitions. The School data consists of the exam scores of 15362 students from 139 secondary\nschools, where each student is described by 27 attributes. We vary the training ratio in the set\n5 \u00d7 {1, 2,\u00b7\u00b7\u00b7 , 6}% and record the respective performance. The experimental results are presented\nin Table 1. We can observe that cCMTL performs the best among all settings. Experimental results\non the Sarcos dataset is available in the supplemental material.\nEf\ufb01ciency Comparison We compare the ef\ufb01ciency of the three algorithms including altCMTL,\napgCMTLand graCMTL for solving the cCMTL formulation in Eq. (11). For the following exper-\niments, we set \u03b1 = 1, \u03b2 = 1, and k = 2 in cCMTL. We observe a similar trend in other settings.\nSpeci\ufb01cally, we study how the feature dimensionality, the sample size, and the task number affect\nthe required computation cost (in seconds) for convergence. The experimental setup is as follows:\nwe terminate apgCMTL when the change of objective values in two successive steps is smaller than\n10\u22125 and record the obtained objective value; we then use such a value as the stopping criterion\nin graCMTL and altCMTL, that is, we stop graCMTL or altCMTL when graCMTL or altCMTL\nattains an objective value equal to or smaller than the one attained by apgCMTL. We use Yahoo\nArts data for the \ufb01rst two experiments. Because in Yahoo data the task number is very small, we\nconstruct a synthetic data for the third experiment.\n\nIn the \ufb01rst experiment, we vary the feature dimensionality in the set [500 : 500 : 2500] with the\nsample size \ufb01xed at 4000 and the task numbers \ufb01xed at 17. The experimental result is presented\nin the left plot of Figure 2. In the second experiment, we vary the sample size in the set [3000 :\n1000 : 9000] with the dimensionality \ufb01xed at 500 and the task number \ufb01xed at 17. The experimental\nresult is presented in the middle plot of Figure 2. From the \ufb01rst two experiments, we observe that\nlarger feature dimensionality or larger sample size will lead to higher computation cost. In the third\nexperiment, we vary the task number in the set [10 : 10 : 190] with the feature dimensionality \ufb01xed\nat 600 and the sample size \ufb01xed at 2000. The employed synthetic data set is constructed as follows:\nfor each task, we generate the entries of the data matrix Xi from N (0, 1), and generate the entries\nof the weight vector from N (0, 1), the response vector yi is computed as yi = Xiwi + \u03be, where\n\u03be \u223c N (0, 0.01) represents the noise vector. The experimental result is presented in the right plot of\nFigure 2. We can observe that altCMTL is more ef\ufb01cient than the other two algorithms.\n\n6 Conclusion\n\nIn this paper we establish the equivalence relationship between two multi-task learning techniques:\nalternating structure optimization (ASO) and clustered multi-task learning (CMTL). We further es-\ntablish the equivalence relationship between our proposed convex relaxation of CMTL and an ex-\nisting convex relaxation of ASO. In addition, we propose three algorithms for solving the convex\nCMTL formulation and demonstrate their effectiveness and ef\ufb01ciency on benchmark datasets. The\nproposed algorithms involve the computation of SVD. In the case of a very large task number, the\nSVD computation will be expensive. We seek to further improve the ef\ufb01ciency of the algorithms by\nemploying approximation methods. In addition, we plan to apply the proposed algorithms to other\nreal world applications involving multiple (clustered) tasks.\n\nAcknowledgments\n\nThis work was supported in part by NSF IIS-0812551, IIS-0953662, MCB-1026710, CCF-1025177,\nand NIH R01 LM010730.\n\n8\n\n\fReferences\n[1] T. Evgeniou, M. Pontil, and O. Toubia. A convex optimization approach to modeling consumer hetero-\n\ngeneity in conjoint estimation. Marketing Science, 26(6):805\u2013818, 2007.\n\n[2] R.K. Ando. Applying alternating structure optimization to word sense disambiguation. In Proceedings of\n\nthe Tenth Conference on Computational Natural Language Learning, pages 77\u201384, 2006.\n\n[3] A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing features: ef\ufb01cient boosting procedures for multi-\nclass object detection. In Computer Vision and Pattern Recognition, 2004, IEEE Conference on, volume 2,\npages 762\u2013769, 2004.\n\n[4] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149\u2013198, 2000.\n[5] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. The Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[6] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. Lecture notes in\n\ncomputer science, pages 567\u2013580, 2003.\n\n[7] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer. Multi-task learning for hiv therapy screening. In\n\nProceedings of the 25th International Conference on Machine Learning, pages 56\u201363. ACM, 2008.\n\n[8] T. Evgeniou, C.A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of\n\nMachine Learning Research, 6(1):615, 2006.\n\n[9] A. Argyriou, C.A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task\n\nstructure learning. Advances in Neural Information Processing Systems, 20:25\u201332, 2008.\n\n[10] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[11] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In\n\nProceedings of the 2006 Conference on EMNLP, pages 120\u2013128, 2006.\n\n[12] A. Quattoni, M. Collins, and T. Darrell. Learning visual representations using images with captions. In\n\nComputer Vision and Pattern Recognition, 2007. IEEE Conference on, pages 1\u20138. IEEE, 2007.\n\n[13] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning shared structures from multiple\ntasks. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 137\u2013144.\nACM, 2009.\n\n[14] S. Thrun and J. O\u2019Sullivan. Clustering learning tasks and the selective cross-task transfer of knowledge.\n\nLearning to learn, pages 181\u2013209, 1998.\n\n[15] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. The Journal of\n\nMachine Learning Research, 4:83\u201399, 2003.\n\n[16] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with dirichlet\n\nprocess priors. The Journal of Machine Learning Research, 8:35\u201363, 2007.\n\n[17] L. Jacob, F. Bach, and J.P. Vert. Clustered multi-task learning: A convex formulation. Arxiv preprint\n\narXiv:0809.2085, 2008.\n\n[18] F. Wang, X. Wang, and T. Li. Semi-supervised multi-task learning with task regularizations. In Data\n\nMining, 2009. ICDM\u201909. Ninth IEEE International Conference on, pages 562\u2013568. IEEE, 2009.\n\n[19] C. Ding and X. He. K-means clustering via principal component analysis. In Proceedings of the twenty-\n\n\ufb01rst International Conference on Machine learning, page 29. ACM, 2004.\n\n[20] H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Spectral relaxation for k-means clustering. Advances in\n\nNeural Information Processing Systems, 2:1057\u20131064, 2002.\n\n[21] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I. Proceedings of the\n\nNational Academy of Sciences of the United States of America, 35(11):652, 1949.\n\n[22] J. Nocedal and S.J. Wright. Numerical optimization. Springer verlag, 1999.\n[23] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[24] Y. Nesterov. Gradient methods for minimizing composite objective function. ReCALL, 76(2007076),\n\n2007.\n\n[25] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n[26] J. Gauvin and F. Dubeau. Differential properties of the marginal function in mathematical programming.\n\nOptimality and Stability in Mathematical Programming, pages 101\u2013119, 1982.\n\n[27] M. Wu, B. Sch\u00a8olkopf, and G. Bak\u0131r. A direct method for building sparse kernel learning algorithms. The\n\nJournal of Machine Learning Research, 7:603\u2013624, 2006.\n\n[28] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In Proceedings of the tenth ACM SIGKDD\n\nInternational Conference on Knowledge discovery and data mining, pages 109\u2013117. ACM, 2004.\n\n9\n\n\f", "award": [], "sourceid": 487, "authors": [{"given_name": "Jiayu", "family_name": "Zhou", "institution": null}, {"given_name": "Jianhui", "family_name": "Chen", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}