{"title": "Generalized Block-Diagonal Structure Pursuit: Learning Soft Latent Task Assignment against Negative Transfer", "book": "Advances in Neural Information Processing Systems", "page_first": 5854, "page_last": 5865, "abstract": "In multi-task learning, a major challenge springs from a notorious issue known as negative transfer, which refers to the phenomenon that sharing the knowledge with dissimilar and hard tasks often results in a worsened performance. To circumvent this issue, we propose a novel multi-task learning method, which simultaneously learns latent task representations and a block-diagonal Latent Task Assignment Matrix (LTAM). Different from most of the previous work, pursuing the Block-Diagonal structure of LTAM (assigning latent tasks to output tasks) alleviates negative transfer via collaboratively grouping latent tasks and output tasks such that inter-group knowledge transfer and sharing is suppressed. This goal is challenging, since 1) our notion of Block-Diagonal Property extends the traditional notion for square matrices where the $i$-th column and the $i$-th column represents the same concept; 2) marginal constraints on rows and columns are also required for avoiding isolated latent/output tasks. Facing such challenges, we propose  a novel regularizer by means of an equivalent spectral condition realizing this generalized block-diagonal property. Practically, we provide a relaxation scheme which improves the flexibility of the model. With the objective function given, we then propose an alternating optimization method, which not only tells how negative transfer is alleviated in our method but also reveals an interesting connection between our method and the optimal transport problem.  Finally, the method is demonstrated on a simulation dataset, three real-world benchmark datasets and further applied to personalized attribute predictions.", "full_text": "Generalized Block-Diagonal Structure Pursuit:\nLearning Soft Latent Task Assignment against\n\nNegative Transfer\n\nZhiyong Yang1,2 Qianqian Xu3\n\nXiaochun Cao1,2,6\n\nYangbangyan Jiang1,2\n\nQingming Huang3,4,5,6\u2217\n\n1State Key Laboratory of Information Security, Institute of Information Engineering, CAS\n\n2School of Cyber Security, University of Chinese Academy of Sciences\n\n3Key Lab. of Intelligent Information Processing, Institute of Computing Technology, CAS\n\n4School of Computer Science and Tech., University of Chinese Academy of Sciences\n\n5Key Laboratory of Big Data Mining and Knowledge Management, CAS\n\n6Peng Cheng Laboratory\n\nyangzhiyong@iie.ac.cn, xuqianqian@ict.ac.cn, jiangyangbangyan@iie.ac.cn\n\ncaoxiaochun@iie.ac.cn, qmhuang@ucas.ac.cn\n\nAbstract\n\nIn multi-task learning, a major challenge springs from a notorious issue known as\nnegative transfer, which refers to the phenomenon that sharing the knowledge with\ndissimilar and hard tasks often results in a worsened performance. To circumvent\nthis issue, we propose a novel multi-task learning method, which simultaneously\nlearns latent task representations and a block-diagonal Latent Task Assignment\nMatrix (LTAM). Different from most of the previous work, pursuing the Block-\nDiagonal structure of LTAM (assigning latent tasks to output tasks) alleviates\nnegative transfer via punishing inter-group knowledge transfer and sharing. This\ngoal is challenging since our notion of Block-Diagonal Property extends the tradi-\ntional notion for homogeneous and square matrices. In this paper, we propose a\nspectral regularizer which is proven to leverage the expected structure. Practically,\nwe provide a relaxation scheme which improves the \ufb02exibility of the model. With\nthe objective function given, we then propose an alternating optimization method,\nwhich reveals an interesting connection between our method and the optimal trans-\nport problem. Finally, the method is demonstrated on a simulation dataset, three\nreal-world benchmark datasets and further applied to two personalized attribute\nlearning datasets.\n\n1\n\nIntroduction\n\nMulti-Task Learning (MTL) is a learning paradigm whose aim is to leverage useful information\ncontained in multiple related tasks to help improve the generalization performance of all the tasks\n[Caruana, 1997]. Nowadays, MTL has emerged as a fundamental building block for a wide range of\napplications ranging from scene parsing [Xu et al., 2018], attribute learning [Cao et al., 2018, Yang\net al., 2019a, 2018], text classi\ufb01cation [Liu et al., 2017], sequence labeling [Lin et al., 2018], to travel\ntime estimation [Li et al., 2018], etc.\nThe fundamental belief of MTL lies in that sharing knowledge among multiple tasks often results\nin an improvement in generalization performance, which is especially of great signi\ufb01cance in the\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpresence of insuf\ufb01cient data annotations [Heskes, 1998]. Based on the belief, a great number of\nstudies have been carried out to explore the problem of how to share valuable knowledge across\ndifferent tasks. The early studies of MTL (e.g.[Argyriou et al., 2008a]) hold that all the tasks share\ncommon and sparse features. However, [Kang et al., 2011] later points out that if not all the tasks\nare indeed related, then sharing common features with dissimilar and hard tasks often results in\nperformance degradation, which is termed as negative transfer.\nTo address this issue, recent studies in the odyssey against negative transfer fall in two major\ndirections. One line of the researches leverages the grouping effect based on the latent-task-agnostic\nidea which develops structural regularizers where only the original per-task parameters are utilized.\n[Kang et al., 2011, Kshirsagar et al., 2017] directly formulate the tasking grouping as a mixed integer\nprogramming (or a relaxation [Frecon et al., 2018]), which simultaneously learns the group index\nand the model parameters. [Argyriou et al., 2008b, Zhou et al., 2011a, Jacob et al., 2009, Lee et al.,\n2016, Liu and Pan, 2017, McDonald et al., 2014] leverage the tasking grouping via enforcing a\nspeci\ufb01c structure, hopefully block-diagonal, on the task correlation matrix. As an extension of this\nformulation [Zhong and Kwok, 2012] resorts to feature-speci\ufb01c task clustering. The other line of\nresearches formulates the MTL based on the latent task, where the model parameter is represented as\na linear combination of latent task basis. [Kumar and III, 2012] gives an early trial of this formulation\nin search of a more \ufb02exible MTL model. Similarly, in the work of [Maurer et al., 2013], a sparse\ncoding model is proposed for MTL, where the dictionary is set as the latent task basis and the code is\nset as the linear combination coef\ufb01cients of such basis. Recently, [Lee et al., 2018] also provides an\nasymmetric learning framework based on the latent task representation where transferring knowledge\nfrom unreliable tasks to reliable tasks is explicitly punished.\nThe two aforementioned directions, i.e., learning grouped model structure and latent task represen-\ntation provide complementary functions in a sense that the former one avoids inter-group negative\ntransfer, while the latter one focuses on learning a more \ufb02exible model. However, the related studies\non how to bridge the efforts of these two directions are sparse. To the best of our knowledge, the only\ntwo studies along this direction are [Crammer and Mansour, 2012, Barzilai and Crammer, 2015].\nHowever, both studies adopt a strong assumption that each group of tasks is only assigned with one\nlatent task basis.\nTo leverage a \ufb02exible grouping structure with latent task representations, we should allow each task\ncluster to have multiple latent tasks. Motivated by this, we study the structural learning problem of\nhow to learn a block-diagonal Latent Task Assignment Matrix (LATM). With the block-diagonal\nstructure, tasks within each group share a subset (not necessarily one) of the latent task basis. Since\nLATM is not a squared matrix and marginal constraints are also necessary to avoid isolated tasks/latent\ntasks, our notion of block-diagonality generalizes the one adopted in the self-expression scenario\n[Lu et al., 2019, Lee et al., 2016, Liu and Pan, 2017] , which makes traditional structural regularizers\nnot available to solve our problem. Our \ufb01rst contribution then comes as an equivalent spectral\ncondition that realizes our pursuit of the generalized block-diagonal structure. Then we propose a\nnew MTL method named Generalized Block-Diagonal Structural Pursuit (GBDSP), which utilizes\nthe spectral condition as a novel regularizer with a relaxation scheme. In our optimization method,\nthe intermediate solution produced provides new insights into how negative transfer is alleviated in\nour model. Theoretical studies show how the proposed regularizer guarantees the expected structure.\nFinally, empirical studies demonstrate the effectiveness of the proposed method.\n\n2 Generalized Block-Diagonal Structure Pursuit\n\nNotations The notations adopted in this paper are enumerated as follows. Sm denotes the set of\nall symmetric matrices in Rm\u00d7m. The eigenvalues of a symmetric matrix A \u2208 Sm are denoted as\n\u03bb1(A),\u00b7\u00b7\u00b7 , \u03bbm(A) such that \u03bb1(A) \u2265 \u03bb2(A) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbm(A). (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product for\ntwo matrices or two vectors. Given two vectors a and b, a \u2295 b denotes the outter sum a1(cid:62) + 1b(cid:62).\nGiven two matrices A and B, A\u2295B denotes the direct sum of two matrices, i.e., A\u2295B =\n,\n0 B\nand we say A (cid:23) B, if A \u2212 B is positive semi-de\ufb01nite. For distributions, N (\u00b5, \u03c32) denotes the\nnormal distribution. Pm denotes the set of all permutation matrices in Rm\u00d7m. For two matrices\nA and B having the same size, d(A, B) = (cid:107)A \u2212 B(cid:107)2\nF . Given an event A, \u03b4(A) denotes the\ncorresponding indicator function. Moreover, let us note two notations in our paper that are prone\n\n(cid:20)A 0\n\n(cid:21)\n\n2\n\n\fto be confused. k denotes the dimension of the latent task representation. K(K \u2264 k) denotes the\nnumber of given groups.\n\n2.1 Model Formulation\n\nj\n\nj=1 (cid:96)(Y (i)\n\nj\n\n, \u02c6Y (i)\n\nj\n\n1 ,\u00b7\u00b7\u00b7 , X (i)\n\nni ](cid:62), where X (i)\n\n1 ,\u00b7\u00b7\u00b7 , Y (i)\n\nni ](cid:62) \u2208 Rni\u00d71, where Y (i)\n\nfor the i-th task is de\ufb01ned as J (i) =(cid:80)ni\n\nas:(cid:8)(X (1), Y (1)),\u00b7\u00b7\u00b7 , (X (T ), Y (T ))(cid:9). Here X (i) = [X (i)\n\nBefore entering our new method, we \ufb01rst provide a brief introduction of the multi-task learning\nsetting we adopted in this paper. Here we adopt the latent task representation framework proposed\nin [Kumar and III, 2012]. Given T tasks to be learned simultaneously, we denote the training data\nj \u2208 Rd\u00d71 is\nthe input feature for the j-th sample of the i-th task, ni denotes the number of instances and d\nrepresents the feature dimension. Similarly Y (i) = [Y (i)\nis the\ncorresponding response for the j-th sample of the i-th task. Following the standard multi-task learning\nparadigm, we learn a model \u02c6Y (i)(x) = W (i)(cid:62)\nx to estimate the output response for each task i. Here\nwe call W = [W (1),\u00b7\u00b7\u00b7 , W (T )] \u2208 Rd\u00d7T the per-task parameter matrix. Furthermore, to model\nthe relationship among the tasks, we assume that the per-task parameters lie in a low dimensional\nsubspace. To this end, we introduce a set of latent task basis L \u2208 Rd\u00d7k, where k < T . For each given\ntask i, its parameter is then represented as a linear combination of the basis by letting W (i) = LS(i),\nwhere S(i) \u2208 Rk\u00d71 is the combination coef\ufb01cients. Given a loss function (cid:96)(y, \u02c6y), the empirical risk\n(cid:80)T\n). Given proper regularizers \u2126(L), \u2126(S),\ni=1 J (i) + \u03b11\u00b7 \u2126(L) + \u03b12\u00b7 \u2126(S).\n[Kumar and III, 2012] learns L, S from the problem argminL,S\nIn this paper, we adopt the F -norm penalty for L, i.e., we set \u2126(L) = (cid:107)L(cid:107)2\nF . And we seek new\nsolutions against negative transfer from \u2126(S). In this setting, we must deal with both the latent task\nrepresentations and the true tasks. To differentiate the two, we refer the former ones to latent tasks\n(l1,\u00b7\u00b7\u00b7 , lk) and the latter ones to output tasks (o1,\u00b7\u00b7\u00b7 , oT ).\nWith the latent task formulation W = LS, S then captures the importance of the latent tasks to the\noutput tasks. In a natural sense, we regard Si,j as P(l = i|o = j), namely the possibility of choosing\nli to represent oj. In this probabilistic view, LS(i) now becomes El|o=i(L), i.e., the expectation of\nthe latent tasks representations assigned to task oi. We then call S the Latent Task Assignment Matrix\n(LTAM), since the conditional possibility could be considered as a soft assignment score. Before\ndeveloping a proper regularizer, we must \ufb01rst answer the question that can we choose S arbitrarily?\nUnfortunately, we will immediately see that the answer is negative. Let us denote S\u2021 \u2208 Rk\u00d7T by\ni,j = P(l = i, o = j), the joint distribution of l and o. Note that S\u20211 and S\u2021(cid:62)\n\u2021\n1 are marginal\nS\ndistributions on l and o, we come to two extreme cases that must be ruled out from consideration.\nIf (S\u20211T )i = 0 then li becomes an isolated latent task which is irrelevant with all the output task.\nSimilarly, if (S\u2021(cid:62)\n1k)j = 0 then oj becomes isolated with no latent tasks assigned to it. To remove\nextreme cases of such kinds, we then pose normalization constraints on S\u2021 for each row and column\nin the form: S\u20211T = a > 0k, S\u2021(cid:62)\n1k = b > 0T . To maintain fairness, we do not expect to introduce\nextra bias from the choice of marginal distribution. Such a spirit guides us to put out a = 1k/k,\nb = 1T /T . Moreover, this also simpli\ufb01es the relation between S and S\u2021 with S = T S\u2021. From all\nabove, we adopt the transportation polytopes \u03a0(a, b) =\nas\nthe feasible set for the parameter S\u2021.\nSo far we have known that S must satisfy the marginal constraints to make the solution non-trivial.\nNow let us step further to seek out what else we should pose on S to suppress inter-group transfer.\nIn this paper, we adopt a basic assumption that the latent tasks and output tasks could be clustered\ninto K independent groups. In order to avoid negative transfer, we hope the possibility to assign\nli to oj is nonzero if and only if (i, j) belongs to the same group. This leads to a block-diagonal\nstructure of S\u2021 up to row and column permutations. Next, we give a formal de\ufb01nition of the desired\nblock-diagonal structure with S\u2021 \u2208 \u03a0(a, b), based on the following simple idea. If the columns and\nrows of S\u2021 could be partitioned into K groups, S\u2021 could then be expressed as a direct sum of K\nblocks up to proper column and row permutations. The maximum of such K then implies the number\nof groups in the matrix.1 This motivates the following de\ufb01nition of the grouping structure, which is\ntermed as the Generalized K Block Diagonal Property (GKBDP) in our paper.\n\n: S\u20211 = a, S\u2021(cid:62)\n\nS\u2021 \u2208 Rk\u00d7T\n\n(cid:111)\n\n1 = b\n\n(cid:110)\n\n+\n\n1If K is not the maximum of such numbers, we can always \ufb01nd out more disjoint blocks.\n\n3\n\n\fsuch that: PrS\u2021Pc = (cid:76)K\n\nDe\ufb01nition 1 (Generalized K Block Diagonal Property). Given a matrix S\u2021 \u2208 \u03a0(a, b), if there\nexists a permutation matrix over rows Pr \u2208 Pk and a permutation matrix over columns Pc \u2208 PT\ni Ti = T ,\nthen we de\ufb01ne \u03c7S\u2021(Pr, Pc) = K. Moreover, we say S\u2021 is Generalized K Block Diagonal if\n\u03c7S\u2021 =\n\n\u02c6S(i) where \u02c6S(i) (cid:54)= 0, \u02c6S(i) \u2208 Rki\u00d7Ti,(cid:80)\n\ni ki = k, (cid:80)\n\n\u03c7S\u2021(Pr, Pc) = K.\n\nmax\n\ni=1\n\nPr\u2208Pd,Pc\u2208PT\n\nc\n\n(cid:21)\n\nS\u2021(cid:62)\n\nS\u2021\n0\n\n(cid:20) 0\n\nNote that GKBDP extends the notion of block-diagonal property deployed in self-expression [Lee\net al., 2016, Liu and Pan, 2017, Lu et al., 2019, Jiang et al., 2018, 2019, Yang et al., 2019b] which is\nonly available for square matrices. Furthermore, the traditional self-expression-based block-diagonal\nproperty requires Pr = P (cid:62)\n[Lu et al., 2019], i.e., a simultaneous permutation on columns and rows\n(the i-th row and the i-th column represent the same object) . However, this is not the case in our\nnotion of GKBDP since the columns and rows here represent heterogeneous concepts (the i-th row\nis for li, i-th column is for oi). Moreover we also consider the marginal constraints S\u20211 = a, and\nS\u2021(cid:62)\n1 = b to avoid isolated l or o. Based on the aforementioned facts, traditional regularization\nschemes are thus not directly applicable to leverage the GKBDP.\nNext, we derive an equivalent condition for GKBDP, which directly leads to the formulation of\nour method. First, we de\ufb01ne an auxiliary bipartite graph Gl\u222ao = (Vl\u222ao,El\u222ao, Al\u222ao). The vertices\nof Gl\u222ao include all l and o. Denote Vo as the set of all output tasks and Vl as the set of all latent\ntasks, the vertex set Vl\u222ao is then de\ufb01ned as Vl\u222ao = Vl \u222a V o. To de\ufb01ne the edge set, we \ufb01rst de\ufb01ne\n, where the (i, j) \u2208 El\u222ao if and only if\nan af\ufb01nity matrix Al\u222ao in the form Al\u222ao =\nAl\u222aoij (cid:54)= 0. Then the well-known graph Laplacian follows as \u2206(S\u2021) = diag(Al\u222ao1) \u2212 Al\u222ao.\nWith the de\ufb01nition of Gl\u222ao, we could derive the following theorem.\nTheorem 1. If S\u2021 \u2208 \u03a0(a, b), \u03c7S = K holds if and only if dim(N ull(\u2206(S\u2021))) = K, i.e, the 0\neigenvalue of \u2206(S\u2021) has multiplicity K. Moreover, denote A(i) as the set of latent and output tasks\nbelonging to the i-th block of S, the eigenspace of 0 is spanned by \u03b9A(1), \u03b9A(2),\u00b7\u00b7\u00b7 , \u03b9A(K), where\n\u03b9A(i) \u2208 R(k+T )\u00d71, [\u03b9A(i) ]j = 1 if j \u2208 A(i), otherwise [\u03b9A(i)]j = 0.\nThe proof can be found in Appendix B.1. With the theorem, we can now step further to seek a suitable\nregularizer realizing GKBDP. It becomes straightforward that leveraging GKBDP requires the sum of\nbottom K eigenvalues to be as small as possible. Let N = k + T denote the total number of nodes in\nN\u2212K+1 \u03bbi(\u2206(S\u2021)) with the constraint S\u2021 \u2208 \u03a0(a, b). Following\nthe variational characterization of eigenvalues, we could reformulate eigenvalue calculation as an\noptimization problem with the following theorem.\nTheorem 2. Let M = {U : U \u2208 SN , I (cid:23) U (cid:23) 0, tr(U ) = K}, then \u2200A \u2208 SN :\nK , where\n\nGl\u222ao, we then need to minimize(cid:80)N\n(cid:80)N\nN\u2212K+1 \u03bbi(A) = minU\u2208M (cid:104)A, U(cid:105) , with an optimal value reached at U = VKV (cid:62)\nThm.2 provides a regularizer as \u2126(S\u2021) = inf(cid:8)(cid:10)\u2206(S\u2021), U(cid:11) : U \u2208 M(cid:9). Denote (cid:101)J =(cid:80)T\n\nThe theorem slightly extends the results in [Overton and Womersley, 1992b], which considers top-K\neigenvalues. A proof for the theorem could be found in Appendix B.2. Back to our practical problem,\ni=1 J (i),\n\u21261 = \u03b11 \u00b7 ||L||2\nframework:\n\nF /2, \u21262 = \u03b13 \u00b7(cid:10)\u2206(S\u2021), U(cid:11), we then reach an MTL model based on the latent task\n\nVK represents the eigenvectors of the smallest K eigenvalues of A.\n\n(cid:101)J + \u21261 + \u21262, s.t. S\u2021 \u2208 \u03a0(a, b), U \u2208 M, S = T S\u2021.\n\nmin\n\nL,S,S\u2021,U\n\n(Obj0)\n\nThis model exactly realizes GKBDP. However, this exact model is impractical in the following\nsense. First, it is hard to solve (Obj0) directly since multiple constraints are wrapped together on S.\nMoreover, it encourages a strict structural control of S, prohibiting overlapped subspaces even when\nit bene\ufb01ts the performance. These problems lead us to a relaxed implementation of (Obj0) , bringing\nus possibilities to embrace a practical and a more \ufb02exible solution.\nTo avoid directly controlling the structure of S, we relax the constraint S = T S\u2021 as a distance\npenalty2 \u21263 = \u03b12 \u00b7 d(S, T S\u2021)/2. This brings us to the \ufb01nal optimization problem:\n\n2Let us note that, in the rest of the paper, S = T S\u2021 no longer holds\n\n4\n\n\fJ = (cid:101)J + \u21261 + \u21262 + \u21263, s.t. S\u2021 \u2208 \u03a0(a, b), U \u2208 M.\n\nmin\n\nL,S,S\u2021,U\n\n(Obj)\n\n(cid:104)\n(cid:104)\n\nwhere\n\nThe relaxation scheme improves the \ufb02exibility of our model via leveraging a partial structural control,\nwhich decomposes S into a structural component T S\u2021 and a dense component as the residual\nS \u2212 T S\u2021. The new dense component allows S to slightly (controlled by the magnitude of \u03b12) violate\nthe structural constraint, in search of a better performance.\nAlternatively, we could also interpret (Obj) as a way to leverage hierarchical priors on the model.\nThis is speci\ufb01ed by the generative model in the following:\n\n(cid:17)\n\n(cid:19)\n\n(cid:104)\n(cid:104)\n\nX (i)LS(i), I\n\n,\n\nT S\u2021(i),\n\nvec(L) | \u03b11\n\nS(i) | T S\u2021(i), \u03b12\n\nY (i) | X (i), L, S(i)(cid:105) \u223c N(cid:16)\n(cid:18)\n(cid:105) \u223c N\n\n(cid:105) \u223c N\n(cid:105) \u223c g, U \u223c h,\ng \u221d exp(cid:0)\u2212\u03b13 \u00b7(cid:10)\u2206(S\u2021), U(cid:11)(cid:1) \u00b7 \u03b4(cid:0)S\u2021 \u2208 \u03a0(a, b)(cid:1) , h \u221d \u03b4(cid:0)U \u2208 M(cid:1).\nJ = \u2212 log(cid:0)P(L, S, S\u2021, U |X, Y , \u03b11, \u03b12, \u03b13)(cid:1) + const.\n\nS\u2021 | U , \u03b13\n\n1\n\u03b12\n\nI\n\n,\n\nHere g speci\ufb01es an exponential distribution restricted on the set \u03a0(a, b), and h speci\ufb01es a uniform\ndistribution on the set M. With this process, our objective function is equivalent to a Maximum A\nPosterior (MAP) formulation in the following sense:\n\n(cid:18)\n\n(cid:19)\n\n0,\n\nI\n\n,\n\n1\n\u03b11\n\nThis fact gives us an alternative perspective on the relationship between S and S\u2021. With the relaxation\nscheme, the constraints are moved to the mean of the prior distribution of S. This provides S with a\npossibility to activate the overlapping off-diagonal block elements with a moderate variance \u03b12.\n\n2.2 Optimization\nIt is easy to see that J in (Obj) is not a jointly convex function. But fortunately, it is easy to show\nthat the four subproblems with respective to L, S, S\u2021, U are all convex. Instead of directly solving\nthe overall non-convex problem, this fact motivates us to adopt an alternating optimization scheme\nwhere only one of the four parameters is updated each time and the others are \ufb01xed as constants. Now\nwe elaborate the four subproblems, respectively.\nL and S subroutine: Theoretically, both subroutines solve a strongly convex unconstrained quadratic\nprogramming and enjoy a closed-form solution. However, calculating the closed form of the L\nsubproblem suffers from a heavy computational complexity. Instead of adopting the closed form\ndirectly for the L subproblem, we adopt a gradient-based optimizer in our paper. Please see Appendix\nC for more details.\nU subroutine: According to Thm.2, U could be solved from: U = VKV (cid:62)\nK , where VK denotes\neigenvectors associated with the smallest K eigenvalues of \u2206(S\u2021). Denote VK = [f1,\u00b7\u00b7\u00b7 , fk+T ](cid:62),\naccording to Thm.1, when \u03c7S\u2021 = K, up to some orthogonal transformation, fi \u2208 RK\u00d71 becomes an\nindicator vector with only one non-zero entry where [fi]j = 1 only if the corresponding latent/output\ntask i is in group \u03b9A(j). In this way, we see that fi is a strong group indicator. Consequently, we\nname fi as the embedding vector for the latent (output) task.\nS\u2021 subroutine: With U updated with U = VKV (cid:62)\nthis subproblem:\nProposition 1. The S\u2021 subproblem could be reformulated as:\n\nK , the following proposition shows a way to solve\n\nmin\n\nS\u2021\u2208\u03a0(a,b)\n\n\u03d1\n2\n\n(cid:107)S\u2021 \u2212 \u00afS(cid:107)2\n\nF +(cid:10)D, S\u2021(cid:11) ,\n\n(P rimal)\n\nwhere \u03d1 =\n\n\u03b12 \u00b7 T 2\n\n\u03b13\n\n, \u00afS =\n\nS\nT\n\nand Dij = (cid:107)fi \u2212 fk+j(cid:107)2.\n\n5\n\n\fThe proof could be found in Appendix A.1. From Prop.1, we see that the subproblem recovers a\nsmoothed Optimal Transport (OT) [Peyr\u00e9 et al., 2019] problem. More speci\ufb01cally, the calculation of\nthe Wasserstein-2 distance between l and o, based on the spectral embedding.\nSimilar to the recent results [Blondel et al., 2018, Peyr\u00e9 et al., 2019], we can show that the regularized\nOT problem has a close connection with the original OT problem.\nProposition 2. The following properties hold true:\n(a) Denote S\u2021\n\n\u03d10 as the solution of problem (P rimal) when \u03d1 = \u03d10. Then we have :\n\n(cid:40)\n\n(cid:10)D, S\u2021(cid:11)(cid:41)\n\n.\n\nS\u2021\n\n\u03d10\n\n\u03d10\u21920\u2192 argmin\n\nS\u2021\n\nd(S\u2021, \u00afS) : S\u2021 \u2208 argmin\nS\u2021\u2208\u03a0(a,b)\n\n(b) Denote\n\nJOT = min\n\nS\u2021\u2208\u03a0(a,b)\n\n(cid:10)D, S\u2021(cid:11) , JREG = min\n\nS\u2021\u2208\u03a0(a,b)\n\n(\u03d1/2) \u00b7 d(S, \u00afS) +(cid:10)D, S\u2021(cid:11) ,\n\nwe have:\n\n2,(cid:107)b(cid:107)2\n\n\u03d1 \u00b7 max(cid:8)d( \u00afS1 \u2212 a)/T, d( \u00afS(cid:62)1, b)/k(cid:9) \u2264 JREG \u2212 JOT \u2264 \u03d1 \u00b7 (min{(cid:107)a(cid:107)2\n\nOT problem with a small \u03d1. Moreover, if the regularizer (cid:10)\u2206(S\u2021), U(cid:11) is suf\ufb01ciently small, f\n\n2} + (cid:107) \u00afS(cid:107)2\nF ).\nThe proof can be found in Appendix A.2. Prop.2-(a) shows that asymptotically, when \u03d1 \u2192 0, the\nsolution of the regularized OT problem approaches a speci\ufb01c solution of the original OT problem.\nMore speci\ufb01cally, it will pick out an optimal coupling from the OT solution set with the smallest\nregularization term d(S\u2021, \u00afS). From a non-asymptotic perspective, Prop.2-(b) shows how fast this\napproximation will take place. Consequently, we will get a reasonable approximation of the original\napproaches the grouping indicator of a K-connected bipartite graph. At the same time, Dij, the\ndistance between the embedding vectors, approaches zero when li and oj belong to the same group\nindicated by f. Under this circumstance, the transportation cost Di,j is small only if li and oj\nbelong to the same group. By contrast, the inter-group negative transfer is suppressed with a large\ntransportation cost. Moreover, with \u03b12 \u2192 +\u221e, LS(i) \u2192 El|o=i(L). This indicates that the\nconditional expectation also embraces the idea of barycenter projection mapping [Seguy et al., 2018]\nin the sense El|o=i(L) = argminz\nEl|o=i(d(L(i), z)). Under this condition, the task parameter of\noi becomes a barycenter of the latent task embeddings. Finally, we show that this subproblem could\nbe solved ef\ufb01ciently from the dual formulation.\nProposition 3. The dual problem of (P rimal) could be solved from:\n\n(h(cid:63), g(cid:63)) = argmin\n\nh, g\n\n1\n2\u03d1\n\nand the primal solution is given by S\u2021(cid:63)\n\n=\n\n\u00b7(cid:13)(cid:13)(h \u2295 g \u2212 D + \u03d1 \u00afS)+\n(cid:20) h(cid:63) \u2295 g(cid:63) \u2212 D\n\n(cid:13)(cid:13)2\n\n\u03d1\n\n+ \u00afS\n\n.\n\n+\n\nF \u2212 (cid:104)h, a(cid:105) \u2212 (cid:104)g, b(cid:105) ,\n(cid:21)\n\n(Dual)\n\nThe proof can be found in Appendix A.3. From Prop.3, we can recover the primal solution from\n(Dual), which only involves O(k + T ) parameters instead of O(kT ). In this spirit, we \ufb01rst solve\nh(cid:63), g(cid:63) from (Dual) with the L-BFGS [Zhu et al., 1997] method as the optimizer, and then recover\nS\u2021(cid:63) from the dual parameters.\nSummary Our optimization procedure then alternatively solves the four subproblems until a conver-\ngence condition is reached, with irrelevant variables \ufb01xed as their latest version. Moreover, since\nall the subproblems are convex, it is easy to see that the iteration over subproblems then keeps the\noverall loss function non-increasing.\n\n2.3 Theoretical Analysis\n\nIn this section, we present theoretical analysis shedding light on how the hyperparameters \u03b11, \u03b12, \u03b13\naffect our proposed model. Let us start with de\ufb01ning a proper hypothesis space H that covers the\nsolution returned by the optimization algorithm. Recall that (Obj) is non-increasing during the\n\n6\n\n\fF \u2264 2J0/\u03b11, d(S, T S\u2021) \u2264 2J0/\u03b12,(cid:10)\u2206(S\u2021), U(cid:11) \u2264 2J0/\u03b13, for all outputs from the\n\noptimization procedure. This means that if we choose a feasible candidate L0, S0, S\u2021\n0, U0 as the\ninitialization of the algorithm, denote by J0 the corresponding objective function value, we will\nhave (cid:107)L(cid:107)2\noptimization algorithm. This naturally de\ufb01nes a hypothesis class H = H(L, S, S\u2021, U ):\n\n(cid:26)(cid:110) \u02c6Y (i)(X (t)\n(cid:27)\nd(S, T S\u2021) \u2264 \u03be2, (cid:10)\u2206(S\u2021), U(cid:11) \u2264 \u03be3, S\u2021 \u2208 \u03a0(a, b), U \u2208 M\n\ni ) = (LS(i))(cid:62)X (t)\n\nF \u2264 \u03be1,\n\n: (cid:107)L(cid:107)2\n\n(cid:111)\n\nti\n\ni\n\n,\n\nH(L, S, S\u2021, U ) =\n\nas: \u02c6R(L, S) =(cid:80)T\n\ni=1 J (i)/T,R(L, S) =(cid:80)T\n\nwhere \u03be1 = 2J0/\u03b11, \u03be2 = 2J0/\u03b12, \u03be3 = 2J0/\u03b13. Now we are ready to represent the theoretical\nresults based on H.\nAs the \ufb01rst step, we explore how well a model learned from H generalizes to the overall population.\nThe empirical risk \u02c6R(L, S) over the observed dataset and the task-averaged risk R(L, S) are given\n, Y (i)\nare sampled from \u00b5i. We then bound the term \u2206 = R(L, S) \u2212 \u02c6R(L, S). Following the spirit of\n[Maurer et al., 2016], we have the following bound for the hypothesis space:\nTheorem 3. Suppose that n1 = n2 \u00b7\u00b7\u00b7 = nT = n, the loss function (cid:96)(y,\u00b7) : \u02c6y (cid:55)\u2192 [0, M ], (cid:96)(y,\u00b7) is\nM \u03c6-Lipschitz continuous, and \u2200(L, S) are chosen from H, the following bound holds with possibility\nat least 1 \u2212 \u03b4 :\n\n(cid:2)J (i)(cid:3)/T, where the training data X (i)\n\nE\u00b5i\n\ni=1\n\nj\n\nj\n\n(cid:32)\n\n\u2264 \u03ba1\u03c6\u2135\n\n\u03be1k(cid:12)(cid:12)(cid:12)(cid:12)COV (X)(cid:12)(cid:12)(cid:12)(cid:12)1\n\n(cid:33)1/2\nde\ufb01ned as (cid:104)COV (X)u, v(cid:105) = (1/nT ) \u00b7(cid:80)\n\n\u2206\nM\nwhere \u03ba1 and \u03ba2 are two universal constants, \u2135 =\n\n+ 2\u03ba2\u03c6\u2135 \u00b7\n\n(cid:68)\n\nnT\n\n(cid:32)\n\n\u221a\n\nu, X (t)\n\ni\n\nti\n\n\u03be1\n\n(cid:12)(cid:12)(cid:12)(cid:12)COV (X)(cid:12)(cid:12)(cid:12)(cid:12)\u221e\n(cid:69)\n(cid:69)(cid:68)\n\nn\n\nX (t)\n\ni\n\n, v\n\n.\n\n(cid:33)1/2\n\n(cid:32)\n\n+\n\n9 ln (2/\u03b4)\n\n2nT\n\n(cid:33)1/2\n\n,\n\n\u03be2 + 1. COV (X) is the covariance operator\n\nThe proof can be found in Appendix B.3. With the sample complexity given, we narrow our focus to\nthe problem that how \u03be2, \u03be3 bene\ufb01t the hypothesis space. The following theorem shows that \u03be2, \u03be3\ncontrol the spectral properties of S and S\u2021.\nTheorem 4 (Spectral Properties of S). Let k \u2264 T , de\ufb01ne the SVD of S as S = P \u039bQ(cid:62), where P =\n[p1,\u00b7\u00b7\u00b7 , pk], Q = [q1,\u00b7\u00b7\u00b7 , qk] are left and right singular vectors respectively, \u039b = diag(\u03c3i (S))\nwith \u03c31 (S) \u2265 \u03c32 (S)\u00b7\u00b7\u00b7 \u2265 \u03c3k (S) \u2265 0. The following properties hold for all S \u2208 H:\n(cid:80)N\n(a) The bottom K eigenvalues of\ni=N\u2212K+1 \u03bbi(\u2206(S)) \u2264 T \u03be3 +\n\n\u221a\nthe graph Laplacian induced by S is bounded by :\nT ), where \u2206(S) is obtained from\n\n\u221a\n\u03be2K(\n\nk +\n\n2 +\n\n\u221a\n\n\u221a\n\nreplacing S\u2021 in \u2206(S\u2021) with S .\n\n(b) De\ufb01ne M(P ) = Span{p1,\u00b7\u00b7\u00b7 pK} and M(Q) = Span{q1,\u00b7\u00b7\u00b7 qK}, if \u03be3 +\n\nand rank(S) \u2265 K, then we have:\n\n\u03c31 (S)\n\u03c3K (S)\n\n=\n\nmax\n\nx,y\u2208M(P )\n\n(cid:107)x(cid:107)2=1,(cid:107)y(cid:107)2=1\n\n(cid:107)Sx(cid:107)2\n(cid:107)Sy(cid:107)2\n\n=\n\nmax\n\nx,y\u2208M(Q)\n\n(cid:107)x(cid:107)2=1,(cid:107)y(cid:107)2=1\n\n(cid:107)S(cid:62)x(cid:107)2\n(cid:107)S(cid:62)y(cid:107)2\n\n\u2264 1\nk\n\n\u00b7\n\n\u221a\n\n\u03be2/T < 1/T\n\u221a\n1 \u2212 T \u03be3 \u2212 \u221a\n\nT + k\n\n\u03be2\n\n\u03be2\n\n.\n\nThe proof can be found in Appendix B.4. Thm.4.(a) implies that, with a small \u03be2 and \u03be3, the grouping\n(cid:80)N\nstructure of S could also be controlled, even though the structural penalty is not directly exhibited\non S. More speci\ufb01cally, if we pick \u03be3 = O(T \u22123/2) and \u03be2 = O(1/T 2), we can reach a small\ni=N\u2212K+1 \u03bbi(\u2206(S)) with O(T \u22121/2) if T >> k. Thm.4.(b) states that shrinking \u03be2, \u03be3 helps to\nremain a smaller numerical perturbation of Sx (S(cid:62)x) over the principle subspaces, i.e., the subspaces\nspanned by principle left/right singular vectors. Besides numerical bene\ufb01ts, the following theorem\nshows that \u03be3 guarantees good structure recovery in a non-asymptotic manner.\nTheorem 5. Assume that k \u2264 T , and that the ground-truth grouping is indicated by G = {(i, j) :\n1 \u2264 i \u2264 k, 1 \u2264 j \u2264 T, li and oi are in the same group} with K disjoint groups. Moreover, for a\nmatrix W , denote supp(W ) as {(i, j) : Wi,j (cid:54)= 0}. For all S\u2021 obtained from the space H such\n\n7\n\n\fthat \u03bbK+1(\u2206(S\u2021)) > \u03bbK(\u2206(S\u2021)) > 0 and inf S(cid:63)\u2208\u03a0(a,b),Supp(S(cid:63))=G ||\u2206(S\u2021) \u2212 \u2206(S(cid:63))||F \u2264 \u0001, we\nhave:\n\n(cid:107)S\u2021suppc(cid:107)1 \u2264 1\n2\n\nk + 6\n\nT \u00b7 \u0001\n\u03bbK+1(\u2206(S\u2021))\n\n\u03be3 +\n\n\u00b7 \u03be3 +\n\n\u221a\n\n4\n\nkT \u03bbK+1(\u2206(S\u2021))\n\n(cid:113) 2\n\n\u00b7(cid:16)\n\n(cid:17) \u2264 1\n\n2\n\nS\u2021suppc\ndiagonal structure in the sense that S\u2021suppc\notherwise.\n\ndenotes the projection of S\u2021 onto the complement of the support set of the expected block-\n\ni,j = 0 if i and j belong to the same group, S\u2021suppc\n\ni,j = S\u2021\n\ni,j\n\nThe proof can be found in Appendix B.5. Under the assumptions of Thm.5, a smaller \u03be3 embraces a\nbetter recovery of the true block-diagonal structure. More speci\ufb01cally, it shrinks (cid:107)S\u2021suppc(cid:107)1, i.e., the\noverall magnitude of the elements that violate the true grouping structure. Picking \u03be3 = O(T \u22121/4) and\nT ), if \u03bbK+1(\u2206(S\u2021)) = O(1/k), under the worst case, we have (cid:107)S\u2021suppc(cid:107)1 = O(T \u22121/4).\nk = O(\n\n\u221a\n\n3 Empirical Study\n\n3.1 Experiment Settings\n\nFor all the experiments, hyper-parameters are tuned based on the training and validation set, and the\nresults on the test set are recorded. The experiments are done with 5 repetitions for each involved\nalgorithm. Except for the Simulated Dataset, the train/valid/test ratio is \ufb01xed as 70%/15%/15%.\nFor regression datasets (Simulated dataset and School), we adopt the overall rmse on all samples\nas the evaluation metric. For classi\ufb01cation datasets, we adopt the average of task-wise AUC as\nthe evaluation metric. For regression problem, J (\u00b7) in GBDSP is chosen as the square-loss. For\nclassi\ufb01cation problem, J (\u00b7) in GBDSP is chosen as the squared surrogate loss for AUC [Gao et al.,\n2016]. All the experiments are run with MATLAB 2016b and a Ubuntu 16.04 system. In the\nnext subsection, we show our experimental results on a simulated dataset. More experiments for\nreal-world datasets could be found in Appendix D.\n\n3.2 Simulated Dataset\n\nL2 \u2208 R300\u00d720, L3 \u2208 R300\u00d710, L4 \u2208 R300\u00d730, L5 \u2208 R300\u00d720, and that S = (cid:76)5\n\nTo test the effectiveness of GBDSP we generate a simple simulated annotation dataset with T = 150\nsimulated tasks, where the dataset is produced according to the assumption in our model. For\neach task, 500 samples are generated with d = 300 features such that X (i) \u2208 R500\u00d7300 and\nk \u223c N (0, I80). Speci\ufb01cally, we generate latent task representations with k = 100 basis. This\nx(i)\nyields an L \u2208 R300\u00d7100 and an S \u2208 R100\u00d7150. To leverage the group structure, we split the\nlatent tasks and output tasks into 5 groups, in a way that L = [L1,\u00b7\u00b7\u00b7 , L5], where L1 \u2208 R300\u00d720,\ni=1 Si where\nS1 \u2208 R20\u00d730, S2 \u2208 R20\u00d730, S3 \u2208 R10\u00d715, S4 \u2208 R30\u00d745, S5 \u2208 R20\u00d730. For the i-th group, the\nelements in Li is sampled i.i.d from N (mi, 0.01), where mi = 5i. Si is generated as Si = si1,\ni.e., every element in Si shares the same value. Moreover, si is calculated from the constraint that\nS \u2208 \u03a0(a, b). Then the task parameter is generated as W = LS. For each task, the outputs are\ngenerated as Y (i) = X (i)(W (i) + \u0001(i)), where \u0001(i) \u2208 R200\u00d71, and \u0001(i) \u223c N (0, 0.12I500). Based on\nthis setting, we compare GBDSP with GOMTL in the simulation dataset to see how the block-diagonal\nstructure bene\ufb01ts the latent task representation based MTL.\nFirst, we show how well could GOMTL and GBDSP recover the block-diagonal structure. We\ncompare S obtained from GOMTL and S\u2021 obtained from GBDSP, with the initial value of \u02c6L set as\n\u02c6L = L + N (0, 0.05I). As shown in Fig.1(a)- Fig.1(c), GBDSP recovers a much clearer structure\nthan GOMTL. Moreover, we provide a closer look at the embedding vectors in GBDSP. To do this,\nwe visualize the spectral embeddings f in a 3d space with t-SNE [Maaten and Hinton, 2008], which\nis shown in Fig.2(c). In this \ufb01gure, the points with different colors represent latent/output tasks in\ndifferent groups. Clearly, we see that the clusters are well-separated in the spectral embedding space,\nwhich again veri\ufb01es the grouping power of the proposed method.\nNext, we check whether GBDSP could improve the performance with a structural LATM.\nIn Fig.2, we plot the performance of GOMTL and GBDSP with different training set ratio\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Visualizations over the Simulated Dataset. (a)-(c) provide structural comparisons over the\nLATM: (a) shows The true LATM; (b) shows the LATM recovered by GOMTL (c) shows the LATM\nrecovered by GBDSP.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a-b) Performance curve with different training data ratio: (a) GOMTL (b) GBDSP(c)\nshows the spectral group embedding of o and l in GBDSP.\n\n(0.35, 0.45, 0.55, 0.65, 0.75). The corresponding results show that GBDSP consistently provides\na better performance with a smaller variance, which supports the idea that learning a block-diagonal\nLATM structure improves the performance.\n\n4 Conclusion\n\nTo simultaneously leverage a latent task representation and alleviate the inter-group negative transfer\nissue, we develop a novel MTL method GBDSP, which simultaneously separates the latent tasks and\nout tasks into a given number of groups. Moreover, we adopt an optimization method to solve the\nmodel parameters, which gives an alternative update scheme for our multi-convex objective function.\nThe solution produced by the optimization method shows a close connection between our method\nand the optimal transport problem, which brings new insight into how negative transfer could be\nprevented across latent tasks and output tasks. Furthermore, we provide theoretical analysis on the\nspectral properties of the model parameters. Empirical results on the simulated dataset show that\nGBDSP could roughly recover the correct grouping structure with good performance, and results\non the real-world datasets further verify the effectiveness of our proposed model on the problem of\npersonalized attribute prediction.\n\n5 Acknowledgements\n\nThis work was supported in part by National Natural Science Foundation of China: 61620106009,\nU1636214, 61836002, 61861166002, 61672514 and 61976202, in part by National Basic Research\nProgram of China (973 Program): 2015CB351800, in part by Key Research Program of Frontier\nSciences, CAS: QYZDJ-SSW-SYS013, in part by the Strategic Priority Research Program of Chinese\nAcademy of Sciences, Grant No. XDB28000000, in part by the Science and Technology Development\nFund of Macau SAR (File no. 0001/2018/AFJ) Joint Scienti\ufb01c Research Project, in part by Beijing\nNatural Science Foundation (No. 61971016, L182057, and 4182079), in part by Peng Cheng\n\n9\n\n0.360.370.380.390.4040506070ratiormse0.10250.10500.10750.11000.112540506070ratiormse\fLaboratory Project of Guangdong Province PCL2018KP004, and in part by Youth Innovation\nPromotion Association CAS.\n\nReferences\nF. Alizadeh. Interior point methods in semide\ufb01nite programming with applications to combinatorial\n\noptimization. SIAM journal on Optimization, 5(1):13\u201351, 1995.\n\nG. P. andJames Hays. SUN attribute database: Discovering, annotating, and recognizing scene\n\nattributes. In CVPR, pages 2751\u20132758, 2012.\n\nA. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73\n\n(3):243\u2013272, 2008a.\n\nA. Argyriou, M. Pontil, Y. Ying, and C. A. Micchelli. A spectral regularization framework for\n\nmulti-task structure learning. In Neurips, pages 25\u201332, 2008b.\n\nA. Barzilai and K. Crammer. Convex multi-task learning by clustering. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 65\u201373, 2015.\n\nM. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In AISTATS, pages\n\n880\u2013889, 2018.\n\nJ. Cao, Y. Li, and Z. Zhang. Partially shared multi-task convolutional neural network with local\n\nconstraint for face attribute learning. In CVPR, pages 4290\u20134299, 2018.\n\nR. Caruana. Multitask learnings. Machine Learning, 28(1):41\u201375, 1997.\n\nK. Crammer and Y. Mansour. Learning multiple tasks using shared hypotheses. In Neurips, pages\n\n1475\u20131483, 2012.\n\nJ. Frecon, S. Salzo, and M. Pontil. Bilevel learning of the group lasso structure. In Neurips, pages\n\n8301\u20138311, 2018.\n\nW. Gao, L. Wang, R. Jin, S. Zhu, and Z. Zhou. One-pass AUC optimization. AI, 236:1\u201329, 2016.\n\nG. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU press, 2012.\n\nL. Han and Y. Zhang. Multi-stage multi-task learning with reduced rank. In AAAI, pages 1638\u20131644,\n\n2016.\n\nT. Heskes. Solving a huge number of similar tasks: A combination of multi-task learning and a\n\nhierarchical bayesian approach. In ICML, pages 233\u2013241, 1998.\n\nR. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.\n\nL. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Neurips,\n\npages 745\u2013752, 2009.\n\nJ.-Y. Jeong and C.-H. Jun. Variable selection and task grouping for multi-task learning. In KDD,\n\npages 1589\u20131598, 2018.\n\nY. Jiang, Z. Yang, Q. Xu, X. Cao, and Q. Huang. When to learn what: Deep cognitive subspace\n\nclustering. In ACM MM, pages 718\u2013726, 2018.\n\nY. Jiang, Q. Xu, Z. Yang, X. Cao, and Q. Huang. Duet robust deep subspace clustering. In ACM MM,\n\n2019.\n\nZ. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In\n\nICML, pages 521\u2013528, 2011.\n\nA. Kovashka and K. Grauman. Discovering attribute shades of meaning with the crowd. IJCV, 114\n\n(1):56\u201373, 2015.\n\n10\n\n\fA. Kovashka, D. Parikh, and K. Grauman. Whittlesearch: Image search with relative attribute\n\nfeedback. In CVPR, pages 2973\u20132980, 2012.\n\nM. Kshirsagar, E. Yang, and A. C. Lozano. Learning task clusters via sparsity grouped multitask\n\nlearning. In ECML PKDD, pages 673\u2013689, 2017.\n\nA. Kumar and H. D. III. Learning task grouping and overlap in multi-task learning. In ICML, pages\n\n1723\u20131730, 2012.\n\nG. Lee, E. Yang, and S. Hwang. Asymmetric multi-task learning based on task relatedness and loss.\n\nIn ICML, pages 230\u2013238, 2016.\n\nH. Lee, E. Yang, and S. J. Hwang. Deep asymmetric multi-task feature learning. In ICML, pages\n\n2962\u20132970, 2018.\n\nY. Li, K. Fu, Z. Wang, C. Shahabi, J. Ye, and Y. Liu. Multi-task representation learning for travel\n\ntime estimation. In KDD, pages 1695\u20131704, 2018.\n\nY. Lin, S. Yang, V. Stoyanov, and H. Ji. A multi-lingual multi-task architecture for low-resource\n\nsequence labeling. In ACL, pages 799\u2013809, 2018.\n\nP. Liu, X. Qiu, and X. Huang. Adversarial multi-task learning for text classi\ufb01cation. In ACL, pages\n\n1\u201310, 2017.\n\nS. Liu and S. J. Pan. Adaptive group sparse multi-task learning via trace lasso. In IJCAI, pages\n\n2358\u20132364, 2017.\n\nC. Lu, J. Feng, Z. Lin, T. Mei, and S. Yan. Subspace clustering by block diagonal representation.\n\nIEEE transactions on pattern analysis and machine intelligence, 41(2):487\u2013501, 2019.\n\nL. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9\n\n(Nov):2579\u20132605, 2008.\n\nA. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning. In\n\nICML, pages 343\u2013351, 2013.\n\nA. Maurer, M. Pontil, and B. Romera-Paredes. The bene\ufb01t of multitask representation learning. The\n\nJournal of Machine Learning Research, 17(1):2853\u20132884, 2016.\n\nA. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. In Neurips,\n\npages 3644\u20133652, 2014.\n\nF. Nie, Z. Hu, and X. Li. Calibrated multi-task learning. In KDD, pages 2012\u20132021, 2018.\n\nM. L. Overton and R. S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix.\n\nSIAM Journal on Matrix Analysis and Applications, 13(1):41\u201345, 1992a.\n\nM. L. Overton and R. S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix.\n\nSIAM Journal on Matrix Analysis and Applications, 13(1):41\u201345, 1992b.\n\nG. Peyr\u00e9, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R(cid:13) in Machine\n\nLearning, 11(5-6):355\u2013607, 2019.\n\nV. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale optimal\n\ntransport and mapping estimation. In (ICLR), 2018.\n\nC. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture\n\nfor computer vision. In CVPR, pages 2818\u20132826, 2016.\n\nU. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416, 2007.\n\nY. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of\nthe good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence,\n2018.\n\n11\n\n\fD. Xu, W. Ouyang, X. Wang, and N. Sebe. Pad-net: Multi-tasks guided prediction-and-distillation\n\nnetwork for simultaneous depth estimation and scene parsing. In CVPR, pages 675\u2013684, 2018.\n\nL. Xu, A. Huang, J. Chen, and E. Chen. Exploiting task-feature co-clusters in multi-task learning. In\n\nAAAI, pages 1931\u20131937, 2015.\n\nZ. Yang, Q. Xu, X. Cao, and Q. Huang. From common to special: When multi-attribute learning\n\nmeets personalized opinions. In AAAI, pages 515\u2013522, 2018.\n\nZ. Yang, Q. Xu, X. Cao, and Q. Huang. Learning personalized attribute preference via multi-task\n\nAUC optimization. In AAAI, pages 5660\u20135667, 2019a.\n\nZ. Yang, Q. Xu, W. Zhang, X. Cao, and Q. Huang. Split multiplicative multi-view subspace clustering.\n\nIEEE Transactions on Image Processing, 2019b.\n\nZ. Yin and Y. Shen. On the dimensionality of word embedding. In NIPS, pages 887\u2013898, 2018.\n\nY. Yu, T. Wang, and R. J. Samworth. A useful variant of the davis\u2013kahan theorem for statisticians.\n\nBiometrika, 102(2):315\u2013323, 2014.\n\nW. Zhong and J. T. Kwok. Convex multitask learning with \ufb02exible task clusters. In ICML, pages\n\n483\u2013490, 2012.\n\nJ. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In\n\nNeurips, pages 702\u2013710, 2011a.\n\nJ. Zhou, J. Chen, and J. Ye. Malsar: Multi-task learning via structural regularization. Arizona State\n\nUniversity, 21, 2011b.\n\nC. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-\nscale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23\n(4):550\u2013560, 1997.\n\n12\n\n\f", "award": [], "sourceid": 3136, "authors": [{"given_name": "Zhiyong", "family_name": "Yang", "institution": "SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences; SCS, University of Chinese Academy of Sciences"}, {"given_name": "Qianqian", "family_name": "Xu", "institution": "Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences"}, {"given_name": "Yangbangyan", "family_name": "Jiang", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"given_name": "Xiaochun", "family_name": "Cao", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"given_name": "Qingming", "family_name": "Huang", "institution": "University of Chinese Academy of Sciences"}]}