{"title": "Learning to Multitask", "book": "Advances in Neural Information Processing Systems", "page_first": 5771, "page_last": 5782, "abstract": "Multitask learning has shown promising performance in many applications and many multitask models have been proposed. In order to identify an effective multitask model for a given multitask problem, we propose a learning framework called Learning to MultiTask (L2MT). To achieve the goal, L2MT exploits historical multitask experience which is organized as a training set consisting of several tuples, each of which contains a multitask problem with multiple tasks, a multitask model, and the relative test error. Based on such training set, L2MT first uses a proposed layerwise graph neural network to learn task embeddings for all the tasks in a multitask problem and then learns an estimation function to estimate the relative test error based on task embeddings and the representation of the multitask model based on a unified formulation. Given a new multitask problem, the estimation function is used to identify a suitable multitask model. Experiments on benchmark datasets show the effectiveness of the proposed L2MT framework.", "full_text": "Learning to Multitask\n\nYu Zhang1, Ying Wei2, Qiang Yang1\n\nyu.zhang.ust@gmail.com judywei@tencent.com qyang@cse.ust.hk\n\n1HKUST\n\n2Tencent AI Lab\n\nAbstract\n\nMultitask learning has shown promising performance in many applications and\nmany multitask models have been proposed. In order to identify an effective multi-\ntask model for a given multitask problem, we propose a learning framework called\nLearning to MultiTask (L2MT). To achieve the goal, L2MT exploits historical mul-\ntitask experience which is organized as a training set consisting of several tuples,\neach of which contains a multitask problem with multiple tasks, a multitask model,\nand the relative test error. Based on such training set, L2MT \ufb01rst uses a proposed\nlayerwise graph neural network to learn task embeddings for all the tasks in a\nmultitask problem and then learns an estimation function to estimate the relative\ntest error based on task embeddings and the representation of the multitask model\nbased on a uni\ufb01ed formulation. Given a new multitask problem, the estimation\nfunction is used to identify a suitable multitask model. Experiments on benchmark\ndatasets show the effectiveness of the proposed L2MT framework.\n\n1\n\nIntroduction\n\nMultitask learning [9] aims to leverage useful information contained in multiple tasks to help improve\nthe generalization performance of those tasks. In the past decades, many multitask models have\nbeen proposed. According to a recent survey [38], these models can be classi\ufb01ed into two main\ncategories: feature-based approach and parameter-based approach. The feature-based approach\nuses data features as the media to share knowledge among tasks and it usually learns a common\nfeature representation for all the tasks. This approach can be further divided into two categories:\nshallow approach [9, 2, 43] and deep approach [25]. Different from the feature-based approach, the\nparameter-based approach links different tasks by placing regularizers or Bayesian priors on model\nparameters to achieve knowledge transfer among tasks. This approach can be further classi\ufb01ed into\nfour categories: low-rank approach [1, 28, 17], task clustering approach [18, 20, 15], task relation\nlearning approach [39, 40, 36, 35, 41, 42, 21, 37], and decomposition approach [10, 19, 44, 16].\nGiven so many multitask models, one important issue is how to choose a good model among them\nfor a given multitask problem. One solution is to do model selection, that is, using cross validation or\nits variants. One limitation of this solution is that it is computationally heavy considering that each of\nthe candidate models needs to be trained for multiple times.\nIn this paper, we propose a framework called Learning to MultiTask (L2MT) to solve this issue in a\nlearning-based approach. The main idea of L2MT is to exploit the historical multitask experience to\nlearn how to choose a suitable multitask model for a new multitask problem. To achieve that, the\nhistorical multitask experience is represented as a training set consisting of tuples each of which has\nthree entries: a multitask problem, a multitask model, and the relative test error that equals the ratio\nof the average test error of the multitask model on the multitask problem over that of the single-task\nlearning model. Based on this training set, we propose an end-to-end approach to learn the mapping\nfrom both the multitask problem and the multitask model to the relative test error, where we need to\ndetermine the representations of the multitask problem and the multitask model. First, a Layerwise\nGraph Neural Network (LGNN) is proposed to learn the task embedding as the representation of each\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftask in a multitask problem and by aggregating of all the task embeddings, the task embedding matrix\nis used as the representation of the multitask problem. For multitask models which have a uni\ufb01ed\nformulation, task covariance matrices are used as their representations since they play an important\nrole to reveal pairwise task relations. Then both representations of the multitask problem and model\nare encoded in an estimation function to estimate the relative test error. For a new multitask problem,\nwe can obtain the task embedding matrix via the LGNN learned on the training set and then in order\nto achieve a low relative test error, we minimize the estimation function to learn the task covariance\nmatrix as well as the corresponding multitask model. Experiments on benchmark datasets show the\neffectiveness of the proposed L2MT framework.\n\n2 A Uni\ufb01ed Formulation for Multitask Learning\n\nBefore presenting the L2MT framework, in this section, we give a uni\ufb01ed formulation for multitask\nlearning by extending that proposed in the survey [38].\nSuppose that we are given a multitask problem consisting of m tasks {Ti}m\ni=1. For task Ti, its\nj=1, where xi,j \u2208 Rd\ntraining dataset contains ni data points {xi,j}ni\nj=1 as well as their labels {yi,j}ni\nand yi,j \u2208 {\u22121, 1} for classi\ufb01cation problems. The learning function for task Ti is de\ufb01ned as\ni x + bi. A regularized formulation to learn task relations, which can unify several\nfi(x) = wT\nrepresentative models [14, 13, 18, 28, 36, 39, 29, 42, 37], is formulated as\n\nni(cid:88)\n\n(cid:16)\n\nm(cid:88)\n\n1\nni\n\ni=1\n\nj=1\n\n(cid:17)\n\nmin\n\nW,b,\u2126(cid:23)0\n\nl\n\nwT\n\ni xi,j + bi, yi,j\n\n+\n\n\u03bb1\n2\n\ntr(W\u2126\n\n\u22121WT ) + \u03bb2g(\u2126),\n\n(1)\n\nwhere W = (w1, . . . , wm), b = (b1, . . . , bm)T , l(\u00b7,\u00b7) denotes a loss function such as the cross-\nentropy loss, \u2126 (cid:23) 0 means that \u2126 is positive semide\ufb01nite (PSD), tr(\u00b7) denotes the trace of a square\nmatrix, \u2126\u22121 denotes the inverse or pseduoinverse of a square matrix, and \u03bb1, \u03bb2 are regularization\nhyperparameters to control the trade-off among the three terms in problem (1). The \ufb01rst term in\nproblem (1) measures the empirical loss. The second term is a regularizer on W based on \u2126.\nAccording to [39], \u2126, the task covariance matrix, is used to describe pairwise task relations. The\nfunction g(\u00b7) in problem (1) can be considered as a regularizer on \u2126 to characterize its structure.\nThe survey [38] has shown that the models proposed in [14, 13, 18, 28, 36, 39, 29, 42, 37] can be\nformulated as problem (1) with different g(\u00b7)\u2019s, where the detailed connections between these works\nand problem (1) are put in the supplementary material for completeness. In the following, we propose\ntwo main extensions to enrich problem (1).\nFirstly, in Theorem 1, we prove that the Schatten norm regularization is an instance of problem (1).\nAs its special case, the trace norm is widely used in multitask learning [28] as a regularizer to capture\nthe low-rank structure in W. Here we generalize it to the Schatten a-norm denoted by |||\u00b7|||a for a > 0,\nwhere |||\u00b7|||1 is just the trace norm. To see the relation between the Schatten norm regularization and\nproblem (1), we prove the following theorem with the proof in the supplementary material.\n\nTheorem 1 When g(\u2126) = tr(\u2126r) for any given positive scalar r, by de\ufb01ning \u02c6r = 2r/(r + 1) and\n\u03bbr = (1 + 1/r)(\u03bbr\n\n1\u03bb2r/2r)1/(r+1), problem (1) reduces to the following problem\n\nni(cid:88)\n\n(cid:16)\n\nm(cid:88)\n\n1\nni\n\ni=1\n\nj=1\n\nmin\nW,b\n\n(cid:17)\n\nl\n\nwT\n\ni xi,j + bi, yi,j\n\n+ \u03bbr|||W|||\u02c6r\n\u02c6r.\n\n(2)\n\nWhen r = 1, Theorem 1 implies that problem (1) with g(\u2126) = tr(\u2126) is equivalent to the trace norm\nregularization. Even though r can be any positive scalar, problem (2) corresponds to the Schatten\n\u02c6r-norm regularization with \u02c6r = 2r\nSecondly, in the following theorem, we prove that the squared Schatten norm regularization is an\ninstance of problem (1).\n\nr+1 < 2 and \u02c6r \u2265 1 when r \u2265 1.\n\nTheorem 2 By de\ufb01ning g(\u2126) =\n\nand corresponds to a constraint on \u2126, for any given positive scalar r and \u02c6r = 2r\nequivalent to the following problem: minW,b\ni xi,j + bi, yi,j\n\ni=1\n\n1\nni\n\n, which is an extended real-value function\nr+1 , problem (1) is\n\n(cid:1) + \u03bb1|||W|||2\n\nj=1 l(cid:0)wT\n(cid:80)ni\n\n\u02c6r.\n\n(cid:26) 0\n\nif tr(\u2126r) \u2264 1\n\n+\u221e otherwise\n\n(cid:80)m\n\n2\n\n\fThe aforementioned multitask models with different instantiations of g(\u00b7) are summarized in Table\n1 in the supplementary material. Based on the above discussion, we can see that problem (1) can\nembrace many or even in\ufb01nite multitask models as r in the (squared) Schatten norm regularization\ncan take an in\ufb01nite number of values. Given a multitask problem and so many candidate models, the\ntop priority is to choose which model to use. One solution is to try all possible models to \ufb01nd the best\none but it is computationally heavy. In the following section, we will give our solution: L2MT.\n\n3 Learning to Multitask\n\nIn this section, we present the proposed L2MT framework and its associated solution.\n\n3.1 The Framework\n\n1, . . . , ei\n\nmi\n\ni=1 to {\u03c5(oi)}q\n\nRecall that the aim of the proposed L2MT framework shown in Figure 1 is to determine a suitable\nmultitask model for a test multitask problem by exploiting historical multitask experience. To achieve\nthis, as a representation of historical multitask experience, the training set of the L2MT framework\nconsists of q tuples {(Si,Mi, oi)}q\ni=1. S denotes the space of multitask problems and Si \u2208 S denotes\na multitask problem. Each multitask problem Si consists of mi learning tasks each of which is\nassociated with a training dataset, a validation dataset, and a test dataset. As we will see later, the\njth task in Si is represented as a task embedding ei\nj \u2208 R \u02c6d via applying the proposed LGNN model\nonto its training dataset and by aggregating of task embeddings of all the tasks, the task embedding\n) will be treated as the representation of the multitask problem Si. M\nmatrix Ei = (ei\ndenotes the space of multitask models and Mi \u2208 M denotes a speci\ufb01c multitask model which is\ntrained on the training datasets in Si. Mi can be a discrete index for candidate multitask models\nor a continuous representation based on model parameters. In this sequel, based on the uni\ufb01ed\nformulation presented in the previous section, Mi is represented by the task covariance matrix \u2126i\nand hence M is continuous. One reason to choose the task covariance matrix as the representation\nof a multitask model is that the task covariance matrix is core to problem (1) and once it has been\ndetermined, the model parameters W and b can easily be obtained. One bene\ufb01t of choosing a\ncontinuous M is that we can learn a new model out of all the candidate models. oi \u2208 R denotes the\nrelative test error \u0001M T L/\u0001ST L, where \u0001M T L denotes the average test error of the multitask model\nMi on the test datasets of multiple tasks in Si and \u0001ST L denotes the average test error of a single-task\nlearning (STL) model which is trained on each task independently. Hence, the training process of the\nL2MT framework is to learn an estimation function f (\u00b7,\u00b7) to map from {(Si,Mi)}q\ni=1 or concretely\n{(Ei, \u2126i)}q\ni=1, where \u03c5(\u00b7), a link function, transforms oi to make the estimation\neasier and will be introduced later. Moreover, based on problem (1), we can see \u2126 is a function of\nhyperparameters \u03bb1 and \u03bb2 and so is the relative test error. Here we make an assumption that \u2126 is\nsuf\ufb01cient to estimate the relative test error. This assumption empirically works very well and it can\nsimplify the design of the estimation function. Moreover, under this assumption, we do not need to\n\ufb01nd the best hyperparameters for each training tuple, which can save a lot of computational cost.\nIn the test process, suppose that we are given a test multitask problem \u02dcS which is not in the training\nset. Each task in \u02dcS also has a training dataset, a validation dataset and a test dataset. To obtain\nthe relative test error \u02dco as low as possible, we resort to minimizing \u03b31f ( \u02dcE, \u2126) with respect to \u2126 to\n\ufb01nd the optimal task covariance matrix \u02dc\u2126, where \u02dcE denotes the task embedding matrix for the test\nmultitask problem and \u03b31 is a parameter in the link function \u03c5(\u00b7) to control its monotonic property,\nand then by incorporating \u02dc\u2126 into problem (1) without manually specifying g(\u00b7), we can learn the\noptimal \u02dcW and \u02dcb which are used to make prediction on the test datasets.\nThere are some learning paradigms related to the L2MT framework, including multitask learning,\ntransfer learning [27], lifelong learning [12], and learning to transfer [33]. However, there exist\nsigni\ufb01cant differences between the L2MT framework and these related paradigms. In multitask\nlearning, the training set contains only one multitask problem, i.e., S1, and its goal is to learn model\nparameters in a given multitask model. The difference between transfer learning and L2MT is\nsimilar to that between multitask learning and L2MT. Lifelong learning can be viewed as online\ntransfer/multitask learning and hence it is different from L2MT. Different from learning to transfer\nwhich is for transfer learning and relies on handcrafted task features, the proposed L2MT is end-to-end\nbased on neural networks for multitask learning.\n\n3\n\n\fFigure 1: An illustration of the L2MT framework consisting of two stages. The training stage is to\nlearn the estimation function f (\u00b7,\u00b7) to approximate the relative test error based on multitask problems\nand multitask models and the test process is to learn the task covariance matrix by minimizing the\nrelative test error or approximately \u03b31f ( \u02dcE, \u2126) with respect to \u2126. Di\nj denotes the training dataset for\nthe jth task in the ith multitask problem Si and \u02dcDi denotes the training dataset for the ith task in the\ntest multitask problem \u02dcS. LGNN, which receives a training dataset as the input and is learned in the\ntraining process, is shared by all the tasks in the training and test multitask problems and we plot\nmultiple copies for clear presentation.\n\n3.2 Task Embedding\n\nIn order to learn the estimation function in the training process, the \ufb01rst thing we need to do is\nto determine the representation of multitask problems {Si}. Usually each multitask problem is\nassociated with multiple training datasets each of which corresponds to a task. So we can reduce\nrepresenting a multitask problem to representing the training dataset of a task, which is called the\ntask embedding, in the multitask problem. In the following, we propose a method to represent the\ntask embedding based on neural networks with powerful capacities.\nFor the ease of presentation, the training dataset of a task in a multitask problem consists of n\ndata-label pairs {(xj, yj)}n\nj=1 by omitting the task index, where xj is assumed to have a vectorized\nrepresentation. Due to varying nature of training datasets in different tasks (e.g., the size and the\nrelations among training data points), it is dif\ufb01cult to use conventional neural networks such as\nconvolutional neural networks or recurrent neural networks to represent a dataset. For a dataset,\nusually we can represent it as a graph where each vertex corresponds to a data point and the edge\nbetween vertices implies the relation between the corresponding data points. Based on the graph\nrepresentation, we propose the LGNN to obtain the task embedding. Speci\ufb01cally, the input to the\nLGNN is a data matrix X = (x1, . . . , xn). By using ReLU as the activation function, the output of\nthe \ufb01rst hidden layer in LGNN is\n\nH1 = ReLU(LT\n\n(3)\nwhere \u02c6d denotes the dimension of hidden representations, L1 \u2208 Rd\u00d7 \u02c6d and \u03b21 \u2208 R \u02c6d denote the\ntransformation matrix and bias, and 1 denotes a vector or matrix of all ones with the size depending\non the context. According to Eq. (3), H1 contains the hidden representations for all the training data\npoints in this task. With an adjacency matrix G \u2208 Rn\u00d7n to model the relations between each pair of\ntraining data points, the output of the ith hidden layer (2 \u2264 i \u2264 s) in the LGNN is de\ufb01ned as\n\n1 X + \u03b211T ),\n\nHi = ReLU(LT\n\n(4)\nwhere Li \u2208 Rd\u00d7 \u02c6d and \u03b2i \u2208 R \u02c6d are the transformation matrix and bias, and s denotes the total number\nof hidden layers. According to Eq. (4), the hidden representations of all the data points at the ith layer\n(i.e., Hi) rely on those in the previous layer (i.e., Hi\u22121) and if xi and xj are correlated according\nto G (i.e., gij (cid:54)= 0), their hidden representations are correlated. The term LT\ni X in Eq. (4) not only\npreserves the comprehensive information encoded in the original representation but also alleviates\n\ni X + Hi\u22121G + \u03b2i1T ),\n\n4\n\n2018/5/18L2MT.html1/2\u03c5()o1 e\u02dcm\u02dc e\u02dc1LGNN D11D1m1S1 e11 e1m1E1()M1\u03a91LGNN Dq1DqmqSq eq1 eqmqEq()Mq\u03a9q\u03c5{}oqLGNN D\u02dc1D\u02dcm\u02dcS\u02dcf(,)\u2248\u03c5{}Ei\u03a9ioioptimize problem (7)E\u02dc\u03a9\u02dc=argf(,\u03a9)\u03a9\u02dcmin\u03a9\u03b31E\u02dcTraining ProcessTest Processoptimizeproblem (8)\fthe gradient vanishing issue by achieving the skip connection as in the highway network [32] when s\nis large. The task embedding of this task, as a result, takes the average of the last hidden layer Hs\nover all data points, i.e., e = Hs1/n. One advantage of the mean function used here is that it can\nhandle datasets with varying sizes. In LGNN, {Li} and {\u03b2i} are learnable parameters based on the\nobjective function presented in the next section.\nThe graph G plays an important role in LGNN. Here we use the label information in the training\ndataset to construct it. For example, when each learning task is a classi\ufb01cation problem, gij, the\n(i, j)th entry in G, is set to be 1 when yi equals yj, \u22121 when yi (cid:54)= yj and xi is one of the k nearest\nneighbors of xj or xj is one of the k nearest neighbors of xi, and 0 otherwise. Based on the de\ufb01nition\nof gij and Eq. (4), when two data points are in the same class, their hidden representations have\npositive effects to each other. When two data points are in different classes and they are nearby (i.e.,\nin the neighborhood), their hidden representations have negative effects to each other.\nThe original graph neural network [30] needs to solve the \ufb01xed point of a recursive equation, which\nrestricts the functional form of the activation function. Graph convolutional neural networks [8, 26, 4]\nfocus on how to select neighbored data points to do the convolution operation, while LGNN aggregates\nall the neighborhood information in a layerwise manner.\nGiven a multitask problem consisting of m tasks, we construct a LGNN for all tasks with the shared\nparameters. Therefore, the task embedding matrix E = (e1, . . . , em), where ei denotes the task\nembedding for the ith task, is treated as the representation for the entire multitask problem. In the\nnext section, we show how to learn the estimation function based on such representation.\n\n3.3 Training Process\nRecall that the training set in L2MT contains q tuples {(Si,Mi, oi)}q\ni=1. Applying the LGNN\nin the previous section, we represent Si with mi tasks as a task embedding matrix Ei \u2208 R \u02c6d\u00d7mi.\nBased on the uni\ufb01ed formulation in Section 2, Mi is represented by the task covariance matrix\n\u2126i \u2208 Rmi\u00d7mi. In the training process, we aim to learn an estimation function mapping from both the\ntask embedding matrix and the task covariance matrix to the relative test error, i.e., f (Ei, \u2126i) \u2248 \u03c5(oi)\nfor i = 1, . . . , q, where \u03c5(\u00b7) is de\ufb01ned as a link function to transform the target. Considering the\ndif\ufb01culty of designing a numerically stable f (\u00b7,\u00b7) to meet all positive oi\u2019s, we introduce the link\nfunction, \u03c5(\u00b7), which transforms oi to real scalars being positive or negative. Different \u2126i\u2019s may have\nvariable scales as they are produced by different multitask models with different g(\u00b7)\u2019s. To make their\nscales comparable, we impose a restriction that tr(\u2126i) equals 1. If some \u2126i does not satisfy this\nrequirement, we simply preprocess it via \u2126i/tr(\u2126i). Note that different Ei\u2019s can have different sizes\nas mi is not \ufb01xed. By taking this into consideration, we design an estimation function, where the\nnumber of parameters is independent of mi, as\n\ni ),\n\nj \u2212 ei\n\nk)(cid:107)2\n\ni Ei\u2126i) + \u03b12tr(Ki\u2126i) + \u03b14tr(\u21262\n\nf (Ei, \u2126i) = \u03b11tr(ET\n(5)\nj is the jth column in Ei, Ki is an mi \u00d7 mi matrix with its (j, k)th entry equal to\nwhere ei\n2}, and \u03b1 = (\u03b11, \u03b12, \u03b13, \u03b14) contains four real parameters to be optimized\nexp{\u2212(cid:107)\u03b13(ei\nin the estimation function. In the right-hand side of Eq. (5), ET\ni Ei and Ki are linear and RBF\nkernel matrices to de\ufb01ne task similarities based on task embeddings, respectively. The \ufb01rst two\nterms in f (\u00b7,\u00b7) de\ufb01ne the consistency between kernel matrices and \u2126i with \u03b11 and \u03b12 controlling the\npositive/negative magnitude to estimate oi. The resultant kernel matrices with the same size as \u2126i\nare also the key to empower the estimation function to accommodate \u2126i\u2019s of different sizes.\nThe link function takes the following form: \u03c5(o) = tanh(\u03b31o + \u03b32), where tanh(\u00b7) denotes the\nhyperbolic tangent function to transform a positive o to the range (\u22121, 1) and \u03b3 = (\u03b31, \u03b32) contains\ntwo learnable parameters.\nThe objective function in the training process is formulated as\n|f (Ei, \u2126i) \u2212 \u03c5(oi)| + \u03bb\n\n(6)\nwhere \u0398 = {{Li},{\u03b2i}, \u03b1, \u03b3} denotes the set of parameters to be optimized. Here we use the\nabsolute loss as it is robust to outliers. Problem (6) indicates that the proposed method is end-to-end\nfrom the training datasets of a multitask problem to its relative test error. We optimize problem (6)\nvia the Adam optimizer in the tensor\ufb02ow package. In each batch, we randomly choose a tuple (e.g.,\nthe kth tuple) and optimize problem (6) by replacing the \ufb01rst term with |f (Ek, \u2126k) \u2212 \u03c5(ok)| as an\napproximation. The left part of Figure 1 illustrates the training process.\n\n(cid:107)Li(cid:107)2\nF ,\n\ns(cid:88)\n\ni=1\n\nq(cid:88)\n\ni=1\n\nmin\n\n\u0398\n\n1\nq\n\n5\n\n\f3.4 Test Process\nIn the test process, suppose that we are given a new test multitask problem \u02dcS consisting of \u02dcm tasks\neach of which is associated with a training dataset, a validation dataset and a test dataset. The goal\nhere is to learn the optimal \u02dc\u2126 automatically via the estimation function and the training datasets\nwithout manually specifying the form of g(\u00b7) in problem (1). With \u02dc\u2126 injected, the validation datasets\nin all tasks can be used to identify the regularization hyperparameter \u03bb1 in problem (1) and the test\ndatasets are used to evaluate the performance of L2MT as usual.\nFor the training datasets in the \u02dcm tasks, we \ufb01rst apply the learned LGNN in the training process to\nobtain their task embedding matrix \u02dcE \u2208 R \u02c6d\u00d7 \u02dcm. Here the task covariance matrix is unknown and what\nwe need to do is to estimate the task covariance matrix by minimizing the relative test error, which,\nhowever, is dif\ufb01cult to measure based on the training datasets. Recall that the estimation function is\nan approximation of the transformed relative test error by the link function. So we resort to optimize\nthe estimation function instead. Due to the monotonically increasing property of the hyperbolic\ntangent function used in the link function \u03c5(\u00b7), minimizing the relative test error can be approximated\nby minimizing/maximizing the estimation function when \u03b31 is positive/negative,1 leading to the\nminimization of \u03b31f ( \u02dcE, \u2126) with respect to \u2126, which based on Eq. (5) can be simpli\ufb01ed as\n\ns.t. \u2126 (cid:23) 0, tr(\u2126) = 1,\n\n\u2126\n\nmin\n\n\u03c1tr(\u21262) + tr(\u03a6\u2126)\n\n(7)\nwhere \u03c1 = \u03b31\u03b14, \u02dcei denotes the ith column in \u02dcE, \u02dcK is an \u02dcm \u00d7 \u02dcm matrix with its (i, j)th entry equal\nto exp{\u2212(cid:107)\u03b13(\u02dcei \u2212 \u02dcej)(cid:107)2\n2}, and \u03a6 = \u03b31(\u03b11 \u02dcET \u02dcE + \u03b12 \u02dcK). The constraints in problem (7) are due\nto the requirement that the trace of the PSD task covariance matrix equals 1 as preprocessed in the\ntraining process. It is easy to \ufb01nd that problem (7) is convex when \u03c1 \u2265 0 and otherwise non-convex.\nEven though the convex/non-convex nature of problem (7) varies with \u03c1, we can always \ufb01nd its\nef\ufb01cient solutions summarized in the following theorem.\n\nTheorem 3 De\ufb01ne the eigendecomposition of \u03a6 as \u03a6 = U\u039bUT where \u039b = diag(\u03ba) denotes\nthe diagonal eigenvalue matrix with \u03ba = (\u03ba1, . . . , \u03ba \u02dcm)T (\u03ba1 \u2265 . . . \u2265 \u03ba \u02dcm), U = (u1, . . . , u \u02dcm)\ndenotes the eigenvector matrix, and the multiplicity of \u03ba \u02dcm is assumed to be t (t \u2265 1). When \u03c1 = 0,\nthe optimal solution \u02dc\u2126 of problem (7) is in the convex hull of u \u02dcm\u2212t+1uT\n\u02dcm. When\n\u03c1 < 0, optimal solutions of problem (7) are in a set {u \u02dcm\u2212t+1uT\n\u02dcm}. When \u03c1 > 0,\nthe optimal solution is \u02dc\u2126 = Udiag(\u00b5)UT where \u00b5 is the solution of the following problem\n\n\u02dcm\u2212t+1, . . . , u \u02dcmuT\n\n\u02dcm\u2212t+1, . . . , u \u02dcmuT\n\n\u03c1(cid:107)\u00b5(cid:107)2\n\ns.t. \u00b5 \u2265 0, \u00b5T 1 = 1.\n\n\u00b5\n\nmin\n\n2 + \u00b5T \u03ba\n\n(8)\nAccording to Theorem 3, we need to solve problem (8) when \u03c1 > 0. Based on the Lagrange multiplier\nmethod, we design an ef\ufb01cient algorithm with O( \u02dcm) complexity in the supplementary material. After\nlearning \u02dc\u2126 according to Theorem 3, we can plug \u02dc\u2126 into problem (1) and learn the optimal \u02dcW and \u02dcb\nfor the \u02dcm tasks involved in the test multitask problem. The right part of Figure 1 illustrates the test\nprocess.\n\n3.5 Analysis\n\nThe training process of L2MT induces a new learning problem where each multitask problem is used\nto predict the relative test error and the task embedding matrix contains meta features to describe the\nmultitask problem. In this section, we study the generalization bound for this learning problem.\nFor the ease of presentation, we assume each multitask problem contains the same number of tasks.\nBy following [7], tasks originate in a common environment \u03b7, which is by de\ufb01nition a probability\nmeasure on a learning task. In L2MT, the absolute loss is used and here we generalize it to a general\ncase where the loss function \u00afl : R \u00d7 R \u2192 [0, 1] is assumed to be 1-Lipschitz in the \ufb01rst argument.\nIn the test process, we can see that the task covariance matrix is a function of the task embedding\nmatrix and inspired by that we make an assumption that there is some function to represent the task\ncovariance matrix in terms of the task embedding matrix. Based on such assumption, the estimation\nfunction is denoted by \u00aff (E) \u2261 f (E, \u2126). Then the expected loss is de\ufb01ned as E = E[\u00afl( \u00aff (E), \u03c5(o))]\nwhere the expectation E is on the space of multitask problems and relative test errors, and E denotes\nthe task embedding matrix induced by the corresponding multitask problem. The training loss is\n\n1We do not consider a trivial case that \u03b31 = 0 where the estimation function is to approximate a constant.\n\n6\n\n\f(cid:80)q\n\nq\n\ni=1\n\n\u00afl( \u00aff (Ei), \u03c5(oi)). Based on the Gaussian average [6, 22], we can bound E in\n\nde\ufb01ned as \u02c6E = 1\nterms of \u02c6E as follows.\nTheorem 4 Let \u00afF be a real-valued function class on the space of task embeddings, the members of \u00afF\nhave values in [0, 1], and H denotes the space of transformation functions in LGNN. With probability\ngreater than 1 \u2212 \u03b4, for any \u00aff \u2208 \u00afF and any h \u2208 H, we have\n\n(cid:115)\n\nE \u2264 \u02c6E +\n\nc1LG({Ei})\n\nq\n\n+\n\nc2Q suph\u2208H (cid:107)E(cid:107)F\n\nq\n\n+\n\n9 ln(2/\u03b4)\n\n2q\n\n,\n\nwhere c1, c2 are universal constants, functions in \u00afF are assumed to have a Lipschitz constant at most\nL, G(Y ) = E supy\u2208Y (cid:104)\u03c3, y(cid:105) denotes the Gaussian average where \u03c3 denotes a generic vector or\nmatrix of independent standard normal variables, minE G(F (E)) is assumed to be 0 by following\n[23], (cid:107) \u00b7 (cid:107)F denote the Frobenius norm, and Q = sup E,E(cid:48)\nE(cid:54)=E(cid:48)\n\n(cid:104)\u03c3, \u00aff (E)\u2212 \u00aff (E(cid:48))(cid:105)\n\nE sup \u00aff\u2208 \u00afF\n\n(cid:107)E\u2212E(cid:48)(cid:107)F\n\n.\n\nAccording to Theorem 4, we can see that the expected loss can be upper-bounded by the sum of the\ntraining loss, the model complexity based on the task embedding matrices, and a con\ufb01dence term\nwith the rate of convergence O(q\u2212 1\n2 ). The Gaussian average on the task embedding matrices induced\nby LGNN can be estimated via the chain rule [22].\n\n4 Experiments\n\nFour datasets are used in the experiments, including the MIT-Indoor-Scene, Caltech256, 20newsgroup,\nand RCV1 datasets. The MIT-Indoor-Scene and Caltech256 datasets are for image classi\ufb01cation,\nwhile the 20newsgroup and RCV1 datasets are for text classi\ufb01cation. For these two image datasets,\nwe use the FC8 layer of the VGG-19 network [31] pretrained on the ImageNet dataset as the\nfeature extractor. The two text datasets are represented using \u201cbag-of-words\u201d, thereby lying in\nhigh-dimensional spaces. To reduce the heavy computational cost induced, we preprocess these two\ndatasets to reduce the dimension to 1,000 by following [34] which utilizes ridge regression to select\nimportant features. The RCV1 dataset is highly imbalanced as the number of data points per class\nvaries from 5 to 130,426. To reduce the effect of imbalanced classes to multitask learning, we keep\nthe categories whose numbers of data samples are between 400 and 5,000.\nBased on each aforementioned dataset , we construct the training set for L2MT in the following two\nsteps: 1) We \ufb01rst construct a multitask problem where each task is a binary classi\ufb01cation task. The\nnumber of tasks in a multitask problem is uniformly distributed between 4 and 8 as the number of\ntasks in real applications is limited. For a multitask problem with m tasks, we just randomly sample\nm pairs of classes along with their data where each task is to distinguish between each pair of classes.\n2) We sample q multi-task problems to constitute the training set for L2MT. The test set for L2MT\ncan be obtained similarly and its construction is exclusively different from the training set.\nBaseline methods in the comparison consist of a single-task learner (STL), which is trained on each\ntask independently by adopting the cross-entropy loss, and all the instantiation models of problem\n(1), including regularized multitask learning (RMTL) [14], Schatten norm regularization with r = 1\n(SNR1) which is the trace norm regularization [28], Schatten norm regularization with r = 2 (SNR2)\nwhich is equivalent to the Schatten 4\n3-norm regularization according to Theorem 1, the MTRL method\n[39, 42], squared Schatten norm regularization with r = 2 (SSNR2) which is equivalent to squared\nSchatten 4\n3-norm regularization according to Theorem 2, clustered multitask learning (CMTL) [18],\nmultitask learning with graphical Lasso (glMTL) [36, 29], asymmetric multitask learning (AMTL)\n[21], and SPATS [37]. So in total there are 10 baseline methods. Moreover, to ensure fairness\nof comparison, we also allow each baseline method to access and include all training datasets of\nall training multitask problems, besides the training datasets in a test multitask problem at hand.\nConsequently, we report the better performance of each baseline method when it learns on the test\nmultitask problem only and on all the training and test multitask problems, respectively.\nUsually collecting a training set with many multitask problems needs to take much time and in the\nexperiments, we only collect 100 multitask problems for training where 30% data in each task form\nthe training dataset. For better training on such training set and controlling the model complexity,\ndifferent Li\u2019s and \u03b2i\u2019s (2 \u2264 i \u2264 s) are constrained to be identical in Eq. (4), i.e., L2 = . . . = Ls and\n\u03b22 = . . . = \u03b2s. There are 50 test multitask problems in the test set.\n\n7\n\n\f(cid:80) \u02dcmi\n\n(a) MIT-Indoor-Scene\n\nEach entry in {Li} is initialized to be normally distributed with zero mean and variance of 1/100, and\nthe biases {\u03b2i} are initialized to be zero. \u03b1 in the estimation function is initialized to [1, 1, 1, 0.1]T\nand \u03b3 in the link function is initialized to [1, 0]T . The learning rate linearly decays from 0.01 with\nrespect to the number of epoches.\nTo investigate the effect of the\nsize of the training dataset on\nthe performance, we vary the\nsize of training data from 30%\nto 50% at an interval of 10%\nwith the validation proportion\n\ufb01xed to 30% in the test process\nand plot the average relative test\nerrors of different methods over\nSTL in Figure 2, where \u02dc\u0001M T L =\nde-\n1\n\u02dcq\nnotes the average test error of\na multitask model over all the\ntasks in all the test multitask\nproblems, \u02dc\u0001ST L has a similar\nde\ufb01nition for STL, and the\naverage relative test error is\nde\ufb01ned as \u02dc\u0001M T L/\u02dc\u0001ST L. All the\nrelative test errors of STL are\nequal to 1, and the performance\nof RMTL is not very good as\nits assumption that all the tasks\nare equally similar to each other is usually violated in real applications. Hence we omit these two\nmethods in Figure 2 for clear presentation. According to Figure 2, we can see that some multitask\nmodels perform worse than STL with relative test errors larger than 1, which can be explained by the\nmismatch between data and model assumptions imposed on the task covariance. By learning the task\ncovariance directly from data without explicit assumptions, the proposed L2MT performs better than\nall the baseline methods under different settings, which demonstrates the effectiveness of L2MT.\n\nFigure 2: Results of different models on four datasets when vary-\ning the size of training data.\n\n(c) 20newsgroup\n\n(b) Caltech256\n\n(cid:80)\u02dcq\n\n1\n\u02dcmi\n\ni=1\n\nj=1 \u02dc\u0001M T L\n\ni,j\n\n(d) RCV1\n\n(a) s\n\n(b) \u03bb\n\n(c) \u02c6d\n\n(d) k\n\n(e) q\n\nFigure 3: Sensitivity analysis of L2MT on the 20newsgroup dataset when using 30% data for training.\n\nIn Figure 3, we conduct the sensitivity analysis on the 20newsgroup dataset with respect to hyper-\nparameters in L2MT, including the number of layers s, the regularization hyperparameter \u03bb, the\nlatent dimension \u02c6d, and the number of neighbors k in LGNN, to see their effects on the performance.\nAccording to Figure 3(a), we can see that the performance when s equals 2 and 3 is better than that\nwith s = 1, which demonstrates the usefulness of the graph information used in LGNN to learn the\ntask embeddings. Yet the performance degrades when s increases further with one reason that L2MT\n\n8\n\n\fis likely to over\ufb01t given a limited number of training multitask problems. As implied by Figures 3(b)\nand 3(d), when \u03bb is in [0.01, 0.5] and k in [5, 10], the performance is not so sensitive that the choices\nare easier and hence in experiments we always set \u03bb and k to 0.1 and 6. According to Figure 3(c),\nwhen \u02c6d is not very large, the performance is better than that corresponding to a larger \u02c6d where the\nover\ufb01tting is likely to occur. Based on such observation, \u02c6d is set to be 50. Moreover, in Figure 3(e)\nwe test the performance of L2MT by varying q, the size of training multitask problems, and we can\nsee that the test error of L2MT decreases when q is increasing, which matches the generalization\nbound in Theorem 4.\nIn previous experiments, the VGG-19 network is used as the feature extractor. It is worth noting that\nL2MT can even be used to update the VGG-19 network. On the Caltech256 and MIT-Indoor-Scene\ndatasets, we use problem (6) as the objective function to \ufb01ne-tune parameters in the fully connected\nlayers of the VGG-19 network. After \ufb01ne-tuning, the average test errors of L2MT are reduced by\nabout 5% compared to L2MT without \ufb01ne-tuning, which demonstrates the effectiveness of L2MT on\nnot only improving the performance of multitask problems but also learning good features.\nWe also study other formulations for the estimation and link functions. For example, another choice\nfor the estimation function we used is f (E, \u2126) = \u03b11tr(ReLU( \u02c6ET \u02c6E)\u2126) + \u03b12tr(\u21262) where \u02c6E =\nLE+\u03b21T with parameters L and \u03b2, and that for the link function is \u03c5(o) = ln(exp{\u03b31}o+exp{\u03b32}),\nwhere ln(\u00b7) denotes the logarithm function with base e. Compared with the estimation and link\nfunctions proposed in Section 3.3, these new functions lead to slightly worse performance (about 2%\nrelative increase on the test error), which demonstrates the effectiveness of the proposed functions\nover the newly de\ufb01ned ones.\nTo assess the quality of the learned task covariance matrices by different models, we conduct a case\nstudy by constructing a multitask problem consisting of three tasks from the Caltech256 dataset.\nThe \ufb01rst task is to classify between classes \u2018Bat\u2019 and \u2018Clutter\u2019, the second one is to distinguish\nbetween classes \u2018Bear\u2019 and \u2018Clutter\u2019, and the last task does classi\ufb01cation between classes \u2018Dog\u2019 and\n\u2018Clutter\u2019. The learned task correlation matrices, which can be computed from task covariance matrices,\n\n\uf8eb\uf8ed 1.0000\n\uf8f6\uf8f8,\n\uf8f6\uf8f8. From the three task correlation matrices, we can see that the\n\n\uf8eb\uf8ed 1.0000 \u22120.0067\n\n\u22120.0067\n0.0553\n\n0.0553\n0.0473\n1.0000\n\n0.0028\n0.0789\n\n0.0028\n1.0000\n0.0633\n\n0.0789\n0.0633\n1.0000\n\n\uf8f6\uf8f8,\n\n1.0000\n0.0473\n\nby SNR1, MTRL and L2MT are\n\n\uf8eb\uf8ed 1.0000\n\nand\n\n0.0057\n0.0052\n1.0000 \u22120.9782\n0.0057\n0.0052 \u22120.9782\n1.0000\n\ncorrelations between the \ufb01rst and second tasks are close to 0 in the three models, which matches\nthe intuition that bats and bears are almost irrelevant as they belong to different species. The same\nobservation holds for the \ufb01rst and third tasks. The difference among the three methods lies in the\ncorrelation between the second and third tasks. Speci\ufb01cally, in SNR1 and MTRL, those correlations\nare close to 0, indicating that these two tasks are nearly uncorrelated, and hence the knowledge shared\namong the three tasks is very limited for SNR1 and MTRL. On the contrary, in L2MT, the second and\nthird tasks have a highly negative correlation and hence there is strong knowledge leverage between\nthose two tasks, which may be one reason that L2MT outperforms SNR1 and MTRL.\n\n5 Conclusions\n\nIn this paper, we propose L2MT to identify a good multitask model for a multitask problem based on\nhistorical multitask problems. To achieve this, we propose an end-to-end procedure, which employs\nthe LGNN to learn task embedding matrices for multitask problems and then uses the estimation\nfunction to approximate the relative test error. In the test process, given a new multitask problem,\nminimizing the estimation function leads to the identi\ufb01cation of the task covariance matrix. As\nrevealed in the survey [38], there is another representative formulation for the feature-based approach\n[2, 3, 11] in multitask learning. In our future research, we will extend the proposed L2MT method\nto learn good feature covariances for multitask problems based on this formulation. Moreover, the\nproposed L2MT method can be extended to meta learning where LGNN can be used to learn hidden\nrepresentations for datasets.\n\n9\n\n\fAcknowledgments\n\nThis research has been supported by NSFC 61673202, National Grant Fundamental Re-\nsearch (973 Program) of China under Project 2014CB340304, and Hong Kong CERG projects\n16211214/16209715/16244616.\n\nReferences\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural\n\nInformation Processing Systems 19, pages 41\u201348, 2006.\n\n[3] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for\nmulti-task structure learning. In Advances in Neural Information Processing Systems 20, pages\n25\u201332, 2007.\n\n[4] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural\n\nInformation Processing Systems 29, pages 1993\u20132001, 2016.\n\n[5] O. Banerjee, L. E. Ghaoui, A. d\u2019Aspremont, and G. Natsoulis. Convex optimization techniques\nfor \ufb01tting sparse Gaussian graphical models. In Proceedings of the Twenty-Third International\nConference on Machine Learning, pages 89\u201396, 2006.\n\n[6] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[7] J. Baxter. A model of inductive bias learning. Journal of Arti\ufb01cal Intelligence Research,\n\n12:149\u2013198, 2000.\n\n[8] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. CoRR, abs/1312.6203, 2013.\n\n[9] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[10] J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks.\nIn Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 1179\u20131188, 2010.\n\n[11] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning shared structures from\nmultiple tasks. In Proceedings of the 26th International Conference on Machine Learning,\npages 137\u2013144, 2009.\n\n[12] Z. Chen and B. Liu. Lifelong Machine Learning. Morgan & Claypool, 2016.\n\n[13] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJournal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[14] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of the Tenth ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109\u2013117,\n2004.\n\n[15] L. Han and Y. Zhang. Learning multi-level task groups in multi-task learning. In Proceedings\n\nof the 29th AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[16] L. Han and Y. Zhang. Learning tree structure in multi-task learning. In Proceedings of the 21st\n\nACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2015.\n\n[17] L. Han and Y. Zhang. Multi-stage multi-task learning with reduced rank. In Proceedings of the\n\n30th AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[18] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In\n\nAdvances in Neural Information Processing Systems 21, pages 745\u2013752, 2008.\n\n10\n\n\f[19] A. Jalali, P. D. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In\n\nAdvances in Neural Information Processing Systems 23, pages 964\u2013972, 2010.\n\n[20] A. Kumar and H. Daum\u00e9 III. Learning task grouping and overlap in multi-task learning. In\n\nProceedings of the 29 th International Conference on Machine Learning, 2012.\n\n[21] G. Lee, E. Yang, and S. J. Hwang. Asymmetric multi-task learning based on task relatedness\nand loss. In Proceedings of the 33rd International Conference on Machine Learning, pages\n230\u2013238, 2016.\n\n[22] A. Maurer. A chain rule for the expected suprema of Gaussian processes. In Proceedings of the\n\n25th International Conference on Algorithmic Learning Theory, pages 245\u2013259, 2014.\n\n[23] A. Maurer, M. Pontil, and B. Romera-Paredes. The bene\ufb01t of multitask representation learning.\n\nJournal of Machine Learning Research, 17:1\u201332, 2016.\n\n[24] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of\n\nMachine Learning Research, 6:1099\u20131125, 2005.\n\n[25] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task\nlearning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,\npages 3994\u20134003, 2016.\n\n[26] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs.\nIn Proceedings of the 33nd International Conference on Machine Learning, pages 2014\u20132023,\n2016.\n\n[27] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and\n\nData Engineering, 22(10):1345\u20131359, 2010.\n\n[28] T. K. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms,\n\nand multi-task learning. SIAM Journal on Optimization, 20(6):3465\u20133489, 2010.\n\n[29] P. Rai, A. Kumar, and H. Daume. Simultaneously leveraging output and task structures for\nmultiple-output regression. In Advances in Neural Information Processing Systems 25, pages\n3185\u20133193, 2012.\n\n[30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n[32] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. CoRR, abs/1505.00387,\n\n2015.\n\n[33] Y. Wei, Y. Zhang, J. Huang, and Q. Yang. Transfer learning via learning to transfer.\n\nProceedings of the 35th International Conference on Machine Learning, 2018.\n\nIn\n\n[34] S. Yang, L. Yuan, Y.-C. Lai, X. Shen, P. Wonka, and J. Ye. Feature grouping and selection over\nan undirected graph. In Proceedings of ACM SIGKDD Conference on Kownledge Discovery\nand Data Mining, 2012.\n\n[35] Y. Zhang. Heterogeneous-neighborhood-based multi-task local learning algorithms. In Advances\n\nin Neural Information Processing Systems 26, 2013.\n\n[36] Y. Zhang and J. G. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In\n\nAdvances in Neural Information Processing Systems 23, pages 2550\u20132558, 2010.\n\n[37] Y. Zhang and Q. Yang. Learning sparse task relations in multi-task learning. In Proceedings of\n\nthe 31th AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[38] Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint, arXiv:1707.08114v2,\n\n2017.\n\n11\n\n\f[39] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task\nlearning. In Proceedings of the 26th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n733\u2013742, 2010.\n\n[40] Y. Zhang and D.-Y. Yeung. Multi-task learning using generalized t process. In Proceedings of\nthe 13th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 964\u2013971, 2010.\n\n[41] Y. Zhang and D.-Y. Yeung. Learning high-order task relationships in multi-task learning. In\n\nProceedings of the 23rd International Joint Conference on Arti\ufb01cial Intelligence, 2013.\n\n[42] Y. Zhang and D.-Y. Yeung. A regularization approach to learning task relationships in multitask\n\nlearning. ACM Transactions on Knowledge Discovery from Data, 8(3):article 12, 2014.\n\n[43] Y. Zhang, D.-Y. Yeung, and Q. Xu. Probabilistic multi-task feature selection. In Advances in\n\nNeural Information Processing Systems 23, 2010.\n\n[44] A. Zweig and D. Weinshall. Hierarchical regularization cascade for joint learning. In Proceed-\n\nings of the 30th International Conference on Machine Learning, pages 37\u201345, 2013.\n\n12\n\n\f", "award": [], "sourceid": 2784, "authors": [{"given_name": "Yu", "family_name": "Zhang", "institution": "HKUST"}, {"given_name": "Ying", "family_name": "Wei", "institution": "Tencent AI Lab"}, {"given_name": "Qiang", "family_name": "Yang", "institution": "Hong Kong University of Science and Technology"}]}