{"title": "Learning Multiple Tasks with Multilinear Relationship Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1594, "page_last": 1603, "abstract": "Deep networks trained on large-scale data can learn transferable features to promote learning multiple tasks. Since deep features eventually transition from general to specific along deep networks, a fundamental problem of multi-task learning is how to exploit the task relatedness underlying parameter tensors and improve feature transferability in the multiple task-specific layers. This paper presents Multilinear Relationship Networks (MRN) that discover the task relationships based on novel tensor normal priors over parameter tensors of multiple task-specific layers in deep convolutional networks. By jointly learning transferable features and multilinear relationships of tasks and features, MRN is able to alleviate the dilemma of negative-transfer in the feature layers and under-transfer in the classifier layer. Experiments show that MRN yields state-of-the-art results on three multi-task learning datasets.", "full_text": "Learning Multiple Tasks with Multilinear\n\nRelationship Networks\n\nMingsheng Long, Zhangjie Cao, Jianmin Wang, Philip S. Yu\nSchool of Software, Tsinghua University, Beijing 100084, China\n\npsyu@uic.edu\n\n{mingsheng,jimwang}@tsinghua.edu.cn\n\ncaozhangjie14@gmail.com\n\nAbstract\n\nDeep networks trained on large-scale data can learn transferable features to promote\nlearning multiple tasks. Since deep features eventually transition from general to\nspeci\ufb01c along deep networks, a fundamental problem of multi-task learning is how\nto exploit the task relatedness underlying parameter tensors and improve feature\ntransferability in the multiple task-speci\ufb01c layers. This paper presents Multilinear\nRelationship Networks (MRN) that discover the task relationships based on novel\ntensor normal priors over parameter tensors of multiple task-speci\ufb01c layers in deep\nconvolutional networks. By jointly learning transferable features and multilinear\nrelationships of tasks and features, MRN is able to alleviate the dilemma of negative-\ntransfer in the feature layers and under-transfer in the classi\ufb01er layer. Experiments\nshow that MRN yields state-of-the-art results on three multi-task learning datasets.\n\n1\n\nIntroduction\n\nSupervised learning machines trained with limited labeled samples are prone to over\ufb01tting, while\nmanual labeling of suf\ufb01cient training data for new domains is often prohibitive. Thus it is imperative\nto design versatile algorithms for reducing the labeling consumption, typically by leveraging off-the-\nshelf labeled data from relevant tasks. Multi-task learning is based on the idea that the performance\nof one task can be improved using related tasks as inductive bias [4]. Knowing the task relationship\nshould enable the transfer of shared knowledge from relevant tasks such that only task-speci\ufb01c features\nneed to be learned. This fundamental idea of task relatedness has motivated a variety of methods,\nincluding multi-task feature learning that learns a shared feature representation [1, 2, 6, 5, 23], and\nmulti-task relationship learning that models inherent task relationship [10, 14, 29, 31, 15, 17, 8].\nLearning inherent task relatedness is a hard problem, since the training data of different tasks may be\nsampled from different distributions and \ufb01tted by different models. Without prior knowledge on the\ntask relatedness, the distribution shift may pose a major dif\ufb01culty in transferring knowledge across\ndifferent tasks. Unfortunately, if cross-task knowledge transfer is impossible, then we will over\ufb01t\neach task due to limited amount of labeled data. One way to circumvent this dilemma is to use an\nexternal data source, e.g. ImageNet, to learn transferable features through which the shift in the\ninductive biases can be reduced such that different tasks can be correlated more effectively. This idea\nhas motivated some latest deep learning methods for learning multiple tasks [25, 22, 7, 27], which\nlearn a shared representation in feature layers and multiple independent classi\ufb01ers in classi\ufb01er layer.\nHowever, these deep multi-task learning methods do not explicitly model the task relationships.\nThis may result in under-transfer in the classi\ufb01er layer as knowledge can not be transferred across\ndifferent classi\ufb01ers. Recent research also reveals that deep features eventually transition from general\nto speci\ufb01c along the network, and feature transferability drops signi\ufb01cantly in higher layers with\nincreasing task dissimilarity [28], hence the sharing of all feature layers may be risky to negative-\ntransfer. Therefore, it remains an open problem how to exploit the task relationship across different\ndeep networks while improving the feature transferability in task-speci\ufb01c layers of the deep networks.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThis paper presents Multilinear Relationship Network (MRN) for multi-task learning, which discovers\nthe task relationships based on multiple task-speci\ufb01c layers of deep convolutional neural networks.\nSince the parameters of deep networks are natively tensors, the tensor normal distribution [21] is\nexplored for multi-task learning, which is imposed as the prior distribution over network parameters\nof all task-speci\ufb01c layers to learn \ufb01nd-grained multilinear relationships of tasks, classes and features.\nBy jointly learning transferable features and multilinear relationships, MRN is able to circumvent the\ndilemma of negative-transfer in feature layers and under-transfer in classi\ufb01er layer. Experiments show\nthat MRN learns \ufb01ne-grained relationships and yields state-of-the-art results on standard benchmarks.\n\n2 Related Work\n\nMulti-task learning is a learning paradigm that learns multiple tasks jointly by exploiting the shared\nstructures to improve generalization performance [4, 19] and mitigate manual labeling consumption.\nThere are generally two categories of approaches: (1) multi-task feature learning, which learns a\nshared feature representation such that the distribution shift across different tasks can be reduced\n[1, 2, 6, 5, 23]; (2) multi-task relationship learning, which explicitly models the task relationship\nin the forms of task grouping [14, 15, 17] or task covariance [10, 29, 31, 8]. While these methods\nhave achieved improved performance, they may be restricted by their shallow learning paradigm that\ncannot embody task relationships by suppressing the task-speci\ufb01c variations in transferable features.\nDeep networks learn abstract representations that disentangle and hide explanatory factors of variation\nbehind data [3, 16]. Deep representations manifest invariant factors underlying different populations\nand are transferable across similar tasks [28]. Thus deep networks have been successfully explored\nfor domain adaptation [11, 18] and multi-task learning [25, 22, 32, 7, 20, 27], where signi\ufb01cant\nperformance gains have been witnessed. Most multi-task deep learning methods [22, 32, 7] learn a\nshared representation in the feature layers and multiple independent classi\ufb01ers in the classi\ufb01er layer\nwithout inferring the task relationships. However, this may result in under-transfer in the classi\ufb01er\nlayer as knowledge cannot be adaptively propagated across different classi\ufb01ers, while the sharing of\nall feature layers may still be vulnerable to negative-transfer in the feature layers, as the higher layers\nof deep networks are tailored to \ufb01t task-speci\ufb01c structures and may not be safely transferable [28].\nThis paper presents a multilinear relationship network based on novel tensor normal priors to learn\ntransferable features and task relationships that mitigate both under-transfer and negative-transfer. Our\nwork contrasts from prior relationship learning [29, 31] and multi-task deep learning [22, 32, 7, 27]\nmethods in two key aspects. (1) Tensor normal prior: our work is the \ufb01rst to explore tensor normal\ndistribution as priors of network parameters in different layers to learn multilinear task relationships in\ndeep networks. Since the network parameters of multiple tasks natively stack into high-order tensors,\nprevious matrix normal distribution [13] cannot be used as priors of network parameters to learn task\nrelationships. (2) Deep task relationship: we de\ufb01ne the tensor normal prior on multiple task-speci\ufb01c\nlayers, while previous deep learning methods do not learn the task relationships. To our knowledge,\nmulti-task deep learning by tensor factorization [27] is the \ufb01rst work that tackles multi-task deep\nlearning by tensor factorization, which learns shared feature subspace from multilayer parameter\ntensors; in contrast, our work learns multilinear task relationships from multiplayer parameter tensors.\n\n3 Tensor Normal Distribution\n\n3.1 Probability Density Function\n\nTensor normal distribution is a natural extension of multivariate normal distribution and matrix-variate\nnormal distribution [13] to tensor-variate distributions. The multivariate normal distribution is order-1\ntensor normal distribution, and matrix-variate normal distribution is order-2 tensor normal distribution.\nBefore de\ufb01ning tensor normal distribution, we \ufb01rst introduce the notations and operations of order-K\ntensor. An order-K tensor is an element of the tensor product of K vector spaces, each of which\nhas its own coordinate system. A vector x \u2208 Rd1 is an order-1 tensor with dimension d1. A matrix\nX \u2208 Rd1\u00d7d2 is an order-2 tensor with dimensions (d1, d2). A order-K tensor X \u2208 Rd1\u00d7...\u00d7dK\nwith dimensions (d1, . . . , dK) has elements {xi1...iK : ik = 1, . . . , dk}. The vectorization of X is\nunfolding the tensor into a vector, denoted by vec(X ). The matricization of X is a generalization of\nvectorization, reordering the elements of X into a matrix. In this paper, to simply the notations and\n\n2\n\n\fk=1 dk) \u00d7 ((cid:81)K\n\ndescribe the tensor relationships, we use the mode-k matricization and denote by X(k) the mode-k\nmatrix of tensor X , where row i of X(k) contains all elements of X having the k-th index equal to i.\nk=1 dk) \u00d7 1 vector,\nthe normal distribution on a tensor X can be considered as a multivariate normal distribution on vector\nk=1 dk. However, such an ordinary multivariate normal distribution ignores\nthe special structure of X as a d1 \u00d7 . . . \u00d7 dK tensor, and as a result, the covariance characterizing the\nk=1 dk), which is often prohibitively\nlarge for modeling and estimation. To exploit the structure of X , tensor normal distributions assume\nk=1 dk) covariance matrix \u03a31:K can be decomposed into the Kronecker\nproduct \u03a31:K = \u03a31 \u2297 . . .\u2297 \u03a3K, and elements of X (in vectorization) follow the normal distribution,\n\nConsider an order-K tensor X \u2208 Rd1\u00d7...\u00d7dK . Since we can vectorize X to a ((cid:81)K\nvec(X ) of dimension(cid:81)K\ncorrelations across elements of X is of size ((cid:81)K\nk=1 dk) \u00d7 ((cid:81)K\nthat the ((cid:81)K\nbetween the dk rows of the mode-k matricization X(k) of dimension dk \u00d7 ((cid:81)\n\n(1)\nwhere \u2297 is the Kronecker product, \u03a3k \u2208 Rdk\u00d7dk is a positive de\ufb01nite matrix indicating the covariance\nk(cid:48)(cid:54)=k dk(cid:48)), and M is a\nmean tensor containing the expectation of each element of X . Due to the decomposition of covariance\nas the Kronecker product, the tensor normal distribution of an order-K tensor X , parameterized by\nmean tensor M and covariance matrices \u03a31, . . . , \u03a3K, can de\ufb01ne probability density function as [21]\n\nvec (X ) \u223c N (vec (M) , \u03a31 \u2297 . . . \u2297 \u03a3K) ,\n\n(cid:32) K(cid:89)\n\n(cid:33)\n\n(cid:18)\n\n\u2212d/2\n\n|\u03a3k|\u2212d/(2dk)\n\n\u00d7 exp\n\nk=1\n\np (x) = (2\u03c0)\n\n\u03a3K, d =(cid:81)K\n\n(2)\nwhere |\u00b7| is the determinant of a square matrix, and x = vec (X ) , \u00b5 = vec (M) , \u03a31:K = \u03a31\u2297. . .\u2297\nk=1 dk. The tensor normal distribution corresponds to the multivariate normal distribution\nwith Kronecker decomposable covariance structure. X following tensor normal distribution, i.e.\nvec (X ) following the normal distribution with Kronecker decomposable covariance, is denoted by\n(3)\n\nX \u223c T Nd1\u00d7...\u00d7dK (M, \u03a31, . . . , \u03a3K) .\n\n,\n\n(x \u2212 \u00b5)T\u03a3\u22121\n\n\u2212 1\n2\n\n3.2 Maximum Likelihood Estimation\nConsider a set of n samples {Xi}n\ni=1 where each Xi is an order-3 tensor generated by a tensor normal\ndistribution as in Equation (2). The maximum likelihood estimation (MLE) of the mean tensor M is\n\n(cid:19)\n1:K (x \u2212 \u00b5)\n\nThe MLE of covariance matrices (cid:98)\u03a31, . . . ,(cid:98)\u03a33 are computed by iteratively updating these equations:\n\ni=1\n\n(4)\n\n(cid:99)M =\n\n1\nn\n\ni=1\n\nn(cid:88)\nn(cid:88)\nn(cid:88)\n\ni=1\n\n(cid:98)\u03a31 =\n(cid:98)\u03a32 =\n(cid:98)\u03a33 =\n\n1\n\nnd2d3\n\n1\n\nnd1d3\n\n1\n\nnd1d2\n\n(Xi \u2212 M)(1)\n\n(Xi \u2212 M)(2)\n\n(Xi \u2212 M)(3)\n\nXi.\n\nn(cid:88)\n(cid:16)(cid:98)\u03a33 \u2297(cid:98)\u03a32\n(cid:16)(cid:98)\u03a33 \u2297(cid:98)\u03a31\n(cid:16)(cid:98)\u03a32 \u2297(cid:98)\u03a31\n\n(cid:17)\u22121\n(cid:17)\u22121\n(cid:17)\u22121\n\n(Xi \u2212 M)T\n\n(1),\n\n(Xi \u2212 M)T\n\n(2),\n\n(5)\n\n(Xi \u2212 M)T\n\n(3).\n\nguaranteed. Covariance matrices (cid:98)\u03a31, . . . ,(cid:98)\u03a33 are not identi\ufb01able and the solutions to maximizing\n\nThis \ufb02ip-\ufb02op algorithm [21] is ef\ufb01cient to solve by simple matrix manipulations and convergence is\ndensity function (2) are not unique, while only the Kronecker product \u03a31\u2297. . .\u2297\u03a3K (1) is identi\ufb01able.\n\ni=1\n\n4 Multilinear Relationship Networks\n\nThis work models multiple tasks by jointly learning transferable representations and task relationships.\nGiven T tasks with training data {Xt,Yt}T\n}\n1, . . . , yt\nNt\nare the Nt training examples and associated labels of the t-th task, respectively drawn from D-\nn \u2208 RD and\ndimensional feature space and C-cardinality label space, i.e. each training example xt\nn \u2208 {1, . . . , C}. Our goal is to build a deep network for multiple tasks yt\nn) which learns\nyt\ntransferable features and adaptive task relationships to bridge different tasks effectively and robustly.\n\nt=1, where Xt = {xt\n\n} and Yt = {yt\n\n1, . . . , xt\nNt\n\nn = ft(xt\n\n3\n\n\fFigure 1: Multilinear relationship network (MRN) for multi-task learning: (1) convolutional layers\nconv1\u2013conv5 and fully-connected layer f c6 learn transferable features, so their parameters are shared\nacross tasks; (2) fully-connected layers f c7\u2013f c8 \ufb01t task-speci\ufb01c structures, so their parameters are\nmodeled by tensor normal priors for learning multilinear relationships of features, classes and tasks.\n\n4.1 Model\n\nWe start with deep convolutional neural networks (CNNs) [16], a family of models to learn transferable\nfeatures that are well adaptive to multiple tasks [32, 28, 18, 27]. The main challenge is that in multi-\ntask learning, each task is provided with a limited amount of labeled data, which is insuf\ufb01cient to\nbuild reliable classi\ufb01ers without over\ufb01tting. In this sense, it is vital to model the task relationships\nthrough which each pair of tasks can help with each other to enable knowledge transfer if they are\nrelated, and can remain independent to mitigate negative transfer if they are unrelated. With this idea,\nwe design a Multilinear Relationship Network (MRN) that exploits both feature transferability and\ntask relationship to establish effective and robust multi-task learning. Figure 1 shows the architecture\nof the proposed MRN model based on AlexNet [16], while other deep networks are also applicable.\nWe build the proposed MRN model upon AlexNet [16], which is comprised of convolutional layers\n(conv1\u2013conv5) and fully-connected layers (f c6\u2013f c8). The (cid:96)-th f c layer learns a nonlinear mapping\nn is the hidden representation of each point xt\nn,\nht,(cid:96)\nWt,(cid:96) and bt,(cid:96) are the weight and bias parameters, and a(cid:96) is the activation function, taken as ReLU\nj=1 exj for the output layer.\nDenote by y = ft(x) the CNN classi\ufb01er of t-th task, and the empirical error of CNN on {Xt,Yt} is\n\nn = a(cid:96)(cid:0)Wt,(cid:96)ht,(cid:96)\u22121\na(cid:96)(x) = max(0, x) for hidden layers or softmax units a(cid:96) (x) = ex/(cid:80)|x|\n\n+ bt,(cid:96)(cid:1) for task t, where ht,(cid:96)\n\nn\n\n(6)\n\n(cid:0)xt\n\nn\n\n(cid:1) , yt\n\nn\n\n(cid:1),\n\nmin\nft\n\nJ(cid:0)ft\n\nNt(cid:88)\n\nn=1\n\nn) is the conditional probability that CNN assigns\nwhere J is the cross-entropy loss function, and ft (xt\nn. We will not describe how to compute the convolutional layers since these layers can\nn to label yt\nxt\nlearn transferable features in general [28, 18], and we will simply share the network parameters of\nthese layers across different tasks, without explicitly modeling the relationships of features and tasks\nin these layers. To bene\ufb01t from pre-training and \ufb01ne-tuning as most deep learning work, we copy\nthese layers from a model pre-trained on ImageNet 2012 [28], and \ufb01ne-tune all conv1\u2013conv5 layers.\nAs revealed by the recent literature \ufb01ndings [28], the deep features in standard CNNs must eventually\ntransition from general to speci\ufb01c along the network, and the feature transferability decreases while\nthe task discrepancy increases, making the features in higher layers f c7\u2013f c8 unsafely transferable\nacross different tasks. In other words, the f c layers are tailored to their original task at the expense\nof degraded performance on the target task, which may deteriorate multi-task learning based on\ndeep neural networks. Most previous methods generally assume that the multiple tasks can be well\ncorrelated given the shared representation learned by the feature layers conv1\u2013f c7 of deep networks\n[25, 22, 32, 27]. However, it may be vulnerable if different tasks are not well correlated under deep\nfeatures, which is common as higher layers are not safely transferable and tasks may be dissimilar.\nMoreover, existing multi-task learning methods are natively designed for binary classi\ufb01cation tasks,\nwhich are not good choices as deep networks mainly adopt multi-class softmax regression. It remains\nan open problem to explore the task relationships of multi-class classi\ufb01cation for multi-task learning.\nIn this work, we jointly learn transferable features and multilinear relationships of features and tasks\nfor multiple task-speci\ufb01c layers L in a Bayesian framework. Based on the transferability of deep\n\n4\n\nconv1inputconv2conv3conv4conv5fc6fc7TN-Priorfc8TN-PriorTask 1Task Toutput\ft=1 the complete training data of T tasks, and by Wt,(cid:96) \u2208 RD(cid:96)\n\nnetworks discussed above, the task-speci\ufb01c layers L are set to {f c7, f c8}. Denote by X = {Xt}T\nt=1,\nY = {Yt}T\n2 the network parameters\nof the t-th task in the (cid:96)-th layer, where D(cid:96)\n2 are the rows and columns of matrix Wt,(cid:96). In order\nto capture the task relationship in the network parameters of all T tasks, we construct the (cid:96)-th layer\nset of parameter tensors of all the task-speci\ufb01c layers L = {f c7, f c8}. The Maximum a Posteriori\n(MAP) estimation of network parameters W given training data {X ,Y} for learning multiple tasks is\n\nparameter tensor as W (cid:96) =(cid:2)W1,(cid:96); . . . ; WT,(cid:96)(cid:3) \u2208 RD(cid:96)\n\n2\u00d7T . Denote by W =(cid:8)W (cid:96) : (cid:96) \u2208 L(cid:9) the\n\n1 and D(cid:96)\n\n1\u00d7D(cid:96)\n\n1\u00d7D(cid:96)\n\np (W|X ,Y) \u221d p (W) \u00b7 p (Y |X ,W )\n\np(cid:0)W (cid:96)(cid:1) \u00b7 T(cid:89)\n\nNt(cid:89)\n\np(cid:0)yt\n\nn\n\n(cid:12)(cid:12)xt\nn,W (cid:96)(cid:1),\n\n(7)\n\n(cid:89)\n\n(cid:96)\u2208L\n\n=\n\nt=1\n\nn=1\n\nwhere we assume that for prior p (W), the parameter tensor of each layer W (cid:96) is independent on the\nparameter tensors of the other layers W (cid:96)(cid:48)(cid:54)=(cid:96), which is a common assumption made by most feed-\nforward neural network methods [3]. Finally, we assume when the network parameter is sampled\nfrom the prior, all tasks are independent. These independence assumptions lead to the factorization of\nthe posteriori in Equation (7), which make the \ufb01nal MAP estimation in deep networks easy to solve.\nThe maximum likelihood estimation (MLE) part p (Y |X ,W ) in Equation (7) is modeled by deep\nCNN in Equation (6), which can learn transferable features in lower layers for multi-task learning.\nWe opt to share the network parameters of all these layers (conv1\u2013f c6). This parameter sharing\nstrategy is a relaxation of existing deep multi-task learning methods [22, 32, 7], which share all the\nfeature layers except for the classi\ufb01er layer. We do not share task-speci\ufb01c layers (the last feature\nlayer f c7 and classi\ufb01er layer f c8), with the expectation to potentially mitigate negative-transfer [28].\nThe prior part p (W) in Equation (7) is the key to enabling multi-task deep learning since this prior\npart should be able to model the multilinear relationship across parameter tensors. This paper, for the\n\ufb01rst time, de\ufb01nes the prior for the (cid:96)-th layer parameter tensor by tensor normal distribution [21] as\n(8)\n2\u00d7T\n3 \u2208 RT\u00d7T are the mode-1, mode-2, and mode-3\nwhere \u03a3(cid:96)\ncovariance matrices, respectively. Speci\ufb01cally, in the tensor normal prior, the row covariance matrix\n1 models the relationships between features (feature covariance), the column covariance matrix \u03a3(cid:96)\n\u03a3(cid:96)\n2\nmodels the relationships between classes (class covariance), and the mode-3 covariance matrix \u03a3(cid:96)\nmodels the relationships between tasks in the (cid:96)-th layer network parameters {W1,(cid:96), . . . , WT,(cid:96)}. A\n3\ncommon strategy used by previous methods is to use identity covariance for feature covariance [31, 8]\nand class covariance [2], which implicitly assumes independent features and classes and cannot\ncapture the dependencies between them. This work learns all feature covariance, class covariance,\ntask covariance and all network parameters from data to build robust multilinear task relationships.\nWe integrate the CNN error functional (6) and tensor normal prior (8) into MAP estimation (7)\nand taking negative logarithm, which leads to the MAP estimation of the network parameters W, a\nregularized optimization problem for Multilinear Relationship Network (MRN) formally writing as\n\np(cid:0)W (cid:96)(cid:1) = T ND(cid:96)\n\n(cid:0)O, \u03a3(cid:96)\n\n1\u00d7D(cid:96)\n2, and \u03a3(cid:96)\n\n2 \u2208 RD(cid:96)\n\n1 \u2208 RD(cid:96)\n\n(cid:1) ,\n\n2, \u03a3(cid:96)\n3\n\n1, \u03a3(cid:96)\n\n2\u00d7D(cid:96)\n\n1\u00d7D(cid:96)\n\n1, \u03a3(cid:96)\n\nmin\nt=1,\u03a3(cid:96)\n\nft|T\n\nk|K\n\nk=1\n\nNt(cid:88)\n(cid:32)\n\nn=1\n\nT(cid:88)\n(cid:88)\n\nt=1\n\n(cid:96)\u2208L\n\nJ(cid:0)ft\n\n(cid:0)xt\n\nn\n\n(cid:1) , yt\n\nn\n\n(cid:1)\n\u22121vec(W (cid:96)) \u2212 K(cid:88)\n\nk=1\n\n+\n\n1\n2\n\nvec(W (cid:96))\n\nT\n\n(\u03a3(cid:96)\n\n1:K )\n\nk|(cid:17)(cid:33)\n(cid:16)|\u03a3(cid:96)\n\n,\n\nD(cid:96)\nD(cid:96)\nk\n\nln\n\n(9)\n\nwhere D(cid:96) =(cid:81)K\n\nk and K = 3 is the number of modes in parameter tensor W, which could be\nk=1 D(cid:96)\nK = 4 for the convolutional layers (width, height, number of feature maps, and number of tasks);\n2 \u2297 \u03a3(cid:96)\n3 is the Kronecker product of the feature covariance \u03a3(cid:96)\n1, class covariance \u03a3(cid:96)\n2,\n\u03a3(cid:96)\nand task covariance \u03a3(cid:96)\n3. Moreover, we can assume shared task relationship across different layers as\n3 = \u03a33, which enhances connection between task relationships on features f c7 and classi\ufb01ers f c8.\n\u03a3(cid:96)\n\n1 \u2297 \u03a3(cid:96)\n\n1:3 = \u03a3(cid:96)\n\n4.2 Algorithm\nThe optimization problem (9) is jointly non-convex with respect to the parameter tensors W as well as\nfeature covariance \u03a3(cid:96)\n3. Thus, we alternatively optimize\n\n2, and task covariance \u03a3(cid:96)\n\n1, class covariance \u03a3(cid:96)\n\n5\n\n\fone set of variables with the others \ufb01xed. We \ufb01rst update Wt,(cid:96), the parameter of task-t in layer-(cid:96).\nWhen training deep CNN by back-propagation, we only require the gradient of the objective function\n(denoted by O) in Equation (10) w.r.t. Wt,(cid:96) on each data point (xt\nn), which can be computed as\n\n=\n\n\u2202O (xt\n\nn) , yt\nn)\n\n\u2202J (ft (xt\n\nn, yt\nn)\n\u2202Wt,(cid:96)\n\n1:3)\u22121vec(cid:0)W (cid:96)(cid:1)]\u00b7\u00b7t is the (:, :, t) slice of a tensor folded from elements (\u03a3(cid:96)\n\n(10)\n1:3)\u22121vec(W (cid:96))\nwhere [(\u03a3(cid:96)\nthat are corresponding to parameter matrix Wt,(cid:96). Since training a deep CNN requires a large amount\nof labeled data, which is prohibitive for many multi-task learning problems, we \ufb01ne-tune from an\nAlexNet model pre-trained on ImageNet as in [28]. In each epoch, after updating W, we can update\n3 by the \ufb02ip-\ufb02op algorithm as\nthe feature covariance \u03a3(cid:96)\n\n2, and task covariance \u03a3(cid:96)\n\n\u2202Wt,(cid:96)\n\n\u00b7\u00b7t,\n\n+(cid:2)(\u03a3(cid:96)\n\nn, yt\n\n1:3)\u22121vec(cid:0)W (cid:96)(cid:1)(cid:3)\n\n\u03a3(cid:96)\n\n1 =\n\n(cid:16)\n1, class covariance \u03a3(cid:96)\n(W (cid:96))(1)\n(cid:16)\n(W (cid:96))(2)\n(cid:16)\n(W (cid:96))(3)\n\n1\nD(cid:96)\n2T\n1\nD(cid:96)\n1T\n1\n1D(cid:96)\nD(cid:96)\n2\n\n2 =\n\n3 =\n\n\u03a3(cid:96)\n\n\u03a3(cid:96)\n\n(cid:17)\u22121\n(cid:17)\u22121\n(cid:17)\u22121\n\n2\n\n3 \u2297 \u03a3(cid:96)\n\u03a3(cid:96)\n3 \u2297 \u03a3(cid:96)\n\u03a3(cid:96)\n2 \u2297 \u03a3(cid:96)\n\u03a3(cid:96)\n\n1\n\n1\n\n(W (cid:96))T\n\n(1) + \u0001ID(cid:96)\n\n1\n\n(W(cid:96))T\n\n(2) + \u0001ID(cid:96)\n\n2\n\n(W (cid:96))T\n\n(3) + \u0001IT .\n\n,\n\n,\n\n(11)\n\n(cid:0)BT \u2297 A(cid:1) vec (X) = vec (AXB). Taking the computation of \u03a3(cid:96)\n\nwhere the last term of each update equation is a small penalty traded off by \u0001 for numerical stability.\nHowever, the above updating equations (11) are computationally prohibitive, due to the dimension\n2 \u00d7 D(cid:96)\nexplosion of the Kronecker product, e.g. \u03a3(cid:96)\n2. To speed up\n1D(cid:96)\n\u22121 = A\u22121 \u2297 B\u22121 and\ncomputation, we will use the following rules of Kronecker product: (A \u2297 B)\n3 \u2208 RT\u00d7T as an example, we have\n\u22121(cid:17)\n\n(W (cid:96))(3),i\u00b7(cid:0)\u03a3(cid:96)\n2 \u2297 \u03a3(cid:96)\n(cid:16)\n\n(cid:1)\u22121\n(W (cid:96))T\n\u22121W (cid:96)\u00b7\u00b7j(\u03a3(cid:96)\n(\u03a3(cid:96)\n1)\n2)\n\n1 is of dimension D(cid:96)\n\n(W (cid:96))(3),i\u00b7vec\n\n(3),j\u00b7 + \u0001Iij\n\n2 \u2297 \u03a3(cid:96)\n\n3)ij =\n\n+ \u0001Iij,\n\n1D(cid:96)\n\n(12)\n\n(\u03a3(cid:96)\n\n=\n\n1\n\n1\nD(cid:96)\n1D(cid:96)\n2\n1\nD(cid:96)\n1D(cid:96)\n2\n\n1D(cid:96)\n2\n\nO(cid:0)T 2D(cid:96)\n\nwhere (W (cid:96))(3),i\u00b7 denotes the i-th row of the mode-3 matricization of tensor W (cid:96), and W (cid:96)\u00b7\u00b7j denotes\nthe (:, :, j) slice of tensor W (cid:96). We can derive that updating \u03a3(cid:96)\n3 has a computational complexity of\n2. The total computational complexity of updating\n1D(cid:96)\n\n2T(cid:1)(cid:1), which is still expensive.\n\n(cid:1)(cid:1), similarly for \u03a3(cid:96)\nk=1 will be O(cid:0)D(cid:96)\n\n1 + D(cid:96)\n2\nk|3\ncovariance matrices \u03a3(cid:96)\n\n2T(cid:0)D(cid:96)\n\n1 and \u03a3(cid:96)\n1D(cid:96)\n\n(cid:0)D(cid:96)\n\n1T + D(cid:96)\n\nk|3\nk=1 should be low-rank, since the\nA key to computation speedup is that the covariance matrices \u03a3(cid:96)\nk|3\nfeatures and tasks are enforced to be correlated for multi-task learning. Thus, the inverses of \u03a3(cid:96)\nk=1\ndo not exist in general and we have to compute the generalized inverses using eigendecomposition.\nk and maintain all eigenvectors with eigenvalues greater\nWe perform eigendecomposition for each \u03a3(cid:96)\nthan zero. The rank r of the eigen-reconstructed covariance matrices should be r \u2264 min(D(cid:96)\n2, T ).\nk|3\nThus, the total computational complexity for \u03a3(cid:96)\n1 + D(cid:96)\nis straight-forward to see the computational complexity of updating the parameter tensor W is the cost\nof back-propagation in standard CNNs plus the cost for computing the gradient of regularization term\nk=1.\n\nk=1 is reduced to O(cid:0)rD(cid:96)\n2 + T(cid:1)(cid:1) given generalized inverses (\u03a3(cid:96)\n\nby Equation (10), which is O(cid:0)rD(cid:96)\n\n2 + T(cid:1)(cid:1). It\n\n2T(cid:0)D(cid:96)\n\n2T(cid:0)D(cid:96)\n\nk)\u22121|3\n\n1 + D(cid:96)\n\n2 + D(cid:96)\n\n1, D(cid:96)\n\n1D(cid:96)\n\n1D(cid:96)\n\n4.3 Discussion\n\nThe proposed Multilinear Relationship Network (MRN) is very \ufb02exible and can be easily con\ufb01gured\nto deal with different network architectures and multi-task learning scenarios. For example, replacing\nthe network backbone from AlexNet to VGGnet [24] boils down to con\ufb01guring task-speci\ufb01c layers\nL = {f c7, f c8}, where f c7 is the last feature layer while f c8 is the classi\ufb01er layer in the VGGnet.\nThe architecture of MRN in Figure 1 can readily cope with homogeneous multi-task learning where\nall tasks share the same output space. It can cope with heterogeneous multi-task learning where\ndifferent tasks have different output spaces by setting L = {f c7}, by only considering feature layers.\nThe multilinear relationship learning in Equation (9) is a general framework that readily subsumes\nmany classical multi-task learning methods as special cases. Many regularized multi-task algorithms\ncan be classi\ufb01ed into two main categories: learning with feature covariances [1, 2, 6, 5] and learning\n\n6\n\n\fwith task relations [10, 14, 29, 31, 15, 17, 8]. Learning with feature covariances can be viewed\nas a representative formulation in feature-based methods while learning with task relations is for\nparameter-based methods [30]. More speci\ufb01cally, previous multi-task feature learning methods [1, 2]\ncan be viewed as a special case of Equation (9) by setting all covariance matrices but the feature\ncovariance to identity matrix, i.e. \u03a3k = I|K\nk=2; and previous multi-task relationship learning methods\n[31, 8] can be viewed as a special case of Equation (9) by setting all covariance matrices but the\ntask covariance to identity matrix, i.e. \u03a3k = I|K\u22121\nk=1 . The proposed MRN is more general in the\narchitecture perspective in dealing with parameter tensors in multiple layers of deep neural networks.\nIt is noteworthy to highlight a concurrent work on multi-task deep learning using tensor decomposition\n[27], which is feature-based method that explicitly learns the low-rank shared parameter subspace.\nThe proposed multilinear relationship across parameter tensors can be viewed as a strong alternative\nto the tensor decomposition, with the advantage to explicitly model the positive and negative relations\nacross features and tasks. As a defense of [27], the tensor decomposition can extract \ufb01ner-grained\nfeature relations (what to share and how much to share) than the proposed multilinear relationships.\n\n5 Experiments\n\nWe compare MRN with state-of-the-art multi-task and deep learning methods to verify the ef\ufb01cacy of\nlearning transferable features and multilinear task relationships. Codes and datasets will be released.\n\n5.1 Setup\n\nFigure 2: Examples of the Of\ufb01ce-Home dataset.\n\nOf\ufb01ce-Caltech [12] This dataset is the standard benchmark for multi-task learning and transfer\nlearning. The Of\ufb01ce part consists of 4,652 images in 31 categories collected from three distinct\ndomains (tasks): Amazon (A), which contains images downloaded from amazon.com, Webcam (W)\nand DSLR (D), which are images taken by Web camera and digital SLR camera under different\nenvironmental variations. This dataset is organized by selecting the 10 common categories shared by\nthe Of\ufb01ce dataset and the Caltech-256 (C) dataset [12], hence it yields four multi-class learning tasks.\nOf\ufb01ce-Home1 [26] This dataset is to evaluate\ntransfer learning algorithms using deep learning.\nIt consists of images from 4 different domains:\nArtistic images (A), Clip Art (C), Product im-\nages (P) and Real-World images (R). For each\ndomain, the dataset contains images of 65 object\ncategories collected in of\ufb01ce and home settings.\nImageCLEF-DA2 This dataset is the benchmark for ImageCLEF domain adaptation challenge,\norganized by selecting the 12 common categories shared by the following four public datasets (tasks):\nCaltech-256 (C), ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and Bing (B). All three datasets\nare evaluated using DeCAF7 [9] features for shallow methods and original images for deep methods.\nWe compare MRN with standard and state-of-the-art methods: Single-Task Learning (STL), Multi-\nTask Feature Learning (MTFL) [2], Multi-Task Relationship Learning (MTRL) [31], Robust Multi-\nTask Learning (RMTL) [5], and Deep Multi-Task Learning with Tensor Factorization (DMTL-TF)\n[27]. STL performs per-task classi\ufb01cation in separate deep networks without knowledge transfer.\nMTFL extracts the low-rank shared feature representations by learning feature covariance. RMTL\nextends MTFL to further capture the task relationships using a low-rank structure and identify outlier\ntasks using a group-sparse structure. MTRL captures the task relationships using task covariance of a\nmatrix normal distribution. DMTL-TF tackles multi-task deep learning by tensor factorization, which\nlearns shared feature subspace instead of multilinear task relationship in multilayer parameter tensors.\nTo go deep into the ef\ufb01cacy of jointly learning transferable features and multilinear task relationships,\nwe evaluate two MRN variants: (1) MRN8, MRN using only one network layer f c8 for multilinear\nrelationship learning; (2) MRNt, MRN using only task covariance \u03a33 for single-relationship learning.\nThe proposed MRN model can natively deal with multi-class problems using the parameter tensors.\nHowever, most shallow multi-task learning methods such as MTFL, RMTL and MTRL are formulated\n\n1http://hemanthdv.org/OfficeHome-Dataset\n2http://imageclef.org/2014/adaptation\n\n7\n\nSpoonSinkMugPenKnifeBedBikeKettleTVKeyboardClassesAlarm-ClockDesk-LampHammerChairFanReal WorldProductClipartArt\fTable 1: Classi\ufb01cation accuracy on Of\ufb01ce-Caltech with standard evaluation protocol (AlexNet).\n\nMethod\n\nSTL (AlexNet)\n\nMTFL [2]\nRMTL [6]\nMTRL [31]\n\nDMTL-TF [27]\n\nMRN8\nMRNt\n\nMRN (full)\n\nA\n88.9\n90.0\n91.3\n86.4\n91.2\n91.7\n91.1\n92.5\n\nW\n73.0\n78.9\n82.3\n83.0\n88.3\n96.4\n96.3\n97.5\n\n5%\nD\n80.4\n90.2\n88.8\n95.1\n92.5\n96.9\n97.4\n97.9\n\nC\n88.7\n86.9\n89.1\n89.1\n85.6\n86.5\n86.1\n87.5\n\nAvg\n82.8\n86.5\n87.9\n88.4\n89.4\n92.9\n92.7\n93.8\n\nA\n92.2\n92.4\n92.6\n91.1\n92.2\n92.7\n92.5\n93.6\n\nW\n80.9\n85.3\n85.2\n87.1\n91.9\n97.1\n97.7\n98.6\n\n10%\nD\n88.2\n89.5\n93.3\n97.0\n97.4\n97.3\n96.6\n98.6\n\nC\n88.9\n89.2\n87.2\n87.6\n86.8\n86.6\n86.7\n87.3\n\nAvg\n87.6\n89.1\n89.6\n90.7\n92.0\n93.4\n93.4\n94.5\n\nA\n91.3\n93.5\n94.3\n90.0\n92.6\n93.2\n91.9\n94.4\n\nW\n83.3\n89.0\n87.0\n88.8\n97.6\n96.9\n96.6\n98.3\n\n20%\nD\n93.7\n95.2\n96.7\n99.2\n94.5\n99.4\n95.9\n99.9\n\nC\n94.9\n92.6\n93.4\n94.3\n88.4\n82.8\n90.0\n89.1\n\nAvg\n90.8\n92.6\n92.4\n93.1\n93.3\n94.4\n93.6\n95.5\n\nTable 2: Classi\ufb01cation accuracy on Of\ufb01ce-Home with standard evaluation protocol (VGGnet).\n\nMethod\n\nSTL (VGGnet)\n\nMTFL [2]\nRMTL [6]\nMTRL [31]\n\nDMTL-TF [27]\n\nMRN8\nMRNt\n\nMRN (full)\n\nA\n35.8\n40.1\n42.3\n42.7\n49.2\n52.7\n52.0\n53.3\n\nC\n31.2\n30.4\n32.8\n33.3\n34.5\n34.7\n34.0\n36.4\n\n5%\nP\n67.8\n61.5\n62.3\n62.9\n67.1\n70.1\n69.9\n70.5\n\nR\n62.5\n59.5\n60.6\n61.3\n62.9\n67.6\n66.8\n67.7\n\nAvg\n49.3\n47.9\n49.5\n50.1\n53.4\n56.3\n55.7\n57.0\n\nA\n51.0\n50.3\n49.7\n51.6\n57.2\n59.1\n58.6\n59.9\n\nC\n40.7\n35.0\n34.6\n36.3\n42.3\n42.7\n42.6\n42.7\n\n10%\n\nP\n75.0\n66.3\n65.9\n67.7\n73.6\n75.1\n74.9\n76.3\n\nR\n68.8\n65.0\n64.6\n66.3\n69.9\n72.8\n72.4\n73.0\n\nAvg\n58.9\n54.2\n53.7\n55.5\n60.8\n62.4\n62.1\n63.0\n\nA\n56.1\n55.2\n55.2\n55.8\n58.3\n58.4\n57.7\n58.5\n\nC\n54.6\n38.8\n39.2\n39.9\n56.1\n55.6\n54.8\n55.6\n\n20%\n\nP\n80.4\n69.1\n69.6\n70.2\n79.3\n80.4\n80.2\n80.7\n\nR\n71.8\n70.0\n70.5\n71.2\n72.1\n72.4\n71.6\n72.8\n\nAvg\n65.7\n58.3\n58.6\n59.3\n66.5\n66.7\n66.1\n66.9\n\nonly for binary-class problems, due to the dif\ufb01culty in dealing with order-3 parameter tensors for\nmulti-class problems. We adopt one-vs-rest strategy to enable them working on multi-class datasets.\nWe follow the standard evaluation protocol [31, 5] for multi-task learning and randomly select 5%,\n10%, and 20% samples from each task as training set and use the rest of the samples as test set. We\ncompare the average classi\ufb01cation accuracy for all tasks based on \ufb01ve random experiments, where\nstandard errors are generally less than \u00b10.5%, which are not signi\ufb01cant and thus are not reported for\nspace limitation. We conduct model selection for all methods using \ufb01ve-fold cross-validation on the\ntraining set. For deep learning methods, we adopt AlexNet [16] and VGGnet [24], \ufb01x convolutional\nlayers conv1\u2013conv5, \ufb01ne-tune fully-connected layers f c6\u2013f c7, and train classi\ufb01er layer f c8 via\nback-propagation. As the classi\ufb01er layer is trained from scratch, we set its learning rate to be 10 times\nthat of the other layers. We use mini-batch stochastic gradient descent (SGD) with 0.9 momentum\nand learning rate decaying strategy, and select learning rate between 10\u22125 and 10\u22122 by stepsize 10 1\n2 .\n\n5.2 Results\n\nThe multi-task classi\ufb01cation results on the Of\ufb01ce-Caltech, Of\ufb01ce-Home and ImageCLEF-DA datasets\nbased on 5%, 10%, and 20% sampled training data are shown in Tables 1, 2 and 3, respectively. We\nobserve that the proposed MRN model signi\ufb01cantly outperforms the comparison methods on most\nmulti-task problems. The substantial accuracy improvement validates that our multilinear relationship\nnetworks through multilayer and multilinear relationship learning is able to learn both transferable\nfeatures and adaptive task relationships, which enables effective and robust multi-task deep learning.\nWe can make the following observations from the results. (1) Shallow multi-task learning methods\nMTFL, RMTL, and MTRL outperform single-task deep learning method STL in most cases, which\ncon\ufb01rms the ef\ufb01cacy of learning multiple tasks by exploiting shared structures. Among the shallow\nmulti-task methods, MTRL gives the best accuracies, showing that exploiting task relationship may\nbe more effective than extracting shared feature subspace for multi-task learning. It is worth noting\nthat, although STL cannot learn from knowledge transfer, it can be \ufb01ne-tuned on each task to improve\nperformance, and thus when the number of training samples are large enough and when different\ntasks are dissimilar enough (e.g. Of\ufb01ce-Home dataset), STL may outperform shallow multi-task\nlearning methods, as evidenced by the results in Table 2. (2) Deep multi-task learning method\nDMTL-TF outperforms shallow multi-task learning methods with deep features as input, which\ncon\ufb01rms the importance of learning deep transferable features to enable knowledge transfer across\ntasks. However, DMTL-TF only learns the shared feature subspace based on tensor factorization of\nthe network parameters, while the task relationships in multiple network layers are not captured. This\nmay result in negative-transfer in the feature layers [28] and under-transfer in the classi\ufb01er layers.\nNegative-transfer can be witnessed by comparing multi-task methods with single-task methods: if\nmulti-task learning methods yield lower accuracy in some of the tasks, then negative-transfer arises.\n\n8\n\n\fTable 3: Classi\ufb01cation accuracy on ImageCLEF-DA with standard evaluation protocol (AlexNet).\n\nMethod\n\nSTL (AlexNet)\n\nMTFL [2]\nRMTL [6]\nMTRL [31]\n\nDMTL-TF [27]\n\nMRN8\nMRNt\n\nMRN (full)\n\nC\n77.4\n79.9\n81.1\n80.8\n87.9\n87.0\n88.5\n89.6\n\nI\n\n60.3\n68.6\n71.3\n68.4\n70.0\n74.4\n73.5\n76.9\n\n5%\nP\n48.0\n43.4\n52.4\n51.9\n58.1\n61.8\n63.3\n65.4\n\nB\n45.0\n41.5\n40.9\n42.9\n34.1\n47.6\n51.1\n49.4\n\nAvg\n57.7\n58.3\n61.4\n61.0\n62.5\n67.7\n69.1\n70.3\n\nC\n78.9\n82.9\n81.5\n83.1\n89.1\n89.1\n88.0\n88.1\n\nI\n\n70.5\n71.4\n71.7\n72.7\n82.1\n82.2\n83.1\n84.6\n\n10%\n\nP\n48.1\n56.7\n55.6\n54.5\n58.7\n64.4\n67.4\n68.7\n\nB\n41.8\n41.7\n45.3\n45.5\n48.0\n49.3\n54.8\n55.6\n\nAvg\n59.8\n63.2\n63.5\n63.9\n69.5\n71.2\n73.3\n74.3\n\nC\n83.3\n83.1\n83.3\n83.7\n91.7\n91.1\n91.1\n92.8\n\nI\n\n74.9\n72.2\n73.3\n75.5\n80.0\n84.1\n83.5\n83.3\n\n20%\n\nP\n49.2\n54.5\n53.7\n57.5\n63.2\n65.7\n65.7\n67.4\n\nB\n47.1\n52.5\n49.2\n49.4\n54.1\n54.1\n55.7\n57.8\n\nAvg\n63.6\n65.6\n64.9\n66.5\n72.2\n73.7\n74.0\n75.3\n\nWe go deeper into MRN by reporting the results of the two MRN variants: MRN8 and MRNt, all\nsigni\ufb01cantly outperform the comparison methods but generally underperform MRN (full), which\nverify our motivation that jointly learning transferable features and multilinear task relationships can\nbridge multiple tasks more effectively. (1) The disadvantage of MRN8 is that it does not learn the\ntask relationship in the lower layers f c7, which are not safely transferable and may result in negative\ntransfer [28]. (2) The shortcoming of MRNt is that it does not learn the multilinear relationship of\nfeatures, classes and tasks, hence the learned relationships may only capture the task covariance\nwithout capturing the feature covariance and class covariance, which may lose some intrinsic relations.\n\n(a) MTRL Relationship\n\n(b) MRN Relationship\n\n(c) DMTL-TF Features\n\n(d) MRN Features\n\nFigure 3: Hinton diagram of task relationships (a)(b) and t-SNE embedding of deep features (c)(d).\n\n5.3 Visualization Analysis\n\n3\n\nWe show that MRN can learn more reasonable task relationships with deep features than MTRL with\nshallow features, by visualizing the Hinton diagrams of task covariances learned by MTRL and MRN\n(\u03a3f c8\n) in Figures 3(a) and 3(b), respectively. Prior knowledge on task similarity in the Of\ufb01ce-Caltech\ndataset [12] describes that tasks A, W and D are more similar with each other while they are relatively\ndissimilar to task C. MRN successfully captures this prior task relationship and enhances the task\ncorrelation across dissimilar tasks, which enables stronger transferability for multi-task learning.\nFurthermore, all tasks are positively correlated (green color) in MRN, implying that all tasks can\nbetter reinforce each other. However, some of the tasks (D and C) are still negatively correlated (red\ncolor) in MTRL, implying these tasks should be drawn far apart and cannot improve with each other.\nWe illustrate the feature transferability by visualizing in Figures 3(c) and 3(d) the t-SNE embeddings\n[18] of the images in the Of\ufb01ce-Caltech dataset with DMTL-TF features and MRN features, respec-\ntively. Compared with DMTL-TF features, the data points with MRN features are discriminated better\nacross different categories, i.e., each category has small intra-class variance and large inter-class\nmargin; the data points are also aligned better across different tasks, i.e. the embeddings of different\ntasks overlap well, implying that different tasks reinforce each other effectively. This veri\ufb01es that with\nmultilinear relationship learning, MRN can learn more transferable features for multi-task learning.\n\n6 Conclusion\n\nThis paper presented multilinear relationship networks (MRN) that integrate deep neural networks\nwith tensor normal priors over the network parameters of all task-speci\ufb01c layers, which model the task\nrelatedness through the covariance structures over tasks, classes and features to enable transfer across\nrelated tasks. An effective learning algorithm was devised to jointly learn transferable features and\nmultilinear relationships. Experiments testify that MRN yields superior results on standard datasets.\n\n9\n\nAWDCAWDCAWDCAWDCCAWDCAWD\fAcknowledgments\n\nThis work was supported by the National Key R&D Program of China (2016YFB1000701), National\nNatural Science Foundation of China (61772299, 61325008, 61502265, 61672313) and TNList Fund.\n\nReferences\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled\n\ndata. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1798\u20131828, 2013.\n\n[4] R. Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n[5] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning a shared predictive structure from\nmultiple tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5):1025\u20131038, 2013.\n[6] J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning.\n\n[7] X. Chu, W. Ouyang, W. Yang, and X. Wang. Multi-task recurrent neural network for immediacy prediction.\n\n[8] C. Ciliberto, Y. Mroueh, T. Poggio, and L. Rosasco. Convex learning of multiple tasks and their structure.\n\nIn KDD, 2011.\n\nIn ICCV, 2015.\n\nIn ICML, 2015.\n\n[9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. In ICML, 2014.\n\n[10] T. Evgeniou and M. Pontil. Regularized multi-task learning. In KDD, 2004.\n[11] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classi\ufb01cation: A deep\n\nlearning approach. In ICML, 2011.\n\nCVPR, 2012.\n\n[12] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\n[13] A. K. Gupta and D. K. Nagar. Matrix variate distributions. Chapman & Hall, 2000.\n[14] L. Jacob, J.-P. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In NIPS, 2009.\n[15] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In ICML,\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\n[17] A. Kumar and H. Daume III. Learning task grouping and overlap in multi-task learning. ICML, 2012.\n[18] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks.\n\n[19] A. Maurer, M. Pontil, and B. Romera-Paredes. The bene\ufb01t of multitask representation learning. The\n\nJournal of Machine Learning Research, 17(1):2853\u20132884, 2016.\n\n[20] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR,\n\n2011.\n\nnetworks. In NIPS, 2012.\n\nIn ICML, 2015.\n\n2016.\n\n[21] M. Ohlson, M. R. Ahmad, and D. Von Rosen. The multilinear normal distribution: Introduction and some\n\nbasic properties. Journal of Multivariate Analysis, 113:37\u201347, 2013.\n\n[22] W. Ouyang, X. Chu, and X. Wang. Multisource deep learning for human pose estimation. In CVPR, 2014.\n[23] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask learning. In\n\n[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\n[25] N. Srivastava and R. Salakhutdinov. Discriminative transfer learning with tree-based priors. In NIPS, 2013.\n[26] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised\n\ndomain adaptation. In CVPR, 2017.\n\n[27] Y. Yang and T. Hospedales. Deep multi-task representation learning: A tensor factorisation approach.\n\n[28] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In\n\n[29] Y. Zhang and J. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In NIPS, 2010.\n[30] Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.\n[31] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. In\n\n[32] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV,\n\nICML, 2013.\n\nICLR, 2015.\n\nICLR, 2017.\n\nNIPS, 2014.\n\nUAI, 2010.\n\n2014.\n\n10\n\n\f", "award": [], "sourceid": 1019, "authors": [{"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "ZHANGJIE", "family_name": "CAO", "institution": "School of Software, Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Philip", "family_name": "Yu", "institution": "UIC"}]}