{"title": "Multi-Task Learning as Multi-Objective Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 527, "page_last": 538, "abstract": "In multi-task learning, multiple tasks are solved jointly, sharing inductive bias between them. Multi-task learning is inherently a multi-objective problem because different tasks may conflict, necessitating a trade-off. A common compromise is to optimize a proxy objective that minimizes a weighted linear combination of per-task losses. However, this workaround is only valid when the tasks do not compete, which is rarely the case. In this paper, we explicitly cast multi-task learning as multi-objective optimization, with the overall objective of finding a Pareto optimal solution. To this end, we use algorithms developed in the gradient-based multi-objective optimization literature. These algorithms are not directly applicable to large-scale learning problems since they scale poorly with the dimensionality of the gradients and the number of tasks. We therefore propose an upper bound for the multi-objective loss and show that it can be optimized efficiently. We further prove that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions. We apply our method to a variety of multi-task deep learning problems including digit classification, scene understanding (joint semantic segmentation, instance segmentation, and depth estimation), and multi-label classification. Our method produces higher-performing models than recent multi-task learning formulations or per-task training.", "full_text": "Multi-Task Learning as Multi-Objective Optimization\n\nOzan Sener\nIntel Labs\n\nVladlen Koltun\n\nIntel Labs\n\nAbstract\n\nIn multi-task learning, multiple tasks are solved jointly, sharing inductive bias\nbetween them. Multi-task learning is inherently a multi-objective problem because\ndifferent tasks may con\ufb02ict, necessitating a trade-off. A common compromise is to\noptimize a proxy objective that minimizes a weighted linear combination of per-\ntask losses. However, this workaround is only valid when the tasks do not compete,\nwhich is rarely the case. In this paper, we explicitly cast multi-task learning as\nmulti-objective optimization, with the overall objective of \ufb01nding a Pareto optimal\nsolution. To this end, we use algorithms developed in the gradient-based multi-\nobjective optimization literature. These algorithms are not directly applicable to\nlarge-scale learning problems since they scale poorly with the dimensionality of\nthe gradients and the number of tasks. We therefore propose an upper bound\nfor the multi-objective loss and show that it can be optimized ef\ufb01ciently. We\nfurther prove that optimizing this upper bound yields a Pareto optimal solution\nunder realistic assumptions. We apply our method to a variety of multi-task\ndeep learning problems including digit classi\ufb01cation, scene understanding (joint\nsemantic segmentation, instance segmentation, and depth estimation), and multi-\nlabel classi\ufb01cation. Our method produces higher-performing models than recent\nmulti-task learning formulations or per-task training.\n\n1\n\nIntroduction\n\nOne of the most surprising results in statistics is Stein\u2019s paradox. Stein (1956) showed that it is better\nto estimate the means of three or more Gaussian random variables using samples from all of them\nrather than estimating them separately, even when the Gaussians are independent. Stein\u2019s paradox\nwas an early motivation for multi-task learning (MTL) (Caruana, 1997), a learning paradigm in which\ndata from multiple tasks is used with the hope to obtain superior performance over learning each task\nindependently. Potential advantages of MTL go beyond the direct implications of Stein\u2019s paradox,\nsince even seemingly unrelated real world tasks have strong dependencies due to the shared processes\nthat give rise to the data. For example, although autonomous driving and object manipulation are\nseemingly unrelated, the underlying data is governed by the same laws of optics, material properties,\nand dynamics. This motivates the use of multiple tasks as an inductive bias in learning systems.\nA typical MTL system is given a collection of input points and sets of targets for various tasks per\npoint. A common way to set up the inductive bias across tasks is to design a parametrized hypothesis\nclass that shares some parameters across tasks. Typically, these parameters are learned by solving an\noptimization problem that minimizes a weighted sum of the empirical risk for each task. However,\nthe linear-combination formulation is only sensible when there is a parameter set that is effective\nacross all tasks. In other words, minimization of a weighted sum of empirical risk is only valid if\ntasks are not competing, which is rarely the case. MTL with con\ufb02icting objectives requires modeling\nof the trade-off between tasks, which is beyond what a linear combination achieves.\nAn alternative objective for MTL is \ufb01nding solutions that are not dominated by any others. Such\nsolutions are said to be Pareto optimal. In this paper, we cast the objective of MTL in terms of \ufb01nding\nPareto optimal solutions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe problem of \ufb01nding Pareto optimal solutions given multiple criteria is called multi-objective\noptimization. A variety of algorithms for multi-objective optimization exist. One such approach\nis the multiple-gradient descent algorithm (MGDA), which uses gradient-based optimization and\nprovably converges to a point on the Pareto set (D\u00e9sid\u00e9ri, 2012). MGDA is well-suited for multi-task\nlearning with deep networks. It can use the gradients of each task and solve an optimization problem\nto decide on an update over the shared parameters. However, there are two technical problems that\nhinder the applicability of MGDA on a large scale. (i) The underlying optimization problem does\nnot scale gracefully to high-dimensional gradients, which arise naturally in deep networks. (ii) The\nalgorithm requires explicit computation of gradients per task, which results in linear scaling of the\nnumber of backward passes and roughly multiplies the training time by the number of tasks.\nIn this paper, we develop a Frank-Wolfe-based optimizer that scales to high-dimensional problems.\nFurthermore, we provide an upper bound for the MGDA optimization objective and show that it can\nbe computed via a single backward pass without explicit task-speci\ufb01c gradients, thus making the\ncomputational overhead of the method negligible. We prove that using our upper bound yields a Pareto\noptimal solution under realistic assumptions. The result is an exact algorithm for multi-objective\noptimization of deep networks with negligible computational overhead.\nWe empirically evaluate the presented method on three different problems. First, we perform an\nextensive evaluation on multi-digit classi\ufb01cation with MultiMNIST (Sabour et al., 2017). Second, we\ncast multi-label classi\ufb01cation as MTL and conduct experiments with the CelebA dataset (Liu et al.,\n2015b). Lastly, we apply the presented method to scene understanding; speci\ufb01cally, we perform\njoint semantic segmentation, instance segmentation, and depth estimation on the Cityscapes dataset\n(Cordts et al., 2016). The number of tasks in our evaluation varies from 2 to 40. Our method clearly\noutperforms all baselines.\n\n2 Related Work\n\nMulti-task learning. We summarize the work most closely related to ours and refer the interested\nreader to reviews by Ruder (2017) and Zhou et al. (2011b) for additional background. Multi-task\nlearning (MTL) is typically conducted via hard or soft parameter sharing. In hard parameter sharing,\na subset of the parameters is shared between tasks while other parameters are task-speci\ufb01c. In soft\nparameter sharing, all parameters are task-speci\ufb01c but they are jointly constrained via Bayesian\npriors (Xue et al., 2007; Bakker and Heskes, 2003) or a joint dictionary (Argyriou et al., 2007; Long\nand Wang, 2015; Yang and Hospedales, 2016; Ruder, 2017). We focus on hard parameter sharing\nwith gradient-based optimization, following the success of deep MTL in computer vision (Bilen and\nVedaldi, 2016; Misra et al., 2016; Rudd et al., 2016; Yang and Hospedales, 2016; Kokkinos, 2017;\nZamir et al., 2018), natural language processing (Collobert and Weston, 2008; Dong et al., 2015; Liu\net al., 2015a; Luong et al., 2015; Hashimoto et al., 2017), speech processing (Huang et al., 2013;\nSeltzer and Droppo, 2013; Huang et al., 2015), and even seemingly unrelated domains over multiple\nmodalities (Kaiser et al., 2017).\nBaxter (2000) theoretically analyze the MTL problem as interaction between individual learners and\na meta-algorithm. Each learner is responsible for one task and a meta-algorithm decides how the\nshared parameters are updated. All aforementioned MTL algorithms use weighted summation as the\nmeta-algorithm. Meta-algorithms that go beyond weighted summation have also been explored. Li\net al. (2014) consider the case where each individual learner is based on kernel learning and utilize\nmulti-objective optimization. Zhang and Yeung (2010) consider the case where each learner is a\nlinear model and use a task af\ufb01nity matrix. Zhou et al. (2011a) and Bagherjeiran et al. (2005) use\nthe assumption that tasks share a dictionary and develop an expectation-maximization-like meta-\nalgorithm. de Miranda et al. (2012) and Zhou et al. (2017b) use swarm optimization. None of these\nmethods apply to gradient-based learning of high-capacity models such as modern deep networks.\nKendall et al. (2018) and Chen et al. (2018) propose heuristics based on uncertainty and gradient\nmagnitudes, respectively, and apply their methods to convolutional neural networks. Another recent\nwork uses multi-agent reinforcement learning (Rosenbaum et al., 2017).\nMulti-objective optimization. Multi-objective optimization addresses the problem of optimizing a\nset of possibly contrasting objectives. We recommend Miettinen (1998) and Ehrgott (2005) for surveys\nof this \ufb01eld. Of particular relevance to our work is gradient-based multi-objective optimization, as\ndeveloped by Fliege and Svaiter (2000), Sch\u00e4f\ufb02er et al. (2002), and D\u00e9sid\u00e9ri (2012). These methods\n\n2\n\n\fuse multi-objective Karush-Kuhn-Tucker (KKT) conditions (Kuhn and Tucker, 1951) and \ufb01nd a\ndescent direction that decreases all objectives. This approach was extended to stochastic gradient\ndescent by Peitz and Dellnitz (2018) and Poirion et al. (2017). In machine learning, these methods\nhave been applied to multi-agent learning (Ghosh et al., 2013; Pirotta and Restelli, 2016; Parisi\net al., 2014), kernel learning (Li et al., 2014), sequential decision making (Roijers et al., 2013), and\nBayesian optimization (Shah and Ghahramani, 2016; Hern\u00e1ndez-Lobato et al., 2016). Our work\napplies gradient-based multi-objective optimization to multi-task learning.\n\n3 Multi-Task Learning as Multi-Objective Optimization\nConsider a multi-task learning (MTL) problem over an input space X and a collection of task spaces\n{Y t}t2[T ], such that a large dataset of i.i.d. data points {xi, y1\ni }i2[N ] is given where T is\ni is the label of the tth task for the ith\nthe number of tasks, N is the number of data points, and yt\ndata point.1 We further consider a parametric hypothesis class per task as f t(x; \u2713sh, \u2713t) : X!Y t,\nsuch that some parameters (\u2713sh) are shared between tasks and some (\u2713t) are task-speci\ufb01c. We also\nconsider task-speci\ufb01c loss functions Lt(\u00b7,\u00b7) : Y t \u21e5Y t ! R+.\nAlthough many hypothesis classes and loss functions have been proposed in the MTL literature, they\ngenerally yield the following empirical risk minimization formulation:\n\ni , . . . , yT\n\nmin\n\u2713sh,\n\n\u27131,...,\u2713T\n\nTXt=1\n\nct \u02c6Lt(\u2713sh, \u2713t)\n\n(1)\n\nNPi Lf t(xi; \u2713sh, \u2713t), yt\ni.\n\nfor some static or dynamically computed weights ct per task, where \u02c6Lt(\u2713sh, \u2713t) is the empirical loss\nof the task t, de\ufb01ned as \u02c6Lt(\u2713sh, \u2713t) , 1\nAlthough the weighted summation formulation (1) is intuitively appealing, it typically either requires\nan expensive grid search over various scalings or the use of a heuristic (Kendall et al., 2018; Chen\net al., 2018). A basic justi\ufb01cation for scaling is that it is not possible to de\ufb01ne global optimality in the\nMTL setting. Consider two sets of solutions \u2713 and \u00af\u2713 such that \u02c6Lt1(\u2713sh, \u2713t1) < \u02c6Lt1( \u00af\u2713sh, \u00af\u2713t1) and\n\u02c6Lt2(\u2713sh, \u2713t2) > \u02c6Lt2( \u00af\u2713sh, \u00af\u2713t2), for some tasks t1 and t2. In other words, solution \u2713 is better for task\nt1 whereas \u00af\u2713 is better for t2. It is not possible to compare these two solutions without a pairwise\nimportance of tasks, which is typically not available.\nAlternatively, MTL can be formulated as multi-objective optimization: optimizing a collection\nof possibly con\ufb02icting objectives. This is the approach we take. We specify the multi-objective\noptimization formulation of MTL using a vector-valued loss L:\n\nmin\n\u2713sh,\n\n\u27131,...,\u2713T\n\nL(\u2713sh, \u27131, . . . , \u2713T ) = min\n\u2713sh,\n\n\u27131,...,\u2713T \u02c6L1(\u2713sh, \u27131), . . . , \u02c6LT (\u2713sh, \u2713T )|.\n\n(2)\n\nThe goal of multi-objective optimization is achieving Pareto optimality.\nDe\ufb01nition 1 (Pareto optimality for MTL)\n(a) A solution \u2713 dominates a solution \u00af\u2713 if \u02c6Lt(\u2713sh, \u2713t) \uf8ff \u02c6Lt( \u00af\u2713sh, \u00af\u2713t) for all tasks t and\n\nL(\u2713sh, \u27131, . . . , \u2713T ) 6= L( \u00af\u2713sh, \u00af\u27131, . . . , \u00af\u2713T ).\n\n(b) A solution \u2713? is called Pareto optimal if there exists no solution \u2713 that dominates \u2713?.\nThe set of Pareto optimal solutions is called the Pareto set (P\u2713) and its image is called the Pareto\nfront (PL = {L(\u2713)}\u27132P\u2713). In this paper, we focus on gradient-based multi-objective optimization\ndue to its direct relevance to gradient-based MTL.\nIn the rest of this section, we \ufb01rst summarize in Section 3.1 how multi-objective optimization can be\nperformed with gradient descent. Then, we suggest in Section 3.2 a practical algorithm for performing\nmulti-objective optimization over very large parameter spaces. Finally, in Section 3.3 we propose an\nef\ufb01cient solution for multi-objective optimization designed directly for high-capacity deep networks.\nOur method scales to very large models and a high number of tasks with negligible overhead.\n\n1This de\ufb01nition can be extended to the partially-labelled case by extending Y t with a null label.\n\n3\n\n\f3.1 Multiple Gradient Descent Algorithm\n\nAs in the single-objective case, multi-objective optimization can be solved to local optimality via\ngradient descent. In this section, we summarize one such approach, called the multiple gradient\ndescent algorithm (MGDA) (D\u00e9sid\u00e9ri, 2012). MGDA leverages the Karush-Kuhn-Tucker (KKT)\nconditions, which are necessary for optimality (Fliege and Svaiter, 2000; Sch\u00e4f\ufb02er et al., 2002;\nD\u00e9sid\u00e9ri, 2012). We now state the KKT conditions for both task-speci\ufb01c and shared parameters:\nt=1 \u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t) = 0\n\n\u2022 There exist \u21b51, . . . ,\u21b5 T 0 such thatPT\n\u2022 For all tasks t, r\u2713t \u02c6Lt(\u2713sh, \u2713t) = 0\n\nt=1 \u21b5t = 1 andPT\n\nAny solution that satis\ufb01es these conditions is called a Pareto stationary point. Although every Pareto\noptimal point is Pareto stationary, the reverse may not be true. Consider the optimization problem\n\nmin\n\n\u21b51,...,\u21b5T(\n\nTXt=1\n\n\u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t)\n\n2\n\n2\n\nTXt=1\n\n\u21b5t = 1,\u21b5 t 0 8t)\n\n(3)\n\nD\u00e9sid\u00e9ri (2012) showed that either the solution to this optimization problem is 0 and the resulting\npoint satis\ufb01es the KKT conditions, or the solution gives a descent direction that improves all tasks.\nHence, the resulting MTL algorithm would be gradient descent on the task-speci\ufb01c parameters\nt=1 \u21b5tr\u2713sh) as a gradient update to shared\nparameters. We discuss how to solve (3) for an arbitrary model in Section 3.2 and present an ef\ufb01cient\nsolution when the underlying model is an encoder-decoder in Section 3.3.\n\nfollowed by solving (3) and applying the solution (PT\n\n3.2 Solving the Optimization Problem\n\nThe optimization problem de\ufb01ned in (3) is equivalent to \ufb01nding a minimum-norm point in the\nconvex hull of the set of input points. This problem arises naturally in computational geometry: it is\nequivalent to \ufb01nding the closest point within a convex hull to a given query point. It has been studied\nextensively (Makimoto et al., 1994; Wolfe, 1976; Sekitani and Yamamoto, 1993). Although many\nalgorithms have been proposed, they do not apply in our setting because the assumptions they make\ndo not hold. Algorithms proposed in the computational geometry literature address the problem of\n\ufb01nding minimum-norm points in the convex hull of a large number of points in a low-dimensional\nspace (typically of dimensionality 2 or 3). In our setting, the number of points is the number of tasks\nand is typically low; in contrast, the dimensionality is the number of shared parameters and can be\nin the millions. We therefore use a different approach based on convex optimization, since (3) is a\nconvex quadratic problem with linear constraints.\nBefore we tackle the general case, let\u2019s consider the case of two tasks. The optimization problem\ncan be de\ufb01ned as min\u21b52[0,1] k\u21b5r\u2713sh \u02c6L1(\u2713sh, \u27131) + (1 \u21b5)r\u2713sh \u02c6L2(\u2713sh, \u27132)k2\n2, which is a one-\ndimensional quadratic function of \u21b5 with an analytical solution:\n#+, 1\n\u02c6\u21b5 =\"r\u2713sh \u02c6L2(\u2713sh, \u27132) r\u2713sh \u02c6L1(\u2713sh, \u27131)|r\u2713sh \u02c6L2(\u2713sh, \u27132)\n\nkr\u2713sh \u02c6L1(\u2713sh, \u27131) r\u2713sh \u02c6L2(\u2713sh, \u27132)k2\n\n(4)\n\n2\n\n|\n\n|\n\nrepresents clipping to [0, 1] as [a]+, 1\n|\n\n= max(min(a, 1), 0). We further visualize this\nwhere [\u00b7]+, 1\nsolution in Figure 1. Although this is only applicable when T = 2, this enables ef\ufb01cient application\nof the Frank-Wolfe algorithm (Jaggi, 2013) since the line search can be solved analytically. Hence,\nwe use Frank-Wolfe to solve the constrained optimization problem, using (4) as a subroutine for the\nline search. We give all the update equations for the Frank-Wolfe solver in Algorithm 2.\n\n4\n\n\f2\n\n = 1\n\nAlgorithm 1\nmin2[0,1] k\u2713 + (1 ) \u00af\u2713k2\n1: if \u2713| \u00af\u2713 \u2713|\u2713 then\n2:\n3: else if \u2713| \u00af\u2713 \u00af\u2713| \u00af\u2713 then\n4:\n5: else\n6:\n7: end if\n\n = 0\n = ( \u00af\u2713\u2713)| \u00af\u2713\nk\u2713 \u00af\u2713k2\n\n2\n\nFigure 1: Visualisation of the min-norm point in the convex hull\nof two points (min2[0,1] k\u2713 + (1 ) \u00af\u2713k2\n2). As the geometry sug-\ngests, the solution is either an edge case or a perpendicular vector.\n\n. Gradient descent on task-speci\ufb01c parameters\n\n. Solve (3) to \ufb01nd a common descent direction\n. Gradient descent on shared parameters\n\nAlgorithm 2 Update Equations for MTL\n1: for t = 1 to T do\n2:\n3: end for\n4: \u21b51, . . . ,\u21b5 T = FRANKWOLFESOLVER(\u2713)\nt=1 \u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t)\n\n\u2713t = \u2713t \u2318r\u2713t \u02c6Lt(\u2713sh, \u2713t)\n\n5: \u2713sh = \u2713sh \u2318PT\n\nrepeat\n\nT , . . . , 1\nT )\n\n6: procedure FRANKWOLFESOLVER(\u2713)\nInitialize \u21b5 = (\u21b51, . . . ,\u21b5 T ) = ( 1\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end procedure\n\nPrecompute M st. Mi,j =r\u2713sh \u02c6Li(\u2713sh, \u2713i)|r\u2713sh \u02c6Lj(\u2713sh, \u2713j)\n\u02c6t = arg minrPt \u21b5tMrt\n\u02c6 = arg min(1 )\u21b5 + e\u02c6t|M(1 )\u21b5 + e\u02c6t\nuntil \u02c6 \u21e0 0 or Number of Iterations Limit\nreturn \u21b51, . . . ,\u21b5 T\n\n\u21b5 = (1 \u02c6)\u21b5 + \u02c6e\u02c6t\n\n. Using Algorithm 1\n\n3.3 Ef\ufb01cient Optimization for Encoder-Decoder Architectures\nThe MTL update described in Algorithm 2 is applicable to any problem that uses optimization\nbased on gradient descent. Our experiments also suggest that the Frank-Wolfe solver is ef\ufb01cient and\naccurate as it typically converges in a modest number of iterations with negligible effect on training\ntime. However, the algorithm we described needs to compute r\u2713sh \u02c6Lt(\u2713sh, \u2713t) for each task t, which\nrequires a backward pass over the shared parameters for each task. Hence, the resulting gradient\ncomputation would be the forward pass followed by T backward passes. Considering the fact that\ncomputation of the backward pass is typically more expensive than the forward pass, this results in\nlinear scaling of the training time and can be prohibitive for problems with more than a few tasks.\nWe now propose an ef\ufb01cient method that optimizes an upper bound of the objective and requires only\na single backward pass. We further show that optimizing this upper bound yields a Pareto optimal\nsolution under realistic assumptions. The architectures we address conjoin a shared representation\nfunction with task-speci\ufb01c decision functions. This class of architectures covers most of the existing\ndeep MTL models and can be formally de\ufb01ned by constraining the hypothesis class as\n\n(5)\nwhere g is the representation function shared by all tasks and f t are the task-speci\ufb01c functions\n\nf t(x; \u2713sh, \u2713t) = (f t(\u00b7; \u2713t) g(\u00b7; \u2713sh))(x) = f t(g(x; \u2713sh); \u2713t)\n\nthat take this representation as input. If we denote the representations as Z =z1, . . . , zN, where\n\nzi = g(xi; \u2713sh), we can state the following upper bound as a direct consequence of the chain rule:\n\n\u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t)\n\n2\n\n2\n\n\uf8ff\n\n@Z\n\n@\u2713sh\n\n2\n\n2\n\nTXt=1\n\n\u21b5trZ \u02c6Lt(\u2713sh, \u2713t)\n\n2\n\n2\n\nwhere @Z\n\n@\u2713sh2 is the matrix norm of the Jacobian of Z with respect to \u2713sh. Two desirable properties\n\nof this upper bound are that (i) rZ \u02c6Lt(\u2713sh, \u2713t) can be computed in a single backward pass for all\n\nTXt=1\n\n\n\n(6)\n\n5\n\n\fmin\n\nTXt=1\n\n2\n\n2\n\n2\n\nterm since it does not affect the optimization. The resulting optimization problem is\n\n2 is not a function of \u21b51, . . . ,\u21b5 T , hence it can be removed when it is used as\nterm with the upper bound\n\ntasks and (ii) @Z\n@\u2713sh2\nan optimization objective. We replace thePT\nt=1 \u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t)\nwe have just derived in order to obtain the approximate optimization problem and drop the @Z\n@\u2713sh2\n\u21b5t = 1,\u21b5 t 0 8t)\n\u21b51,...,\u21b5T(\n2\n\u21b5trZ \u02c6Lt(\u2713sh, \u2713t)\nTXt=1\n\nWe refer to this problem as MGDA-UB (Multiple Gradient Descent Algorithm \u2013 Upper Bound).\nIn practice, MGDA-UB corresponds to using the gradients of the task losses with respect to the\nrepresentations instead of the shared parameters. We use Algorithm 2 with only this change as the\n\ufb01nal method.\nAlthough MGDA-UB is an approximation of the original optimization problem, we now state a\ntheorem that shows that our method produces a Pareto optimal solution under mild assumptions. The\nproof is given in the supplement.\nTheorem 1 Assume @Z\ntrue:\n\n@\u2713sh is full-rank. If \u21b51,...,T is the solution of MGDA-UB, one of the following is\n\n(MGDA-UB)\n\n2\n\n(a) PT\n(b) PT\n\nt=1 \u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t) = 0 and the current parameters are Pareto stationary.\nt=1 \u21b5tr\u2713sh \u02c6Lt(\u2713sh, \u2713t) is a descent direction that decreases all objectives.\n\nThis result follows from the fact that as long as @Z\n@\u2713sh is full rank, optimizing the upper bound\ncorresponds to minimizing the norm of the convex combination of the gradients using the Mahalonobis\n| @Z\nnorm de\ufb01ned by @Z\n@\u2713sh . The non-singularity assumption is reasonable as singularity implies\n@\u2713sh\nthat tasks are linearly related and a trade-off is not necessary. In summary, our method provably \ufb01nds\na Pareto stationary point with negligible computational overhead and can be applied to any deep\nmulti-objective problem with an encoder-decoder model.\n\n4 Experiments\n\nWe evaluate the presented MTL method on a number of problems. First, we use MultiMNIST\n(Sabour et al., 2017), an MTL adaptation of MNIST (LeCun et al., 1998). Next, we tackle multi-label\nclassi\ufb01cation on the CelebA dataset (Liu et al., 2015b) by considering each label as a distinct binary\nclassi\ufb01cation task. These problems include both classi\ufb01cation and regression, with the number of\ntasks ranging from 2 to 40. Finally, we experiment with scene understanding, jointly tackling the tasks\nof semantic segmentation, instance segmentation, and depth estimation on the Cityscapes dataset\n(Cordts et al., 2016). We discuss each experiment separately in the following subsections.\nThe baselines we consider are (i) uniform scaling: minimizing a uniformly weighted sum of\nloss functions 1\n\ntively trying various values from {ct 2 [0, 1]|Pt ct = 1} and optimizing for 1\n\nT Pt Lt, (ii) single task: solving tasks independently, (iii) grid search: exhaus-\nT Pt ctLt,\n\n(iv) Kendall et al. (2018): using the uncertainty weighting proposed by Kendall et al. (2018), and\n(v) GradNorm: using the normalization proposed by Chen et al. (2018).\n\n4.1 MultiMNIST\nOur initial experiments are on MultiMNIST, an MTL version of the MNIST dataset (Sabour et al.,\n2017). In order to convert digit classi\ufb01cation into a multi-task problem, Sabour et al. (2017) overlaid\nmultiple images together. We use a similar construction. For each image, a different one is chosen\nuniformly in random. Then one of these images is put at the top-left and the other one is at the\nbottom-right. The resulting tasks are: classifying the digit on the top-left (task-L) and classifying\nthe digit on the bottom-right (task-R). We use 60K examples and directly apply existing single-task\nMNIST models. The MultiMNIST dataset is illustrated in the supplement.\nWe use the LeNet architecture (LeCun et al., 1998). We treat all layers except the last as the\nrepresentation function g and put two fully-connected layers as task-speci\ufb01c functions (see the\n\n6\n\n\fFigure 2: Radar charts of percentage error per attribute on CelebA (Liu et al., 2015b). Lower is better.\nWe divide attributes into two sets for legibility: easy on the left, hard on the right. Zoom in for details.\n\nsupplement for details). We visualize the performance pro\ufb01le as a scatter plot of accuracies on task-L\nand task-R in Figure 3, and list the results in Table 3.\nIn this setup, any static scaling results in lower accuracy than solving each task separately (the single-\ntask baseline). The two tasks appear to compete for model capacity, since increase in the accuracy\nof one task results in decrease in the accuracy of the other. Uncertainty weighting (Kendall et al.,\n2018) and GradNorm (Chen et al., 2018) \ufb01nd solutions that are slightly better than grid search but\ndistinctly worse than the single-task baseline. In contrast, our method \ufb01nds a solution that ef\ufb01ciently\nutilizes the model capacity and yields accuracies that are as good as the single-task solutions. This\nexperiment demonstrates the effectiveness of our method as well as the necessity of treating MTL as\nmulti-objective optimization. Even after a large hyper-parameter search, any scaling of tasks does not\napproach the effectiveness of our method.\n\n4.2 Multi-Label Classi\ufb01cation\n\nNext, we tackle multi-label classi\ufb01cation. Given a set of attributes,\nmulti-label classi\ufb01cation calls for deciding whether each attribute\nholds for the input. We use the CelebA dataset (Liu et al., 2015b),\nwhich includes 200K face images annotated with 40 attributes. Each\nattribute gives rise to a binary classi\ufb01cation task and we cast this as a\n40-way MTL problem. We use ResNet-18 (He et al., 2016) without\nthe \ufb01nal layer as a shared representation function, and attach a linear\nlayer for each attribute (see the supplement for further details).\nWe plot the resulting error for each binary classi\ufb01cation task as a\nradar chart in Figure 2. The average over them is listed in Table 1.\nWe skip grid search since it is not feasible over 40 tasks. Although\nuniform scaling is the norm in the multi-label classi\ufb01cation litera-\nture, single-task performance is signi\ufb01cantly better. Our method\noutperforms baselines for signi\ufb01cant majority of tasks and achieves\ncomparable performance in rest. This experiment also shows that\nour method remains effective when the number of tasks is high.\n\n4.3 Scene Understanding\n\nTable 1: Mean of error per\ncategory of MTL algorithms\nin multi-label classi\ufb01cation on\nCelebA (Liu et al., 2015b).\n\nAverage\n\nSingle task\nUniform scaling\nKendall et al. 2018\n\nerror\n8.77\n9.62\n9.53\nGradNorm 8.44\n8.25\n\nOurs\n\nTo evaluate our method in a more realistic setting, we use scene understanding. Given an RGB\nimage, we solve three tasks: semantic segmentation (assigning pixel-level class labels), instance\n\n7\n\n12345A4A15A20A35A5A9A10A13A14A16A17A22A24A26A29A30A38A0A1A2A3A6A7A11A32A33A37A39A19A8A12A18A21A23A25A27A28A31A34A362.55.07.51012151720UniformScalingKendalletal.2018SingleTaskGradNormOurs\fTable 2: Effect of the MGDA-UB approximation. We report the \ufb01nal accuracies as well as training\ntimes for our method with and without the approximation.\n\nOurs (w/o approx.)\nOurs\n\ntime\n38.6\n23.3\n\nMulti-label (40 tasks)\nTraining\nAverage\ntime (hour)\n\nerror\n8.33\n8.25\n\nScene understanding (3 tasks)\n\nTraining Segmentation Instance Disparity\nerror [px] error [px]\n\nmIoU [%]\n\n66.13\n66.63\n\n10.28\n10.25\n\n2.59\n2.54\n\n429.9\n16.1\n\nsegmentation (assigning pixel-level instance labels), and monocular depth estimation (estimating\ncontinuous disparity per pixel). We follow the experimental procedure of Kendall et al. (2018) and\nuse an encoder-decoder architecture. The encoder is based on ResNet-50 (He et al., 2016) and is\nshared by all three tasks. The decoders are task-speci\ufb01c and are based on the pyramid pooling module\n(Zhao et al., 2017) (see the supplement for further implementation details).\nSince the output space of instance segmentation is unconstrained (the number of instances is not\nknown in advance), we use a proxy problem as in Kendall et al. (2018). For each pixel, we estimate\nthe location of the center of mass of the instance that encompasses the pixel. These center votes\ncan then be clustered to extract the instances. In our experiments, we directly report the MSE in\nthe proxy task. Figure 4 shows the performance pro\ufb01le for each pair of tasks, although we perform\nall experiments on all three tasks jointly. The pairwise performance pro\ufb01les shown in Figure 4 are\nsimply 2D projections of the three-dimensional pro\ufb01le, presented this way for legibility. The results\nare also listed in Table 4.\nMTL outperforms single-task accuracy, indicating that the tasks cooperate and help each other. Our\nmethod outperforms all baselines on all tasks.\n\n4.4 Role of the Approximation\nIn order to understand the role of the approximation proposed in Section 3.3, we compare the \ufb01nal\nperformance and training time of our algorithm with and without the presented approximation in\nTable 2 (runtime measured on a single Titan Xp GPU). For a small number of tasks (3 for scene\nunderstanding), training time is reduced by 40%. For the multi-label classi\ufb01cation experiment (40\ntasks), the presented approximation accelerates learning by a factor of 25.\nOn the accuracy side, we expect both methods to perform similarly as long as the full-rank assumption\nis satis\ufb01ed. As expected, the accuracy of both methods is very similar. Somewhat surprisingly, our\napproximation results in slightly improved accuracy in all experiments. While counter-intuitive at\n\ufb01rst, we hypothesize that this is related to the use of SGD in the learning algorithm. Stability analysis\nin convex optimization suggests that if gradients are computed with an error \u02c6r\u2713Lt = r\u2713Lt + et (\u2713\ncorresponds to \u2713sh in (3)), as opposed to Z in the approximate problem in (MGDA-UB), the error in\nthe solution is bounded as k\u02c6\u21b5 \u21b5k2 \uf8ffO (maxt ketk2). Considering the fact that the gradients are\ncomputed over the full parameter set (millions of dimensions) for the original problem and over a\nsmaller space for the approximation (batch size times representation which is in the thousands), the\ndimension of the error vector is signi\ufb01cantly higher in the original problem. We expect the l2 norm\nof such a random vector to depend on the dimension.\nIn summary, our quantitative analysis of the approximation suggests that (i) the approximation does\nnot cause an accuracy drop and (ii) by solving an equivalent problem in a lower-dimensional space,\nour method achieves both better computational ef\ufb01ciency and higher stability.\n\n5 Conclusion\n\nWe described an approach to multi-task learning. Our approach is based on multi-objective optimiza-\ntion. In order to apply multi-objective optimization to MTL, we described an ef\ufb01cient algorithm as\nwell as speci\ufb01c approximations that yielded a deep MTL algorithm with almost no computational\noverhead. Our experiments indicate that the resulting algorithm is effective for a wide range of\nmulti-task scenarios.\n\n8\n\n\fFigure 3: MultiMNIST accuracy pro\ufb01le. We\nplot the obtained accuracy in detecting the left\nand right digits for all baselines. The grid-search\nresults suggest that the tasks compete for model\ncapacity. Our method is the only one that \ufb01nds\na solution that is as good as training a dedicated\nmodel for each task. Top-right is better.\n\nTable 3: Performance of MTL algorithms on\nMultiMNIST. Single-task baselines solve tasks\nseparately, with dedicated models, but are shown\nin the same row for clarity.\n\nLeft digit\n\naccuracy [%]\n\nRight digit\naccuracy [%]\n\nSingle task\nUniform scaling\nKendall et al. 2018\nGradNorm\nOurs\n\n97.23\n96.46\n96.47\n96.27\n97.26\n\n95.90\n94.99\n95.29\n94.84\n95.90\n\nTable 4: Performance of MTL algorithms in\njoint semantic segmentation, instance segmenta-\ntion, and depth estimation on Cityscapes. Single-\ntask baselines solve tasks separately but are\nshown in the same row for clarity.\n\nSegmentation Instance Disparity\nerror [px] error [px]\n\nmIoU [%]\n\nSingle task\nUniform scaling\nKendall et al. 2018\nGradNorm\nOurs\n\n60.68\n54.59\n64.21\n64.81\n66.63\n\n11.34\n10.38\n11.54\n11.31\n10.25\n\n2.78\n2.96\n2.65\n2.57\n2.54\n\nFigure 4: Cityscapes performance pro\ufb01le. We\nplot the performance of all baselines for the\ntasks of semantic segmentation, instance seg-\nmentation, and depth estimation. We use mIoU\nfor semantic segmentation, error of per-pixel re-\ngression (normalized to image size) for instance\nsegmentation, and disparity error for depth esti-\nmation. To convert errors to performance mea-\nsures, we use 1 instance error and 1/disparity\nerror. We plot 2D projections of the performance\npro\ufb01le for each pair of tasks. Although we plot\npairwise projections for visualization, each point\nin the plots solves all tasks. Top-right is better.\n\n9\n\n0.900.920.940.960.981.00AccuracyL0.880.900.920.940.960.981.00AccuracyRSingleTaskGridSearchUniformScalingKendalletal.2018GradNormOurs0.450.500.550.600.650.70mIoU[%]0.050.100.150.200.250.300.350.401/DisparityError[px]SingleTaskGridSearchUniformScalingKendalletal.2018GradNormOurs0.450.500.550.600.650.70mIoU[%]0.800.850.900.951.001-InstanceError[%]SingleTaskGridSearchUniformScalingKendalletal.2018GradNormOurs0.8000.8250.8500.8750.9000.9250.9500.9751.01-InstanceError[%]0.050.100.150.200.250.300.350.401/DisparityError[px]SingleTaskGridSearchUniformScalingKendalletal.2018GradNormOurs\fReferences\nA. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2007.\n\nA. Bagherjeiran, R. Vilalta, and C. F. Eick. Content-based image retrieval through a multi-agent meta-learning\n\nframework. In International Conference on Tools with Arti\ufb01cial Intelligence, 2005.\n\nB. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. JMLR, 4:83\u201399, 2003.\n\nJ. Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 12:149\u2013198, 2000.\n\nH. Bilen and A. Vedaldi. Integrated perception with recurrent multi-task neural networks. In NIPS, 2016.\n\nR. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\nZ. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich. GradNorm: Gradient normalization for adaptive loss\n\nbalancing in deep multitask networks. In ICML, 2018.\n\nR. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks with\n\nmultitask learning. In ICML, 2008.\n\nM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.\n\nThe Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.\n\nP. B. C. de Miranda, R. B. C. Prud\u00eancio, A. C. P. L. F. de Carvalho, and C. Soares. Combining a multi-objective\noptimization approach with meta-learning for SVM parameter selection. In International Conference on\nSystems, Man, and Cybernetics, 2012.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\nJ.-A. D\u00e9sid\u00e9ri. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus\n\nMathematique, 350(5):313\u2013318, 2012.\n\nD. Dong, H. Wu, W. He, D. Yu, and H. Wang. Multi-task learning for multiple language translation. In ACL,\n\n2015.\n\nM. Ehrgott. Multicriteria Optimization (2. ed.). Springer, 2005.\n\nJ. Fliege and B. F. Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of\n\nOperations Research, 51(3):479\u2013494, 2000.\n\nS. Ghosh, C. Lovell, and S. R. Gunn. Towards Pareto descent directions in sampling experts for multiple tasks in\n\nan on-line learning paradigm. In AAAI Spring Symposium: Lifelong Machine Learning, 2013.\n\nK. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher. A joint many-task model: Growing a neural network for\n\nmultiple NLP tasks. In EMNLP, 2017.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\nD. Hern\u00e1ndez-Lobato, J. M. Hern\u00e1ndez-Lobato, A. Shah, and R. P. Adams. Predictive entropy search for\n\nmulti-objective bayesian optimization. In ICML, 2016.\n\nJ.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong. Cross-language knowledge transfer using multilingual deep\n\nneural network with shared hidden layers. In ICASSP, 2013.\n\nZ. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee. Rapid adaptation for deep neural networks\n\nthrough multi-task learning. In Interspeech, 2015.\n\nM. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\n\nL. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One model to learn\n\nthem all. arXiv:1706.05137, 2017.\n\nA. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and\n\nsemantics. In CVPR, 2018.\n\nI. Kokkinos. UberNet: Training a universal convolutional neural network for low-, mid-, and high-level vision\n\nusing diverse datasets and limited memory. In CVPR, 2017.\n\n10\n\n\fH. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proceedings of the Second Berkeley Symposium on\n\nMathematical Statistics and Probability, Berkeley, Calif., 1951. University of California Press.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nC. Li, M. Georgiopoulos, and G. C. Anagnostopoulos. Pareto-path multi-task multiple kernel learning.\n\narXiv:1404.3190, 2014.\n\nX. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y.-Y. Wang. Representation learning using multi-task deep neural\n\nnetworks for semantic classi\ufb01cation and information retrieval. In NAACL HLT, 2015a.\n\nZ. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015b.\n\nM. Long and J. Wang. Learning multiple tasks with deep relationship networks. arXiv:1506.02117, 2015.\n\nM.-T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence learning.\n\narXiv:1511.06114, 2015.\n\nN. Makimoto, I. Nakagawa, and A. Tamura. An ef\ufb01cient algorithm for \ufb01nding the minimum norm point in the\n\nconvex hull of a \ufb01nite point set in the plane. Operations Research Letters, 16(1):33\u201340, 1994.\n\nK. Miettinen. Nonlinear Multiobjective Optimization. Springer, 1998.\n\nI. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR,\n\n2016.\n\nS. Parisi, M. Pirotta, N. Smacchia, L. Bascetta, and M. Restelli. Policy gradient approaches for multi-objective\n\nsequential decision making. In IJCNN, 2014.\n\nA. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.\n\nAutomatic differentiation in PyTorch. In NIPS Workshops, 2017.\n\nS. Peitz and M. Dellnitz. Gradient-based multiobjective optimization with uncertainties. In NEO, 2018.\n\nM. Pirotta and M. Restelli. Inverse reinforcement learning through policy gradient minimization. In AAAI, 2016.\n\nF. Poirion, Q. Mercier, and J. D\u00e9sid\u00e9ri. Descent algorithm for nonsmooth stochastic multiobjective optimization.\n\nComputational Optimization and Applications, 68(2):317\u2013331, 2017.\n\nD. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making.\n\nJournal of Arti\ufb01cial Intelligence Research, 48:67\u2013113, 2013.\n\nC. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linear functions for\n\nmulti-task learning. arXiv:1711.01239, 2017.\n\nE. M. Rudd, M. G\u00fcnther, and T. E. Boult. MOON: A mixed objective optimization network for the recognition\n\nof facial attributes. In ECCV, 2016.\n\nS. Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.\n\nS. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.\n\nS. Sch\u00e4f\ufb02er, R. Schultz, and K. Weinzierl. Stochastic method for the solution of unconstrained vector optimization\n\nproblems. Journal of Optimization Theory and Applications, 114(1):209\u2013222, 2002.\n\nK. Sekitani and Y. Yamamoto. A recursive algorithm for \ufb01nding the minimum norm point in a polytope and a\n\npair of closest points in two polytopes. Mathematical Programming, 61(1-3):233\u2013249, 1993.\n\nM. L. Seltzer and J. Droppo. Multi-task learning in deep neural networks for improved phoneme recognition. In\n\nICASSP, 2013.\n\nA. Shah and Z. Ghahramani. Pareto frontier learning with expensive correlated objectives. In ICML, 2016.\n\nC. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical\n\nreport, Stanford University, US, 1956.\n\nP. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128\u2013149, 1976.\n\nY. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with dirichlet process\n\npriors. JMLR, 8:35\u201363, 2007.\n\n11\n\n\fY. Yang and T. M. Hospedales. Trace norm regularised deep multi-task learning. arXiv:1606.04038, 2016.\n\nA. R. Zamir, A. Sax, W. B. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task\n\ntransfer learning. In CVPR, 2018.\n\nY. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task learning. In UAI,\n\n2010.\n\nH. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.\n\nB. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ADE20K dataset. In\n\nCVPR, 2017a.\n\nD. Zhou, J. Wang, B. Jiang, H. Guo, and Y. Li. Multi-task multi-view learning based on cooperative multi-\n\nobjective optimization. IEEE Access, 2017b.\n\nJ. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In NIPS, 2011a.\n\nJ. Zhou, J. Chen, and J. Ye. MALSAR: Multi-task learning via structural regularization. Arizona State University,\n\n2011b.\n\n12\n\n\f", "award": [], "sourceid": 318, "authors": [{"given_name": "Ozan", "family_name": "Sener", "institution": "Intel Labs"}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": "Intel Labs"}]}