{"title": "Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL", "book": "Advances in Neural Information Processing Systems", "page_first": 2150, "page_last": 2158, "abstract": "Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks' empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging.", "full_text": "Minimax Multi-Task Learning and a Generalized\n\nLoss-Compositional Paradigm for MTL\n\nniche@cc.gatech.edu, drselee@gmail.com, agray@cc.gatech.edu\n\n\u2020 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA\n\nNishant A. Mehta\u2020, Dongryeol Lee\u2217, Alexander G. Gray\u2020\n\n\u2217 GE Global Research, Niskayuna, NY 12309, USA\n\nAbstract\n\nSince its inception, the modus operandi of multi-task learning (MTL) has been to\nminimize the task-wise mean of the empirical risks. We introduce a generalized\nloss-compositional paradigm for MTL that includes a spectrum of formulations as\na subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formu-\nlation that minimizes the maximum of the tasks\u2019 empirical risks. Via a certain re-\nlaxation of minimax MTL, we obtain a continuum of MTL formulations spanning\nminimax MTL and classical MTL. The full paradigm itself is loss-compositional,\noperating on the vector of empirical risks. It incorporates minimax MTL, its relax-\nations, and many new MTL formulations as special cases. We show theoretically\nthat minimax MTL tends to avoid worst case outcomes on newly drawn test tasks\nin the learning to learn (LTL) test setting. The results of several MTL formulations\non synthetic and real problems in the MTL and LTL test settings are encouraging.\n\n1\n\nIntroduction\n\nThe essence of machine learning is to exploit what we observe in order to form accurate predictors\nof what we cannot. A multi-task learning (MTL) algorithm learns an inductive bias to learn several\ntasks together. MTL is incredibly pervasive in machine learning: it has natural connections to ran-\ndom effects models [15]; user preference prediction (including collaborative \ufb01ltering) can be framed\nas MTL [16]; multi-class classi\ufb01cation admits the popular one-vs-all and all-pairs MTL reductions;\nand MTL admits provably good learning in settings where single-task learning is hopeless [4, 12].\nBut if we see examples from a random set of tasks today, which of these tasks will matter tomorrow?\nNot knowing in the present what challenges nature has in store for the future, a sensible strategy is\nto mitigate the worst case by ensuring some minimum pro\ufb01ciency on each task.\nConsider a simple learning scenario: A music preference prediction company is in the business of\npredicting what 5-star ratings different users would assign to songs. At training time, the com-\npany learns a shared representation for predicting the users\u2019 song ratings by pooling together the\ncompany\u2019s limited data on each user\u2019s preferences. Given this learned representation, a separate\npredictor for each user can be trained very quickly. At test time, the environment draws a user\naccording to some (possibly randomized) rule and solicits from the company a prediction of that\nuser\u2019s preference for a particular song. The environment may also ask for predictions about new\nusers, described by a few ratings each, and so the company must leverage its existing representation\nto rapidly learn new predictors and produce ratings for these new users.\nClassically, multi-task learning has sought to minimize the (regularized) sum of the empirical risks\nover a set of tasks. In this way, classical MTL implicitly assumes that once the learner has been\ntrained, it will be tested on test tasks drawn uniformly at random from the empirical task distribution\nof the training tasks. Notably, there are several reasons why classical MTL may not be ideal:\n\n\u2217Work completed while at Georgia Institute of Technology\n\n1\n\n\f\u2022 While at training time the usual \ufb02avor of MTL commits to a \ufb01xed distribution over users (typi-\ncally either uniform or proportional to the number of ratings available for each user), at test time\nthere is no guarantee what user distribution we will encounter. In fact, there may not exist any\n\ufb01xed user distribution: the sequence of users for which ratings are elicited could be adversarial.\n\u2022 Even in the case when the distribution over tasks is not adversarial, it may be in the interest of\nthe music preference prediction company to guarantee some minimum level of accuracy per user\nin order to minimize negative feedback and a potential loss of business, rather than maximizing\nthe mean level of accuracy over all users.\n\n\u2022 Whereas minimizing the average prediction error is very much a teleological endeavor, typically\nat the expense of some locally egregious outcomes, minimizing the worst-case prediction error\nrespects a notion of fairness to all tasks (or people).\n\nThis work introduces minimax multi-task learning as a response to the above scenario.1 In addi-\ntion, we cast a spectrum of multi-task learning. At one end of the spectrum lies minimax MTL,\nand departing from this point progressively relaxes the \u201chardness\u201d of the maximum until full re-\nlaxation reaches the second endpoint and recovers classical MTL. We further sculpt a generalized\nloss-compositional paradigm for MTL which includes this spectrum and several other new MTL\nformulations. This paradigm equally applies to the problem of learning to learn (LTL), in which the\ngoal is to learn a hypothesis space from a set of training tasks such that this representation admits\ngood hypotheses on future tasks. In truth, MTL and LTL typically are handled equivalently at train-\ning time \u2014 this work will be no exception \u2014 and they diverge only in their test settings and hence\nthe learning theoretic inquiries they inspire.\n\nContributions. The \ufb01rst contribution of this work is to introduce minimax MTL and a continuum\nof relaxations. Second, we introduce a generalized loss-compositional paradigm for MTL which\nadmits a number of new MTL formulations and also includes classical MTL as a special case.\nThird, we empirically evaluate the performance of several MTL formulations from this paradigm\nin the multi-task learning and learning to learn settings, under the task-wise maximum test risk and\ntask-wise mean test risk criteria, on four datasets (one synthetic, three real). Finally, Theorem 1\nis the core theoretical contribution of this work and shows the following: If it is possible to obtain\nmaximum empirical risk across a set of training tasks below some level \u03b3, then it is likely that the\nmaximum true risk obtained by the learner on a new task is bounded by roughly \u03b3. Hence, if the goal\nis to minimize the worst case outcome over new tasks, the theory suggests minimizing the maximum\nof the empirical risks across the training tasks rather than their mean.\nIn the next section, we recall the settings of multi-task learning and learning to learn, formally\nintroduce minimax MTL, and motivate it theoretically.\nIn Section 3, we introduce a continu-\nously parametrized family of minimax MTL relaxations and the new generalized loss-compositional\nparadigm. Section 4 presents an empirical evaluation of various MTL/LTL formulations with differ-\nent models on four datasets. Finally, we close with a discussion.\n\n2 Minimax multi-task learning\n\nWe begin with a promenade through the basic MTL and LTL setups, with an effort to abide by the\nnotation introduced by Baxter [4]. Throughout the rest of the paper, each labeled example (x, y)\nwill live in X \u00d7 Y for input instance x and label y. Typical choices of X include Rn or a compact\nsubset thereof, while Y typically is a compact subset of R or the binary {\u22121, 1}.\nIn addition,\nde\ufb01ne a loss function (cid:96) : R \u00d7 Y \u2192 R+. For simplicity, this work considers (cid:96)2 loss (squared loss)\n(cid:96)(y(cid:48), y) = (y(cid:48) \u2212 y)2 for regression and hinge loss (cid:96)(y(cid:48), y) = max{0, 1 \u2212 y(cid:48)y} for classi\ufb01cation.\nMTL and LTL often are framed as applying an inductive bias to learn a common hypothesis space,\nselected from a \ufb01xed family of hypothesis spaces, and thereafter learning from this hypothesis space\na hypothesis for each task observed at training time. It will be useful to formalize the various sets\nand elements present in the preceding statement. Let H be a family of hypothesis spaces. Any\nhypothesis space H \u2208 H itself is a set of hypotheses; each hypothesis h \u2208 H is a map h : X \u2192 R.\n\n1Note that minimax MTL does not refer to the minimax estimators of statistical decision theory.\n\n2\n\n\fLearning to learn.\nIn learning to learn, the goal is to achieve inductive transfer to learn the best\nH from H. Unlike in MTL, there is a notion of an environment of tasks: an unknown probability\nmeasure Q over a space of task probability measures P. The goal is to \ufb01nd the optimal representation\nvia the objective\n\n(1)\nIn practice, T (unobservable) training task probability measures P1, . . . , PT \u2208 P are drawn iid from\nQ, and from each task t a set of m examples are drawn iid from Pt.\n\nh\u2208H E(x,y)\u223cP (cid:96)(y, h(x)).\n\ninfH\u2208H EP\u223cQ inf\n\nMulti-task learning. Whereas in learning to learn there is a distribution over tasks, in multi-task\nlearning there is a \ufb01xed, \ufb01nite set of tasks indexed by [T ] := {1, . . . , T}. Each task t \u2208 [T ]\nis coupled with a \ufb01xed but unknown probability measure Pt. Classically, the goal of MTL is to\nminimize the expected loss at test time under the uniform distribution on [T ]:\n\nh\u2208H E(x,y)\u223cPt(cid:96)(y, h(x)).\ninf\n\n(2)\n\n(cid:88)\n\nt\u2208[T ]\n\ninfH\u2208H\n\n1\nT\n\nNotably, this objective is equivalent to (1) when Q is the uniform distribution on {P1, . . . , PT}. In\nterms of the data generation model, MTL differs from LTL since the tasks are \ufb01xed; however, just\nas in LTL, from each task t a set of m examples are drawn iid from Pt .\n\n2.1 Minimax MTL\n\nA natural generalization of classical MTL results by introducing a prior distribution \u03c0 over the index\nset of tasks [T ]. Given \u03c0, the (idealized) objective of this generalized MTL is\n\ninfH\u2208H Et\u223c\u03c0 inf\n\nh\u2208H E(x,y)\u223cPt(cid:96)(y, h(x)),\n\n(3)\ngiven only the training data {(xt,1, yt,1), . . . , (xt,m, yt,m)}t\u2208[T ]. The classical MTL objective (2)\nequals (3) when \u03c0 is taken to be the uniform prior over [T ]. We argue that in many instances, that\nwhich is most relevant to minimize is not the expected error under a uniform distribution over tasks,\nor even any pre-speci\ufb01ed \u03c0, but rather the expected error for the worst \u03c0. We propose to minimize\nthe maximum error over tasks under an adversarial choice of \u03c0, yielding the objective:\n\ninfH\u2208H sup\n\n\u03c0\n\nEt\u223c\u03c0 inf\n\nh\u2208H E(x,y)\u223cPt(cid:96)(y, h(x)),\n\nwhere the supremum is taken over the T -dimensional simplex. As the supremum (assuming it is\nattained) is attained at an extreme point of the simplex, this objective is equivalent to\n\ninfH\u2208H max\nt\u2208[T ]\n\nh\u2208H E(x,y)\u223cPt(cid:96)(y, h(x)).\ninf\n\nIn practice, we approximate the true objective via a regularized form of the empirical objective\n\nm(cid:88)\n\ni=1\n\ninfH\u2208H max\nt\u2208[T ]\n\ninf\nh\u2208H\n\n(cid:96)(yt,i, h(xt,i)).\n\nIn the next section, we motivate minimax MTL theoretically by showing that the worst-case perfor-\nmance on future tasks likely will not be much higher than the maximum of the empirical risks for\nthe training tasks. In this short paper, we restrict attention to the case of \ufb01nite H.\n\n2.2 A learning to learn bound for the maximum risk\n\nIn this subsection, we use the following notation. Let P (1), . . . , P (T ) be probability measures drawn\niid from Q, and for t \u2208 [T ] let z(t) be an m-sample (a sample of m points) from P (t) with corre-\nm . Also, if P is a probability measure then P (cid:96) \u25e6 h := E(cid:96)(y, h(x));\nsponding empirical measure P (t)\nsimilarly, if Pm is an empirical measure, then Pm(cid:96) \u25e6 h := 1\nOur focus is the learning to learn setting with a minimax lens: when one learns a representation\nH \u2208 H from multiple training tasks and observes maximum empirical risk \u03b3, we would like to\n\ni=1 (cid:96)(yi, h(xi)).\n\n(cid:80)m\n\nm\n\n3\n\n\fguarantee that H\u2019s true risk on a newly drawn test task will be bounded by roughly \u03b3. Such a goal is\nin striking contrast to the classical emphasis of learning to learn, where the goal is to obtain bounds\non H\u2019s expected true risk. Using H\u2019s expected true risk and Markov\u2019s inequality, Baxter [4, the\ndisplay prior to (25) ] showed that the probability that H\u2019s true risk on a newly drawn test task is\nabove some level \u03b3 decays as the expected true risk over \u03b3:\n\n(cid:26)\n\n(cid:27)\n\n(cid:80)\n\nPr\n\nh\u2208H P (cid:96) \u25e6 h \u2265 \u03b3\ninf\n\n\u2264 1\n\nT\n\nm (cid:96) \u25e6 ht + \u03b5\nt\u2208[T ] P (t)\n\u03b3\n\n(4)\n\nwhere the size of \u03b5 is controlled by T , m, and the complexities of certain spaces.\nThe expected true risk is not of primary interest for controlling the tail of the (random) true risk,\nand a more direct approach yields a much better bound. In this short paper we restrict the space of\nrepresentations H to be \ufb01nite with cardinality C; in this case, the analysis is particularly simple and\nilluminates the idea for proving the general case. The next theorem is the main result of this section:\nTheorem 1. Let |H| = C, and let the loss (cid:96) be L-Lipschitz in its second argument and bounded by\nB. Suppose T tasks P (1), . . . , P (T ) are drawn iid from Q and from each task P (t) an iid m-sample\nz(t) is drawn. Suppose there exists H \u2208 H such that all t \u2208 [T ] satisfy minh\u2208H P (t)\nm (cid:96) \u25e6 h \u2264 \u03b3. Let\nP be newly drawn probability measure from Q. Let \u02c6h be the empirical risk minimizer over the test\nm-sample. With probability at least 1 \u2212 \u03b4 with respect to the random draw of the T tasks and their\nT corresponding m-samples:\n\n1\nT\n\n+ 2L maxH\u2208H Rm(H) +\n\n8 log 4\n\u03b4\n\nm\n\n\u03b4 + log(cid:100)B(cid:101) + log(T + 1)\n\n.\n\n(5)\n\nT\n\n\uf8f1\uf8f2\uf8f3P (cid:96) \u25e6 \u02c6h > \u03b3 +\n\nPr\n\n(cid:115)\n\n\uf8fc\uf8fd\uf8fe \u2264 log 2C\n\nIn the above, Rm(H) is the Rademacher complexity of H (cf. [3]). Critically, in (5) the probability\nof observing a task with high true risk decays with T , whereas in (4) the decay is independent of T .\nHence, when the goal is to minimize the probability of bad performance on future tasks uniformly,\nthis theorem motivates minimizing the maximum of the empirical risks as opposed to their mean.\nFor the proof of Theorem 1, \ufb01rst consider the singleton case H = {H1}. Suppose that for \u03b3 \ufb01xed a\nm (cid:96) \u25e6 h \u2264 \u03b3.\npriori, the maximum of the empirical risks is bounded by \u03b3, i.e. maxt\u2208[T ] minh\u2208H1 P (t)\nLet a new probability measure P drawn from Q correspond to a new test task. Suppose the prob-\nability of the event [minh\u2208H1 Pm(cid:96) \u25e6 h > \u03b3] is at least \u03b5. Then the probability that \u03b3 bounds all T\nempirical risks is at most (1 \u2212 \u03b5)T \u2264 e\u2212T \u03b5. Hence, with probability at least 1 \u2212 e\u2212T \u03b5:\n\nA simple application of the union bound extends this result for \ufb01nite H:\nLemma 1. Under the same conditions as Theorem 1, with probability at least 1 \u2212 \u03b4/2 with respect\nto the random draw of the T tasks and their T corresponding m-samples:\n\nPr {minh\u2208H1 Pm(cid:96) \u25e6 h > \u03b3} \u2264 \u03b5.\n(cid:26)\n\n(cid:27)\n\nPr\n\nh\u2208H Pm(cid:96) \u25e6 h > \u03b3\nmin\n\n\u2264 log 2C\n\n\u03b4\n\n.\n\nT\n\nThe bound in the lemma states a 1/T rate of decay for the probability that the empirical risk obtained\nby H on a new task exceeds \u03b3. Next, we relate this empirical risk to the true risk obtained by the\nempirical risk minimizer. Note that at test time H is \ufb01xed and hence independent of any test m-\nsample. Then, from by now standard learning theory results of Bartlett and Mendelson [3]:\nLemma 2. Take loss (cid:96) as in Theorem 1. With probability at least 1 \u2212 \u03b4/2, for all h \u2208 H uniformly:\n\nP (cid:96) \u25e6 h \u2264 Pm(cid:96) \u25e6 h + 2LRm(H) +(cid:112)(8 log(4/\u03b4))/m.\n\nIn particular, with high probability the true risk of the empirical risk minimizer is not much larger\nthan its empirical risk. Theorem 1 now follows from Lemmas 1 and 2 and a union bound over\n\u03b3 \u2208 \u0393 := {0, 1/T, 2/T, . . . ,(cid:100)B(cid:101)}; note that mapping the observed maximum empirical risk \u03b3 to\nmin{\u03b3(cid:48) \u2208 \u0393 | \u03b3 \u2264 \u03b3(cid:48)} picks up the additional 1\nIn the next section, we introduce a loss-compositional paradigm for multi-task learning which in-\ncludes as special cases minimax MTL and classical MTL.\n\nT term in (5).\n\n4\n\n\f3 A generalized loss-compositional paradigm for MTL\n\nrisk for hypothesis ht \u2208 H (\u2208 H) on task t \u2208 [T ] as \u02c6(cid:96)t(ht) :=(cid:80)m\n\nThe paradigm can bene\ufb01t from a bit of notation. Given a set of T tasks, we represent the empirical\ni=1 (cid:96)(yt,i, ht(xt,i)). Additionally\nde\ufb01ne a set of hypotheses for multiple tasks h := (h1, . . . , hT ) \u2208 HT and the vector of empirical\nrisks \u02c6(cid:96)(h) := (\u02c6(cid:96)1(h1), . . . , \u02c6(cid:96)T (hT )).\nWith this notation set, the proposed loss-compositional paradigm encompasses any regularized min-\nimization of a (typically convex) function \u03c6 : RT\n\n+ \u2192 R+ of the empirical risks:\n\n\u03c6(cid:0)\u02c6(cid:96)(h)(cid:1) + \u2126(cid:0)(H, h)(cid:1),\n\ninfH\u2208H inf\nh\u2208HT\n\n(6)\n\nwhere \u2126(\u00b7) : H \u00d7 \u222aH\u2208HHT \u2192 R+ is a regularizer.\n\n(cid:96)p MTL. One notable specialization that is still quite general is the case when \u03c6 is an (cid:96)p-norm,\nyielding (cid:96)p MTL. This subfamily encompasses classical MTL and many new MTL formulations:\n\n\u2022 Classical MTL as (cid:96)1 MTL:\n\ninfH\u2208H inf\nh\u2208HT\n\n1\nT\n\nt\u2208[T ]\n\u2022 Minimax MTL as (cid:96)\u221e MTL:\nmax\nt\u2208[T ]\n\ninfH\u2208H inf\nh\u2208HT\n\n\u2022 A new formulation, (cid:96)2 MTL:\n\n(cid:88)\n\n\u02c6(cid:96)(ht) + \u2126(cid:0)(H, h)(cid:1)\n\u02c6(cid:96)(ht) + \u2126(cid:0)(H, h)(cid:1)\n(cid:0)\u02c6(cid:96)(ht)(cid:1)2(cid:17)1/2\n\n(cid:16) 1\n\nT\n\n(cid:88)\n\nt\u2208[T ]\n\ninfH\u2208H inf\nh\u2208HT\n\n\u2261 infH\u2208H inf\nh\u2208HT\n\n\u2261 infH\u2208H inf\nh\u2208HT\n\n+ \u2126(cid:0)(H, h)(cid:1) \u2261 infH\u2208H inf\n\nh\u2208HT\n\n1\nT\n\n(cid:107)\u02c6(cid:96)(h)(cid:107)1 + \u2126(cid:0)(H, h)(cid:1).\n(cid:107)\u02c6(cid:96)(h)(cid:107)\u221e + \u2126(cid:0)(H, h)(cid:1).\n(cid:107)\u02c6(cid:96)(h)(cid:107)2 + \u2126(cid:0)(H, h)(cid:1).\n\n1\u221a\nT\n\nA natural question is why one might consider minimizing (cid:96)p-norms of the empirical risks vector for\n1 < p < \u221e, as in (cid:96)2 MTL. The contour of the (cid:96)1-norm of the empirical risks evenly trades off\nempirical risks between different tasks; however, it has been observed that over\ufb01tting often happens\nnear the end of learning, rather than the beginning [14]. More precisely, when the empirical risk is\nhigh, the gradient of the empirical risk (taken with respect to the parameter (H, h)) is likely to have\npositive inner product with the gradient of the true risk. Therefore, given a candidate solution with a\ncorresponding vector of empirical risks, a sensible strategy is to take a step in solution space which\nplaces more emphasis on tasks with higher empirical risk. This strategy is particularly appropriate\nwhen the class of learners has high capacity relative to the amount of available data. This observation\nsets the foundation for an approach that minimizes norms of the empirical risks.\nIn this work, we also discuss an interesting subset of the loss-compositional paradigm which does\nnot \ufb01t into (cid:96)p MTL; this subfamily embodies a continuum of relaxations of minimax MTL.\n\n\u03b1-minimax MTL.\nIn some cases, minimizing the maximum loss can exhibit certain disadvan-\ntages because the maximum loss is not robust to situations when a small fraction of the tasks are\nfundamentally harder than the remaining tasks. Consider the case when the empirical risk for each\ntask in this small fraction can not be reduced below a level u. Rather than rigidly minimizing the\nmaximum loss, a more robust alternative is to minimize the maximize loss in a soft way.\nIntu-\nitively, the idea is to ensure that most tasks have low empirical risk, but a small fraction of tasks are\npermitted to have higher loss. We formalize this as \u03b1-minimax MTL, via the relaxed objective:\n\n(cid:110)\n\n(cid:88)\n\nt\u2208[T ]\n\n1\n\u03b1\n\nmax{0, \u02c6(cid:96)t(ht) \u2212 b}(cid:111)\n\n+ \u2126(cid:0)(H, h)(cid:1).\n\nminimize\nH\u2208H,h\u2208HT\n\nmin\nb\u22650\n\nb +\n\nIn the above, \u03c6 from the loss-compositional paradigm (6) is a variational function of the empirical\nrisks vector. The above optimization problem is equivalent to the perhaps more intuitive problem:\n\u02c6(cid:96)t(ht) \u2264 b + \u03bet, t \u2208 [T ].\n\n\u03bet + \u2126(cid:0)(H, h)(cid:1)\n\n(cid:88)\n\nsubject to\n\nminimize\n\nb +\n\nH\u2208H,h\u2208HT ,b\u22650,\u03be\u22650\n\n1\n\u03b1\n\nt\u2208[T ]\n\n5\n\n\fHere, b plays the role of the relaxed maximum, and each \u03bet\u2019s deviation from zero indicates the\ndeviation from the (loosely enforced) maximum. We expect \u03be to be sparse.\nTo help understand how \u03b1 affects the learning problem, let us consider a few cases:\n\n(1) When \u03b1 > T , the optimal value of b is zero, and the problem is equivalent to classical MTL. To\nsee this, note that for a given candidate solution with b > 0 the objective always can be reduced\nby reducing b by some \u03b5 and increasing each \u03bet by the same \u03b5.\n\n(2) Suppose one task is much harder than all the other tasks (e.g. an outlier task), and its empirical\nrisk is separated from the maximum empirical risk of the other tasks by \u03c1. Let 1 < \u03b1 < 2; now,\nat the optimal hard maximum solution (where \u03be = 0), the objective can be reduced by increasing\none of the \u03bet\u2019s by \u03c1 and decreasing b by \u03c1. Thus, the objective can focus on minimizing the\nmaximum risk of the set of T \u2212 1 easier tasks. In this special setting, this argument can be\nextended to the more general case k < \u03b1 < k + 1 and k outlier tasks, for k \u2208 [T ].\n\n(3) As \u03b1 approaches 0, we recover the hard maximum case of minimax MTL.\nThis work focuses on \u03b1-minimax MTL with \u03b1 = 2/((cid:100)0.1T + 0.5(cid:101)\u22121 + (cid:100)0.1T + 1.5(cid:101)\u22121) i.e. the\nharmonic mean of (cid:100)0.1T + 0.5(cid:101) and (cid:100)0.1T + 1.5(cid:101). The reason for this choice is that in the idealized\ncase (2) above, for large T this setting of \u03b1 makes the relaxed maximum consider all but the hardest\n10% of the tasks. We also try the 20% level (i.e. 0.2T replacing 0.1T in the above).\n\nt\u2208[T ]\n\n(cid:80)\n\nminv0,{vt}t\u2208[T ]\n\nModels. We now provide examples of how speci\ufb01c models \ufb01t into this framework. We consider\ntwo convex multi-task learning formulations: Evgeniou and Pontil\u2019s regularized multi-task learning\n(the EP model) [5] and Argyriou, Evgeniou, and Pontil\u2019s convex multi-task feature learning (the\nAEP model) [1]. The EP model is a linear model with a shared parameter v0 \u2208 Rd and task-speci\ufb01c\n(cid:80)m\nparameters vt \u2208 Rd (for t \u2208 [T ]). Evgeniou and Pontil presented this model as\ni=1 (cid:96)(yt,i,(cid:104)v0 + vt, xt,i(cid:105)) + \u03bb0(cid:107)v0(cid:107)2 + \u03bb1\n(cid:80)m\n(cid:80)\ni=1 (cid:96)(yt,i,(cid:104)Wt, xt,i(cid:105)) + \u03bb(cid:107)W(cid:107)tr,\ni=1 (cid:96)(cid:0)yt,i,(cid:104)ht, xt,i(cid:105)(cid:1).\n(cid:80)m\n\nfor (cid:96) the hinge loss or squared loss. This can be set in the new paradigm via H = {Hv0 | v0 \u2208 Rd},\nHv0 = {h : x (cid:55)\u2192 (cid:104)v0 + vt, x(cid:105) | vt \u2208 Rd}, and \u02c6(cid:96)t(ht) = 1\nThe AEP model minimizes the task-wise average loss with the trace norm (nuclear norm) penalty:\n\ni=1 (cid:96)(cid:0)yt,i,(cid:104)v0 + vt, xt,i(cid:105)(cid:1).\n(cid:80)m\n\nwhere (cid:107)\u00b7(cid:107)tr : W (cid:55)\u2192(cid:80)\n\n(cid:80)\nt\u2208[T ] (cid:107)vt(cid:107)2,\n\nminW\n\nm\n\nT\n\nt\n\ni \u03c3i(W ) is the trace norm. In the new paradigm, H is a set where each element\nis a k-dimensional subspace of linear estimators (for k (cid:28) d). Each ht = Wt in some H \u2208 H lives\nin H\u2019s corresponding low-dimensional subspace. Also, \u02c6(cid:96)t(ht) = 1\nFor easy empirical comparison between the various MTL formulations from the paradigm, at times\nit will be convenient to use constrained formulations of the EP and AEP model. If the regularized\nforms are used, a fair comparison of the methods warrants plotting results according to the size of\nthe optimal parameter found (i.e. (cid:107)W(cid:107)tr for AEP). For EP, the constrained form is:\nsubject to (cid:107)v0(cid:107) \u2264 \u03c40, (cid:107)vt(cid:107) \u2264 \u03c41 for t \u2208 [T ].\nminv0,{vt}t\u2208[T ]\nFor AEP, the constrained form is: minW\n\n(cid:80)m\n(cid:80)m\ni=1 (cid:96)(yt,i,(cid:104)v0 + vt, xt,i(cid:105))\ni=1 (cid:96)(yt,i,(cid:104)Wt, xt,i(cid:105))\n\nsubject to (cid:107)W(cid:107)tr \u2264 r.\n\n(cid:80)\n\n(cid:80)\n\nt\u2208[T ]\n\nm\n\nt\n\n4 Empirical evaluation\n\nWe consider four learning problems; the \ufb01rst three involve regression (MTL model in parentheses):\n\n\u2022 A synthetic dataset composed from two modes of tasks (EP model),\n\u2022 The school dataset from the Inner London Education Authority (EP model),\n\u2022 The conjoint analysis personal computer ratings dataset 2 [11] (AEP model).\n\nThe fourth problem is multi-class classi\ufb01cation from the MNIST digits dataset [10] with a reduction\nto multi-task learning using a tournament of pairwise (binary) classi\ufb01ers. We use the AEP model.\nGiven data, each problem involved a choice of MTL formulation (e.g. minimax MTL), model (EP\nor AEP), and choice of regularized versus constrained. All the problems were solved with just a few\nlines of code using CVX [9, 8]. In this work, we considered convex multi-task learning formulations\nin order to make clear statements about the optimal solutions attained for various learning problems.\n2This data, collected at the University of Michigan MBA program, generously was provided by Peter Lenk.\n\n6\n\n\fFigure 1: Max (cid:96)2-risk (Top two lines) and mean (cid:96)2-risk (Bottom two lines). At Left and Center: (cid:96)2-risk vs\nnoise level, for \u03c3task = 0.1 and \u03c3task = 0.5 respectively. At Right: (cid:96)2-risk vs task variation, for \u03c3noise = 0.1.\nDashed red is (cid:96)1, dashed blue is minimax. Error bars indicate one standard deviation. MTL results (not shown)\nwere similar to LTL results (shown), with MTL-LTL relative difference below 6.8% for all points plotted.\n\nTwo modes. The two modes regression problem consists of 50 linear prediction tasks for the \ufb01rst\ntype of task and 5 linear prediction tasks for the second task type. The true parameter for the \ufb01rst\ntask type is a vector \u00b5 drawn uniformly from the sphere of radius 5; the true parameter for the\nsecond task type is \u22122\u00b5. Each task is drawn from an isotropic Gaussian with mean taken from the\ntask type and the standard deviation of all dimensions set to \u03c3task. Each data point for each task is\ndrawn from a product of 10 standard normals (so xt,i \u2208 R10). The targets are generated according\nto (cid:104)Wt, xt,i(cid:105) + \u03b5t, where the \u03b5t\u2019s are iid univariate centered normals with standard deviation \u03c3noise.\nWe \ufb01xed \u03c40 to a large value (in this case, \u03c40 = 10 is suf\ufb01cient since the mean for the largest task\n\ufb01ts into a ball of radius 10) and \u03c41 to a small value (\u03c41 = 2). We compute the average mean and\nmaximum test error over 100 instances of the 55-task multi-task problem. Each task\u2019s training set\nand test set are 5 and 15 points respectively. The average maximum (mean) test error is the 100-\nexperiment-average of the task-wise maximum (mean) of the (cid:96)2 risks. For each LTL experiment, 55\nnew test tasks were drawn using the same \u00b5 as from the training tasks.\nFigure 1 shows a tradeoff: when each task group is fairly homogeneous (left and center plots),\nminimax is better at minimizing the maximum of the test risks while (cid:96)1 is better at minimizing the\nmean of the test risks. As task homogeneity decreases (right plot), the gap in performance closes\nwith respect to the maximum of the test risks and remains roughly the same with respect to the mean.\n\nFigure 2: Maximum RMSE (Left) and normalized mean RMSE (Right) versus task-speci\ufb01c parameter bound\n\u03c41, for shared parameter bound \u03c40 \ufb01xed. In each \ufb01gure, Left section is \u03c40 is 0.2 and Right section is \u03c40 = 0.6.\nSolid red (cid:7) is (cid:96)1, solid blue \u2022 is minimax, dashed green (cid:78) is (0.1T )-minimax, dashed black (cid:72) is (0.2T )-\nminimax. The results for (cid:96)2 MTL were visually identical to (cid:96)1 MTL and hence were not plotted.\nSchool. The school dataset has appeared in many previous works [7, 2, 6]. For brevity we just say\nthe goal is to predict student test scores using certain student-level features. Each school is treated as\na separate task. We report both the task-wise maximum of the root mean square error (RMSE) and\nthe taskwise-mean of the RMSE (normalized by number of points per task, as in previous works).\nThe results (see Figure 2) demonstrate that when the learner has moderate shared capacity \u03c40 and\nhigh task-speci\ufb01c capacity \u03c41, minimax MTL outperforms (cid:96)1 MTL for the max objective; addition-\nally, for the max objective in almost all parameter settings (0.1T )-minimax and (0.2T )-minimax\nMTL outperform (cid:96)1 MTL, and they also outperform minimax MTL when the task-speci\ufb01c capacity\n\u03c41 is not too large. We hypothesize that minimax MTL performs the best in the high\u2212\u03c41 regime be-\ncause stopping learning once the maximum of the empirical risks cannot be improved invokes early\nstopping and its built-in regularization properties (see e.g. [13]). Interestingly, for the normalized\nmean RMSE objective, both minimax relaxations are competitive with (cid:96)1 MTL; however, when the\n\n7\n\n\u2212101234050100150200250\u03c3noisesquared\u2212loss risk\u2212101234050100150200250\u03c3noisesquared\u2212loss risk\u22120.500.511.522.5050100150200250300350400\u03c3tasksquared\u2212loss risk00.20.40.600.20.40.61.31.351.41.451.5\u03c4100.20.40.600.20.40.60.780.80.820.840.860.880.9\u03c41\fshared capacity \u03c40 is high (right section, right plot), (cid:96)1 MTL performs the best. For high task-speci\ufb01c\ncapacity \u03c41, minimax MTL and its relaxations again seem to resist over\ufb01tting compared to (cid:96)1 MTL.\nPersonal computer. The personal\ncomputer dataset is composed of 189\nhuman subjects each of which rated\non a 0-10 scale the same 20 comput-\ners (16 training, 4 test). Each com-\nputer has 13 binary features (amount\nof memory, screen size, price, etc.).\nThe results are shown in Figure 3. In\nthe MTL setting, for both the maxi-\nmum RMSE objective and the mean\nRMSE objective, (cid:96)1 MTL appears to\nperform the best. When the trace\nnorm of W is high, minimax MTL\ndisplays resistance to over\ufb01tting and\nobtains the lowest mean RMSE. In\nthe LTL setting for the maximum\nRMSE objective, (cid:96)2, minimax, and\n(0.1T )-minimax MTL all outperform\n(cid:96)1 MTL. For the mean RMSE, (cid:96)1\nMTL obtains the lowest risk for al-\nmost all parameter settings.\n\nFigure 3: MTL (Top) and LTL (Bottom). Maximum (cid:96)2 risk (Left)\nand Mean (cid:96)2 risk (Right) vs bound on (cid:107)W(cid:107)tr. LTL used 10-fold\ncross-validation (10% of tasks left out in each fold). Solid red (cid:7) is\n(cid:96)1, solid blue \u2022 is minimax, dashed green (cid:78) is (0.1T )-minimax,\ndashed black (cid:72) is (0.2T )-minimax, solid gold (cid:4) is (cid:96)2.\n\nMNIST. The MNIST task is a 10-class problem; we ap-\nproach it via a reduction to a tournament of 45 binary clas-\nsi\ufb01ers trained via the AEP model. The dimensionality was\nreduced to 50 using principal component analysis (com-\nputed on the full training set), and only the \ufb01rst 2% of\neach class\u2019s training points were used for training.\nIntuitively, the performance of the tournament tree of bi-\nnary classi\ufb01ers can only be as accurate as its paths, and\nthe accuracy of each path depends on the accuracy of\nthe nodes. Hence, our hypothesis is that minimax MTL\nshould outperform (cid:96)1 MTL. The results in Figure 4 con-\n\ufb01rm our hypothesis. Minimax MTL outperforms (cid:96)1 MTL\nwhen the capacity (cid:107)W(cid:107)tr is somewhat limited, with the\ngap widening as the capacity decreases. Furthermore, at\nevery capacity minimax MTL is competitive with (cid:96)1 MTL.\n\n5 Discussion\n\nFigure 4: Test multiclass 0-1 loss vs\n(cid:107)W(cid:107)tr. Solid red is (cid:96)1 MTL, solid blue is\nminimax, dashed green is (0.1T )-minimax,\ndashed black is (0.2T )-minimax. Regular-\nized AEP used for speed and trace norm of\nW \u2019s computed, so samples differ per curve.\n\nWe have established a continuum of formulations for MTL which recovers as special cases classical\nMTL and the newly formulated minimax MTL. In between these extreme points lies a continuum of\nrelaxed minimax MTL formulations. More generally, we introduced a loss-compositional paradigm\nthat operates on the vector of empirical risks, inducing the additional (cid:96)p MTL paradigms. The empir-\nical evaluations indicate that \u03b1-minimax MTL at either the 10% or 20% level often outperforms (cid:96)1\nMTL in terms of the maximum test risk objective and sometimes even in the mean test risk objective.\nAll the minimax or \u03b1-minimax MTL formulations exhibit a built-in safeguard against over\ufb01tting in\nthe case of learning with a model that is very complex relative to the available data.\nAlthough ef\ufb01cient algorithms may make the various new MTL learning formulations practical for\nlarge problems, a proper effort to develop fast algorithms in this setting would have detracted from\nthe main point of this \ufb01rst study. A good direction for the future is to obtain ef\ufb01cient algorithms\nfor minimax and \u03b1-minimax MTL. In fact, such algorithms might have applications beyond MTL\nand even machine learning. Another area ripe for exploration is to establish more general learning\nbounds for minimax MTL and to extend these bounds to \u03b1-minimax MTL.\n\n8\n\n051015200.650.70.750.80.850.9bound on trace norm of WCV\u2212mean Maximum squared loss051015200.120.140.160.180.20.220.24bound on trace norm of WCV\u2212mean Mean squared loss051015200.430.440.450.460.470.480.490.5bound on trace norm of WCV\u2212mean Maximum squared loss051015200.130.1350.140.1450.150.155bound on trace norm of WCV\u2212mean Mean squared loss4060801001201400.10.150.20.250.30.350.40.450.5trace norm of WTest multiclass zero\u2212one loss\fReferences\n[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[2] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning.\n\nJournal of Machine Learning Research, 4:83\u201399, 2003.\n\n[3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Re-\n\nsearch, 12(1):149\u2013198, 2000.\n\n[5] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceed-\nings of the tenth ACM SIGKDD international conference on Knowledge discovery and data\nmining, pages 109\u2013117. ACM, 2004.\n\n[6] Theodoros Evgeniou, Massimiliano Pontil, and Olivier Toubia. A convex optimization ap-\nproach to modeling consumer heterogeneity in conjoint estimation. Marketing Science,\n26(6):805\u2013818, 2007.\n\n[7] Harvey Goldstein. Multilevel modelling of survey data. Journal of the Royal Statistical Society.\n\nSeries D (The Statistician), 40(2):235\u2013244, 1991.\n\n[8] Michael C. Grant and Stephen P. Boyd. Graph implementations for nonsmooth convex pro-\nIn V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and\ngrams.\nControl, Lecture Notes in Control and Information Sciences, pages 95\u2013110. Springer-Verlag\nLimited, 2008.\n\n[9] Michael C. Grant and Stephen P. Boyd. CVX: Matlab software for disciplined convex pro-\n\ngramming, version 1.21, April 2011.\n\n[10] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[11] Peter J. Lenk, Wayne S. DeSarbo, Paul E. Green, and Martin R. Young. Hierarchical bayes\nconjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs.\nMarketing Science, pages 173\u2013191, 1996.\n\n[12] Andreas Maurer. Transfer bounds for linear feature learning. Machine learning, 75(3):327\u2013\n\n350, 2009.\n\n[13] Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal Pro-\n\ncessing, 74(1):3\u201328, 1999.\n\n[14] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural\ngradient algorithm. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in\nNeural Information Processing Systems 20, pages 849\u2013856. MIT Press, Cambridge, MA, 2008.\n[15] Kai Yu, John Lafferty, Shenghuo Zhu, and Yihong Gong. Large-scale collaborative prediction\nusing a nonparametric random effects model. In Proceedings of the 26th Annual International\nConference on Machine Learning, pages 1185\u20131192. ACM, 2009.\n\n[16] Liang Zhang, Deepak Agarwal, and Bee-Chung Chen. Generalizing matrix factorization\nIn Proceedings of the \ufb01fth ACM conference on Recom-\n\nthrough \ufb02exible regression priors.\nmender systems, pages 13\u201320. ACM, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1070, "authors": [{"given_name": "Nishant", "family_name": "Mehta", "institution": null}, {"given_name": "Dongryeol", "family_name": "Lee", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}