{"title": "Generalization in multitask deep neural classifiers: a statistical physics approach", "book": "Advances in Neural Information Processing Systems", "page_first": 15862, "page_last": 15871, "abstract": "A proper understanding of the striking generalization abilities of deep neural networks presents an enduring puzzle. Recently, there has been a growing body of numerically-grounded theoretical work that has contributed important insights to the theory of learning in deep neural nets. There has also been a recent interest in extending these analyses to understanding how multitask learning can further improve the generalization capacity of deep neural nets. These studies deal almost exclusively with regression tasks which are amenable to existing analytical techniques. We develop an analytic theory of the nonlinear dynamics of generalization of deep neural networks trained to solve classification tasks using softmax outputs and cross-entropy loss, addressing both single task and multitask settings. We do so by adapting techniques from the statistical physics of disordered systems, accounting for both finite size datasets and correlated outputs induced by the training dynamics. We discuss the validity of our theoretical results in comparison to a comprehensive suite of numerical experiments. Our analysis provides theoretical support for the intuition that the performance of multitask learning is determined by the noisiness of the tasks and how well their input features align with each other. Highly related, clean tasks benefit each other, whereas unrelated, clean tasks can be detrimental to individual task performance.", "full_text": "Generalization in multitask deep neural classi\ufb01ers: a\n\nstatistical physics approach\n\nTyler Lee\nIntel AI Lab\n\ntyler.p.lee@intel.com\n\nAnthony Ndirango\n\nIntel AI Lab\n\nanthony.ndirango@intel.com\n\nAbstract\n\nA proper understanding of the striking generalization abilities of deep neural net-\nworks presents an enduring puzzle. Recently, there has been a growing body of\nnumerically-grounded theoretical work that has contributed important insights to\nthe theory of learning in deep neural nets. There has also been a recent interest\nin extending these analyses to understanding how multitask learning can further\nimprove the generalization capacity of deep neural nets. These studies deal almost\nexclusively with regression tasks which are amenable to existing analytical tech-\nniques. We develop an analytic theory of the nonlinear dynamics of generalization\nof deep neural networks trained to solve classi\ufb01cation tasks using softmax outputs\nand cross-entropy loss, addressing both single task and multitask settings. We do\nso by adapting techniques from the statistical physics of disordered systems, ac-\ncounting for both \ufb01nite size datasets and correlated outputs induced by the training\ndynamics. We discuss the validity of our theoretical results in comparison to a\ncomprehensive suite of numerical experiments. Our analysis provides theoretical\nsupport for the intuition that the performance of multitask learning is determined\nby the noisiness of the tasks and how well their input features align with each other.\nHighly related, clean tasks bene\ufb01t each other, whereas unrelated, clean tasks can\nbe detrimental to individual task performance.\n\n1\n\nIntroduction\n\nDespite the remarkable string of successful results demonstrated by deep learning practitioners, we\nstill do not have a clear understanding of how these models manage to generalize so well, effectively\nevading many of the intuitions expected from statistical learning theory. The enigma is further\nheightened when one considers multitask learning, especially in regimes where labeled data is scarce.\nIn order to make speci\ufb01c assertions about the effective transfer of knowledge across tasks, one\nneeds a predictive framework to address generalization in a multitask setting. There has been a\nnoticeable uptick in recent efforts to build a rigorous theoretical foundation for deep learning (see,\ne.g. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] for a sampling of this trend). To the best of our knowledge (with one\nexception, described below), none of the existing analytical work deals with multitask learning.\nMultitask learning holds promise for training more generalized and intelligent learning systems\n[11]. It comprises a broad set of strategies loosely de\ufb01ned by the presence of multiple objective\nfunctions and a set of shared parameters optimized for those objective functions. The most prevalent\nformulation of multitask learning in the literature is the addition of supervised auxiliary task(s) to\nassist in training a network to better perform a target task of interest (main task)[12, 13, 14, 15].\nIn this framework the only purpose of the auxiliary task(s) is to produce improved generalization\nperformance on the main task. This bene\ufb01t is thought to arise from an inductive bias placed on the\nlearning of the main task towards learning more general features [11]. Since the features learned\nthrough multitask learning blend the optimal features for all of the optimized tasks, there is an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fassumed dependence of the multitask bene\ufb01t on the relatedness of the auxiliary tasks to the main\ntask (e.g. if the optimal features for the auxiliary task are orthogonal to those of the main task, then\nthe main task will be best optimized by ignoring the auxiliary task entirely). How exactly to de\ufb01ne\n\"relatedness\" in the context of multitask learning in deep neural networks remains unknown. The\nmost explicit de\ufb01nition to date, to our knowledge, comes from [16], where it is described as the\nangles between the singular vectors of the implicit input-output function learned by the network.\nWhile this de\ufb01nition is narrow, it lends a nice starting point for a theoretical analysis in the multitask\nsetting. Outside of the work done in [16] on multitask learning in linear regression networks, the\ntheory of multitask learning in neural networks remains unexplored. In this work we hope to further\nthe theoretical understanding of multitask bene\ufb01ts to multiclass classi\ufb01cation problems, a much more\ncommon class of problems in modern machine learning.\nTo narrow the scope of this study, we have chosen to focus on the formulation of multitask learning\nwhere the neural network is de\ufb01ned as having a single shared trunk and multiple task-speci\ufb01c heads.\nMany recent studies have sought to explore alternative methods of parameter sharing, though these do\nnot usually lend themselves as easily to this form of theoretical analysis [17, 18]. Further, multitask\nlearning also provides an interesting strategy for learning a single universal representation for many\ntasks possibly across multiple domains [19, 20, 21]. In this strategy there is often no clear \"main\"\ntask and it is not clear that the bene\ufb01t to be gained is even improved generalization performance on\nany of the trained tasks. Instead the bene\ufb01t could be seen as improved performance over a set of\nproblems given a \ufb01xed parameter budget or improved transfer learning to unseen tasks [22]. While\nthese are certainly exciting research directions and could bene\ufb01t from careful theoretical scrutiny, we\nleave them for future work.\nThis manuscript is structured as follows: in section 2 we describe the theory behind single task\nlearning in classi\ufb01cation networks. In section 3 we describe, both analytically and empirically, the\ntraining dynamics of such networks. In section 4 we extend this work to account for multitask learning\nof simple classi\ufb01cation tasks. Finally, in section 5 discuss interesting leads and future directions.\n\n2 Theoretical Underpinnings\n\nA convenient framework for analyzing multitask problems was introduced in [16], addressing\nregression problems in deep linear neural networks. Given the success of that approach, could the\ntechniques in [16] be generalized to deep neural net classi\ufb01ers with softmax outputs? Our analysis\nprovides an af\ufb01rmative answer to this question, albeit at considerable technical cost: despite a strong\nconceptual similarity between analyzing regression and softmax classi\ufb01cation problems, the structure\nof the solutions to the classi\ufb01cation problem differ markedly from those obtained in the regression\ncase. On the other hand, and perhaps unsurprisingly, the intuition gleaned from [16] about the\nconditions required for effective multitask learning carry over to the classi\ufb01cation problems, in spite\nof the technical differences between the analysis of classi\ufb01cation and regression tasks.\nWe adopt the student-teacher setup popularized several decades ago in early attempts to theoretically\nunderstand the generalization abilities of neural networks (see, e.g. [23]) and recently revisited in [16].\nWe will attempt to closely follow the notational conventions in [16] with the hope of establishing\na common language for analyzing these sorts of problems. The key insight behind the analysis of\nsoftmax classi\ufb01ers is the uncanny resemblance of the training dynamics of deep neural nets to the\nphysical dynamics of disordered systems. In particular, we take advantage of a formal similarity\nbetween deep neural softmax classifers and a generalized version of Derrida\u2019s Random Energy Model\n(REM) [24]. A generalization of the REM is required because the outputs of a deep neural network\nare correlated random variables, in contrast to the i.i.d conditions that render the original REM\nsolvable. Furthermore, deep learning practitioners do not work with in\ufb01nite size models, so we also\nhave to take into account \ufb01nite size effects.\n\n2.1 Teacher Network\n\nFollowing [16], we consider low rank teacher networks which serve to provide a training signal to\narbitrary student networks. We begin with a 3-layer teacher network de\ufb01ned by N ` units in layer\n32 2 RN 3\u21e5N 2\n` and weight matrices W\n\n21 2 RN 2\u21e5N 1 between the input and hidden layer and W\n\n2\n\n\f32\n\nW\n\n21 2 RN 3\u21e5N 1\nbetween the hidden layer and an argmax output layer. We also de\ufb01ne W \u2318 W\nfor the teacher\u2019s composite weight.\nWe consider teachers that produce noisy outputs using a noise perturbed composite weight matrix\n\u02c6\u2303 \u2318 W + \u21e0, where \u21e0 2 RN 3\u21e5N 1 has i.i.d elements.\nDuring training, the teacher network takes in an input data matrix X 2 RN 1\u21e5Ndata, and produces\nover rows \u02c6\u2303X 2 RNdata\nnoisy vector outputs \u02c6y \u2318 argmax\nthereby furnishing a rule for producing (noisy) labels \u02c6y from inputs X. At test time, the student is\ntested against noise-free labels generated via y \u2318 argmax\nAt this point, we take a slight departure from the setup in [16]: in their setup, the data matrix is\ntaken to be orthonormal, whereas we take X to have entries drawn independently from a standard\nGaussian distribution. Similarly, the elements of the noise matrix \u21e0 are i.i.d centered normal variables\nwith variance \u02c62/N 1. The scale of \u02c6 is chosen in such a way that there is a non-zero probability for\nlabel-\ufb02ipping, i.e. Prob(\u02c6y 6= y) > 0.\n2.2 Student Network\n\nover rowsWX 2 RNdata\n\nWe \ufb01rst consider a 3-layer student network. In general, the student network has the same number\nof input and output units as the teacher since these are de\ufb01ned by the speci\ufb01cs of the task at hand.\nHowever, the student has no knowledge of the teacher\u2019s internal architecture. Thus, the number of\nhidden units in the student\u2019s network will almost surely be different from the teacher\u2019s. Writing N2\nfor the student\u2019s number of hidden units, we have student weight matrices W21 2 RN2\u21e5N 1 between\nthe input and hidden layer and W32 2 RN 3\u21e5N2 between the hidden layer and the softmax output\nlayer. We also de\ufb01ne W \u2318 W32W21 2 RN 3\u21e5N 1 for the student\u2019s composite weight.\nGiven an input data matrix X 2 RN 1\u21e5Ndata, the student computes a matrix output\n\nwhich is interpreted as the probability that the student assigns a class label c given an input x\u00b5 drawn\nfrom the \u00b5th column of X.\nThe student is trained by minimizing a cross-entropy loss\n\nLtrain = \n\n1\n\nNdata\n\nNdataX\u00b5=1\n\nN 3Xc=1\n\nc,\u02c6y\u00b5(X) ln Yc\u00b5(WX),\n\n(where  is the Kronecker delta.)\n\n(1)\n\n3 Training Dynamics: Theory v/s Experiment\n\nWe use vanilla SGD to train the student network. A detailed derivation of the dynamics of training is\npresented in appendix A. The relevant equations are given by\n\n\u2327\n\n\u2327\n\nd\ndt\nd\ndt\n\nW32 = \u21e3G( \u02c6\u2303) \u02c6\u2303  G(W)W\u2318W21T\nW21 = W32T\u21e3G( \u02c6\u2303) \u02c6\u2303  G(W)W\u2318\n\n3\n\n(2)\n\nNote that Y 2 RN 3\u21e5N 1 is a matrix with elements\n\nY(WX) = softmaxWX\nWckXk\u00b51A ,\n\nN 1Xk=1\n\nYc\u00b5(WX) = softmax0@\n\n1 \uf8ff c \uf8ff N 3, 1 \uf8ff \u00b5 \uf8ff Ndata\n\n\fwhere 1/\u2327 is the SGD learning rate, and G : RN 3\u21e5N 1 7! RN 3\u21e5N 3 is a non-linear, positive semi-\nde\ufb01nite matrix-valued function which captures the gradient of the softmax function averaged over the\ntraining data (see appendix A:13 for a precise de\ufb01nition). The solutions to (2) are very different from\nthose obtained for the regression case in [16].\nFurther insight into the dynamics (2) is provided by considering the so-called training aligned (TA)\ncase as de\ufb01ned in [16] where one initializes the student\u2019s weights such that the initial value of the\nT , where\nstudent\u2019s composite weight is W0 = \u02c6U S0 \u02c6V\nS0 is the student\u2019s initial singular value matrix.\nA detailed analysis of the TA dynamics is presented in full generality in appendix B. For a rank one\nteacher in the TA case, i.e. if the noisy teacher\u2019s SVD is \u02c6\u2303 = \u02c6s\u02c6u\u02c6vT , equation (2) simpli\ufb01es further\nto an equation for the student\u2019s largest singular value, with all the other singular values exponentially\nsuppressed in time. Explicitly, writing s \u2318 max S for the student\u2019s largest singular value, equation\n(2) becomes\n\nT given the noisy teacher\u2019s SVD \u02c6\u2303 = \u02c6U \u02c6S \u02c6V\n\n\u2327\n\nd\ndt\n\ns = 2s\u02c6u \u00b7\u21e3\u02c6sG(\u02c6s\u02c6u\u02c6vT )  sG(s\u02c6u\u02c6vT )\u2318\u02c6u\n\n(3)\n\nNumerically integrating equation (3) yields the graphs shown in Figure 1. The \ufb01gure reveals excellent\nagreement between theory and experiment over a wide range of initial conditions.\n\n4 Multitask Generalization Dynamics: Theory v/s Experiment\n\n4.1 Teacher Networks\nIn the multitask setting, we have two teacher networks represented by N 3 \u21e5 N 1 weight matrices\nB\n2 respectively. Their noise-perturbed versions, \u02c6\u2303A, \u02c6\u2303B are\nWA and WB with ranks N\nover rows \u02c6\u2303A/BX and noise\nde\ufb01ned as before, so that the teachers produce noisy labels \u02c6yA/B \u2318 argmax\nfree labels yA/B \u2318 argmax\n\nA\n2 and N\n\nover rowsWA/BX .\n\n4.2 Student Network\n\n32, WB\n\n32 for the weights in the heads, and WA \u2318 WA\n\nIn the multitask setting, a composite student network is designed to learn multiple tasks jointly from\nthe teachers. In general, the student network will consist of a trunk comprised of a stack of hidden\nlayers shared across tasks, augmented by a set of specialized heads speci\ufb01c to individual tasks. This\nsetup is identical to the one used in [16].\nFor three-layer students, we continue to denote the trunk\u2019s composite weight matrix by W21 and\n32W21 for the\nwrite WA\ncorresponding composite weights. Note that, crucially, both students share the trunk weights W21.\nThe students are trained to minimize a weighted sum of the cross-entropy losses pertaining to each\ntask, i.e. L = \u21b5ALA + \u21b5BLB. In general, the weighting coef\ufb01cients \u21b5A,\u21b5 B can be chosen via\nsome optimization method or even learned as part of the model\u2019s training procedure. However, we\nwill only consider the simplest case where \u21b5A = \u21b5B = 1.\nWe arbitrarily pick task A as the main task that we\u2019re interested in, and consider task B as an auxiliary\ntask whose sole purpose is to improve the performance of task A. We are thus interested in \ufb01nding\nout what properties of task B are required in order to improve the student\u2019s learning of task A. This\nnaturally leads to the idea of task-relatedness, a well-known, though loosely-de\ufb01ned, concept in the\nliterature on multitask learning [11].\n\n32W21, WB \u2318 WB\n\n4.3 Task Relatedness\n\nAs noted in the introduction, we currently lack a precise de\ufb01nition of task-relatedness in the context of\nmultitask learning in deep neural networks. The authors of [16] propose de\ufb01ning task-relatedness as\na function of the angles between the singular vectors of the implicit input-output function learned by\n\n4\n\n\fFigure 1: Comparing the theoretical predictions in (3) to empirical results. 1/\u2327 = 103 is the\nlearning rate, so the \ufb01gure shows training for 5k steps (chosen as the minimum of the validation\nerror). The empirical results are obtained using 10 different random seeds. The results shown are for\na 2-class and 20-class classi\ufb01cation task using 100 training data points to highlight the fact that the\ntheory agrees with experiment over a wide range of class sizes.\n\nthe network. As it turns out, as a direct consequence of the SGD dynamics in (2), the same de\ufb01nition\nappears naturally in the student-teacher framework for multitask classi\ufb01ers.\nGiven two tasks A and B de\ufb01ned by two teachers with weight matrices WA and WB respectively,\nT\nwe denote their SVDs by WA/B = UA/B SA/B V\nA/B. We de\ufb01ne the relatedness rAB between\ntasks A and B as\n\nrAB := V\n\nT\nBV A\n\n(4)\n\n4.4 Multitask Bene\ufb01t\n\nTable 1: Key takeaways from multitask analysis\n\nindependent variables\nrAB\n0\n> 0\n\nsB\nany\n%\nany\nany\n\nNdata\nany\nany\n\nlimited\nabundant\n\neffect on M TA B\n\n0\n%\n%\nsmall\n\n(a)\n(b)\n(c)\n(d)\n\nrAB % (0 < rAB \u2327 1)\n\nany\n\nanalytical explanation\n\nappendix:C.1, eqn. (36)\n\nsA =esA\n(sA esA) & as sB %\nesAg(esA) ! sAg(sA)\n\nFor the purposes of quantifying any gains in performance from multitask learning relative to models\ntrained on a single task, we introduce the notion of a multitask bene\ufb01t. We arrive at our multitask\nbene\ufb01t by comparing the optimal performance of the multitask model on the main task, say A to the\noptimal performance of a baseline model trained only on task A.\nGiven the multitask generalization loss LAB = LA + LB, we de\ufb01ne LA|B := LAB L B as\nthe generalization loss on task A when task A is trained jointly with task B. This quantity is to\nbe compared to the generalization loss eLA de\ufb01ned as the loss when task A is trained on its own.\n\nFollowing [16], we de\ufb01ne the multitask bene\ufb01t conferred on task A by task B via\n\nM TA B \u2318 min\n\nt neLA(t)o  min\n\nt LA|B(t) \n\nRemarkably, one can place a tight bound on the multitask bene\ufb01t using a relatively simple argument\nbased on the concavity of the logarithm function. We present here the result for the simpler case of a\nTA model with rank one teachers and relegate the general case to appendix C. For a TA model with\nrank one teachers with SVD WA = suAvA\nT )uA  0,\n\nT , we abbreviate g(s) := uA \u00b7 G(suAvA\n\n5\n\n012345t/1.01.11.21.31.41.51.6Singular Value20 classesTheoryEmpirical2 classesTheoryEmpirical\fwith G as featured in the training dynamics in equation (2) and de\ufb01ned in appendix A:13. The key\ntakeaways of this analysis are summarized in Table 1 and described more fully below.\nAs derived in Appendix C (cf. equations C:24 and C:25), the bound on the multitask bene\ufb01t is\n\n(5)\n\npertaining to the baseline single task case, and hence is entirely independent of the training dynamics\nof the multitask case.\n\n(sA esA)\u21e3sAg(sA)  sAg(sA)\u2318 \uf8ff M TA B \uf8ff (sA esA)\u21e3sAg(sA) esAg(esA)\u2318\nNotice that the factorsAg(sA) esAg(esA) on the RHS of equation (5) depends only quantities\nIn contrast, the sign of (sA esA) depends on the multitask teachers\u2019 singular values for tasks\nobtains sA =esA (cf. C.1:28) and so the multitask bene\ufb01t vanishes. For \u201cweakly related\u201d tasks, viz.\n\nA and B, their correspponding SNRs, and the relatedness rAB between tasks A and B (see the\ndiscussion surrounding equations 28-37 in Appendix C.1). For unrelated tasks, viz. rAB = 0, one\n0 < rAB \u2327 1, (C.1:35) shows that high SNR auxiliary tasks have a deleterious effect on M TA B.\nIn the high SNR regime, the noisy teacher\u2019s singular values are larger than the noise-free case. Since\nthe student\u2019s dynamics is driven by the noisy teacher, sA ! \u02c6sA  sA in the high SNR regime. Under\nthese conditions, equation (C.1:31) implies that M TA B  0.\nIn the low SNR regime, the noisy teacher\u2019s singular values lie in the bulk of the MP sea [25].\nIn this case, the student\u2019s dynamics is driven by noise, so that sA ! \u02c6sA < sA for low SNRs.\nUnder these conditions, a positive M TA B occurs only if the constraints on rAB and sB leading to\nequation (C.1:33) are satis\ufb01ed.\n\nIn regimes where labeled training data is abundant, the factorsAg(sA) esAg(esA) ! 0 in which\n\ncase M TA B ! 0, regardless of the relatedness between tasks (cf. equation C.1:37).\nTo summarize, the TA model predicts that multitask learning will have the largest impact under\nconditions mimicking scarce labeled data such that the baseline model underperforms on the main\ntask, as long as the auxiliary tasks have some relatedness to the main task. Thus, coming up with\nauxiliary tasks that have a high degree of relatedness to the main task will be crucial to observing a\npositive multitask bene\ufb01t.\nWhile the results in this section have only been demonstrated for the special case of TA models, we\nwill shortly see that the predictions are realized empirically in a wide variety of scenarios.\n\n4.5 Data vs model uncertainty\n\nUsing the framework described above, we set out to describe the relationship between multitask\nbene\ufb01t and several key factors that in\ufb02uence training of both the single task baseline - the amount and\nquality of the main task data - and multitask training - the amount, quality and relatedness of auxiliary\ntask data. We systematically varied1 these factors and computed the multitask bene\ufb01t for 5 different\ntraining datasets, the results of which are summarized in Figure 2. To ensure that we had roughly\nclass-balanced training datasets, we \ufb01xed N 3 = N 2, and set both to 10 for the experiments here.\nOther values for the rank showed similar results and data for rank 3 teacher networks can be found in\nFigure A2. The signal-to-noise ratio (SNR) of the data in each dataset is directly proportional to the\nsingular value of the teacher network that generated each task\u2019s data.\nWe kept all singular values for a given teacher network the same and varied this value from .01 to 100.\nT\nSimilarly, we \ufb01xed the relatedness of teacher network B to V\nBV A = rABI, such that the singular\nvectors V B were orthogonal to V A with constant inner product. We varied this value from 0 to 1.\nThis work demonstrates several interesting dependencies:\n\n1. Multitask bene\ufb01t increases with increasing task relatedness and SNR of the auxiliary data.\n\nThis mirrors the \ufb01nding from row (b) of Table 1.\n\n2. Unrelated, high SNR auxiliary tasks are actually destructive to the learning process of the\nmain task. Our theoretical framework provides an explanation for this observation in C.1:35.\n\n1Code supporting this paper is available upon request\n\n6\n\n\fFigure 2: (Left) Summary of multitask bene\ufb01ts gained when the student network was trained with\nincreasing signal-to-noise ratio (SNR). With constant noise levels, the SNR increases with the singular\nvalues for teacher A, SA, were increased from .01 to 10 (alternating stripes, left-to-right). For each\nvalue of SA (x-axis), the average multitask bene\ufb01t was computed for low SNR auxiliary tasks (SB)\nand high SNR auxiliary tasks (each line segment, left-to-right) across 5 levels of task relatedness\n(rAB). Data is plotted for 800 training points. This demonstrates that multitask bene\ufb01t is correlated\nwith task relatedness and SNR for related tasks, yet negatively correlated with SNR for unrelated\ntasks. (Right) Summary of multitask bene\ufb01ts with increasing amount of training data (alternating\nstripes, left-to-right). At 100 training points the network still struggles to train and does not gain a\ngeneralization bene\ufb01t from auxiliary data. For > 200 training points, the network begins to leverage\nthe related auxiliary data to improve performance. When the dataset is very large, performance nearly\nreaches its ceiling and the auxiliary data has little effect. See Figure A1 for the complete set of\ninteractions among these variables.\n\nFigure 3: (Left) Summary of multitask bene\ufb01ts gained when the student network was trained with\nincreasing amounts of auxiliary task data . For each quantity of auxiliary task data (x-axis), the\naverage multitask bene\ufb01t was computed for low SNR aux tasks and high SNR aux tasks (each\nline segment, left-to-right) across 5 levels of task relatedness. All the data shown is for high SNR\nmain tasks, and demonstrates that increasing relatedness and auxiliary task data give large multitask\nbene\ufb01ts. For more details see Figure A3. (Right) Summary of multitask bene\ufb01ts gained for nonlinear\nstudent networks of increasing depth (x-axis). Deeper nonlinear networks show similar trends to\nshallow linear networks. For more details see Figure A4.\n\nIn contrast, unrelated, noisy auxiliary tasks are readily ignored. This mirrors the \ufb01ndings\nfrom rows (a) and (c) of Table 1.\n\n3. The main task must have a certain level of base performance either from clean data or\nlarger amounts of data before multitask learning can help. This holds up to the point where\nsingle task performance nears optimal performance on the main task, as is the case when the\namount of training data supplied is large. These statements mirror the \ufb01ndings from rows\n(c) and (d) of Table 1.\n\n7\n\n-.04.17Training Pts10020040080016003200-.03.18MTA\u2190BSA.0110.0110SBrAB-.21.15135#Layers-.08.34MTA\u2190BAux Training Pts100200400800.0110rABSB\f4.6 Auxiliary task data ef\ufb01ciency\nMultitask learning is a popular strategy for extending the utility of a limited amount of main task data.\nThis is often an interesting choice when auxiliary task data is easy to come by but main task data is\nexpensive. To gauge the value of additional auxiliary task data while holding main task data \ufb01xed,\nwe trained multitask student networks on 100 main task data points and up to 800 auxiliary task data\npoints. These results are summarized in Figure 3 (left) and full results can be found in Figure A3. As\nauxiliary task data quantities increase we see similar trade-offs to those above, where related, high\nquality data provides a large multitask bene\ufb01t, while unrelated, high quality data proves increasingly\ndetrimental.\n\n4.7 Multitask learning in deeper, nonlinear student networks\nTo ensure that our results can generalize to nonlinear and deeper networks, we varied the number\nof hidden layers in the student network and included a ReLU nonlinearity between each hidden\nlayer. While this situation does not lend itself to clean theoretical analysis, we found that these\nnetworks behave qualitatively similar to the linear network results described above. These results are\nsummarized in Figure 3 (right) and full results can be found in Figure A4. Again, multitask bene\ufb01t is\nstrongly correlated with relatedness and the SNR of both datasets. Interestingly, there is a general\nshift downwards in multitask bene\ufb01t, suggesting that nonlinear networks require more highly related\ntasks in order to generate a signi\ufb01cant performance increase.\n\n5 Discussion and future directions\n\nHere we demonstrate that, for linear classi\ufb01er networks with a softmax output nonlinearity, general-\nization performance can be computed analytically. We extend the analysis in [16] to classi\ufb01cation\nproblems and show both theoretically and empirically that improvements from multitask learning\nare heavily related to training set size, task relatedness, and the noise levels inherent in the data.\nNetworks given suf\ufb01cient data to train well show improved performance when supplemented with\nrelated, high signal-to-noise ratio auxiliary tasks. Unrelated auxiliary tasks show little bene\ufb01t and can\nbe actively detrimental if they provide a strong enough training signal.\nThe problem of increasing the range of parameters from which one gets a multitask bene\ufb01t and\ndecreasing potential harms has received increasing interest in recent years, often through clever loss\nor gradient weighting strategies [26, 27, 28]. A careful interrogation of (5) should provide some\ninsight on methods for maximizing the possible multitask bene\ufb01t, a direction we leave for future\nwork. Additionally, we have shown that our results generalize to deeper, more nonlinear student\nnetworks, though these networks are still quite different from networks used in practice. We expect\nthe insights gained in this work, especially with regard to the critical properties of main and auxiliary\ntask datasets will generalize well to more complex networks. Generalizing our results regarding task\nrelatedness poses an interesting challenge for future research.\n\nAcknowledgments\nWe would like to thank Cory Stephenson, Gokce Keskin, Oguz Elibol, Suchismita Padhy, and Ting\nGong for many fruitful discussions regarding this work. We must also acknowledge Nicholas Sapp\nfor his work in establishing the compute infrastructure that made the empirical portions of this work\npossible.\n\nReferences\n[1] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in\n\nneural networks. CoRR, abs/1710.03667, 2017.\n\n[2] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear\ndynamics of learning in deep linear neural networks. In 2nd International Conference on\nLearning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track\nProceedings, 2014.\n\n[3] St\u00e9phane Mallat. Understanding deep convolutional networks. CoRR, abs/1601.04920, 2016.\n\n8\n\n\f[4] Henry W. Lin and Max Tegmark. Why does deep and cheap learning work so well? CoRR,\n\nabs/1608.08225, 2016.\n\n[5] Felix Dr\u00e4xler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially\nno barriers in neural network energy landscape.\nIn Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July\n10-15, 2018, pages 1308\u20131317, 2018.\n\n[6] Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, G\u00e9rard Ben Arous, Chiara\nCammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep\nneural networks versus glassy systems. In Proceedings of the 35th International Conference\non Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018,\npages 324\u2013333, 2018.\n\n[7] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. In 5th International Conference on Learning Representations, ICLR\n2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.\n\n[8] Sebastian Goldt, Madhu S. Advani, Andrew M. Saxe, Florent Krzakala, and Lenka Zdeborov\u00e1.\nGeneralisation dynamics of online learning in over-parameterised neural networks. CoRR,\nabs/1901.09085, 2019.\n\n[9] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. CoRR, abs/1902.06720, 2019.\n\n[10] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\n\nOn exact computation with an in\ufb01nitely wide neural net. CoRR, abs/1904.11955, 2019.\n\n[11] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, Jul 1997.\n\n[12] Y. Qian, M. Yin, Y. You, and K. Yu. Multi-task joint-learning of deep neural networks for\nrobust speech recognition. In 2015 IEEE Workshop on Automatic Speech Recognition and\nUnderstanding (ASRU), pages 310\u2013316, Dec 2015.\n\n[13] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task\n\nSequence to Sequence Learning. arXiv e-prints, page arXiv:1511.06114, Nov 2015.\n\n[14] Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint CTC-attention based end-to-end speech\nrecognition using multi-task learning. ICASSP, IEEE International Conference on Acoustics,\nSpeech and Signal Processing - Proceedings, pages 4835\u20134839, 2017.\n\n[15] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks\n\nfor natural language understanding. CoRR, abs/1901.11504, 2019.\n\n[16] Andrew Kyle Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and\n\ntransfer learning in deep linear networks. CoRR, abs/1809.10374, 2018.\n\n[17] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks\n\nfor multi-task learning. CoRR, abs/1604.03539, 2016.\n\n[18] Elliot Meyerson and Risto Miikkulainen. Beyond shared hierarchies: Deep multitask learning\n\nthrough soft layer ordering. CoRR, abs/1711.00108, 2017.\n\n[19] Hakan Bilen and Andrea Vedaldi. Universal representations: The missing link between faces,\n\ntext, planktons, and cat breeds. CoRR, abs/1701.07275, 2017.\n\n[20] Jeremy Howard and Sebastian Ruder. Fine-tuned language models for text classi\ufb01cation. CoRR,\n\nabs/1801.06146, 2018.\n\n[21] Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones,\n\nand Jakob Uszkoreit. One model to learn them all. CoRR, abs/1706.05137, 2017.\n\n9\n\n\f[22] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. CoRR,\n\nabs/1708.07860, 2017.\n\n[23] S. B\u00f6s, W. Kinzel, and M. Opper. Generalization ability of perceptrons with continuous outputs.\n\nPhys. Rev. E, 47:1384\u20131391, Feb 1993.\n\n[24] Bernard Derrida. Random-energy model: An exactly solvable model of disordered systems.\n\nPhys. Rev. B, 24:2613\u20132626, Sep 1981.\n\n[25] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low\nrank perturbations of large rectangular random matrices. Journal of Multivariate Analysis,\n111:120\u2013135, 2012.\n\n[26] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh\n\nlosses for scene geometry and semantics. CoRR, abs/1705.07115, 2017.\n\n[27] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. CoRR,\n\nabs/1810.04650, 2018.\n\n[28] Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji\nLakshminarayanan. Adapting Auxiliary Losses Using Gradient Similarity. arXiv e-prints, page\narXiv:1812.02224, Dec 2018.\n\n[29] J.E. Littlewood G.H. Hardy and G.P\u00f2lya. Inequalities. Cambridge University Press UK, 1934.\n\n10\n\n\f", "award": [], "sourceid": 9289, "authors": [{"given_name": "Anthony", "family_name": "Ndirango", "institution": "Intel AI Lab"}, {"given_name": "Tyler", "family_name": "Lee", "institution": "Intel AI Lab"}]}