{"title": "AutoAssist: A Framework to Accelerate Training of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5998, "page_last": 6008, "abstract": "Deep neural networks have yielded superior performance in many contemporary\napplications. However, the gradient computation in a deep model with millions of\ninstances leads to a lengthy training process even with modern\nGPU/TPU hardware acceleration. In this paper, we propose AutoAssist, a\nsimple framework to accelerate training of a deep neural network.\nTypically, as the training procedure evolves, the amount of improvement by a\nstochastic gradient update varies dynamically with the choice of instances in the mini-batch.\nIn AutoAssist, we utilize this fact and design an instance shrinking operation that is\nused to filter out instances with relatively low marginal improvement to the\ncurrent model; thus the computationally intensive gradient computations are\nperformed on informative instances as much as possible.\nSpecifically, we train\na very lightweight Assistant model jointly with the original deep network, which we refer to as Boss.\nThe Assistant model is designed to gauge the importance of a given\ninstance with respect to the current Boss such that the shrinking operation can\nbe applied in the batch generator. With careful design, we train the Boss and\nAssistant in a nonblocking and asynchronous fashion such that\noverhead is minimal.\nTo demonstrate the effectiveness of AutoAssist, we conduct experiments on two\ncontemporary applications: image classification using ResNets with varied number\nof layers, and neural machine translation using LSTMs, ConvS2S and\nTransformer models. For each application, we verify that AutoAssist leads to\nsignificant reduction in training time; in particular, 30% to 40% of the total operation count can be reduced which leads to\nfaster convergence and a corresponding decrease in training time.", "full_text": "AutoAssist: A Framework to\n\nAccelerate Training of Deep Neural Networks\n\nJiong Zhang \u21e4\n\nzhangjiong724@utexas.edu\n\nHsiang-Fu Yu \u2020\n\nrofu.yu@gmail.com\n\nInderjit S. Dhillon\u21e4\u2020\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nDeep Neural Networks (DNNs) have yielded superior performance in many con-\ntemporary applications. However, the gradient computation in a deep model\nwith millions of instances leads to a lengthy training process even with modern\nGPU/TPU hardware acceleration. In this paper, we propose AutoAssist, a simple\nframework to accelerate training of a deep neural network. Typically, as the training\nprocedure evolves, the amount of improvement by a stochastic gradient update\nvaries dynamically with the choice of instances in the mini-batch. In AutoAssist,\nwe utilize this fact and design an instance shrinking operation that is used to \ufb01lter\nout instances with relatively low marginal improvement to the current model; thus\nthe computationally intensive gradient computations are performed on informative\ninstances as much as possible. Speci\ufb01cally, we train a very lightweight Assistant\nmodel jointly with the original deep network, which we refer to as the Boss. The\nAssistant model is designed to gauge the importance of a given instance with re-\nspect to the current Boss model such that the shrinking operation can be applied in\nthe batch generator. With careful design, we train the Boss and Assistant in a non-\nblocking and asynchronous fashion such that overhead is minimal. To demonstrate\nthe effectiveness of AutoAssist, we conduct experiments on two contemporary\napplications: image classi\ufb01cation using ResNets with varied number of layers,\nand neural machine translation using LSTMs, ConvS2S and Transformer models.\nFor each application, we verify that AutoAssist leads to signi\ufb01cant reduction in\ntraining time; in particular, 30% to 40% of the total operation count can be reduced\nwhich leads to faster convergence and a corresponding decrease in training time.\n\n1\n\nIntroduction\n\nDeep Neural Networks (DNNs) trained on a large number of instances have been successfully applied\nto many real world applications, such as [6, 11] and [20]. Due to the increasing number of training\ninstances and the increasing complexity of deep models, variants of (mini-batch) stochastic gradient\ndescent (SGD) are still the most widely used optimization methods because of their simplicity and\n\ufb02exibility. In a typical SGD implementation, a batch of instances is generated by either a randomly\npermuted order or a uniform sampler. Due to the complexity of deep models, the gradient calculation\nis usually extremely computationally intensive and requires powerful hardware (such as a GPU or\nTPU) to perform the entire training in a reasonable time frame. At any given time in the training\nprocess, each instance has its own utility in terms of improving the current model. As a result,\nperforming SGD updates on a batch of instances which are sampled/generated uniformly can be\nsuboptimal in terms of maximizing the return-on-investment (ROI) on GPU/TPU cycles. In this paper,\nwe propose AutoAssist, a simple framework to accelerate training deep models with an Assistant that\ngenerates instances in a sequence that attempts to improve the ROI.\n\n\u21e4The University of Texas at Austin\n\u2020Amazon\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere have been earlier similar attempts to improve the training speed of deep learning. In [3],\ncurriculum learning (CL), was shown to be bene\ufb01cial for convergence; however, prior knowledge of\nthe training set is required to sort the instances by its dif\ufb01culty. Self-paced learning (SPL) [19] is\nanother attempt that infers the \u201cdif\ufb01culty\u201d of instances based on the corresponding loss value during\ntraining and decreases the training weights of these dif\ufb01cult instances. [13] combined the above two\nideas and proposed Self Paced Curriculum learning (SPCL), which utilizes both prior knowledge\nand the loss values as the learning progresses. However SPCL relies on a manually chosen scheme\nfunction and introduces a considerable overhead in terms of both time and space complexity.\nIn our proposed AutoAssist framework, the main model, referred to as the Boss, is trained with\nbatches generated by a light-weight Assistant which is designed to adapt to the changes in the Boss\ndynamically and asynchronously. Our contributions in this paper are as follows.\n\u2022 We propose AutoAssist, a simple framework to accelerate training of deep neural networks by a\ncareful designed Assistant which is able to shrink less informative instances and generate smart\nbatches in an ROI aware sequence for the Boss to perform SGD updates.\n\u2022 We also propose a concurrent computation mechanism to simultaneously utilize both CPUs and\nGPUs such that learning of the Boss and the Assistant are conducted asynchronously, which\nminimizes the overhead introduced by the Assistant.\n\u2022 We conduct extensive experiments to show that AutoAssist is effective in accelerating the training\nof various types of DNN models including image classi\ufb01cation using Resnets with varied number\nof layers, and neural machine translation using LSTMs, ConvS2S and Transformers.\n\n2 Related Work\n\nConsiderable research has been conducted to optimize the way data is presented to the optimizer\nfor deep learning. For example, curriculum learning (CL) [3], which presents easier instances to\nthe model before hard ones, was shown to be bene\ufb01cial to the overall convergence; however, prior\nknowledge of the training set is required to decide the curriculum. To avoid this, [28] propose to\nlearn the curriculum with Bayesian optimization. Self-paced learning (SPL) [19] infers the dif\ufb01culty\nof instances with the corresponding loss value and then decreases the sample weight of dif\ufb01cult\ninstances. Self-paced Convolutional Networks (SPCN) [22] combines the SPL algorithm with the\ntraining of Convolutional Neural Networks to get rid of noisy data. SPL type methods generally\nrequire a user speci\ufb01ed \u201cpace rate\u201d and the learning algorithm gradually incorporates more data after\nevery epoch until the whole dataset is incorporated into the curriculum. These methods have been\nproven useful in a wide range of applications, including image recognition and natural language\nprocessing. Similar ideas have been developed when optimizing other machine learning models.\nFor example, in classical SVM models, methods have been proposed to ignore trivial instances by\ndimension shrinking in dual coordinate descent [12].\nImportance sampling is another type of method that has been proposed to accelerate SGD convergence.\nIn importance sampling methods, instances are sampled by their importance weights. [31] proposed\nIprox-SGD that uses importance sampling to achieve variance reduction. The optimal importance\nweight distribution to reduce the variance of the stochastic gradient is proved to be the gradient\nnorm of the sample, see [24, 31, 1]. Despite the variance reduction effect, importance sampling\nmethods tend to introduce large computational overhead. Before each stochastic step, the importance\nweights need to be updated for all instances which makes importance sampling methods infeasible\nfor large datasets. [16] proposed an importance sampling scheme for deep learning models; however,\nin order to reduce computation cost for evaluating importance scores, the proposed algorithm applied\na sub-sampling technique, thus leading to reliance on outdated importance scores during training.\nThe online batch selection method [23] samples instances with the probability exponential to their\nlast known loss value.\nThere are also several recent methods that propose to train an attached network with the original\none. ScreenerNet [17] trains an attached neural network to learn a scalar weight for each training\ninstance, while MentorNet [14] learns a data-driven curriculum that prevents the main network from\nover-\ufb01tting. Leaning to teach [9] uses a student and a teacher model and optimizes the framework\nwith reinforcement learning. Since the additional model is another deep neural network, the above\nmethods introduce substantial computational and memory overhead to the original training process.\n\n2\n\n\f3 A Motivating Example: SGD with Instance Shrinking for Linear SVMs\n\nIn this section, we present and analyze the theoretical and empirical properties of SGD with instance\nshrinking for linear Support Vector Machines (SVM). Although the analysis and observations apply\nto convex problems such as linear SVM, the learning from this section is the inspirational cornerstone\nfor the design of our AutoAssist framework to accelerate training of non-convex deep learning models\nin Section 4.\nThe shrinking strategy is a key technique widely used in many popular ML software libraries to\naccelerate training for large-scale SVMs, such as SVMLight [15, Section 11.4], LIBSVM [4, Section\n5.1] and LIBLINEAR [8, Section E.5]. The main idea behind the shrinking strategy is to identify\nvariables that are unlikely to change in upcoming iterations and temporarily remove them from\nconsideration to form a smaller/simpler optimization problem. Although, in most existing literature,\nthe shrinking strategy is applied to accelerate coordinate descent based methods for the dual SVM\nproblem, it is natural to ask if we can extend the shrinking strategy to accelerate stochastic gradient\ndescent based methods. Due to the primal-dual relationship for SVM problems, there is a direct\nconnection between a coordinate update in the dual SVM formulation and the stochastic gradient\nupdate with respect to the corresponding instance in the primal SVM formulation (See for example\n[12, Section 4.2]). After a careful examination of the relationship and the shrinking strategy adopted\nfor dual SVMs, it can be seen that when a dual variable meets the shrinking criterion during training,\nthe corresponding instance is not only correctly classi\ufb01ed by the current model but is also relatively\nfar away from the decision boundary. Furthermore, as the decision boundary changes dynamically\nduring training, there is a mechanism for each shrunk dual variable to become active in all the\naforementioned approaches, thus guaranteeing convergence. Now, we discuss a simple Instance\nShrinking strategy designed for SGD on SVM-like convex functions.\nGiven a dataset {(xi, yi) : i = 1, . . . , N}, we consider a objective function parametrized as follows:\n(1)\n\nF (w) :=\n\nmin\nw2Rd\n\n1\nN\n\nNXi=1\n\nfi(w | xi, yi),\n\nwhere fi(\u00b7) is a loss function for the i-th instance. In a typical SGD method, at the k-th step, an\ninstance (xi, yi) is uniformly sampled to perform the following update:\n\nwk+1 wk \u2318krfi(wk | xi, yi),\n\n(2)\nwhere \u2318k is the learning rate at the k-th step. Motivated by the above primal-dual connection for\nlinear SVMs, in order to extend the shrinking strategy for SGD, it is intuitive to introduce the concept\n\nof utility of each instance for the current model wk denoted by utilityxi, yi | wk, which is used to\n\nestimate the marginal improvement for this instance using the current model. As a result, we can\napply the following utility-aware instance shrinking strategy to SGD:\n\nwk+1 \u21e2wk \u2318krfi(wk | xi, yi)\n\nwk\n\nif utilityxi, yi | wk Tk,\n\notherwise,\n\n(3)\n\nwhere Tk is a threshold used to control the aggressiveness of our instance shrinking strategy.\n\n3.1 Choice of the Utility Function\nThe effectiveness of the instance shrinking strategy for SGD depends on the choice of the utility\nfunction. There are two considerations:\n\u2022 The utility function should be designed such that it accurately approximates the exact marginal\nimprovement of the instance (xi, yi) for the current model, which can be de\ufb01ned as F (wk \n\u2318krfi(wk | xi, yi)) F (wk).\n\u2022 The utility function should also be simple to compute so that its overhead is minimal. As a result,\nthe exact marginal improvement cannot be an effective utility function due to its high O(N d)\ncomputational overhead.\n\nObviously, the balance between both considerations is the key to designing an effective utility function.\nInspired by the existing shrinking strategy used in SVM optimization, there are two simple candidates\n\nfor the utility function: 1) the norm of the gradient: utilityxi, yi | wk = rfi(wk | xi, yi),\n\n3\n\n\fFigure 1: Comparison of Instance Shrinking with loss (Shrinking-loss) or gradient (Shrinking-grad)\nnorm as utility and Importance Sampling for SGD on the linear SVM problem on the public news20\nand rcv1 datasets. In terms of the number of parameter updates, instance shrinking and impor-\ntance sampling strategies show faster convergence than plain SGD. Because of large computational\noverhead, importance sampling strategy is not effective in terms of reducing training time.\n\nNPN\n\nk\n\n2) the loss: utilityxi, yi | wk = fi(wk | xi, yi). First, both choices rely only on local information\n\n(xi, yi), i.e., no other instances are involved in the computation. Thus the overhead is small. Next,\nboth choices are a good proxy: the gradient norm measures the magnitude of the change in the\nSGD update while the loss directly measures the performance of the current model on this instance.\nExperimental results on SVMs indicate that both gradient norm and loss utility achieves faster\nconvergence. As stated below, the choice of gradient norm can be shown to be theoretically sound.\nTheorem 1. Given \u00b5 strongly convex function F (w) := 1\ni=1 fi(w) that satis\ufb01es Property 1 in\n\nAppendix A. Whenrfi(wk | xi, yi) is used as the utility function, there exists threshold Tk such\nthat SGD with the instance shrinking update rule (3) converges as follows: E\u21e3wk w\u21e42\u2318 \uf8ff L\n\nfor some constant L, where w\u21e4 is the optimal solution of (1).\n\nThe proof of Theorem 1 can be found in the Appendix. From Theorem 1, we can see that shrinking\nof instances with low utility during the training process does not hinder the theoretical convergence\nof SGD for strongly convex problems.\n\n3.2 Practical Consideration: Computational Overhead\n\nAs mentioned earlier, computational overhead is one of the major considerations for a shrinking\nstrategy to be effective in terms of acceleration of training process. In Figure 1, we show the results\nof various acceleration techniques for SGD on the linear SVM problem with two datasets: news20\nand rcv1. To further demonstrate the consequence of overhead in practical effectiveness, we also\ninclude the importance sampling strategy into our comparison. The theoretical bene\ufb01t of importance\nsampling for SGD has been extensively studied in the literature [24, 31, 1]. In particular, it is well\nknown that the optimal distribution for the importance sampling strategy is that each instance be\nsampled with a probability proportional to the norm of gradient of this instance given the current\nmodel wk. To have a fair comparison, we implement Pegasos [27] as our plain SGD algorithm, the\nshrinking strategy with the loss and gradient norm as the utility function, and the importance sampling\nstrategy with the exact optimal distribution in C++. From the top part of Figure 1, we can see that\ninstance shrinking with both utility choice and importance sampling yields faster convergence than\nplain SGD in terms of the number of updates. However, in terms of the actual training time, from the\nbottom part of Figure 1, we can see that the importance sampling strategy is signi\ufb01cantly slower than\neven plain SGD due to the huge computational overhead to maintain the exact sampling distribution\nthat leads to the optimal theoretical convergence. On the other hand, our shrinking strategies with a\nvery light-weight extra overhead show improvement over plain SGD in terms of training time.\n\n4\n\n\fIt is not hard to see why the improvement in Figure 1 is almost negligible for the shrinking strategies.\nDue to simplicity of the linear SVM and the choice of utility function (loss in this case), the time\nsaved by the shrinking strategy is almost the same as the overhead introduced by the computation of\nthe utility function. Based on these observations, we see that the opportunity of a shrinking strategy to\nbe effective in accelerating the training of complicated DNN models is in designing a utility function\nwhose overhead is signi\ufb01cantly lower than the computation involved in a single SGD update.\n\n4 AutoAssist: Training DNNs with an Automatic Assistant\n\nInspired by the observations in Section 3, given that a single SGD update for a DNN model is very\ntime-consuming, we believe that there is an opportunity for a properly designed shrinking strategy\nto accelerate the training for a DNN model effectively. Note that in a typical SGD training process\nfor a DNN model, there are three major components: a batch generator which collects a batch of\ninstances to perform a (mini-)batch stochastic gradient update; a forward pass(FP) on the DNN model\nto evaluate the loss values on a given batch of instances; and a backward pass(BP) on the DNN model\nto compute the aggregated gradient for this batch so that an SGD update can be performed. The\nmajor computation cost comes from the FP and BP phases, which usually require powerful hardware\nsuch as GPU to perform the computation ef\ufb01ciently. This indicates that if we can skip the FP and BP\ncomputations for instances with relatively lower utility with respect to the current model, a signi\ufb01cant\namount of computation time can be saved. Thus, in AutoAssist, we propose to design an Assistant\nto accelerate the training of a DNN model, which we refer to as the Boss model from now on. The\nAssistant is a special batch generator which implements a utility aware instance shrinking strategy.\n\n4.1 A Lightweight Assistant Grows with the Boss Dynamically\nTo design an effective Assistant, we need to take the same\ntwo considerations of the shrinking strategy into account:\non one hand, Assistant should be aware of the latest ca-\npability of the Boss model so that an accurate shrinking\nstrategy can be used; on the other hand, Assistant should\nbe lightweight so that the overhead is as low as possible.\nDue to the fact that extracting the per-instance loss in most\nmodern implementations of batch SGD is signi\ufb01cantly eas-\nier than the per instance gradient, we consider using only\nthe loss to gauge the utility of a given instance for the cur-\nrent Boss. However, unlike the simple linear SVM model,\neven the forward pass in DNN training to compute the\nloss is very time consuming. Thus, instead of exact loss\ncomputation, we should design a lightweight Assistant\nmodel to estimate instance utility. For most applications\nwhere DNN is applied, there exist many traditional sim-\npler ML models which still yield reasonable performance.\nThese \u201cshallow counterparts\u201d of the DNN model are good\ncandidates to approximate the chosen utility function in\nan ef\ufb01cient and accurate manner. In Section 5, we will see that even with a simple linear model\nour Assistant is able to reduce training time signi\ufb01cantly for real-world applications such as image\nclassi\ufb01cation and neural machine translation.\nIn AutoAssist, we use to denote the parameters of the shallow Assistant model, and use g(\u00b7 | ) to\ndenote the approximate instance utility for the current Boss model. In particular, g(\u00b7) is designed to\nmodel the following probability:\n\nAlgorithm 1 Assistant: Utility Aware\nBatch Generator\n1: Input: Dataset {xi, yi}N\ni=1, k\n2: Output: batch Bk \u21e2{ 1, . . . , N}\n3: Initialize: Bk {}\n4: while |Bk| < batch_size do\n5:\ni \u21e0 uniformInt(N )\n6:\nr1 \u21e0 uniform(0, 1)\nif r1 < 1 k then\n7:\n8:\nBk Bk [{ i}\nelse\n9:\n10:\nr2 \u21e0 uniform(0, 1)\nif r2 < g( i | ) then\n11:\n12:\nBk Bk [{ i}\n\nreturn Bk\n\ng( (xi, yi) | ) \u21e1 Pqutilityxi, yi | wk Tky,\n\nwhere (xi, yi) is the feature vector used in the shallow model, utilityxi, yi | wk is the loss for\n\nthe i-th instance with the current Boss model wk, and Tk is a threshold used to determine whether\nthe marginal utility of the i-th instance is large enough to include it in the mini-batch. Note that\nTk also changes during the entire training phase to adapt to the dynamically changing Boss model.\nIn particular, we propose Tk to be an exponential moving average of the loss of instances updated\n\n(4)\n\n5\n\n\fFigure 2: The sequential training scheme on the left wastes CPU/GPU cycles, while the accelerated\ntraining process with AutoAssist, shown on the right, asynchronously trains both the Boss and the\nAssistant leading to a more ef\ufb01cient use of computational resources. (AQ denotes AssistantQueue,\nwhile BQ denotes BossQueue.)\n\nrecently by Boss. There are three reasons why we choose to model the probability of the binary event\ninstead of the instance loss directly: a) the range of the instance loss varies depending on lots of\nfactors such as the choice of the loss function and training instance outliers. Thus, to increase the\nrobustness of Assistant, we choose to model (4) instead; b) Shallow models usually have a limited\ncapacity to approximate the exact loss of a given instance for a DNN model; c) with the probability\noutput of g(\u00b7), Assistant can perform a \u201cstochastic\u201d instance shrinking strategy to avoid the situation\nthat instances that always have a lower predicted probability are never seen by the Boss.\nIn order for the Assistant to know the latest capability of the Boss model, we propose a very\nlightweight approach to collect the latest information about the Boss. In particular, after the forward\npass of a batch B, we collect the actual loss value (i.e., our utility function) of each instance (xi, yi)\nto form a binary classi\ufb01cation dataset {( i, zi) : i 2 B}, where i = (xi, yi) is the feature\nvector, and zi := I\u21e5utilityxi, yi | wk > Tk\u21e4, where I[\u21e4] is the indicator function, is the supervision\nproviding the latest information about the current Boss model. To keep the Assistant up-to-date with\nthe Boss, we update the parameters for the Assistant model:\n\n (1 ) \u2318\n\nr`CE(zi, g( i|)),\n\n(5)\n\n1\n\n|B|Xi2B\n\nwhere `CE is the cross entropy loss, \u2318 is a \ufb01xed learning rate, and is the weight decay factor.\nTo handle the situation where the Assistant model has not yet learned the capability of the Boss, we\npropose a simple mechanism to control the rate of instances, denoted by k 2 [0, 1], to be passed to\nthe stochastic instance shrinker de\ufb01ned by the Assistant model. In particular, in the early stage of\ntraining, we set 0 = 0 so that Assistant includes all instances into the batch without any shrinking\noperation. For the Assistant, k acts like a safeguard which takes the con\ufb01dence of the current model\ng(\u00b7 | ) in predicting a correct shrinking probability. The better the Assistant model performs, the\nhigher the value k is set to. In particular, k is dynamically set to an exponential running average\nover the observed empirical accuracy of the Assistant model. In Algorithm 1, we describe the utility\naware batch generation performed by our Assistant.\nConnections to Existing Curriculum-based Approaches. The concept of utility of an instance in\nAutoAssist could be viewed as a machine learned curriculum for the current Boss. To contrast our\napproach, existing curriculum based approaches are not capable of evolving with the Boss model in a\ntimely manner. For example, Self-Paced-Learning (SPL) only updates its curriculum (or self-pace)\nafter one full epoch of the dataset. ScreenerNet [17] is another attempt to dynamically learn a\ncurriculum with an auxiliary deep neural network, which requires additional GPU cycles to train the\nauxiliary DNNs causing signi\ufb01cant overhead. We compare these two methods in Section 5.\n\n4.2 An Asynchronous Computational Scheme for Joint Learning of Boss and Assistant\n\nIn traditional batch SGD training for a DNN model, as depicted in the left part of Figure 2, there\nis an interleaving of batch generation done in a CPU and FP/BP done in a GPU. Due to the simple\nlogic of most existing batch generators, batch generation takes a minimal number of CPU cycles,\n\n6\n\n\fwhich causes a lengthy idle period for the CPU. The Assistant in our AutoAssist framework needs to\nperform instance shrinking in addition to updating the shrinking model to keep pace with Boss. To\nreduce the overhead, we propose an asynchronous computational scheme to fully utilize the available\nCPU and GPU. In particular, we maintain two concurrent queues to store the batches required for\nBoss and Assistant respectively:\n\nAssistantQueue =. . . , M t, . . . ,\n\n(6)\n\nBossQueue = {. . . , Bs, . . .}\n\nand\n\nwhere each Bs is a batch of instance indices and each M t =(i, utilityxi, yi | wk) : i 2 Bs is\n\na batch of pairs of an instance index and the corresponding loss value (the utility function chosen\nin AutoAssist) evaluated from the forward pass of the recent Boss model on a batch Bs. With the\nhelp of these two concurrent queues, we can design the following computational scheme: For each\nGPU worker performing Boss updates, it \ufb01rst dequeues a batch Bs from the BossQueue, performs\nthe forward computation, enqueues the M t containing the loss values along with the instance indices\nto the AssistantQueue, conducts the backward computation to perform the SGD update on the\nparameters of the Boss model. On the other hand, for each CPU worker performing Assistant updates,\nwhenever one of the queues is empty, it always generates and enqueues a new batch to the BossQueue;\notherwise, it dequeues an M t from the AssistantQueue, forms the binary dataset from M t to perform\nthe update (5). An illustration of the scheme is shown in the right part of Figure 2. It is not hard to\nsee that both the CPU and GPU are utilized in our scheme, and the only overhead introduced is the\nstep to collect the loss values and enqueuing the corresponding M t, which is an operation that has\nminimal computational cost.\n\n5 Experimental results\n\nTo demonstrate the effectiveness of AutoAssist in training large-scale DNNs, we conduct experi-\nments 3 on two applications where DNNs have been successful: image classi\ufb01cation and neural\nmachine translation.\n\n5.1\n\nImage classi\ufb01cation\n\nDatasets and DNN models. We consider MNIST [21], rotated MNIST4, CIFAR10 [18] and raw\nImageNet [7] datasets. The dataset statistics are presented in Table 1 of the Appendix. For the DNN\nmodels, we consider the popular ResNets with varied number of layers (18, 34, and 101 layers).\nExperimental Setting. Following [11], we use SGD with momentum as the optimizer. The detailed\nparameter settings are listed in the Appendix. Three acceleration methods and SGD baseline are\nincluded in our comparison:\n\u2022 AutoAssist: L2-regularized logistic regression as our Assistant model, where the stacked pixel\nvalues of the raw image are used as the feature vector except for ImageNet, where low resolution\nimages are used as feature vector.\n\n\u2022 SGD baseline: vanilla stochastic gradient descent with momentum.\n\u2022 Self-Paced Learning (SPL): We implement the same self-paced scheme in [22].\n\u2022 ScreenerNet: We use the same screener structure and settings described in [17].\nExperimental results, that include the training loss versus training time for each model and dataset\ncombination, are shown in Figure 3. As a sanity check, we also veri\ufb01ed that the AutoAssist reaches\nthe same accuracy as SGD baseline. It can be observed that AutoAssist outperforms other competing\napproaches in all (model, dataset) combinations in terms of training time. The Assistant model is\nable to reach 80% \u21e0 90% shrinking accuracy even with a simple linear logistic regression model, as\nshown in Figure 6 in the Appendix. It can be seen that AutoAssist yields effective SGD acceleration\ncompared to other approaches.\n\n5.2 Neural Machine Translation\nDatasets and DNN models. We consider the widely used WMT14 English to German dataset, which\ncontains 4M sentence pairs. We constructed source and target vocabulary with size 42k and 40k. In\n\n3The code is available at https://github.com/zhangjiong724/autoassist-exp\n4Constructed by randomly rotating each image in the original MNIST within \u00b190 degrees.\n\n7\n\n\fFigure 3: Comparison of various training schemes on image classi\ufb01cation. X-axis is the training\ntime in seconds, while Y-axis is the training loss. ResNets with varied number of layers ranging\nfrom {18, 34, 101} are considered. Each column of \ufb01gures shows the ResNet results with a speci\ufb01c\nnumber of layers, while each row of \ufb01gures shows results on a speci\ufb01c dataset.\n\nFigure 4: Comparison of various training schemes on neural machine translation. X-axis is the\ntraining time, while Y-axis is the training perplexity. Four commonly sequence-to-sequence models\nare considered: LSTM, ConvS2S, and Transformer(base and big). We use 8 CPUs for Assistant and\n8 GPUs for Boss.\n\nterms of the DNN models for NMT, we consider four popular deep sequence models: LSTM [30],\nConvS2S [10], Transformer base and big model [29].\nExperimental Setting. We implement AutoAssist with the asynchronous update mechanism de-\nscribed in Section 4.2 under the Fairseq [25] codebase. In particular we enable multiple CPUs for\nmultiple Assistant updates and multiple GPUs for Boss training. We use 8 Nvidia V100 GPUs for\n\n8\n\n\fBoss training and stop after training on 6 billion tokens. As ScreenerNet described in [17] cannot be\ntrivially extended to the NMT task, we exclude ScreenerNet in our comparison.\n\u2022 AutoAssist: L2-regularized logistic regression is used as our Assistant model, where the term-\nfrequency / inverse-document-frequency [26] of each source-target pair is used as the feature\nvector. The TF/IDF features are computed during preprocessing time to reduce overhead. Note that\nalthough the Boss model is trained in a data-parallel fashion (gradients are synchronized after each\nback-propagation), Assistant is updated in an asynchronous manner as described in Section 4.2.\nIn order to generate batches to train the Boss model with p GPUs, p Assistant models are created\nsuch that batch generation can be done asynchronously for each GPU worker.\n\u2022 SGD baseline: vanilla stochastic gradient descent with momentum.\n\u2022 Self-Paced Learning (SPL): we implement the same self-paced scheme in [22].\nThe experimental results are shown in Figure 4, which includes the number of tokens/training time\nversus training perplexity for each DNN model. Similar to image classi\ufb01cation, we also observe that\nAutoAssist outperforms other training schemes for all our experiments on neural machine translation.\nIn particular, the AutoAssist is able to save around 40% tokens per epoch and achieves better \ufb01nal\nBLEU scores than the baseline. Some training statistics and the \ufb01nal BLEU scores are listed in table 2\nin Appendix.\n\n6 Conclusions\n\nIn this paper, we propose a training framework to accelerate deep learning model training. The\nproposed AutoAssist framework jointly trains a batch generator (Assistant) along with the main\ndeep learning model (Boss). The Assistant model conducts instance shrinking to get rid of trivial\ninstances during training and can automatically adjust the criteria based on the ability of the Boss.\nWe further propose a method to reduce the computational overhead by training Assistant and Boss\nasynchronously on CPU/GPU and extend this framework to multi-GPU/CPU settings. Experimental\nresults demonstrate that both convergence speed and training time are improved by our Assistant\nmodel.\nAcknowledgement This research was supported by NSF grants IIS-1546452 CCF-1564000 and\nAWS Cloud Credits for Research program.\n\nReferences\n\n[1] Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Vari-\nance reduction in SGD by distributed importance sampling. arXiv preprint arXiv:1511.06481,\n2015.\n\n[2] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for\nIn Advances in Neural Information\n\nsequence prediction with recurrent neural networks.\nProcessing Systems, pages 1171\u20131179, 2015.\n\n[3] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\nIn Proceedings of the 26th annual International Conference on Machine Learning, pages 41\u201348.\nACM, 2009.\n\n[4] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[5] Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training\nmore accurate neural networks by emphasizing high variance samples. In Advances in Neural\nInformation Processing Systems, pages 1002\u20131012, 2017.\n\n[6] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep\nneural networks with multitask learning. In Proceedings of the 25th International Conference\non Machine learning, pages 160\u2013167. ACM, 2008.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nHierarchical Image Database. In CVPR09, 2009.\n\nImageNet: A Large-Scale\n\n9\n\n\f[8] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIB-\nLINEAR: A library for large linear classi\ufb01cation. Journal of Machine Learning Research,\n9:1871\u20131874, 2008.\n\n[9] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. arXiv preprint\n\narXiv:1805.03643, 2018.\n\n[10] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\nsequence to sequence learning. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1243\u20131252. JMLR. org, 2017.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundarara-\njan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th\nInternational Conference on Machine learning, pages 408\u2013415. ACM, 2008.\n\n[13] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced\n\ncurriculum learning. In AAAI, page Vol. 2. No. 5.4., 2015.\n\n[14] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning\ndata-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint\narXiv:1712.05055, 2017.\n\n[15] Thorsten Joachims. Making large-scale SVM learning practical.\n\nIn Bernhard Sch\u00f6lkopf,\nChristopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods \u2013\nSupport Vector Learning, pages 169\u2013184, Cambridge, MA, 1998. MIT Press.\n\n[16] Angelos Katharopoulos and Fran\u00e7ois Fleuret. Biased importance sampling for deep neural\n\nnetwork training. arXiv preprint arXiv:1706.00043, 2017.\n\n[17] Tae-Hoon Kim and Jonghyun Choi. ScreenerNet: Learning curriculum for neural networks.\n\narXiv preprint arXiv:1801.00904, 2018.\n\n[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[19] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[20] Martin L\u00e4ngkvist, Lars Karlsson, and Amy Lout\ufb01. A review of unsupervised feature learning\n\nand deep learning for time-series modeling. Pattern Recognition Letters, 42:11\u201324, 2014.\n\n[21] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.\n\n[22] Hao Li and Maoguo Gong. Self-paced convolutional neural networks. In Proceedings of the\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, 2017.\n\n[23] Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks.\n\narXiv preprint arXiv:1511.06343, 2015.\n\n[24] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling,\nIn Advances in Neural Information Processing\n\nand the randomized Kaczmarz algorithm.\nSystems, pages 1017\u20131025, 2014.\n\n[25] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,\nand Michael Auli. Fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of\nNAACL-HLT 2019: Demonstrations, 2019.\n\n[26] Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for\n\nidf. Journal of documentation, 60(5):503\u2013520, 2004.\n\n10\n\n\f[27] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal\n\nestimated sub-gradient solver for SVM. Mathematical programming, 127(1):3\u201330, 2011.\n\n[28] Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the\ncurriculum with bayesian optimization for task-speci\ufb01c word representation learning. arXiv\npreprint arXiv:1605.03852, 2016.\n\n[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[30] Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search opti-\n\nmization. arXiv preprint arXiv:1606.02960, 2016.\n\n[31] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized\n\nloss minimization. In International Conference on Machine Learning, pages 1\u20139, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3213, "authors": [{"given_name": "Jiong", "family_name": "Zhang", "institution": "University of Texas at Austin"}, {"given_name": "Hsiang-Fu", "family_name": "Yu", "institution": "Amazon"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "UT Austin & Amazon"}]}