{"title": "E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings", "book": "Advances in Neural Information Processing Systems", "page_first": 5138, "page_last": 5150, "abstract": "Convolutional neural networks (CNNs) have been increasingly deployed to edge devices. Hence, many efforts have been made towards efficient CNN inference on resource-constrained platforms. This paper attempts to explore an orthogonal direction: how to conduct more energy-efficient training of CNNs, so as to enable on-device training? We strive to reduce the energy cost during training, by dropping unnecessary computations, from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level. Extensive simulations and ablation studies, with real energy measurements from an FPGA board, confirm the superiority of our proposed strategies and demonstrate remarkable energy savings for training. For example, when training ResNet-74 on CIFAR-10, we achieve aggressive energy savings of >90% and >60%, while incurring a top-1 accuracy loss of only about 2% and 1.2%, respectively. When training ResNet-110 on CIFAR-100, an over 84% training energy saving is achieved without degrading inference accuracy.", "full_text": "E2-Train: Training State-of-the-art CNNs with Over\n\n80% Less Energy\n\nYue Wang(cid:5),\u2217, Ziyu Jiang\u2020,\u2217, Xiaohan Chen\u2020,\u2217, Pengfei Xu(cid:5), Yang Zhao(cid:5),\n\nYingyan Lin(cid:5) and Zhangyang Wang\u2020\n\n\u2020Department of Computer Science and Engineering, Texas A&M University\n\n(cid:5)Department of Electrical and Computer Engineering, Rice University\n\n\u2020{jiangziyu, chernxh, atlaswang}@tamu.edu\n(cid:5){yw68, px5, zy34, yingyan.lin}@rice.edu\nhttps://rtml.eiclab.net/?page_id=120\n\nAbstract\n\nConvolutional neural networks (CNNs) have been increasingly deployed to edge\ndevices. Hence, many efforts have been made towards ef\ufb01cient CNN inference\nin resource-constrained platforms. This paper attempts to explore an orthogonal\ndirection: how to conduct more energy-ef\ufb01cient training of CNNs, so as to enable\non-device training? We strive to reduce the energy cost during training, by dropping\nunnecessary computations, from three complementary levels: stochastic mini-batch\ndropping on the data level; selective layer update on the model level; and sign\nprediction for low-cost, low-precision back-propagation, on the algorithm level.\nExtensive simulations and ablation studies, with real energy measurements from an\nFPGA board, con\ufb01rm the superiority of our proposed strategies and demonstrate\nremarkable energy savings for training. For example, when training ResNet-74\non CIFAR-10, we achieve aggressive energy savings of >90% and >60%, while\nincurring a top-1 accuracy loss of only about 2% and 1.2%, respectively. When\ntraining ResNet-110 on CIFAR-100, an over 84% training energy saving is achieved\nwithout degrading inference accuracy.\n\nIntroduction\n\n1\nThe increasing penetration of intelligent sensors has revolutionized how Internet of Things (IoT)\nworks. For visual data analytics, we have witnessed the record-breaking predictive performance\nachieved by convolutional neural networks (CNNs) [1, 2, 3]. Although such high performance CNN\nmodels are initially learned in data centers and then deployed to IoT devices, we have witnessed\nincreasing necessity for the model to continue learning and updating itself in situ, such as for\npersonalization for different users, or incremental/lifelong learning. Ideally, this learning/retraining\nprocess should take place on device. Comparing to cloud-based retraining, training locally helps\navoid transferring data back and forth between data centers and IoT devices, reduce communication\ncost/latency, and enhance privacy.\nHowever, training on IoT devices is non-trivial, more consuming yet much less explored than in-\nference. IoT devices, such as smart phones and wearables, have limited computation and energy\nresources, that are even stringent for inference. Training CNNs consumes magnitudes higher com-\nputations than one inference. For example, training ResNet-50 for only one 224 \u00d7 224 image can\ntake up to 12 GFLOPs (vs. 4GFLOPS for inference), which can easily drain a mobile phone battery\nwhen training with batch images [4]. This mismatch between the limited resources of IoT devices\nand the high complexity of CNNs is only getting worse because the network structures are getting\nmore complex as they are designed to solve harder and larger-scale tasks [5].\n\u2217The \ufb01rst three authors (Yue Wang, Ziyu Jiang, Xiaohan Chen) contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis paper considers the most standard CNN training setting, assuming both the model structure\nand the dataset to be given. This \u201cbasic\u201d training setting is not usually the realistic IoT case, but\nwe address it as a starting point (with familiar benchmarks), and an opening door towards obtaining\na toolbox that may be later extended to online/transfer learning too (see Section 5). Our goal is\nto reduce the total energy cost in training, which is complicated by a myriad of factors: from\nper-sample (mini-batch) complexity (both feed-forward and backward computations), to the empirical\nconvergence rate (how many epochs it takes to converge), and more broadly, hardware/architecture\nfactors such as data access and movements [6, 7]. Despite a handful of works on ef\ufb01cient, accelerated\nCNN training [8, 9, 10, 11, 12], they mostly focus on reducing the total training time in resource-rich\nsettings, such as by distributed training in large-scale GPU clusters. In contrast, our focus is to trim\ndown the total energy cost for in-situ, resource-constrained training. It represents an orthogonal\n(and less studied) direction to [8, 9, 10, 11, 12, 13, 14], although the two can certainly be combined.\nTo unleash the potential of more energy-ef\ufb01cient in-situ training, we look at the full CNN training\nlifecycle closely. With the goal to \u201csqueeze out\u201d unnecessary costs, we raise three curious questions:\n\u2022 Q1: Are all samples always required throughout training: is it necessary to use all training samples\n\u2022 Q2: Are all parts of the entire model equally important during training: does every layer or \ufb01lter\n\u2022 Q3: Are precise gradients indispensable for training: can we ef\ufb01ciently compute and update the\n\nhave to be updated every time?\n\nin all epochs?\n\nmodel with approximate gradients?\n\nThe above three questions only represent our \u201c\ufb01rst stab\u201d ideas to explore energy-ef\ufb01cient training,\nwhose full scope is much more profound. By no means do our above questions represent all possible\ndirections. We envision that many other recipes can be blended too, such as training on lower bit\nprecision or input resolution [15, 16]. We also recognize that energy-ef\ufb01cient CNN training should be\njointly considered with hardware/architecture co-design [17, 18], which is beyond the current work.\nMotivated by the above questions, this paper proposes a novel energy ef\ufb01cient CNN training frame-\nwork dubbed E2-Train. It consists of three complementary aspects of efforts to trim down unnecessary\ntraining computations and data movements, each addressing one of the above questions:\n\u2022 Data-Level: Stochastic mini-batch dropping (SMD). We show that CNN training could be\naccelerated by a \u201cfrustratingly easy\u201d strategy: randomly skipping mini-batches with 0.5 probability\nthroughout training. This could be interpreted as data sampling with (limited) replacements, and is\nfound to incur minimal accuracy loss (sometimes even increase).\n\n\u2022 Model-Level: Input-dependent selective layer update (SLU). For each minibatch, we select a\ndifferent subset of CNN layers to be updated. The input-adaptive selection is based on a low-cost\ngating function jointly learned during training. While similar ideas were explored in ef\ufb01cient\ninference [19], for the \ufb01rst time it is applied and evaluated for training.\n\n\u2022 Algorithm-Level: Predictive sign gradient descent (PSG). We explore the usage of an ex-\ntremely low-precision gradient descent algorithm called SignSGD, which has recently found both\ntheoretical and experimental grounds [20]. The original algorithm still requires the full gradient\ncomputation and therefore does not save energy. We create a novel \u201cpredictive\u201d variant, that could\nobtain the sign without computing the full gradient, via low-cost, bit-level prediction. Combined\nwith mixed-precision design, it decreases computation and data-movement costs.\n\nBesides mainly experimental explorations, we \ufb01nd E2-Train has many interesting links to recent CNN\ntraining theories, e.g., [21, 22, 23, 24]. We evaluate E2-Train in comparison with its closest state-of-\nthe-art competitors. To measure its actual performance, E2-Train is also implemented and evaluated\non an FPGA board. The results show that the CNN model applied with E2-Train consistently\nachieves higher training energy ef\ufb01ciency with marginal accuracy drops.\n\n2 Related Work\nAccelerated CNN training. A number of works have been devoted to accelerating training, in a\nresource-rich setting, by utilizing communication-ef\ufb01cient distributed optimization and larger mini-\nbatch sizes [8, 9, 10, 11]. The latest work [12] combined distributed training with a mixed precision\nframework, leading to training AlexNet within 4 minutes. However, their goals and settings are\ndistinct from ours - while the distributed training strategy can reduce time, it will actually incur more\ntotal energy overhead, and is clearly not applicable to on-device resource-constrained training.\n\n2\n\n\fLow-precision training. It is well known that CNN training can be performed under substantial\nlower precision [15, 14, 13], rather than using full-precision \ufb02oats. Speci\ufb01cally, training with\nquantized gradients has been well studied in the distributed learning, whose main motivation is to\nreduce the communication cost during gradient aggregations between workers [25, 26, 27, 28, 29, 20].\nA few works considered to only transmit the coordinates of large magnitudes [30, 31, 32]. Recently,\nthe SignSGD algorithm [25, 20] even showed the feasibility of using one-bit gradients (signs) during\ntraining, without notably hampering the convergence rate or \ufb01nal result. However, most algorithms\nare optimized for distributed communication ef\ufb01ciency, rather than for reducing training energy costs.\nMany of them, including [20], need \ufb01rst compute full-precision gradients and then quantize them.\nEf\ufb01cient CNN inference: Static and Dynamic. Compressing CNNs and speeding up their inference\nhave attracted major research interests in recent years. Representative methods include weight pruning,\nweight sharing, layer factorization, bit quantization, to just name a few [33, 34, 35, 36, 37].\nWhile model compression presents \u201cstatic\u201d solutions for improving inference ef\ufb01ciency, a more\ninteresting recent trend looks at dynamic inference [19, 38, 39, 40, 41] to reduce the latency, i,e,\nselectively executing subsets of layers in the network conditioned on each input. That sequential\ndecision making process is usually controlled by low-cost gating or policy networks. This mechanism\nwas also applied to improve inference energy ef\ufb01ciency [42, 43].\nIn [44], a unique bit-level prediction framework called PredictiveNet was presented to accelerate\nCNN inference at a lower level. Since CNN layer-wise activations are usually highly sparse, the\nauthors proposed to predict those zero locations using low-cost bit predictors, thereby bypassing a\nlarge fraction of energy-dominant convolutions without modifying the CNN structure.\nEnergy-ef\ufb01cient training is different from and more complicated than its inference counterpart.\nHowever, many insights gained from the latter can be lent to the former. For example, the recent\nwork [45] showed that performing active channel pruning during training can accelerate the empirical\nconvergence. Our proposed model-level SLU is inspired by [19]. The algorithm-level PSG also\ninherits the idea of bit-level low-cost prediction from [44].\n\n3 The Proposed Framework\n\n3.1 Data-Level: Stochastic mini-batch dropping (SMD)\n\nWe \ufb01rst adopt a straightforward, seemingly naive, yet surprisingly effective stochastic mini-batch\ndropping (SMD) strategy (see Fig. 1), to aggressively reduce the training cost by letting it see less\nmini-batches. At each epoch, SMD simply skips every mini-batch with a default probability of 0.5.\nAll other training protocols, such as learning rate schedule, remain unchanged. Compared to the\nnormal training, SMD can directly half the training cost, if both were trained with the same number of\nepochs. Yet amazingly, we observe in our experiments that SMD usually leads to negligible accuracy\ndecrease, sometimes even increase (see Sec. 4). Why? We discuss possible explanations below.\nSMD can be interpreted as sampling with limited replacement. To understand this, think of combing\ntwo consecutive SMD-enforced epochs into one, then it has the same number of mini-batches as one\nfull epoch; but within it each training sample now has 0.25, 0.5, and 0.25 probability, to be sampled 2,\n1, and 0 times, respectively. The conventional wisdom is that for stochastic gradient descent (SGD),\nin each epoch, the mini-batches are sampled i.i.d. from data without replacement (i.e., each sample\noccurs exactly once per epoch) [46, 47, 48, 49, 50]. However, [21] proved that sampling mini-batches\nwith replacement has a large variance than sampling without replacement, and consequently SGD\nmay have better regularization properties.\nAlternatively, SMD could also be viewed as a special data augmentation way that injects more\nsampling noise to perturb training distribution every epoch. Past works [51, 52, 53] have shown\nthat speci\ufb01c kinds of random noise aid convergence through escaping from saddle points or less\ngeneralizable minima. The structured sampling noise caused by SMD might aid the exploration.\nBesides, [22, 54, 55] also showed that an importance sampling scheme that focuses on training more\nwith \u201cinformative\u201d examples leads to faster convergence under resource budgets. They implied that\nthe mini-batch dropping can be selective based on certain information criterion instead of stochastic.\nWe use SMD because it has zero overhead, but more effective dropping options might be available if\nlow-cost indicators of mini-batch importance can be identi\ufb01ed: we leave this as future work.\n\n3\n\n\fFigure 1: Illustration of proposed framework. SLU: each blue circle G indicates an RNN gate and each blue\nsquare under G indicates one block of layers in the base model. Green arrows denote the backward propagation.\nTo reduce the training cost, the RNN gates generate strategies to select which layers to train for each input.\nIn this speci\ufb01c example, the second and fourth blocks are \u201cskipped\u201d for both feedforward and backward\ncomputations. Only the \ufb01rst and third blocks are updated. SMD and PSG: details are described in the main text.\n\n3.2 Model-Level: Input-dependent selective layer update (SLU)\n[19] proposed to dynamically skip a subset of layers for different inputs, in order to adaptively\naccelerate the feed-forward inference. However, [19] called for a post process after supervised\ntraining, i.e., to re\ufb01ne the dynamic skipping policy via reinforcement learning, thus causing undesired\nextra training overhead. We propose to extend the idea of dynamic inference to the training stage,\ni.e., dynamically skipping a subset of layers during both feed-forward and back-propagation.\nCrucially, we show that by adding an auxiliary regularization, such dynamic skipping can be learned\nfrom scratch and obtain satisfactory performance: no post re\ufb01nement nor extra training iterations\nis required. That is critical for dynamic layer skipping to be useful for energy-ef\ufb01cient training: we\nterm this extended scheme as input-dependent selective layer update (SLU).\nAs depicted in Fig. 1, given a base CNN to be trained, we follow [19] to add a light-weight RNN\ngating network per layer block. Each gate takes the same input as its corresponding layer, and outputs\nsoft-gating outputs between [0,1] for the layer, which are then used as the skipping probability, in\nwhich the higher the value is, more probably that layer will be selected. Therefore, each layer will be\nadaptively selected or skipped, depending on the inputs. We will only select the layers activated by\ngates. Those RNN gates cost less than 0.04% feed-forward FLOPs than the base models; hence their\nenergy overheads are negligible. More details can be found in the supplementary.\n[19] \ufb01rst trained the gates in a supervised way together with the base model. Observing that such\nlearned routing policies were often not suf\ufb01ciently ef\ufb01cient, they used reinforcement learning post-\nprocessing to learn more aggressive skipping afterwards. While this is \ufb01ne for the end goal of dynamic\ninference, we hope to get rid of the post-processing overhead. We incorporate the computational\ncomplexity regularization into the objective function to overcome this hurdle, de\ufb01ned as\n\nmin\nW,G\n\nL(W, G) + \u03b1C(W, G)\n\n(1)\nHere, \u03b1 is a weighting coef\ufb01cient of the computational complexity regularization. W and G denote\nthe parameters of the base model and the gating network, respectively. Also, L(W, G) denotes\nthe prediction loss, and C(W, G) is calculated by accumulating the computational cost (FLOPs)\nof the layers that are selected. The regularization explicitly encourages to learn more \u201cparismous\u201d\nselections throughout the training. We \ufb01nd that such SLU-regularized training leads to almost\nthe same number of epochs to converge compared to standard training, i.e., SLU does not sacri\ufb01ce\nempirical convergence speed. As a side effect, SLU will naturally yield CNNs with dynamic inference\ncapability. Though not the focus of this paper, we \ufb01nd the CNN trained with SLU reaches comparable\naccuracy-ef\ufb01ciency trade-off over one trained with the approach in [19].\nThe practice of SLU seems to align with several recent theories on CNN training. In [56], the authors\nsuggested that \u201cnot all layers are created equal\u201d for training. Speci\ufb01cally, some layers are critical to be\nintensively updated for improving \ufb01nal predictions, while others are insensitive along training. There\nexist \u201cnon-critical\u201d layers that barely change their weights throughout training: even resetting those\nlayers in a trained model to their initial value has few negative consequences. The more recent work\n[24] further con\ufb01rmed the phenomenon, though how to identify those non-critical model parts at the\n\n4\n\n\fearly training stage remains unclear. [57, 58] also observed different samples might activate different\nsub-models. Those inspiring theories, combined with the dynamic inference practice, motivate us to\npropose SLU for more ef\ufb01cient training.\n3.3 Algorithm-Level: Predictive sign gradient descent (PSG)\nIt is well recognized that low-precision \ufb01xed-point implementation is a very effective knob for\nachieving energy ef\ufb01cient CNNs, because both CNNs\u2019 computational and data movement costs are\napproximately a quadratic function of their employed precision. For example, a state-of-the-art design\n[59] shows that adopting 8-bit precision for a multiplication, adder, and data movement can reduce\nthe energy cost by 95%, 97%, and 75%, respectively, as compared to that of a 32-bit \ufb02oating point\ndesign when evaluated in a commercial 45nm CMOS technology.\nThe successful adoption of extremely low-precision (binary) gradients in SignSGD [20] is appealing,\nas it might lead reducing both weight update computation and data movements. However, directly\napplying the original SignSGD algorithm for training will not save energy, because it actually\ncomputes the full-precision gradient \ufb01rst before taking the signs. We propose a novel predictive\nsign gradient descent (PSG) algorithm, which predicts the sign of gradients using low-cost bit-level\npredictors, therefore completely bypassing the costly full-gradient computation.\nWe next introduce how the gradients of weights are updated in PSG. Assume the following notations:\nthe full precision and most signi\ufb01cant bits (the latter, MSB part, is adopted by PSG\u2019s low-cost\npredictors) of the input x and the gradient of the output gy are denoted as (Bx, Bg) and (Bmsb\n),\n, Bmsb\nrespectively, where the corresponding input and the gradient of the output for PSG\u2019s predictors are\ndenoted as xmsb and gmsb\n, respectively. As such, the quantization noise for the input and the gradient\nof the output are qx = x\u2212 xmsb and qgy = gy \u2212 gmsb\n, respectively. Similarly, after back-propagation,\nwe denote the full-precision and low-precision (i.e., taking the most signi\ufb01cant bits (MSBs)) gradient\nof the weight as gw and gmsb\nw , respectively, the latter of which is computed using xmsb and gmsb\n.\nThen, with an empirically pre-selected threshold \u03c4, PSG updates the i-th weight gradient as follows:\n\nx\n\ny\n\ny\n\ny\n\ng\n\n(cid:26)sgn(gmsb\n\nw [i])\n\nsgn(gw[i])\n\n\u02dcgw[i] =\n\nw [i]| \u2265 \u03c4\n\n,|gmsb\n, otherwise\n\nw\n\n(2)\n\nx\n\ngy\n\ngy\n\n(e.g., Bmsb\n\n(e.g., Bmsb\n\nx =4) and Bmsb\n\nis embedded within that of gw.\n\nNote that in hardware implementation, the computation to obtain gmsb\nTherefore, the PSG\u2019s predictors do not incur energy overhead.\nPSG for energy-ef\ufb01cient training. Recent work [15] has shown that most of the training process is\nrobust to reduced precision (e.g., 8 bits instead of 32 bits), except for the weight gradient calculations\nand updates. Taking their learning, we similarly adopt a higher precision for the gradients than the\ninputs and weights, i.e., Bgy > Bx =Bw. Speci\ufb01cally, when training with PSG, we \ufb01rst compute\nthe predictors using Bmsb\n=10), and then update the weights\u2019\ngradients following Eq. (2). The further energy savings of training with PSG over the \ufb01xed-point\ntraining [15] are resulted from the fact that the predictors computed using xmsb and gmsb\nrequire\nexponentially less computational and data movement energy.\nPrediction guarantee of PSG. We analyze the probability of PSG\u2019s prediction failure to discuss its\nperformance guarantee. Speci\ufb01cally, if denoting the sign prediction failure produced by Eq. (2) as H,\nit can be proved that this probability is upbounded as follows,\nE2,\n\n(3)\nxE1 + \u22062\ngy\n\u22121) are the quantization noise step sizes of xmsb\nwhere \u2206x = 2\u2212(Bmsb\nand gmsb\n, respectively. E1 and E2 are given in the Appendix along with the proof of Eq. (3).\nEq. (3) shows that the prediction failure probability of PSG is upbounded by a term that degrades\nexponentially with the precision assigned to the predictors, indicating that this failure probability can\nbe very small if the predictors are designed properly.\nAdaptive threshold. Training with PSG might lead to sign \ufb02ips in the weight gradients as compared\nto that of the \ufb02oating point one, which only occurs when the latter has a small magnitude and thus the\nquantization noise of the predictors causes the sign \ufb02ips. Therefore, it is important to properly select\na threshold (e.g., \u03c4 in Eq.(2)) that can optimally balance this sign \ufb02ip probability and the achieved\nenergy savings. We adopt an adaptive threshold selection strategy because the dynamic range of\ngradients differ signi\ufb01cantly from layers to layers: instead of using a \ufb01xed number, we will tune a\nratio \u03b2 \u2208 (0, 1) which yields the adaptive threshold as \u02dc\u03c4 = \u03b2 maxi{gmsb\n\nP (H) \u2264 \u22062\n\u2212(Bmsb\n\nx \u22121) and \u2206gy = 2\n\nw [i]}.\n\ngy\n\ny\n\ny\n\n5\n\n\f4 Experiments\n4.1 Experiment setup\nDatasets: We evaluate our proposed techniques on two datasets: CIFAR-10 and CIFAR-100. Common\ndata augmentation methods (e.g., mirroring/shifting) are adopted, and data are normalized as in\n[60]. Models: Three popular backbones, ResNet-74, ResNet-110 [61], and MobileNetV2 [62], are\nconsidered. For evaluating each of the three proposed techniques (i.e., SMD, SLU, and PSG), we\nconsider various experimental settings using ResNet-74 and CIFAR-10 dataset for ablation study, as\ndescribed in Sections 4.2-4.5. ResNet-110 and MobileNetV2 results are reported in Section 4.6. Top-1\naccuracies are measured for CIFAR-10, and both top-1 and top-5 accuracies for CIFAR-100. Training\nsettings: We adopt the training settings in [61] for the baseline default con\ufb01gurations. Speci\ufb01cally, we\nuse an SGD with a momentum of 0.9 and a weight decaying factor of 0.0001, and the initialization\nintroduced in [63]. Models are trained for 64k iterations. For experiments where PSG is used, the\ninitial learning rate is adjusted to 0.03 as SignSGD[20] suggested small learning rates to bene\ufb01t\nconvergence. For others, the learning rate is initially set to be 0.1 and then decayed by 10 at the 32k\nand 48k iterations, respectively. We also employed the stochastic weight averaging (SWA) technique\n[64] when PSG is adopted, that was found to notably stabilize training.\nReal energy measurements using FPGA: As the energy cost of CNN inference/training consists\nof both computational and data movement costs, the latter of which is often dominant but can\nnot captured by the commonly used metrics, such as the number of FLOPs [6], we evaluate the\nproposed techniques against the baselines in terms of accuracy and real measured energy consumption.\nSpeci\ufb01cally, unless otherwise speci\ufb01ed, all the energy or energy savings are obtained through real\nmeasurements by training the corresponding models and datasets in a state-of-the-art FPGA [65],\nwhich is a digilent ZedBoard Zynq-7000 ARM/FPGA SoC Development Board. Fig. 2 shows our\nFPGA measurement setup, in which the FPGA board is connected to a laptop through a serial port\nand a power meter. In particular, the training settings are downloaded from the laptop to the FPGA\nboard, and the real-measured energy consumption is obtained via the power meter for the whole\ntraining process and then sent back to the laptop. All energy results are measured from FPGA.\n\nFigure 2: The energy measurement setup with (from left to right) a MAC Air latptop, a Xilinx FPGA board\n[65], and a power meter.\n\n4.2 Evaluating stochastic mini-batch dropping\nWe \ufb01rst validate the energy saving achieved by SMD against a few \u201coff-the-shelf\u201d options: (1) can\nwe train with the standard algorithm, using less iterations and otherwise the same training protocol?\n(2) can we train with the standard algorithm, using less iterations but properly increased learning\nrates? Two set of carefully-designed experiments are presented below for addressing them.\nTraining with SMD vs. standard mini-batch (SMB): We \ufb01rst evaluate SMD over the standard\nmini-batch (SMB) training, which uses all (vs. 50% in SMD) mini-batch samples. As shown in Fig.\n3a when the energy ratio is 1 (i.e., training with SMB + 64k iterations vs. SMD + 128k iterations),\nthe proposed SMD technique is able to boost the inference accuracy by 0.39% over the standard way.\nWe next \u201cnaively\u201d suppress the energy cost of SMB, by reducing training iterations. Speci\ufb01cally,\nwe reduce the SMB training iterations to be { 6\n12 , 1} of the original one. Note the\nlearning rate schedule (e.g., when to reduce learning rates) will be scaled proportionally with the\ntotal iteration number too. For comparison, we conduct experiments of training with SMD when\nthe number of equivalent training iterations are the same as those of the SMB cases. Fig. 3a shows\nthat training with SMD consistently achieves a higher inference accuracy than SMB with the margin\nranging from 0.39% to 0.86%. Furthermore, we can see that training with SMD reduces the training\nenergy cost by 0.33 while boosting the inference accuracy by 0.2% (see the cases of SMD under the\nenergy ratio of 0.67 vs. SMB under the energy ratio of 1, respectively, in Fig. 3a), as compared to\nSMB. We adopt SMD under this energy ratio of 0.67 in all the remaining experiments.\n\n12 , 10\n\n12 , 11\n\n12 , 7\n\n12 , 8\n\n12 , 9\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: The top-1 accuracy of CIFAR-10 testing using the Resnet-74 model when: (a) training with SMD\nversus the standard mini-batch (SMB) method, with the two\u2019s training energy ratio ranging from 0.5 to 1. The\nsize of markers is drawn proportionally with the measured training energy cost; and (b) training with SMD, and\nSMB with different increased learning rates \u2013 all under the same training energy budget.\n\nWe repeated training ResNet-74 on CIFAR-10\nusing SMD for 10 times with different random\ninitializations. The accuracy standard deviation\nis only 0.132%, showing high stability. We also\nconducted more experiments with different back-\nbones and datasets . As shown in Tab. 1, SMD is\nconsistently better than SMB.\n\nTraining with SMD vs.\nSMB + increased\nlearning rates: We further compare with SMB\nwith tuned/larger learning rates, conjecturing that\nit would accelerate convergence by possibly re-\nducing the needed training epochs. Results are\nsummarized in Fig. 3b. Speci\ufb01cally, when the\nnumber of iterations are reduced by 1\n3, we do a\ngrid search of learning rates, with a step size from\n0.02 between [0.1,0.2]. All compared methods\nare set with the same training energy budget. Fig.\n3b demonstrates that while increasing learning\nrates seem to improve SMB\u2019s energy ef\ufb01ciency\nover sticking to the original protocol, our pro-\nposed SMD still maintains a clear advantage of at\nleast 0.22%.\n\nFigure 4: Inference accuracy vs. energy ratios, where\nthe energy ratios are obtained by normalizing the corre-\nsponding energy over that of the original one (SMB +\n64k iterations).\nTable 1: The accuracy of SMD on other datasets and\nbackbones (energy ratio 0.67).\n\nDataset\n\nBackbone\n\nCIFAR-10\nCIFAR-100\n\nResNet-110\nResNet-74\n\nAccuracy\n\nSMB\nSMD\n92.75% 93.05%\n71.11% 71.37%\n\n4.3 Evaluating selective layer update\nOur current SLU experiments are based on CNNs with residual connections, partially because they\ndominate in SOTA CNNs. We will extend SLU to other model structures in future work. We evaluate\nthe proposed SLU by comparing it with stochastic depth (SD) [66], a technique originally developed\nfor training very deep networks effectively, by updating only a random subset of layers at each\nmini-batch. It could be viewed as a \u201crandom\u201d version of SLU (which uses learned layer selection).\nWe follow all suggested settings in [66]. For a fair comparison, we adjust the hyper-parameter pL\n[66], so that SD dropping ratio is always the same as SLU.\nFrom Fig. 4, training with SLU consistently achieves higher inference accuracies than SD when their\ntraining energy costs are the same. It is further encouraging to observe that training with SLU could\neven achieve higher accuracy sometimes in addition to saving energy. For example, comparing the\ncases when training with SLU + an energy ratio of 0.3 (i.e., 70% energy saving) and that of SD + an\nenergy ratio of 0.5, the proposed SLU technique is able to reduce the training energy cost by 20%\nwhile boosting the inference accuracy by 0.86%. These results endorses the usage of data-driven\ngates instead of random dropping, in the context of energy-ef\ufb01cient training. Training with SLU\n+ SMD combined further boosts the accuracy while reducing the energy cost. Furthermore, 20\n\n7\n\n0.50.60.70.80.91.0Energy Ratio92.492.893.293.694.0Test Accuracy0.860.390.2Proposed SMDSMB0.100.120.140.160.180.20Learning Rate92.792.993.193.393.593.7Test Accuracy0.22Proposed SMDSMB0.20.30.40.50.6Energy Ratio91.091.592.092.593.093.594.0Test Accuracy0.86SDProposed SLUProposed SLU+SMD\ftrials of SLU experiments to ResNet38 on CIFAR-10 conclude that, with 95% con\ufb01dence level, the\ncon\ufb01dence interval for the mean of the top-1 accuracy and the energy saving are [92.47%, 92.58%]\n(baseline:92.50%) and [39.55%, 40.52%], respectively, verifying SLU\u2019s trustworthy effectiveness.\n\n4.4 Evaluating predictive sign gradient descent\nWe evaluate PSG against two alternatives: (1) 8-bit \ufb01xed point training proposed in [15]; and (2)\nthe original SignSGD [20]. For all experiments in Sections 4.4 and 4.5, we adopt 8-bit precision for\nthe activations/weights and 16-bit for the gradients. The corresponding precision of the predictors\nare 4-bit and 10-bit, respectively. We use an adaptive threshold (see Section 3.3) of \u03b2 =0.05. More\nexperiment details are in Appendix.\n\nTable 2: Comparing the inference accuracy and\nachieved energy savings (over the 32-bit \ufb02oating point\ntraining) when training with SGD, 8-bit \ufb01xed point [15],\nSignSGD, and PSG using Resnet-74 and CIFAR-10.\nPSG\nMethod\n\nSignSGD[20]\n\n8-bit[15]\n\n32-bit\nSGD\n93.52% 93.24%\n38.62%\n\n-\n\nAccuracy\nEnergy savings\n\n92.54%\n\n-\n\n92.59%\n63.28%\n\nTable 3: The inference accuracy and energy sav-\nings (over the 32-bit \ufb02oating point training) of the\nproposed E2-Train under different (averaged) SLU\nskipping ratios and adaptive thresholds (i.e., \u03b2 in\nSection 3.3) when using Resnet-74 and CIFAR-10.\nSkipping\nAccuracy (\u03b2 =0.05)\nAccuracy (\u03b2 =0.1)\nComputational savings\nEnergy savings\n\n91.36%\n90.94%\n90.13%\n92.81%\n\n92.12%\n92.15%\n80.27%\n84.64%\n\n91.84%\n91.72%\n85.20%\n88.72%\n\n40%\n\n20%\n\n60%\n\nAs shown in Table 2, while the 8-bit \ufb01xed point training in [15] saves about 39% training energy\n(going from 32-bit to 8-bit in general leads to about 80% energy saving, which is compromised by\ntheir employed 32-bit gradients in this case) with a marginal accuracy loss of 0.28% as compared\nto the 32-bit SGD, the proposed PSG almost doubles the training energy savings (63% vs. 39% for\n[15]) with a negligible accuracy loss of 0.65% (93.24% vs. 92.59% for [15]). Interestingly, PSG\nslightly boosts the inference accuracy by 0.05% while saving 63% energy, i.e., 3\u00d7 better training\nenergy ef\ufb01ciency with a slightly better inference accuracy, compared to SignSGD [20]. Besides, as\nwe observed, the ratio of using gmsb\nfor sign prediction typically remains at least 60% throughout the\ntraining process, given adaptive threshold \u03b2 = 0.05.\n\nw\n\n4.5 Evaluating E2-Train: SMD + SLU + PSG\nWe now evaluate the proposed E2-\nTrain framework, which combines the\nSMD, SLU, and PSG techniques. As\nshown in Table 3, we can see that E2-\nTrain: (1) indeed can further boost\nthe performance as compared to train-\ning with SMD+SLU (e.g., E2-Train\nachieves a higher accuracy of 0.5%\n(92.1% vs. 91.6% (see Fig.4 at the\nenergy ratio of 0.2) of training with\nSMD+SLU, when achieving 80% en-\nergy savings); and (2) can achieve an\nextremely aggressive energy savings\nof > 90% and > 60%, while incur-\nring an accuracy loss of only about\n2.0% and 1.2%, respectively, as com-\npared to that of the 32-bit \ufb02oating\npoint SGD (see Table 2), i.e., up to\n9\u00d7 better training energy ef\ufb01ciency with small accuracy loss.\nImpact on empirical convergence speed. We plot the training convergence curves of different\nmethods in Fig. 5, with the x-axis represented in the alternative form of training energy costs (up to\ncurrent iteration). We observe that E2-Train does not slow down the empirical convergence. In fact, it\neven makes the training loss decrease faster in the early stage.\nExperiments on adapting a pre-trained model. We perform a proof-of-concept experiment for\nCNN \ufb01ne-tuning by splitting CIFAR-10 training set into half, where each class was i.i.d. split\nevenly. We \ufb01rst pre-train ResNet-74 on the \ufb01rst half, then \ufb01ne-tune it on the second half. During\n\ufb01ne-tuning, we compare two energy-ef\ufb01cient options: (1) \ufb01ne-tuning only the last FC layer using\n\nFigure 5: Inference accuracy vs. energy costs, at different stages of\ntraining, (i.e., the empirical convergence curves), when training with\nSMB, SD, merely SLU, SLU + SMD, and E2-Train on CIFAR-10\nwith ResNet-74.\n\n8\n\n0.00.20.40.60.81.01.21.4The Training Energy (million joules)01020304050Test Error1.4% accuracy drop0.83% accuracy drop0.16% accuracy improve1.16% accuracy dropSMBSDproposed SMDproposed SMD+SLUproposed E2 Train\fstandard training; (2) \ufb01ne-tuning all layers using E2-Train. With all hyperparameters being tuned\nto best efforts, the two \ufb01ne-tuning methods improve over the pre-trained model top-1 accuracy by\n[0.30%, 1.37%] respectively, while (2) saves 61.58% more energy (FPGA-measured) than (1). That\nshows that E2-Train is the preferred option: higher accuracy and more energy savings\nTable 4 evaluates E2-Train and its ablation baselines on various models and more datasets. The\nconclusions are aligned with the ResNet-74 cases. Remarkably, on CIFAR-10 with ResNet-110,\nE2-Train saves over 83% energy with only 0.56% accuracy loss. When saving over 91% (i.e., more\nthan 10\u00d7), the accuracy drop is still less than 2%. On CIFAR-100 with ResNet-110, E2-Train can\neven surpass baseline on both top-1 and top5 accuracy while saving over 84% energy. More notably,\nE2-Train is also effective for even compact networks: it saves about 90% energy cost while achieving\na comparable accuracy, when adopted for training MobileNetV2.\n\nTable 4: Experiment results with ResNet-110 and MobileNetV2 on CIFAR-10/CIFAR-100.\n\nDataset\n\nMethod\n\nBackbone\n\nComputational Savings\n\nEnergy Savings\n\nCIFAR-10\n\nCIFAR-100\n\nSMB (original)\n\nSD[66]\n\nSMB (original)\n\nE2-Train\n\n(SMD+SLU+PSG)\n\nSMB (original)\n\nSD[66]\n\nSMB (original)\n\nE2-Train\n\n(SMD+SLU+PSG)\n\nResNet-110\n\nMobileNetV2[67]\n\nResNet-110\n\nMobileNetV2[67]\n\nResNet-110\n\nMobileNetV2[67]\n\nResNet-110\n\nMobileNetV2[67]\n\n-\n\n50%\n\n-\n\n80.27%\n85.20%\n90.13%\n75.34%\n\n-\n\n50%\n\n-\n\n80.27%\n85.20%\n90.13%\n75.34%\n\n-\n\n46.03%\n\n-\n\n83.40%\n87.42%\n91.34%\n88.73%\n\n-\n\n48.34%\n\n-\n\n84.17%\n88.72%\n92.90 %\n88.17%\n\nAccuracy\n(top-1)\n93.57%\n91.51%\n92.47%\n93.01%\n91.74%\n91.68%\n92.06%\n71.60%\n70.40%\n71.91%\n71.63%\n68.61%\n67.94%\n71.61%\n\nAccuracy\n(top-5)\n\n-\n-\n-\n-\n-\n-\n-\n\n91.50%\n92.58%\n\n-\n\n91.72%\n89.84%\n89.06%\n\n-\n\n5 Discussion of Limitations and Future Work\nWe propose the E2-Train framework to achieve energy-ef\ufb01cient CNN training in resource-constrained\nsettings. Three complementary aspects of efforts to trim down training costs - from data, model and\nalgorithm levels, respectively, are carefully designed, justi\ufb01ed, and integrated. Experiments on both\nsimulation and real FPGA demonstrate the promise of E2-Train. Despite the preliminary success, we\nare aware of several limitations of E2-Train, which also points us to the future road map. For example,\nE2-Train is currently designed and evaluated for standard off-line CNN training, with all training data\npresented in batch, for simplicity. This is not scalable for many real-world IoT scenarios, where new\ntraining data arrives sequentially in a stream form, with limited or no data buffer/storage leading to\nthe open challenge of \u201con-the-\ufb02y\u201d CNN training [68]. In this case, while both SLU and PSG are still\napplicable, SMD needs to be modi\ufb01ed, e.g., by one-pass active selection of stream-in data samples.\nBesides, SLU is not yet straightforward to be extended to plain CNNs without residual connections.\nWe expect \ufb01ner-grained selective model updates, such as online channel pruning [45], to be useful\nalternatives here. We also plan to optimize E2-Train for continuous adaptation or lifelong learning.\n\nAcknowledgments\n\nThe work is in part supported by the NSF RTML grant (1937592, 1937588). The authors would like\nto thank all anonymous reviewers for their tremendously useful comments to help improve our work.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 1097\u20131105. Curran Associates, Inc., 2012.\n\n[2] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, volume abs/1311.2524, pages 580\u2013587, 2014.\n\n[3] Yaniv Taigman, Ming Yang, Marc\u2019Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-\nlevel performance in face veri\ufb01cation. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 1701\u20131708, 2014.\n\n9\n\n\f[4] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-ef\ufb01cient convolutional neural net-\nworks using energy-aware pruning. 2017 IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), Jul 2017.\n\n[5] Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob\n\nUszkoreit. One model to learn them all. CoRR, abs/1706.05137, 2017.\n\n[6] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-ef\ufb01cient recon\ufb01gurable accelerator for\n\ndeep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127\u2013138, Jan 2017.\n\n[7] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann. An always-on 3.8 \u00b5 j/86processor with\n\nall memory on chip in 28-nm cmos. IEEE Journal of Solid-State Circuits, 54(1):158\u2013172, Jan 2019.\n\n[8] Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew\nTulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour.\nCoRR, abs/1706.02677, 2017.\n\n[9] Minsik Cho, Ulrich Finkler, Sameer Kumar, David S. Kung, Vaibhav Saxena, and Dheeraj Sreedhar.\n\nPowerai ddl. CoRR, abs/1708.02188, 2017.\n\n[10] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes.\n\nProceedings of the 47th International Conference on Parallel Processing - ICPP 2018, 2018.\n\n[11] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on\n\nimagenet in 15 minutes. CoRR, abs/1711.04325, 2017.\n\n[12] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu\nGuo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision:\nTraining imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.\n\n[13] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited\n\nnumerical precision. In International Conference on Machine Learning, pages 1737\u20131746, 2015.\n\n[14] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep\nneural networks with 8-bit \ufb02oating point numbers. In Advances in neural information processing systems,\npages 7675\u20137684, 2018.\n\n[15] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 5145\u20135153, 2018.\n\n[16] Ting-Wu Chin, Ruizhou Ding, and Diana Marculescu. Adascale: Towards real-time video object detection\n\nusing adaptive scaling. arXiv preprint arXiv:1902.02910, 2019.\n\n[17] Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Dong Wu, Yuan Xie, and Luping Shi. L1-norm batch\nnormalization for ef\ufb01cient training of deep neural networks. IEEE transactions on neural networks and\nlearning systems, 2018.\n\n[18] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: ef\ufb01cient and accurate normaliza-\ntion schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160\u20132170,\n2018.\n\n[19] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic\nIn Proceedings of the European Conference on Computer Vision\n\nrouting in convolutional networks.\n(ECCV), pages 409\u2013424, 2018.\n\n[20] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD:\nCompressed Optimisation for Non-Convex Problems. In International Conference on Machine Learning\n(ICML-18), 2018.\n\n[21] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges\nto limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages\n1\u201310. IEEE, 2018.\n\n[22] Angelos Katharopoulos and Fran\u00e7ois Fleuret. Not all samples are created equal: Deep learning with\n\nimportance sampling, 2018.\n\n[23] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal?\n\narXiv:1902.01996, 2019.\n\narXiv preprint\n\n10\n\n\f[24] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural\n\nnetworks. In ICLR 2019, 2019.\n\n[25] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its\napplication to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the\nInternational Speech Communication Association, 2014.\n\n[26] Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Randomized quantization for\n\ncommunication-optimal stochastic gradient descent. arXiv preprint arXiv:1610.02132, 2016.\n\n[27] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models\nwith end-to-end low precision, and a little bit of deep learning. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 4035\u20134043. JMLR. org, 2017.\n\n[28] Christopher De Sa, Matthew Feldman, Christopher R\u00e9, and Kunle Olukotun. Understanding and optimizing\nasynchronous low-precision stochastic gradient descent. In ACM SIGARCH Computer Architecture News,\nvolume 45, pages 561\u2013574. ACM, 2017.\n\n[29] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary\ngradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Ben-\ngio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 1509\u20131519. Curran Associates, Inc., 2017.\n\n[30] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent. arXiv\n\npreprint arXiv:1704.05021, 2017.\n\n[31] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing\n\nthe communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.\n\n[32] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\ndistributed optimization. In Advances in Neural Information Processing Systems, pages 1299\u20131309, 2018.\n\n[33] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks\nIn International Conference on Learning\n\nwith pruning, trained quantization and huffman coding.\nRepresentations, 2016.\n\n[34] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In\n\n2017 IEEE International Conference on Computer Vision (ICCV), pages 1398\u20131406. IEEE, 2017.\n\n[35] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke.\nScalpel: Customizing dnn pruning to the underlying hardware parallelism. In ACM SIGARCH Computer\nArchitecture News, volume 45, pages 548\u2013560. ACM, 2017.\n\n[36] Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan Lin. Deep\nk-means: Re-training and parameter sharing with harder cluster assignments for compressing deep convo-\nlutions. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5363\u20135372, Stock-\nholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[37] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local\nnetworks for memory-ef\ufb01cient segmentation of ultra-high resolution images. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 8924\u20138933, 2019.\n\n[38] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and\nRogerio Feris. Blockdrop: Dynamic inference paths in residual networks. 2018 IEEE/CVF Conference on\nComputer Vision and Pattern Recognition, Jun 2018.\n\n[39] Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. Lecture Notes\n\nin Computer Science, page 3\u201318, 2018.\n\n[40] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Ben-\ngio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 2181\u20132191. Curran Associates, Inc., 2017.\n\n[41] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. Gaternet: Dynamic \ufb01lter selection in convolutional\n\nneural network via a dedicated global gating network, 2018.\n\n[42] Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin, and Richard Baraniuk. Energynet:\n\nEnergy-ef\ufb01cient dynamic inference. NeurIPS workshop, 2018.\n\n11\n\n\f[43] Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang,\nand Yingyan Lin. Dual dynamic inference: Enabling more ef\ufb01cient, adaptive and controllable deep\ninference. arXiv preprint arXiv:1907.04523, 2019.\n\n[44] Yingyan Lin, Charbel Sakr, Yongjune Kim, and Naresh Shanbhag. Predictivenet: An energy-ef\ufb01cient\n\nconvolutional neural network via zero prediction. In Proceedings of ISCAS, 2017.\n\n[45] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. Prunetrain:\n\nFast neural network training by dynamic sparse model recon\ufb01guration, 2019.\n\n[46] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A\n\nsurvey. Optimization for Machine Learning, 2010(1-38):3, 2011.\n\n[47] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in neural\n\ninformation processing systems, pages 46\u201354, 2016.\n\n[48] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural\n\nnetworks: Tricks of the trade, pages 437\u2013478. Springer, 2012.\n\n[49] Benjamin Recht and Christopher Re. Beneath the valley of the noncommutative arithmetic-geometric mean\ninequality: conjectures, case-studies. Technical report, and consequences. Technical report, University of\nWisconsin-Madison, 2012.\n\n[50] Mert G\u00fcrb\u00fczbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuf\ufb02ing beats stochastic gradient\n\ndescent. arXiv preprint arXiv:1510.08560, 2015.\n\n[51] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic gradient\n\nfor tensor decomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[52] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter\nTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint\narXiv:1609.04836, 2016.\n\n[53] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic\n\ngradients. arXiv preprint arXiv:1803.05999, 2018.\n\n[54] Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance\n\nsampling. In Advances in Neural Information Processing Systems, pages 7265\u20137275, 2018.\n\n[55] Cody Coleman, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec,\n\nand Matei Zaharia. Select via proxy: Ef\ufb01cient data selection for training deep networks, 2019.\n\n[56] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal? CoRR, abs/1902.01996,\n\n2019.\n\n[57] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively\n\nshallow networks, 2016.\n\n[58] Klaus Greff, Rupesh K Srivastava, and J\u00fcrgen Schmidhuber. Highway and residual networks learn unrolled\n\niterative estimation. ICLR, 2017.\n\n[59] Mark Horowitz. 1.1 computing\u2019s energy problem (and what we can do about it). In 2014 IEEE international\n\nsolid-state circuits conference digest of technical papers (ISSCC), pages 10\u201314. IEEE, 2014.\n\n[60] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[62] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:\n\nInverted residuals and linear bottlenecks, 2018.\n\n[63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference\non computer vision, pages 1026\u20131034, 2015.\n\n[64] Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Christopher\nDe Sa. Swalp: Stochastic weight averaging in low-precision training. arXiv preprint arXiv:1904.11943,\n2019.\n\n12\n\n\f[65] Xilinx Inc. Digilent ZedBoard Zynq-7000 ARM/FPGA SoC Development Board. https://www.xilinx.\n\ncom/products/boards-and-kits/1-elhabt.html/, 2019. [Online; accessed 20-May-2019].\n\n[66] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic\n\ndepth. In European conference on computer vision, pages 646\u2013661. Springer, 2016.\n\n[67] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:\nInverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 4510\u20134520, 2018.\n\n[68] Doyen Sahoo, Quang Pham, Jing Lu, and Steven CH Hoi. Online deep learning: Learning deep neural\n\nnetworks on the \ufb02y. arXiv preprint arXiv:1711.03705, 2017.\n\n13\n\n\f", "award": [], "sourceid": 2810, "authors": [{"given_name": "Yue", "family_name": "Wang", "institution": "Rice University"}, {"given_name": "Ziyu", "family_name": "Jiang", "institution": "Texas A&M University"}, {"given_name": "Xiaohan", "family_name": "Chen", "institution": "Texas A&M University"}, {"given_name": "Pengfei", "family_name": "Xu", "institution": "Rice University"}, {"given_name": "Yang", "family_name": "Zhao", "institution": "Rice University"}, {"given_name": "Yingyan", "family_name": "Lin", "institution": "Rice University"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "TAMU"}]}