{"title": "On the Ineffectiveness of Variance Reduced Optimization for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1755, "page_last": 1765, "abstract": "The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.", "full_text": "On the Ineffectiveness of Variance Reduced\n\nOptimization for Deep Learning\n\nAaron Defazio & L\u00b4eon Bottou\nFacebook AI Research New York\n\nAbstract\n\nThe application of stochastic variance reduction to optimization has shown re-\nmarkable recent theoretical and practical success. The applicability of these tech-\nniques to the hard non-convex optimization problems encountered during training\nof modern deep neural networks is an open problem. We show that naive applica-\ntion of the SVRG technique and related approaches fail, and explore why.\n\nIntroduction\n\n1\nStochastic variance reduction (SVR) consists of a collection of techniques for the minimization of\n\ufb01nite-sum problems:\n\n(cid:80)n\n\nf (w) = 1\nn\n\ni=1 fi(w),\n\nsuch as those encountered in empirical risk minimization, where each fi is the loss on a single train-\ning data point. Principle techniques include SVRG [Johnson and Zhang, 2013], SAGA [Defazio\net al., 2014a], and their variants. SVR methods use control variates to reduce the variance of the\ntraditional stochastic gradient descent (SGD) estimate f(cid:48)\ni (w) of the full gradient f(cid:48)(w). Control\nvariates are a classical technique for reducing the variance of a stochastic quantity without intro-\nducing bias. Say we have some random variable X. Although we could use X as an estimate of\nE[X] = \u00afX, we can often do better through the use of a control variate Y . If Y is a random variable\ncorrelated with X (i.e. Cov[X, Y ] > 0), then we can estimate \u00afX with the quantity\n\nZ = X \u2212 Y + E[Y ].\n\nThis estimate is unbiased since \u2212Y cancels with E[Y ] when taking expectations, leaving E[Z] =\nE[X]. As long as V ar[Y ] \u2264 2Cov[X, Y ], the variance of Z is lower than that of X.\nRemarkably, these methods are able to achieve linear convergence rates for smooth strongly-convex\noptimization problems, a signi\ufb01cant improvement on the sub-linear rate of SGD. SVR methods are\npart of a larger class of methods that explicitly exploit \ufb01nite-sum structures, either by dual (SDCA,\nShalev-Shwartz and Zhang, 2013; MISO, Mairal, 2014; Finito, Defazio et al., 2014b) or primal\n(SAG, Schmidt et al., 2017) approaches.\nRecent work has seen the fusion of acceleration with variance reduction (Shalev-Shwartz and Zhang\n[2014], Lin et al. [2015], Defazio [2016], Allen-Zhu [2017]), and the extension of SVR approaches\nto general non-convex [Allen-Zhu and Hazan, 2016, Reddi et al., 2016] as well as saddle point\nproblems [Balamurugan and Bach, 2016].\nIn this work we study the behavior of variance reduction methods on a prototypical non-convex\nproblem in machine learning: A deep convolutional neural network designed for image classi\ufb01ca-\ntion. We discuss in Section 2 how standard training and modeling techniques signi\ufb01cantly compli-\ncate the application of variance reduction methods in practice, and how to overcome some of these\nissues. In Sections 3 & 5 we study empirically the amount of variance reduction seen in practice\non modern CNN architectures, and we quantify the properties of the network that affect the amount\nof variance reduction. In Sections 6 & 8 we show that streaming variants of SVRG do not improve\nover regular SVRG despite their theoretical ability to handle data augmentation. Code to reproduce\nthe experiments performed is provided on the \ufb01rst author\u2019s website.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fStandard SVR approach\nThe SVRG method is the simplest of the variance reduction approaches to apply for large-scale\nproblems, so we will focus our initial discussion on it. In SVRG, training epochs are interlaced with\nsnapshot points where a full gradient evaluation is performed. The iterate at the snapshot point \u02dcw is\nstored, along with the full gradient f(cid:48)( \u02dcw). Snapshots can occur at any interval, although once per\nepoch is the most common frequency used in practice. The SGD step wk+1 = wk \u2212 \u03b3f(cid:48)\ni (wk), using\nthe randomly sampled data-point loss fi with step size \u03b3, is augmented with the snapshot gradient\nusing the control variate technique to form the SVRG step:\ni (wk) \u2212 f(cid:48)\n\nwk+1 = wk \u2212 \u03b3 [f(cid:48)\n\ni ( \u02dcw) + f(cid:48)( \u02dcw)] .\n\n(1)\n\nThe single-data point gradient f(cid:48)\ni ( \u02dcw) may be stored during the snapshot pass and retrieved, or re-\ncomputed when needed. The preference for recomputation or storage depends a lot on the computer\narchitecture and its bottlenecks, although recomputation is typically the most practical approach.\nNotice that following the control variate approach, the expected step, conditioning on wk, is just a\ngradient step. So like SGD, it is an unbiased step. Unbiasedness is not necessary for the fast rates\nobtainable by SVR methods, both SAG [Schmidt et al., 2017] and Point-SAGA [Defazio, 2016] use\nbiased steps, however biased methods are harder to analyze. Note also that successive step directions\nare highly correlated, as the f(cid:48)( \u02dcw) term appears in every consecutive step between snapshots. This\nkind of step correlation is also seen in momentum methods, and is considered a contributing factor\nto their effectiveness [Kidambi et al., 2018].\n\n2 Complications in practice\nModern approaches to training deep neural networks deviate signi\ufb01cantly from the assumptions that\nSVR methods are traditionally analyzed under. In this section we discuss the major ways in which\npractice deviates from theory and how to mitigate any complications that arise.\n\nData augmentation\nIn order to achieve state-of-the-art results in\nmost domains, data augmentation is essential.\nThe standard approach is to form a class of\ntransform functions T ; for an image domain\ntypical transforms include cropping, rotation,\n\ufb02ipping and compositions thereof. Before the\ngradient calculation for a data-point xi, a trans-\nform Ti is sampled and the gradient is evaluated\non its image Ti(xi).\nWhen applying standard SVRG using gradient\nrecomputation, the use of random transforms\ncan destroy the prospects of any variance reduc-\ntion if different transforms are used for a data-\npoint during the snapshot pass compared to the\nfollowing steps. Using a different transform is\nunfortunately the most natural implementation\nwhen using standard libraries (PyTorch1; TensorFlow, Abadi et al. [2015]), as the transform is ap-\nplied automatically as part of the data-pipeline. We propose the use of transform locking, where the\ntransform used during the snapshot pass is cached and reused during the following epoch/s.\nThis performance difference is illustrated in Figure 1, where the variance of the SVRG step is\ncompared with and without transform locking during a single epoch during training of a LeNet\nmodel. Data augmentation consisted of random horizontal \ufb02ips and random cropping to 32x32,\nafter padding by 4 pixels on each side (following standard practice).\nFor SVRG with transform locking, the variance of the step is initially zero at the very beginning of\nthe epoch, increasing over the course of the epoch. This is the behavior expected of SVRG on \ufb01nite\n\nFigure 1: Variance within epoch two during LeNet\ntraining on CIFAR10.\n\n1http://pytorch.org/\n\n2\n\n0%20%40%60%80%100%Progress within epoch0.00.10.20.30.40.50.60.7SVRG VarianceNo lockingTransform locking\fsum problems. In contrast, without transform locking the variance is non-zero at the beginning of\nthe epoch, and uniformly worse.\nThe handling of data augmentation in \ufb01nite-sum methods has been previously considered for the\nMISO method [Bietti and Mairal, 2017], which is one of the family of gradient table methods (as\nwith the storage variant of SVRG). The stored gradients are updated with an exponential moving\naverage instead of overwriting, which averages over multiple past transformed-data-point gradients.\nAs we show in Section 5, stored gradients can quickly become too stale to provide useful information\nwhen training large models.\nBatch normalization\nBatch normalization [Ioffe and Szegedy, 2015] is another technique that breaks the \ufb01nite-sum struc-\nture assumption. In batch normalization, mean and variance statistics are calculated within a mini-\nbatch, for the activations of each layer (typically before application of a nonlinearity). These statis-\ntics are used to normalize the activations. The \ufb01nite sum structure no longer applies since the loss\non a datapoint i depends on the statistics of the mini-batch it is sampled in.\nThe interaction of BN with SVRG depends on if storage or recomputation of gradients is used.\nWhen recomputation is used naively, catastrophic divergence occurs in standard frameworks. The\nproblem is a subtle interaction with the internal computation of running means and variances, for\nuse at test time.\nIn order to apply batch normalization at test time, where data may not be mini-batched or may not\nhave the same distribution as training data, it is necessary to store mean and variance information at\ntraining time for later use. The standard approach is to keep track of a exponential moving average\nof the mean and variances computed at each training step. For instance, PyTorch by default will\nupdate the moving average mEM A using the mini-batch mean m as:\n\nmEM A =\n\n9\n10\n\nmEM A +\n\n1\n10\n\nm.\n\nDuring test time, the network is switched to evaluation mode using model.eval(), and the stored\nrunning mean and variances are then used instead of the internal mini-batch statistics for normal-\nization. The complication with SVRG is that during training the gradient evaluations occur both at\nthe current iterate xk and the snapshot iterate \u02dcx. If the network is in train mode for both, the EMA\nwill average over activation statistics between two different points, resulting in poor results and\ndivergence.\nSwitching the network to evaluation mode mid-step is the obvious solution, however computing the\ngradient using the two different sets of normalizations results in additional introduced variance. We\nrecommend a BN reset approach, where the normalization statistics are temporarily stored before the\n\u02dcw gradient evaluation, and the stored statistics are used to undo the updated statistics by overwriting\nafterwards. This avoids having to modify the batch normalization library code. It is important to use\ntrain mode during the snapshot pass as well, so that the mini-batch statistics match between the two\nevaluations.\nDropout\nDropout [Srivastava et al., 2014] is another popular technique that affects the \ufb01nite-sum assumption.\nWhen dropout is in use, a random fraction, usually 50%, of the activations will be zero at each step.\nThis is extremely problematic when used in conjunction with variance reduction, since the sparsity\npattern will be different for the snapshot evaluation of a datapoint compared to its evaluation during\nthe epoch, resulting in much lower correlation and hence lower variance reduction.\nThe same dropout pattern can be used at both points as with the transform locking approach proposed\nabove. The seed used for each data-point\u2019s sparsity pattern should be stored during the snapshot\npass, and reused during the following epoch when that data-point is encountered. Storing the sparsity\npatterns directly is not practical as it will be many times larger than memory even for simple models.\nResidual connection architectures bene\ufb01t very little from dropout when batch-norm is used [He\net al., 2016, Ioffe and Szegedy, 2015], and because of this we don\u2019t use dropout in the experiments\ndetailed in this work, following standard practice.\nIterate averaging\n\n3\n\n\fAlthough it is common practice to use the last iterate of an epoch as the snapshot point for the\nnext epoch, standard SVRG theory requires computing the snapshot at either an average iterate or\na randomly chosen iterate from the epoch instead. Averaging is also needed for SGD when applied\nto non-convex problems. We tested both SVRG and SGD using averaging of 100%, 50% or 10%\nof the tail of each epoch as the starting point of the next epoch. Using a 10% tail average did result\nin faster initial convergence for both methods before the \ufb01rst step size reduction on the CIFAR10\ntest problem (detailed in the next section). However, this did not lead to faster convergence after the\n\ufb01rst step size reduction, and \ufb01nal test error was consistently worse than without averaging. For this\nreason we did not use iterate averaging in the experiments presented in this work.\n\n3 Measuring variance reduction\nTo illustrate the degree of variance reduction achieved by SVRG on practical problems, we directly\ncomputed the variance of the SVRG gradient estimate, comparing it to the variance of the stochas-\ntic gradient used by SGD. To minimize noise the variance was estimated using a pass over the full\ndataset, although some noise remains due to the use of data augmentation. The transform lock-\ning and batch norm reset techniques described above were used in order to get the most favorable\nperformance out of SVRG.\nRatios below one indicate that variance reduction is occurring, whereas ratios around two indicate\nthat the control variate is uncorrelated with the stochastic gradient, leading to an increase in variance.\nFor SVRG to be effective we need a ratio below 1/3 to offset the additional computational costs of\nthe method. We plot the variance ratio at multiple points within each epoch as it changes signi\ufb01cantly\nduring each epoch. An initial step size of 0.1 was used, with 10-fold decreases at 150 and 220\nepochs. A batch size of 128 with momentum 0.9 and weight decay 0.0001 was used for all methods.\nWithout-replacement data sampling was used.\nTo highlight differences introduced by model complexity, we compared four models:\n\n1. The classical LeNet-5 model [Lecun et al., 1998], modi\ufb01ed to use batch-norm and ReLUs,\n\nwith approximately 62 thousand parameters2.\n\n2. A ResNet-18 model [He et al., 2016], scaled down to match the model size of the LeNet\nIt has approximately 69\n\nmodel by halving the number of feature planes at each layer.\nthousand parameters.\n\n3. A ResNet-110 model with 1.7m parameters, as used by He et al. [2016].\n4. A wide DenseNet model [Huang et al., 2017] with growth rate 36 and depth 40. It has\n\napproximately 1.5 million parameters and achieves below 5% test error.\n\nFigure 2 shows how this variance ratio depends dramatically on the model used. For the LeNet\nmodel, the SVRG step has consistently lower variance, from 4x to 2x depending on the position\nwithin the epoch, during the initial phase of convergence.\nIn contrast, the results for the DenseNet-40-36 model as well as the ResNet-110 model show an\nincrease in variance, for the majority of each epoch, up until the \ufb01rst step size reduction at epoch\n150. Indeed, even at only 2% progress through each epoch, the variance reduction is only a factor\nof 2, so computing the snapshot pass more often than once an epoch can not help during the initial\nphase of optimization.\nThe small ResNet model sits between these two extremes, showing some variance reduction mid-\nepoch at the early stages of optimization. Compared to the LeNet model of similar size, the modern\narchitecture with its greater ability to \ufb01t the data also bene\ufb01ts less from the use of SVRG.\n\n4 Snapshot intervals\nThe number of stochastic steps between snapshots has a signi\ufb01cant effect on the practical perfor-\nmance of SVRG. In the classical convex theory the interval should be proportional to the condition\nnumber [Johnson and Zhang, 2013], but in practice an interval of one epoch is commonly used, and\nthat is what we used in the experiment above. A careful examination of our results from Figure 2\n\n2Connections between max pooling layers and convolutions are complete, as the symmetry breaking ap-\n\nproach taken in the the original network is not implemented in modern frameworks.\n\n4\n\n\f(a) LeNet\n\n(b) DenseNet-40-36\n\n(c) Small ResNet\n\n(d) ResNet-110\n\nFigure 2: The SVRG to SGD gradient variance ratio during a run of SVRG. The shaded region\nindicates a variance increase, where the SVRG variance is worse than the SGD baseline. Dotted\nlines indicate when the step size was reduced. The variance ratio is shown at different points within\neach epoch, so that the 2% dots (for instance) indicate the variance at 1,000 data-points into the\n50,000 datapoints consisting of the epoch. Multiple percentages within the same run are shown at\nequally spaced epochs.\nSVRG fails to show a variance reduction for the majority of each epoch when applied to modern\nhigh-capacity networks, whereas some variance reduction is seem for smaller networks.\n\nshow that no adjustment to the snapshot interval can salvage the method. The SVRG variance can be\nkept reasonable (i.e. below the SGD variance) by reducing the duration between snapshots, however\nfor the ResNet-110 and DenseNet models, even at 11% into an epoch, the SVRG step variance is\nalready larger than that of SGD, at least during the crucial 10-150 epochs. If we were to perform\nsnapshots at this frequency the wall-clock cost of the SVRG method would go up by an order of\nmagnitude compared to SGD, while still under-performing on a per-epoch basis.\nSimilarly, we can consider performing snapshots at less frequent intervals. Our plots show that the\nvariance of the SVRG gradient estimate will be approximately 2x the variance of the SGD estimate\non the harder two problems in this case (during epochs 10-150), which certainly will not result in\nfaster convergence. This is because the correction factor in Equation 1 becomes so out-of-date that it\nbecomes effectively uncorrelated with the stochastic gradient, and since it\u2019s magnitude is comparable\n(the gradient norm decays relatively slowly during optimization for these networks) adding it to the\nstochastic gradient results in a doubling of the variance.\n\n4.1 Variance reduction and optimization speed\n\nFor suf\ufb01ciently well-behaved objective functions (such as smooth & strongly convex), we can expect\nthat an increase of the learning rate results in a increase of the converge rate, up until the learning\nrate approaches a limit de\ufb01ned by the curvature (\u2248 1/L for L Lipschitz-smooth functions). This\nholds also in the stochastic case for small learning rates, however there is an additional ceiling that\noccurs as you increase the learning rate, where the variance of the gradient estimate begins to slow\nconvergence. Which ceiling comes into effect \ufb01rst determines if a possible variance reduction (such\nas from SVRG) can allow for larger learning rates and thus faster convergence. Although clearly\n\n5\n\n050100150200Epoch20212223242526272829210SVR Variance / SGD Variance2%11%33%100%050100150200Epoch20212223242526272829210SVR Variance / SGD Variance2%11%33%100%050100150200Epoch20212223242526272829210SVR Variance / SGD Variance2%11%33%100%050100150200Epoch20212223242526272829210SVR Variance / SGD Variance2%11%33%100%\fFigure 3: Distance moved from the snapshot point, and curvature relative to the snapshot point, at\nepoch 50.\n\na simpli\ufb01ed view of the non-differentiable non-convex optimization problem we are considering, it\nstill offers some insight.\nEmpirically deep residual networks are known to be constrained by the curvature for a few initial\nepochs, and afterwards are constrained by the variance. For example, Goyal et al. [2017] show that\ndecreasing the variance by increasing the batch-size allows them to proportionally increase the learn-\ning rate for variance reduction factors up to 30 fold. This is strong evidence that a SVR technique\nthat results in signi\ufb01cant variance reduction can potentially improve convergence in practice.\n\n5 Why variance reduction fails\nFigure 2 clearly illustrates that for the DenseNet model, SVRG gives no actual variance reduction\nfor the majority of the optimization run. This also holds for larger ResNet models (plot omitted).\nThe variance of the SVRG estimator is directly dependent on how similar the gradient is between\nthe snapshot point \u02dcx and the current iterate xk. Two phenomena may explain the differences seen\nhere. If the wk iterate moves too quickly through the optimization landscape, the snapshot point will\nbe too out-of-date to provide meaningful variance reduction. Alternatively, the gradient may just\nchange more rapidly in the larger model.\nFigure 3 sheds further light on this. The left plot shows how rapidly the current iterate moves within\nthe same epoch for LeNet and DenseNet models when training using SVRG. The distance moved\nfrom the snapshot point increases signi\ufb01cantly faster for the DenseNet model compared to the LeNet\nmodel.\nIn contrast the right plot shows the curvature change during an epoch, which we estimated as:\n\n(cid:13)(cid:13)(cid:13) 1|Si|\n\n(cid:80)\n\n(cid:2)f(cid:48)\n\nj( \u02dcw)(cid:3)(cid:13)(cid:13)(cid:13) /(cid:13)(cid:13)wk \u2212 \u02dcw(cid:13)(cid:13),\n\nj\u2208Si\n\nj(wk) \u2212 f(cid:48)\n\nwhere Si is a sampled mini-batch. This can be seen as an empirical measure of the Lipschitz smooth-\nness constant. Surprisingly, the measured curvature is very similar for the two models, which sup-\nports the idea that iterate distance is the dominating factor in the lack of variance reduction. The\ncurvature is highest at the beginning of an epoch because of the lack of smoothness of the objective\n(the Lipschitz smoothness is potentially unbounded for non-smooth functions).\nSeveral papers have show encouraging results when using SVRG variants on small MNIST training\nproblems [Johnson and Zhang, 2013, Lei et al., 2017]. Our failure to show any improvement when\nusing SVRG on larger problems should not be seen as a refutation of their results. Instead, we\nbelieve it shows a fundamental problem with MNIST as a baseline for optimization comparisons.\nParticularly with small neural network architectures, it is not representative of harder deep learning\ntraining problems.\n5.1 Smoothness\nSince known theoretical results for SVRG apply only to smooth objectives, we also computed the\nvariance when using the ELU activation function [Clevert et al., 2016], a popular smooth activation\nthat can be used as a drop-in replacement for ReLU. We did see a small improvement in the degree\nof variance reduction when using the ELU. There was still no signi\ufb01cant variance reduction on the\nDenseNet model.\n\n6\n\n0%20%40%60%80%100%Progress within epoch02468101214Iterate distanceLeNetDenseNet0%20%40%60%80%100%Progress within epoch0.00.20.40.60.81.01.2CurvatureLeNetDenseNet\f6 Streaming SVRG Variants\nIn Section 3, we saw that the amount of variance reduction quickly diminished as the optimization\nprocedure moved away from the snapshot point. One potential \ufb01x is to perform snapshots at \ufb01ner\nintervals. To avoid incurring the cost of a full gradient evaluation at each snapshot, the class of\nstreaming SVRG [Frostig et al., 2015, Lei et al., 2017] methods instead use a mega-batch to compute\nthe snapshot point. A mega-batch is typically 10-32 times larger than a regular mini-batch. To be\nprecise, let the mini-batch size be b be and the mega-batch size be B. Streaming SVRG alternates\nbetween computing a snapshot mega-batch gradient \u02dcg at \u02dcw = wk, and taking a sequence of SVRG\ninner loop steps where a mini-batch Sk is sampled, then a step is taken:\n\n(cid:35)\n\n(cid:34)\n\n(cid:88)\n\ni\u2208Sk\n\n1\n|Sk|\n\nwk+1 = wk \u2212 \u03b3\n\ni (wk) \u2212 f(cid:48)\n(f(cid:48)\n\ni ( \u02dcw)) + \u02dcg\n\n.\n\n(2)\n\nAlthough the theory suggests taking a random number of these steps, often a \ufb01xed m steps is used\nin practice, and we follow this procedure as well.\nIn this formulation the data-points from the mega-batch and subsequent m steps are independent.\nSome further variance reduction is potentially possible by sampling the mini-batches for the inner\nstep from the mega-batch, but at the cost of some bias. This approach has been explored as the\nStochastically Controlled Stochastic Gradient (SCSG) method [Lei and Jordan, 2017].\nTo investigate the effectiveness of streaming\nSVRG methods we produced variance-over-\ntime plots. We look at the variance of each in-\ndividual step after the computation of a mega-\nbatch, where our mega-batches were taken as\n10x larger than our mini-batch size of 128 CI-\nFAR10 instances, and 10 inner steps were taken\nper snapshot. The data augmentation and batch\nnorm reset techniques from Section 2 were used\nto get the lowest variance possible. The vari-\nance is estimated using the full dataset at each\npoint.\nFigure 4 shows the results at the beginning of\nthe 50th epoch. In both cases the variance is re-\nduced by 10x for the \ufb01rst step, as the two mini-\nbatch terms cancel in Equation 2, resulting in just the mega-batch being used. The variance quickly\nrises thereafter. These results are similar to the non-streaming SVRG method, as we see that much\ngreater variance reduction is possible for LeNet. Recall that the amortized cost of each step is three\ntimes that of SGD, so for the DenseNet model the amount of variance reduction is not compelling.\n\nFigure 4: Streaming SVRG Variance at epoch 50\n\n7 Other methods for non-convex variance-reduced optimization\n\nAlthough dozens of methods based upon the non-streaming variance reduction framework have been\ndeveloped, they can generally be characterized into one of several classes: SAGA-like [Defazio\net al., 2014a], SVRG-like [Johnson and Zhang, 2013], Dual [Shalev-Shwartz and Zhang, 2013],\nCatalyst [Lin et al., 2015] or SARAH-like [Nguyen et al., 2017]. Each of these classes has the same\nissues as those described for the basic SVRG, with some additional subtleties. SAGA-like methods\nhave lower computational costs than SVRG, but they similar convergence rates on a per-epoch basis\nboth empirically and theoretically. As we show in the next Section, even on a per-epoch basis and\nignoring additional costs, SVRG doesn\u2019t improve over SGD for large models, so we would not\nexpect SAGA to show improvement either. On such large models, SAGA is also impractical due to\nits gradient storage requirements.\nDual methods require the storage of dual iterates, resulting in similar storage costs to SAGA, and are\nnot generally applicable in the non-convex setting. Most accelerated methods for the convex case\nfall within the dual setup.\nCatalyst methods involve using a secondary variance-reduction method to solve subproblems, which\nprovides acceleration in the convex case. Catalyst methods do not match the best theoretical rates in\n\n7\n\n246810Iterations after snapshot0.10.20.30.40.50.60.70.8VarianceDenseNet SGDDenseNet StreamingLeNet SGDLeNet Streaming\f(a) LeNet on CIFAR10\n\n(b) DenseNet on CIFAR10\n\n(c) ResNet-110 on CIFAR10\n\n(d) ResNet-18 on ImageNet\n\nFigure 5: Test error comparison between SGD, SVRG and SCSG. For the CIFAR10 comparison a\nmoving average (window size 10) of 10 runs is shown with 1 SE overlay, as results varied signi\ufb01-\ncantly between runs.\n\nthe general non-convex case [Paquette et al., 2018], and are not well-suited to non-smooth models\nsuch as the ReLU-based neural networks used in this work.\nThe SARAH approach is quite different from the other approaches described above, but it suffers\nfrom the same high per-epoch computational cost as SVRG that limits it\u2019s effectiveness, as it also\nuses two minibatch evaluations each step together with a snapshot full gradient evaluation. The\nSARAH++ variant [Nguyen et al., 2019] has the best theoretical convergence rate among the meth-\nods considered for non-convex problems. However, we were not able to achieve reliable convergence\nwith SARAH-style methods on our test problems, which we attribute to an accumulation of error in\nthe inner loop.\n\n8 Convergence rate comparisons\nTogether with the direct measures of variance reduction in Section 3, we also directly compared the\nconvergence rate of SGD, SVRG and the streaming method SCSG. The results are shown in Figure\n5. For our CIFAR10 experiment, an average of 10 runs is shown for each method, using the same\nmomentum (0.9) and learning rate (0.1) parameters for each, with a 10-fold reduction in learning\nrate at epochs 150 and 225. We were not able to see any improvement from using alternative hyper-\nparameters for each method. A comparison was also performed on ImageNet using a ResNet-18\narchitecture and a single run for each method. Run-to-run variability is much lower for image-net.\nThe variance reduction seen in SVRG comes at the cost of the introduction of heavy correlation be-\ntween consecutive steps. This is why the reduction in variance does not have the direct impact that\nincreasing batch size or decreasing learning rate has on the convergence rate, and why convergence\ntheory for VR methods requires careful proof techniques. It is for this reason that the amount of vari-\nance reduction in Figure 4 doesn\u2019t necessarily manifest as a direct improvement in convergence rate\nin practice. On the LeNet problem we see that SVRG converges slightly faster than SGD, whereas\n\n8\n\n050100150200250Epoch22242628303234Test error (%)SCSGSGDSVRG050100150200250Epoch68101214161820Test error (%)SCSGSGDSVRG050100150200Epoch10152025303540Test error (%)SCSGSGD0102030405060708090Epoch303540455055606570Test error (%)SCSGSGDSVRG\f(a) ResNet-50\n\n(b) DenseNet-169\n\nFigure 6: Fine-tuning on ImageNet with SVRG\n\non the larger problems including ResNet on ImageNet (Figure 5d) and DenseNet on CIFAR10 they\nare a little slower than SGD . This is consistent with the differences in the amount of variance reduc-\ntion observed in the two cases in Figure 2, and our hypothesis that SVRG performs worse for larger\nmodels. The SCSG variant performs the worst in each comparison.\n9 Fine-tuning with SVRG\n\nAs we have shown that SVRG appears to only introduce a bene\ufb01t late in training, we performed\nexperiments where we turned on SVRG after a \ufb01xed number of epochs into training. Using the\nstandard ResNet-50 architecture on ImageNet, we considered training using SVRG with momentum\nfrom epoch 0, 20, 40, 60 or 80, with SGD with momentum used in prior epochs. Figure 6 shows that\nthe \ufb01ne-tuning process did not lead to improved test accuracy at any interval compared to the SGD\nonly baseline. For further validation we evaluated a DenseNet-169 model, which we only \ufb01ne-tuned\nfrom 60 and 80 epochs out to a total of 90 epochs, due to the much slower model training. This\nmodel also showed no improvement from the \ufb01ne-tuning procedure.\n\nConclusion\nThe negative results presented here are disheartening, however we don\u2019t believe that they rule out\nthe use of stochastic variance reduction on deep learning problems. Rather, they suggest avenues for\nfurther research. For instance, SVR can be applied adaptively; or on a meta level to learning rates;\nor scaling matrices; and can potentially be combined with methods like Adagrad [Duchi et al., 2011]\nand ADAM Kingma and Ba [2014] to yield hybrid methods.\n\nReferences\nMart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew\nHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath\nKudlur, Josh Levenberg, Dan Man\u00b4e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,\nMike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-\ncent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00b4egas, Oriol Vinyals, Pete Warden, Martin Watten-\nberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning\non heterogeneous systems, 2015. URL https://www.tensorflow.org/.\n\nZeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In Pro-\nceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017,\n2017.\n\nZeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization.\n\nProceedings of The 33rd International Conference on Machine Learning, 2016.\n\nIn\n\nP. Balamurugan and Francis Bach. Stochastic variance reduction methods for saddle-point problems.\n\nAdvances in Neural Information Processing Systems 29 (NIPS2016), 2016.\n\n9\n\n020406080100Epoch203040506070Test error (%)from 0 (28.60%)from 20 (26.32%)from 40 (24.71%)from 60 (23.80%)from 80 (23.65%)Baseline (23.61%)020406080Epoch203040506070Test error (%)from 60 (23.38%)from 80 (23.30%)Baseline (23.22%)\fAlberto Bietti and Julien Mairal. Stochastic optimization with variance reduction for in\ufb01nite datasets\nwith \ufb01nite sum structure. In Advances in Neural Information Processing Systems 30 (NIPS 2017),\n2017.\n\nDjork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\nlearning by exponential linear units (elus). In International conference on learning representa-\ntions 2016 (ICLR 2016). 2016.\n\nAaron Defazio. A simple practical accelerated method for \ufb01nite sums. Advances in Neural Infor-\n\nmation Processing Systems 29 (NIPS 2016), 2016.\n\nAaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. Advances in Neural Information\nProcessing Systems 27 (NIPS 2014), 2014a.\n\nAaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gra-\ndient method for big data problems. The 31st International Conference on Machine Learning\n(ICML 2014), 2014b.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nRoy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk\n\nminimizer in a single pass. In Proceedings of The 28th Conference on Learning Theory, 2015.\n\nPriya Goyal, Piotr Doll \u02dcA\u00a1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-\ndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet\nin 1 hour. 06 2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-\nnition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n770\u2013778, 2016.\n\nGao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), July 2017.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by\nreducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine\nLearning, 2015.\n\nRie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. Advances in Neural Information Processing Systems 26 (NIPS2013), 2013.\n\nRahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. On the insuf\ufb01ciency of\nexisting momentum schemes for stochastic optimization. International Conference on Learning\nRepresentations (ICLR), 2018, Vancouver, Canada, 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nY. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog-\n\nnition. Proceedings of the IEEE, 1998.\n\nLihua Lei and Michael Jordan. Less than a Single Pass: Stochastically Controlled Stochastic Gradi-\nent. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics,\n2017.\n\nLihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization via\n\nscsg methods. In Advances in Neural Information Processing Systems 30. 2017.\n\nHongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization.\n\nIn Advances in Neural Information Processing Systems 28. 2015.\n\n10\n\n\fJulien Mairal. Incremental majorization-minimization optimization with application to large-scale\nmachine learning. Technical report, INRIA Grenoble Rh\u02c6one-Alpes / LJK Laboratoire Jean Kuntz-\nmann, 2014.\n\nLam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00b4a\u02c7c. Sarah: A novel method for machine\nlearning problems using stochastic recursive gradient. Proceedings of the 34th International Con-\nference on Machine Learning (ICML2017), 2017.\n\nLam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and\n\nJayant R. Kalagnanam. Finite-sum smooth optimization with sarah. 2019.\n\nCourtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Harchaoui. Cat-\nalyst for gradient-based nonconvex optimization.\nIn Amos Storkey and Fernando Perez-Cruz,\neditors, Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 84 of Proceedings of Machine Learning Research, pages 613\u2013622, Playa\nBlanca, Lanzarote, Canary Islands, 09\u201311 Apr 2018. PMLR. URL http://proceedings.mlr.\npress/v84/paquette18a.html.\n\nSashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00b4as P\u00b4ocz\u00b4os, and Alex Smola. Stochastic variance\nreduction for nonconvex optimization. In Proceedings of the 33rd International Conference on\nInternational Conference on Machine Learning - Volume 48, ICML\u201916, 2016.\n\nMark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. F. Math. Program., 2017.\n\nShai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14, 2013.\n\nShai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. Proceedings of the 31st International Conference on Machine\nLearning, 2014.\n\nNitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning\nResearch, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Aaron", "family_name": "Defazio", "institution": "Facebook AI Research"}, {"given_name": "Leon", "family_name": "Bottou", "institution": "FAIR"}]}