{"title": "Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs", "book": "Advances in Neural Information Processing Systems", "page_first": 8789, "page_last": 8798, "abstract": "The loss functions of deep neural networks are complex and their geometric properties are not well understood.  We show that the optima of these complex loss functions are in fact connected by simple curves, over which training and test accuracy are nearly constant.  We introduce a training procedure to discover these high-accuracy pathways between modes.  Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model.  We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on  CIFAR-10, CIFAR-100, and ImageNet.", "full_text": "Loss Surfaces, Mode Connectivity,\n\nand Fast Ensembling of DNNs\n\nTimur Garipov\u22171,2 Pavel Izmailov\u22173 Dmitrii Podoprikhin\u22174\n\nDmitry Vetrov5 Andrew Gordon Wilson3\n\n1Samsung AI Center in Moscow, 2Skolkovo Institute of Science and Technology,\n\n3Cornell University,\n\n4Samsung-HSE Laboratory, National Research University Higher School of Economics,\n\n5National Research University Higher School of Economics\n\nAbstract\n\nThe loss functions of deep neural networks are complex and their geometric\nproperties are not well understood. We show that the optima of these complex\nloss functions are in fact connected by simple curves over which training and test\naccuracy are nearly constant. We introduce a training procedure to discover these\nhigh-accuracy pathways between modes. Inspired by this new geometric insight,\nwe also propose a new ensembling method entitled Fast Geometric Ensembling\n(FGE). Using FGE we can train high-performing ensembles in the time required to\ntrain a single model. We achieve improved performance compared to the recent\nstate-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.\n\n1\n\nIntroduction\n\nThe loss surfaces of deep neural networks (DNNs) are highly non-convex and can depend on millions\nof parameters. The geometric properties of these loss surfaces are not well understood. Even for\nsimple networks, the number of local optima and saddle points is large and can grow exponentially in\nthe number of parameters [1, 2, 3]. Moreover, the loss is high along a line segment connecting two\noptima [e.g., 7, 13]. These two observations suggest that the local optima are isolated.\nIn this paper, we provide a new training procedure which can in fact \ufb01nd paths of near-constant\naccuracy between the modes of large deep neural networks. Furthermore, we show that for a wide\nrange of architectures we can \ufb01nd these paths in the form of a simple polygonal chain of two line\nsegments. Consider, for example, Figure 1, which illustrates the ResNet-164 (cid:96)2-regularized cross-\nentropy train loss on CIFAR-100, through three different planes. We form each two dimensional\nplane by all af\ufb01ne combinations of three weight vectors.1 The left panel shows a plane de\ufb01ned\nby three independently trained networks. In this plane, all optima are isolated, which corresponds\nto the standard intuition. However, the middle and right panels show two different paths of near-\nconstant loss between the modes in weight space, discovered by our proposed training procedure.\nThe endpoints of these paths are the two independently trained DNNs corresponding to the two lower\nmodes on the left panel.\n\n\u2217 Equal contribution.\n1Suppose we have three weight vectors w1, w2, w3. We set u = (w2 \u2212 w1), v = (w3 \u2212 w1) \u2212\n(cid:104)w3 \u2212 w1, w2 \u2212 w1(cid:105)/(cid:107)w2 \u2212 w1(cid:107)2 \u00b7 (w2 \u2212 w1). Then the normalized vectors \u02c6u = u/(cid:107)u(cid:107), \u02c6v = v/(cid:107)v(cid:107)\nform an orthonormal basis in the plane containing w1, w2, w3. To visualize the loss in this plane, we de\ufb01ne a\nCartesian grid in the basis \u02c6u, \u02c6v and evaluate the networks corresponding to each of the points in the grid. A point\nP with coordinates (x, y) in the plane would then be given by P = w1 + x \u00b7 \u02c6u + y \u00b7 \u02c6v.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The (cid:96)2-regularized cross-entropy train loss surface of a ResNet-164 on CIFAR-100, as a\nfunction of network weights in a two-dimensional subspace. In each panel, the horizontal axis is\n\ufb01xed and is attached to the optima of two independently trained networks. The vertical axis changes\nbetween panels as we change planes (de\ufb01ned in the main text). Left: Three optima for independently\ntrained networks. Middle and Right: A quadratic Bezier curve, and a polygonal chain with one bend,\nconnecting the lower two optima on the left panel along a path of near-constant loss. Notice that in\neach panel a direct linear path between each mode would incur high loss.\n\nWe believe that this geometric discovery has major implications for research into multilayer networks,\nincluding (1) improving the ef\ufb01ciency, reliability, and accuracy of training, (2) creating better\nensembles, and (3) deriving more effective posterior approximation families in Bayesian deep\nlearning. Indeed, in this paper we are inspired by this geometric insight to propose a new ensembling\nprocedure that can ef\ufb01ciently discover multiple high-performing but diverse deep neural networks.\nIn particular, our contributions include:\n\ntest error remain low along these paths.\n\nsimple curves, such as a polygonal chain with only one bend.\n\n\u2022 The discovery that the local optima for modern deep neural networks are connected by very\n\u2022 A new method that \ufb01nds such paths between two local optima, such that the train loss and\n\u2022 Using the proposed method we demonstrate that such mode connectivity holds for a wide\nrange of modern deep neural networks, on key benchmarks such as CIFAR-100. We show\nthat these paths correspond to meaningfully different representations that can be ef\ufb01ciently\nensembled for increased accuracy.\n\n\u2022 Inspired by these observations, we propose Fast Geometric Ensembling (FGE), which\noutperforms the recent state-of-the-art Snapshot Ensembles [11], on CIFAR-10 and CIFAR-\n100, using powerful deep neural networks such as VGG-16, Wide ResNet-28-10, and\nResNet-164. On ImageNet we achieve 0.56% top-1 error-rate improvement for a pretrained\nResNet-50 model by running FGE for only 5 epochs.\n\n\u2022 We release the code for reproducing the results in this paper at\n\nhttps://github.com/timgaripov/dnn-mode-connectivity\n\nThe rest of the paper is organized as follows. Section 2 discusses existing literature on DNN loss\ngeometry and ensembling techniques. Section 3 introduces the proposed method to \ufb01nd the curves\nwith low train loss and test error between local optima, which we investigate empirically in Section 4.\nSection 5 then introduces our proposed ensembling technique, FGE, which we empirically compare\nto the alternatives in Section 6. Finally, in Section 7 we discuss connections to other \ufb01elds and\ndirections for future work.\nNote that we interleave two Sections where we make methodological proposals (Sections 3, 5), with\ntwo Sections where we perform experiments (Sections 4, 6). Our key methodological proposal for\nensembling, FGE, is in Section 5.\n\n2 Related Work\n\nDespite the success of deep learning across many application domains, the loss surfaces of deep\nneural networks are not well understood. These loss surfaces are an active area of research, which\nfalls into two distinct categories.\nThe \ufb01rst category explores the local structure of minima found by SGD and its modi\ufb01cations.\nResearchers typically distinguish sharp and wide local minima, which are respectively found by using\n\n2\n\n\u221220020406080100\u2212200204060800.0650.110.170.280.541.12.35>5\u221220020406080100\u22122002040600.0650.110.170.280.541.12.35>5\u221220020406080100\u2212200204060801000.0650.110.170.280.541.12.35>5\flarge and small mini-batch sizes during training. Hochreiter and Schmidhuber [10] and Keskar et al.\n[13], for example, claim that \ufb02at minima lead to strong generalization, while sharp minima deliver\npoor results on the test dataset. However, recently Dinh et al. [4] argue that most existing notions\nof \ufb02atness cannot directly explain generalization. To better understand the local structure of DNN\nloss minima, Li et al. [16] proposed a new visualization method for the loss surface near the minima\nfound by SGD. Applying the method for a variety of different architectures, they showed that the loss\nsurfaces of modern residual networks are seemingly smoother than those of VGG-like models.\nThe other major category of research considers global loss structure. One of the main questions\nin this area is how neural networks are able to overcome poor local optima. Choromanska et al.\n[2] investigated the link between the loss function of a simple fully-connected network and the\nHamiltonian of the spherical spin-glass model. Under strong simplifying assumptions they showed\nthat the values of the investigated loss function at local optima are within a well-de\ufb01ned bound. In\nother research, Lee et al. [14] showed that under mild conditions gradient descent almost surely\nconverges to a local minimizer and not a saddle point, starting from a random initialization.\nIn recent work Freeman and Bruna [6] theoretically show that local minima of a neural network\nwith one hidden layer and ReLU activations can be connected with a curve along which the loss\nis upper-bounded by a constant that depends on the number of parameters of the network and the\n\u201csmoothness of the data\u201d. Their theoretical results do not readily generalize to multilayer networks.\nUsing a dynamic programming approach they empirically construct a polygonal chain for a CNN\non MNIST and an RNN on PTB next word prediction. However, in more dif\ufb01cult settings such as\nAlexNet on CIFAR-10 their approach struggles to achieve even the modest test accuracy of 80%.\nMoreover, they do not consider ensembling.\nBy contrast, we propose a much simpler training procedure that can \ufb01nd near-constant accuracy\npolygonal chains with only one bend between optima, even on a range of modern state-of-the-art\narchitectures. Inspired by properties of the loss function discovered by our procedure, we also propose\na new state-of-the-art ensembling method that can be trained in the time required to train a single\nDNN, with compelling performance on many key benchmarks (e.g., 96.4% accuracy on CIFAR-10).\nDraxler et al. [5] simultaneously and independently discovered the existence of curves connecting\nlocal optima in DNN loss landscapes. To \ufb01nd these curves they used a different approach inspired by\nthe Nudged Elastic Band method [12] from quantum chemistry.\nXie et al. [21] proposed a related ensembling approach that gathers outputs of neural networks from\ndifferent epochs at the end of training to stabilize \ufb01nal predictions. More recently, Huang et al. [11]\nproposed snapshot ensembles, which use a cosine cyclical learning rate [17] to save \u201csnapshots\u201d of the\nmodel during training at times when the learning rate achieves its minimum. In our experiments, we\ncompare our geometrically inspired approach to Huang et al. [11], showing improved performance.\n\n3 Finding Paths between Modes\n\nWe describe a new method to minimize the training error along a path that connects two points in the\nspace of DNN weights. Section 3.1 introduces this general procedure for arbitrary parametric curves,\nand Section 3.2 describes polygonal chains and Bezier curves as two example parametrizations of\nsuch curves. In the supplementary material, we discuss the computational complexity of the proposed\napproach and how to apply batch normalization at test time to points on these curves. We note\nthat after curve \ufb01nding experiments in Section 4, we make our key methodological proposal for\nensembling in Section 5.\n\n3.1 Connection Procedure\n\nLet \u02c6w1 and \u02c6w2 in R|net| be two sets of weights corresponding to two neural networks independently\ntrained by minimizing any user-speci\ufb01ed loss L(w), such as the cross-entropy loss. Here, |net| is the\nnumber of weights of the DNN. Moreover, let \u03c6\u03b8 : [0, 1] \u2192 R|net| be a continuous piecewise smooth\nparametric curve, with parameters \u03b8, such that \u03c6\u03b8(0) = \u02c6w1, \u03c6\u03b8(1) = \u02c6w2.\n\n3\n\n\fTo \ufb01nd a path of high accuracy between \u02c6w1 and \u02c6w2, we propose to \ufb01nd the parameters \u03b8 that minimize\nthe expectation over a uniform distribution on the curve, \u02c6(cid:96)(\u03b8):\n\n(cid:82)\n\n(cid:82) d\u03c6\u03b8\n\nL(\u03c6\u03b8)d\u03c6\u03b8\n\n=\n\n\u02c6(cid:96)(\u03b8) =\n\n1(cid:82)\n0 L(\u03c6\u03b8(t))(cid:107)\u03c6(cid:48)\n1(cid:82)\n0 (cid:107)\u03c6(cid:48)\n\u03b8(t)(cid:107)dt\n\n\u03b8(t)(cid:107)dt\n\n=\n\n1(cid:90)\n\n0\n\n(cid:104)\n\nL(\u03c6\u03b8(t))\n\n(cid:105)\n(cid:19)\u22121\n0 (cid:107)\u03c6(cid:48)\n\n,\n\n(1)\n\n. The\n\n(2)\n\nL(\u03c6\u03b8(t))q\u03b8(t)dt = Et\u223cq\u03b8(t)\n(cid:18) 1(cid:82)\n0 (cid:107)\u03c6(cid:48)\n\n\u03b8(t)(cid:107) \u00b7\n\nwhere the distribution q\u03b8(t) on t \u2208 [0, 1] is de\ufb01ned as: q\u03b8(t) = (cid:107)\u03c6(cid:48)\n\nnumerator of (1) is the line integral of the loss L on the curve, and the denominator(cid:82) 1\n\n\u03b8(t)(cid:107)dt\nis the normalizing constant of the uniform distribution on the curve de\ufb01ned by \u03c6\u03b8(\u00b7). Stochastic\ngradients of \u02c6(cid:96)(\u03b8) in Eq. (1) are generally intractable since q\u03b8(t) depends on \u03b8. Therefore we also\npropose a more computationally tractable loss\n\n\u03b8(t)(cid:107)dt\n\n(cid:90) 1\n0 L(\u03c6\u03b8(t))dt = Et\u223cU (0,1)L(\u03c6\u03b8(t)),\n\n(cid:96)(\u03b8) =\n\nwhere U (0, 1) is the uniform distribution on [0, 1]. The difference between (1) and (2) is that the\nlatter is an expectation of the loss L(\u03c6\u03b8(t)) with respect to a uniform distribution on t \u2208 [0, 1], while\n(1) is an expectation with respect to a uniform distribution on the curve. The two losses coincide,\nfor example, when \u03c6\u03b8(\u00b7) de\ufb01nes a polygonal chain with two line segments of equal length and the\nparametrization of each of the two segments is linear in t.\nTo minimize (2), at each iteration we sample \u02dct from the uniform distribution U (0, 1) and make a\ngradient step for \u03b8 with respect to the loss L(\u03c6\u03b8(\u02dct)). This way we obtain unbiased estimates of the\ngradients of (cid:96)(\u03b8), as\n\n\u2207\u03b8L(\u03c6\u03b8(\u02dct)) (cid:39) Et\u223cU (0,1)\u2207\u03b8L(\u03c6\u03b8(t)) = \u2207\u03b8Et\u223cU (0,1)L(\u03c6\u03b8(t)) = \u2207\u03b8(cid:96)(\u03b8).\n\nWe repeat these updates until convergence.\n\n3.2 Example Parametrizations\n\nPolygonal chain The simplest parametric curve we consider is the polygonal chain (see Figure 1,\nright). The trained networks \u02c6w1 and \u02c6w2 serve as the endpoints of the chain and the bends of the chain\nare the parameters \u03b8 of the curve parametrization. Consider the simplest case of a chain with one\nbend \u03b8. Then\n\n(cid:26) 2 (t\u03b8 + (0.5 \u2212 t) \u02c6w1) ,\n\n\u03c6\u03b8(t) =\n\n0 \u2264 t \u2264 0.5\n2 ((t \u2212 0.5) \u02c6w2 + (1 \u2212 t)\u03b8) , 0.5 \u2264 t \u2264 1.\n\nBezier curve A Bezier curve (see Figure 1, middle) provides a convenient parametrization of\nsmooth paths with given endpoints. A quadratic Bezier curve \u03c6\u03b8(t) with endpoints \u02c6w1 and \u02c6w2 is\ngiven by\n\n\u03c6\u03b8(t) = (1 \u2212 t)2 \u02c6w1 + 2t(1 \u2212 t)\u03b8 + t2 \u02c6w2, 0 \u2264 t \u2264 1.\n\nThese formulas naturally generalize for n bends \u03b8 = {w1, w2, . . . , wn} (see supplement).\n4 Curve Finding Experiments\n\nWe show that the proposed training procedure in Section 3 does indeed \ufb01nd high accuracy paths\nconnecting different modes, across a range of architectures and datasets. Moreover, we further\ninvestigate the properties of these curves, showing that they correspond to meaningfully different\nrepresentations that can be ensembled for improved accuracy. We use these insights to propose an\nimproved ensembling procedure in Section 5, which we empirically validate in Section 6.\nIn particular, we test VGG-16 [19], a 28-layer Wide ResNet with widening factor 10 [22] and a\n158-layer ResNet [9] on CIFAR-10, and VGG-16, 164-layer ResNet-bottleneck [9] on CIFAR-100.\nFor CIFAR-10 and CIFAR-100 we use the same standard data augmentation as Huang et al. [11]. We\n\n4\n\n\fFigure 2: The (cid:96)2-regularized cross-entropy train loss (left) and test error (middle) as a function of\nthe point on the curves \u03c6\u03b8(t) found by the proposed method (ResNet-164 on CIFAR-100). Right:\nError of the two-network ensemble consisting of the endpoint \u03c6\u03b8(0) of the curve and the point \u03c6\u03b8(t)\non the curve (CIFAR-100, ResNet-164). \u201cSegment\u201d is a line segment connecting two modes found\nby SGD. \u201cPolychain\u201d is a polygonal chain connecting the same endpoints.\n\nprovide additional results, including detailed experiments for fully connected and recurrent networks,\nin the supplement.\nFor each model and dataset we train two networks with different random initializations to \ufb01nd two\nmodes. Then we use the proposed algorithm of Section 3 to \ufb01nd a path connecting these two modes\nin the weight space with a quadratic Bezier curve and a polygonal chain with one bend. We also\nconnect the two modes with a line segment for comparison. In all experiments we optimize the loss\n(2), as for Bezier curves the gradient of loss (1) is intractable, and for polygonal chains we found loss\n(2) to be more stable.\nFigures 1 and 2 show the results of the proposed mode connecting procedure for ResNet-164 on\nCIFAR-100. Here loss refers to (cid:96)2-regularized cross-entropy loss. For both the Bezier curve and\npolygonal chain, train loss (Figure 2, left) and test error (Figure 2, middle) are indeed nearly constant.\nIn addition, we provide plots of train error and test loss in the supplementary material. In the\nsupplement, we also include a comprehensive table summarizing all path \ufb01nding experiments on\nCIFAR-10 and CIFAR-100 for VGGs, ResNets and Wide ResNets, as well as fully connected\nnetworks and recurrent neural networks, which follow the same general trends. In the supplementary\nmaterial we also show that the connecting curves can be found consistently as we vary the number of\nparameters in the network, although the ratio of the arclength for the curves to the length of the line\nsegment connecting the same endpoints decreases with increasing parametrization. In the supplement,\nwe also measure the losses (1) and (2) for all the curves we constructed, and \ufb01nd that the values of\nthe two losses are very close, suggesting that the loss (2) is a good practical approximation to the loss\n(1).\nThe constant-error curves connecting two given networks discovered by the proposed method are not\nunique. We trained two different polygonal chains with the same endpoints and different random\nseeds using VGG-16 on CIFAR-10. We then measured the Euclidean distance between the turning\npoints of these curves. For VGG-16 on CIFAR-10 this distance is equal to 29.6 and the distance\nbetween the endpoints is 50, showing that the curves are not unique. In this instance, we expect the\ndistance between turning points to be less than the distance between endpoints, since the locations of\nthe turning points were initialized to the same value (the center of the line segment connecting the\nendpoints).\nAlthough high accuracy connecting curves can often be very simple, such as a polygonal chain with\nonly one bend, we note that line segments directly connecting two modes generally incur high error.\nFor VGG-16 on CIFAR-10 the test error goes up to 90% in the center of the segment. For ResNet-158\nand Wide ResNet-28-10 the worst errors along direct line segments are still high, but relatively less,\nat 80% and 66%, respectively. This \ufb01nding suggests that the loss surfaces of state-of-the-art residual\nnetworks are indeed more regular than those of classical models like VGG, in accordance with the\nobservations in Li et al. [16].\nIn this paper we focus on connecting pairs of networks trained using the same hyper-parameters,\nbut from different random initializations. Building upon our work, Gotmare et al. [8] have recently\nshown that our mode connectivity approach applies to pairs of networks trained with different batch\nsizes, optimizers, data augmentation strategies, weight decays and learning rate schemes.\nTo motivate the ensembling procedure proposed in the next section, we now examine how far we\nneed to move along a connecting curve to \ufb01nd a point that produces substantially different, but still\n\n5\n\n0.00.20.40.60.81.0t0.070.110.31.55TrainlossSegmentBezierPolychain0.00.20.40.60.81.0t2325275075100Testerror(%)SegmentBezierPolychain0.00.20.40.60.81.0t21222324252627Testerror(%)SegmentPolychain\fuseful, predictions. Let \u02c6w1 and \u02c6w2 be two distinct sets of weights corresponding to optima obtained\nby independently training a DNN two times. We have shown that there exists a path connecting\n\u02c6w1 and \u02c6w2 with high test accuracy. Let \u03c6\u03b8(t), t \u2208 [0, 1] parametrize this path with \u03c6\u03b8(0) = \u02c6w1,\n\u03c6\u03b8(1) = \u02c6w2. We investigate the performance of an ensemble of two networks: the endpoint \u03c6\u03b8(0)\nof the curve and a point \u03c6\u03b8(t) on the curve corresponding to t \u2208 [0, 1]. Figure 2 (right) shows the\ntest error of this ensemble as a function of t, for a ResNet-164 on CIFAR-100. The test error starts\ndecreasing at t \u2248 0.1 and for t \u2265 0.4 the error of an ensemble is already as low as the error of\nan ensemble of the two independently trained networks used as the endpoints of the curve. Thus\neven by moving away from the endpoint by a relatively small distance along the curve we can \ufb01nd\na network that produces meaningfully different predictions from the network at the endpoint. This\nresult also demonstrates that these curves do not exist only due to degenerate parametrizations of the\nnetwork (such as rescaling on either side of a ReLU); instead, points along the curve correspond to\nmeaningfully different representations of the data that can be ensembled for improved performance.\nIn the supplementary material we show how to create trivially connecting curves that do not have this\nproperty.\n\n5 Fast Geometric Ensembling\n\nIn this section, we introduce a practical ensembling procedure, Fast Geometric Ensembling (FGE),\nmotivated by our observations about mode connectivity.\n\nFigure 3: Left: Plot of the learning rate (Top), test error (Middle) and distance from the initial value\n\u02c6w (Bottom) as a function of iteration for FGE with Preactivation-ResNet-164 on CIFAR-100. Circles\nindicate the times when we save models for ensembling. Right: Ensemble performance of FGE\nand SSE (Snapshot Ensembles) as a function of training time, using ResNet-164 on CIFAR-100\n(B = 150 epochs). Crosses represent the performance of separate \u201csnapshot\u201d models, and diamonds\nshow the performance of the ensembles constructed of all models available by the given time.\n\nIn the previous section, we considered ensembling along mode connecting curves. Suppose now we\ninstead only have one set of weights \u02c6w corresponding to a mode of the loss. We cannot explicitly\nconstruct a path \u03c6\u03b8(\u00b7) as before, but we know that multiple paths passing through \u02c6w exist, and it is\nthus possible to move away from \u02c6w in the weight space without increasing the loss. Further, we know\nthat we can \ufb01nd diverse networks providing meaningfully different predictions by making relatively\nsmall steps in the weight space (see Figure 2, right).\nInspired by these observations, we propose the Fast Geometric Ensembling (FGE) method that aims\nto \ufb01nd diverse networks with relatively small steps in the weight space, without leaving a region that\ncorresponds to low test error. While inspired by mode connectivity, FGE does not rely on explicitly\n\ufb01nding a connecting curve, and thus does not require pre-trained endpoints, and so can be trained in\nthe time required to train a single network.\nLet us describe Fast Geometric Ensembling. First, we initialize a copy of the network with weights w\nset equal to the weights of the trained network \u02c6w. Now, to force w to move away from \u02c6w without\nsubstantially decreasing the prediction accuracy we adopt a cyclical learning rate schedule \u03b1(\u00b7) (see\nFigure 3, left), with the learning rate at iteration i = 1, 2, . . . de\ufb01ned as\n\n(cid:26) (1 \u2212 2t(i))\u03b11 + 2t(i)\u03b12\n\n(2 \u2212 2t(i))\u03b12 + (2t(i) \u2212 1)\u03b11\n\n\u03b1(i) =\n\n0 < t(i) \u2264 1\n2 < t(i) \u2264 1\n\n2\n\n1\n\n,\n\n6\n\n\u03b12\u03b11Learningraten253035Testerror(%)00.5c1c1.5c2c2.5c3c3.5cFGEiterationnumber051015Distance00.5BB1.5B2BTrainingbudget7476788082Testaccuracy(%)SSEseparateFGEseparateSSEensembleFGEensemble1Bmodel\fwhere t(i) = 1\nc (mod(i \u2212 1, c) + 1), the learning rates are \u03b11 > \u03b12, and the number of iterations\nin one cycle is given by even number c. Here by iteration we mean processing one mini-batch of\ndata. We can train the network w using the standard (cid:96)2-regularized cross-entropy loss function (or\nany other loss that can be used for DNN training) with the proposed learning rate schedule for n\niterations. In the middle of each learning rate cycle when the learning rate reaches its minimum value\n\u03b1(i) = \u03b12 (which corresponds to mod(i \u2212 1, c) + 1 = c/2, t(i) = 1\n2) we collect the checkpoints\nof weights w. When the training is \ufb01nished we ensemble the collected models. An outline of the\nalgorithm is provided in the supplement.\nFigure 3 (left) illustrates the adopted learning rate schedule. During the periods when the learning\nrate is large (close to \u03b11), w is exploring the weight space doing larger steps but sacri\ufb01cing the test\nerror. When the learning rate is small (close to \u03b12), w is in the exploitation phase in which the steps\nbecome smaller and the test error goes down. The cycle length is usually about 2 to 4 epochs, so that\nthe method ef\ufb01ciently balances exploration and exploitation with relatively-small steps in the weight\nspace that are still suf\ufb01cient to gather diverse and meaningful networks for the ensemble.\nTo \ufb01nd a good initialization \u02c6w for the proposed procedure, we \ufb01rst train the network with the standard\nlearning rate schedule (the schedule used to train single DNN models) for about 80% of the time\nrequired to train a single model. After this pre-training is \ufb01nished we initialize FGE with \u02c6w and run\nthe proposed fast ensembling algorithm for the remaining computational budget. In order to get more\ndiverse samples, one can run the algorithm described above several times for a smaller number of\niterations initializing from different checkpoints saved during training of \u02c6w, and then ensemble all of\nthe models gathered across these runs.\nCyclical learning rates have also recently been considered in Smith and Topin [20] and Huang et al.\n[11]. Our proposed method is perhaps most closely related to Snapshot Ensembles [11], but has\nseveral distinctive features, inspired by our geometric insights. In particular, Snapshot Ensembles\nadopt cyclical learning rates with cycle length on the scale of 20 to 40 epochs from the beginning\nof the training as they are trying to do large steps in the weight space. However, according to our\nanalysis of the curves it is suf\ufb01cient to do relatively small steps in the weight space to get diverse\nnetworks, so we only employ cyclical learning rates with a small cycle length on the scale of 2 to 4\nepochs in the last stage of the training. As illustrated in Figure 3 (left), the step sizes made by FGE\nbetween saving two models (that is the euclidean distance between sets of weights of corresponding\nmodels in the weight space) are on the scale of 7 for Preactivation-ResNet-164 on CIFAR-100. For\nSnapshot Ensembles for the same model the distance between two snapshots is on the scale of 40. We\nalso use a piecewise linear cyclical learning rate schedule following Smith and Topin [20] as opposed\nto the cosine schedule in Snapshot Ensembles.\n\n6 Fast Geometric Ensembling Experiments\n\nTable 1: Error rates (%) on CIFAR-100 and CIFAR-10 datasets for different ensembling techniques\nand training budgets. The best results for each dataset, architecture, and budget are bolded.\n\nDNN (Budget)\n\nVGG-16 (200)\n\nResNet-164 (150)\n\nWRN-28-10 (200)\n\nmethod\nInd\nSSE\nFGE\nInd\nSSE\nFGE\nInd\nSSE\nFGE\n\nCIFAR-100\n\n1B\n\n2B\n\n3B\n\nCIFAR-10\n2B\n\n1B\n\n27.4 \u00b1 0.1\n26.4 \u00b1 0.1\n25.7 \u00b1 0.1\n21.5 \u00b1 0.4\n20.9 \u00b1 0.2\n20.2 \u00b1 0.1\n19.2 \u00b1 0.2\n17.9 \u00b1 0.2\n17.7 \u00b1 0.2\n\n25.28\n25.16\n24.11\n\n19.04\n19.28\n18.67\n\n17.48\n17.3\n16.95\n\n24.45\n24.69\n23.54\n\n18.59\n18.91\n18.21\n\n17.01\n16.97\n16.88\n\n6.75 \u00b1 0.16\n6.57 \u00b1 0.12\n6.48 \u00b1 0.09\n4.72 \u00b1 0.1\n4.66 \u00b1 0.02\n4.54 \u00b1 0.05\n3.82 \u00b1 0.1\n3.73 \u00b1 0.04\n3.65 \u00b1 0.1\n\n5.89\n6.19\n5.82\n\n4.1\n4.37\n4.21\n\n3.4\n3.54\n3.38\n\n3B\n\n5.9\n5.95\n5.66\n\n3.77\n4.3\n3.98\n\n3.31\n3.55\n3.52\n\nIn this section we compare the proposed Fast Geometric Ensembling (FGE) technique against\nensembles of independently trained networks (Ind), and SnapShot Ensembles (SSE) [11], a recent\nstate-of-the-art fast ensembling approach.\n\n7\n\n\fFor the ensembling experiments we use a 164-layer Preactivation-ResNet in addition to the VGG-16\nand Wide ResNet-28-10 models. Links for implementations to these models can be found in the\nsupplement.\nWe compare the accuracy of each method as a function of computational budget. For each network\narchitecture and dataset we denote the number of epochs required to train a single model as B. For\na kB budget, we run each of Ind, FGE and SSE k times from random initializations and ensemble\nthe models gathered from the k runs. In our experiments we set B = 200 for VGG-16 and Wide\nResNet-28-10 (WRN-28-10) models, and B = 150 for ResNet-164, since 150 epochs is typically\nsuf\ufb01cient to train this model. We note the runtime per epoch for FGE, SSE, and Ind is the same, and\nso the total computation associated with kB budgets is the same for all ensembling approaches.\nFor Ind, we use an initial learning rate of 0.1 for ResNet and Wide ResNet, and 0.05 for VGG. For\nFGE, with VGG we use cycle length c = 2 epochs, and a total of 22 models in the \ufb01nal ensemble.\nWith ResNet and Wide ResNet we use c = 4 epochs, and the total number of models in the \ufb01nal\nensemble is 12 for Wide ResNets and 6 for ResNets. For VGG we set the learning rates to \u03b11 = 10\u22122,\n\u03b12 = 5 \u00b7 10\u22124; for ResNet and Wide ResNet models we set \u03b11 = 5 \u00b7 10\u22122, \u03b12 = 5 \u00b7 10\u22124. For SSE,\nwe followed Huang et al. [11] and varied the initial learning rate \u03b10 and number of snapshots per\nrun M. We report the best results we achieved, which corresponded to \u03b10 = 0.1, M = 4 for ResNet,\n\u03b10 = 0.1, M = 5 for Wide ResNet, and \u03b10 = 0.05, M = 5 for VGG. The total number of models in\nthe FGE ensemble is constrained by network choice and computational budget. Further experimental\ndetails are in the supplement.\nTable 1 summarizes the results of the experiments. In all conducted experiments FGE outperforms\nSSE, particularly as we increase the computational budget. The performance improvement against\nInd is most noticeable for CIFAR-100. With a large number of classes, any two models are less likely\nto make the same predictions. Moreover, there will be greater uncertainty over which representation\none should use on CIFAR-100, since the number of classes is increased tenfold from CIFAR-10, but\nthe number of training examples is held constant. Thus smart ensembling strategies will be especially\nimportant on this dataset. Indeed in all experiments on CIFAR-100, FGE outperformed all other\nmethods. On CIFAR-10, FGE consistently improved upon SSE for all budgets and architectures.\nFGE also improved against Ind for all training budgets with VGG, but is more similar in performance\nto Ind on CIFAR-10 when using ResNets.\nFigure 3 (right) illustrates the results for Preactivation-ResNet-164 on CIFAR-100 for one and two\ntraining budgets. The training budget B is 150 epochs. Snapshot Ensembles use a cyclical learning\nrate from the beginning of the training and they gather the models for the ensemble throughout\ntraining. To \ufb01nd a good initialization we run standard independent training for the \ufb01rst 125 epochs\nbefore applying FGE. In this case, the whole ensemble is gathered over the following 22 epochs\n(126-147) to \ufb01t in the budget of each of the two runs. During these 22 epochs FGE is able to gather\ndiverse enough networks to outperform Snapshot Ensembles both for 1B and 2B budgets.\nDiversity of predictions of the individual networks is crucial for the ensembling performance [e.g.,\n15]. We note that the diversity of the networks averaged by FGE is lower than that of completely\nindependently trained networks. Speci\ufb01cally, two independently trained ResNet-164 on CIFAR-100\nmake different predictions on 19.97% of test objects, while two networks from the same FGE run\nmake different predictions on 14.57% of test objects. Further, performance of individual networks\naveraged by FGE is slightly lower than that of fully trained networks (e.g. 78.0% against 78.5% on\nCIFAR100 for ResNet-164). However, for a given computational budget FGE can propose many\nmore high-performing networks than independent training, leading to better ensembling performance\n(see Table 1).\n\n6.1\n\nImageNet\n\nImageNet ILSVRC-2012 [18] is a large-scale dataset containing 1.2 million training images and\n50000 validation images divided into 1000 classes.\nCIFAR-100 is the primary focus of our ensemble experiments. However, we also include ImageNet\nresults for the proposed FGE procedure, using a ResNet-50 architecture. We used a pretrained model\nwith top-1 test error of 23.87 to initialize the FGE procedure. We then ran FGE for 5 epochs with a\ncycle length of 2 epochs and with learning rates \u03b11 = 10\u22123, \u03b12 = 10\u22125. The top-1 test error-rate of\nthe \ufb01nal ensemble was 23.31. Thus, in just 5 epochs we could improve the accuracy of the model by\n\n8\n\n\f0.56 using FGE. The \ufb01nal ensemble contains 4 models (including the pretrained one). Despite the\nharder setting of only 5 epochs to construct an ensemble, FGE performs comparably to the best result\nreported by Huang et al. [11] on ImageNet, 23.33 error, which was also achieved using a ResNet-50.\n\n7 Discussion and Future Work\n\nWe have shown that the optima of deep neural networks are connected by simple pathways, such\nas a polygonal chain with a single bend, with near constant accuracy. We introduced a training\nprocedure to \ufb01nd these pathways, with a user-speci\ufb01c curve of choice. We were inspired by these\ninsights to propose a practical new ensembling approach, Fast Geometric Ensembling, which achieves\nstate-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet.\nThere are so many exciting future directions for this research. At a high level we have shown that\neven though the loss surfaces of deep neural networks are very complex, there is relatively simple\nstructure connecting different optima. Indeed, we can now move towards thinking about valleys of\nlow loss, rather than isolated modes.\nThese valleys could inspire new directions for approximate Bayesian inference, such as stochastic\nMCMC approaches which could now jump along these bridges between modes, rather than getting\nstuck exploring a single mode. One could similarly derive new proposal distributions for variational\ninference, exploiting the \ufb02atness of these pathways. These geometric insights could also be used to\naccelerate the convergence, stability and accuracy of optimization procedures like SGD, by helping\nus understand the trajectories along which the optimizer moves, and making it possible to develop\nprocedures which can now search in more structured spaces of high accuracy. One could also use\nthese paths to construct methods which are more robust to adversarial attacks, by using an arbitrary\ncollection of diverse models described by a high accuracy curve, returning the predictions of a\ndifferent model for each query from an adversary. We can also use this new property to create better\nvisualizations of DNN loss surfaces. Indeed, using the proposed training procedure, we were able to\nproduce new types of visualizations showing the connectivity of modes, which are normally depicted\nas isolated. We also could continue to build on the new training procedure we proposed here, to \ufb01nd\ncurves with particularly desirable properties, such as diversity of networks. Indeed, we could start to\nuse entirely new loss functions, such as line and surface integrals of cross-entropy across structured\nregions of weight space.\n\nAcknowledgements. Timur Garipov was supported by Ministry of Education and Science of the\nRussian Federation (grant 14.756.31.0001). Timur Garipov and Dmitrii Podoprikhin were supported\nby Samsung Research, Samsung Electronics. Andrew Gordon Wilson and Pavel Izmailov were\nsupported by Facebook Research and NSF IIS-1563887.\n\nReferences\n[1] Peter Auer, Mark Herbster, and Manfred K Warmuth. Exponentially many local minima for\nsingle neurons. In Advances in Neural Information Processing Systems, pages 316\u2013322, 1996.\n\n[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in Neural Information Processing Systems, pages 2933\u20132941,\n2014.\n\n[4] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize\nfor deep nets. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npages 1019\u20131028, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\nURL http://proceedings.mlr.press/v70/dinh17b.html.\n\n9\n\n\f[5] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no\nbarriers in neural network energy landscape. In Jennifer Dy and Andreas Krause, editors, Pro-\nceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings\nof Machine Learning Research, pages 1309\u20131318, Stockholmsm\u00e4ssan, Stockholm Sweden,\n10\u201315 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/draxler18a.html.\n\n[6] C Daniel Freeman and Joan Bruna. Topology and geometry of half-recti\ufb01ed network optimiza-\n\ntion. International Conference on Learning Representations, 2017.\n\n[7] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural\nnetwork optimization problems. International Conference on Learning Representations, 2015.\n\n[8] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Using mode\n\nconnectivity for loss landscape analysis. arXiv preprint arXiv:1806.06977, 2018.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[11] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger.\nSnapshot ensembles: Train 1, get m for free. International Conference on Learning Representa-\ntions, 2017.\n\n[12] Hannes Jonsson, Greg Mills, and Karsten W Jacobsen. Nudged elastic band method for \ufb01nding\nminimum energy paths of transitions. Classical and quantum dynamics in condensed phase\nsimulations, 1998.\n\n[13] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nInternational Conference on Learning Representations, 2017.\n\n[14] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In Conference on Learning Theory, pages 1246\u20131257, 2016.\n\n[15] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David\nCrandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deep\nensembles. In Advances in Neural Information Processing Systems, pages 2119\u20132127, 2016.\n\n[16] Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural\n\nnets. arXiv preprint arXiv:1712.09913, 2017.\n\n[17] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.\n\nInternational Conference on Learning Representations, 2017.\n\n[18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2012.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[20] Leslie N Smith and Nicholay Topin. Exploring loss function topology with cyclical learning\n\nrates. arXiv preprint arXiv:1702.04283, 2017.\n\n[21] Jingjing Xie, Bing Xu, and Zhang Chuang. Horizontal and vertical ensemble with deep\n\nrepresentation for classi\ufb01cation. arXiv preprint arXiv:1306.2759, 2013.\n\n[22] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n\n10\n\n\f", "award": [], "sourceid": 5291, "authors": [{"given_name": "Timur", "family_name": "Garipov", "institution": "Moscow State University"}, {"given_name": "Pavel", "family_name": "Izmailov", "institution": "Cornell University"}, {"given_name": "Dmitrii", "family_name": "Podoprikhin", "institution": "XTX markets"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Samsung AI Center, Moscow"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}]}