{"title": "A Simple Baseline for Bayesian Uncertainty in Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13153, "page_last": 13164, "abstract": "We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including variational inference, MC dropout, KFAC Laplace, and temperature scaling.", "full_text": "A Simple Baseline for Bayesian Uncertainty\n\nin Deep Learning\n\nWesley J. Maddox\u22171 Timur Garipov\u22172\n\nPavel Izmailov\u22171\n\nDmitry Vetrov2,3 Andrew Gordon Wilson1\n\n3 Samsung-HSE Laboratory, National Research University Higher School of Economics\n\n1 New York University\n\n2 Samsung AI Center Moscow\n\nAbstract\n\nWe propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose\napproach for uncertainty representation and calibration in deep learning. Stochastic\nWeight Averaging (SWA), which computes the \ufb01rst moment of stochastic gradient\ndescent (SGD) iterates with a modi\ufb01ed learning rate schedule, has recently been\nshown to improve generalization in deep learning. With SWAG, we \ufb01t a Gaussian\nusing the SWA solution as the \ufb01rst moment and a low rank plus diagonal covariance\nalso derived from the SGD iterates, forming an approximate posterior distribution\nover neural network weights; we then sample from this Gaussian distribution to\nperform Bayesian model averaging. We empirically \ufb01nd that SWAG approximates\nthe shape of the true posterior, in accordance with results describing the stationary\ndistribution of SGD iterates. Moreover, we demonstrate that SWAG performs\nwell on a wide variety of tasks, including out of sample detection, calibration,\nand transfer learning, in comparison to many popular alternatives including MC\ndropout, KFAC Laplace, SGLD, and temperature scaling.\n\n1\n\nIntroduction\n\nUltimately, machine learning models are used to make decisions. Representing uncertainty is crucial\nfor decision making. For example, in medical diagnoses and autonomous vehicles we want to protect\nagainst rare but costly mistakes. Deep learning models typically lack a representation of uncertainty,\nand provide overcon\ufb01dent and miscalibrated predictions [e.g., 21, 12].\nBayesian methods provide a natural probabilistic representation of uncertainty in deep learning [e.g.,\n3, 24, 5], and previously had been a gold standard for inference with neural networks [38]. However,\nexisting approaches are often highly sensitive to hyperparameter choices, and hard to scale to modern\ndatasets and architectures, which limits their general applicability in modern deep learning.\nIn this paper we propose a different approach to Bayesian deep learning: we use the information\ncontained in the SGD trajectory to ef\ufb01ciently approximate the posterior distribution over the weights\nof the neural network. We \ufb01nd that the Gaussian distribution \ufb01tted to the \ufb01rst two moments of\nSGD iterates, with a modi\ufb01ed learning rate schedule, captures the local geometry of the posterior\nsurprisingly well. Using this Gaussian distribution we are able to obtain convenient, ef\ufb01cient,\naccurate and well-calibrated predictions in a broad range of tasks in computer vision. In particular,\nour contributions are the following:\n\n\u2022 In this work we propose SWAG (SWA-Gaussian), a scalable approximate Bayesian inference\ntechnique for deep learning. SWAG builds on Stochastic Weight Averaging [20], which\n\n\u2217Equal contribution. Correspondence to wjm363 AT nyu.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcomputes an average of SGD iterates with a high constant learning rate schedule, to provide\nimproved generalization in deep learning and the interpretation of SGD as approximate\nBayesian inference [34]. SWAG additionally computes a low-rank plus diagonal approxima-\ntion to the covariance of the iterates, which is used together with the SWA mean, to de\ufb01ne a\nGaussian posterior approximation over neural network weights.\n\n\u2022 SWAG is motivated by the theoretical analysis of the stationary distribution of SGD iterates\n[e.g., 34, 6], which suggests that the SGD trajectory contains useful information about the\ngeometry of the posterior. In Appendix 2 we show that the assumptions of Mandt et al. [34]\ndo not hold for deep neural networks, due to non-convexity and over-parameterization (with\nfurther analysis in the supplementary material). However, we \ufb01nd in Section 4 that in the\nlow-dimensional subspace spanned by SGD iterates the shape of the posterior distribution is\napproximately Gaussian within a basin of attraction. Further, SWAG is able to capture the\ngeometry of this posterior remarkably well.\n\n\u2022 In an exhaustive empirical evaluation we show that SWAG can provide well-calibrated\nuncertainty estimates for neural networks across many settings in computer vision. In partic-\nular SWAG achieves higher test likelihood compared to many state-of-the-art approaches,\nincluding MC-Dropout [9], temperature scaling [12], SGLD [46], KFAC-Laplace [43] and\nSWA [20] on CIFAR-10, CIFAR-100 and ImageNet, on a range of architectures. We also\ndemonstrate the effectiveness of SWAG for out-of-domain detection, and transfer learning.\nWhile we primarily focus on image classi\ufb01cation, we show that SWAG can signi\ufb01cantly im-\nprove test perplexities of LSTM networks on language modeling problems, and in Appendix\n7 we also compare SWAG with Probabilistic Back-propagation (PBP) [16], Deterministic\nVariational Inference (DVI) [47], and Deep Gaussian Processes [4] on regression problems.\n\u2022 We release PyTorch code at https://github.com/wjmaddox/swa_gaussian.\n\n2 Related Work\n\n2.1 Bayesian Methods\n\nBayesian approaches represent uncertainty by placing a distribution over model parameters, and then\nmarginalizing these parameters to form a whole predictive distribution, in a procedure known as\nBayesian model averaging. In the late 1990s, Bayesian methods were the state-of-the-art approach to\nlearning with neural networks, through the seminal works of Neal [38] and MacKay [32]. However,\nmodern neural networks often contain millions of parameters, the posterior over these parameters\n(and thus the loss surface) is highly non-convex, and mini-batch approaches are often needed to\nmove to a space of good solutions [22]. For these reasons, Bayesian approaches have largely been\nintractable for modern neural networks. Here, we review several modern approaches to Bayesian\ndeep learning.\n\nMarkov chain Monte Carlo (MCMC) was at one time a gold standard for inference with neural\nnetworks, through the Hamiltonian Monte Carlo (HMC) work of Neal [38]. However, HMC requires\nfull gradients, which is computationally intractable for modern neural networks. To extend the HMC\nframework, stochastic gradient HMC (SGHMC) was introduced by Chen et al. [5] and allows for\nstochastic gradients to be used in Bayesian inference, crucial for both scalability and exploring a space\nof solutions that provide good generalization. Alternatively, stochastic gradient Langevin dynamics\n(SGLD) [46] uses \ufb01rst order Langevin dynamics in the stochastic gradient setting. Theoretically,\nboth SGHMC and SGLD asymptotically sample from the posterior in the limit of in\ufb01nitely small\nstep sizes. In practice, using \ufb01nite learning rates introduces approximation errors (see e.g. [34]), and\ntuning stochastic gradient MCMC methods can be quite dif\ufb01cult.\n\nVariational Inference: Graves [11] suggested \ufb01tting a Gaussian variational posterior approxima-\ntion over the weights of neural networks. This technique was generalized by Kingma and Welling\n[26] which proposed the reparameterization trick for training deep latent variable models; multiple\nvariational inference methods based on the reparameterization trick were proposed for DNNs [e.g.,\n25, 3, 36, 31]. While variational methods achieve strong performance for moderately sized networks,\nthey are empirically noted to be dif\ufb01cult to train on larger architectures such as deep residual networks\n[15]; Blier and Ollivier [2] argue that the dif\ufb01culty of training is explained by variational methods\n\n2\n\n\fproviding inusf\ufb01cient data compression for DNNs despite being designed for data compression (mini-\nmum description length). Recent key advances [31, 47] in variational inference for deep learning\ntypically focus on smaller-scale datasets and architectures. An alternative line of work re-interprets\nnoisy versions of optimization algorithms: for example, noisy Adam [23] and noisy KFAC [50], as\napproximate variational inference.\n\nDropout Variational Inference: Gal and Ghahramani [9] used a spike and slab variational distri-\nbution to view dropout at test time as approximate variational Bayesian inference. Concrete dropout\n[10] extends this idea to optimize the dropout probabilities as well. From a practical perspective,\nthese approaches are quite appealing as they only require ensembling dropout predictions at test time,\nand they were succesfully applied to several downstream tasks [21, 37].\n\nassume a Gaussian posterior, N (\u03b8\u2217,I(\u03b8\u2217)\u22121), where \u03b8\u2217 is a MAP\nLaplace Approximations\nestimate and I(\u03b8\u2217)\u22121 is the inverse of the Fisher information matrix (expected value of the Hessian\nevaluated at \u03b8\u2217). It was notably used for Bayesian neural networks in MacKay [33], where a diagonal\napproximation to the inverse of the Hessian was utilized for computational reasons. More recently,\nKirkpatrick et al. [27] proposed using diagonal Laplace approximations to overcome catastrophic\nforgetting in deep learning. Ritter et al. [43] proposed the use of either a diagonal or block Kronecker\nfactored (KFAC) approximation to the Hessian matrix for Laplace approximations, and Ritter et al.\n[42] successfully applied the KFAC approach to online learning scenarios.\n\n2.2 SGD Based Approximations\n\nMandt et al. [34] proposed to use the iterates of averaged SGD as an MCMC sampler, after analyzing\nthe dynamics of SGD using tools from stochastic calculus. From a frequentist perspective, Chen et al.\n[6] showed that under certain conditions a batch means estimator of the sample covariance matrix of\nthe SGD iterates converges to A = H(\u03b8)\u22121C(\u03b8)H(\u03b8)\u22121, where H(\u03b8)\u22121 is the inverse of the Hessian\nof the log likelihood and C(\u03b8) = E(\u2207 log p(\u03b8)\u2207 log p(\u03b8)T ) is the covariance of the gradients of the\nlog likelihood. Chen et al. [6] then show that using A and the sample average of the iterates for a\nGaussian approximation produces well calibrated con\ufb01dence intervals of the parameters and that the\nvariance of these estimators achieves the Cramer Rao lower bound (the minimum possible variance).\nA description of the asymptotic covariance of the SGD iterates dates back to Ruppert [44] and Polyak\nand Juditsky [41], who show asymptotic convergence of Polyak-Ruppert averaging.\n\n2.3 Methods for Calibration of DNNs\n\nLakshminarayanan et al. [29] proposed using ensembles of several networks for enhanced calibration,\nand incorporated an adversarial loss function to be used when possible as well. Outside of probabilistic\nneural networks, Guo et al. [12] proposed temperature scaling, a procedure which uses a validation set\nand a single hyperparameter to rescale the logits of DNN outputs for enhanced calibration. Kuleshov\net al. [28] propose calibrated regression using a similar rescaling technique.\n\n3 SWA-Gaussian for Bayesian Deep Learning\n\nIn this section we propose SWA-Gaussian (SWAG) for Bayesian model averaging and uncertainty\nestimation. In Section 3.2, we review stochastic weight averaging (SWA) [20], which we view as\nestimating the mean of the stationary distribution of SGD iterates. We then propose SWA-Gaussian\nin Sections 3.3 and 3.4 to estimate the covariance of the stationary distribution, forming a Gaussian\napproximation to the posterior over weight parameters. With SWAG, uncertainty in weight space\nis captured with minimal modi\ufb01cations to the SWA training procedure. We then present further\ntheoretical and empirical analysis for SWAG in Section 4.\n\n3.1 Stochastic Gradient Descent (SGD)\n\nStandard training of deep neural networks (DNNs) proceeds by applying stochastic gradient descent\non the model weights \u03b8 with the following update rule:\n\n(cid:32)\n\n\u2206\u03b8t = \u2212\u03b7t\n\nB(cid:88)\n\ni=1\n\n1\nB\n\n(cid:33)\n\n,\n\n\u2207\u03b8 log p(yi|f\u03b8(xi)) \u2212 \u2207\u03b8 log p(\u03b8)\n\nN\n\n3\n\n\f(cid:80)\nwhere the learning rate is \u03b7, the ith input (e.g. image) and label are {xi, yi}, the size of the whole\ntraining set is N, the size of the batch is B, and the DNN, f, has weight parameters \u03b8.2 The loss\ni log p(yi|f\u03b8(xi)), combined with a regularizer log p(\u03b8).\nfunction is a negative log likelihood \u2212\nThis type of maximum likelihood training does not represent uncertainty in the predictions or\nparameters \u03b8.\n\n3.2 Stochastic Weight Averaging (SWA)\n\n(cid:80)T\n\nThe main idea of SWA [20] is to run SGD with a constant learning rate schedule starting from a\npre-trained solution, and to average the weights of the models it traverses. Denoting the weights of\nthe network obtained after epoch i of SWA training \u03b8i, the SWA solution after T epochs is given\nby \u03b8SWA = 1\ni=1 \u03b8i . A high constant learning rate schedule ensures that SGD explores the set of\nT\npossible solutions instead of simply converging to a single point in the weight space. Izmailov et al.\n[20] argue that conventional SGD training converges to the boundary of the set of high-performing\nsolutions; SWA on the other hand is able to \ufb01nd a more centered solution that is robust to the shift\nbetween train and test distributions, leading to improved generalization performance. SWA and\nrelated ideas have been successfully applied to a wide range of applications [see e.g. 1, 48, 49, 40]. A\nrelated but different procedure is Polyak-Ruppert averaging [41, 44] in stochastic convex optimization,\nwhich uses a learning rate decaying to zero. Mandt et al. [34] interpret Polyak-Ruppert averaging as a\nsampling procedure, with convergence occurring to the true posterior under certain strong conditions.\nAdditionally, they explore the theoretical feasibility of SGD (and averaged SGD) as an approximate\nBayesian inference scheme; we test their assumptions in Appendix 1.\n\n3.3 SWAG-Diagonal\n\ni=1 \u03b82\n\nSWA); here the squares in \u03b82\n\nSWA and \u03b82\n\n(cid:80)T\n\ni , \u03a3diag = diag(\u03b82 \u2212 \u03b82\n\nWe \ufb01rst consider a simple diagonal format for the covariance matrix. In order to \ufb01t a diagonal\ncovariance approximation, we maintain a running average of the second uncentered moment for each\nweight, and then compute the covariance using the following standard identity at the end of training:\ni are applied elementwise.\n\u03b82 = 1\nT\nThe resulting approximate posterior distribution is then N (\u03b8SWA, \u03a3Diag). In our experiments, we term\nthis method SWAG-Diagonal.\nConstructing the SWAG-Diagonal posterior approximation requires storing two additional copies\nof DNN weights: \u03b8SWA and \u03b82. Note that these models do not have to be stored on the GPU. The\nadditional computational complexity of constructing SWAG-Diagonal compared to standard training\nis negligible, as it only requires updating the running averages of weights once per epoch.\n\n3.4 SWAG: Low Rank plus Diagonal Covariance Structure\n\nWe now describe the full SWAG algorithm. While the diagonal covariance approximation is standard\nin Bayesian deep learning [3, 27], it can be too restrictive. We extend the idea of diagonal covariance\napproximations to utilize a more \ufb02exible low-rank plus diagonal posterior approximation. SWAG\napproximates the sample covariance \u03a3 of the SGD iterates along with the mean \u03b8SWA.3\n(cid:80)T\n(cid:80)T\nNote that the sample covariance matrix of the SGD iterates can be written as the sum of outer products,\ni=1(\u03b8i \u2212 \u03b8SWA)(\u03b8i \u2212 \u03b8SWA)(cid:62), and is of rank T . As we do not have access to the value\n\u03a3 = 1\nT\u22121\ni=1(\u03b8i \u2212 \u00af\u03b8i)(\u03b8i \u2212\nof \u03b8SWA during training, we approximate the sample covariance with \u03a3 \u2248 1\nT\u22121\n\u00af\u03b8i)(cid:62) = 1\nT\u22121 DD(cid:62), where D is the deviation matrix comprised of columns Di = (\u03b8i \u2212 \u00af\u03b8i), and \u00af\u03b8i is\nthe running estimate of the parameters\u2019 mean obtained from the \ufb01rst i samples. To limit the rank of\nthe estimated covariance matrix we only use the last K of Di vectors corresponding to the last K\n\n2We ignore momentum for simplicity in this update; however we utilized momentum in the resulting\n\nexperiments and it is covered theoretically [34].\n\n3 We note that stochastic gradient Monte Carlo methods [5, 46] also use the SGD trajectory to construct\nsamples from the approximate posterior. However, these methods are principally different from SWAG in that\nthey (1) require adding Gaussian noise to the gradients, (2) decay learning rate to zero and (3) do not construct a\nclosed-form approximation to the posterior distribution, which for instance enables SWAG to draw new samples\nwith minimal overhead. We include comparisons to SGLD [46] in the Appendix.\n\n4\n\n\fK\u22121 \u00b7 (cid:98)D(cid:98)D(cid:62) with the diagonal ap-\n\nepochs of training. Here K is the rank of the resulting approximation and is a hyperparameter of the\n\nmethod. We de\ufb01ne (cid:98)D to be the matrix with columns equal to Di for i = T \u2212 K + 1, . . . , T .\n\ndiagz1 +\n\n1\n\u221a2 \u00b7 \u03a3\n\n1(cid:112)2(K \u2212 1)\n\nWe then combine the resulting low-rank approximation \u03a3low-rank = 1\nproximation \u03a3diag of Section 3.3. The resulting approximate posterior distribution is a Gaussian with\n2 \u00b7 (\u03a3diag + \u03a3low-rank)).4 In our experiments,\nthe SWA mean \u03b8SWA and summed covariance: N (\u03b8SWA, 1\nwe term this method SWAG. Computing this approximate posterior distribution requires storing K\nvectors Di of the same size as the model as well as the vectors \u03b8SWA and \u03b82. These models do not\nhave to be stored on a GPU.\nTo sample from SWAG we use the following identity\n\n(cid:101)\u03b8 = \u03b8SWA +\n(cid:98)Dz2, where z1 \u223c N (0, Id), z2 \u223c N (0, IK).\ndiagz1 can be computed in O(d) time. The product (cid:98)Dz2 can be computed in O(Kd) time.\n\nHere d is the number of parameters in the network. Note that \u03a3diag is diagonal, and the product\n\u03a3\nRelated methods for estimating the covariance of SGD iterates were considered in Mandt et al. [34]\nand Chen et al. [6], but store full-rank covariance \u03a3 and thus scale quadratically in the number of\nparameters, which is prohibitively expensive for deep learning applications. We additionally note that\nusing the deviation matrix for online covariance matrix estimation comes from viewing the online\nupdates used in Dasgupta and Hsu [8] in matrix fashion.\nThe full Bayesian model averaging procedure is given in Algorithm 1. As in Izmailov et al. [20]\n(SWA) we update the batch normalization statistics after sampling weights for models that use batch\nnormalization [18]; we investigate the necessity of this update in Appendix 4.4.\nAlgorithm 1 Bayesian Model Averaging with SWAG\n\n(1)\n\n1\n2\n\n1\n2\n\n\u03b80: pretrained weights; \u03b7: learning rate; T : number of steps; c: moment update frequency; K: maximum\nnumber of columns in deviation matrix; S: number of samples in Bayesian model averaging\nTrain SWAG\n\nTest Bayesian Model Averaging\n\nfor i \u2190 1, 2, ..., S do\n\nDraw(cid:101)\u03b8i \u223c N(cid:16)\n\n\u03b8SWA, 1\n\n2 \u03a3diag + (cid:98)D(cid:98)D(cid:62)\nS p(y\u2217|(cid:101)\u03b8i)\n\n2(K\u22121)\n\nUpdate batch norm statistics with new sample.\np(y\u2217|Data) + = 1\n\n(cid:17)\n\n(1)\n\n{Number of models}\n\nreturn p(y\u2217|Data)\n\n0\n\n\u03b8 \u2190 \u03b80, \u03b82 \u2190 \u03b82\n{Initialize moments}\nfor i \u2190 1, 2, ..., T do\n\u03b8i \u2190 \u03b8i\u22121\u2212\u03b7\u2207\u03b8L(\u03b8i\u22121){Perform SGD update}\nif MOD(i, c) = 0 then\n\nn \u2190 i/c\n\u03b8 \u2190 n\u03b8 + \u03b8i\nn + 1\n\ni\n\n, \u03b82 \u2190 n\u03b82 + \u03b82\nn + 1\n\nif NUM_COLS((cid:98)D) = K then\nREMOVE_COL((cid:98)D[:, 1])\nAPPEND_COL((cid:98)D, \u03b8i \u2212 \u03b8) {Store deviation}\n\n{Moments}\n\nreturn \u03b8SWA = \u03b8, \u03a3diag = \u03b82 \u2212 \u03b8\n\n2\n\n, (cid:98)D\n\n3.5 Bayesian Model Averaging with SWAG\nMaximum a-posteriori (MAP) optimization is a procedure whereby one maximizes the (log) posterior\nwith respect to parameters \u03b8: log p(\u03b8|D) = log p(D|\u03b8) + log p(\u03b8). Here, the prior p(\u03b8) is viewed as a\nregularizer in optimization. However, MAP is not Bayesian inference, since one only considers a sin-\ngle setting of the parameters \u02c6\u03b8MAP = argmax\u03b8p(\u03b8|D) in making predictions, forming p(y\u2217|\u02c6\u03b8MAP, x\u2217),\nwhere x\u2217 and y\u2217 are test inputs and outputs.\nA Bayesian procedure instead marginalizes the posterior distribution over \u03b8, in a Bayesian model\n\naverage, for the unconditional predictive distribution: p(y\u2217|D, x\u2217) = (cid:82) p(y\u2217|\u03b8, x\u2217)p(\u03b8|D)d\u03b8. In\n\npractice, this integral is computed through a Monte Carlo sampling procedure:\np(y\u2217|D, x\u2217) \u2248 1\nWe emphasize that in this paper we are approximating fully Bayesian inference, rather than MAP\noptimization. We develop a Gaussian approximation to the posterior from SGD iterates, p(\u03b8|D) \u2248\n4We use one half as the scale here because both the diagonal and low rank terms include the variance of the\n\n(cid:80)T\nt=1 p(y\u2217|\u03b8t, x\u2217) ,\n\n\u03b8t \u223c p(\u03b8|D).\n\nT\n\nweights. We tested several other scales in Appendix 4.\n\n5\n\n\fFigure 1: Left: Posterior joint density cross-sections along the rays corresponding to different\neigenvectors of SWAG covariance matrix. Middle: Posterior joint density surface in the plane\nspanned by eigenvectors of SWAG covariance matrix corresponding to the \ufb01rst and second largest\neigenvalues and (Right:) the third and fourth largest eigenvalues. All plots are produced using\nPreResNet-164 on CIFAR-100. The SWAG distribution projected onto these directions \ufb01ts the\ngeometry of the posterior density remarkably well.\n\nN (\u03b8; \u00b5, \u03a3), and then sample from this posterior distribution to perform a Bayesian model average.\nIn our procedure, optimization with different regularizers, to characterize the Gaussian posterior\napproximation, corresponds to approximate Bayesian inference with different priors p(\u03b8).\n\nPrior Choice Typically, weight decay is used to regularize DNNs, corresponding to explicit L2\nregularization when SGD without momentum is used to train the model. When SGD is used with\nmomentum, as is typically the case, implicit regularization still occurs, producing a vague prior on\nthe weights of the DNN in our procedure. This regularizer can be given an explicit Gaussian-like\nform (see Proposition 3 of Loshchilov and Hutter [30]), corresponding to a prior distribution on the\nweights.\nThus, SWAG is an approximate Bayesian inference algorithm in our experiments (see Section 5) and\ncan be applied to most DNNs without any modi\ufb01cations of the training procedure (as long as SGD is\nused with weight decay or explicit L2 regularization). Alternative regularization techniques could\nalso be used, producing different priors on the weights. It may also be possible to similarly utilize\nAdam and other stochastic \ufb01rst-order methods, which view as a promising direction for future work.\n\n4 Does the SGD Trajectory Capture Loss Geometry?\n\nTo analyze the quality of the SWAG approximation, we study the posterior density along the directions\ncorresponding to the eigenvectors of the SWAG covariance matrix for PreResNet-164 on CIFAR-100.\nIn order to \ufb01nd these eigenvectors we use randomized SVD [14].5 In the left panel of Figure 1 we\nvisualize the (cid:96)2-regularized cross-entropy loss L(\u00b7) (equivalent to the joint density of the weights and\nthe loss with a Gaussian prior) as a function of distance t from the SWA solution \u03b8SWA along the i-th\neigenvector vi of the SWAG covariance: \u03c6(t) = L(\u03b8SWA + t \u00b7 vi(cid:107)vi(cid:107) ). Figure 1 (left) shows a clear\ncorrelation between the variance of the SWAG approximation and the width of the posterior along\nthe directions vi. The SGD iterates indeed contain useful information about the shape of the posterior\ndistribution, and SWAG is able to capture this information. We repeated the same experiment for\nSWAG-Diagonal, \ufb01nding that there was almost no variance in these eigen-directions. Next, in Figure 1\n(middle) we plot the posterior density surface in the 2-dimensional plane in the weight space spanning\nthe two top eigenvectors v1 and v2 of the SWAG covariance: \u03c8(t1, t2) = L(\u03b8SWA+t1\u00b7 v1(cid:107)v1(cid:107) +t2\u00b7 v2(cid:107)v2(cid:107) ).\nAgain, SWAG is able to capture the geometry of the posterior. The contours of constant posterior\ndensity appear remarkably well aligned with the eigenvalues of the SWAG covariance. We also\npresent the analogous plot for the third and fourth top eigenvectors in Figure 1 (right). In Appendix 3,\nwe additionally present similar results for PreResNet-164 on CIFAR-10 and VGG-16 on CIFAR-100.\nAs we can see, SWAG is able to capture the geometry of the posterior in the subspace spanned by SGD\niterates. However, the dimensionality of this subspace is very low compared to the dimensionality of\n\n5From sklearn.decomposition.TruncatedSVD.\n\n6\n\n\u221280\u221260\u221240\u221220020406080Distance0.00.20.40.60.81.01.21.41.6TrainlossTrainlossPreResNet-164CIFAR-100v1v2v5v10v20SWAG3\u03c3region\u221280\u221260\u221240\u221220020406080v1\u221280\u221260\u221240\u221220020406080v2TrainlossPreResNet-164CIFAR-100SWATrajectory(proj)SWAG3\u03c3region0.0840.0910.110.150.270.651.75>5\u221240\u22122002040v3\u221240\u22122002040v4TrainlossPreResNet-164CIFAR-100SWATrajectory(proj)SWAG3\u03c3region0.10.120.140.190.340.751.95>5\fFigure 2: Negative log likelihoods for SWAG and baselines. Mean and standard deviation (shown\nwith error-bars) over 3 runs are reported for each experiment on CIFAR datasets. SWAG (blue\nstar) consistently outperforms alternatives, with lower negative log likelihood, with the largest\nimprovements on transfer learning. Temperature scaling applied on top of SWA (SWA-Temp) often\nperforms close to as well on the non-transfer learning tasks, but requires a validation set.\n\nthe weight space, and we can not guarantee that SWAG variance estimates are adequate along all\ndirections in weight space. In particular, we would expect SWAG to under-estimate the variances\nalong random directions, as the SGD trajectory is in a low-dimensional subspace of the weight\nspace, and a random vector has a close-to-zero projection on this subspace with high probability. In\nAppendix 1 we visualize the trajectory of SGD applied to a quadratic function, and further discuss\nthe relation between the geometry of objective and SGD trajectory. In Appendices 1 and 2, we also\nempirically test the assumptions behind theory relating the SGD stationary distribution to the true\nposterior for neural networks.\n\n5 Experiments\n\nWe conduct a thorough empirical evaluation of SWAG, comparing to a range of high performing\nbaselines, including MC dropout [9], temperature scaling [12], SGLD [46], Laplace approximations\n[43], deep ensembles [29], and ensembles of SGD iterates that were used to construct the SWAG\napproximation. In Section 5.1 we evaluate SWAG predictions and uncertainty estimates on image\nclassi\ufb01cation tasks. We also evaluate SWAG for transfer learning and out-of-domain data detection.\nWe investigate the effect of hyperparameter choices and practical limitations in SWAG, such as the\neffect of learning rate on the scale of uncertainty, in Appendix 4.\n\n5.1 Calibration and Uncertainty Estimation on Image Classi\ufb01cation Tasks\n\nIn this section we evaluate the quality of uncertainty estimates as well as predictive accuracy for\nSWAG and SWAG-Diagonal on CIFAR-10, CIFAR-100 and ImageNet ILSVRC-2012 [45].\nFor all methods we analyze test negative log-likelihood, which re\ufb02ects both the accuracy and the\nquality of predictive uncertainty. Following Guo et al. [12] we also consider a variant of reliability\ndiagrams to evaluate the calibration of uncertainty estimates (see Figure 3) and to show the difference\nbetween a method\u2019s con\ufb01dence in its predictions and its accuracy. To produce this plot for a given\nmethod we split the test data into 20 bins uniformly based on the con\ufb01dence of a method (maximum\npredicted probability). We then evaluate the accuracy and mean con\ufb01dence of the method on the\nimages from each bin, and plot the difference between con\ufb01dence and accuracy. For a well-calibrated\nmodel, this difference should be close to zero for each bin. We found that this procedure gives a more\neffective visualization of the actual con\ufb01dence distribution of DNN predictions than the standard\nreliability diagrams used in Guo et al. [12] and Niculescu-Mizil and Caruana [39].\nWe provide tables containing the test accuracy, negative log likelihood and expected calibration error\nfor all methods and datasets in Appendix 5.3.\nCIFAR datasets On CIFAR datasets we run experiments with VGG-16, PreResNet-164 and\nWideResNet-28x10 networks. In order to compare SWAG with existing alternatives we report the\nresults for standard SGD and SWA [20] solutions (single models), MC-Dropout [9], temperature\nscaling [12] applied to SWA and SGD solutions, SGLD [46], and K-FAC Laplace [43] methods. For\nall the methods we use our implementations in PyTorch (see Appendix 8). We train all networks\nfor 300 epochs, starting to collect models for SWA and SWAG approximations once per epoch after\nepoch 160. For SWAG, K-FAC Laplace, and Dropout we use 30 samples at test time.\n\n7\n\n0.600.650.700.750.80NLLWideResNet28x10CIFAR-1000.650.700.750.800.850.900.95PreResNet-164CIFAR-1001.01.21.41.6VGG-16CIFAR-1000.110.120.130.14WideResNet28x10CIFAR-100.120.130.140.150.160.170.18PreResNet-164CIFAR-100.2000.2250.2500.2750.3000.325VGG-16CIFAR-100.900.951.001.051.10WideResNet28x10CIFAR-10\u2192STL-101.01.11.21.31.41.5PreResNet-164CIFAR-10\u2192STL-101.11.21.31.41.51.61.7VGG-16CIFAR-10\u2192STL-100.840.860.880.90DenseNet-161ImageNet0.820.830.840.850.860.87ResNet-152ImageNetSWAGSWAG-DiagSGDSWASGD-TempSWA-TempKFAC-LaplaceSGD-DropSWA-DropSGLD\fFigure 3: Reliability diagrams for WideResNet28x10 on CIFAR-100 and transfer task; ResNet-152\nand DenseNet-161 on ImageNet. Con\ufb01dence is the value of the max softmax output. A perfectly\ncalibrated network has no difference between con\ufb01dence and accuracy, represented by a dashed black\nline. Points below this line correspond to under-con\ufb01dent predictions, whereas points above the\nline are overcon\ufb01dent predictions. SWAG is able to substantially improve calibration over standard\ntraining (SGD), as well as SWA. Additionally, SWAG signi\ufb01cantly outperforms temperature scaling\nfor transfer learning (CIFAR-10 to STL), where the target data are not from the same distribution as\nthe training data.\n\nImageNet On ImageNet we report our results for SWAG, SWAG-Diagonal, SWA and SGD. We\nrun experiments with DenseNet-161 [17] and Resnet-152 [15]. For each model we start from a\npre-trained model available in the torchvision package, and run SGD with a constant learning rate\nfor 10 epochs. We collect models for the SWAG versions and SWA 4 times per epoch. For SWAG\nwe use 30 samples from the posterior over network weights at test-time, and use randomly sampled\n10% of the training data to update batch-normalization statistics for each of the samples. For SGD\nwith temperature scaling, we use the results reported in Guo et al. [12].\n\nTransfer from CIFAR-10 to STL-10 We use the models trained on CIFAR-10 and evaluate them\non STL-10 [7]. STL-10 has a similar set of classes as CIFAR-10, but the image distribution is\ndifferent, so adapting the model from CIFAR-10 to STL-10 is a commonly used transfer learning\nbenchmark. We provide further details on the architectures and hyperparameters in Appendix 8.\nResults We visualize the negative log-likelihood for all methods and datasets in Figure 2. On all\nconsidered tasks SWAG and SWAG diagonal perform comparably or better than all the considered\nalternatives, SWAG being best overall. We note that the combination of SWA and temperature scaling\npresents a competitive baseline. However, unlike SWAG it requires using a validation set to tune the\ntemperature; further, temperature scaling is not effective when the test data distribution differs from\ntrain, as we observe in experiments on transfer learning from CIFAR-10 to STL-10.\nNext, we analyze the calibration of uncertainty estimates provided by different methods. In Figure\n3 we present reliability plots for WideResNet on CIFAR-100, DenseNet-161 and ResNet-152 on\nImageNet. The reliability diagrams for all other datasets and architectures are presented in the\nAppendix 5.1. As we can see, SWAG and SWAG-Diagonal both achieve good calibration across\nthe board. The low-rank plus diagonal version of SWAG is generally better calibrated than SWAG-\nDiagonal. We also present the expected calibration error for each of the methods, architectures and\ndatasets in Tables A.2,3. Finally, in Tables A.8,9 we present the predictive accuracy for all of the\nmethods, where SWAG is comparable with SWA and generally outperforms the other approaches.\n\n5.2 Comparison to ensembling SGD solutions\n\nWe evaluated ensembles of independently trained SGD solutions (Deep Ensembles, [29]) on\nPreResNet-164 on CIFAR-100. We found that an ensemble of 3 SGD solutions has high accu-\nracy (82.1%), but only achieves NLL 0.6922, which is worse than a single SWAG solution (0.6595\nNLL). While the accuracy of this ensemble is high, SWAG solutions are much better calibrated. An\nensemble of 5 SGD solutions achieves NLL 0.6478, which is competitive with a single SWAG solution,\nthat requires 5\u00d7 less computation to train. Moreover, we can similarly ensemble independently\ntrained SWAG models; an ensemble of 3 SWAG models achieves NLL of 0.6178.\nWe also evaluated ensembles of SGD iterates that were used to construct the SWAG approximation\n(SGD-Ens) for all of our CIFAR models. SWAG has higher NLL than SGD-Ens on VGG-16, but\n\n8\n\n0.2000.7590.9270.9780.9930.998Con\ufb01dence(maxprob)-0.10-0.050.000.050.100.15Con\ufb01dence-AccuracyWideResNet28x10CIFAR-1000.2000.7590.9270.9780.9930.998Con\ufb01dence(maxprob)0.000.050.100.150.200.250.300.350.40Con\ufb01dence-AccuracyWideResNet28x10CIFAR-10\u2192STL-100.2000.7590.9270.9780.9930.998Con\ufb01dence(maxprob)-0.05-0.030.000.020.050.080.10Con\ufb01dence-AccuracyDenseNet-161ImageNet0.2000.7590.9270.9780.9930.998Con\ufb01dence(maxprob)-0.08-0.05-0.020.000.020.050.080.100.12Con\ufb01dence-AccuracyResNet-152ImageNetSGDSGLDSWA-DropSWA-TempSWAGSWAG-Diag\fmuch lower NLL on the larger PreResNet-164 and WideResNet28x10; the results for accuracy and\nECE are analogous.\n\n5.3 Out-of-Domain Image Detection\n\nTo evaluate SWAG on out-of-domain data detection we train a WideResNet as described in section\n5.1 on the data from \ufb01ve classes of the CIFAR-10 dataset, and then analyze predictions of SWAG\nvariants along with the baselines on the full test set. We expect the outputted class probabilities on\nobjects that belong to classes that were not present in the training data to have high-entropy re\ufb02ecting\nthe model\u2019s high uncertainty in its predictions, and considerably lower entropy on the images that are\nsimilar to those on which the network was trained. We plot the histograms of predictive entropies\non the in-domain and out-of-domain in Figure A.A7 for a qualitative comparison and report the\nsymmetrized KL divergence between the binned in and out of sample distributions in Table 1, \ufb01nding\nthat SWAG and Dropout perform best on this measure. Additional details are in Appendix 5.2.\n\n5.4 Language Modeling with LSTMs\n\nWe next apply SWAG to an LSTM network on language modeling tasks on Penn Treebank and\nWikiText-2 datasets. In Appendix 6 we demonstrate that SWAG easily outperforms both SWA and\nNT-ASGD [35], a strong baseline for LSTM training, in terms of test and validation perplexities.\nWe compare SWAG to SWA and the NT-ASGD method [35], which is a strong baseline for training\nLSTM models. The main difference between SWA and NT-ASGD, which is also based on weight\naveraging, is that NT-ASGD starts weight averaging much earlier than SWA: NT-ASGD switches\nto ASGD (averaged SGD) typically around epoch 100 while with SWA we start averaging after\npre-training for 500 epochs. We report test and validation perplexities for different methods and\ndatasets in Table 1.\nAs we can see, SWA substantially improves perplexities on both datasets over NT-ASGD. Further,\nwe observe that SWAG is able to substantially improve test perplexities over the SWA solution.\n\nTable 1: Validation and Test perplexities for NT-ASGD, SWA and SWAG on Penn Treebank and\nWikiText-2 datasets.\n\nMethod\nNT-ASGD\nSWA\nSWAG\n\nPTB val\n\nPTB test WikiText-2 val WikiText-2 test\n\n61.2\n59.1\n58.6\n\n58.8\n56.7\n56.26\n\n68.7\n68.1\n67.2\n\n65.6\n65.0\n64.1\n\n5.5 Regression\n\nFinally, while the empirical focus of our paper is classi\ufb01cation calibration, we also compare to\nadditional approximate BNN inference methods which perform well on smaller architectures, includ-\ning deterministic variational inference (DVI) [47], single-layer deep GPs (DGP) with expectation\npropagation [4], SGLD [46], and re-parameterization VI [26] on a set of UCI regression tasks. We\nreport test log-likelihoods, RMSEs and test calibration results in Appendix Tables 11 and 12 where it\nis possible to see that SWAG is competitive with these methods. Additional details are in Appendix 7.\n6 Discussion\nIn this paper we developed SWA-Gaussian (SWAG) for approximate Bayesian inference in deep\nlearning. There has been a great desire to apply Bayesian methods in deep learning due to their\ntheoretical properties and past success with small neural networks. We view SWAG as a step towards\npractical, scalable, and accurate Bayesian deep learning for large modern neural networks.\nA key geometric observation in this paper is that the posterior distribution over neural network\nparameters is close to Gaussian in the subspace spanned by the trajectory of SGD. Our work shows\nBayesian model averaging within this subspace can improve predictions over SGD or SWA solutions.\nFurthermore, Gur-Ari et al. [13] argue that the SGD trajectory lies in the subspace spanned by the\neigenvectors of the Hessian corresponding to the top eigenvalues, implying that the SGD trajectory\nsubspace corresponds to directions of rapid change in predictions. In recent work, Izmailov et al. [19]\nshow promising results from directly constructing subspaces for Bayesian inference.\n\n9\n\n\fAcknowledgements\n\nWM, PI, and AGW were supported by an Amazon Research Award, Facebook Research, NSF\nIIS-1563887, and NSF IIS-1910266. WM was additionally supported by an NSF Graduate Research\nFellowship under Grant No. DGE-1650441. DV was supported by the Russian Science Foundation\ngrant no.19-71-30020. We would like to thank Jacob Gardner and Polina Kirichenko for helpful\ndiscussions.\n\nReferences\n[1] Athiwaratkun, B., Finzi, M., Izmailov, P., and Wilson, A. G. (2019). There are many consistent\nexplanations for unlabeled data: why you should average. In International Conference on Learning\nRepresentations. arXiv: 1806.05594.\n\n[2] Blier, L. and Ollivier, Y. (2018). The Description Length of Deep Learning models. In Advances\n\nin Neural Information Processing Systems, page 11.\n\n[3] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight Uncertainty in\n\nNeural Networks. In International Conference on Machine Learning. arXiv: 1505.05424.\n\n[4] Bui, T., Hern\u00e1ndez-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. (2016). Deep\ngaussian processes for regression using approximate expectation propagation. In International\nConference on Machine Learning, pages 1472\u20131481.\n\n[5] Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic Gradient Hamiltonian Monte Carlo. In\n\nInternational Conference on Machine Learning. arXiv: 1402.4102.\n\n[6] Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2016). Statistical Inference for Model Parameters\n\nin Stochastic Gradient Descent. arXiv: 1610.08637.\n\n[7] Coates, A., Ng, A., and Lee, H. (2011). An Analysis of Single-Layer Networks in Unsuper-\nvised Feature Learning. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 215\u2013223.\n\n[8] Dasgupta, S. and Hsu, D. (2007). On-Line Estimation with the Multivariate Gaussian Distribution.\nIn Bshouty, N. H. and Gentile, C., editors, Twentieth Annual Conference on Learning Theory.,\nvolume 4539, pages 278\u2013292, Berlin, Heidelberg. Springer Berlin Heidelberg.\n\n[9] Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian Approximation. In International\n\nConference on Machine Learning.\n\n[10] Gal, Y., Hron, J., and Kendall, A. (2017). Concrete Dropout. In Advances in Neural Information\n\nProcessing Systems. arXiv: 1705.07832.\n\n[11] Graves, A. (2011). Practical variational inference for neural networks. In Advances in neural\n\ninformation processing systems, pages 2348\u20132356.\n\n[12] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural\n\nNetworks. In International Conference on Machine Learning. arXiv: 1706.04599.\n\n[13] Gur-Ari, G., Roberts, D. A., and Dyer, E. (2019). Gradient descent happens in a tiny subspace.\n\n[14] Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288.\n\n[15] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition.\n\nIn CVPR. arXiv: 1512.03385.\n\n[16] Hern\u00e1ndez-Lobato, J. M. and Adams, R. (2015). Probabilistic Backpropagation for Scalable\nLearning of Bayesian Neural Networks. In Advances in Neural Information Processing Systems.\n\n[17] Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2017). Densely Connected\n\nConvolutional Networks. In CVPR. arXiv: 1608.06993.\n\n10\n\n\f[18] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by\n\nreducing internal covariate shift. arXiv preprint arXiv:1502.03167.\n\n[19] Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. (2019).\n\nSubspace inference for bayesian deep learning. arXiv preprint arXiv:1907.07504.\n\n[20] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2018). Averaging\nweights leads to wider optima and better generalization. Uncertainty in Arti\ufb01cial Intelligence\n(UAI).\n\n[21] Kendall, A. and Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning\n\nfor Computer Vision? In Advances in Neural Information Processing Systems, Long Beach.\n\n[22] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017). On\nLarge-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International\nConference on Learning Representations. arXiv: 1609.04836.\n\n[23] Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and\nScalable Bayesian Deep Learning by Weight-Perturbation in Adam. In International Conference\non Machine Learning. arXiv: 1806.04854.\n\n[24] Kingma, D. P., Salimans, T., and Welling, M. (2015a). Variational Dropout and the Local\n\nReparameterization Trick. arXiv:1506.02557 [cs, stat]. arXiv: 1506.02557.\n\n[25] Kingma, D. P., Salimans, T., and Welling, M. (2015b). Variational dropout and the local\nreparameterization trick. In Advances in Neural Information Processing Systems, pages 2575\u2013\n2583.\n\n[26] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. In International\n\nConference on Learning Representations.\n\n[27] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K.,\nQuan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting\nin neural networks. Proceedings of the national academy of sciences, page 201611835.\n\n[28] Kuleshov, V., Fenner, N., and Ermon, S. (2018). Accurate Uncertainties for Deep Learning\n\nUsing Calibrated Regression. In International Conference on Machine Learning, page 9.\n\n[29] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and Scalable Predictive\nUncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing\nSystems.\n\n[30] Loshchilov, I. and Hutter, F. (2019). Decoupled Weight Decay Regularization. In International\n\nConference on Learning Representations. arXiv: 1711.05101.\n\n[31] Louizos, C. and Welling, M. (2017). Multiplicative normalizing \ufb02ows for variational bayesian\n\nneural networks. In International Conference on Machine Learning.\n\n[32] MacKay, D. J. C. (1992a). Bayesian Interpolation. Neural Computation.\n\n[33] MacKay, D. J. C. (1992b). A Practical Bayesian Framework for Backpropagation Networks.\n\nNeural Computation, 4(3):448\u2013472.\n\n[34] Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic Gradient Descent as Approximate\n\nBayesian Inference. JMLR, 18:1\u201335.\n\n[35] Merity, S., Keskar, N. S., and Socher, R. (2017). Regularizing and optimizing lstm language\n\nmodels. arXiv preprint arXiv:1708.02182.\n\n[36] Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Variational dropout sparsi\ufb01es deep neural\n\nnetworks. arXiv preprint arXiv:1701.05369.\n\n[37] Mukhoti, J. and Gal, Y. (2018). Evaluating Bayesian Deep Learning Methods for Semantic\n\nSegmentation.\n\n11\n\n\f[38] Neal, R. M. (1996). Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in\n\nStatistics. Springer New York, New York, NY.\n\n[39] Niculescu-Mizil, A. and Caruana, R. (2005). Predicting good probabilities with supervised\nlearning. In International Conference on Machine Learning, pages 625\u2013632, Bonn, Germany.\nACM Press.\n\n[40] Nikishin, E., Izmailov, P., Athiwaratkun, B., Podoprikhin, D., Garipov, T., Shvechikov, P.,\nVetrov, D., and Wilson, A. G. (2018). Improving stability in deep reinforcement learning with\nweight averaging.\n\n[41] Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averag-\n\ning. SIAM Journal on Control and Optimization, 30(4):838\u2013855.\n\n[42] Ritter, H., Botev, A., and Barber, D. (2018a). Online Structured Laplace Approximations For\nOvercoming Catastrophic Forgetting. In Advances in Neural Information Processing Systems.\narXiv: 1805.07810.\n\n[43] Ritter, H., Botev, A., and Barber, D. (2018b). A Scalable Laplace Approximation for Neural\n\nNetworks. In International Conference on Learning Representations.\n\n[44] Ruppert, D. (1988). Ef\ufb01cient Estimators from a Slowly Convergent Robbins-Munro Process.\nTechnical Report 781, Cornell University, School of Operations Report and Industrial Engineering.\n\n[45] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,\nKhosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual\nRecognition Challenge. IJCV, 115(3):211\u2013252. arXiv: 1409.0575.\n\n[46] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics.\nIn Proceedings of the 28th international conference on machine learning (ICML-11), pages\n681\u2013688.\n\n[47] Wu, A., Nowozin, S., Meeds, E., Turner, R. E., Hern\u00e1ndez-Lobato, J. M., and Gaunt, A. L.\n(2019). Fixing variational bayes: Deterministic variational inference for bayesian neural networks.\nIn Inernational Conference on Learning Representations. arXiv preprint arXiv:1810.03958.\n\n[48] Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A. G., and De Sa, C. (2019). Swalp:\nStochastic weight averaging in low precision training. In International Conference on Machine\nLearning, pages 7015\u20137024.\n\n[49] Yazici, Y., Foo, C.-S., Winkler, S., Yap, K.-H., Piliouras, G., and Chandrasekhar, V. (2019). The\nUnusual Effectiveness of Averaging in GAN Training. In International Conference on Learning\nRepresentations. arXiv: 1806.04498.\n\n[50] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. (2017). Noisy Natural Gradient as Variational\n\nInference. arXiv:1712.02390 [cs, stat]. arXiv: 1712.02390.\n\n12\n\n\f", "award": [], "sourceid": 7203, "authors": [{"given_name": "Wesley", "family_name": "Maddox", "institution": "New York University"}, {"given_name": "Pavel", "family_name": "Izmailov", "institution": "New York University"}, {"given_name": "Timur", "family_name": "Garipov", "institution": "MIT"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Samsung AI Center, Moscow"}, {"given_name": "Andrew Gordon", "family_name": "Wilson", "institution": "New York University"}]}