{"title": "The promises and pitfalls of Stochastic Gradient Langevin Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 8268, "page_last": 8278, "abstract": "Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated spectacular successes in machine learning tasks. The current practice is to set the step size inversely proportional to N where N is the number of training samples. As N becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like as Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments.", "full_text": "The promises and pitfalls of Stochastic Gradient\n\nLangevin Dynamics\n\nNicolas Brosse, \u00c9ric Moulines\n\nCentre de Math\u00e9matiques Appliqu\u00e9es, UMR 7641,\n\nEcole Polytechnique, Palaiseau, France.\n\nnicolas.brosse@polytechnique.edu, eric.moulines@polytechnique.edu\n\nAlain Durmus\n\nEcole Normale Sup\u00e9rieure CMLA,\n\n61 Av. du Pr\u00e9sident Wilson 94235 Cachan Cedex, France.\n\nalain.durmus@cmla.ens-cachan.fr\n\nAbstract\n\nStochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC\nalgorithm for Bayesian learning from large scale datasets. While SGLD with\ndecreasing step sizes converges weakly to the posterior distribution, the algorithm\nis often used with a constant step size in practice and has demonstrated successes\nin machine learning tasks. The current practice is to set the step size inversely\nproportional to N where N is the number of training samples. As N becomes\nlarge, we show that the SGLD algorithm has an invariant probability measure\nwhich signi\ufb01cantly departs from the target posterior and behaves like Stochastic\nGradient Descent (SGD). This difference is inherently due to the high variance of\nthe stochastic gradients. Several strategies have been suggested to reduce this effect;\namong them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates\nto reduce the variance of the stochastic gradients. We show that SGLDFP gives\napproximate samples from the posterior distribution, with an accuracy comparable\nto the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear\nin the number of data points. We provide a detailed analysis of the Wasserstein\ndistances between LMC, SGLD, SGLDFP and SGD and explicit expressions of\nthe means and covariance matrices of their invariant distributions. Our \ufb01ndings are\nsupported by limited numerical experiments.\n\n1\n\nIntroduction\n\nMost MCMC algorithms have not been designed to process huge sample sizes, a typical setting in\nmachine learning. As a result, many classical MCMC methods fail in this context, because the mixing\ntime becomes prohibitively long and the cost per iteration increases proportionally to the number of\ntraining samples N. The computational cost in standard Metropolis-Hastings algorithm comes from\n1) the computation of the proposals, 2) the acceptance/rejection step. Several approaches to solve\nthese issues have been recently proposed in machine learning and computational statistics.\nAmong them, the stochastic gradient langevin dynamics (SGLD) algorithm, introduced in [33], is\na popular choice. This method is based on the Langevin Monte Carlo (LMC) algorithm proposed\nin [16, 17]. Standard versions of LMC require to compute the gradient of the log-posterior at the\ncurrent \ufb01t of the parameter, but avoid the accept/reject step. The LMC algorithm is a discretization\nof a continuous-time process, the overdamped Langevin diffusion, which leaves invariant the target\ndistribution \u03c0. To further reduce the computational cost, SGLD uses unbiased estimators of the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fgradient of the log-posterior based on subsampling. This method has triggered a huge number of\nworks among others [1, 21, 2, 6, 8, 12, 24, 13, 4] and have been successfully applied to a range of\nstate of the art machine learning problems [27, 23].\nThe properties of SGLD with decreasing step sizes have been studied in [31]. The two key \ufb01ndings\nin this work are that 1) the SGLD algorithm converges weakly to the target distribution \u03c0, 2) the\noptimal rate of convergence to equilibrium scales as n\u22121/3 where n is the number of iterations, see\n[31, Section 5]. However, in most of the applications, constant rather than decreasing step sizes are\nused, see [1, 8, 18, 22, 30, 32]. A natural question for the practical design of SGLD is the choice of\nthe minibatch size. This size controls on the one hand the computational complexity of the algorithm\nper iteration and on the other hand the variance of the gradient estimator. Non-asymptotic bounds in\nWasserstein distance between the marginal distribution of the SGLD iterates and the target distribution\n\u03c0 have been established in [10, 11]. These results highlight the cost of using stochastic gradients and\nshow that, for a given precision \u0001 in Wasserstein distance, the computational cost of the plain SGLD\nalgorithm does not improve over the LMC algorithm; Nagapetyan et al. [25] reports also similar\nresults on the mean square error.\nIt has been suggested to use control variates to reduce the high variance of the stochastic gradients.\nFor strongly log-concave models, Nagapetyan et al. [25], Baker et al. [3] use the mode of the posterior\ndistribution as a reference point and introduce the SGLDFP (Stochastic Gradient Langevin Dynamics\nFixed Point) algorithm. Nagapetyan et al. [25], Baker et al. [3] provide upper bounds on the mean\nsquare error and the Wasserstein distance between the marginal distribution of the iterates of SGLDFP\nand the posterior distribution. In addition, Nagapetyan et al. [25], Baker et al. [3] show that the overall\ncost remains sublinear in the number of individual data points, up to a preprocessing step. Other\ncontrol variates methodologies are provided for non-concave models in the form of SAGA-Langevin\nDynamics and SVRG-Langevin Dynamics [13, 7], albeit a detailed analysis in Wasserstein distance\nof these algorithms is only available for strongly log-concave models [5].\nIn this paper, we provide further insights on the links between SGLD, SGLDFP, LMC and SGD\n(Stochastic Gradient Descent). In our analysis, the algorithms are used with a constant step size\nand the parameters are set to the standard values used in practice [1, 8, 18, 22, 30, 32]. The LMC,\nSGLD and SGLDFP algorithms de\ufb01ne homogeneous Markov chains, each of which admits a unique\nstationary distribution used as a hopefully close proxy of \u03c0. The main contribution of this paper is to\nshow that, while the invariant distributions of LMC and SGLDFP become closer to \u03c0 as the number\nof data points increases, on the opposite, the invariant measure of SGLD never comes close to the\ntarget distribution \u03c0 and is in fact very similar to the invariant measure of SGD.\nIn Section 3.1, we give an upper bound in Wasserstein distance of order 2 between the marginal\ndistribution of the iterates of LMC and the Langevin diffusion, SGLDFP and LMC, and SGLD\nand SGD. We provide a lower bound on the Wasserstein distance between the marginal distribution\nof the iterates of SGLDFP and SGLD. In Section 3.2, we give a comparison of the means and\ncovariance matrices of the invariant distributions of LMC, SGLDFP and SGLD with those of the\ntarget distribution \u03c0. Our claims are supported by numerical experiments in Section 4.\n\n2 Preliminaries\nDenote by z = {zi}N\ni=1 the observations. We are interested in situations where the target distribution\n\u03c0 arises as the posterior in a Bayesian inference problem with prior density \u03c00(\u03b8) and a large number\ni=1 p(zi|\u03b8).\ni=0 Ui.\n\nN (cid:29) 1 of i.i.d. observations zi with likelihoods p(zi|\u03b8). In this case, \u03c0(\u03b8) = \u03c00(\u03b8)(cid:81)N\nWe denote Ui(\u03b8) = \u2212 log(p(zi|\u03b8)) for i \u2208 {1, . . . , N}, U0(\u03b8) = \u2212 log(\u03c00(\u03b8)), U =(cid:80)N\n\nUnder mild conditions, \u03c0 is the unique invariant probability measure of the Langevin Stochastic\nDifferential Equation (SDE):\n\nd\u03b8t = \u2212\u2207U (\u03b8t)dt +\n\n\u221a\n\n2dBt ,\n\n(1)\n\nwhere (Bt)t\u22650 is a d-dimensional Brownian motion. Based on this observation, Langevin Monte\nCarlo (LMC) is an MCMC algorithm that enables to sample (approximately) from \u03c0 using an Euler\ndiscretization of the Langevin SDE:\n\n\u03b8k+1 = \u03b8k \u2212 \u03b3\u2207U (\u03b8k) +(cid:112)2\u03b3Zk+1 ,\n\n(2)\n\n2\n\n\fwhere \u03b3 > 0 is a constant step size and (Zk)k\u22651 is a sequence of i.i.d. standard d-dimensional\nGaussian vectors. Discovered and popularised in the seminal works [16, 17, 29], LMC has recently\nreceived renewed attention [9, 15, 14, 11]. However, the cost of one iteration is N d which is\nprohibitively large for massive datasets. In order to scale up to the big data setting, Welling and\ni\u2208S \u2207Ui where S is\na minibatch of {1, . . . , N} with replacement of size p. A single update of SGLD is then given for\nk \u2208 N by\n\nTeh [33] suggested to replace \u2207U with an unbiased estimate \u2207U0 + (N/p)(cid:80)\n\uf8f6\uf8f8 +(cid:112)2\u03b3Zk+1 .\n\n\uf8eb\uf8ed\u2207U0(\u03b8k) +\n\n\u03b8k+1 = \u03b8k \u2212 \u03b3\n\n\u2207Ui(\u03b8k)\n\n(cid:88)\n\n(3)\n\nN\np\n\ni\u2208Sk+1\n\nThe idea of using only a fraction of data points to compute an unbiased estimate of the gradient\nat each iteration comes from Stochastic Gradient Descent (SGD) which is a popular algorithm to\nminimize the potential U. SGD is very similar to SGLD because it is characterised by the same\nrecursion as SGLD but without Gaussian noise:\n\n\uf8f6\uf8f8 .\n\n\u2207Ui(\u03b8k)\n\n(cid:88)\n\ni\u2208Sk+1\n\nN\np\n\n\u03b8k+1 = \u03b8k \u2212 \u03b3\n\n\uf8eb\uf8ed\u2207U0(\u03b8k) \u2212 \u2207U0(\u03b8(cid:63)) +\n\n\uf8eb\uf8ed\u2207U0(\u03b8k) +\n(cid:88)\n\nN\np\n\ni\u2208Sk+1\n\nAssuming for simplicity that U has a minimizer \u03b8(cid:63), we can de\ufb01ne a control variates version of SGLD,\nSGLDFP, see [13, 7], given for k \u2208 N by\n\n\u03b8k+1 = \u03b8k \u2212 \u03b3\n\n{\u2207Ui(\u03b8k) \u2212 \u2207Ui(\u03b8(cid:63))}\n\n(4)\n\n\uf8f6\uf8f8 +(cid:112)2\u03b3Zk+1 . (5)\n\n(cid:13)(cid:13)Dj Ui(\u03b8)(cid:13)(cid:13) \u2264 \u02dcL. In particular for all i \u2208 {0, . . . , N}, Ui is \u02dcL-gradient Lipschitz, i.e. for\n\nIt is worth mentioning that the objectives of the different algorithms presented so far are distinct. On\nthe one hand, LMC, SGLD and SGDLFP are MCMC methods used to obtain approximate samples\nfrom the posterior distribution \u03c0. On the other hand, SGD is a stochastic optimization algorithm used\nto \ufb01nd an estimate of the mode \u03b8(cid:63) of the posterior distribution. In this paper, we focus on the \ufb01xed\nstep-size SGLD algorithm and assess its ability to reliably sample from \u03c0. For that purpose and to\nquantify precisely the relation between LMC, SGLD, SGDFP and SGD, we make for simplicity the\nfollowing assumptions on U.\nH1. For all i \u2208 {0, . . . , N}, Ui is four times continuously differentiable and for all j \u2208 {2, 3, 4},\nsup\u03b8\u2208Rd\nall \u03b81, \u03b82 \u2208 Rd, (cid:107)\u2207Ui(\u03b81) \u2212 \u2207Ui(\u03b82)(cid:107) \u2264 \u02dcL(cid:107)\u03b81 \u2212 \u03b82(cid:107).\nH2. U is m-strongly convex, i.e. for all \u03b81, \u03b82 \u2208 Rd, (cid:104)\u2207U (\u03b81) \u2212 \u2207U (\u03b82), \u03b81 \u2212 \u03b82(cid:105) \u2265 m(cid:107)\u03b81 \u2212 \u03b82(cid:107)2.\nH3. For all i \u2208 {0, . . . , N}, Ui is convex.\nNote that under H 1, U is\nfour\n{2, 3, 4}, sup\u03b8\u2208Rd\nsup(cid:107)u1(cid:107)\u22641,...,(cid:107)uj(cid:107)\u22641 Dj U (\u03b8)[u1, . . . , uj]. In particular, U is L-gradient Lipschitz. Furthermore,\nunder H2, U has a unique minimizer \u03b8(cid:63). In this paper, we focus on the asymptotic N \u2192 +\u221e,. We\nassume that lim inf N\u2192+\u221e N\u22121m > 0, which is a common assumption for the analysis of SGLD\nand SGLDFP [3, 5]. In practice [1, 8, 18, 22, 30, 32], \u03b3 is of order 1/N and we adopt this convention\nin this article.\nFor a practical implementation of SGLDFP, an estimator \u02c6\u03b8 of \u03b8(cid:63) is necessary. The theoretical analysis\nand the bounds remain unchanged if, instead of considering SGLDFP centered w.r.t. \u03b8(cid:63), we study\nSGLDFP centered w.r.t. \u02c6\u03b8 satisfying E[(cid:107)\u02c6\u03b8 \u2212 \u03b8(cid:63)(cid:107)2] = O(1/N ). Such an estimator \u02c6\u03b8 can be computed\nusing for example SGD with decreasing step sizes, see [26, eq.(2.8)] and [3, Section 3.4], for a\ncomputational cost linear in N.\n\n(cid:13)(cid:13)Dj U (\u03b8)(cid:13)(cid:13) \u2264 L, with L = (N + 1) \u02dcL and where (cid:13)(cid:13)Dj U (\u03b8)(cid:13)(cid:13) =\n\ntimes continuously differentiable and for\n\n\u2208\n\nj\n\n3 Results\n\n3.1 Analysis in Wasserstein distance\n\nBefore presenting the results, some notations and elements of Markov chain theory have to be\nintroduced. Denote by P2(Rd) the set of probability measures with \ufb01nite second moment and by\n\n3\n\n\fB(Rd) the Borel \u03c3-algebra of Rd. For \u03bb, \u03bd \u2208 P2(Rd), de\ufb01ne the Wasserstein distance of order 2 by\n\n(cid:18)(cid:90)\n\n(cid:19)1/2\n\nW2(\u03bb, \u03bd) = inf\n\n\u03be\u2208\u03a0(\u03bb,\u03bd)\n\nRd\u00d7Rd\n\n(cid:107)\u03b8 \u2212 \u03d1(cid:107)2 \u03be(d\u03b8, d\u03d1)\n\n,\n\nwhere \u03a0(\u03bb, \u03bd) is the set of probability measures \u03be on B(Rd) \u2297 B(Rd) satisfying for all A \u2208 B(Rd),\n\u03be(A \u00d7 Rd)) = \u03bb(A) and \u03be(Rd \u00d7 A) = \u03bd(A).\nA Markov kernel R on Rd \u00d7 B(Rd) is a mapping R : Rd \u00d7 B(Rd) \u2192 [0, 1] satisfying the following\nconditions: (i) for every \u03b8 \u2208 Rd, R(\u03b8,\u00b7) : A (cid:55)\u2192 R(\u03b8, A) is a probability measure on B(Rd) (ii) for\nevery A \u2208 B(Rd), R(\u00b7, A) : \u03b8 (cid:55)\u2192 R(\u03b8, A) is a measurable function. For any probability measure\nRd \u03bb(d\u03b8)R(\u03b8, A). For all k \u2208 N\u2217,\nwe de\ufb01ne the Markov kernel Rk recursively by R1 = R and for all \u03b8 \u2208 Rd and A \u2208 B(Rd),\n\n\u03bb on B(Rd), we de\ufb01ne \u03bbR for all A \u2208 B(Rd) by \u03bbR(A) = (cid:82)\nRk+1(\u03b8, A) =(cid:82)\n\nRd Rk(\u03b8, d\u03d1)R(\u03d1, A). A probability measure \u00af\u03c0 is invariant for R if \u00af\u03c0R = \u00af\u03c0.\n\nThe LMC, SGLD, SGD and SGLDFP algorithms de\ufb01ned respectively by (2), (3), (4) and (5) are\nhomogeneous Markov chains with Markov kernels denoted RLMC, RSGLD, RSGD, and RFP. To avoid\noverloading the notations, the dependence on \u03b3 and N is implicit.\nLemma 1. Assume H 1, H 2 and H 3. For any step size \u03b3 \u2208 (0, 2/L), RSGLD (respectively\nRLMC, RSGD, RFP) has a unique invariant measure \u03c0SGLD \u2208 P2(Rd) (respectively \u03c0LMC, \u03c0SGD, \u03c0FP).\nIn addition, for all \u03b3 \u2208 (0, 1/L], \u03b8 \u2208 Rd and k \u2208 N,\n\n(cid:90)\n\nW2\n\n2(Rk\n\nSGLD(\u03b8,\u00b7), \u03c0SGLD) \u2264 (1 \u2212 m\u03b3)k\n\nRd\nand the same inequality holds for LMC, SGD and SGLDFP.\n\n(cid:107)\u03b8 \u2212 \u03d1(cid:107)2 \u03c0SGLD(d\u03d1)\n\nProof. The proof is postponed to Section 1.1 in the supplementary document.\nUnder H1, (1) has a unique strong solution (\u03b8t)t\u22650 for every initial condition \u03b80 \u2208 Rd [20, Chapter\n5, Theorems 2.5 and 2.9]. Denote by (Pt)t\u22650 the semigroup of the Langevin diffusion de\ufb01ned for all\n\u03b80 \u2208 Rd and A \u2208 B(Rd) by Pt(\u03b80, A) = P(\u03b8t \u2208 A).\nTheorem 2. Assume H1, H2 and H3. For all \u03b3 \u2208 (0, 1/L], \u03bb, \u00b5 \u2208 P2(Rd) and n \u2208 N, we have the\nfollowing upper-bounds in Wasserstein distance between\n\ni) LMC and SGLDFP,\n\nW2\n\n2(\u03bbRn\n\nLMC, \u00b5Rn\n\nFP) \u2264 (1 \u2212 m\u03b3)n W2\n\n2(\u03bb, \u00b5) +\n\n2L2\u03b3d\npm2\n\n+\n\nL2\u03b32\n\np\n\nn(1 \u2212 m\u03b3)n\u22121\n\n(cid:90)\n(cid:18)\n\nRd\n\n(cid:107)\u03d1 \u2212 \u03b8(cid:63)(cid:107)2 \u00b5(d\u03d1) ,\n\n(cid:19)(cid:18) 13\n\n6\n\n+\n\nL\nm\n\n(cid:19)\n\n3 +\n\nL\nm\n\nii) the Langevin diffusion and LMC,\n\nW2\n\n2(\u03bbRn\n\nLMC, \u00b5Pn\u03b3) \u2264 2\n\n1 \u2212 mL\u03b3\nm + L\n\nW2\n\n2(\u03bb, \u00b5) + d\u03b3\n\n(cid:18)\n\n(cid:19)n\n\n(cid:18)\n\n+ ne\u2212(m/2)\u03b3(n\u22121)L3\u03b33\n\niii) SGLD and SGD,\n\nm + L\n\n(cid:19)(cid:90)\n\n2m\n\nRd\n\n1 +\n\nm + L\n\n2m\n\n(cid:107)\u03d1 \u2212 \u03b8(cid:63)(cid:107)2 \u00b5(d\u03d1) ,\n\nW2\n\n2(\u03bbRn\n\nSGLD, \u00b5Rn\n\nSGD) \u2264 (1 \u2212 m\u03b3)n W2\n\n2(\u03bb, \u00b5) + (2d)/m .\n\nProof. The proof is postponed to Section 1.2 in the supplementary document.\nCorollary 3. Assume H1, H2 and H3. Set \u03b3 = \u03b7/N with \u03b7 \u2208 (0, 1/(2 \u02dcL)] and assume that\nlim inf N\u2192\u221e mN\u22121 > 0. Then,\n\n4\n\n\fi) for all n \u2208 N, we get W2(Rn\n\nW2(\u03c0LMC, \u03c0FP) =\n\n\u221a\n\nLMC(\u03b8(cid:63),\u00b7), Rn\nd\u03b7 O(N\u22121/2), W2(\u03c0LMC, \u03c0) =\nSGLD(\u03b8(cid:63),\u00b7), Rn\n\u221a\n\nd O(N\u22121/2).\n\nii) for all n \u2208 N, we get W2(Rn\n\nW2(\u03c0SGLD, \u03c0SGD) =\n\n\u221a\n\nFP(\u03b8(cid:63),\u00b7)) =\n\n\u221a\nd\u03b7 O(N\u22121/2) and\nd\u03b7 O(N\u22121/2).\n\u221a\n\nd O(N\u22121/2) and\n\nSGD(\u03b8(cid:63),\u00b7)) =\n\n\u221a\n\nTheorem 2 implies that the number of iterations necessary to obtain a sample \u03b5-close from \u03c0 in\nWasserstein distance is the same for LMC and SGLDFP. However for LMC, the cost of one iteration\nis N d which is larger than pd the cost of one iteration for SGLDFP. In other words, to obtain an\napproximate sample from the target distribution at an accuracy O(1/\nN ) in 2-Wasserstein distance,\nLMC requires \u0398(N ) operations, in contrast with SGLDFP that needs only \u0398(1) operations.\nWe show in the sequel that W2(\u03c0FP, \u03c0SGLD) = \u2126(1) when N \u2192 +\u221e in the case of a Bayesian linear\nregression, where for two sequences (uN )N\u22651, (vN )N\u22651, uN = \u2126(vN ) if lim inf N\u2192+\u221e uN /vN >\n0. The dataset is z = {(yi, xi)}N\ni=1 where yi \u2208 R is the response variable and xi \u2208 Rd are the\ncovariates. Set y = (y1, . . . , yN ) \u2208 RN and X \u2208 RN\u00d7d the matrix of covariates such that the ith row\n\u03b8 > 0. For i \u2208 {1, . . . , N}, the conditional distribution of yi given xi is Gaussian\nof X is xi. Let \u03c32\nwith mean xT\ny. The prior \u03c00(\u03b8) is a normal distribution of mean 0 and variance\n\u03c32\nwhere\n\n\u03b8 Id. The posterior distribution \u03c0 is then proportional to \u03c0(\u03b8) \u221d exp(cid:0)\u2212(1/2)(\u03b8 \u2212 \u03b8(cid:63))T\u03a3(\u03b8 \u2212 \u03b8(cid:63))(cid:1)\n\ni \u03b8 and variance \u03c32\n\ny, \u03c32\n\n\u03a3 = Id /\u03c32\n\n\u03b8 + XTX/\u03c32\ny\n\nand \u03b8(cid:63) = \u03a3\u22121(XTy)/\u03c32\ny .\n\nWe assume that XTX (cid:23) m Id, with lim inf N\u2192+\u221e m/N > 0. Let S be a minibatch of {1, . . . , N}\nwith replacement of size p. De\ufb01ne\n\u2207U0(\u03b8) + (N/p)\n\n\u2207Ui(\u03b8) = \u03a3(\u03b8 \u2212 \u03b8(cid:63)) + \u03c1(S)(\u03b8 \u2212 \u03b8(cid:63)) + \u03be(S)\n\nwhere\n\n\u03c1(S) =\n\nId\n\u03c32\n\u03b8\n\n+\n\nN\np\u03c32\ny\n\nxixT\n\ni \u2212 \u03a3 , \u03be(S) =\n\n\u03b8(cid:63)\n\u03c32\n\u03b8\n\n+\n\nN\np\u03c32\ny\n\n(cid:0)xT\n\ni \u03b8(cid:63) \u2212 yi\n\n(cid:1) xi .\n\n(6)\n\n(cid:88)\n\ni\u2208S\n\n\u03c1(S)(\u03b8 \u2212 \u03b8(cid:63)) is the multiplicative part of the noise in the stochastic gradient, and \u03be(S) the additive\npart that does not depend on \u03b8. The additive part of the stochastic gradient for SGLDFP disappears\nsince\n\n\u2207U0(\u03b8) \u2212 \u2207U0(\u03b8(cid:63)) + (N/p)\n\n{\u2207Ui(\u03b8) \u2212 \u2207Ui(\u03b8(cid:63))} = \u03a3(\u03b8 \u2212 \u03b8(cid:63)) + \u03c1(S)(\u03b8 \u2212 \u03b8(cid:63)) .\n\n(cid:88)\n(cid:88)\n\ni\u2208S\n\ni\u2208S\n\n(cid:88)\n\ni\u2208S\n\nIn this setting, the following theorem shows that the Wasserstein distances between the marginal\ndistribution of the iterates of SGLD and SGLDFP, and \u03c0SGLD and \u03c0, is of order \u2126(1) when N \u2192 +\u221e.\nThis is in sharp contrast with the results of Corollary 3 where the Wasserstein distances tend to 0 as\nN \u2192 +\u221e at a rate N\u22121/2. For simplicity, we state the result for d = 1.\nTheorem 4. Consider the case of the Bayesian linear regression in dimension 1.\n\ni) For all \u03b3 \u2208 (0, \u03a3\u22121{1 + N/(p(cid:80)N\n\n(cid:18) 1 \u2212 \u00b5\n\n(cid:19)1/2\n\ni=1 x2\n\ni )}\u22121] and n \u2208 N\u2217,\n\nW2(Rn\n\n1 \u2212 \u00b5n\n\nFP(\u03b8(cid:63),\u00b7))\nN(cid:88)\n\nSGLD(\u03b8(cid:63),\u00b7), Rn\n(cid:40)\n\n(cid:18) (xi\u03b8(cid:63) \u2212 yi)xi\nii) Set \u03b3 = \u03b7/N with \u03b7 \u2208 (0, lim inf N\u2192+\u221e N \u03a3\u22121{1 + N/(p(cid:80)N\nlim inf N\u2192+\u221e N\u22121(cid:80)N\n\ni=1 x2\ni > 0. We have W2(\u03c0SGLD, \u03c0) = \u2126(1).\n\nwhere \u00b5 \u2208 (0, 1 \u2212 \u03b3\u03a3].\n\ni=1 x2\n\n2\u03b3 +\n\n\u03b32N\n\n\u2265\n\n\u03c32\ny\n\ni=1\n\n+\n\np\n\n(cid:19)2(cid:41)1/2\n\n\u2212(cid:112)2\u03b3 ,\n\n\u03b8(cid:63)\nN \u03c32\n\u03b8\n\ni )}\u22121] and assume that\n\nProof. The proof is postponed to Section 1.3 in the supplementary document.\n\n5\n\n\fThe study in Wasserstein distance emphasizes the different behaviors of the LMC, SGLDFP, SGLD\nand SGD algorithms. When N \u2192 \u221e and limN\u2192+\u221e m/N > 0, the marginal distributions of the\nkth iterates of the LMC and SGLDFP algorithm are very close to the Langevin diffusion and their\ninvariant probability measures \u03c0LMC and \u03c0FP are similar to the posterior distribution of interest \u03c0.\nIn contrast, the marginal distributions of the kth iterates of SGLD and SGD are analogous and their\ninvariant probability measures \u03c0SGLD and \u03c0SGD are very different from \u03c0 when N \u2192 +\u221e.\nNote that to \ufb01x the asymptotic bias of SGLD, other strategies can be considered: choosing a step\nsize \u03b3 \u221d N\u2212\u03b2 where \u03b2 > 1 and/or increasing the batch size p \u221d N \u03b1 where \u03b1 \u2208 [0, 1]. Using the\nWasserstein (of order 2) bounds of SGLD w.r.t. the target distribution \u03c0, see e.g. [11, Theorem 3],\n\u03b1 + \u03b2 should be equal to 2 to guarantee the \u03b5-accuracy in Wasserstein distance of SGLD for a cost\nproportional to N (up to logarithmic terms), independently of the choice of \u03b1 and \u03b2.\n\n3.2 Mean and covariance matrix of \u03c0LMC, \u03c0FP, \u03c0SGLD\n\nWe now establish an expansion of the mean and second moments of \u03c0LMC, \u03c0FP, \u03c0SGLD and \u03c0SGD as\nN \u2192 +\u221e, and compare them. We \ufb01rst give an expansion of the mean and second moments of \u03c0 as\nN \u2192 +\u221e.\nProposition 5. Assume H1 and H2 and that lim inf N\u2192+\u221e N\u22121m > 0. Then,\n\n(\u03b8 \u2212 \u03b8(cid:63))\u22972\u03c0(d\u03b8) = \u22072U (\u03b8(cid:63))\u22121 + ON\u2192+\u221e(N\u22123/2) ,\n\n\u03b8 \u03c0(d\u03b8) \u2212 \u03b8(cid:63) = \u2212(1/2)\u22072U (\u03b8(cid:63))\u22121 D3 U (\u03b8(cid:63))[\u22072U (\u03b8(cid:63))\u22121] + ON\u2192+\u221e(N\u22123/2) .\n\n(cid:90)\n\n(cid:90)\n\nRd\n\nRd\n\nProof. The proof is postponed to Section 2.1 in the supplementary document.\n\nContrary to the Bayesian linear regression where the covariance matrices can be explicitly computed,\nsee Section 3 in the supplementary document, only approximate expressions are available in the gen-\neral case. For that purpose, we consider two types of asymptotic. For LMC and SGLDFP, we assume\nthat limN\u2192+\u221e m/N > 0, \u03b3 = \u03b7/N, for \u03b7 > 0, and we develop an asymptotic when N \u2192 +\u221e.\nCombining Proposition 5 and Theorem 6 , we show that the biases and covariance matrices of \u03c0LMC\nand \u03c0FP are of order \u0398(1/N ) with remainder terms of the form O(N\u22123/2), where for two sequences\n(uN )N\u22651, (vN )N\u22651, u = \u0398(v) if 0 < lim inf N\u2192+\u221e uN /vN \u2264 lim supN\u2192+\u221e uN /vN < +\u221e.\nRegarding SGD and SGLD, we do not have such concentration properties when N \u2192 +\u221e because\nof the high variance of the stochastic gradients. The biases and covariance matrices of SGLD and\nSGD are of order \u0398(1) when N \u2192 +\u221e. To obtain approximate expressions of these quantities, we\nset \u03b3 = \u03b7/N where \u03b7 > 0 is the step size for the gradient descent over the normalized potential\nU/N. Assuming that m is proportional to N and N \u2265 1/\u03b7, we show by combining Proposition 5\nand Theorem 7 that the biases and covariance matrices of SGLD and SGD are of order \u0398(\u03b7) with\nremainder terms of the form O(\u03b73/2) when \u03b7 \u2192 0.\nBefore giving the results associated to \u03c0LMC, \u03c0FP, \u03c0SGLD and \u03c0SGD, we need to introduce some\nnotations. For any matrices A1, A2 \u2208 Rd\u00d7d, we denote by A1 \u2297 A2 the Kronecker product de\ufb01ned\non Rd\u00d7d by A1 \u2297 A2 : Q (cid:55)\u2192 A1QA2 and A\u22972 = A \u2297 A. Besides, for all \u03b81 \u2208 Rd and \u03b82 \u2208 Rd, we\ndenote by \u03b81 \u2297 \u03b82 \u2208 Rd\u00d7d the tensor product of \u03b81 and \u03b82. For any matrix A \u2208 Rd\u00d7d, Tr(A) is the\ntrace of A.\nDe\ufb01ne K : Rd\u00d7d \u2192 Rd\u00d7d for all A \u2208 Rd\u00d7d by\n\n\uf8eb\uf8ed\u22072Ui(\u03b8(cid:63)) \u2212 1\n\nN\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\nj=1\n\n\uf8f6\uf8f8\u22972\n\n\u22072Uj(\u03b8(cid:63))\n\nA .\n\nN\np\nand H and G : Rd\u00d7d \u2192 Rd\u00d7d by\n\nK(A) =\n\nH = \u22072U (\u03b8(cid:63)) \u2297 Id + Id\u2297\u22072U (\u03b8(cid:63)) \u2212 \u03b3\u22072U (\u03b8(cid:63)) \u2297 \u22072U (\u03b8(cid:63)) ,\nG = \u22072U (\u03b8(cid:63)) \u2297 Id + Id\u2297\u22072U (\u03b8(cid:63)) \u2212 \u03b3(\u22072U (\u03b8(cid:63)) \u2297 \u22072U (\u03b8(cid:63)) + K) .\n\n6\n\n(7)\n\n(8)\n(9)\n\n\fK, H and G can be interpreted as perturbations of \u22072U (\u03b8(cid:63))\u22972 and \u22072U (\u03b8(cid:63)), respectively, due to\nthe noise of the stochastic gradients. It can be shown, see Section 2.2 in the supplementary document,\nthat for \u03b3 small enough, H and G are invertible.\nTheorem 6. Assume H1, H2 and H3. Set \u03b3 = \u03b7/N and assume that lim inf N\u2192+\u221e N\u22121m > 0.\nThere exists an (explicit) \u03b70 independent of N such that for all \u03b7 \u2208 (0, \u03b70),\n\n(\u03b8 \u2212 \u03b8(cid:63))\u22972\u03c0LMC(d\u03b8) = H\u22121(2 Id) + ON\u2192+\u221e(N\u22123/2) ,\n(\u03b8 \u2212 \u03b8(cid:63))\u22972\u03c0FP(d\u03b8) = G\u22121(2 Id) + ON\u2192+\u221e(N\u22123/2) ,\n\n(10)\n\n(11)\n\n\u03b8\u03c0LMC(d\u03b8) \u2212 \u03b8(cid:63) = \u2212\u22072U (\u03b8(cid:63))\u22121 D3 U (\u03b8(cid:63))[H\u22121 Id] + ON\u2192+\u221e(N\u22123/2) ,\n\u03b8\u03c0FP(d\u03b8) \u2212 \u03b8(cid:63) = \u2212\u22072U (\u03b8(cid:63))\u22121 D3 U (\u03b8(cid:63))[G\u22121 Id] + ON\u2192+\u221e(N\u22123/2) .\n\n(cid:90)\n(cid:90)\n\nRd\n\nRd\n\n(cid:90)\n(cid:90)\n\nRd\n\nRd\n\nand\n\n(cid:90)\n(cid:90)\n\nRd\n\nRd\n\nand(cid:90)\n(cid:90)\n\nRd\n\nRd\nwhere\n\nProof. The proof is postponed to Section 2.2.2 in the supplementary document.\nTheorem 7. Assume H1, H2 and H3. Set \u03b3 = \u03b7/N and assume that lim inf N\u2192+\u221e N\u22121m > 0.\nThere exists an (explicit) \u03b70 independent of N such that for all \u03b7 \u2208 (0, \u03b70) and N \u2265 1/\u03b7,\n\n(\u03b8 \u2212 \u03b8(cid:63))\u22972\u03c0SGLD(d\u03b8) = G\u22121 {2 Id +(\u03b7/p) M} + O\u03b7\u21920(\u03b73/2) ,\n(\u03b8 \u2212 \u03b8(cid:63))\u22972\u03c0SGD(d\u03b8) = (\u03b7/p) G\u22121 M +O\u03b7\u21920(\u03b73/2) ,\n\n(12)\n\n(13)\n\n\u03b8\u03c0SGLD(d\u03b8) \u2212 \u03b8(cid:63) = \u2212(1/2)\u22072U (\u03b8(cid:63))\u22121 D3 U (\u03b8(cid:63))[G\u22121 {2 Id +(\u03b7/p) M}] + O\u03b7\u21920(\u03b73/2) ,\n\u03b8\u03c0SGD(d\u03b8) \u2212 \u03b8(cid:63) = \u2212(\u03b7/2p)\u22072U (\u03b8(cid:63))\u22121 D3 U (\u03b8(cid:63))[G\u22121 M] + O\u03b7\u21920(\u03b73/2) ,\n\n\uf8eb\uf8ed\u2207Ui(\u03b8(cid:63)) \u2212 1\n\nN\n\nN(cid:88)\n\ni=1\n\nM =\n\n\uf8f6\uf8f8\u22972\n\nN(cid:88)\n\nj=1\n\n\u2207Uj(\u03b8(cid:63))\n\n,\n\n(14)\n\nand G is de\ufb01ned in (9).\n\nProof. The proof is postponed to Section 2.2.2 in the supplementary document.\n\nNote that this result implies that the mean and the covariance matrix of \u03c0SGLD and \u03c0SGD stay lower\nbounded by a positive constant for any \u03b7 > 0 as N \u2192 +\u221e. In Section 4 of the supplementary\ndocument, a \ufb01gure illustrates the results of Theorem 6 and Theorem 7 in the asymptotic N \u2192 +\u221e.\n\n4 Numerical experiments\n\nSimulated data For illustrative purposes, we consider a Bayesian logistic regression in dimension\nd = 2. We simulate N = 105 covariates {xi}N\ni=1 drawn from a standard 2-dimensional Gaussian\ndistribution and we denote by X \u2208 RN\u00d7d the matrix of covariates such that the ith row of X is xi.\nOur Bayesian regression model is speci\ufb01ed by a Gaussian prior of mean 0 and covariance matrix the\nidentity, and a likelihood given for yi \u2208 {0, 1} by p(yi|xi, \u03b8) = (1 + e\u2212xT\ni \u03b8)yi\u22121. We\nsimulate N observations {yi}N\ni=1 under this model. In this setting, H1 and H3 are satis\ufb01ed, and H2\nholds if the state space is compact.\nTo illustrate the results of Section 3.2, we consider 10 regularly spaced values of N between 102\nand 105 and we truncate the dataset accordingly. We compute an estimator \u02c6\u03b8 of \u03b8(cid:63) using SGD [28]\n\ni \u03b8)\u2212yi(1 + exT\n\n7\n\n\fFigure 1: Distance to \u03b8(cid:63), (cid:13)(cid:13)\u00af\u03b8n \u2212 \u03b8(cid:63)(cid:13)(cid:13) for LMC, SGLDFP, SGLD and SGD, function of N, in\n\nlogarithmic scale.\n\nk=0 \u03b8k and {1/(n \u2212 1)}(cid:80)n\u22121\n\ncombined with the BFGS algorithm [19]. For the LMC, SGLDFP, SGLD and SGD algorithms,\nthe step size \u03b3 is set equal to (1 + \u03b4/4)\u22121 where \u03b4 is the largest eigenvalue of XTX. We start the\nalgorithms at \u03b80 = \u02c6\u03b8 and run n = 1/\u03b3 iterations where the \ufb01rst 10% samples are discarded as a\nburn-in period.\nWe estimate the means and covariance matrices of \u03c0LMC, \u03c0FP, \u03c0SGLD and \u03c0SGD by their empirical\nk=0 (\u03b8k \u2212 \u00af\u03b8n)\u22972. We plot the mean and the trace\nof the covariance matrices for the different algorithms, averaged over 100 independent trajectories, in\nFigure 1 and Figure 2 in logarithmic scale.\n\naverages \u00af\u03b8n = (1/n)(cid:80)n\u22121\nThe slope for LMC and SGLDFP is \u22121 which con\ufb01rms the convergence of(cid:13)(cid:13)\u00af\u03b8n \u2212 \u03b8(cid:63)(cid:13)(cid:13) to 0 at a rate\nN\u22121. On the other hand, we can observe that(cid:13)(cid:13)\u00af\u03b8n \u2212 \u03b8(cid:63)(cid:13)(cid:13) converges to a constant for SGD and SGLD.\nnot included in the simulations. We truncate the training dataset at N \u2208(cid:8)103, 104, 105(cid:9). For all\n\nCovertype dataset We then illustrate our results on the covertype dataset1 with a Bayesian logistic\nregression model. The prior is a standard multivariate Gaussian distribution. Given the size of\nthe dataset and the dimension of the problem, LMC requires high computational resources and is\n\nalgorithms, the step size \u03b3 is set equal to 1/N and the trajectories are started at \u02c6\u03b8, an estimator of \u03b8(cid:63),\ncomputed using SGD combined with the BFGS algorithm.\nWe empirically check that the variance of the stochastic gradients scale as N 2 for SGD and SGLD,\nand as N for SGLDFP. We compute the empirical variance estimator of the gradients, take the mean\nover the dimension and display the result in a logarithmic plot in Figure 3. The slopes are 2 for SGD\nand SGLD, and 1 for SGLDFP.\nOn the test dataset, we also evaluate the negative loglikelihood of the three algorithms for different\n\nvalues of N \u2208(cid:8)103, 104, 105(cid:9), as a function of the number of iterations. The plots are shown in\n\nFigure 4. We note that for large N, SGLD and SGD give very similar results that are below the\nperformance of SGLDFP.\n\n1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.\n\nbinary.scale.bz2\n\n8\n\n102103104105103102101nLMC102103104105103102101SGLDFP102103104105N2\u00d7101nSGLD102103104105N2\u00d7101SGD\fFigure 2: Trace of the covariance matrices for LMC, SGLDFP, SGLD and SGD, function of N, in\nlogarithmic scale.\n\nFigure 3: Variance of the stochastic gradients of SGLD, SGLDFP and SGD function of N, in\nlogarithmic scale.\n\nnumber of iterations for different values of N \u2208(cid:8)103, 104, 105(cid:9).\n\nFigure 4: Negative loglikelihood on the test dataset for SGLD, SGLDFP and SGD function of the\n\n9\n\n102103104105103102101Tr(cov(n))LMC102103104105103102101SGLDFP102103104105N6\u00d71017\u00d7101Tr(cov(n))SGLD102103104105N5\u00d7101SGD103104105N103104105106Variance of the gradientssgd103104105N103104105106sgld103104105N102103104sgldfp020000400006000080000100000iterations0.5340.5360.5380.5400.5420.544Negative loglikelihoodN=10302000004000006000008000001000000iterations0.50600.50650.50700.50750.50800.50850.5090N=1040.00.20.40.60.81.0iterations1e70.50220.50240.50260.50280.50300.50320.50340.5036N=105sgldsgldfpsgd\fReferences\n[1] S. Ahn, A. K. Balan, and M. Welling. Bayesian posterior sampling via stochastic gradient\nFisher scoring. In Proceedings of the 29th International Conference on Machine Learning,\nICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[2] S. Ahn, B. Shahbaba, and M. Welling. Distributed stochastic gradient MCMC. In E. P. Xing\nand T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning,\nvolume 32 of Proceedings of Machine Learning Research, pages 1044\u20131052, Bejing, China,\n22\u201324 Jun 2014. PMLR.\n\n[3] J. Baker, P. Fearnhead, E. B. Fox, and C. Nemeth. Control variates for stochastic gradient\n\nMCMC. ArXiv e-prints 1706.05439, June 2017.\n\n[4] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data.\n\nJournal of Machine Learning Research, 18(47):1\u201343, 2017.\n\n[5] N. S. Chatterji, N. Flammarion, Y.-A. Ma, P. L. Bartlett, and M. I. Jordan. On the theory of\nvariance reduction for stochastic gradient Monte Carlo. ArXiv e-prints 1802.05431, Feb. 2018.\n\n[6] C. Chen, N. Ding, and L. Carin. On the convergence of Stochastic Gradient MCMC algorithms\nwith high-order integrators. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2278\u20132286.\nCurran Associates, Inc., 2015.\n\n[7] C. Chen, W. Wang, Y. Zhang, Q. Su, and L. Carin. A convergence analysis for a class of\npractical variance-reduction stochastic gradient MCMC. ArXiv e-prints 1709.01180, Sept. 2017.\n\n[8] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian Monte Carlo. In Proceedings\n\nof the 31st International Conference on Machine Learning, pages 1683\u20131691, 2014.\n\n[9] A. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave\ndensities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):\n651\u2013676, 2017.\n\n[10] A. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin\nMonte Carlo and gradient descent. In S. Kale and O. Shamir, editors, Proceedings of the 2017\nConference on Learning Theory, volume 65 of Proceedings of Machine Learning Research,\npages 678\u2013689, Amsterdam, Netherlands, 07\u201310 Jul 2017. PMLR.\n\n[11] A. S. Dalalyan and A. G. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo\n\nwith inaccurate gradient. ArXiv e-prints 1710.00095, Sept. 2017.\n\n[12] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using\nstochastic gradient thermostats. In Proceedings of the 27th International Conference on Neural\nInformation Processing Systems - Volume 2, NIPS\u201914, pages 3203\u20133211, Cambridge, MA, USA,\n2014. MIT Press.\n\n[13] K. A. Dubey, S. J. Reddi, S. A. Williamson, B. Poczos, A. J. Smola, and E. P. Xing. Variance\nreduction in stochastic gradient Langevin dynamics. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n1154\u20131162. Curran Associates, Inc., 2016.\n\n[14] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin\n\nalgorithm. ArXiv e-prints 1605.01559, May 2016.\n\n[15] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin\n\nalgorithm. Ann. Appl. Probab., 27(3):1551\u20131587, 06 2017. doi: 10.1214/16-AAP1238.\n\n[16] U. Grenander. Tutorial in pattern theory. Division of Applied Mathematics, Brown University,\n\nProvidence, 1983.\n\n[17] U. Grenander and M. I. Miller. Representations of knowledge in complex systems. J. Roy.\nStatist. Soc. Ser. B, 56(4):549\u2013603, 1994. ISSN 0035-9246. With discussion and a reply by the\nauthors.\n\n10\n\n\f[18] L. Hasenclever, S. Webb, T. Lienart, S. Vollmer, B. Lakshminarayanan, C. Blundell, and Y. W.\nTeh. Distributed Bayesian learning with stochastic natural gradient expectation propagation and\nthe posterior server. Journal of Machine Learning Research, 18(106):1\u201337, 2017.\n\n[19] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scienti\ufb01c tools for Python, 2001.\n\n[20] I. Karatzas and S. Shreve. Brownian motion and stochastic calculus. Graduate Texts in\n\nMathematics. Springer New York, 1991. ISBN 9780387976556.\n\n[21] A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: cutting the Metropolis-\nhastings budget. In Proceedings of the 31st International Conference on International Confer-\nence on Machine Learning - Volume 32, ICML\u201914, pages I\u2013181\u2013I\u2013189. JMLR.org, 2014.\n\n[22] C. Li, C. Chen, D. Carlson, and L. Carin. Preconditioned stochastic gradient Langevin dynamics\nIn Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial\n\nfor deep neural networks.\nIntelligence, AAAI\u201916, pages 1788\u20131794. AAAI Press, 2016.\n\n[23] W. Li, S. Ahn, and M. Welling. Scalable MCMC for mixed membership stochastic blockmodels.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 723\u2013731, 2016.\n\n[24] Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient MCMC. In C. Cortes,\nN. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Informa-\ntion Processing Systems 28, pages 2917\u20132925. Curran Associates, Inc., 2015.\n\n[25] T. Nagapetyan, A. B. Duncan, L. Hasenclever, S. J. Vollmer, L. Szpruch, and K. Zygalakis. The\n\ntrue cost of stochastic gradient Langevin dynamics. ArXiv e-prints 1706.02692, June 2017.\n\n[26] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009. doi:\n10.1137/070704277.\n\n[27] S. Patterson and Y. W. Teh. Stochastic gradient riemannian Langevin dynamics on the probability\nsimplex. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, pages 3102\u20133110. Curran\nAssociates, Inc., 2013.\n\n[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[29] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their\ndiscrete approximations. Bernoulli, 2(4):341\u2013363, 1996. ISSN 1350-7265. doi: 10.2307/\n3318418.\n\n[30] I. Sato and H. Nakagawa. Approximation analysis of stochastic gradient Langevin dynamics by\nusing Fokker-Planck equation and Ito process. In E. P. Xing and T. Jebara, editors, Proceedings\nof the 31st International Conference on Machine Learning, volume 32 of Proceedings of\nMachine Learning Research, pages 982\u2013990, Bejing, China, 22\u201324 Jun 2014. PMLR.\n\n[31] Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and \ufb02uctuations for stochastic gradient\n\nLangevin dynamics. The Journal of Machine Learning Research, 17(1):193\u2013225, 2016.\n\n[32] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-)asymptotic bias and\nvariance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17\n(159):1\u201348, 2016.\n\n[33] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\nProceedings of the 28th International Conference on International Conference on Machine\nLearning, ICML\u201911, pages 681\u2013688, USA, 2011. Omnipress. ISBN 978-1-4503-0619-5.\n\n11\n\n\f", "award": [], "sourceid": 5038, "authors": [{"given_name": "Nicolas", "family_name": "Brosse", "institution": "Ecole Polytechnique, Palaiseau, FRANCE"}, {"given_name": "Alain", "family_name": "Durmus", "institution": "ENS"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Ecole Polytechnique"}]}