{"title": "Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 2047, "page_last": 2055, "abstract": "Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have become increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias of SG-MCMC while keeping the variance at a reasonable level. Our approach is based on a numerical sequence acceleration method, namely the Richardson-Romberg extrapolation, which simply boils down to running almost the same SG-MCMC algorithm twice in parallel with different step sizes. We illustrate our framework on the popular Stochastic Gradient Langevin Dynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We provide formal theoretical analysis and show that SGRRLD is asymptotically consistent, satisfies a central limit theorem, and its non-asymptotic bias and the mean squared-error can be bounded. Our results show that SGRRLD attains higher rates of convergence than SGLD in both finite-time and asymptotically, and it achieves the theoretical accuracy of the methods that are based on higher-order integrators. We support our findings using both synthetic and real data experiments.", "full_text": "Stochastic Gradient Richardson-Romberg\n\nMarkov Chain Monte Carlo\n\nAlain Durmus1, Umut S\u00b8ims\u00b8ekli1, \u00b4Eric Moulines2, Roland Badeau1, Ga\u00a8el Richard1\n\n1: LTCI, CNRS, T\u00b4el\u00b4ecom ParisTech, Universit\u00b4e Paris-Saclay, 75013, Paris, France\n2: Centre de Math\u00b4ematiques Appliqu\u00b4ees, UMR 7641, \u00b4Ecole Polytechnique, France\n\nAbstract\n\nStochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have be-\ncome increasingly popular for Bayesian inference in large-scale applications. Even\nthough these methods have proved useful in several scenarios, their performance is\noften limited by their bias. In this study, we propose a novel sampling algorithm\nthat aims to reduce the bias of SG-MCMC while keeping the variance at a reason-\nable level. Our approach is based on a numerical sequence acceleration method,\nnamely the Richardson-Romberg extrapolation, which simply boils down to run-\nning almost the same SG-MCMC algorithm twice in parallel with different step\nsizes. We illustrate our framework on the popular Stochastic Gradient Langevin\nDynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to\nas Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We\nprovide formal theoretical analysis and show that SGRRLD is asymptotically con-\nsistent, satis\ufb01es a central limit theorem, and its non-asymptotic bias and the mean\nsquared-error can be bounded. Our results show that SGRRLD attains higher rates\nof convergence than SGLD in both \ufb01nite-time and asymptotically, and it achieves\nthe theoretical accuracy of the methods that are based on higher-order integrators.\nWe support our \ufb01ndings using both synthetic and real data experiments.\n\n1\n\nIntroduction\n\nMarkov Chain Monte Carlo (MCMC) techniques are one of the most popular family of algorithms in\nBayesian machine learning. Recently, novel MCMC schemes that are based on stochastic optimiza-\ntion have been proposed for scaling up Bayesian inference to large-scale applications. These so-called\nStochastic Gradient MCMC (SG-MCMC) methods provide a fruitful framework for Bayesian in-\nference, well adapted to massively parallel and distributed architecture. In this domain, a \ufb01rst and\nimportant attempt was made by Welling and Teh [1], where the authors combined ideas from the Un-\nadjusted Langevin Algorithm (ULA) [2] and Stochastic Gradient Descent (SGD) [3]. They proposed\na scalable MCMC framework referred to as Stochastic Gradient Langevin Dynamics (SGLD). Unlike\nconventional batch MCMC methods, SGLD uses subsamples of the data per iteration similar to SGD.\nSeveral extensions of SGLD have been proposed [4\u201312]. Recently, in [10] it has been shown that\nunder certain assumptions and with suf\ufb01ciently large number of iterations, the bias and the mean-\nsquared-error (MSE) of a general class of SG-MCMC methods can be bounded as O(\u03b3) and O(\u03b32),\nrespectively, where \u03b3 is the step size of the Euler-Maruyama integrator. The authors have also shown\nthat these bounds can be improved by making use of higher-order integrators.\nIn this paper, we propose a novel SG-MCMC algorithm, called Stochastic Gradient Richardson-\nRomberg Langevin Dynamics (SGRRLD) that aims to reduce the bias of SGLD by applying a\nnumerical sequence acceleration method, namely the Richardson-Romberg (RR) extrapolation, which\nrequires running two chains with different step sizes in parallel. While reducing the bias, SGRRLD\nalso keeps the variance of the estimator at a reasonable level by using correlated Brownian motions.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe show that the asymptotic bias and variance of SGRRLD can be bounded as O(\u03b32) and O(\u03b34),\nrespectively. We also show that after K iterations, our algorithm achieves a rate of convergence\nfor the MSE of order O(K\u22124/5), whereas this rate for SGLD and its extensions with \ufb01rst-order\nintegrators is of order O(K\u22122/3).\nOur results show that by only using a \ufb01rst-order numerical integrator, the proposed approach can\nachieve the theoretical accuracy of methods that are based on higher-order integrators, such as the\nones given in [10]. This accuracy can be improved even more by applying the RR extrapolation\nmultiple times in a recursive manner [13]. On the other hand, since the two chains required by the\nRR extrapolation can be generated independently, the SGRRLD algorithm is well adapted to parallel\nand distributed architectures. It is also worth to note that our technique is quite generic and can be\nvirtually applied to all the current SG-MCMC algorithms besides SGLD, provided that they satisfy\nrather technical weak error and ergodicity conditions.\nIn order to assess the performance of the proposed method, we conduct several experiments on both\nsynthetic and real datasets. We \ufb01rst apply our method on a rather simple Gaussian model whose\nposterior distribution is analytically available and compare the performance of SGLD and SGRRLD.\nIn this setting, we also illustrate the generality of our technique by applying the RR extrapolation\non Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [6]. Then, we apply our method on a\nlarge-scale matrix factorization problem for a movie recommendation task. Numerical experiments\nsupport our theoretical results: our approach achieves improved accuracy over SGLD and SGHMC.\n\n2 Preliminaries\n\n2.1 Stochastic Gradient Langevin Dynamics\n\ndenoted by \u03c0 and given by \u03c0 : \u03b8 \u2192 e\u2212U (\u03b8)/(cid:82)\n\nIn MCMC, one aims at generating samples from a target probability measure \u03c0 that is known up to a\nmultiplicative constant. Assume that \u03c0 has a density with respect to the Lebesgue measure that is still\nRd e\u2212U (\u02dc\u03b8)d\u02dc\u03b8 where U : Rd \u2192 R is called the potential\nenergy function. In practice, directly generating samples from \u03c0 turns out to be intractable except\nfor very few special cases, therefore one often needs to resort to approximate methods. A popular\nway to approximately generate samples from \u03c0 is based on discretizations of a stochastic differential\nequation (SDE) that has \u03c0 as an invariant distribution [14]. A common choice is the over-damped\nLangevin equation associated with \u03c0, that is the stochastic differential equation (SDE) given by\n\nd\u03d1t = \u2212\u2207U (\u03d1t)dt +\n\n\u221a\n\n2dBt ,\n\n(1)\n\nwhere (Bt)t\u22650 is the standard d-dimensional Brownian motion. Under mild assumptions on U\n(cf. [2]), (\u03d1t)t\u22650 is a well de\ufb01ned Markov process which is geometrically ergodic with respect to\n\u03c0. Therefore, if continuous sample paths from (\u03d1t)t\u22650 could be generated, they could be used as\napproximate samples from \u03c0. However, this is not possible and therefore in practice we need to\nuse a discretization of (1). The most common discretization is the Euler-Maruyama scheme, which\nboils down to applying the following update equation iteratively: \u03b8k+1 = \u03b8k \u2212 \u03b3k+1\u2207U (\u03b8k) +\n\u221a\n2\u03b3k+1Zk+1, for k \u2265 0 with initial state \u03b80. Here, (\u03b3k)k\u22651 is a sequence of non-increasing step\nsizes and (Zk)k\u22651 is a sequence of independent and identically distributed (i.i.d.) d-dimensional\nstandard normal random variables. This schema is called the Unadjusted Langevin Algorithm (ULA)\n[2]. When the sequence of the step sizes (\u03b3k)k\u22650 goes to 0 as k goes to in\ufb01nity, it has been shown\nin [15] and [16] that the empirical distribution of (\u03b8k)k\u22650 weakly converges to \u03c0 under certain\nassumptions. A central limit theorem for additive functionals has also been obtained in [17] and [16].\nIn Bayesian machine learning, \u03c0 is often chosen as the Bayesian posterior, which imposes the\nn=1 log p(xn|\u03b8) + log p(\u03b8)) for all \u03b8 \u2208 Rd,\nwhere x \u2261 {xn}N\nn=1 is a set of observed i.i.d. data points, belonging to Rm, for m \u2265 1, p(xn|\u00b7) :\nRd \u2192 R\u2217\n+ is the prior distribution. In large scale\nsettings, N becomes very large and therefore computing \u2207U can be computationally very demanding,\nlimiting the applicability of ULA. Inspired by stochastic optimization techniques, in [1], the authors\nhave proposed replacing the exact gradient \u2207U with an unbiased estimator and presented the SGLD\nalgorithm that iteratively applies the following update equation:\n\nfollowing form on the potential energy: U (\u03b8) = \u2212((cid:80)N\n\n+ is the likelihood function, and p(\u03b8) : Rd \u2192 R\u2217\n\n\u03b8k+1 = \u03b8k \u2212 \u03b3k+1\u2207 \u02dcUk+1(\u03b8k) +(cid:112)2\u03b3k+1Zk+1 ,\n\n(2)\n\n2\n\n\fwhere (\u2207 \u02dcUk)k\u22651 is a sequence of i.i.d. unbiased estimators of \u2207U. In the following, the common\ndistribution of (\u2207 \u02dcUk)k\u22651 will be denoted by L. A typical choice for the sequence of estimators\n(\u2207 \u02dcUk)k\u22651 of \u2207U is to randomly draw an i.i.d. sequence of data subsample (Rk)k\u22651 with Rk \u2282\n[N ] = {1, . . . , N} having a \ufb01xed number of elements |Rk| = B for all k \u2265 1. Then, set for all\n\u03b8 \u2208 Rd, k \u2265 1\n\n\u2207 \u02dcUk(\u03b8) = \u2212[\u2207 log p(\u03b8) +\n\nN\nB\n\ni\u2208Rk\n\n\u2207 log p(xi|\u03b8)] .\n\n(3)\n\n(cid:88)\n\nConvergence analysis of SGLD has been studied in [18, 19] and it has been shown in [20] that\nfor constant step sizes \u03b3k = \u03b3 > 0 for all k \u2265 1, the bias and the MSE of SGLD are of order\nO(\u03b3 + 1/(\u03b3K)) and O(\u03b32 + 1/(\u03b3K)), respectively. Recently, it has been shown that these bounds\nare also valid in a more general family of SG-MCMC methods [10].\n\n2.2 Richardson-Romberg Extrapolation for SDEs\n\nRichardson-Romberg extrapolation is a well-known method in numerical analysis, which aims to\nimprove the rate of convergence of a sequence. Talay and Tubaro [21] showed that the rate of\nconvergence of Monte Carlo estimates on certain SDEs can be radically improved by using an\nRR extrapolation that can be described as follows. Let us consider the SDE in (1) and its Euler\ndiscretization with exact gradients and \ufb01xed step size, i.e. \u03b3k = \u03b3 > 0 for all k \u2265 1. Under mild\nassumptions on U (cf. [22]), the homogeneous Markov chain (\u03b8k)k\u22650 is ergodic with a unique\ninvariant distribution \u03c0\u03b3, which is different from the target distribution \u03c0. However, [21] showed that\nfor f suf\ufb01ciently smooth with polynomial growth, there exists a constant C, which only depends on\nRd f (x)\u03c0(dx). By exploiting this\nresult, RR extrapolation suggests considering two different discretizations of the same SDE with\ntwo different step sizes \u03b3 and \u03b3/2. Then instead of \u03c0\u03b3(f ), if we consider 2\u03c0\u03b3/2(f ) \u2212 \u03c0\u03b3(f ) as the\nestimator, we obtain \u03c0(f ) \u2212 (2\u03c0\u03b3/2(f ) \u2212 \u03c0\u03b3(f )) = O(\u03b32). In the case where the sequence (\u03b3k)k\u22650\ngoes to 0 as k \u2192 +\u221e, it has been observed in [23] that the estimator de\ufb01ned by RR extrapolation\nsatis\ufb01es a CLT. The applications of RR extrapolation to SG-MCMC have not yet been explored.\n\n\u03c0 and f such that \u03c0\u03b3(f ) = \u03c0(f ) + C\u03b3 + O(\u03b32), where \u03c0(f ) =(cid:82)\n\n3 Stochastic Gradient Richardson-Romberg Langevin Dynamics\n\nIn this study, we explore the use of RR extrapolation in SG-MCMC algorithms for improving their\nrates of convergence. In particular, we focus on the applications of RR extrapolation on the SGLD\nestimator and present a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-\nRomberg Langevin Dynamics (SGRRLD).\nThe proposed algorithm applies RR extrapolation on SGLD by considering two SGLD chains applied\nto the SDE (1), with two different sequences of step sizes satisfying the following relation. For the\n\ufb01rst chain, we consider a sequence of non-increasing step sizes (\u03b3k)k\u22651 and for the second chain, we\nuse the sequence of step sizes (\u03b7k)k\u22651 de\ufb01ned by \u03b72k\u22121 = \u03b72k = \u03b3k/2 for k \u2265 1. These two chains\nare started at the same point \u03b80 \u2208 Rd, and are run accordingly to (2) but the chain with the smallest\nstep size is run twice more time than the other one. In other words, these two discretizations are run\nk=1 \u03b3k, where K is the number of iterations. Finally, we extrapolate\nthe two SGLD estimators in order to construct the new one. Each iteration of SGRRLD will consist of\none step of the \ufb01rst SGLD chain with (\u03b3k)k\u22651 and two steps of the second SGLD chain with (\u03b7k)k\u22651.\nMore formally the proposed algorithm is de\ufb01ned by: consider a starting point \u03b8(\u03b3)\n= \u03b80\nand for k \u2265 0,\n\nuntil the same time horizon(cid:80)K\n\n0 = \u03b8(\u03b3/2)\n\n0\n\nk\n\n\uf8f1\uf8f2\uf8f3\u03b8(\u03b3/2)\n\n\u03b8(\u03b3)\nk+1 = \u03b8(\u03b3)\n2k+1 = \u03b8(\u03b3/2)\n\u03b8(\u03b3/2)\n2k+2 = \u03b8(\u03b3/2)\n\n\u2212 \u03b3k+1\u2207 \u02dcU (\u03b3)\n2 \u2207 \u02dcU (\u03b3/2)\n2 \u2207 \u02dcU (\u03b3/2)\n\n2k \u2212 \u03b3k+1\n2k+1 \u2212 \u03b3k+1\n\n2k+1\n\n2k+2\n\nk+1\n\n(cid:1) +(cid:112)2\u03b3k+1Z (\u03b3)\n(cid:0)\u03b8(\u03b3)\n(cid:1) +\n(cid:0)\u03b8(\u03b3/2)\n(cid:0)\u03b8(\u03b3/2)\n(cid:1) +\n\nk+1 ,\n\u03b3k+1Z (\u03b3/2)\n2k+1\n\u03b3k+1Z (\u03b3/2)\n2k+2\n\n\u221a\n\u221a\n\n2k+1\n\n2k+1\n\nk\n\nChain 1 :\n\nChain 2 :\n\n(4)\n\n(5)\n\nk\n\nwhere (Z (\u03b3/2)\n)k\u22651 and (Z (\u03b3)\nk )k\u22651 are two sequences of d-dimensional i.i.d. standard Gaussian\nrandom variables and (\u2207 \u02dcU (\u03b3/2)\nk )k\u22651 are two sequences of i.i.d. unbiased estimators\nof \u2207U with the same common distribution L, meaning that the mini-batch size has to be the same.\n\n)k\u22651, (\u2207 \u02dcU (\u03b3)\n\nk\n\n3\n\n\fFor a test function f : Rd \u2192 R, we then de\ufb01ne the estimator of \u03c0(f ) based on RR extrapolation as\nfollows: (for all K \u2208 N\u2217)\n\n(cid:32)K+1(cid:88)\n\n(cid:33)\u22121 K(cid:88)\n\n(cid:104){f (\u03b8(\u03b3/2)\n\n(cid:105)\n\n\u02c6\u03c0R\nK(f ) =\n\n\u03b3k\n\n\u03b3k+1\n\n2k\u22121 ) + f (\u03b8(\u03b3/2)\n\n2k\n\n)} \u2212 f (\u03b8(\u03b3)\nk )\n\n,\n\n(6)\n\nk=2\n\nk=1\n\nlimK\u2192+\u221e(cid:80)K\n\nk=1 \u03b3k+1 = +\u221e, then limK\u2192+\u221e \u02c6\u03c0R\n\nWe provide a pseudo-code of SGRRLD in the supplementary document.\nUnder mild assumptions on \u2207U and the law L (see the conditions in the Supplement), by [19,\nK(f ) is a consistent estimator of \u03c0(f ): when limk\u2192+\u221e \u03b3k = 0 and\nTheorem 7] we can show that \u02c6\u03c0R\nK(f ) = \u03c0(f ) almost surely. However, it is not\nimmediately clear whether applying an RR extrapolation would provide any advantage over SGLD\nin terms of the rate of convergence. Even if RR extrapolation were to reduce the bias of the SGLD\nestimator, this improvement could be offset by an increase of variace. In the context of a general\nclass of SDEs, in [13] it has been shown that the variance of estimator based on RR extrapolation can\nbe controlled by using correlated Brownian increments and the best choice in this sense is in fact\ntaking the two sequences (Z (\u03b3/2)\n\n)k\u22651 and (Z (\u03b3)\nZ (\u03b3)\nk = (Z (\u03b3/2)\n\nk )k\u22651 perfectly correlated, i.e. for all k \u2265 1,\n2k\u22121 + Z (\u03b3/2)\n\n(7)\nThis choice has also been justi\ufb01ed in the context of the sampling of the stationary distribution of a\ndiffusion in [23] through a central limit theorem.\nInspired by [23], in order to be able to control the variance of the SGRRLD estimator, we consider\ncorrelated Brownian increments. In particular, we assume that the Brownian increments in (4)\nand (5) satisfy the following relationship: there exist a matrix \u03a3 \u2208 Rd\u00d7d, a sequence (Wk)k\u22651\nof d dimensional i.i.d. standard Gaussian random variables, independent of (Z (\u03b3/2)\n)k\u22651 such that\nId\u2212\u03a3(cid:62)\u03a3 is a positive semide\ufb01nite matrix and for all k \u2265 0,\n\n\u221a\n)/\n\n2 .\n\n2k\n\nk\n\nk\n\nk+1 = \u03a3(cid:62)(Z (\u03b3/2)\nZ (\u03b3)\n\n2 + (Id\u2212\u03a3(cid:62)\u03a3)1/2Wk+1 ,\n\n2k+1 + Z (\u03b3/2)\n\n(8)\nwhere Id denotes the identity matrix. In Section 4, we will show that the properly scaled SGRRLD\nestimator converges to a Gaussian random variable whose variance is minimal when \u03a3 = Id, and\ntherefore Z (\u03b3)\nk+1 should be chosen as in (7). Accordingly, (8) justi\ufb01es the choice of using the same\nBrownian motion in the two discretizations, extending the results of [23] to SG-MCMC. On the other\nhand, regarding the sequences of estimators for \u2207U, we assume that they can also be correlated\nbut do not assume an explicit form on their relation. However, it is important to note that if the\ntwo sequences (\u2207 \u02dcU (\u03b3/2)\nk )k\u22651 do not have the same common distribution, then the\nSGRRLD estimator can have a bias, which would have the same order as of vanilla SGLD (with the\nsame sequence of step sizes). In the particular case of (3), in order for SGRRLD to gain ef\ufb01ciency\ncompared to SGLD, the mini-batch size has to be the same for the two chains.\n\n)k\u22651 and (\u2207 \u02dcU (\u03b3)\n\nk\n\n\u221a\n2(k+1))/\n\n4 Convergence Analysis\n\nk=1 \u03b3n\n\nk+1 and \u0393K = \u0393(1)\n\nK =(cid:80)K\n\nK(f ) of \u03c0(f ) (see (6)) for a smooth\n\nWe analyze asymptotic and non-asymptotic properties of SGRRLD. In order to save space and avoid\nobscuring the results, we present the technical conditions under which the theorems hold, and the full\nproofs in the supplementary document.\nWe \ufb01rst present a central limit theorem for the estimator \u02c6\u03c0R\nfunction f. Let us de\ufb01ne \u0393(n)\nTheorem 1. Let f : Rd \u2192 R be a smooth function and (\u03b3k)k\u22651 be a nonincreasing sequence\nsatisfying limk\u2192+\u221e \u03b3k = 0 and limK\u2192+\u221e \u0393K = +\u221e. Let (\u03b8(\u03b3)\n)k\u22650 be de\ufb01ned by (4)-\n(5), started at \u03b80 \u2208 Rd and assume that the relation (8) holds for \u03a3 \u2208 Rd\u00d7d. Under appropriate\nconditions on U, f and L, then the following statements hold:\n\u221a\na) If limK\u2192+\u221e \u0393(3)\n\u0393K = 0, then\nK /\nto a zero-mean Gaussian random variable with variance \u03c32\n\u221a\nb) If limK\u2192+\u221e \u0393(3)\nK /\ngoes to in\ufb01nity to a Gaussian random variable with variance \u03c32\n\n(cid:0)\u02c6\u03c0R\nK(f ) \u2212 \u03c0(f )(cid:1) converges in law as K goes to in\ufb01nity\n(cid:0)\u02c6\u03c0R\nK(f ) \u2212 \u03c0(f )(cid:1) converges in law as K\n\nR, which is minimized when \u03a3 = Id.\n\n\u0393K = \u03ba \u2208 (0, +\u221e), then\n\nK , for all n \u2208 N.\n\nk , \u03b8(\u03b3/2)\n\nR and mean \u03ba \u00b5R.\n\n\u221a\n\n\u221a\n\n\u0393K\n\n\u0393K\n\nk\n\n4\n\n\fk )(cid:0)\u02c6\u03c0R\n\nK(f ) \u2212 \u03c0(f )(cid:1) converges in probability as\n\n\u0393K = +\u221e, then (\u0393K/\u0393(3)\n\nR and \u00b5R are given in the supplementary document.\n\n\u221a\nc) If limK\u2192+\u221e \u0393(3)\nK /\nK goes to in\ufb01nity to \u00b5R.\nThe expressions of \u03c32\nProof (Sketch). The proof follows the same strategy as the one in [23, Theorem 4.3] for ULA.\nWe assume that the Poisson equation associated with f has a solution g \u2208 C 9(Rd). Then, the\nproof consists in making a 7th order Taylor expansion for g(\u03b8(\u03b3)\n2k+1) at\nK(f ) \u2212 \u03c0(f ) is decomposed as a sum of three terms\n\u03b8(\u03b3)\nk , \u03b8(\u03b3/2)\nA1,K + A2,K + A3,K. A1,K is the \ufb02uctuation term and \u03931/2\nK A1,K converges to a zero-mean Gaussian\nR. A2,K is the bias term, and \u0393KA2,K/\u0393(3)\nrandom variable with variance \u03c32\nK converges in probability\nto \u00b5R as K goes to +\u221e if limK\u2192+\u221e \u0393(3)\nK A3,K goes to 0 as K\ngoes to +\u221e. The detailed proof is given in the supplementary document.\n\nK = +\u221e. Finally the last term \u03931/2\n\n, respectively. Then \u02c6\u03c0R\n\nk+1), g(\u03b8(\u03b3/2)\n\n2k\n\n2k\u22121 and \u03b8(\u03b3/2)\n\n2k\n\n) and g(\u03b8(\u03b3)\n\nk\n\nThese results state that the Gaussian noise dominates the stochastic gradient noise. Moreover, we\nalso observe that the correlation between the two sequences of Gaussian random variables (Z (\u03b3)\nk )k\u22651\nand (Z (\u03b3/2)\n)k\u22651 has an important impact on the asymptotic convergence of \u02c6\u03c0R(f ), whereas the\ncorrelation of the two sequences of stochastic gradients does not.\nA typical choice of decreasing sequence (\u03b3k)k\u22651 is of the form \u03b3k = \u03b31k\u2212\u03b1 for \u03b1 \u2208 (0, 1]. With\nsuch a choice, Theorem 1 states that \u02c6\u03c0R(f ) converges to \u03c0(f ) at a rate of convergence of order\nO(K\u2212((1\u2212\u03b1)/2)\u2227(2\u03b1)), where a \u2227 b = min(a, b). Therefore, the optimal choice for the exponent \u03b1\nfor obtaining the fastest convergence turns out to be \u03b1 = 1/5, which implies a rate of convergence of\norder O(K\u22122/5). Note that this rate is higher than SGLD whose optimal rate is of order O(K\u22121/3).\nBesides, \u03b1 = 1/5 corresponds to the second point of Theorem 1, in which there is an equal\ncontribution of the bias and the \ufb02uctuation at an asymptotic level. Futher discussions and detailed\ncalculations can be found in the supplementary document.\nWe now derive non-asymptotic bounds for the bias and the MSE of the estimator \u02c6\u03c0R(f ).\nTheorem 2. Let f : Rd \u2192 R be a smooth function and (\u03b3k)k\u22651 be a nonincreasing sequence such\nthat there exists K1 \u2265 1, \u03b3K1 \u2264 1 and limK\u2192+\u221e \u0393K = +\u221e. Let (\u03b8(\u03b3)\n)k\u22650 be de\ufb01ned by\n(4)-(5), started at \u03b80 \u2208 Rd. Under appropriate conditions on U, f and L, then there exists C \u2265 0\nsuch that for all K \u2208 N, K \u2265 1:\n\nk , \u03b8(\u03b3/2)\n\nk\n\nK(f ) \u2212 \u03c0(f )(cid:3)(cid:12)(cid:12) \u2264 (C/\u0393K)\n(cid:12)(cid:12)E(cid:2)\u02c6\u03c0R\nK(f ) \u2212 \u03c0(f )(cid:9)2(cid:105) \u2264 C{(\u0393(3)\nE(cid:104)(cid:8)\u02c6\u03c0R\n\n(cid:110)\nK /\u0393K)2 + 1/\u0393K} .\n\n\u0393(3)\nK + 1\n\n(cid:111)\n\nBIAS:\n\nMSE:\n\nProof (Sketch). The proof follows the same strategy as the one of Theorem 1, but instead of estab-\nlishing the exact convergence of the \ufb02uctuation and the bias terms, we just give an upper bound for\nthese two terms. The detailed proof is given in the supplementary document.\n\nIt is important to observe that the constant C which appears in Theorem 2 depends on moments of\nthe estimator of the gradient. For \ufb01xed step size \u03b3k = \u03b3 for all k \u2265 1, Theorem 2 shows that the\nbias is of order O(\u03b32 + 1/(K\u03b3)). Therefore, if the number of iterations K is \ufb01xed then the choice\nof \u03b3 which minimizes this bound is \u03b3 \u221d K\u22121/3, obtained by differentiating x (cid:55)\u2192 x2 + (xK)\u22121.\nChoosing this value for \u03b3 leads to the optimal rate for the bias of order O(K\u22122/3). Note that this\nbound is better than SGLD for which the optimal bound of the bias at \ufb01xed K is of order O(K\u22121/2).\nThe same approach can be applied to the MSE which is of order O(\u03b34 + 1/(K\u03b3)). Then, the optimal\nchoice of the step size is \u03b3 = O(K\u22121/5), leading to a bound of order O(K\u22124/5). Similar to the\nprevious case, this bound is smaller than the bound obtained with SGLD, which is O(K\u22122/3).\nIf we choose \u03b3k = \u03b31k\u2212\u03b1 for \u03b1 \u2208 (0, 1], Theorem 2 shows that the bias and the MSE go to 0 as\nK goes to in\ufb01nity. More precisely for \u03b1 \u2208 (0, 1), the bound for the bias is O(K\u2212(2\u03b1)\u2227(1\u2212\u03b1)),\nand is therefore minimal for \u03b1 = 1/3. As for the MSE, the bound provided by Theorem 2\nis O(K\u2212(4\u03b1)\u2227(1\u2212\u03b1)) which is consistent with Theorem 1, leading to an optimal bound of order\nO(K\u22124/5) as \u03b1 = 1/5.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: The performance of SGRRLD on synthetic data. (a) The true posterior and the estimated\nposteriors. (b) The MSE for different problem sizes.\n5 Experiments\n\n5.1 Linear Gaussian Model\n\nWe conduct our \ufb01rst set of experiments on synthetic data where we consider a simple Gaussian model\nwhose posterior distribution is analytically available. The model is given as follows:\n\nx) , for all n .\n\nn \u03b8, \u03c32\nn=1 \u2208 RN\u00d7d, \u03c32\n\n\u03b8 and \u03c32\n\n\u03b8 = 10, \u03c32\n\n\u03b8 \u223c N (0, \u03c32\n\npost) for both algorithms, where \u02c6\u00b5post and \u02c6\u03c32\n\n\u03b8 Id) , xn|\u03b8 \u223c N (a(cid:62)\n(9)\nHere, we assume that the explanatory variables {an}N\nx are known and we\naim to draw samples from the posterior distribution p(\u03b8|x). In all the experiments, we \ufb01rst randomly\ngenerate an \u223c N (0, 0.5 Id) and we generate the true \u03b8 and the response variables x by using the\ngenerative model given in (9). All our experiments are conducted on a standard laptop computer\nwith 2.5GHz Quad-core Intel Core i7 CPU, and in all settings, the two chains of SGRRLD are run in\nparallel.\nx = 1, N = 1000, and the size of each minibatch\nIn our \ufb01rst experiment, we set d = 1, \u03c32\nB = N/10. We \ufb01x the step size to \u03b3 = 10\u22123. In order to ensure that both algorithms are run for a\n\ufb01xed computation time, we run SGLD for K = 21000 iterations where we discard the \ufb01rst 1000\nsamples as burn-in, and we run SGRRLD for K = 10500 iterations accordingly, where we discard\nthe samples generated in the \ufb01rst 500 iterations as burn-in. Figure 1(a) shows the typical results\nof this experiment. In particular, in the left \ufb01gure, we illustrate the true posterior distribution and\nthe Gaussian density N (\u02c6\u00b5post, \u02c6\u03c32\npost denote the empirical\nposterior mean and variance, respectively. In the right \ufb01gure, we monitor the bias of the estimated\nvariance as a function of computation time. The results show that SGLD overestimates the posterior\nvariance, whereas SGRRLD is able to reduce this error signi\ufb01cantly. We also observe that the results\nsupport our theory: the bias of the estimated variance is \u2248 10\u22122 for SGLD whereas this bias is\nreduced to \u2248 10\u22124 with SGRRLD.\nIn our second experiment, we\n\ufb01x \u03b3 and K and monitor the\nMSE of the posterior covariance\nas a function of the dimension\nd of the problem.\nIn order to\nmeasure the MSE, we compute\nthe squared Frobenius norm of\nthe difference between the true\nposterior covariance and the es-\ntimated covariance. Similarly to\nthe previous experiment, we av-\nerage 100 runs that are initial-\nized randomly. The results are\nshown in Figure 1(b). The re-\nsults clearly show that SGRRLD provides signi\ufb01cant performance improvement over SGLD, where\nthe MSE of SGRRLD is in the order of the square of the MSE of SGLD for all values of d.\nIn our next experiment, we use the same setting as in the \ufb01rst experiment and we monitor the bias\nand the MSE of the estimated variance as a function of the step size \u03b3. For evaluation, we average\n100 runs that are initialized randomly. As depicted in Figure 2, the results show that SGRRLD yields\n\nFigure 2: Bias and MSE of SGLD and SGRRLD for different step\nsizes.\n\n6\n\n1.822.22.4\u03b801234567p(\u03b8|x)TrueSGLDSGRRLD151020Dimension (d)10-710-610-510-4MSESGLDSGRRLD10-610-510-410-3Stepsize(\u03b3)10-510-410-310-2BiasSGLDSGRRLD10-610-510-410-3Stepsize(\u03b3)10-810-710-610-510-4MSESGLDSGRRLD\fFigure 3: Bias and MSE of SGRRLD with different rates for step\nsize (\u03b1).\n\nsigni\ufb01cantly better results than SGLD in terms of both the bias and MSE. Note that for very small \u03b3,\nthe bias and MSE increase. This is due to the term 1/(K\u03b3) in the bounds of Theorem 2 dominates\nboth the bias and the MSE as expected since K is \ufb01xed. Therefore, we observe a drop in the bias and\nthe MSE as we increase \u03b3 up to \u2248 8 \u00d7 10\u22125, and then they gradually increase along with \u03b3.\nWe conduct the next experiment\nin order to check the rate of con-\nvergence that we have derived\nin Theorem 2 for \ufb01xed step size\n\u03b3k = \u03b3 for all k \u2265 1. We ob-\nserve that the optimal choice for\nthe step size is of the form \u03b3 =\nbK\u22121/3 and \u03b3 = \u03b3(cid:63)\nMK\u22120.2 for\n\u03b3(cid:63)\nthe bias and MSE, respectively.\nTo con\ufb01rm our \ufb01ndings, we \ufb01rst\nneed to determine the constants\nb and \u03b3(cid:63)\nM, which can be done\n\u03b3(cid:63)\nby using the results from the pre-\nvious experiment. Accordingly,\nb \u2248 8.5\u00b7 10\u22125\u00b7\nwe observe that \u03b3(cid:63)\n(20000)1/3 \u2248 2 \u00b7 10\u22123 and \u03b3(cid:63)\nM \u2248 1.7 \u00b7 10\u22124 \u00b7 (20000)0.2 \u2248 10\u22123. Then, to con\ufb01rm the right depen-\nbK\u2212\u03b1\ndency of \u03b3 on K, we \ufb01x K = 106 and monitor the bias with the sequence of step sizes \u03b3 = \u03b3(cid:63)\nand the MSE with \u03b3 = \u03b3MK\u2212\u03b1 for several values of \u03b1 as given in Figure 3. It can be observed that\nthe optimal convergence rate is still obtained for \u03b1 = 1/3 for the bias and \u03b1 = 0.2 for the MSE,\nwhich con\ufb01rms the results of Theorem 2. For a decreasing sequence of step sizes \u03b3k = \u03b3(cid:63)\n1 k\u03b1 for\n\u03b1 \u2208 (0, 1], we conduct a similar experiment to con\ufb01rm that the best convergence rate is achieved\nchoosing \u03b1 = 1/3 in the case of the bias and \u03b1 = 0.2 in the case of the MSE. The resulting \ufb01gures\ncan be found in the supplementary document.\nIn our last synthetic data experi-\nment, instead of SGLD, we con-\nsider another SG-MCMC algo-\nrithm, namely the Stochastic Gra-\ndient Hamiltonian Monte Carlo\n(SGHMC) [6]. We apply the pro-\nposed extrapolation scheme de-\nscribed in Section 3 to SGHMC\nand call the resulting algorithm\nStochastic Gradient Richardson-\nRomberg Hamiltonian Monte\nCarlo (SGRRHMC). In this ex-\nperiment, we use the same set-\nting as we use in Figure 2, and\nwe monitor the bias and the MSE of the estimated variance as a function of \u03b3. We compare SGR-\nRHMC against SGHMC with Euler discretization [6] and SGHMC with an higher-order splitting\nintegrator (SGHMC-s) [10] (we describe SGHMC, SGHMC-s, and SGRRHMC in more detail in the\nsupplementary document). We average 100 runs that are initialized randomly. As given in Figure 4,\nthe results are similar to the ones obtained in Figure 2: for large enough \u03b3, SGRRHMC yields\nsigni\ufb01cantly better results than SGHMC. For small \u03b3, the term 1/(K\u03b3) in the bound derived in\nTheorem 2 dominates the MSE and therefore SGRRHMC requires a larger K for improving over\nSGHMC. For large enough values of \u03b3, we observe that SGRRHMC obtains an MSE similar to that\nof SGHMC-s with small \u03b3, which con\ufb01rms our claim that the proposed approach can achieve the\naccuracy of the methods that are based on higher-order integrators.\n\nFigure 4: The performance of RR extrapolation on SGHMC.\n\n5.2 Large-Scale Matrix Factorization\n\nIn our second set of experiments, we evaluate our approach on a large-scale matrix factorization\nproblem for a link prediction application, where we consider the following probabilistic model:\nWip \u223c N (0, \u03c32\nobserved data matrix with missing entries, and W \u2208 RI\u00d7P and H \u2208 RD\u00d7P are the latent factors,\n\nw), Hpj \u223c N (0, \u03c32\n\n(cid:1), where X \u2208 RI\u00d7J is the\n\nh), Xij|W, H \u223c N(cid:0)(cid:80)\n\np WipHpj, \u03c32\nx\n\n7\n\n10-410-310-2Stepsize(\u03b3)10-610-510-410-310-2BiasSGHMCSGHMC-sSGRRHMC10-410-310-2Stepsize(\u03b3)10-810-710-610-510-4MSESGHMCSGHMC-sSGRRHMC\f(a) MovieLens-1Million\n\n(b) MovieLens-10Million\n\n(c) MovieLens-20Million\n\nFigure 5: The performance of SGRRLD on large-scale matrix factorization problems.\n\nwhose entries are i.i.d. distributed. The aim in this application is to predict the missing values of\nX by using a low-rank approximation. This model is similar to the Bayesian probabilistic matrix\nfactorization model [24] and it is often used in large-scale matrix factorization problems [25], in\nwhich SG-MCMC has been shown to outperform optimization methods such as SGD [26].\nIn this experiment, we compare SGRRLD against SGLD on three large movie ratings datasets, namely\nthe MovieLens 1Million (ML-1M), MovieLens 10Million (ML-10M), and MovieLens 20Million\n(ML-20M) (grouplens.org). The ML-1M dataset contains about 1 million ratings applied to\nI = 3883 movies by J = 6040 users, resulting in a sparse observed matrix X with 4.3% non-zero\nentries. The ML-10M dataset contains about 10 million ratings applied to I = 10681 movies by\nJ = 71567 users, resulting in a sparse observed matrix X with 1.3% non-zero entries. Finally, The\nML-20M dataset contains about 20 million ratings applied to I = 27278 movies by J = 138493\nusers, resulting in a sparse observed matrix X with 0.5% non-zero entries. We randomly select 10%\nof the data as the test set and use the remaining data for generating the samples. The rank of the\nfactorization is chosen as P = 10. We set \u03c32\nx = 1. For all datasets, we use a constant\nstep size. We run SGLD for K = 10500 iterations where we discard the \ufb01rst 500 samples as burn-in.\nIn order to keep the computation time the same, we run SGRRLD for K = 5250 iterations where\nwe discard the \ufb01rst 250 iterations as burn-in. For ML-1M we set \u03b3 = 2 \u00d7 10\u22126 and for ML-10M\nand ML-20M we set \u03b3 = 2 \u00d7 10\u22125. The size of the subsamples B is selected as N/10, N/50,\nand N/500 for ML-1M, ML-10M and ML-20M, respectively. We have implemented SGLD and\nSGRRLD in C by using the GNU Scienti\ufb01c Library for ef\ufb01cient matrix computations. We fully\nexploit the inherently parallel structure of SGRRLD by running the two chains in parallel as two\nindependent processes, whereas SGLD cannot bene\ufb01t from this parallel computation architecture due\nto its inherently sequential nature. Therefore their wall-clock times are nearly exactly the same.\nFigure 5 shows the comparison of SGLD and SGRRLD in terms of the root mean squared-errors\n(RMSE) that are obtained on the test sets as a function of wall-clock time. The results clearly show\nthat in all datasets SGRRLD yields signi\ufb01cant performance improvements. We observe that in the\nML-1M experiment SGRRLD requires only \u2248 200 seconds for achieving the accuracy that SGLD\nprovides after \u2248 400 seconds. We see similar behaviors in the ML-10M and ML-20M experiments:\nSGRRLD appears to be more ef\ufb01cient than SGLD. The results indicate that by using our approach, we\neither obtain the same accuracy of SGLD in shorter time or we obtain a better accuracy by spending\nthe same amount of time as SGLD.\n\nw = \u03c32\n\nh = \u03c32\n\n6 Conclusion\nWe presented SGRRLD, a novel scalable sampling algorithm that aims to reduce the bias of SG-\nMCMC while keeping the variance at a reasonable level by using RR extrapolation. We provided\nformal theoretical analysis and showed that SGRRLD is asymptotically consistent and satis\ufb01es a\ncentral limit theorem. We further derived bounds for its non-asymptotic bias and the mean squared-\nerror, and showed that SGRRLD attains higher rates of convergence than all known SG-MCMC\nmethods with \ufb01rst-order integrators in both \ufb01nite-time and asymptotically. We supported our \ufb01ndings\nusing both synthetic and real data experiments, where SGRRLD appeared to be more ef\ufb01cient than\nSGLD in terms of computation time on a large-scale matrix factorization application. As a next step,\nwe plan to explore the use of the multi-level Monte Carlo approaches [27] in our framework.\nAcknowledgements: This work is partly supported by the French National Research Agency (ANR)\nas a part of the EDISON 3D project (ANR-13-CORD-0008-02).\n\n8\n\n\fReferences\n[1] M. Welling and Y. W Teh, \u201cBayesian learning via Stochastic Gradient Langevin Dynamics,\u201d in ICML,\n\n2011, pp. 681\u2013688.\n\n[2] G. O. Roberts and R. L. Tweedie, \u201cExponential convergence of Langevin distributions and their discrete\n\napproximations,\u201d Bernoulli, vol. 2, no. 4, pp. 341\u2013363, 1996.\n\n[3] H. Robbins and S. Monro, \u201cA stochastic approximation method,\u201d Ann. Math. Statist., vol. 22, no. 3, pp.\n\n400\u2013407, 1951.\n\n[4] S. Ahn, A. Korattikara, and M. Welling, \u201cBayesian posterior sampling via stochastic gradient Fisher\n\nscoring,\u201d in ICML, 2012.\n\n[5] S. Patterson and Y. W. Teh, \u201cStochastic gradient Riemannian Langevin dynamics on the probability\n\nsimplex,\u201d in NIPS, 2013.\n\n[6] T. Chen, E. B. Fox, and C. Guestrin, \u201cStochastic gradient Hamiltonian Monte Carlo,\u201d in ICML, 2014.\n[7] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, \u201cBayesian sampling using stochastic\n\ngradient thermostats,\u201d in NIPS, 2014, pp. 3203\u20133211.\n\n[8] X. Shang, Z. Zhu, B. Leimkuhler, and A. J. Storkey, \u201cCovariance-controlled adaptive Langevin thermostat\n\nfor large-scale Bayesian sampling,\u201d in NIPS, 2015, pp. 37\u201345.\n\n[9] Y. A. Ma, T. Chen, and E. Fox, \u201cA complete recipe for stochastic gradient MCMC,\u201d in NIPS, 2015, pp.\n\n2899\u20132907.\n\n[10] C. Chen, N. Ding, and L. Carin, \u201cOn the convergence of stochastic gradient MCMC algorithms with\n\nhigh-order integrators,\u201d in NIPS, 2015, pp. 2269\u20132277.\n\n[11] C. Li, C. Chen, D. Carlson, and L. Carin, \u201cPreconditioned stochastic gradient Langevin dynamics for deep\n\nneural networks,\u201d in AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[12] U. S\u00b8ims\u00b8ekli, R. Badeau, A. T. Cemgil, and G. Richard, \u201cStochastic quasi-Newton Langevin Monte Carlo,\u201d\n\nin ICML, 2016.\n\n[13] G. Pages, \u201cMulti-step Richardson-Romberg extrapolation: remarks on variance control and complexity,\u201d\n\nMonte Carlo Methods and Applications, vol. 13, no. 1, pp. 37, 2007.\n\n[14] U. Grenander, \u201cTutorial in pattern theory,\u201d Division of Applied Mathematics, Brown University, Provi-\n\ndence, 1983.\n\n[15] D. Lamberton and G. Pag`es, \u201cRecursive computation of the invariant distribution of a diffusion: the case\n\nof a weakly mean reverting drift,\u201d Stoch. Dyn., vol. 3, no. 4, pp. 435\u2013451, 2003.\n\n[16] V. Lemaire, Estimation de la mesure invariante d\u2019un processus de diffusion, Ph.D. thesis, Universit\u00b4e\n\nParis-Est, 2005.\n\n[17] D. Lamberton and G. Pag`es, \u201cRecursive computation of the invariant distribution of a diffusion,\u201d Bernoulli,\n\nvol. 8, no. 3, pp. 367\u2013405, 2002.\n\n[18] I. Sato and H. Nakagawa, \u201cApproximation analysis of stochastic gradient Langevin dynamics by using\n\nFokker-Planck equation and Ito process,\u201d in ICML, 2014, pp. 982\u2013990.\n\n[19] Y. W. Teh, A. H. Thi\u00b4ery, and S. J. Vollmer, \u201cConsistency and \ufb02uctuations for stochastic gradient Langevin\n\ndynamics,\u201d Journal of Machine Learning Research, vol. 17, no. 7, pp. 1\u201333, 2016.\n\n[20] Y. W. Teh, S. J. Vollmer, and K. C. Zygalakis, \u201c(Non-)asymptotic properties of Stochastic Gradient\n\nLangevin Dynamics,\u201d arXiv preprint arXiv:1501.00438, 2015.\n\n[21] D. Talay and L. Tubaro, \u201cExpansion of the global error for numerical schemes solving stochastic\n\ndifferential equations,\u201d Stochastic Anal. Appl., vol. 8, no. 4, pp. 483\u2013509 (1991), 1990.\n\n[22] J. C. Mattingly, A. M. Stuart, and D. J. Higham, \u201cErgodicity for SDEs and approximations: locally\nLipschitz vector \ufb01elds and degenerate noise,\u201d Stochastic Process. Appl., vol. 101, no. 2, pp. 185\u2013232,\n2002.\n\n[23] V. Lemaire, G. Pag`es, and F. Panloup, \u201cInvariant measure of duplicated diffusions and application to\nRichardson\u2013Romberg extrapolation,\u201d Ann. Inst. H. Poincar\u00b4e Probab. Statist., vol. 51, no. 4, pp. 1562\u20131596,\n11 2015.\n\n[24] R. Salakhutdinov and A. Mnih, \u201cBayesian probabilistic matrix factorization using Markov Chain Monte\n\nCarlo,\u201d in ICML, 2008, pp. 880\u2013887.\n\n[25] R. Gemulla, E. Nijkamp, Haas. P. J., and Y. Sismanis, \u201cLarge-scale matrix factorization with distributed\n\nstochastic gradient descent,\u201d in ACM SIGKDD, 2011.\n\n[26] S. Ahn, A. Korattikara, N. Liu, S. Rajan, and M. Welling, \u201cLarge-scale distributed Bayesian matrix\n\nfactorization using stochastic gradient MCMC,\u201d in KDD, 2015.\n\n[27] V. Lemaire and G. Pages,\n\narXiv:1401.1177, 2014.\n\n\u201cMultilevel Richardson-Romberg extrapolation,\u201d\n\narXiv preprint\n\n9\n\n\f", "award": [], "sourceid": 1089, "authors": [{"given_name": "Alain", "family_name": "Durmus", "institution": "Telecom ParisTech"}, {"given_name": "Umut", "family_name": "Simsekli", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Ecole Polytechnique"}, {"given_name": "Roland", "family_name": "Badeau", "institution": "Telecom ParisTech"}, {"given_name": "Ga\u00ebl", "family_name": "RICHARD", "institution": "Telecom ParisTech"}]}