{"title": "On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators", "book": "Advances in Neural Information Processing Systems", "page_first": 2278, "page_last": 2286, "abstract": "Recent advances in Bayesian learning with large-scale data have witnessed emergence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC (SGHMC), and the stochastic gradient thermostat. While finite-time convergence properties of the SGLD with a 1st-order Euler integrator have recently been studied, corresponding theory for general SG-MCMCs has not been explored. In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. Our theoretical results show faster convergence rates and more accurate invariant measures for SG-MCMCs with higher-order integrators. For example, with the proposed efficient 2nd-order symmetric splitting integrator, the mean square error (MSE) of the posterior average for the SGHMC achieves an optimal convergence rate of $L^{-4/5}$ at $L$ iterations, compared to $L^{-2/3}$ for the SGHMC and SGLD with 1st-order Euler integrators. Furthermore, convergence results of decreasing-step-size SG-MCMCs are also developed, with the same convergence rates as their fixed-step-size counterparts for a specific decreasing sequence. Experiments on both synthetic and real datasets verify our theory, and show advantages of the proposed method in two large-scale real applications.", "full_text": "On the Convergence of Stochastic Gradient MCMC\n\nAlgorithms with High-Order Integrators\n\nChangyou Chen\u2020\n\nNan Ding\u2021\n\nLawrence Carin\u2020\n\n\u2020Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA\n\n\u2021Google Inc., Venice, CA, USA\n\ncchangyou@gmail.com; dingnan@google.com; lcarin@duke.edu\n\nAbstract\n\nRecent advances in Bayesian learning with large-scale data have witnessed emer-\ngence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochas-\ntic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC\n(SGHMC), and the stochastic gradient thermostat. While \ufb01nite-time convergence\nproperties of the SGLD with a 1st-order Euler integrator have recently been stud-\nied, corresponding theory for general SG-MCMCs has not been explored. In this\npaper we consider general SG-MCMCs with high-order integrators, and develop\ntheory to analyze \ufb01nite-time convergence properties and their asymptotic invariant\nmeasures. Our theoretical results show faster convergence rates and more accu-\nrate invariant measures for SG-MCMCs with higher-order integrators. For exam-\nple, with the proposed ef\ufb01cient 2nd-order symmetric splitting integrator, the mean\nsquare error (MSE) of the posterior average for the SGHMC achieves an optimal\nconvergence rate of L\u22124/5 at L iterations, compared to L\u22122/3 for the SGHMC\nand SGLD with 1st-order Euler integrators. Furthermore, convergence results of\ndecreasing-step-size SG-MCMCs are also developed, with the same convergence\nrates as their \ufb01xed-step-size counterparts for a speci\ufb01c decreasing sequence. Ex-\nperiments on both synthetic and real datasets verify our theory, and show advan-\ntages of the proposed method in two large-scale real applications.\n\nIntroduction\n\n1\nIn large-scale Bayesian learning, diffusion based sampling methods have become increasingly pop-\nular. Most of these methods are based on It\u02c6o diffusions, de\ufb01ned as:\nd Xt = F (Xt)dt + \u03c3(Xt)dWt .\n\n(1)\nHere Xt \u2208 Rn represents model states, t the time index, Wt is Brownian motion, functions F :\nRn \u2192 Rn and \u03c3 : Rn \u2192 Rn\u00d7m (m not necessarily equal to n) are assumed to satisfy the usual\nLipschitz continuity condition.\nIn a Bayesian setting, the goal is to design appropriate functions F and \u03c3, so that the stationary\ndistribution, \u03c1(X), of the It\u02c6o diffusion has a marginal distribution that is equal to the posterior\ndistribution of interest. For example, 1st-order Langevin dynamics (LD) correspond to X = \u03b8,\nF = \u2212\u2207\u03b8U and \u03c3 =\n2 In, with In being the n\u00d7 n identity matrix; 2nd-order Langevin dynamics\ncorrespond to X = (\u03b8, p), F =\nfor some D > 0.\nHere U is the unnormalized negative log-posterior, and p is known as the momentum [1, 2]. Based\non the Fokker-Planck equation [3], the stationary distributions of these dynamics exist and their\nmarginals over \u03b8 are equal to \u03c1(\u03b8) \u221d exp(\u2212U (\u03b8)), the posterior distribution we are interested in.\nSince It\u02c6o diffusions are continuous-time Markov processes, exact sampling is in general infeasible.\nAs a result, the following two approximations have been introduced in the machine learning liter-\n\np\n\n\u2212D p\u2212\u2207\u03b8U\n\n, and \u03c3 =\n\n2D\n\n(cid:16) 0\n\n0\n0 In\n\n(cid:17)\n\n\u221a\n\n(cid:16)\n\n\u221a\n\n(cid:17)\n\n1\n\n\fature [1, 2, 4], to make the sampling numerically feasible and practically scalable: 1) Instead of\nanalytically integrating in\ufb01nitesimal increments dt, numerical integration over small step h is used\nto approximate the integration of the true dynamics. Although many numerical schemes have been\nstudied in the SDE literature, in machine learning only the 1st-order Euler scheme is widely applied.\n2) During every integration, instead of working with the gradient of the full negative log-posterior\nU (\u03b8), a stochastic-gradient version of it, \u02dcUl(\u03b8), is calculated from the l-th minibatch of data, im-\nportant when considering problems with massive data. In this paper, we call algorithms based on 1)\nand 2) SG-MCMC algorithms. To be complete, some recently proposed SG-MCMC algorithms are\nbrie\ufb02y reviewed in Appendix A. SG-MCMC algorithms often work well in practice, however some\ntheoretical concerns about the convergence properties have been raised [5\u20137].\nRecently, [5, 6, 8] showed that the SGLD [4] converges weakly to the true posterior. In [7], the author\nstudied the sample-path inconsistency of the Hamiltonian PDE with stochastic gradients (but not the\nSGHMC), and pointed out its incompatibility with data subsampling. However, real applications\nonly require convergence in the weak sense, i.e., instead of requiring sample-wise convergence as\nin [7], only laws of sample paths are of concern\u2217. Very recently, the invariance measure of an SG-\nMCMC with a speci\ufb01c stochastic gradient noise was studied in [9]. However, the technique is not\nreadily applicable to our general setting.\nIn this paper we focus on general SG-MCMCs, and study the role of their numerical integrators. Our\nmain contributions include: i) From a theoretical viewpoint, we prove weak convergence results for\ngeneral SG-MCMCs, which are of practical interest. Speci\ufb01cally, for a Kth-order numerical inte-\ngrator, the bias of the expected sample average of an SG-MCMC at iteration L is upper bounded by\nL\u2212K/(K+1) with optimal step size h \u221d L\u22121/(K+1), and the MSE by L\u22122K/(2K+1) with optimal\nh \u221d L\u22121/(2K+1). This generalizes the results of the SGLD with an Euler integrator (K = 1) in\n[5, 6, 8], and is better when K \u2265 2; ii) From a practical perspective, we introduce a numerically ef\ufb01-\ncient 2nd-order integrator, based on symmetric splitting schemes [9]. When applied to the SGHMC,\nit outperforms existing algorithms, including the SGLD and SGHMC with Euler integrators, con-\nsidering both synthetic and large real datasets.\n\n2 Preliminaries & Two Approximation Errors in SG-MCMCs\n\nh\n\nF (Xt) \u00b7 \u2207 +\n\n=\n\n1\n2\n\n(cid:18)\n\nE [f (Xt+h)] \u2212 f (Xt)\n\n(cid:0)\u03c3(Xt)\u03c3(Xt)T(cid:1) :\u2207\u2207T\n\nIn weak convergence analysis, instead of working directly with sample-paths in (1), we study how\nthe expected value of any suitably smooth statistic of Xt evolves in time. This motivates the intro-\nduction of an (in\ufb01nitesimal) generator. Formally, the generator L of the diffusion (1) is de\ufb01ned for\nany compactly supported twice differentiable function f : Rn \u2192 R, such that,\nLf (Xt) (cid:44) lim\nf (Xt) ,\nh\u21920+\nwhere a\u00b7 b (cid:44) aT b, A : B (cid:44) tr(AT B), h \u2192 0+ means h approaches zero along the positive real\naxis. L is associated with an integrated form via Kolmogorov\u2019s backward equation\u2020 : E [f (Xe\nT )] =\neTLf (X0), where Xe\nT denotes the exact solution of the diffusion (1) at time T . The operator eTL\nis called the Kolmogorov operator for the diffusion (1). Since diffusion (1) is continuous, it is\ngenerally infeasible to solve analytically (so is eTL). In practice, a local numerical integrator is\nused for every small step h, with the corresponding Kolmogorov operator Ph approximating ehL.\nlh denote the approximate sample path from such a numerical integrator; similarly, we have\nLet Xn\n(l\u22121)h). Let A \u25e6 B denote the composition of two operators A and B, i.e., A is\nE[f (Xn\nlh)] = Phf (Xn\nevaluated on the output of B. For time T = Lh, we have the following approximation\nA2(cid:39) Ph \u25e6 . . . \u25e6 Phf (X0) = E[f (Xn\n\nE [f (Xe\n\nT )] A1= ehL \u25e6 . . . \u25e6 ehLf (X0)\n\nwith L compositions, where A1 is obtained by decomposing TL into L sub-operators, each for\na minibatch of data, while approximation A2 is manifested by approximating the infeasible ehL\nwith Ph from a feasible integrator, e.g., the symmetric splitting integrator proposed later, such that\n\u2217For completeness, we provide mean sample-path properties of the SGHMC (similar to [7]) in Appendix J.\n\u2020More details of the equation are provided in Appendix B. Speci\ufb01cally, under mild conditions on F , we can\nexpand the operator ehL up to the mth-order (m \u2265 1) such that the remainder terms are bounded by O(hm+1).\nRefer to [10] for more details. We will assume these conditions to hold for the F \u2019s in this paper.\n\n(cid:19)\n\nT )],\n\n2\n\n\fT )] is close to the exact expectation E [f (Xe\n\nE [f (Xn\nT )]. The latter is the \ufb01rst approximation error\nintroduced in SG-MCMCs. Formally, to characterize the degree of approximation accuracies for\ndifferent numerical methods, we use the following de\ufb01nition.\nDe\ufb01nition 1. An integrator is said to be a Kth-order local integrator if for any smooth and bounded\nfunction f, the corresponding Kolmogorov operator Ph satis\ufb01es the following relation:\n\nPhf (x) = ehLf (x) + O(hK+1) .\n\n(2)\n\nThe second approximation error is manifested when handling large data. Speci\ufb01cally, the SGLD\nand SGHMC use stochastic gradients in the 1st and 2nd-order LDs, respectively, by replacing in F\nand L the full negative log-posterior U with a scaled log-posterior, \u02dcUl, from the l-th minibatch. We\ndenote the corresponding generators with stochastic gradients as \u02dcLl, e.g., the generator in the l-th\nminibatch for the SGHMC becomes \u02dcLl = L +\u2206Vl, where \u2206Vl = (\u2207\u03b8 \u02dcUl\u2212\u2207\u03b8U )\u00b7\u2207p. As a result, in\nSG-MCMC algorithms, we use the noisy operator \u02dcP l\nlh )] =\nlh denotes the numerical sample-path with stochastic gradient noise, i.e.,\n\u02dcP l\nhf (X(l\u22121)h), where Xn,s\n\nh to approximate eh \u02dcLl such that E[f (Xn,s\n\nE [f (Xe\n\nT )]\n\nB1(cid:39) eh \u02dcLL \u25e6 . . . \u25e6 eh \u02dcL1 f (X0)\n\nB2(cid:39) \u02dcP L\n\nh \u25e6 . . . \u25e6 \u02dcP 1\n\nh f (X0) = E[f (Xn,s\n\nT )].\n\n(3)\n\nApproximations B1 and B2 in (3) are from the stochastic gradient and numerical integrator ap-\nh corresponds to a Kth-order local integrator of \u02dcLl\nproximations, respectively. Similarly, we say \u02dcP l\nhf (x) = eh \u02dcLl f (x) + O(hK+1). In the following sections, we focus on SG-MCMCs which\nif \u02dcP l\nuse numerical integrators with stochastic gradients, and for the \ufb01rst time analyze how the two intro-\nduced errors affect their convergence behaviors. For notational simplicity, we henceforth use Xt to\nrepresent the approximate sample-path Xn,s\n\n.\n\nt\n\n3 Convergence Analysis\n\nThis section develops theory to analyze \ufb01nite-time convergence properties of general SG-MCMCs\nwith both \ufb01xed and decreasing step sizes, as well as their asymptotic invariant measures.\n\nas: \u00af\u03c6 (cid:44) (cid:82)\n\n3.1 Finite-time error analysis\nGiven an ergodic\u2021 It\u02c6o diffusion (1) with an invariant measure \u03c1(x), the posterior average is de\ufb01ned\nX \u03c6(x)\u03c1(x)d x for some test function \u03c6(x) of interest. For a given numerical method\nl=1 \u03c6(Xlh) to\nwith generated samples (Xlh)L\napproximate \u00af\u03c6. In the analysis, we de\ufb01ne a functional \u03c8 that solves the following Poisson Equation:\n\nl=1, we use the sample average \u02c6\u03c6 de\ufb01ned as \u02c6\u03c6 = 1\n\n(cid:80)L\n\nL\n\nL(cid:88)\n\nl=1\n\nL\u03c8(Xlh) = \u03c6(Xlh) \u2212 \u00af\u03c6, or equivalently,\n\n1\nL\n\nL\u03c8(Xlh) = \u02c6\u03c6 \u2212 \u00af\u03c6.\n\n(4)\n\nThe solution functional \u03c8(Xlh) characterizes the difference between \u03c6(Xlh) and the posterior aver-\nage \u00af\u03c6 for every Xlh, thus would typically possess a unique solution, which is at least as smooth as\n\u03c6 under the elliptic or hypoelliptic settings [12]. In the unbounded domain of Xlh \u2208 Rn, to make\nthe presentation simple, we follow [6] and make certain assumptions on the solution functional,\n\u03c8, of the Poisson equation (4), which are used in the detailed proofs. Extensive empirical results\nhave indicated the assumptions to hold in many real applications, though extra work is needed for\ntheoretical veri\ufb01cations for different models, which is beyond the scope of this paper.\nAssumption 1. \u03c8 and its up to 3rd-order derivatives, Dk\u03c8, are bounded by a function\u00a7 V, i.e.,\n(cid:107)Dk\u03c8(cid:107) \u2264 CkV pk for k = (0, 1, 2, 3), Ck, pk > 0. Furthermore, the expectation of V on {Xlh}\nEV p(Xlh) < \u221e, and V is smooth such that sups\u2208(0,1) V p (s X + (1 \u2212 s) Y) \u2264\nis bounded: supl\nC (V p (X) + V p (Y)), \u2200 X, Y, p \u2264 max{2pk} for some C > 0.\n\n\u2021See [6, 11] for conditions to ensure (1) is ergodic.\n\u00a7The existence of such function can be translated into \ufb01nding a Lyapunov function for the corresponding\nSDEs, an important topic in PDE literatures [13]. See Assumption 4.1 in [6] and Appendix C for more details.\n\n3\n\n\f(cid:16)I + h\u02dcLl\n\n(cid:17)\n\n=\n\nK(cid:88)\n\nk=2\n\n\u03c8(X(l\u22121)h) +\n\n\u02dcLk\nl \u03c8(X(l\u22121)h) + O(hK+1) ,\n\nhk\nk!\n\n(5)\n\nWe emphasize that our proof techniques are related to those of the SGLD [6, 12], but with signi\ufb01cant\ndistinctions in that, instead of expanding the function \u03c8(Xlh) [6], whose parameter Xlh does not\nendow an explicit form in general SG-MCMCs, we start from expanding the Kolmogorov\u2019s back-\nward equation for each minibatch. Moreover, our techniques apply for general SG-MCMCs, instead\nof for one speci\ufb01c algorithm. More speci\ufb01cally, given a Kth-order local integrator with the corre-\nsponding Kolmogorov operator \u02dcP l\nh, according to De\ufb01nition 1 and (3), the Kolmogorov\u2019s backward\nequation for the l-th minibatch can be expanded as:\n\nE[\u03c8(Xlh)] = \u02dcP l\n\nh\u03c8(X(l\u22121)h) = eh \u02dcLl \u03c8(X(l\u22121)h) + O(hK+1)\n\nwhere I is the identity map. Recall that \u02dcLl = L +\u2206Vl, e.g., \u2206Vl = (\u2207\u03b8 \u02dcUl\u2212\u2207\u03b8U )\u00b7\u2207p in SGHMC.\nBy further using the Poisson equation (4) to simplify related terms associated with L, after some\nalgebra shown in Appendix D, the bias can be derived from (5) as: |E \u02c6\u03c6 \u2212 \u00af\u03c6| =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E[\u03c8(Xlh)] \u2212 \u03c8(X0)\n\nLh\n\n\u2212 1\nL\n\n(cid:88)\n\nE[\u2206Vl\u03c8(X(l\u22121)h)] \u2212 K(cid:88)\n\nl\n\nk=2\n\nhk\u22121\nk!L\n\nL(cid:88)\n\nl=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + O(hK) .\n\nE[\u02dcLk\n\nl \u03c8(X(l\u22121)h)]\n\nL\n\n+\n\n+ hK\n\n.\n\n(cid:19)\n\n(cid:80)\n\nl (cid:107)E\u2206Vl(cid:107)\n\nAll terms in the above equation can be bounded, with details provided in Appendix D. This gives us\na bound for the bias of an SG-MCMC algorithm in Theorem 2.\nTheorem 2. Under Assumption 1, let (cid:107)\u00b7(cid:107) be the operator norm. The bias of an SG-MCMC with a\nKth-order integrator at time T = hL can be bounded as:\n\n(cid:18) 1\n(cid:12)(cid:12)(cid:12)E \u02c6\u03c6 \u2212 \u00af\u03c6\n(cid:12)(cid:12)(cid:12) = O\nNote the bound above includes the term(cid:80)\nLh\nl (cid:107)E\u2206Vl(cid:107) /L, measuring the difference between the ex-\npectation of stochastic gradients and the true gradient. It vanishes when the stochastic gradient is\n(cid:80)\nan unbiased estimation of the exact gradient, an assumption made in the SGLD. This on the other\nhand indicates that if the stochastic gradient is biased, |E \u02c6\u03c6 \u2212 \u00af\u03c6| might diverge when the growth of\nl (cid:107)E\u2206Vl(cid:107) is faster than O(L). We point this out to show our result to be more informative than\nthat of the SGLD [6], though this case might not happen in real applications. By expanding the proof\nfor the bias, we are also able to bound the MSE of SG-MCMC algorithms, given in Theorem 3.\nTheorem 3. Under Assumption 1, and assume \u02dcUl is an unbiased estimate of Ul. For a smooth test\nfunction \u03c6, the MSE of an SG-MCMC with a Kth-order integrator at time T = hL is bounded, for\nsome C > 0 independent of (L, h), as\n\nE(cid:107)\u2206Vl(cid:107)2\nL\nE(cid:107)\u2206Vl(cid:107)2 relates to the variance of noisy gradi-\nCompared to the SGLD [6], the extra term 1\nL2\nents. As long as the variance is bounded, the MSE still converges with the same rate. Speci\ufb01cally,\nwhen optimizing bounds for the bias and MSE, the optimal bias decreases at a rate of L\u2212K/(K+1)\nwith step size h \u221d L\u22121/(K+1); while this is L\u22122K/(2K+1) with step size h \u221d L\u22121/(2K+1) for the\nMSE\u00b6. These rates decrease faster than those of the SGLD [6] when K \u2265 2. The case of K = 2 for\nthe SGHMC with our proposed symmetric splitting integrator is discussed in Section 4.\n\n+ h2K\n\n1\nLh\n\nE(cid:16) \u02c6\u03c6 \u2212 \u00af\u03c6\n\n(cid:80)\n(cid:80)\n\nl\n\n(cid:17)2 \u2264 C\n\n(cid:33)\n\n(cid:32) 1\n\nL\n\n+\n\n.\n\nl\n\n3.2 Stationary invariant measures\n\nThe asymptotic invariant measures of SG-MCMCs correspond to L approaching in\ufb01nity in the above\nanalysis. According to the bias and MSE above, asymptotically (L \u2192 \u221e) the sample average \u02c6\u03c6 is a\nrandom variable with mean E \u02c6\u03c6 = \u00af\u03c6+O(hK), and variance E( \u02c6\u03c6\u2212E \u02c6\u03c6)2 \u2264 E( \u02c6\u03c6\u2212 \u00af\u03c6)2+E( \u00af\u03c6\u2212E \u02c6\u03c6)2 =\nO(h2K), close to the true \u00af\u03c6. This section de\ufb01nes distance between measures, and studies more\nformally how the approximation errors affect the invariant measures of SG-MCMC algorithms.\n\n\u00b6To compare with the standard MCMC convergence rate of 1/2, the rate needs to be taken a square root.\n\n4\n\n\fX \u03c6(x)\u02dc\u03c1h(d x) = (cid:82)\n\nergodic, meaning that for a test function \u03c6, we have(cid:82)\n\nFirst we note that under mild conditions, the existence of a stationary invariant measure for an SG-\nMCMC can be guaranteed by application of the Krylov\u2013Bogolyubov Theorem [14]. Examining the\nconditions is beyond the scope of this paper. For simplicity, we follow [12] and assume stationary\ninvariant measures do exist for SG-MCMCs. We denote the corresponding invariant measure as \u02dc\u03c1h,\nand the true posterior of a model as \u03c1. Similar to [12], we assume our numerical solver is geometric\nX Ex\u03c6(Xlh)\u02dc\u03c1h(d x) for\nany l \u2265 0 from the ergodic theorem, where Ex denotes the expectation conditional on X0 = x. The\ngeometric ergodicity implies that the integration is independent of the starting point of an algorithm.\nGiven this, we have the following theorem on invariant measures of SG-MCMCs.\nTheorem 4. Assume that a Kth-order integrator is geometric ergodic and its invariance mea-\nsures \u02dc\u03c1h exist. De\ufb01ne the distance between the invariant measures \u02dc\u03c1h and \u03c1 as: d(\u02dc\u03c1h, \u03c1) (cid:44)\nsup\u03c6\nto \u03c1 with an error up to an order of O(hK), i.e., there exists some C \u2265 0 such that: d(\u02dc\u03c1h, \u03c1) \u2264 ChK.\nFor a Kth-order integrator with full gradients, the corresponding invariant measure has been shown\nto be bounded by an order of O(hK) [9, 12]. As a result, Theorem 4 suggests only orders of nu-\nmerical approximations but not the stochastic gradient approximation affect the asymptotic invariant\nmeasure of an SG-MCMC algorithm. This is also re\ufb02ected by experiments presented in Section 5.\n\nX \u03c6(x)\u03c1(d x)(cid:12)(cid:12). Then any invariant measure \u02dc\u03c1h of an SG-MCMC is close\n\nX \u03c6(x)\u02dc\u03c1h(d x) \u2212(cid:82)\n\n(cid:12)(cid:12)(cid:82)\n\n3.3 SG-MCMCs with decreasing step sizes\n\nThe original SGLD was \ufb01rst proposed with a decreasing-step-size sequence [4], instead of \ufb01xing\nstep sizes, as analyzed in [6]. In [5], the authors provide theoretical foundations on its asymptotic\nconvergence properties. We demonstrate in this section that for general SG-MCMC algorithms, de-\ncreasing step sizes for each minibatch are also feasible. Note our techniques here are different from\nthose used for the decreasing-step-size SGLD [5], which interestingly result in similar convergence\npatterns. Speci\ufb01cally, by adapting the same techniques used in the previous sections, we establish\nconditions on the step size sequence to ensure asymptotic convergence, and develop theory on their\n\ufb01nite-time ergodic error as well. To guarantee asymptotic consistency, the following conditions on\ndecreasing step size sequences are required.\nAssumption 2. The step sizes {hl} are decreasing(cid:107), i.e., 0 < hl+1 < hl, and satisfy that 1)\n\n(cid:80)\u221e\nl=1 hl = \u221e; and 2) limL\u2192\u221e\nDenote the \ufb01nite sum of step sizes as SL (cid:44) (cid:80)L\n(cid:80)L\n\n(cid:80)L\nl(cid:80)L\n\nl=1 hK+1\n\n= 0.\n\nl=1 hl\n\nl=1 hl. Under Assumption 2, we need to mod-\nify the sample average \u00af\u03c6 de\ufb01ned in Section 3.1 as a weighted summation of {\u03c6(Xlh)}: \u02dc\u03c6 =\n\u03c6(Xlh). For simplicity, we assume \u02dcUl to be an unbiased estimate of U such that E\u2206Vl = 0.\nExtending techniques in previous sections, we develop the following bounds for the bias and MSE.\nTheorem 5. Under Assumptions 1 and 2, for a smooth test function \u03c6, the bias and MSE of a\ndecreasing-step-size SG-MCMC with a Kth-order integrator at time SL are bounded as:\n\nhl\nSL\n\nl=1\n\n(cid:32)\n(cid:12)(cid:12)(cid:12)E \u02dc\u03c6 \u2212 \u00af\u03c6\n(cid:12)(cid:12)(cid:12) = O\n(cid:17)2 \u2264 C\nMSE: E(cid:16) \u02dc\u03c6 \u2212 \u00af\u03c6\n\nBIAS:\n\n(cid:33)\n\n(cid:80)L\n\nh2\nl\nS2\nL\n\n+\n\n1\nSL\n\n(cid:32)(cid:88)\n\nl=1 hK+1\n\nl\n\nSL\nE(cid:107)\u2206Vl(cid:107)2 +\n\n1\nSL\n\n((cid:80)L\n\nl=1 hK+1\n\n)2\n\n(cid:33)\n\n(6)\n\nl\n\nl\n\n+\n\nl\nS2\nL\n\nl=1 h2\nS2\nL\n\n(7)\n(cid:80)\u221e\nAs a result, the asymptotic bias approaches 0 according to the assumptions. If further assuming\u2217\u2217\n= 0, the MSE also goes to 0. In words, the decreasing-step-size SG-MCMCs are consistent.\nAmong the kinds of decreasing step size sequences, a commonly recognized one is hl \u221d l\u2212\u03b1 for\n0 < \u03b1 < 1. We show in the following corollary that such a sequence leads to a valid sequence.\nCorollary 6. Using the step size sequences hl \u221d l\u2212\u03b1 for 0 < \u03b1 < 1, all the step size assumptions\nin Theorem 5 are satis\ufb01ed. As a result, the bias and MSE approach zero asymptotically, i.e., the\nsample average \u02dc\u03c6 is asymptotically consistent with the posterior average \u00af\u03c6.\n\n.\n\n\u2217\u2217The assumption of(cid:80)\u221e\n\n(cid:107)Actually the sequence need not be decreasing; we assume it is decreasing for simplicity.\n\nl=1 h2\n\nl < \u221e satis\ufb01es this requirement, but is weaker than the original assumption.\n\n5\n\n\fhl \u221d l\u2212\u03b1. Speci\ufb01cally, using the bounds for(cid:80)L\n\ndecreases at a rate of O(L\u03b1\u22121), whereas ((cid:80)L\n\nRemark 7. Theorem 5 indicates the sample average \u02dc\u03c6 asymptotically converges to the true posterior\naverage \u00af\u03c6. It is possible to \ufb01nd out the optimal decreasing rates for the speci\ufb01c decreasing sequence\nl=1 l\u2212\u03b1 (see the proof of Corollary 6), for the two\nterms in the bias (6) in Theorem 5, 1\n)/SL\nSL\ndecreases as O(L\u2212K\u03b1). The balance between these two terms is achieved when \u03b1 = 1/(K + 1),\nwhich agrees with Theorem 2 on the optimal rate of \ufb01xed-step-size SG-MCMCs. Similarly, for\nthe MSE (7), the \ufb01rst term decreases as L\u22121, independent of \u03b1, while the second and third terms\ndecrease as O(L\u03b1\u22121) and O(L\u22122K\u03b1), respectively, thus the balance is achieved when \u03b1 = 1/(2K+\n1), which also agrees with the optimal rate for the \ufb01xed-step-size MSE in Theorem 3.\nAccording to Theorem 5, one theoretical advantage of decreasing-step-size SG-MCMCs over \ufb01xed-\nstep-size variants is the asymptotically unbiased estimation of posterior averages, though the bene\ufb01t\nmight not be signi\ufb01cant in large-scale real applications where the asymptotic regime is not reached.\n\nl=1 hK+1\n\nl\n\n4 Practical Numerical Integrators\nGiven the theory for SG-MCMCs with high-order integrators, we here propose a 2nd-order symmet-\nric splitting integrator for practical use. The Euler integrator is known as a 1st-order integrator; the\nproof and its detailed applications on the SGLD and SGHMC are given in Appendix I.\nThe main idea of the symmetric splitting scheme is to split the local generator \u02dcLl into several\nsub-generators that can be solved analytically\u2020\u2020. Unfortunately, one cannot easily apply a splitting\nscheme with the SGLD. However, for the SGHMC, it can be readily split into: \u02dcLl = LA +LB +LOl,\nwhere\n\nLA = p\u00b7\u2207\u03b8, LB = \u2212D p\u00b7\u2207p, LOl = \u2212\u2207\u03b8 \u02dcU (\u03b8) \u00b7 \u2207p + 2D In : \u2207p\u2207T\np .\n\nThese sub-generators correspond to the following SDEs, which are all analytically solvable:\n\n(cid:26) d\u03b8 = p dt\n\nd p = 0\n\nA :\n\n(cid:26) d\u03b8 = 0\n\n(8)\n\n(9)\n\n\u221a\n\n2DdW\n\n(cid:26) d\u03b8 = 0\n\n, B :\n\nd p = \u2212D p dt\n\n, O :\n\nBased on these sub-SDEs, the local Kolmogorov operator \u02dcP l\n2 LA \u25e6 e\n\nhf (X(l\u22121)h), where, \u02dcP l\n\nE[f (Xlh)] = \u02dcP l\n\n(cid:44) e\n\nh\n\nh\n\nd p = \u2212\u2207\u03b8 \u02dcUl(\u03b8)dt +\nh is de\ufb01ned as:\n\nh\n\n2 LB \u25e6 ehLOl \u25e6 e\n\nh\n\n2 LB \u25e6 e\n\nh\n\n2 LA,\n\nlh = e\u2212Dh/2 p(l\u22121)h \u21d2 p(2)\n\nso that the corresponding updates for Xlh = (\u03b8lh, plh) consist of the following 5 steps:\nlh = \u03b8(l\u22121)h + p(l\u22121)h h/2 \u21d2 p(1)\nlh \u2212\u2207\u03b8 \u02dcUl(\u03b8(1)\n\u03b8(1)\nlh \u21d2 \u03b8lh = \u03b8(1)\n\u21d2 plh = e\u2212Dh/2 p(2)\nwhere (\u03b8(1)\nlh , p(2)\nlh ) are intermediate variables. We denote such a splitting method as the\nABOBA scheme. From the Markovian property of a Kolmogorov operator, it is readily seen that\nall such symmetric splitting schemes (with different orders of \u2018A\u2019, \u2018B\u2019 and \u2018O\u2019) are equivalent [15].\nLemma 8 below shows the symmetric splitting scheme is a 2nd-order local integrator.\nLemma 8. The symmetric splitting scheme is a 2nd-order local integrator, i.e., the corresponding\nKolmogorov operator \u02dcP l\n\nlh + plh h/2 ,\n\nlh = p(1)\n\nh = eh \u02dcLl + O(h3).\n\nh satis\ufb01es: \u02dcP l\n\nlh , p(1)\n\nlh )h +\n\n\u221a\n\n2Dh\u03b6l\n\n1\nL\n\n(cid:80)\n\nLh +\n\n(cid:80)\nl(cid:107)E\u2206Vl(cid:107)\n\nWhen this integrator is applied to the SGHMC, the following properties can be obtained.\nRemark 9. Applying Theorem 2 to the SGHMC with the symmetric splitting scheme (K = 2), the\nbias is bounded as: |E \u02c6\u03c6 \u2212 \u00af\u03c6| = O( 1\n+ h2). The optimal bias decreasing rate is\nL\u22122/3, compared to L\u22121/2 for the SGLD [6]. Similarly, the MSE is bounded by: E( \u02c6\u03c6 \u2212 \u00af\u03c6)2 \u2264\nLh + h4), decreasing optimally as L\u22124/5 with step size h \u221d L\u22121/5, compared\nC(\nto the MSE of L\u22122/3 for the SGLD [6]. This indicates that the SGHMC with the splitting integrator\nconverges faster than the SGLD and SGHMC with 1st-order Euler integrators.\nRemark 10. For a decreasing-step-size SGHMC, based on Remark 7, the optimal step size decreas-\ning rate for the bias is \u03b1 = 1/3, and \u03b1 = 1/5 for the MSE. These agree with their \ufb01xed-step-size\ncounterparts in Remark 9, thus are faster than the SGLD/SGHMC with 1st-order Euler integrators.\n\nE(cid:107)\u2206Vl(cid:107)2\nL\n\n+ 1\n\nL\n\nl\n\n\u2020\u2020This is different from the traditional splitting in SDE literatures[9, 15], where L instead of \u02dcLl is split.\n\n6\n\n\fFigure 1: Comparisons of symmet-\nric splitting and Euler integrators.\n\nFigure 2: Bias of SGHMC-D (left) and MSE of SGHMC-F (right) with different step size rates \u03b1.\nThick red curves correspond to theoretically optimal rates.\n5 Experiments\nWe here verify our theory and compare with related algorithms on both synthetic data and large-scale\nmachine learning applications.\nSynthetic data We consider a standard Gaussian model\nwhere xi \u223c N (\u03b8, 1), \u03b8 \u223c N (0, 1). 1000 data samples {xi}\nare generated, and every minibatch in the stochastic gradient\nis of size 10. The test function is de\ufb01ned as \u03c6(\u03b8) (cid:44) \u03b82, with\nexplicit expression for the posterior average. To evaluate the\nexpectations in the bias and MSE, we average over 200 runs\nwith random initializations.\nFirst we compare the invariant measures (with L = 106) of\nthe proposed splitting integrator and Euler integrator for the\nSGHMC. Results of the SGLD are omitted since they are not\nas competitive. Figure 1 plots the biases with different step\nsizes.\nIt is clear that the Euler integrator has larger biases\nin the invariant measure, and quickly explodes when the step size becomes large, which does not\nhappen for the splitting integrator. In real applications we also \ufb01nd this happen frequently (shown\nin the next section), making the Euler scheme an unstable integrator.\nNext we examine the asymptotically optimal step size rates for the SGHMC. From the theory we\nknow these are \u03b1 = 1/3 for the bias and \u03b1 = 1/5 for the MSE, in both \ufb01xed-step-size SGHMC\n(SGHMC-F) and decreasing-step-size SGHMC (SGHMC-D). For the step sizes, we did a grid search\nto select the best prefactors, which resulted in h=0.033\u00d7L\u2212\u03b1 for the SGHMC-F and hl= 0.045\u00d7l\u2212\u03b1\nfor the SGHMC-D, with different \u03b1 values. We plot the traces of the bias for the SGHMC-D and the\nMSE for the SGHMC-F in Figure 2. Similar results for the bias of the SGHMC-F and the MSE of\nthe SGHMC-D are plotted in Appendix K. We \ufb01nd that when rates are smaller than the theoretically\noptimal rates, i.e., \u03b1 = 1/3 (bias) and \u03b1 = 1/5 (MSE), the bias and MSE tend to decrease faster\nthan the optimal rates at the beginning (especially for the SGHMC-F), but eventually they slow down\nand are surpassed by the optimal rates, consistent with the asymptotic theory. This also suggests that\nif only a small number of iterations were feasible, setting a larger step size than the theoretically\noptimal one might be bene\ufb01cial in practice.\nFinally, we study the relative convergence speed of the SGHMC and SGLD. We test both \ufb01xed-\nstep-size and decreasing-step-size versions. For \ufb01xed-step-size experiments, the step sizes are set\nto h = CL\u2212\u03b1, with \u03b1 chosen according to the theory for SGLD and SGHMC. To provide a fair\ncomparison, the constants C are selected via a grid search from 10\u22123 to 0.5 with an interval of\n0.002 for L = 500, it is then \ufb01xed in the other runs with different L values. The parameter D in\nthe SGHMC is selected within (10, 20, 30) as well. For decreasing-step-size experiments, an initial\nstep size is chosen within [0.003, 0.05] with an interval of 0.002 for different algorithms\u2021\u2021, and then\nit decreases according to their theoretical optimal rates. Figure 3 shows a comparison of the biases\nfor the SGHMC and SGLD. As indicated by both theory and experiments, the SGHMC with the\nsplitting integrator yields a faster convergence speed than the SGLD with an Euler integrator.\nLarge-scale machine learning applications For real applications, we test the SGLD with an\nEuler integrator, the SGHMC with the splitting integrator (SGHMC-S), and the SGHMC with an\n\n\u2021\u2021Using the same initial step size is not fair because the SGLD requires much smaller step sizes.\n\n7\n\n#iterations101102103104BIAS10-210-1100101,=0:1,=0:2,=0:33,=0:7#iterations101102103104MSE10-410-2100102,=0:1,=0:2,=0:3,=0:4step size0.0010.0050.010.020.050.1BIAS10-410-310-210-1SplittingEuler\fFigure 3: Biases for the \ufb01xed-step-size (left) and decreasing-step-size (right) SGHMC and SGLD.\n\nEuler integrator (SGHMC-E). First we test them on the latent Dirichlet allocation model (LDA)\n[16]. The data used consists of 10M randomly downloaded documents from Wikipedia, using scripts\nprovided in [17]. We randomly select 1K documents for testing and validation, respectively. As\nin [17, 18], the vocabulary size is 7,702. We use the Expanded-Natural reparametrization trick to\nsample from the probabilistic simplex [19]. The step sizes are chosen from {2, 5, 8, 20, 50, 80}\u00d710\u22125,\nand parameter D from {20, 40, 80}. The minibatch size is set to 100, with one pass of the whole\ndata in the experiments (and therefore L = 100K). We collect 300 posterior samples to calculate\ntest perplexities, with a standard holdout technique as described in [18].\nNext a recently studied sigmoid belief network model (SBN) [20] is tested, which is a directed coun-\nterpart of the popular RBM model. We use a one layer model where the bottom layer corresponds to\nbinary observed data, which is generated from the hidden layer (also binary) via a sigmoid function.\nAs shown in [18], the SBN is readily learned by SG-MCMCs. We test the model on the MNIST\ndataset, which consists of 60K hand written digits of size 28 \u00d7 28 for training, and 10K for testing.\n\u221a\nAgain the step sizes are chosen from {3, 4, 5, 6}\u00d710\u22124, D from {0.9, 1, 5}/\nh. The minibatch is\nset to 200, with 5000 iterations for training. Like applied for the RBM [21], an advance technique\ncalled anneal importance sampler (AIS) is adopted for calculating test likelihoods.\nWe brie\ufb02y describe the results here, more details\nare provided in Appendix K. For LDA with 200\ntopics, the best test perplexities for the SGHMC-S,\nSGHMC-E and SGLD are 1168, 1180 and 2496, re-\nspectively; while these are 1157, 1187 and 2511, re-\nspectively, for 500 topics. Similar to the synthetic\nexperiments, we also observed SGHMC-E crashed\nwhen using large step sizes. This is illustrated more\nclearly in Figure 4. For the SBN with 100 hid-\nden units, we obtain negative test log-likelihoods of\n103, 105 and 126 for the SGHMC-S, SGHMC-E and\nSGLD, respectively; and these are 98, 100, and 110 for 200 hidden units. Note the SGHMC-S on\nSBN yields state-of-the-art results on test likelihoods compared to [22], which was 113 for 200 hid-\nden units. A decrease of 2 units in the neg-log-likelihood with AIS is considered to be a reasonable\ngain [20], which is approximately equal to the gain from a shallow to a deep model [22]. SGHMC-S\nis more accuracy and robust than SGHMC-E due to its 2nd-order splitting integrator.\n6 Conclusion\nFor the \ufb01rst time, we develop theory to analyze \ufb01nite-time ergodic errors, as well as asymptotic\ninvariant measures, of general SG-MCMCs with high-order integrators. Our theory applies for\nboth \ufb01xed and decreasing step size SG-MCMCs, which are shown to be equivalent in terms of\nconvergence rates, and are faster with our proposed 2nd-order integrator than previous SG-MCMCs\nwith 1st-order Euler integrators. Experiments on both synthetic and large real datasets validate our\ntheory. The theory also indicates that with increasing order of numerical integrators, the convergence\nrate of an SG-MCMC is able to theoretically approach the standard MCMC convergence rate. Given\nthe theoretical convergence results, SG-MCMCs can be used effectively in real applications.\nAcknowledgments Supported in part by ARO, DARPA, DOE, NGA and ONR. We acknowledge\nJonathan C. Mattingly and Chunyuan Li for inspiring discussions; David Carlson for the AIS codes.\n\nFigure 4: SGHMC with 200 topics. The Eu-\nler explodes with large step sizes.\n\n8\n\n#iterations20100250400550700BIAS00.050.10.150.2SGLDSGHMC#iterations20100250400550700BIAS00.020.040.060.08SGLDSGHMCstep size2e-055e-050.00020.0008Perplexity12001400160018002000SGHMC-EulerSGHMC-Splitting\fReferences\n[1] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML,\n\n2014.\n\n[2] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using\n\nstochastic gradient thermostats. In NIPS, 2014.\n\n[3] H. Risken. The Fokker-Planck equation. Springer-Verlag, New York, 1989.\n[4] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\n\nICML, 2011.\n\n[5] Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and \ufb02uctuations for stochastic gradient\nLangevin dynamics. Technical Report arXiv:1409.0578, University of Oxford, UK, Sep. 2014.\nURL http://arxiv.org/abs/1409.0578.\n\n[6] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh.\n\n(Non-)asymptotic properties of stochastic\ngradient Langevin dynamics. Technical Report arXiv:1501.00438, University of Oxford, UK,\nJanuary 2015. URL http://arxiv.org/abs/1501.00438.\n\n[7] M. Betancourt. The fundamental incompatibility of Hamiltonian Monte Carlo and data sub-\n\nsampling. In ICML, 2015.\n\n[8] I. Sato and H. Nakagawa. Approximation analysis of stochastic gradient Langevin dynamics\n\nby using Fokker-Planck equation and It\u02c6o process. In ICML, 2014.\n\n[9] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. Techni-\ncal Report arXiv:1505.06889v1, University of Edinburgh, UK, May 2015. URL http:\n//arxiv.org/abs/1505.06889.\n\n[10] A. Abdulle, G. Vilmart, and K. C. Zygalakis. Long time accuracy of Lie\u2013Trotter splitting\n\nmethods for Langevin dynamics. SIAM J. NUMER. ANAL., 53(1):1\u201316, 2015.\n\n[11] R. Hasminskii. Stochastic stability of differential equations. Springer-Verlag Berlin Heidel-\n\nberg, 2012.\n\n[12] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Construction of numerical time-average and\n\nstationary measures via Poisson equations. SIAM J. NUMER. ANAL., 48(2):552\u2013577, 2010.\n\n[13] P. Giesl. Construction of global Lyapunov functions using radial basis functions. Springer\n\nBerlin Heidelberg, 2007.\n\n[14] N. N. Bogoliubov and N. M. Krylov. La theorie generalie de la mesure dans son application a\nl\u2019etude de systemes dynamiques de la mecanique non-lineaire. Ann. Math. II (in French), 38\n(1):65\u2013113, 1937.\n\n[15] B. Leimkuhler and C. Matthews. Rational construction of stochastic numerical methods for\n\nmolecular sampling. AMRX, 2013(1):34\u201356, 2013.\n\n[16] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n[17] M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In\n\nNIPS, 2010.\n\n[18] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis\n\nfor topic modeling. In ICML, 2015.\n\n[19] S. Patterson and Y. W. Teh. Stochastic gradient Riemannian Langevin dynamics on the proba-\n\nbility simplex. In NIPS, 2013.\n\n[20] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks with data\n\naugmentation. In AISTATS, 2015.\n\n[21] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML,\n\n2008.\n\n[22] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML,\n\n2014.\n\n[23] A. Debussche and E. Faou. Weak backward error analysis for SDEs. SIAM J. NUMER. ANAL.,\n\n50(3):1734\u20131752, 2012.\n\n[24] M. Kopec. Weak backward error analysis for overdamped Langevin processes. IMA J. NU-\n\nMER. ANAL., 2014.\n\n9\n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Changyou", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Nan", "family_name": "Ding", "institution": "Google"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}