{"title": "Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 3835, "page_last": 3846, "abstract": "Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice. In this paper, we propose a Stochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC) algorithm. It makes use of a semi-stochastic gradient estimator that recursively accumulates the gradient information to reduce the variance of the stochastic gradient. We provide a convergence analysis of SRVR-HMC for sampling from a class of non-log-concave distributions and show that SRVR-HMC converges faster than all existing HMC-type algorithms based on underdamped Langevin dynamics. Thorough experiments on synthetic and real-world datasets validate our theory and demonstrate the superiority of SRVR-HMC.", "full_text": "Stochastic Gradient Hamiltonian Monte Carlo\nMethods with Recursive Variance Reduction\n\nDifan Zou\n\nPan Xu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\nknowzou@cs.ucla.edu\n\nLos Angeles, CA 90095\npanxu@cs.ucla.edu\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\n\nqgu@cs.ucla.edu\n\nAbstract\n\nStochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received\nincreasing attention in both theory and practice.\nIn this paper, we propose a\nStochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC) algorithm.\nIt makes use of a semi-stochastic gradient estimator that recursively accumulates\nthe gradient information to reduce the variance of the stochastic gradient. We\nprovide a convergence analysis of SRVR-HMC for sampling from a class of\nnon-log-concave distributions and show that SRVR-HMC converges faster than\nall existing HMC-type algorithms based on underdamped Langevin dynamics.\nThorough experiments on synthetic and real-world datasets validate our theory and\ndemonstrate the superiority of SRVR-HMC.\n\n1\n\nIntroduction\n\nMonte Carlo Markov Chain (MCMC) has been widely used in Bayesian learning [1] as a powerful\ntool for posterior sampling, inference and decision making. More recently, Hamiltonian MCMC\napproaches based on the Hamiltonian Langevin dynamics [24, 43] have received extensive attention\nin both theory and practice [16, 5, 40, 14, 6, 18, 55, 28] due to their widespread empirical successes.\nHamiltonian Langevin dynamics (a.k.a., underdamped Langevin dynamics) [19] is described by the\nfollowing stochastic differential equation:\n\ndVt = Vtdt urf (Xt)dt +p2udBt,\n\ndXt = Vtdt,\n\nwhere > 0 is called the friction parameter, u > 0 is the inverse mass, Xt, Vt 2 Rd are the position\nand velocity variables of the continuous-time dynamics respectively, and Bt 2 Rd is the standard\nBrownian motion. Under mild assumptions on the function f (x), the Markov process (Xt, Vt) has a\nunique stationary distribution which is proportional to exp{f (x) kvk2\n2/(2u)} and the marginal\ndistribution of Xt converges to a stationary distribution \u21e1 / exp{f (x)}. Hence, we can apply\nnumerical integrators to discretize the continuous-time dynamics (1.1) in order to sample from the\ntarget distribution \u21e1. Direct Euler-Maruyama discretization [34] of (1.1) gives rise to\n\nvk+1 = vk \u2318vk \u2318urf (xk) +p2u\u2318\u270f k,\n\nxk+1 = xk + \u2318vk,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n(1.1)\n\n(1.2)\n\n\fwhich is known as underdamped Langevin MCMC (UL-MCMC) and can also be viewed as a type of\nHamiltonian Monte Carlo (HMC) methods [43, 6]. Cheng et al. [18] studied a modi\ufb01ed version of\nUL-MCMC in (1.2) and proved its convergence rate to the stationary distribution in 2-Wasserstein\ndistance for sampling from strongly log-concave densities. When the target distribution is non-log-\nconcave but admits certain good properties, the convergence guarantees of UL-MCMC in Wasserstein\nmetric have also been established in [27, 17, 8, 30].\nIn practice, f (x) in (1.2) can be chosen as the negative log-likelihood function on the training data:\n\nf (x) = n1Pn\n\ni=1 fi(x),\n\n(1.3)\nwhere n is the size of training data and fi(x) : Rd ! R is the negative log-likelihood function on the\ni-th data point. For a large dataset, it can be extremely inef\ufb01cient to compute the full gradient rf (x)\nwhich consists of gradients rfi(x)\u2019s for all data points. To alleviate this computational burden,\nstochastic gradient Hamiltonian Monte Carlo (SGHMC) methods [16, 40] and stochastic gradient\nUL-MCMC (SG-UL-MCMC) [18] were proposed, which replace the full gradient in (1.2) with a\nmini-batch stochastic gradient. While SGHMC is much more ef\ufb01cient than HMC methods, it comes\nat the cost of a slower mixing rate due to the large variance caused by stochastic gradients [5, 6, 23].\nTo resolve this dilemma, Zou et al. [55], Li et al. [37] proposed stochastic variance-reduced gradient\nHMC methods using variance reduction techniques [33, 36] and proved that variance reduction can\naccelerate the convergence of both HMC and SGHMC for sampling and Bayesian inference. For\nsampling from a class of non-log-concave densities, Gao et al. [30] showed that SGHMC converges\nto the stationary distribution of (1.1) up to an \u270f-error in 2-Wasserstein distance with eO(\u270f8\u00b55\n\u21e4 )1\ngradient complexity2, where \u00b5\u21e4 is a lower bound of the spectral gap of the Markov process generated\nby (1.1) and is in the order of exp(eO(d)) in the worst case [27]. This gradient complexity of\nSGHMC is very high even for a moderate sampling error \u270f.\nIn this paper, we aim to reduce the gradient complexity of SGHMC for sampling from non-log-\nconcave densities. The fundamental challenge in speeding up HMC-type methods lies in the control\nof the discretization error between the Hamiltonian Langevin dynamics (1.1) and discrete algo-\nrithms. We propose a novel algorithm, namely stochastic recursive variance-reduced gradient HMC\n(SRVR-HMC), which employs a recursively updated semi-stochastic gradient estimator to reduce\nthe variance of stochastic gradient and improve the discretization error. Note that such a recursively\nupdated semi-stochastic gradient estimator was originally proposed in [44, 29] for \ufb01nding stationary\npoints in stochastic nonconvex optimization. Nevertheless, our analysis is fundamentally different\nfrom that in [44, 29] since their goal is just to \ufb01nd a stationary point of f (x), while we aim to sample\nfrom the target distribution \u21e1 / exp(f (x)) that concentrates on the global minimizer of f (x),\nwhich is substantially more challenging.\n\n1.1 Our contributions\n\nWe summarize our major contributions as follows.\n\u2022 We propose a new HMC algorithm called SRVR-HMC for approximate sampling, which is\nbuilt on a recursively updated semi-stochastic gradient estimator that signi\ufb01cantly decreases the\ndiscretization error and speeds up the sampling process.\n\n\u2022 We establish the convergence guarantee of SRVR-HMC for sampling from non-log-concave\ndensities satisfying certain dissipativeness condition. Speci\ufb01cally, we show that its gradient\ncomplexity for achieving \u270f-error in 2-Wasserstein distance is eO((n + \u270f2n1/2\u00b53/2\n\u21e4 ).\n) ^ \u270f4\u00b52\nRemarkably, the convergence guarantee of SRVR-HMC is better than the eO(\u270f4\u00b53\n\u21e4 n) gradient\ncomplexity of HMC [30] by a factor of at least eO(\u270f2\u00b53/2\nn1/2), and better than the eO(\u270f8\u00b55\n\u21e4 )\ngradient complexity of SGHMC [30] by a factor of at least eO(\u270f4\u00b53\n\u21e4 ).\n\u2022 With a proper choice of parameters, our algorithm can reduce to UL-MCMC [18] and SG-UL-\nMCMC [18], which are originally proposed for sampling from strongly-log-concave distributions.\n1eO(\u00b7) hides constant and logarithm factors.\n2Gradient complexity is the total number of stochastic gradients rfi(x) an algorithm needs to compute in\n\norder to achieve \u270f-error in terms of certain measurement.\n\n\u21e4\n\n\u21e4\n\n2\n\n\fOur theoretical analysis shows that these two algorithms can be used for sampling from non-log-\nconcave distributions as well, and they enjoy lower gradient complexities than HMC and SGHMC\n[30], which is of independent interest.\n\n\u2022 We compare our algorithm with many state-of-the-art baselines through experiments on sampling\nfrom Gaussian mixture distributions, independent component analysis (ICA) and Bayesian logistic\nregression, which further validates the superiority of our algorithm.\n\ndXt = rf (Xt)dt +p2dBt,\n\n1.2 Additional related work\nThere is also a vast literature of MCMC methods based on the overdamped Langevin dynamics [35]:\n(1.4)\nwhere > 0 is the temperature parameter and Bt is Brownian motion. The convergence analysis of\nLangevin based algorithms dates back to [46]. Mattingly et al. [41] established convergence rates for\na class of discrete approximation of Langevin dynamics. When the target distribution is smooth and\nstrongly log-concave, the convergence of Langevin Monte Carlo (LMC) based on the discretization of\n(1.4) has been widely studied in terms of both total variation (TV) distance [21, 26] and 2-Wasserstein\ndistance [22, 20]. Welling and Teh [50] proposed the stochastic gradient Langevin dynamics (SGLD)\nalgorithm to avoid full gradient computation. Teh et al. [47] proposed to apply decreasing step size\nwith SGLD and proved its convergence in terms of mean square error (MSE). Vollmer et al. [48]\ncharacterized the bias of SGLD and further proposed a modi\ufb01ed SGLD algorithm that removes the\nbias. [10] establish a link between LMC, SGLD, SGLDFP (a variant of SGLD) and SGD, which\nshows that the stationary distribution of LMC and SGLDFP can be closer to the target density \u21e1 as the\nsample size increases, while the dynamics of SGLD is more similar to that of SGD. Barkhagen et al.\n[4], Chau et al. [13] studied the convergence of SGLD when the training data in (1.3) are dependent.\nIn order to reduce the variance of SGLD, SVRG-LD and SAGA-LD have been proposed by Dubey\net al. [25] and their convergence have been studied in terms of MSE [25, 15] and 2-Wasserstein\ndistance [56, 12]. Baker et al. [2] proposed to use control variate in SGLD which can also reduce the\nvariance and improve the convergence rate. Mou et al. [42] studied the generalization performance of\nSGLD from both stability and PAC-Bayesian perspectives. For nonconvex optimization, Raginsky\net al. [45] proved the non-asymptotic convergence rate of SGLD and Zhang et al. [52] analyzed the\nhitting time of SGLD to local minima. Xu et al. [51] further studied the global convergence of a class\nof Langevin dynamics based algorithms.\n\nTable 1: Gradient complexity of different methods to achieve \u270f-error in 2-Wasserstein distance for\nsampling from non-log-concave densities.\n\nMethods\nLMC\nSGLD\nSVRG-LD\nHMC\nUL-MCMC\nSGHMC\nSG-UL-MCMC\nSRVR-HMC\n\neO\u270f45\n\u21e4 n\neO\u270f89\n\u21e4 \neOn + \u270f24\neO\u270f4\u00b53\n\u21e4 n\neO\u270f2\u00b53/2\nn\neO\u270f8\u00b55\n\u21e4 \n\neO\u270f6\u00b55/2\neO(n + \u270f2n1/2\u00b53/2\n\n\u21e4\n\n\u21e4\n\n\u21e4\n\nGradient Complexity\n\n\u21e4 n3/4 + \u270f44\n\n\u21e4 n1/2\n\n\u21e4 \n) ^ \u270f4\u00b52\n\n[45]\n[45]\n[57]\n[30]\n. Corollary 3.9\n[30]\n. Corollary 3.9\n. Corollary 3.5\n\nIn Table 1, we compare the gradient complexity of different methods to achieve \u270f-error in 2-\nWasserstein distance for sampling from non-log-concave densities3. LMC, SGLD and SVRG-LD are\nbased on overdamped Langevin dynamics (1.4) and HMC, UL-MCMC, SGHMC, SG-UL-MCMC\nand SRVR-HMC are based on underdamped Langevin dynamics (1.1). The HMC/SGHMC algo-\nrithm studied in [30] and the UL-MCMC/SG-UL-MCMC algorithm [18] analyzed in this paper are\n3The original results for LMC/SGLD in [45] and for HMC/SG-HMC in [30] are about the global convergence\nin nonconvex optimization. Yet their results can be adapted to sampling from non-log-concave distributions, and\nthe corresponding gradient complexities can be spelled out from their convergence rates.\n\n3\n\n\foverdamped Langevin dynamics (1.4), which is also in the order of exp(eO(d)) [9, 45] in the worst\n\nslightly different since they rely on different discretization methods to the Hamiltonian Langevin\ndynamics (1.1). In addition, note that \u21e4 denotes the spectral gap of the Markov process generated by\ncase.\nFrom Table 1, we can see that the proposed SRVR-HMC algorithm strictly outperforms HMC, UL-\nMCMC, SGHMC and SG-UL-MCMC, and also outperforms LMC, SGLD and SVRG-LD in terms\nof the dependency on target accuracy \u270f and training sample size n. We remark that for a general\nnon-log-concave target density, \u21e4 and \u00b5\u21e4 are not directly comparable, though both of them are\nexponential in dimension d. However, it is shown that for a class of target densities, \u00b5\u21e4 can be in the\norder of O(1/2\n) [27, 30], which suggests that SRVR-HMC is also strictly better than LMC, SGLD\nand SVRG-LD for sampling from such densities.\nNotation. We denote discrete update by lower case bold symbol xk and continuous-time dynamics\nby upper case italicized bold symbol Xt. For a vector x 2 Rd, we denote by kxk2 the Euclidean\nnorm. For random vectors xk, Xt 2 Rd, we denote their probability distribution functions by P(xk)\nand P(Xt) respectively. For a probability measure \u00b5, we denote by E\u00b5[X] the expectation of X\nunder probability measure u. The 2-Wasserstein distance between two probability measures u and v\nis\n\n\u21e4\n\nW2(u, v) =s inf\n\n\u21e32(u,v)ZRd\u21e5Rd kXu Xvk2\n\n2d\u21e3(Xu, Xv),\n\nwhere the in\ufb01mum is taken over all joint distributions \u21e3 with u and v being its marginal distributions.\n1(\u00b7) denotes the indicator function. We denote index set [n] = {1, 2, . . . , n} for an integer n. We\nuse an = O(bn) to denote that an \uf8ff Cbn for some constant C > 0 independent of n, and use\nan = eO(bn) to hide the logarithmic factors in bn. The Vinogradov notation an . bn is also used\nsynonymously with an = O(bn). We denote min{a, b} and max{a, b} by a^ b and a_ b respectively.\nThe ceiling function dxe outputs the least integer greater than or equal to x.\n2 The proposed algorithm\n\nIn this section, we present our algorithm, SRVR-HMC, for sampling from a target distribution in\nthe form of \u21e1 / exp{f (x)}. Our algorithm is shown in Algorithm 1, which has a multi-epoch\nstructure. In detail, there are dK/Le epochs, where K is the number of total iterations and L denotes\nthe epoch length, i.e., the number of iterations within each inner loop.\nRecall that the update rule of HMC in (1.2) requires the computation of full gradient rf (xk) at each\niteration, which is the average of n stochastic gradients. This causes a high per-iteration complexity\nwhen n is large. Therefore, we propose to leverage the stochastic gradient to offset the computational\n\ntraining data (uniformly sampled from [n] without replacement) as shown in Line 4 of Algorithm\n\nburden. At the beginning of the j-th epoch, we compute a stochastic gradientegj based on a batch of\n1, where the batch is denoted by eBj with batch size |eBj| = B0. In each epoch, we make use of\n\nthe stochastic path-integrated differential estimator [29] to compute the following semi-stochastic\ngradient\n\ngk = 1/BPi2Bk\u21e5rfi(xk) rfi(xk1)\u21e4 + gk1,\n\n(2.1)\nwhere Bk is another uniformly sampled (without replacement) mini-batch from [n] with mini-batch\nsize |Bk| = B. Unlike the unbiased stochastic gradient estimators in SGHMC [16] and SVR-HMC\n[55], gk is a biased estimator of the full gradient rf (xk) conditioned on xk. However, we can show\nthat while being biased, the variance of gk is substantially smaller than that of unbiased ones. This\nis the key reason why our algorithm can achieve a faster convergence rate than existing HMC-type\nalgorithms. Based on the semi-stochastic gradient in (2.1), we update the position and velocity\nvariables as follows\n\nvk+1 = vke\u2318 u1(1 e\u2318)gk + \u270fv\nk,\nxk+1 = xk + 1(1 e\u2318)vk + u2(\u2318 + e\u2318 1)gk + \u270fx\nk,\n\n(2.2)\n\nwhere \u2318 is the step size and u, are the inverse mass and friction parameter de\ufb01ned in (1.1), which\nare usually treated as tunable hyper parameters in practice. Moreover, \u270fv\nk 2 Rd are zero mean\n\nk, \u270fx\n\n4\n\n\fAlgorithm 1 Stochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC)\n\nk = jL + l\nif l = 0 then\n\nelse\n\niterations K; epoch length L\n\n1: input: Initial pointsex0 = x0 = x0, v0; step size \u2318; batch sizes B0 and B; total number of\nUniformly sample a subset of index eBj \u21e2 [n] with |eBj| = B0\nComputeegj = 1/B0Pi2eBj rfi(exj)\nfor l = 0, . . . , L 1 do\ngk =egj\nUniformly sample a subset of index Bk \u21e2 [n] with |Bk| = B\nCompute gk = 1/BPi2Bk\n(rfi(xk) rfi(xk1)) + gk1\nend if\nxk+1 = xk + (1 e\u2318)vk + u2(\u2318 + e\u2318 1)gk + \u270fx\nvk+1 = vke\u2318 u1(1 e\u2318)gk + \u270fv\nexj+1 = x(j+1)L\n\n2: for j = 0, . . . ,dK/Le do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n18: output: xK\n\nend for\n\nk\n\nk\n\nGaussian random vectors with covariance matrices satisfying\n\nE[\u270fv\nE[\u270fx\nE[\u270fv\n\nk(\u270fv\nk(\u270fx\nk(\u270fx\n\nk)>] = u(1 e2\u2318) \u00b7 I,\nk)>] = u2(2\u2318 + 4e\u2318 e2\u2318 3) \u00b7 I,\nk)>] = u1(1 2e\u2318 + e2\u2318) \u00b7 I,\n\n(2.3)\n\nwhere I 2 Rd\u21e5d is the identity matrix. The covariance of the Gaussian noises in (2.3) is obtained\nby integrating the Hamiltonian Langevin dynamics (1.1) over a time period of length \u2318. It is worth\nnoting our update rule in (2.2) and the construction of the Gaussian noises in (2.3) follow Cheng\net al. [18], Zou et al. [55], Cheng et al. [17], except that we use a different semi-stochastic gradient\nestimator as shown in (2.1). In contrast, Cheng et al. [18] uses full gradient and noisy gradient, and\nZou et al. [55] uses an unbiased semi-stochastic gradient based on SVRG [33].\nWe remark here that the semi-stochastic gradient estimator in (2.1) was originally proposed in \ufb01nding\nstationary points in \ufb01nite-sum optimization [44, 29] and further extended in [49, 32]. In addition,\nanother semi-stochastic gradient estimator called SNVRG [54, 53] has also been demonstrated to\nachieve similar convergence rate in \ufb01nite-sum optimization. Despite using the same semi-stochastic\ngradient estimator, our work differs from [44, 29] in at least two aspects: (1) the sampling problem\nstudied in this paper is different from the optimization problem studied in [44, 29], where our goal\nis to sample from a target distribution concentrating on the global minimizer of f (x) such that the\nsample distribution is close to the target distribution in 2-Wasserstein distance. In contrast, Nguyen\net al. [44], Fang et al. [29] aim at \ufb01nding a stationary point of f (x) with small gradient; and (2)\nthe algorithms in [44, 29] only have one update variable, while our SRVR-HMC algorithm has an\nadditional Hamiltonian momentum term and therefore has two update variables (i.e., velocity and\nposition variables). The Hamiltonian momentum is essential for underdamped Langevin Monte Carlo\nmethods to achieve a smaller discretization error than overdamped methods such as SGLD [50] and\nSVRG-LD [25]. At the same time, this also introduces a great technical challenge in our theoretical\nanalysis and requires nontrivial efforts.\n\n3 Main theory\n\nIn this section, we provide the convergence guarantee for Algorithm 1. In particular, we characterize\nthe 2-Wasserstein distance between the distribution of the output of Algorithm 1 and the target\ndistribution \u21e1 / ef (x). We focus on sampling from non-log-concave densities that satisfy the\nsmoothness and dissipativeness conditions, which are formally de\ufb01ned as follows.\n\n5\n\n\fW2P(xK),\u21e1 \uf8ff 1\u2713\u27131 +\n\nL\n\nB\u25c6K\u2318 3 +\n\nK\u2318\n\n2B0 \u00b7 1(B0 < n)\u25c61/4\n\n+ 0e\u00b5\u21e4K\u2318 ,\n\nwhere B0, B are the batch and minibatch sizes, L is the epoch length and \u00b5\u21e4 = exp(eO(d)) is\na lower bound of the spectral gap of the Markov process generated by (1.1). 0 = eO(\u00b51\n\u21e4 ) and\n\n1 = 2D1(M 23uD2)1/4 are problem-dependent parameters with constants D1, D2 de\ufb01ned as\n\nD1 =\n\nD2 =\n\n8\n\nr um(f (0) f (x\u21e4)) + 2M u(4d + 2b + mkx\u21e4k2\n8um(f (0) f (x\u21e4)) + 8M u20(d + b) + mkx\u21e4k2\n2\n\n2m\n\nm\n\n22) + (12um + 32)\n\n,\n\nkrfi(0)k2\n\n2\n\nM 2\n\n,\n\n+ max\ni2[n]\n\nAssumption 3.1 (Smoothness). Each fi in (1.3) is M-smooth, i.e., there exists a positive constant\nM > 0, such that the following holds\n\nkrfi(x) rfi(y)k2 \uf8ff Mkx yk2,\n\nfor any x, y 2 Rd.\nNote that Assumption 3.1 directly implies that function f (x) is also M-smooth.\nAssumption 3.2 (Dissipativeness). There exist constants m, b > 0, such that the following holds\n\nhrf (x), xi mkxk2\n\n2 b,\n\nfor any x 2 Rd.\n\nDifferent from the smoothness assumption, Assumption 3.2 is only required for f (x) rather than\nfi(x). The dissipativeness assumption is standard in the analysis for sampling from non-log-concave\ndensities and is essential to guarantee the convergence of underdamped Langevin dynamics [46, 41].\n\n3.1 Convergence analysis of the proposed algorithm\nNow we state our main theorem that establishes the convergence rate of Algorithm 1.\nTheorem 3.3. Suppose Assumptions 3.1 and 3.2 hold and the initial points are x0 = v0 = 0. If set\n \uf8ff 2pM u and the step size \u2318 \uf8ff O(mM3 ^ m1/2M3/2L1/2), the output xK of Algorithm 1\n\nsatis\ufb01es\n\nand x\u21e4 = argminx2Rd f (x) is the global minimizer of f.\nTheorem 3.3 states that the 2-Wasserstein distance between the output of SRVR-HMC and the target\ndistribution is upper bounded by two terms: the \ufb01rst term is the discretization error between the\ndiscrete-time Algorithm 1 and the continuous-time dynamics (1.1), which goes to zero when the step\nsize \u2318 goes to zero; the second term represents the ergodicity of the Markov process generated by\n(1.1) which converges to zero exponentially fast.\nRemark 3.4. The result in Theorem 3.3 encloses a term \u00b5\u21e4 with an exponential dependence on\nthe dimension d, which is a lower bound of the spectral of the Markov process generated by (1.1).\nWhen f is nonconvex, the exponential dependence of \u00b5\u21e4 on dimension is unavoidable under the\ndissipativeness assumption [9]. However, this exponential dependency on d can be weakened by\nimposing stronger assumptions on f (x). For instance, Eberle et al. [27], Gao et al. [30] showed that\nfor a symmetric double-well potential f (x), \u00b5\u21e4 is in the order of \u2326(1/a), where a is the distance\nbetween these two wells, and is typically polynomial in the dimension d. Another example is shown\nby Cheng et al. [17]: when f (x) is strongly convex outside a `2 ball centered at the origin with radius\nR, \u00b5\u21e4 is in the order of exp(O(M R2)) where M is the smoothness parameter.\nFrom Theorem 3.3, we can obtain the gradient complexity of SRVR-HMC by optimizing the choice\nof minibatch size B and batch size B0 in the following corollary.\n\n0\n\n\u00b51/2\n\n\u21e4 ^n), B . B1/2\nCorollary 3.5. Under the same assumptions in Theorem 3.3, if set B0 = eO(\u270f4\u00b51\n,\n\u21e4 B), then Algorithm 1 requires eO((n + \u270f2n1/2\u00b53/2\nL = O(B0/B), and \u2318 = eO(\u270f2B1/2\n) ^\n\u21e4 ) stochastic gradient evaluations to achieve \u270f-error in 2-Wasserstein distance.\n\u270f4\u00b52\nRemark 3.6. Recall the gradient complexities of HMC and SGHMC in Table 1, it is evident that the\ngradient complexity of Algorithm 1 is lower than that of HMC [30] by a factor of eO(\u270f2n1/2\u00b53/2\n\u21e4 _\nn\u00b5\u21e4) and is lower than that of SGHMC [30] by a factor of eO(\u270f6n1/2\u00b57/2\n\n\u21e4 ).\n_ \u270f4\u00b53\n\n\u21e4\n\n\u21e4\n\n0\n\n6\n\n\fRemark 3.7. As shown in Table 1, the gradient complexities of overdamped Langevin dynamics\nbased algorithms, including LMC, SGLD and SVRG-LD, depend on the spectral gap \u21e4 of the\nMarkov chain generated by (1.4). Although the magnitudes of \u00b5\u21e4 and \u21e4 are not directly comparable,\nthey are generally in the same order in the worst case [9, 45, 27]. Thus we treat them the same in the\nfollowing comparison. In speci\ufb01c, the gradient complexity of SRVR-HMC is better than those of\n\nLMC [45] SGLD [45] and SVRG-LD [57] by factors of eO(\u270f2n1/2 _ n), eO(\u270f6n1/2 _ \u270f4) and\neO(\u270f2 _ n1/2) respectively.\n\n3.2\nRecall the proposed SRVR-HMC algorithm in Algorithm 1, if we set the epoch length to be L = 1,\nAlgorithm 1 degenerates to SG-UL-MCMC [18], with the following update formulation:\n\nImplication for UL-MCMC and SG-UL-MCMC\n\nk,\n\nvk+1 = vke\u2318 u1(1 e\u2318)egk + \u270fv\nxk+1 = xk + 1(1 e\u2318vk) + u2(\u2318 + e\u2318 1)egk + \u270fx\n\nwhereegk = |eBk|1Pn\naddition, if we replaceegk with the full gradient rf (xk), SG-UL-MCMC in (3.1) further reduces\n\ni=1 rfi(xk) denotes the stochastic gradient computed in the k-th iteration. In\nto UL-MCMC [18]. Although these two algorithms were originally proposed for sampling from\nstrongly-log-concave densities [18], in this subsection, we show that our analysis of SRVR-HMC\ncan be easily adapted to derive the gradient complexity of UL-MCMC/SG-UL-MCMC for sampling\nfrom non-log-concave densities. We \ufb01rst state the convergence of SG-UL-MCMC in the following\ntheorem.\nTheorem 3.8. Under the same assumptions in Theorem 3.3, the output xK of the SG-UL-MCMC\nalgorithm in (3.1) satis\ufb01es\n\n(3.1)\n\nk,\n\nW2P(xK),\u21e1 \uf8ff 1\u21e52K\u2318 3 + K\u2318/ (2B0) \u00b7 1(B0 < n)\u21e41/4 + 0e\u00b5\u21e4K\u2318 ,\n\nwhere B0 denotes the mini-batch size, \u00b5\u21e4, 0 and 1 are de\ufb01ned in Theorem 3.3.\nSimilar to the results in Theorem 3.3, the sampling error of SG-UL-MCMC in 2-Wasserstein distance\nis also controlled by the discretization error of the discrete algorithm (3.1) and the ergodicity rate of\nHamiltonian Langevin dynamics (1.1). In particular, the main difference in the convergence results\nof SG-UL-MCMC and SRVR-HMC lies in the discretization error term, which leads to a different\ngradient complexity for SG-UL-MCMC.\n\n\u21e4\n\n) and B0 =\n) stochastic gradient evaluations to achieve\nn) stochastic gradient\n\nCorollary 3.9. Under the same assumptions in Theorem 3.3, if we set \u2318 = eO(\u270f2\u00b51/2\neO(\u270f4\u00b51\n\u270f-error in 2-Wasserstein distance. Moreover, UL-MCMC requires eO(\u270f2\u00b53/2\n\n\u21e4 ), SG-UL-MCMC in (3.1) requires eO(\u270f6\u00b55/2\n\nevaluations to achieve \u270f-error in 2-Wasserstein distance.\nRemark 3.10. Our theoretical analysis suggests that the gradient complexity of UL-MCMC is better\nthan that of HMC [30] by a factor of O(\u270f2\u00b53/2\n) and the gradient complexity of SG-UL-MCMC\nis better than that of SGHMC [30] by a factor of O(\u270f2\u00b55/2\n). We note that Cheng et al. [17]\nproved O(1/\u270f) convergence rate of UL-MCMC for sampling from a smaller class of non-log-concave\ndensities in 1-Wasserstein distance. Their result is not directly comparable to our result since 1-\nWasserstein distance is strictly smaller than 2-Wasserstein distance and more importantly, their results\nrely on a stronger assumption than the dissipativeness assumption used in our paper as we commented\nin Remark 3.4.\n\n\u21e4\n\n\u21e4\n\n\u21e4\n\n\u21e4\n\n4 Experiments\n\nIn this section, we evaluate the empirical performance of SRVR-HMC on both synthetic and real\ndatasets. We compare our proposed algorithm with existing overdamped and underdamped Langevin\nbased stochastic gradient algorithms including SGLD [50], SVRG-LD [25], SGHMC [16], SG-UL-\nMCMC [18] and SVR-HMC [55].\n\n4.1 Sampling from Gaussian mixture distributions\n\n7\n\n\fWe \ufb01rst demonstrate the performance of SRVR-HMC for \ufb01t-\nting a Gaussian mixture model on synthetic data . In this case,\nthe density on each data point is de\ufb01ned as\n\nefi(x) = 2ekxaik2\n\n2/2 + ekx+aik2\n\n2/2,\n\n0.05\n\n0\n-6\n6\n\n4\n\n-4\n\n-2\n\n0\n\n2\n\n4\n\n6\n\nSRVR-HMC\nTrue\n\n6\n\n4\n\n0\n\n2\n\n4\n\n6\n\n-4\n\n-2\n\n0\n\n-6\n\n-2\n\n-2\n\n-4\n\n-6\n\n-4\n\n-6\n\n2\n\n0\n\n2\n\n0\n\n0.05\n\ni=1 fi(x).\n\nFigure 1: Kernel density estimation\nfor Gaussian mixture distribution.\n\nwhich is proportional to the probability density function (PDF)\nof two-component Gaussian mixture density with weights 1/3\nand 2/3. By simple calculation, it can be veri\ufb01ed that when\nkaik2 1, fi(x) is nonconvex but satis\ufb01es Assumption 3.2,\nand so does f (x) = 1/nPn\nWe generated n = 500 vectors {ai}i=1,...,n 2 R2 to construct\nthe target density functions. We \ufb01rst show that the proposed\nalgorithm can well approximate the target distribution. Specif-\nically, we run SRVR-HMC for 104 data passes, and use the\nlast 105 iterates to visualize the estimated distribution, where the batch size, minibatch size and\nepoch length are set to be B0 = n, B = 1 and L = n respectively. As a reference, we run MCMC\nwith Metropolis-Hasting (MH) correction to represent the underlying distribution. Following [3], we\ndisplay the kernel densities of random samples generated by SRVR-HMC in Figures 4.1, which shows\nthat the random samples generated by SRVR-HMC well approximate Gaussian mixture distribution.\nIn Figure 2(a), we compare the perfor-\nmance of SRVR-HMC with baseline\nalgorithms for sampling from Gaus-\nsian mixture distribution. Since di-\nrectly computing the 2-Wasserstein\ndistance is expensive, we resort to the\n2],\nwhere \u00afx = E\u21e1[x] is obtained via run-\nning MCMC with MH correction and\ns=1001 xs/(k 1000) is the\nsample path average, where xs de-\nnotes the s-th position iterate of the al-\ngorithms and we discard the \ufb01rst 1000\niterates as burn-in. We report the MSE\nresults of all algorithms in Figure 2(a)\nby repeating each algorithms for 20\ntimes. It can be seen that SRVR-HMC\nconverges faster than all baseline algo-\nrithms, which is well aligned with our theory. In addition, it can be seen SG-UL-MCMC outperforms\nSGHMC, which is consistent with our results in Table 1. We also compare the convergence perfor-\nmance of SRVR-HMC with different batch sizes in Figure 2(b). It can be observed that SRVR-HMC\nworks well for all small batch sizes (B < 20) but becomes signi\ufb01cantly worse when B is large\n(B = 50). This observation is consistent with Corollary 3.5 where we prove that when B . B1/2\nthe gradient complexity maintains the same.\n\nFigure 2: Experiment results for sampling from Gaussian\nmixture distribution, where X-axis represents the number of\ndata passes and Y-axis represents MSE: (a) Comparison with\nbaseline algorithms. (b) Convergence of SRVR-HMC with\nvarying batch size B.\n\nmean square error (MSE) E[kbx\u00afxk2\nbx = Pk\n\n500 1000 1500 2000 2500 3000 3500 4000 4500 5000\n\n500 1000 1500 2000 2500 3000 3500 4000 4500 5000\n\n(b)\n\n(a)\n\n0\n\n0\n\n0\n\n0\n\n3\n\n2\n\n3\n\n2\n\n1\n\n1.5\n\n1\n\n0.5\n\n0\n\n2.5\n\n2.5\n\n1.5\n\n0.5\n\nIndependent components analysis\n\n4.2\nIn\nWe further run the sampling algorithms for independent components analysis (ICA) tasks.\nthe ICA model, the input are examples {xi}n\ni=1, and the likelihood function can be written as\np(x|W) = |det(W)|Ql\nj=1 p(w>j x), where W 2 Rd\u21e5l is the model matrix, d is the problem dimen-\nsion, l denotes the number of independent components and wj denotes the j-th column of W. Fol-\nlowing [50, 25] we set p(w>j x) = 1/(4 cosh2(w>j x/2)) with a Gaussian prior p(W) \u21e0N (0, 1I).\nThen the negative log-posterior can be written as f (W) = 1/nPn\n\ni=1 fi(W), where\n\nF /2.\n\nWe compare the performance of SRVR-HMC with all the baseline algorithms on MEG dataset4,\nwhich consists of 17730 time-points in 122 channels. In order to explore the performance of our\n\nfi(W) = n log(|det(W)|) 2nPl\n\nj=1 log cosh(w>j xi/2) + kWk2\n\n4http://research.ics.aalto.fi/ica/eegmeg/MEG_data.html\n\n8\n\n\f8\n\n6\n\n4\n\n2\n\n0\n\n-2\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n8\n\n6\n\n4\n\n2\n\n0\n\n-2\n\n-4\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n8\n\n6\n\n4\n\n2\n\n0\n\n-2\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n8\n\n6\n\n4\n\n2\n\n0\n\n-2\n\n-4\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n(a) n=500, B0=100\n\n(b) n=5000, B0=1000\n\n(c) n=500, B0=100\n\n(d) n=5000, B0=1000\n\nFigure 3: Experiment results for ICA, where X-axis represents the number of data passes, and Y-axis\nrepresents the negative log likelihood on the test dataset: (a)-(b) Comparison with different baselines\n(c)-(d) Convergence of SRVR-HMC with varying batch size B.\n\nalgorithm for different sample size, we extract two subset with sizes n = 500 and n = 5000 from the\noriginal dataset for training, and regard the rest 12730 examples as test dataset. For inference, we\ncompute the sample path average while discarding the \ufb01rst 100 iterates as burn-in. We \ufb01rst compare\nthe convergence performance of SRVR-HMC with baseline algorithms and report the negative log\nlikelihood on test dataset in Figures 3(a)-3(b), where the batch size, minibatch size and epoch length\nare set to be B0 = n/5, B = 10 and L = B0/B, and the rest hyper parameters are tuned to achieve\nthe best performance. It is worth noting that we do not perform the normalization when evaluating\nthe test likelihood, thus the negative log likelihood results may be smaller than 0. From Figures\n3(a)-3(b) it can be clearly seen that SRVR-HMC outperforms all baseline algorithms, which validates\nits superior theoretical properties. Again, we can see that SG-UL-MCMC can decrease the negative\nlog likelihood much faster than SGHMC, which is well aligned with our theory. Furthermore, we\nevaluate the convergence for different minibatch size, which are displayed in Figures 3(c)-3(d), where\nthe batch size B0 is \ufb01xed as n/5 for both scenarios. It can be seen that SRVR-HMC attains similar\nconvergence performance for all small minibatch sizes (B \uf8ff 10 when B0 = 100 and B \uf8ff 20 when\nB0 = 1000), which again corroborates our theory that when B . B1/2\nthe gradient complexity\nmaintains the same.\nWe also evaluate our proposed algorithm SRVR-HMC on Bayesian logistic regression. We defer the\nadditional experimental results to Appendix E due to space limit.\n\n0\n\n5 Conclusions\n\nWe propose a novel algorithm SRVR-HMC based on Hamiltonian Langevin dynamics for sampling\nfrom a class of non-log-concave target densities. We show that SRVR-HMC achieves a lower gradient\ncomplexity in 2-Wasserstein distance than all existing HMC-type algorithms. In addition, we show\nthat our algorithm reduces to UL-MCMC and SG-UL-MCMC with properly chosen parameters. Our\nanalysis of SRVR-HMC directly applies to these two algorithms and suggests that UL-MCMC/SG-\nUL-MCMC are faster than HMC/SGHMC for sampling from non-log-concave densities.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers for their helpful comments. This research was\nsponsored in part by the National Science Foundation BIGDATA IIS-1855099 and CAREER Award\nIIS-1906169. The views and conclusions contained in this paper are those of the authors and should\nnot be interpreted as representing any funding agencies.\n\nReferences\n[1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction\n\nto mcmc for machine learning. Machine learning, 50(1-2):5\u201343, 2003.\n\n[2] Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher Nemeth. Control variates for\nstochastic gradient MCMC. Statistics and Computing, 2018. ISSN 1573-1375. doi: 10.1007/\ns11222-018-9826-2.\n\n9\n\n\f[3] R\u00e9mi Bardenet, Arnaud Doucet, and Chris Holmes. On markov chain monte carlo methods for\n\ntall data. The Journal of Machine Learning Research, 18(1):1515\u20131557, 2017.\n\n[4] M Barkhagen, NH Chau, \u00c9 Moulines, M R\u00e1sonyi, S Sabanis, and Y Zhang. On stochastic\ngradient langevin dynamics with dependent data streams in the logconcave case. arXiv preprint\narXiv:1812.02709, 2018.\n\n[5] Michael Betancourt. The fundamental incompatibility of scalable Hamiltonian monte carlo and\nnaive data subsampling. In International Conference on Machine Learning, pages 533\u2013540,\n2015.\n\n[6] Michael Betancourt, Simon Byrne, Sam Livingstone, Mark Girolami, et al. The geometric\n\nfoundations of Hamiltonian monte carlo. Bernoulli, 23(4A):2257\u20132298, 2017.\n\n[7] Francois Bolley and Cedric Villani. Weighted csisz\u00e1r-kullback-pinsker inequalities and applica-\ntions to transportation inequalities. Annales de la Facult\u00e9 des Sciences de Toulouse. S\u00e9rie VI.\nMath\u00e9matiques, 14, 01 2005. doi: 10.5802/afst.1095.\n\n[8] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for\n\nHamiltonian monte carlo. arXiv preprint arXiv:1805.00452, 2018.\n\n[9] Anton Bovier, Michael Eckhoff, V\u00e9ronique Gayrard, and Markus Klein. Metastability in\nreversible diffusion processes i: Sharp asymptotics for capacities and exit times. Journal of the\nEuropean Mathematical Society, 6(4):399\u2013424, 2004.\n\n[10] Nicolas Brosse, Alain Durmus, and Eric Moulines. The promises and pitfalls of stochastic\ngradient langevin dynamics. In Advances in Neural Information Processing Systems, pages\n8268\u20138278, 2018.\n\n[11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM\n\ntransactions on intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[12] Niladri S Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L Bartlett, and Michael I Jor-\ndan. On the theory of variance reduction for stochastic gradient monte carlo. arXiv preprint\narXiv:1802.05431, 2018.\n\n[13] Ngoc Huy Chau, \u00c9ric Moulines, Miklos R\u00e1sonyi, Sotirios Sabanis, and Ying Zhang. On\nstochastic gradient langevin dynamics with dependent data streams: the fully non-convex case.\narXiv preprint arXiv:1905.13142, 2019.\n\n[14] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient\nmcmc algorithms with high-order integrators. In Advances in Neural Information Processing\nSystems, pages 2278\u20132286, 2015.\n\n[15] Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, and Lawrence Carin. A convergence\nanalysis for a class of practical variance-reduction stochastic gradient mcmc. arXiv preprint\narXiv:1709.01180, 2017.\n\n[16] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian monte carlo. In\n\nInternational Conference on Machine Learning, pages 1683\u20131691, 2014.\n\n[17] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan.\nSharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint\narXiv:1805.01648, 2018.\n\n[18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped\nIn Proceedings of the 31st Conference On\n\nLangevin mcmc: A non-asymptotic analysis.\nLearning Theory, volume 75, pages 300\u2013323, 2018.\n\n[19] William Coffey and Yu P Kalmykov. The Langevin equation: with applications to stochastic\nproblems in physics, chemistry and electrical engineering, volume 27. World Scienti\ufb01c, 2012.\n[20] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin\nMonte Carlo and gradient descent. In Conference on Learning Theory, pages 678\u2013689, 2017.\n\n10\n\n\f[21] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-\nconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n79(3):651\u2013676, 2017.\n\n[22] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the Langevin monte\n\ncarlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.\n\n[23] Khue-Dung Dang, Matias Quiroz, Robert Kohn, Minh-Ngoc Tran, and Mattias Villani. Hamil-\ntonian monte carlo with energy conserving subsampling. Journal of machine learning research,\n20(100):1\u201331, 2019.\n\n[24] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte\n\ncarlo. Physics letters B, 195(2):216\u2013222, 1987.\n\n[25] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J\nSmola, and Eric P Xing. Variance reduction in stochastic gradient Langevin dynamics. In\nAdvances in Neural Information Processing Systems, pages 1154\u20131162, 2016.\n\n[26] Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjusted\n\nLangevin algorithm. The Annals of Applied Probability, 27(3):1551\u20131587, 2017.\n\n[27] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative contraction\n\nrates for Langevin dynamics. arXiv preprint arXiv:1703.01617, 2017.\n\n[28] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with\ndiscretized diffusions. In Advances in Neural Information Processing Systems, pages 9671\u20139680,\n2018.\n\n[29] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-\nconvex optimization via stochastic path-integrated differential estimator. In Advances in Neural\nInformation Processing Systems, pages 686\u2013696, 2018.\n\n[30] Xuefeng Gao, Mert G\u00fcrb\u00fczbalaban, and Lingjiong Zhu. Global convergence of stochastic\ngradient Hamiltonian monte carlo for non-convex stochastic optimization: Non-asymptotic\nperformance bounds and momentum-based acceleration. arXiv preprint arXiv:1809.04618,\n2018.\n\n[31] Istv\u00e1n Gy\u00f6ngy. Mimicking the one-dimensional marginal distributions of processes having an\n\nit\u00f4 differential. Probability theory and related \ufb01elds, 71(4):501\u2013516, 1986.\n\n[32] Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced\nalgorithms and analysis for nonconvex optimization. In International Conference on Machine\nLearning, pages 3100\u20133109, 2019.\n\n[33] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[34] Peter E Kloeden and Eckhard Platen. Higher-order implicit strong numerical schemes for\n\nstochastic differential equations. Journal of statistical physics, 66(1):283\u2013314, 1992.\n\n[35] Paul Langevin. On the theory of brownian motion. CR Acad. Sci. Paris, 146:530\u2013533, 1908.\n\n[36] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[37] Zhize Li, Tianyi Zhang, and Jian Li. Stochastic gradient Hamiltonian monte carlo with variance\n\nreduction for bayesian inference. arXiv preprint arXiv:1803.11159, 2018.\n\n[38] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\n[39] Robert S Liptser and Albert N Shiryaev. Statistics of random processes: I. General theory,\n\nvolume 5. Springer Science & Business Media, 2013.\n\n11\n\n\f[40] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In\n\nAdvances in Neural Information Processing Systems, pages 2917\u20132925, 2015.\n\n[41] Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for sdes and\napproximations: locally lipschitz vector \ufb01elds and degenerate noise. Stochastic processes and\ntheir applications, 101(2):185\u2013232, 2002.\n\n[42] Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for\nnon-convex learning: Two theoretical viewpoints. In Conference on Learning Theory, pages\n605\u2013638, 2018.\n\n[43] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 2:113\u2013162, 2011.\n\n[44] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00e1\u02c7c. Sarah: A novel method for\nmachine learning problems using stochastic recursive gradient. arXiv preprint arXiv:1703.00102,\n2017.\n\n[45] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic\ngradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory,\npages 1674\u20131703, 2017.\n\n[46] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions\n\nand their discrete approximations. Bernoulli, pages 341\u2013363, 1996.\n\n[47] Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and \ufb02uctuations\nfor stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):\n193\u2013225, 2016.\n\n[48] Sebastian J Vollmer, Konstantinos C Zygalakis, and Yee Whye Teh. Exploration of the (non-)\nasymptotic bias and variance of stochastic gradient Langevin dynamics. The Journal of Machine\nLearning Research, 17(1):5504\u20135548, 2016.\n\n[49] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster\nvariance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690,\n2018.\n\n[50] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics.\nIn Proceedings of the 28th International Conference on Machine Learning, pages 681\u2013688,\n2011.\n\n[51] Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics\nbased algorithms for nonconvex optimization. In Advances in Neural Information Processing\nSystems, pages 3126\u20133137, 2018.\n\n[52] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient\n\nLangevin dynamics. In Conference on Learning Theory, pages 1980\u20132022, 2017.\n\n[53] Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance\n\nreduction. arXiv preprint arXiv:1806.08782, 2018.\n\n[54] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent\nfor nonconvex optimization. In Advances in Neural Information Processing Systems, pages\n3925\u20133936, 2018.\n\n[55] Difan Zou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced Hamilton Monte Carlo\nmethods. In Proceedings of the 35th International Conference on Machine Learning, pages\n6028\u20136037, 2018.\n\n[56] Difan Zou, Pan Xu, and Quanquan Gu. Subsampled stochastic variance-reduced gradient\nLangevin dynamics. In Proceedings of International Conference on Uncertainty in Arti\ufb01cial\nIntelligence, 2018.\n\n[57] Difan Zou, Pan Xu, and Quanquan Gu. Sampling from non-log-concave distributions via\nvariance-reduced gradient Langevin dynamics. In Arti\ufb01cial Intelligence and Statistics, vol-\nume 89 of Proceedings of Machine Learning Research, pages 2936\u20132945. PMLR, 2019.\n\n12\n\n\f", "award": [], "sourceid": 2110, "authors": [{"given_name": "Difan", "family_name": "Zou", "institution": "University of California, Los Angeles"}, {"given_name": "Pan", "family_name": "Xu", "institution": "University of California, Los Angeles"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}