{"title": "Variance Reduction in Stochastic Gradient Langevin Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 1154, "page_last": 1162, "abstract": "Stochastic gradient-based Monte Carlo methods such as stochastic gradient Langevin dynamics are useful tools for posterior inference on large scale datasets in many machine learning applications. These methods scale to large datasets by using noisy gradients calculated using a mini-batch or subset of the dataset. However, the high variance inherent in these noisy gradients degrades performance and leads to slower mixing. In this paper, we present techniques for reducing variance in stochastic gradient Langevin dynamics, yielding novel stochastic Monte Carlo methods that improve performance by reducing the variance in the stochastic gradient. We show that our proposed method has better theoretical guarantees on convergence rate than stochastic Langevin dynamics. This is complemented by impressive empirical results obtained on a variety of real world datasets, and on four different machine learning tasks (regression, classification, independent component analysis and mixture modeling). These theoretical and empirical contributions combine to make a compelling case for using variance reduction in stochastic Monte Carlo methods.", "full_text": "Variance Reduction in Stochastic Gradient\n\nLangevin Dynamics\n\nAvinava Dubey\u2217, Sashank J. Reddi\u2217, Barnab\u00b4as P\u00b4oczos, Alexander J. Smola, Eric P. Xing\n\nDepartment of Machine Learning\n\nCarnegie-Mellon University\n\n{akdubey, sjakkamr, bapoczos, alex, epxing}@cs.cmu.edu\n\nPittsburgh, PA 15213\n\nSinead A. Williamson\n\nIROM/Statistics and Data Science\n\nUniversity of Texas at Austin\n\nsinead.williamson@mccombs.utexas.edu\n\nAustin, TX 78712\n\nAbstract\n\nStochastic gradient-based Monte Carlo methods such as stochastic gradient\nLangevin dynamics are useful tools for posterior inference on large scale datasets\nin many machine learning applications. These methods scale to large datasets by\nusing noisy gradients calculated using a mini-batch or subset of the dataset. How-\never, the high variance inherent in these noisy gradients degrades performance and\nleads to slower mixing. In this paper, we present techniques for reducing variance\nin stochastic gradient Langevin dynamics, yielding novel stochastic Monte Carlo\nmethods that improve performance by reducing the variance in the stochastic gra-\ndient. We show that our proposed method has better theoretical guarantees on\nconvergence rate than stochastic Langevin dynamics. This is complemented by\nimpressive empirical results obtained on a variety of real world datasets, and on\nfour different machine learning tasks (regression, classi\ufb01cation, independent com-\nponent analysis and mixture modeling). These theoretical and empirical contribu-\ntions combine to make a compelling case for using variance reduction in stochastic\nMonte Carlo methods.\n\n1\n\nIntroduction\n\nMonte Carlo methods are the gold standard in Bayesian posterior inference due to their asymptotic\nconvergence properties; however convergence can be slow in large models due to poor mixing.\nGradient-based Monte Carlo methods such as Langevin Dynamics and Hamiltonian Monte Carlo\n[10] allow us to use gradient information to more ef\ufb01ciently explore posterior distributions over\ncontinuous-valued parameters. By traversing contours of a potential energy function based on the\nposterior distribution, these methods allow us to make large moves in the sample space. Although\ngradient-based methods are ef\ufb01cient in exploring the posterior distribution, they are limited by the\ncomputational cost of computing the gradient and evaluating the likelihood on large datasets. As a\nresult, stochastic variants are a popular choice when working with large data sets [15].\nStochastic gradient methods [13] have long been used in the optimization community to decrease\nthe computational cost of gradient-based optimization algorithms such as gradient descent. These\nmethods replace the (expensive, but accurate) gradient evaluation with a noisy (but computationally\n\n\u2217 denotes equal contribution\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fcheap) gradient evaluation on a random subset of the data. With appropriate scaling, this gradient\nevaluated on a random subset of the data acts as a proxy for the true gradient. A carefully designed\nschedule of step sizes ensures convergence of the stochastic algorithm.\nA similar idea has been employed to design stochastic versions of gradient-based Monte Carlo meth-\nods [15, 1, 2, 9]. By evaluating the derivative of the log likelihood on only a small subset of data\npoints, we can drastically reduce computational costs. However, using stochastic gradients comes at\na cost: While the resulting estimates are unbiased, they do have very high variance. This leads to an\nincreased probability of selecting paths with high deviation from the true gradient, leading to slower\nconvergence.\nThere have been a number of variations proposed on the basic stochastic gradient Langevin dynamics\n(SGLD) model of [15]: [4] incorporates a momentum term to improve posterior exploration; [6]\nproposes using additional variables to stabilize \ufb02uctuations; [12] proposes modi\ufb01cations to facilitate\nexploration of simplex; [7] provides sampling solutions for correlated data. However, none of these\nmethods directly tries to reduce the variance in the computed stochastic gradient.\nAs was the case with the original SGLD algorithm, we look to the optimization community for\ninspiration, since high variance is also detrimental in stochastic gradient based optimization. A\nplethora of variance reduction techniques have recently been proposed to alleviate this issue for\nthe stochastic gradient descent (SGD) algorithm [8, 5, 14]. By incorporating a carefully designed\n(usually unbiased) term into the update sequence of SGD, these methods reduce the variance that\narises due to the stochastic gradients in SGD, thereby providing strong theoretical and empirical\nperformance.\nInspired by these successes in the optimization community, we propose methods for reducing the\nvariance in stochastic gradient Langevin dynamics. Our approach bridges the gap between the faster\n(in terms of iterations) convergence of non-stochastic Langevin dynamics, and the faster per-iteration\nspeed of SGLD. While our approach draws its motivation from the stochastic optimization literature,\nit is to our knowledge the \ufb01rst approach that aims to directly reduce variance in a gradient-based\nMonte Carlo method. While our focus is on Langevin dynamics, our approach is easily applicable\nto other gradient-based Monte Carlo methods.\nMain Contributions: We propose a new Langevin algorithm designed to reduce variance in the\nstochastic gradient, with minimal additional computational overhead. We also provide a memory\nef\ufb01cient variant of our algorithm. We demonstrate theoretical conversion to the true posterior under\nreasonable assumptions, and show that the rate of convergence has a tighter bound than one previ-\nously shown for SGLD. We complement these theoretical results with empirical evaluation showing\nimpressive speed-ups versus a standard SGLD algorithm, on a variety of machine learning tasks such\nas regression, classi\ufb01cation, independent component analysis and mixture modeling.\n\ni=1 be a set of data items modeled using a likelihood function p(X|\u03b8) =\n\n(cid:2)N\n2 Preliminaries\nLet X = {xi}N\ni=1 p(xi|\u03b8)\n(cid:2)N\nwhere the parameter \u03b8 has prior distribution p(\u03b8). We are interested in sampling from the posterior\ni=1 p(xi|\u03b8). If N is large, standard Langevin Dynamics is not feasible\ndistribution p(\u03b8|X) \u221d p(\u03b8)\ndue to the high cost of repeated gradient evaluations; a more scalable approach is to use a stochastic\nvariant [15], which we will refer to as stochastic gradient Langevin dynamics, or SGLD. SGLD uses\na classical Robbins-Monro stochastic approximation to the true gradient [13]. At each iteration t of\nthe algorithm, a subset Xt = {xt1, . . . , xtn} of the data is sampled and the parameters are updated\nby using only this subset of data, according to\n\n(cid:4)\u221e\nwhere \u03b7t \u223c N (0, ht), and ht is the learning rate. ht is set in such a fashion that\n\n(cid:4)\u221e\n(1)\nt=1 ht = \u221e and\nt < \u221e. This provides an approximation to a \ufb01rst order Langevin diffusion, with dynamics\n(2)\nwhere U is the unnormalized negative log posterior. Equation 2 has stationary distribution \u03c1(\u03b8) \u221d\nexp{\u2212U (\u03b8)}. Let \u00af\u03c6 =\n\u03c6(\u03b8)\u03c1(\u03b8)d\u03b8 where \u03c6 represents a test function of interest. For a numerical\n\n2\u2207\u03b8U dt + dW,\n\n(cid:3)\u2207 log p(\u03b8t) + N\n\nn\n\nd\u03b8 = \u2212 1\n\n(cid:4)n\ni=1 \u2207 log p(xti|\u03b8t)\n\n(cid:5)\n\n+ \u03b7t\n\nt=1 h2\n\n\u0394\u03b8t = ht\n2\n\n(cid:6)\n\n2\n\n\fmethod that generates samples {\u03b8t}T\u22121\nt=0 \u03c6(\u03b8t). Fur-\nthermore, let \u03c8 denote the solution to the Poisson equation L\u03c8 = \u03c6 \u2212 \u00af\u03c6, where L is the generator\nof the diffusion, given by\n\ni=0 , let \u02c6\u03c6 denote the empirical average 1\nT\n\n(cid:4)\n\nL\u03c8 = (cid:6)\u2207\u03b8\u03c8,\u2207\u03b8U(cid:7) + 1\n\n2\n\ni \u22072\ni \u03c8.\n\n(3)\n\n(cid:4)T\u22121\n\nThe decreasing step size ht in our approximation (Equation 1) means we do not have to incorporate\na Metropolis-Hastings step to correct for the discretization error relative to Equation 2; however it\ncomes at the cost of slowing the mixing rate of the algorithm. We note that, while the discretized\nLangevin diffusion is Markovian, its convergence guarantees rely on the quality of the approxima-\ntion, rather than from standard Markov chain Monte Carlo analyses that rely on this Markovian\nproperty.\nA second source of error comes from the use of stochastic approximations to the true gradients. This\nis equivalent to using an approximate generator \u02dcLt = L + \u0394Vt, where \u0394Vt = (\u2207Ut \u2212 \u2207U ) \u00b7 \u2207\nand \u2207Ut is the current stochastic approximation to \u2207U. The key contribution of this paper will\nbe replacing the Robbins-Monro approximation to U with a lower-variance approximation, thus\nreducing the error.\nTo see more clearly the effect of the variance of our stochastic approximation on the estimator error,\nwe present a result derived for SGLD by [3]:\nTheorem 1. [3] Let Ut be an unbiased estimate of U and ht = h for all t \u2208 {1, . . . , T}. Then\nunder certain reasonable assumptions (concretely, assumption [A1] in Section 4), for a smooth test\nfunction \u03c6, the MSE of SGLD at time K = hT is bounded, for some C > 0 independent of (T, h) in\nthe following manner:\n\n(cid:4)\n\n\u239b\n\u239c\u239c\u239d 1\n(cid:10)\n\nT\n\nE( \u02c6\u03c6 \u2212 \u00af\u03c6)2 \u2264 C\n\nt E[(cid:10)\u0394Vt(cid:10)2]\n(cid:11)(cid:12)\n(cid:13)\n\nT\nT1\n\n+\n\n1\nT h\n\n+ h2\n\n\u239e\n\u239f\u239f\u23a0 .\n\n(4)\n\nHere (cid:10).(cid:10) represents the operator norm.\nWe clearly see that the MSE depends on the variance term E[(cid:10)\u0394Vt(cid:10)2], which in turn depends on the\nvariance of the noisy stochastic gradients. Since, for consistency, we require h \u2192 0 as T \u2192 \u221e,1\nprovided E[(cid:10)\u0394Vt(cid:10)2] is bounded by a constant \u03c4, the term T1 ceases to dominate as T \u2192 \u221e,\nmeaning that the effect of noise in the stochastic gradient becomes negligible. However outside this\nasymptotic regime, the effect of the variance term in Equation 4 remains signi\ufb01cant. This motivates\nour efforts in this paper to decrease the variance of the approximate gradient, while maintaining an\nunbiased estimator.\nAn easy to decrease the variance is by using larger minibatches. However, this comes at a consid-\nerable computational cost, undermining the whole bene\ufb01t of using SGLD. Inspired by the recent\nsuccess of variance reduction techniques in stochastic optimization [14, 8, 5], we take a rather dif-\nferent approach to reduce the effect of noisy gradients.\n\n3 Variance Reduction for Langevin Dynamics\n\nAs we have seen in Section 2, reducing the variance of our stochastic approximation can reduce\nour estimation error. In this section, we introduce two approaches for variance reduction, based on\nrecent variance reduction algorithms for gradient descent [5, 8]. The \ufb01rst algorithm, SAGA-LD, is\nappropriate when our bottleneck is computation; it yields improved convergence with minimal ad-\nditional computational costs over SGLD. The second algorithm, SVRG-LD, is appropriate when our\nbottleneck is memory; while the computational cost is generally higher than SAGA-LD, the mem-\nory requirement is lower, with the memory overhead beyond that of stochastic Langevin dynamics\nscales as O(d). In practice, we found that computation was a greater bottleneck in the examples\nconsidered, so our experimental section only focuses on SAGA-LD; however on larger datasets with\neasily computable gradients, SVRG-LD may be the optimal choice.\n\n1In particular, if h \u221d T \u22121/3, we obtain the optimal convergence rate for the above upper bound.\n\n3\n\n\fd for i \u2208 {1, . . . , N}, step sizes {ht > 0}T\u22121\n\n(cid:4)N\nAlgorithm 1: SAGA-LD\n0 = \u03b80 \u2208 R\n1: Input: \u03b1i\ni=1 \u2207 log p(xi|\u03b1i\n2: g\u03b1 =\n0)\n3: for t = 0 to T \u2212 1 do\nUniformly randomly pick a set It from {1, . . . , N} (with replacement) such that |It| = b\n4:\nRandomly draw \u03b7t \u223c N (0, ht)\n5:\n\u03b8t+1 = \u03b8t + ht\n6:\nt+1 = \u03b8t for i \u2208 It and \u03b1i\n2\n\u03b1i\n7:\n8:\ng\u03b1 = g\u03b1 +\n9: end for\n10: Output: Iterates {\u03b8t}T\u22121\n\n(cid:4)\ni\u2208It\nt for i /\u2208 It\nt+1) \u2212 \u2207 log p(xi|\u03b1i\nt)\n\n(cid:3)\u2207 log p(xi|\u03b8t) \u2212 \u2207 log p(xi|\u03b1i\n\n(cid:3)\u2207 log p(\u03b8t) + N\n(cid:3)\u2207 log p(xi|\u03b1i\n\nt+1 = \u03b1i\n\n(cid:4)\n\n+ g\u03b1\n\ni\u2208It\n\n+ \u03b7t\n\n(cid:5)\n\n(cid:5)\n\n(cid:5)\n\ni=0\n\nn\n\nt)\n\nt=0\n\n3.1 SAGA-LD\nThe increased variance in SGLD is due to the fact that we only have information from n (cid:12) N data\npoints at each iteration. However, inspired by a minibatch version of the SAGA algorithm [5], we\ncan include information from the remaining data points via an approximate gradient, and partially\nupdate the average gradient in each operation. We call this approach SAGA-LD.\nUnder SAGA-LD, we explicitly store N approximate gradients {g\u03b1i}N\ni=1, corresponding to the N\n0 = \u03b80 for all i \u2208\ndata points. Concretely, let \u03b1t = (\u03b1i\nt)N\n[N ], and initialize g\u03b1i = \u2207 log p(xi|\u03b1i\ni=1 g\u03b1i. As we iterate through the data,\nif a data point is not selected in the current minibatch, we approximate its gradient with g\u03b1i. If\nIt = {i1t, . . . int} is the minibatch selected at iteration t, this means we approximate the gradient as\n\ni=1 be a set of vectors, initialized as \u03b1i\n0) and g\u03b1 =\n\n(cid:4)N\n\n(5)\nWhen Equation (5) is used for MAP estimation it corresponds to SAGA[5]. However by injecting\nnoise into the parameter update in the following manner\n\u0394\u03b8t = ht\n2\n\n(cid:3)\u2207 log p(\u03b8t) + N\n\ni\u2208It (\u2207 log p(xi|\u03b8t) \u2212 g\u03b1i) + g\u03b1\n\n+ \u03b7t, where \u03b7t \u223c N (0, ht) (6)\n\n(cid:4)\n\n(cid:5)\n\nn\n\nn\n\ni\u2208It (\u2207 log p(xi|\u03b8t) \u2212 g\u03b1i) + g\u03b1\n\n(cid:4)N\ni=1 \u2207 log p(xi|\u03b8t) \u2248 N\n\n(cid:4)\n\nwe can adapt it for sampling from the posterior. After updating \u03b8t+1 = \u03b8t + \u0394\u03b8t, we let \u03b1i\nt+1 = \u03b8t\nfor i \u2208 It. Note that we do not need to explicitly store the \u03b1i\nt; instead we just update the correspond-\ning gradients g\u03b1i and overall approximate gradient g\u03b1. The SAGA-LD algorithm is summarized in\nAlgorithm 1.\nThe approximation in Equation (6) gives an unbiased estimate of the true gradient, since the mini-\nbatch It is sampled uniformly at random from [N ], and the \u03b1t\ni are independent of It. SAGA-LD\noffers two key properties: (i) As shown in Section 4, SAGA-LD has better convergence properties\nthan SGLD; (ii) The computational overhead is minimal, since SAGA-LD does not require explicit\ncalculation of the full gradient. Instead, it simply makes use of gradients that are already being\ncalculated in the current minibatch. Combined, we end up with a similar computational complexity\nto SGLD, with a much better convergence rate.\nThe only downside of SAGA-LD, when compared with SGLD, is in terms of memory storage. Since\nwe need to store N individual gradients g\u03b1i, we typically have a storage overhead of O(N d) rel-\native to SGLD. Fortunately, in many applications of interest to machine learning, the cost can be\nreduced to O(N ) (please refer to [5] for more details), and in practice the cost of the higher memory\nrequirements is typically outweighed by the improved convergence and low computational cost.\n\n3.2 SVRG-LD\n\nIf the memory overhead of SAGA-LD is not acceptable, we can use a variant that reduces storage\nrequirements, at the cost of higher computational demands. The memory complexity for SAGA-LD\nis high because the approximate gradient g\u03b1 is updated at each step. This can be avoided by updating\nthe approximate gradient every m iterations in a single evaluation, and never storing the individual\ngradients g\u03b1i. Concretely, after every m passes through the data, we evaluate the gradient on the\n\n4\n\n\fd, epoch length m, step sizes {ht > 0}T\u22121\n\ni=0\n\nif (t mod m = 0) then\n\n(cid:4)N\ni=1 \u2207 log p(xi|\u02dc\u03b8)\n\nAlgorithm 2: SVRG-LD\n1: Input: \u02dc\u03b8 = \u03b80 \u2208 R\n2: for t = 0 to T \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n11: Output: Iterates {\u03b8t}T\u22121\n\n(cid:17)\n\nt=0\n\n\u02dc\u03b8 = \u03b8t\n\u02dcg =\nend if\nUniformly randomly pick a set It from {1, . . . , N} (with replacement) such that |It| = n\nRandomly draw \u03b7t \u223c N (0, ht)\n\u03b8t+1 = \u03b8t + ht\n2\n\n\u2207 log p(xi|\u03b8t) \u2212 \u2207 log p(xi|\u02dc\u03b8)\n\n\u2207 log p(\u03b8t) + N\n\n(cid:4)\n\n(cid:18)\n\n(cid:17)\n\n(cid:18)\n\n+ \u03b7t\n\n+ \u02dcg\n\nn\n\ni\u2208It\n\nn\n\n\u0394\u03b8t = ht\n2\n\n+ \u03b7t where \u03b7t \u223c N (0, ht)\n\n(cid:4)N\ni=1 \u02dcgi, where \u02dcgi = \u2207 log p(xi|\u02dc\u03b8) is the current local gradient. \u02dcg\nentire data set, obtaining \u02dcg =\n(cid:3)\u2207 log p(\u03b8t) + N\n(cid:4)\nthen serves as an approximate gradient until the next global evaluation. This yields an update of the\nform\n\n(cid:5)\ni\u2208It (\u2207 log p(xi|\u03b8t) \u2212 \u02dcgi) + \u02dcg\n\n(7)\nWithout adding noise \u03b7t the update sequence in Equation (7) corresponds to the stochastic variance\nreduction gradient descent algorithm [8]. Pseudocode for this procedure is given in Algorithm 2.\nWhile the memory requirements are lower, the computational cost is higher, due to the cost of a\nfull update of \u02dcg. Further, convergence may be negatively effected due to the fact that, as we move\nfurther from \u02dc\u03b8, \u02dcg will be further from the true gradient. In practice, we found SAGA-LD to be a more\neffective algorithm on the datasets considered, so in the interest of space we relegate further details\nabout SVRG-LD to the appendix.\n4 Analysis\nOur motivation in this paper was to improve the convergence of SGLD, by reducing the variance of\nthe gradient estimate. As we saw in Theorem 1, a high variance E[||\u0394Vt||2], corresponding to noisy\nstochastic gradients, leads to a large bound on the MSE of a test function. We expand this analysis\nto show that the algorithms introduced in this paper yield a tighter bound.\nTheorem 1 required a number of assumptions, given below in [A1]. Discussion of the reasonableness\nof these assumptions is provided in [3].\n[A1] We assume the functional \u03c8 that solves the Poisson equation L\u03c8 = \u03c6 \u2212 \u00af\u03c6 is bounded up to\n3rd-order derivatives by some function \u0393, i.e., (cid:10)Dk\u03c8(cid:10) \u2264 Ck\u0393pk where D is the kth order derivative\n(for k = (0, 1, 2, 3)), and Ck, pk > 0. We also assume that the expectation of \u0393 on {\u03b8t} is bounded\n(supt E\u0393p[\u03b8t] < \u221e) and that \u0393 is smooth such that sups\u2208(0,1) \u0393p(s\u03b8 + (1 \u2212 s)\u03b8\n) \u2264 C(\u0393p(\u03b8) +\n\u0393p(\u03b8\nIn our analysis of SAGA-LD and SVRG-LD, we make the assumptions in [A1], and add the following\nfurther assumptions about the smoothness of our gradients:\n[A2] We assume that the functions log p(xi|\u03b8) are Lipschitz smooth with constant L for all i \u2208 [N ],\ni.e. (cid:10)\u2207 log p(xi|\u03b8) \u2212 \u2207 log p(xi|\u03b8\nd. We assume that\n(\u0394Vt\u03c8(\u03b8))2 \u2264 C\nd, where \u03c8 is the\nsolution to the Poisson equation for our test function. We also assume that (cid:10)\u2207 log p(\u03b8)(cid:10) \u2264 \u03c3 and\n(cid:10)\u2207 log p(xi|\u03b8)(cid:10) \u2264 \u03c3 for some \u03c3 and all i \u2208 [N ] and \u03b8 \u2208 R\nThe Lipschitz smoothness assumption is very common both in the optimization literature [11] and\nwhen working with It\u02c6o diffusions [3]. The bound on (\u0394Vt\u03c8(\u03b8))2 holds when the gradient (cid:10)\u2207\u03c8(cid:10) is\nbounded.\nLoosely, these assumptions encode the idea that the gradients don\u2019t change too quickly, so that we\nlimit the errors introduced by incorporating gradients based on previous values of \u03b8. With these\nassumptions, we state the following key results for SAGA-LD and SVRG-LD, which are proved in\nthe supplement.\n\n(cid:5)(cid:10)\u2207Ut(\u03b8) \u2212 \u2207U (\u03b8)(cid:10)2 for some constant C\n\n(cid:5), p \u2264 max 2pk for some C > 0.\n\n(cid:5) \u2208 R\n> 0 for all \u03b8 \u2208 R\n\n(cid:5)(cid:10) for all i \u2208 [N ] and \u03b8, \u03b8\n\n)(cid:10) \u2264 L(cid:10)\u03b8 \u2212 \u03b8\n\n)), \u2200\u03b8, \u03b8\n\nd.\n\n(cid:5)\n\n(cid:5)\n\n(cid:5)\n\n(cid:5)\n\n5\n\n\fTheorem 2. Let ht = h for all t \u2208 {1, . . . , T}. Under the assumptions [A1],[A2], for a smooth test\nfunction \u03c6, the MSE of SAGA-LD (in Algorithm 1) at time K = hT is bounded, for some C > 0\nindependent of (T, h) in the following manner:\n\nE( \u02c6\u03c6 \u2212 \u00af\u03c6)2 \u2264 C\n\nN 2 min{\u03c32,\n\nn2 (L2h2\u03c32+hd)}\nN 2\n\nnT\n\n+ 1\n\nT h + h2\n\n.\n\n(8)\n\n(cid:19)\n\n(cid:17)\n\n(cid:20)\n\n(cid:18)\n\n.\n\nA similar result can be shown for SVRG-LD in Algorithm 2:\nTheorem 3. Let ht = h for all t \u2208 {1, . . . , T}. Under the assumptions [A1],[A2], for a smooth test\nfunction \u03c6, the MSE of SVRG-LD (in Algorithm 2) at time K = hT is bounded, for some C > 0\nindependent of (T, h) in the following manner:\n\nE( \u02c6\u03c6 \u2212 \u00af\u03c6)2 \u2264 C\n\nN 2 min{\u03c32,m2(L2h2\u03c32+hd)}\n\nnT\n\n+ 1\n\nT h + h2\n\n(9)\nThe result in Theorem 3 is qualitatively equivalent to that in Theorem 2 when m = (cid:15)N/n(cid:16). In\ngeneral, such a choice of m is preferable because, in this case, the overall cost of calculation of full\ngradient in Algorithm 2 becomes insigni\ufb01cant.\nIn order to assess the theoretical convergence of our proposed algorithm, we compare the bounds\nfor SVRG-LD (Theorem 3) and SAGA-LD (Theorem 2) with those obtained for SGLD (Theorem 1.\nUnder the assumptions in this section, it is easy to show that the term T1 in Theorem 1 becomes\nO(N 2\u03c32/(T n)). In contrast, both Theorem 2 and 3 show that, due to a reduction in variance,\nSVRG-LD and SAGA-LD exhibit a much weaker dependence. More speci\ufb01cally, this is manifested\nin the form of the following bound:\n\n(cid:2)\n\nN 2 min\n\n\u03c32,\n\n(cid:3)\n\nN 2\nn2 (h2\u03c32+hd)\nnT\n\n.\n\nNote that this is tighter than the corresponding bound on SGLD. We also note that, similar to SGLD,\nSAGA-LD and SVRG-LD require h \u2192 0 as T \u2192 \u221e. In such a scenario, the convergence becomes\nsigni\ufb01cantly faster relative to SGLD as h \u2192 0.\n\n5 Experiments\nWe present our empirical results in this section. We focus on applying our stochastic gradient method\nto four different machine learning tasks, carried out on benchmark datasets: (i) Bayesian linear re-\ngression (ii) Bayesian logistic regression and (iii) Independent component analysis (iv) Mixture\nmodeling. We focus on SAGA-LD, since in the applications considered, the convergence and com-\nputational bene\ufb01ts of SAGA-LD are more bene\ufb01cial than the memory bene\ufb01ts of SVRG-LD;\nIn order to reduce the initial computational costs associated with calculating the initial average\ngradient, we use a variant of Algorithm 1 that calculates g\u03b1 (in line 2 of Algorithm 1) in an online\nfashion and reweights the updates accordingly. Note that such a heuristic is also commonly used in\nthe implementation of SAG and SAGA in the context of optimization [14, 5].\nIn all our experiments, we use a decreasing step size for SGLD as suggested by [15]. In particular,\n\u2212\u03b3, where the parameters a, b and \u03b3 are chosen for each dataset to give the best\nwe use \u0001t = a(b + t)\nperformance of the algorithm on that particular dataset. For SAGA-LD, due to the bene\ufb01t of variance\nreduction, we use a simple two phase constant step size selection strategy. In each of these phases, a\nconstant step size is chosen such that SAGA-LD gives the best performance on the particular dataset.\nThe minibatch size, n, in both SGLD and SAGA-LD is held at a constant value of 10 throughout our\nexperiments. All algorithms are initialized to the same point and the same sequence of minibatches\nis pre-generated and used in both algorithms.\n\n5.1 Regression\nWe \ufb01rst demonstrate the performance of our algorithm on Bayesian regression. Formally, we are\nd and yi \u2208 R. The distribution of the ith output\nprovided with inputs Z = {xi, yi}N\nyi is given by p(yi|xi) = N (\u03b2\n\u22121I). Due to conjugacy, the posterior\ndistribution over \u03b2 is also normal, and the gradients of the log-likelihood and the log-prior are given\n\ni=1 where xi \u2208 R\nxi, \u03c3e), where p(\u03b2) = N (0, \u03bb\n\n(cid:6)\n\n6\n\n\f104\n\n102\n\nE\nS\nM\n\n \nt\ns\ne\nT\n\n10-1\n\n0\n\nconcrete\n\nSGLD\nSAGA-LD\n\n1\n\n2\n\nNumber of pass through data\n\n104\n\n102\n\nE\nS\nM\n\n \nt\ns\ne\nT\n\n10-1\n\n0\n\n3\n\nnoise\n\nSGLD\nSAGA-LD\n\n1\n\n3\n\nNumber of pass through data\n\n105\n\n102\n\nE\nS\nM\n\n \nt\ns\ne\nT\n\n10-1\n\n0\n\n5\n\nparkinsons\n\nSGLD\nSAGA-LD\n\n1\nNumber of pass through data\n\n5\n\n10\n\n103\n\n101\n\nE\nS\nM\n\n \nt\ns\ne\nT\n\n10-1\n\n0\n\ntoms\n\nSGLD\nSAGA-LD\n\n1\nNumber of pass through data\n\n5\n\n10\n\n2\n\n1.5\n\nE\nS\nM\n\n \nt\ns\ne\nT\n\n1\n\n0\n\n3dRoad\n\nSGLD\nSAGA-LD\n\n0.5\n\nNumber of pass through data\n\n1\n\nFigure 1: Performance comparison of SGLD and SAGA-LD on a regression task. The x-axis and y-\naxis represent the number of passes through the entire data and the average test MSE, respectively.\nAdditional experiments are provided in the appendix.\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\n \n\nl\n \nt\ns\ne\nT\ne\ng\na\nr\ne\nv\nA\n\n-10-1\n\n-100\n\n-101\n\n-102\n\npima\n\nSGLD\nSAGA-LD\n\n1\nNumber of pass through data\n\n4\n\n8\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\n \n\nl\n \nt\ns\ne\nT\ne\ng\na\nr\ne\nv\nA\n\n-10-1\n\n-100\n\n-101\n\n-102\n\n0\n\ndiabetic\n\nSGLD\nSAGA-LD\n\n2\n\nNumber of pass through data\n\n4\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\n \n\nl\n \nt\ns\ne\nT\ne\ng\na\nr\ne\nv\nA\n\neeg\n\n-100\n\n-101\n\n0\n\nSGLD\nSAGA-LD\n\n0.5\n\nNumber of pass through data\n\n1\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\nl\n \nt\ns\ne\nT\n \ne\ng\na\nr\ne\nv\nA\n\n-10-1\n\n-100\n\nspace\n\nSGLD\nSAGA-LD\n\n1\nNumber of pass through data\n\n5\n\n10\n\n-0.4\n\n-1\n\n-2\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\n \n\nl\n \nt\ns\ne\nT\ne\ng\na\nr\ne\nv\nA\n\nsusy\n\n0.5\n\n1\n\nNumber of pass through data\n\nSGLD\nSAGA-LD\n\n2\n\nFigure 2: Comparison of performance of SGLD and SAGA-LD for Bayesian logistic regression.\nThe x-axes and y-axes represent the number of effective passes through the dataset and the test\nlog-likelihood, respectively.\nby \u2207\u03b2 log(P (yi|xi, \u03b2)) = \u2212(yi \u2212 \u03b2T xi)xi and \u2207\u03b2 log(P (\u03b2)) = \u2212\u03bb\u03b2. We ran experiments on 11\nstandard UCI regression datasets, summarized in Table 1.2 In each case, we set the prior precision\n\u03bb = 1, and we partitioned our dataset into training (70%), validation (10%), and test (20%) sets.\nThe validation set is used to select the step size parameters, and we report the mean square error\n(MSE) evaluated on the test set, using 5-fold cross-validation.\nThe average test MSE on a subset of datasets is reported in Figure 1. Due to space constraints,\nwe relegate the remaining experimental results to the appendix. As shown in Figure 1, SAGA-LD\nconverges much faster than the SGLD method (taking less than one pass through the whole dataset\nin many cases). This performance gain is consistent across all the datasets. Furthermore, the step\nsize selection was much simpler for SAGA-LD than SGLD.\n\nDatasets\n\nN\nP\n\nconcrete\n\n1030\n\n8\n\nnoise\n1503\n\n5\n\nparkinson\n\n5875\n21\n\nbike\n17379\n\n12\n\ntoms\n45730\n\n96\n\nprotein\n45730\n\n9\n\ncasp\n53500\n\n9\n\nkegg\n64608\n\n27\n\nTable 1: Summary of datasets used for regression.\n\n3droad\n434874\n\n2\n\nmusic\n515345\n\n90\n\ntwitter\n583250\n\n77\n\ni=1 where xi \u2208 R\n\n5.2 Classi\ufb01cation\nWe next turn our attention to classi\ufb01cation, using Bayesian logistic regression. In this case, the\ninput is the set Z = {xi, yi}N\nd, yi \u2208 {0, 1}. The distribution of the out-\nput yi for given sample xi is given by P (yi = 1) = \u03c6(\u03b2T xi), where p(\u03b2) = N (0, \u03bb\n\u22121I) and\n\u03c6(z) = 1/(1 + exp(\u2212z)). Here, the gradient of the log-likelihood and the log-prior are given by\n\u2207\u03b2 log(P (yi|xi, \u03b2)) = (yi \u2212 \u03c6(\u03b2T xi))xi and \u2207\u03b2 log(P (\u03b2)) = \u2212\u03bb\u03b2 respectively. Again, \u03bb is set\nto 1 for all experiments, and the dataset split and parameter selection method is exactly same as in\nour regression experiments. We run experiments on \ufb01ve binary classi\ufb01cation datasets in the UCI\nrepository, summarized in Table 2, and report the the test set log-likelihood for each dataset, using\n5-fold cross validation. Figure 2 shows the performance of SGLD and SAGA-LD for the classi\ufb01ca-\ntion datasets. As we saw with the regression task, SAGA-LD converges faster that SGLD on all the\ndatasets, demonstrating the ef\ufb01ciency of the our algorithm in this setting.\n\nDatasets\n\nN\nd\n\npima\n768\n8\n\ndiabetic\n1151\n20\n\neeg\n14980\n\n15\n\nspace\n58000\n\n9\n\nsusy\n\n100000\n\n18\n\nTable 2: Summary of the datasets used for classi\ufb01cation.\n\n2The datasets can be downloaded from https://archive.ics.uci.edu/ml/index.html\n\n7\n\n\fd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\n \n\nl\n \nt\ns\ne\nT\ne\ng\na\nr\ne\nv\nA\n\n1\n\n-1\n\n-2\n\n0\n\nMEG\n\n1010\n\ne\nc\nn\na\ni\nr\na\nV\n\n105\n\nSGLD\nSAGA-LD\n\nNumber of pass through data\n\n1\n\n1.5\n\n0\n\nRegression-concrete\n\nClassification-pima\n\nSGLD\nSAGA-LD\n\n104\n\ne\nc\nn\na\ni\nr\na\nV\n\n103\n\n0\n\n6\n\n2\n\n4\n\nNumber of pass through data\n\nSGLD\nSAGA-LD\n\n1\n\n2\n\nNumber of pass through data\n\n(cid:1)106\n\n-0.01\n\nposterior\n\n-1\n\nr\no\ni\nr\ne\nt\ns\no\np\n-\ng\no\n\nl\n\n3\n\n-3\n\n-50\n\nEstimated Posterior\n\n15000\n\n7000\n\nt\nn\nu\no\nc\n \ne\np\nm\na\nS\n\nl\n\n0\n(cid:1)\n\n50\n\n0\n-50\n\n0\n(cid:1)\n\n50\n\nFigure 3: The left plot shows the performance of SGLD and SAGA-LD for the ICA task. The next\ntwo plots show the variance of SGLD and SAGA-LD for regression and classi\ufb01cation. The rightmost\ntwo plot shows true and estimated posteriors using SAGA-LD for the mixture modeling task\n\n5.3 Bayesian Independent Component Analysis\nTo evaluate performance under a Bayesian Independent Component Analysis (ICA) model, we as-\nsume our dataset x = {xi}N\n\ni=1 is distributed according to\n\np(x|W ) \u221d | det(W )|(cid:2)d\n\ni=1 p(yi), Wij \u223c N (0, \u03bb),\n\n(10)\n\nd\u00d7d, yi = wT\n\ni x, and p(yi) = 1/(4 cosh2( 1\n\u22121)T \u2212 YixT\n\nwhere W \u2208 R\n2 yi)). The gradient of the log-likelihood\nand the log-prior are \u2207W log(p(xi|W )) = (W\n2 yij) for all j \u2208 [d]\ni where Yij = tanh( 1\nand \u2207W log(p(W )) = \u2212\u03bbW respectively. All other parameters are set as before. We used a stan-\ndard ICA dataset for our experiment3, comprisein 17730 time-points with 122 channels from which\nwe extracted the \ufb01rst 10 channels. Further experimental details are similar to those for regression\nand classi\ufb01cation. The performance (in terms of test set log likelihood) of SGLD and SAGA-LD for\nthe ICA task is shown in Figure 3. As seen in Figure 3, similar to the regression and classi\ufb01cation\ntasks, SAGA-LD outperforms SGLD in the ICA task.\n\n5.4 Mixture Model\nFinally, we evaluate how well SAGA-LD estimates the true posterior of parameters of mixture mod-\nels. We generated 20,000 data points from a mixture of two Gaussians, given by p(x|\u03bc, \u03c31, \u03c32, \u03b3) =\n2N (x;\u2212\u03bc + \u03b3, \u03c32), where \u03bc = \u22125, \u03b3 = 20, and \u03c3 = 5. We estimate the posterior\n2N (x; \u03bc, \u03c32) + 1\n1\ndistribution over \u03bc, holding the other variables \ufb01xed. The two plots on the right of Figure 3 show\nthat we are able to estimate the true posterior correctly.\nDiscussion: Our experiments provide a very compelling reason to use variance reduction techniques\nfor SGLD, complementing the theoretical justi\ufb01cation given in Section 4. The hypothesized variance\nreduction is demonstrated in Figure 3, where we compare the variances of SGLD and SAGA-LD\nwith respect to the true gradient on regression and classi\ufb01cation tasks. As we see from all of the\nexperimental results in this section, SAGA-LD converges with relatively very few samples compared\nwith SGLD. This is especially important in hierarchical Bayesian models where, typically, the size\nof the model used is proportional to the number of observations. Thus, with SAGA-LD, we can\nachieve better performance with very few samples. Another advantage is that, while we require the\nstep size to tend to zero, we can use a much simpler schedule than SGLD.\n6 Discussion and Future Work\nSAGA-LD is a new stochastic Langevin method that obtains improved convergence by reducing the\nvariance in the stochastic gradient. An alternative method, SVRG-LD, can be used when memory is\nat a premium. For both SAGA-LD and SVRG-LD, we proved a tighter convergence bound than the\none previously shown for stochastic gradient Langevin dynamics. We also showed, on a variety of\nmachine learning tasks, that SAGA-LD converges to the true posterior faster than SGLD, suggesting\nthe widespread use of SAGA-LD in place of SGLD.\nWe note that, unlike other stochastic Langevin methods, our sampler is non-Markovian. Since our\nconvergence guarantees are based on bounding the error relative to the full Langevin diffusion rather\nthan on properties of a Markov chain, this does not impact the validity of our sampler.\nWhile we showed the ef\ufb01cacy of using our proposed variance reduction technique to SGLD, our\nproposed strategy is very generic enough and can also be applied to other gradient-based MCMC\ntechniques such as [1, 2, 9, 6, 12]. We leave this as future work.\n\n3The dataset can be downloaded from https://www.cis.hut.fi/projects/ica/eegmeg/\n\nMEG_data.html.\n\n8\n\n\fReferences\n[1] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic\n\ngradient Fisher scoring. In ICML, 2012.\n\n[2] Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient MCMC. In\n\nICML, 2014.\n\n[3] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient\n\nMCMC algorithms with high-order integrators. In NIPS, 2015.\n\n[4] Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte\n\nCarlo. In ICML, 2014.\n\n[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\n\nmethod with support for non-strongly convex composite objectives. In NIPS, 2014.\n\n[6] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut\n\nNeven. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.\n\n[7] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n2011.\n\n[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\n\nance reduction. In NIPS, 2013.\n\n[9] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In\n\nNIPS, 2015.\n\nCarlo, 2010.\n\n2003.\n\n[10] Radford Neal. Mcmc using hamiltonian dynamics.\n\nIn Handbook of Markov Chain Monte\n\n[11] Yurii Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer,\n\n[12] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the\n\nprobability simplex. In NIPS, 2013.\n\n[13] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Math-\n\nematical Statistics, 22(3):400\u2013407, sep 1951.\n\n[14] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing \ufb01nite sums with the\n\nstochastic average gradient. arXiv:1309.2388, 2013.\n\n[15] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynam-\n\nics. In ICML, 2011.\n\n9\n\n\f", "award": [], "sourceid": 643, "authors": [{"given_name": "Kumar Avinava", "family_name": "Dubey", "institution": "Carnegie Mellon University"}, {"given_name": "Sashank", "family_name": "J. Reddi", "institution": "Carnegie Mellon University"}, {"given_name": "Sinead", "family_name": "Williamson", "institution": "University of Texas at Austin"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Amazon - We are hiring!"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Carnegie Mellon University"}]}