{"title": "Online sampling from log-concave distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1228, "page_last": 1239, "abstract": "Given a sequence of convex functions $f_0, f_1, \\ldots, f_T$, we study the problem of sampling from the Gibbs distribution $\\pi_t \\propto e^{-\\sum_{k=0}^t f_k}$ for each epoch $t$ in an {\\em online} manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. Our main result is an algorithm that generates roughly independent samples from $\\pi_t$ for every epoch $t$ and, under mild assumptions, makes $\\mathrm{polylog}(T)$ gradient evaluations per epoch. All previous results imply a bound on the number of gradient or function evaluations which is at least linear in $T$. Motivated by real-world applications, we assume that functions are smooth, their associated distributions have a bounded second moment, and their minimizer drifts in a bounded manner, but do not assume they are strongly convex. In particular, our assumptions hold for online Bayesian logistic regression, when the data satisfy natural regularity properties, giving a sampling algorithm with updates that are poly-logarithmic in $T$. In simulations, our algorithm achieves accuracy comparable to an algorithm specialized to logistic regression. Key to our algorithm is a novel stochastic gradient Langevin dynamics Markov chain with a carefully designed variance reduction step and constant batch size. Technically, lack of strong convexity is a significant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in $T$ for enough time to reach within $\\epsilon$ of $\\pi_t$.", "full_text": "Online Sampling from Log-Concave Distributions\n\nHolden Lee\n\nDuke University\n\nOren Mangoubi\n\nWorcester Polytechnic Institute\n\nNisheeth K. Vishnoi\n\nYale University\n\nAbstract\n\nsampling from the Gibbs distribution \u21e1t / ePt\n\nGiven a sequence of convex functions f0, f1, . . . , fT , we study the problem of\nk=0 fk for each epoch t in an\nonline manner. Interest in this problem derives from applications in machine\nlearning, Bayesian statistics, and optimization where, rather than obtaining all the\nobservations at once, one constantly acquires new data, and must continuously\nupdate the distribution. Our main result is an algorithm that generates roughly\nindependent samples from \u21e1t for every epoch t and, under mild assumptions,\nmakes polylog(T ) gradient evaluations per epoch. All previous results imply a\nbound on the number of gradient or function evaluations which is at least linear\nin T . Motivated by real-world applications, we assume that functions are smooth,\ntheir associated distributions have a bounded second moment, and their minimizer\ndrifts in a bounded manner, but do not assume they are strongly convex.\nIn\nparticular, our assumptions hold for online Bayesian logistic regression, when the\ndata satisfy natural regularity properties, giving a sampling algorithm with updates\nthat are poly-logarithmic in T . In simulations, our algorithm achieves accuracy\ncomparable to an algorithm specialized to logistic regression. Key to our algorithm\nis a novel stochastic gradient Langevin dynamics Markov chain with a carefully\ndesigned variance reduction step and constant batch size. Technically, lack of\nstrong convexity is a signi\ufb01cant barrier to analysis and, here, our main contribution\nis a martingale exit time argument that shows our Markov chain remains in a ball\nof radius roughly poly-logarithmic in T for enough time to reach within \" of \u21e1t.\n\n1\n\nIntroduction\n\nIn this paper, we study the following online sampling problem:\nProblem 1.1. Consider a sequence of convex functions f0, f1, . . . , fT : Rd ! R for some T 2 N,\nand let \"> 0. At each epoch t 2{ 1, . . . , T}, the function ft is given to us, so that we have oracle\naccess to the gradients of the \ufb01rst t + 1 functions f0, f1, . . . , ft. The goal at each epoch t is to\ngenerate a sample from the distribution \u21e1t(x) / ePt\nk=0 fk(x) with \ufb01xed total-variation (TV) error\n\n\". The samples at different time steps should be almost independent.\n\nVarious versions of this problem have been considered in the literature, with applications in Bayesian\nstatistics, optimization, and theoretical computer science; see [NR17, DDFMR00, ADH10] and\nreferences therein. If f is convex, then a distribution p / ef is logconcave; this captures a large\nclass of useful distributions such as gaussian, exponential, Laplace, Dirichlet, gamma, beta, and\nchi-squared distributions. We give some settings where online sampling can be used:\n\u2022 Online posterior sampling. In Bayesian statistics, the goal is to infer the probability distribution\n(the posterior) of a parameter, based on observations; however, rather than obtaining all the\nobservations at once, one constantly acquires new data, and must continuously update the posterior\ndistribution, rather than only after all data is collected. Suppose \u2713 \u21e0 p0 / ef0 for a given\nprior distribution, and samples yt drawn from the conditional distribution p(\u00b7|\u2713, y1, . . . , yt1)\narrive in a streaming manner. By Bayes\u2019s rule, letting pt(\u2713) = eft(\u2713) := p(\u2713|y1, . . . , yt) be\nthe posterior distribution, we have the following recursion: pt(\u2713) / pt1(\u2713)p(yt|\u2713, y1, . . . , yt1).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fk=0 fk(\u2713). The goal is to sample from pt(\u2713) for each t. This \ufb01ts the setting of\n\nHence, pt(\u2713) / ePt\nProblem 1.1 if p0 and all updates p(yt|\u2713, y1, . . . yt1) are logconcave.\nOne practical application is online logistic regression; logistic regression is a common model\nfor binary classi\ufb01cation. Another is inference for Gaussian processes, which are used in many\nBayesian models because of their \ufb02exibility, and where stochstic gradient Langevin algorithms\nhave been applied [FE15]. A third application is latent Dirichlet allocation (LDA), often used for\ndocument classi\ufb01cation [BNJ03]. As new documents are published, it is desirable to update the\ndistribution of topics without excessive re-computation.1\n\n\u2022 Optimization. One online optimization method is to sample a point from the exponential of\nthe (weighted) negative loss ([CBL06, HAK07], Lemma 10 in [NR17]). There are settings such\nas online logistic regression where the only known way to achieve optimal regret is a Bayesian\nsampling approach [FKL+18], with lower bounds known for the naive convex optimization\napproach [HKL14].\n\n\u2022 Reinforcement learning (RL). Thompson sampling [RVRK+18, DFE18] solves RL problems\nby maximizing the expected reward at each period with respect to a sample from the Bayesian\nposterior for the environment parameters, reducing it to the online posterior sampling problem.\n\nIn all of these applications, because a sample is needed at every epoch t, it is desirable to have a fast\nonline sampling algorithm. In particular, the ultimate goal is to design an algorithm for Problem\n1.1 such that the number of gradient evaluations is almost constant at each epoch t, so that the\ncomputational requirements at each epoch do not increase over time. This is challenging because at\nepoch t, one has to incorporate information from all t + 1 functions f0, . . . , ft in roughly O(1) time.\n\nOur main contribution is an algorithm for Problem 1.1 that computes eOT (1) gradients per epoch,\n\nunder mild assumptions on the functions2. All previous rigorous results (even with comparable\nassumptions) imply a bound on the number of gradient or function evaluations which is at least linear\nin T ; see Table 1. Our assumptions are motivated by real-world considerations and hold in the setting\nof online Bayesian logistic regression when the data vectors satisfy natural regularity properties.\nIn the of\ufb02ine setting, our result also implies the \ufb01rst algorithm to sample from a d-dimensional\nlog-concave distribution / ePT\nt=1 ft where the ft\u2019s are not assumed strongly convex and the total\nnumber of gradient evaluations is roughly T log(T ) + poly(d), instead of T \u21e5 poly(d) implied by\nprior works (Table 1).\nA natural approach to online sampling is to design a Markov chain with the right steady state\ndistribution [NR17, DMM19, DCWY18, CFM+18]. The main dif\ufb01culty is that running a step of\na Markov chain that incorporates all previous functions takes time \u2326(t) at epoch t; all previous\nalgorithms with provable guarantees suffer from this. To overcome this, one must use stochasticity\n\u2013 for example, sample a subset of the previous functions. However, this fails because of the large\nvariance of the gradient. Our result relies on a stochastic gradient Langevin dynamics (SGLD)\nMarkov chain with a carefully designed variance reduction step and \ufb01xed batch size.\nWe emphasize that we do not assume that the functions ft are strongly convex. This is important\nfor applications such as logistic regression. Even if the negative log-prior f0 is strongly convex, we\ncannot obtain the same bounds by using existing results on strongly convex f, because the bounds\nt=0 ft, which grows as T . Lack of strong convexity is a\ntechnical barrier to analyzing our Markov chain and, here, our main contribution is a martingale exit\ntime argument that shows that our Markov chain is constrained to a ball of radius roughly 1/pt for\ntime that is suf\ufb01cient for it to reach within \" of \u21e1t.\n\ndepend on the condition number ofPT\n\n1Note that LDA requires sampling from non-logconcave distributions. Our algorithm can be used for\n\nnon-logconcave distributions, but our theoretical guarantees are only for logconcave distributions.\n\n2The subscript T in eOT means that we only show the dependence on the parameters t, T , and exclude\n\ndependence on non-T, t parameters such as the dimension d, sampling accuracy \" and the regularity parameters\nC, D, L which we de\ufb01ne in Section 2.1.\n\n2\n\n\f2 Our algorithm and results\n\nk=0 fk, and let x?\n\nt x?\n\n\u2327k \uf8ff D/pt+c.\n\ntk s/pt+c) \uf8ff Aeks.\n\n2.1 Assumptions\nDenote by L(Y ) the distribution of a random variable Y . For any two probability measures \u00b5, \u232b,\ndenote the 2-Wasserstein distance by W2(\u00b5, \u232b) := inf (X,Y )\u21e0\u21e7(\u00b5,\u232b)pE[kX Y k2], where \u21e7(\u00b5, \u232b)\ndenotes the set of all possible couplings of random vectors ( \u02c6X, \u02c6Y ) with marginals \u02c6X \u21e0 \u00b5 and \u02c6Y \u21e0 \u232b.\nFor every t 2{ 0, . . . , T}, de\ufb01ne Ft :=Pt\nt be a minimizer of Ft(x) on Rd. For any\nx 2 Rd, let x be the Dirac delta distribution centered at x. We make the following assumptions:\nAssumption 1 (Smoothness/Lipschitz gradient (with constants L0, L > 0)). For all 1 \uf8ff t \uf8ff T\nand x, y 2 Rd, krft(y) rft(x)k \uf8ff Lkx yk. For t = 0, krf0(y) rf0(x)k \uf8ff L0 kx yk.\nWe allow f0 to satisfy our assumptions with a different parameter value, since in Bayesian applications\nf0 models a \u201cprior\" which has different scaling from f1, f2, . . . fT .\nAssumption 2 (Bounded second moment with exponential concentration (with constants\nA, k > 0, c 0)). For all 0 \uf8ff t \uf8ff T and all s 0, PX\u21e0\u21e1t(kX x?\n2 \uf8ff C/pt+c\nNote Assumption 2 implies a bound on the second moment, m1/2\ntk2\n2) 1\n:= (Ex\u21e0\u21e1t kx x?\n2\nfor C := (2 + 1/k) log(A/k2). For conciseness, we write bounds in terms of this parameter C.3\nAssumption 3 (Drift of mode (with constants D 0, c 0)). For all 0 \uf8ff t, \u2327 \uf8ff T such that\n\u2327 2 [t, max{2t, 1}], kx?\nAssumption 2 says that the \u201cdata is informative enough\u201d \u2013 the current distribution \u21e1t (posterior)\nconcentrates near the mode x?\nt decrease in the second moment is what one would\nexpect based on central limit theorems such as the Bernstein-von Mises theorem. Assumption 2 is a\nweaker condition than strong convexity: if the ft\u2019s are \u21b5-strongly convex, then \u21e1t(x) / ePt\nk=0 fk(x)\nconcentrates to within pd/p\u21b5(t+1); however, many distributions satisfy Assumption 2 without being\nstrongly log-concave. For instance, posterior distributions used in Bayesian logistic regression satisfy\nAssumption 2 under natural conditions on the data, but are not strongly log-concave with comparable\nparameters (Section 2.4). Hence, together Assumptions 1 and 2 are a weaker condition than strong\nconvexity and gradient Lipschitzness, the typical assumptions under which the of\ufb02ine algorithm is\nanalyzed. Similar to the typical assumptions, our assumptions avoid the \u201cill-conditioned\u201d case when\nthe distribution becomes more concentrated in one direction than another as the number of functions\nt increases.\nAssumption 3 is typically satis\ufb01ed in the setting where the ft\u2019s are iid. This is the case when\nwe observe iid random variables and de\ufb01ne functions ft based on them, as will be the case for\nour application to Bayesian logistic regression (Problem 2.2). To help with intuition, note that\nAssumption 3 is satis\ufb01ed for the problem of Gaussian mean estimation: the mode is the same as the\nmean, and the assumption reduces to the fact that a random walk drifts on the order of pt, and hence\nthe mean of the posterior drifts by OT (1/pt), after t time steps. We need this assumption because our\nalgorithm uses cached gradients computed \u21e5T (t) time steps ago, and in order for the past gradients\nto be close in value to the gradient at the current point, the points where the gradients were last\ncalculated should be at distance OT (1/pt) from the current point. We give a simple example where\nthe assumptions hold (Appendix G of the supplement).\nIn Section 2.4 we show these assumptions hold for functions arising in online Bayesian logistic\nregression; unlike previous work on related techniques [NDH+17, CFM+18], our assumptions are\nweak enough to hold in such applications, as they do not require f0, . . . , fT to be strongly convex.\n\nt as t increases. The 1\n\na point X t approximately distributed according to \u21e1t / ePt\n\n2.2 Algorithm for online sampling\nAt every epoch t = 1, . . . , T , given gradient access to the functions f0, . . . , ft, Algorithm 2 generates\nk=0 fk(x). It does so by running SAGA-\nLD (Algorithm 1), with step size \u2318t that decreases as the epoch, and a given number of steps imax.\n3Having a bounded second moment suf\ufb01ces to obtain (weaker) polynomial bounds (by replacing the use\nof the concentration inequality with Chebyshev\u2019s inequality). We use this slightly stronger condition because\nexponential concentration improves the dependence on \", and is typically satis\ufb01ed in practice.\n\n3\n\n\fk=0 rfk(Xi):\n\nOur main Theorem 2.1 says that for each sample to have \ufb01xed TV error \", at each epoch the number\nof steps imax only needs to be poly-logarithmic in T .\nAlgorithm 1 makes the following update rule at each step for the SGLD Markov chain Xi, for a\n\ncertain choice of stochastic gradient gi, where E[gi] =Pt\nXi+1 = Xi \u2318tgi +p2\u2318t\u21e0i,\u21e0\n\n(1)\nKey to our algorithm is the construction of the variance reduced stochastic gradient gi. It is constructed\nby taking the sum of the cached gradients at previous points in the chain and correcting it with a\nbatch of constant size b.\nThis variance reduction is only effective when the points where the cached gradients were computed\nt . Algorithm 2 ensures that this holds with high probability\n\nstay within eOT (1/pt) of the current mode x?\n\nby resetting to the sample at the previous power of 2 if the sample has drifted too far.\nThe step size \u2318t is determined by an input parameter \u23180 > 0. We set \u2318t = \u23180/t+c for the following\nreason: Assumption 2 says that the variance of the target distribution \u21e1t decreases at the rate C2/t+c,\nand we want to ensure that the variance of each step of Langevin dynamics decreases at roughly\nthe same rate. With the step size \u2318t = \u23180/t+c, the Markov chain can travel across a sub-level set\n\ni \u21e0 N (0, Id).\n\ncontaining most of the probability measure of \u21e1t in roughly the same number imax = eOT (1) of\nsteps at each epoch t. We will take the acceptance radius to be C0 = 2.5(C1 + D) where C1 is\ngiven by (65) in the supplement, and show that with good probability this choice of C0 ensures\nkX t1 X t0k \uf8ff 4(C1+D)/pt+c in Algorithm 2. Note that in practice, one need not know the values\nof the regularity constants in Assumptions 1-3 but can instead use heuristics to \u201ctune\u201d the Markov\nchain\u2019s parameters.\n\nAlgorithm 1 SAGA-LD\nInput: Oracles for rfk for k 2 [0, t], step size \u2318> 0, batch size b 2 N, number of steps imax, initial\npoint X0, cached gradients Gk = rfk(uk) for some points uk, and s =Pt\nk=1 Gk. Output: Ximax\n1: for i from 0 to imax 1 do\n(Sample batch) Sample with replacement a (multi)set S of size b from {1, . . . , t}.\n2:\n(Calculate gradients) For each k 2 S, let Gk\n3:\n(Variance-reduced gradient estimate) Let gi = rf0(Xi) + s + t\n4:\n(Langevin step) Let Xi+1 = Xi \u2318gi + p2\u2318\u21e0i where \u21e0i \u21e0 N (0, I).\n5:\n(Update sum) Update s [ s +Pk2set(S)(Gk\n6:\n(Update gradients) For each k 2 S, update Gk [ Gk\n7:\n8: end for\n\nbPk2S(Gk\n\nnew Gk).\nnew.\n\nnew = rfk(Xi).\n\nnew Gk).\n\n2.3 Result in the online setting\nIn this section we give our main result for the online sampling problem; for additional results in the\nof\ufb02ine sampling problem, see Appendix A in the supplement.\nTheorem 2.1 (Online variance-reduced SGLD). Suppose that f0, . . . , fT : Rd ! R are (weakly)\nconvex and satisfy Assumptions 1-3 with c = L0/L. Let C = (2 + 1/k) log(A/k2). Then there exist\n\u2318, such that at each\n\nL2 log6(T )(C+D)2d\u2318, and imax = eO\u21e3 (C+D)2 log2(T )\nparameters b = 9, \u23180 = e\u21e5\u21e3\nHere,e\u21e5 and eO hide polylogarithmic factors in d, L, C, D,\" 1 and log(T ).\nNote that the dependence of imax on \" is imax = eO\" 1\n\nproof of Theorem 2.1. Note that the algorithm needs to know the parameters, but bounds are enough.\nPrevious results all imply a bound on the number of gradient or function evaluations5 at each epoch\nwhich is at least linear in T . Our result is the \ufb01rst to obtain bounds on the number of gradient\n\nepoch t, Algorithm 2 generates an \"-approximate independent sample Xt from \u21e1t.4 The total number\nof gradient evaluations imax required at each epoch t is polynomial in d, L, C, D,\" 1 and log(T ).\n\n\"6. See Section B.4 in the supplement for the\n\n\u23180\"2\n\n\"4\n\n4See De\ufb01nition B.1 in the supplement for the formal de\ufb01nition. Necessarily, kL(Xt) \u21e1tkTV \uf8ff \".\n5In our setting a gradient can be computed in at worst 2d function evaluations. In many applications\n(including logistic regression) gradient evaluation takes the same number of operations as function evaluation.\n\n4\n\n\fAlgorithm 2 Online SAGA-LD\nInput: T 2 N and gradient oracles for functions ft : Rd ! R, for all t 2{ 0, . . . , T} , where only\nthe gradient oracles rf0, . . . ,rft are available at epoch t, an initial point X0 2 Rd.\nInput: step size \u23180 > 0, batch size b > 0, imax > 0, constant offset c, acceptance radius C0.\nOutput: At each epoch t, a sample Xt\n1: Set s = 0.\n2: for epoch t = 1 to T do\n3:\n4:\n\nSet t0 = 2blog2(t1)c if t > 1, and t0 = 0 if t = 1.\n\n. The previous power of 2\n\n. Initial gradient sum\n\n0 [ Xt1 . If the previous sample hasn\u2019t drifted too far,\n\n. If the previous sample has drifted too far, reset to the sample at time t0\n\nuse the previous sample as warm start\n\nelse Xt\nend if\n\nifXt1 Xt0 \uf8ff C0/pt+c then Xt\n0 [ Xt0\nSet Gt [ rft(Xt\nSet s [ s + Gt.\n\n0) and update s accordingly.\n\nrfk(Xt\n\n0)\n\n5:\n6:\n7:\n8:\n9:\n\n10:\n11:\n\nFor all gradients Gk = rfk(uk) which were last updated at time t/2, replace them by\nDraw it uniformly from {1, . . . , imax}.\nRun Algorithm 1 with step size \u23180/t+c, batch size b, number of steps it, initial point Xt\n\n0, and\n\nprecomputed gradients Gk with sum s. Keep track of when the gradients are updated.\n\nReturn the output Xt = Xt\n\n12:\n13: end for\nevaluations which are poly-logarithmic, rather than linear, in T at each epoch. We are able to do better\nk=0 ft and the fact that the \u21e1t evolve slowly. See Section 4\n\nit of Algorithm 1.\n\nby exploiting the sum structure of Pt\n\nfor a detailed comparison.\n\n2.4 Application to Bayesian logistic regression\n\nk=1, return a sample from the posterior distribution6 \u02c6\u21e1t(\u2713) / ePt\n\nNext, we show that Assumptions 1-3, and therefore Theorem 2.1, hold in the setting of online\nBayesian logistic regression, when the data satisfy certain regularity properties. Logistic regression\nis a fundamental and widely used model in Bayesian statistics [AC93]. It has served as a model\nproblem for methods in scalable Bayesian inference [WT11, HCB16, CB19, CB18], of which online\nsampling is one approach. Additionally, sampling from the logistic regression posterior is the key\nstep in the optimal algorithm for online logistic regret minimization [FKL+18].\nIn Bayesian logistic regression, one models the data (ut 2 Rd, yt 2 {1, 1}) as follows: there is\nsome unknown \u27130 2 Rd such that given ut (the \u201cindependent variable\"), for all t 2{ 1, . . . , T} the\n\u201cdependent variable\u201d yt follows a Bernoulli distribution with \u201csuccess\u201d probability (u>t \u2713) (yt = 1\nwith probability (u>t \u2713) and 1 otherwise) where (x) := 1/(1+ex). The problem we consider is:\nProblem 2.2 (Bayesian logistic regression). Suppose the yt\u2019s are generated from ut\u2019s as Bernoulli\nrandom variables with \u201csuccess\u201d probability (u>t \u2713). At every epoch t 2{ 1, . . . , T}, after observing\n\u02c6fk(\u2713), where \u02c6f0(\u2713) :=\n(uk, yk)t\ne\u21b5k\u2713k2/2 and \u02c6fk(\u2713) := log[(yku>k \u2713)].\nWe show that Algorithm 2 succeeds for Bayesian logistic regression under reasonable conditions on\nthe data-generating distribution \u2013 namely, that inputs are bounded and we see data in all directions.7\nTheorem 2.3 (Online Bayesian logistic regression). Suppose that for some B, M, > 0, we have\nk\u27130k \uf8ff B and that ut \u21e0 Pu are iid, where Pu is a distribution satisfying the following: For u \u21e0 Pu,\n(1) kuk \uf8ff M (\u201cbounded\u201d) and (2) Eu[uu> |u>\u27130|\uf8ff2] \u232b Id (\u201crestricted\u201d covariance matrix is\nbounded away from 0). Then for the functions \u02c6f0, . . . , \u02c6fT in Problem 2.2, and any \"> 0, there exist\nparameters L, log(A), k1, D = poly(M, 1,\u21b5, B, d,\" 1, log(T )) such that Assumptions 1, 2,\nand 3 hold for all t with probability at least 1 \". Therefore Alg. 2 gives \"-approximate samples\nfrom \u21e1t for t 2 [1, T ] with poly(M, 1,\u21b5, B, d,\" 1, log(T )) gradient evaluations at each epoch.\n6Here we use a Gaussian prior but this can be replaced by any ef0 where f0 is strongly convex and smooth.\n7For simplicity, we state the result (Theorem 2.3) in the case where the input variables u are iid, but note that\nthe result holds more generally (see Lemma E.1 in the supplement for a more general statement of our result).\n\nk=0\n\n5\n\n\fIn Section 5 we show that in numerical simulations, our algorithm achieves competitive accuracy\nwith the same runtime compared to an algorithm specialized to logistic regression, the P\u00f3lya-Gamma\nsampler. However, the P\u00f3lya-Gamma sampler has two drawbacks: its running time at each epoch\nscales linearly as t (our algorithm scales as polylog(t)), and it is unknown whether P\u00f3lya-Gamma\nattains TV-error \" in time polynomial in 1\n\n\" , t, d, and other problem parameters.\n\n3 Proof overview for online problem\n\nFor the online problem, information theoretic constraints require us to use \u201cinformation\" from at least\n\u2326(t) gradients to sample with \ufb01xed TV error at the t\u2019th epoch (see Appendix H). Thus, to use only\n\neOT (1) gradients at each epoch, we must reuse gradient information from past epochs. We accomplish\n\nthis by reusing gradients computed at points in the Markov chain, including points at past epochs.\nThis saves a factor of T over naive SGLD, but only if we can show that these past points in the chain\ntrack the distributions\u2019 mode, and that our chain stays close to the mode (Lemma B.2 in supplement).\nThe distribution is concentrated to OT (1/pt) at the tth epoch (Assumption 2), and we need the Markov\n\ni ukk2. To obtain this bound, observe that the individual components {rfk(X t\n\nshow that with high probability (w.h.p.) the chain stays within this ball. Once we establish that the\nMarkov chain stays close, we combine our bounds with existing results on SGLD from [DMM19]\n\nchain to stay within eOT (1/pt) of the mode. The bulk of the proof (Lemma B.3 in supplement) is to\nto show that we only need eOT (1) steps per epoch (Lemma B.6). Finally, an induction with careful\n\nchoice of constants \ufb01nishes the proof (Theorem 2.1). Details of each of these steps follow.\nBounding the variance of the stochastic gradient (see Lemma B.2). We reduce the variance of\nour stochastic gradient by using the gradient evaluated at a past point uk and estimating the difference\nin the gradients between our current point X t\ni and past point uk. Using the L-Lipschitz property\n(Assumption 1) of the gradients, we show that the variance of this stochastic gradient is bounded by\nt2L2\nb maxk kX t\ni ) \ni ukk2 by the\nrfk(uk)}k2S of the stochastic gradient gt\nLipschitz property. Averaging with a batch saves a factor of b. For the number of gradient evaluations\nto stay nearly constant at each step, increasing the batch size is not a viable option to decrease our\ni ukk = eOT (1/pt),\nstochastic gradient\u2019s variance. Rather, showing that kX t\nimplies the variance of our stochastic gradient decreases at each epoch at the desired rate.\nBounding the escape time from a ball where the stochastic gradient has low variance (see\nLemma B.3). Our main challenge is to bound the distance kXi ukk. Because we do not assume\nstrong convexity, we cannot use proof techniques of past papers analyzing variance-reduced SGLD\nmethods. [CFM+18, NDH+17] used strong convexity to show that w.h.p., the Markov chain does\nnot travel too far from its initial point, implying a bound on the variance of their stochastic gradients.\nUnfortunately, many important applications, including logistic regression, lack strong convexity.\nTo deal with the lack of strong convexity, we instead use a martingale exit time argument to show\n\ni have variance at most = t2L2 maxk kX t\n\ni ukk decreases as kX t\n\ni x?\n\nthat the Markov chain remains inside a ball of radius r = eOT (1/pt) w.h.p. for a large enough time\nimax for the Markov chain to reach a point within TV distance \" of the target distribution. Towards\nthis end, we would like to bound the distance from the current state of the Markov chain to the mode\nt ukk by eOT (1/pt). Together, this allows us to bound the\ntk by eOT (1/pt), and bound kx?\nkX t\ni ukk = OT (1/pt). We can then use our bound on kX t\ni ukk = eOT (1/pt) together\ndistance kX t\nwith Lemma B.2 to bound the variance of the stochastic gradient by roughly eOT (1/t).\nt ukk. Since uk is a point of the Markov chain, possibly at a previous epoch\nBounding kx?\n\u2327 \uf8ff t, roughly speaking we can bound this distance inductively by using bounds obtained at\ni for some i \uf8ff imax, we use our bound for\nthe previous epoch \u2327 (Lemma B.6). Noting that uk = X \u2327\n\u2327k = OT (1/p\u2327) = OT (1/pt) obtained at the previous epoch \u2327, together with Assumption 3\nkuk x?\n\u2327k = OT (1/pt), to bound kx?\nwhich says that kx?\nt x?\nBounding kX t\ntk to the mode, we would like to bound\ntk. To bound the distance \u21e2i := kX t\ni x?\nthe increase \u21e2i+1 \u21e2i at each step i in the Markov chain. Unfortunately, the expected increase in the\ntk is much larger when the Markov chain is close to the mode than when it is far\ndistance kX t\ni x?\naway from the mode, making it dif\ufb01cult to get a tight bound on the increase in the distance at each\ntk2, the\nstep. To get around this problem, we instead use a martingale exit time argument on kX t\n\nt ukk.\ni x?\n\ni x?\n\n6\n\n\f0 x?\n\ni x?\n\n\u2327 , then Xt will be a\n0 such\n\nsquared distance from the current state of the Markov chain to the mode. The advantage in using\nsquared distance is that the expected increase in squared distance due to the Gaussian noise term\np2\u2318t\u21e0i in the Markov chain update rule (Equation (1)) is the same regardless of the position of the\nchain, allowing us to obtain tighter bounds on the increase regardless of the Markov chain\u2019s current\ntk2 that is\nposition. We then use weak convexity to bound the component of the increase in kX t\ndue to the gradient term \u2318tgi, and apply Azuma\u2019s martingale concentration inequality to bound the\nexit time from the ball, showing the chain remains at distance of roughly eOT (1/pt) from the mode.\nBounding the TV error (Lemma B.6). We now show that if uk is close to x?\ngood sample from \u21e1t. More precisely, we show that if at epoch t the Markov chain starts at X t\nthat kX t\nTo do this, we use two bounds: a bound on the Wasserstein distance between the initial point X t\n0 and\nthe target density \u21e1t, and a bound on the variance of the stochastic gradient. We then plug the bounds\ninto Corollary 18 of [DMM19] (reproduced as Theorem B.4 in the supplementary material), to show\n\nimax) \u21e1tTV \uf8ff O(\"/log2(T )).\nthat imax = eO\",T (poly(1/\")) steps per epoch are suf\ufb01cient to obtain a bound of \" on the TV error.\n\n\u2327k \uf8ff R/pt+c (R to be chosen later), thenL(X t\n\nBounding the number of gradient evaluations at each epoch. Working out constants, we see\nimax = poly(d, L, C, D,\" 1, log(T )) suf\ufb01ces to obtain TV-error \" each epoch. A constant batch\nsize suf\ufb01ces, so the total number of evaluations is O(imaxb) = poly(d, L, C, D,\" 1, log(T )).\n\n4 Related work\n\nt=1 ft(xt) minx\u21e4PT\n\nOnline convex optimization. Our motivation for studying the online sampling problem comes\npartly from the successes of online (convex) optimization [Haz16]. In online convex optimization,\none chooses a point xt 2 K at each step and suffers a loss ft(xt), where K is a compact convex set\nand ft : K ! R is a convex function [Zin03]. The aim is to minimize the regret compared to the best\npoint in hindsight, where RegretT =PT\nt=1 ft(x\u21e4). The same of\ufb02ine convex\noptimization algorithms such as gradient descent and Newton\u2019s method can be adapted to the online\nsetting [Zin03, HAK07].\nOnline sampling. To the best of our knowledge, all previous algorithms with provable guarantees in\nour setting require computation time that grows polynomially with t. This is because any Markov\nchain taking all previous data into account needs \u2326T (t) gradient (or function) evaluations per step.\nOn the other hand, there are many streaming algorithms that are used in practice which lack provable\nguarantees, or which rely on properties of the data (such as compressibility [HCB16, CB19]).\nThe most relevant theoretical work in our direction is [NR17]. The authors consider a changing\nlog-concave distribution on a convex body, and show that under certain conditions, they can use\nthe previous sample as a warm start and only take a constant number of steps of their Dikin walk\nchain at each stage. They consider the online sampling problem in the more general setting where\nthe distribution is restricted to a convex body. However, [NR17] do not achieve optimal results in\nk=0 fk has a sum structure\nand therefore require \u2326(t) function evaluations at epoch t. Moreover, they do not consider how\nconcentration properties of the distribution translate into more ef\ufb01cient sampling. When the ft are\nlinear, they need OT (1) steps and OT (t) evaluations per epoch. However, in the general convex\nsetting with smooth ft\u2019s, they need OT (t) steps per epoch and OT (t2) evaluations per epoch.\nThere are many other online sampling and other approaches to estimating changing distributions, used\nin practice. The Laplace approximation, perhaps the simplest, approximates the posterior distribution\nwith a Gaussian [BDT16]; however, most distributions cannot be well-approximated by Gaussians.\nStochastic gradient Langevin dynamics [WT11] can be used in an online setting; however, it suffers\nfrom large variance which we address in this work. The particle \ufb01lter [DMHW+12, GDM+17] is a\ngeneral algorithm to track changing distributions. Another approach (besides sampling) is variational\ninference, which has also been considered in an online setting ([WPB11], [BBW+13]).\nVariance reduction techniques. Variance reduction techniques for SGLD were initially proposed\nin [DRW+16], when sampling from a \ufb01xed distribution \u21e1 / ePT\nt=0 ft. [DRW+16] propose two\nvariance-reduced SGLD techniques, CV-ULD and SAGA-LD. CV-ULD re-computes the full gradient\nrF at an \u201canchor\u201d point every r steps and updates the gradient at intermediate steps by subsampling\nthe difference in the gradients between the current point and the anchor point. SAGA-LD, on the\n\nour setting, since they do not separately consider the case when Ft =Pt\n\n7\n\n\fAlgorithm\n\nOnline Dikin walk\n\n[NR17, \u00a75.1]\n\nLangevin [DMM19, DCWY18]\n\nSGLD [DMM19]\n\nSAGA-LD [CFM+18]\nCV-ULD [CFM+18]\n\noracle calls per\n\nepoch\nOT (T )\n\nOT (T )\nOT (T )\n\nOT (T )\n\nOT (T )\n\nThis work\n\npolylog(T )\n\nOther assumptions\n\nStrong convexity\n\nBounded ratio of densities\n\n\u2014\n\u2014\n\nStrong convexity\nLipschitz Hessian\nStrong convexity\n\nbounded second moment\nbounded drift of minimizer\n\nTable 1: Bounds on the number of gradient (or function) evaluations required by different algorithms to solve\nthe online sampling problem. Lipschitz gradient is assumed for all algorithms. [NR17] analyzed the online Dikin\nwalk for a different setting where the target has compact support; here we give the result one should obtain for\nsupport Rd, where it reduces to the ball walk. Thus it is possible the assumptions we give for the online Dikin\nwalk can be weakened. Note that the number of gradient or function evaluations for the basic Langevin and\nSGLD algorithms and online Dikin walk depend multiplicatively on T (i.e., T\u21e5poly(d, L, other parameters)),\nwhile the number of evaluations for variance-reduced SGLD methods depend only additively on T (i.e.,\nT +poly(d, L, other parameters)).\nother hand, keeps track of when each gradient rft was computed, and updates individual gradients\nwith respect to when they were last computed. [CFM+18] show that CV-ULD can sample in the\nof\ufb02ine setting in roughly T + d2/\"(L/m)6 gradient evaluations, and that SAGA-LD can sample in\nT + T pd/\"(L/m)3/2(1 + LH) evaluations, where LH is the Lipschitz constant of the Hessian of\n log(\u21e1).8\n5 Simulations\n\nWe test our algorithm against other sampling algorithms on a synthetic dataset for logistic regression.\nThe dataset consists of T = 1000 data points in dimension d = 20. We compare the marginal\naccuracies of the algorithms.\nThe data is generated as follows. First, \u2713 \u21e0 N (0, Id), b \u21e0 N (0, 1) are randomly generated. For each\n1 \uf8ff t \uf8ff T , a feature vector xt 2 Rd and output yt 2{ 0, 1} are generated by\n\n2 e b2\n\n2 Qt\n\nwhere the sparsity is s = 5 in our simulations, and (x) = 1\nxt 2{ 0, 1}d because in applications, features are often indicators.\nThe algorithms are tested in an online setting as follows. At epoch t each algorithm has access\nto xs,i, ys for s \uf8ff t, and attempts to generate a sample from the posterior distribution pt(\u2713) /\ne k\u2713k2\ns=1 (\u2713>xt + b); the time is limited to t = 0.1 seconds. We estimate the quality of\nthe samples at t = T = 1000, by saving the state of the algorithm at t = T 1, and re-running it\n1000 times to collect 1000 samples. We replicate this entire simulation 8 times, and the marginal\naccuracies of the runs are given in Figure 1.\nThe marginal accuracy (MA) is a heuristic to compare accuracy of samplers (see e.g. [DMS17],\n[FOW11] and [CR+17]). The marginal accuracy between the measure \u00b5 of a sample and the target \u21e1\nis M A(\u00b5, \u21e1) := 1 1\ni=1 k\u00b5i\u21e1ikTV, where \u00b5i and \u21e1i are the marginal distributions of \u00b5 and \u21e1\nfor the coordinate xi. Since MALA is known to sample from the correct stationary distribution for the\nclass of distributions analyzed in this paper, we let \u21e1 be the estimate of the true distribution obtained\nfrom 1000 samples generated from running MALA for a long time (1000 steps). We estimate the TV\ndistance by the TV distance between the histograms when the bin widths are 0.25 times the sample\nstandard deviation for the corresponding coordinate of \u21e1.\n\n2dPd\n\n8The bounds of [CFM+18] are given for sampling within a speci\ufb01ed Wasserstein error, not TV error. The\nbounds we give here are the number of gradient evaluations one would need if one samples with Wasserstein\n\nerrore\" which roughly corresponds to TV error \"; roughly, one requirese\" = O(\"/pT ) to sample with TV error \".\n\n8\n\nxt,i \u21e0 Bernoulli\u21e3 s\nd\u2318\nyt \u21e0 Bernoulli((\u2713>xt + b)),\n\n1 \uf8ff i \uf8ff d,\n\n(2)\n\n(3)\n1+ex is the logistic function. We chose\n\n\fAlgorithm\n\nMean marginal accuracy\n\nSGLD\n\n0.442\n0.571\n0.901\n0.921\n0.921\n0.924\n\nOnline Laplace\n\nMALA\n\nPolya-Gamma\n\nOnline SAGA-LD\n(our algorithm)\n\nFull Laplace\n\nFigure 1: Marginal accuracies of 5 different sampling algorithms on online logistic regression, with\nT = 1000 data points, dimension d = 20, and time 0.1 seconds, averaged over 8 runs. SGLD and\nonline Laplace perform much worse and are not pictured.\nWe compare our online SAGA-LD algorithm with SGLD, full and online Laplace approximation,\nP\u00f3lya-Gamma, and MALA. The Laplace method approximates the target distribution with a multivari-\nate Gaussian distribution. Here, one \ufb01rst \ufb01nds the mode of the target distribution using a deterministic\noptimization technique and then computes the Hessian r2Ft of the log-posterior at the mode. The\ninverse of this Hessian is the covariance matrix of the Gaussian. In the online version of the algorithm,\ngiven in [CL11], to speed up optimization, only a quadratic approximation (with diagonal Hessian)\nto the log-posterior is maintained. The P\u00f3lya-Gamma chain [DFE18] is a Markov chain specialized\nto sample from the posterior for logistic regression. Note that in contrast, our algorithm works more\ngenerally for any smooth probability distribution over Rd.\nOur results show that our online SAGA-LD algorithm is competitive with the best samplers for\nlogistic regression, namely, the P\u00f3lya-Gamma Markov chain and the full Laplace approximation. We\nnote that the full Laplace approximation requires optimizing a sum of t functions, which has runtime\nthat scales linearly with t at each epoch, while our method only scales as polylog(t).\n1+0.5t for SGLD, and\nThe parameters are as follows. The step size at epoch t is\n1+0.5t for online SAGA-LD. A smaller step size must be used with SGLD because of the increased\n0.05\nvariance. For MALA, a larger step size can be used because the Metropolis-Hastings acceptance step\nensures the stationary distribution is correct. The batch size for SGLD and online SAGA-LD is 64.\nThe step sizes \u23180 were chosen by hand from testing various values in the range from 0.001 to 1.0.\nWe found the reset step of our online SAGA-LD algorithm, and the random number of steps, to be\nunnecessary in practice, so the results are reported for our online SAGA-LD algorithm without these\nfeatures. The experiments were run on Fujitsu CX2570 M2 servers with dual, 14-core 2.4GHz Intel\nXeon E5 2680 v4 processors with 384GB RAM running the Springdale distribution of Linux.\n\n1+0.5t for MALA,\n\n0.01\n\n0.1\n\n6 Conclusion and future work\n\nlog-concave distributions \u21e1t / ePt\n\nIn this paper we obtain logarithmic-in-T bounds at each epoch when sampling from a sequence of\nk=0 fk, improving on previous results which are linear-in-T in\nthe online setting. Since we do not assume the ft\u2019s are strongly convex, we also obtain bounds which\nhave an improved dependence on T for a wider range of applications including Bayesian logistic\nregression. While our assumption of Lipschitz gradients requires the target to have full support on\nRd, one can also consider extending our polylog(T ) bounds to log-densities supported on a compact\nset. Restricting the distribution to have compact support can cause the target distribution\u2019s covariance\nmatrix to become increasingly ill-conditioned as the number of functions t increases. To overcome\nthis, we could modify our algorithm by including an adaptive pre-conditioner which changes along\nwith the target distribution.\n\nAcknowledgments\nThis research was partially supported by NSF CCF-1908347 and SNSF 200021_182527 grants.\n\n9\n\n\fReferences\n\n[AC93] James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous\nresponse data. Journal of the American statistical Association, 88(422):669\u2013679,\n1993.\n\n[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain\nMonte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 72(3):269\u2013342, 2010.\n\n[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar.\nInformation-theoretic lower bounds on the oracle complexity of convex optimization.\nIn Advances in Neural Information Processing Systems, pages 1\u20139, 2009.\n\n[BBW+13] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I\nJordan. Streaming variational Bayes. In Advances in Neural Information Processing\nSystems, pages 1727\u20131735, 2013.\n\n[BDT16] Rina Foygel Barber, Mathias Drton, and Kean Ming Tan. Laplace approximation in\nhigh-dimensional Bayesian regression. In Statistical Analysis for High-Dimensional\nData, pages 15\u201336. Springer, 2016.\n\n[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.\n\nJournal of machine Learning research, 3(Jan):993\u20131022, 2003.\n\n[CB18] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy\niterative geodesic ascent. In International Conference on Machine Learning, pages\n697\u2013705, 2018.\n\n[CB19] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via\nHilbert coresets. The Journal of Machine Learning Research, 20(1):551\u2013588, 2019.\n\n[CBL06] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge\n\nuniversity press, 2006.\n\n[CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan.\nOn the theory of variance reduction for stochastic gradient Monte Carlo. In Jennifer\nDy and Andreas Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages\n764\u2013773, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[CL11] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In\n\nAdvances in neural information processing systems, pages 2249\u20132257, 2011.\n\n[CR+17] Nicolas Chopin, James Ridgway, et al. Leave pima indians alone: binary regression\n\nas a benchmark for Bayesian computation. Statistical Science, 32(1):64\u201387, 2017.\n\n[DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling:\nMetropolis-Hastings algorithms are fast! In Proceedings of the 2018 Conference on\nLearning Theory, PMLR 75, 2018.\n\n[DDFMR00] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao-\nBlackwellised particle \ufb01ltering for dynamic Bayesian networks.\nIn Proceedings\nof the Sixteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 176\u2013183.\nMorgan Kaufmann Publishers Inc., 2000.\n\n[DFE18] Bianca Dumitrascu, Karen Feng, and Barbara E Engelhardt. PG-TS: Improved\nThompson sampling for logistic contextual bandits. In Advances in neural information\nprocessing systems, 2018.\n\n[DMHW+12] Pierre Del Moral, Peng Hu, Liming Wu, et al. On the concentration properties\nof interacting particle processes. Foundations and Trends R in Machine Learning,\n3(3\u20134):225\u2013389, 2012.\n\n10\n\n\f[DMM19] Alain Durmus, Szymon Majewski, and B\u0142a\u02d9zej Miasojedow. Analysis of Langevin\nMonte Carlo via convex optimization. Journal of Machine Learning Research,\n20(73):1\u201346, 2019.\n\n[DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian\n\nMonte Carlo. arXiv preprint arXiv:1705.00166, 2017.\n\n[DRW+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos,\nAlexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient\nLangevin dynamics. In Advances in neural information processing systems, pages\n1154\u20131162, 2016.\n\n[DSCW18] Chris De Sa, Vincent Chen, and Wing Wong. Minibatch Gibbs sampling on large\ngraphical models. In Jennifer Dy and Andreas Krause, editors, Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings\nof Machine Learning Research, pages 1165\u20131173, Stockholmsm\u00e4ssan, Stockholm\nSweden, 10\u201315 Jul 2018. PMLR.\n\n[FE15] Maurizio Filippone and Raphael Engler. Enabling scalable stochastic gradient-based\ninference for gaussian processes by employing the unbiased linear system solver\n(ulisse). In International Conference on Machine Learning, pages 1015\u20131024, 2015.\n[FKL+18] Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan.\nLogistic regression: The importance of being improper. Proceedings of Machine\nLearning Research vol, 75:1\u201342, 2018.\n\n[FOW11] Christel Faes, John T Ormerod, and Matt P Wand. Variational Bayesian inference for\nparametric and nonparametric regression with missing data. Journal of the American\nStatistical Association, 106(495):959\u2013971, 2011.\n\n[GDM+17] Fran\u00e7ois Giraud, Pierre Del Moral, et al. Nonasymptotic analysis of adaptive and\n\nannealed Feynman\u2013Kac particle models. Bernoulli, 23(1):670\u2013709, 2017.\n\n[GLR18] Rong Ge, Holden Lee, and Andrej Risteski. Simulated tempering Langevin Monte\nCarlo II: An improved proof using soft Markov chain decomposition. arXiv preprint\narXiv:1812.00793, 2018.\n\n[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for\n\nonline convex optimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R\n\nin Optimization, 2(3-4):157\u2013325, 2016.\n\n[HCB16] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable\nBayesian logistic regression. In Advances in Neural Information Processing Systems,\npages 4080\u20134088, 2016.\n\n[HKL14] Elad Hazan, Tomer Koren, and K\ufb01r Y Levy. Logistic regression: Tight bounds\nfor stochastic and online optimization. In Conference on Learning Theory, pages\n197\u2013209, 2014.\n\n[KM15] Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of\na random matrix without concentration. International Mathematics Research Notices,\n2015(23):12991\u201313008, 2015.\n\n[Men14] Shahar Mendelson. Learning without concentration. In Conference on Learning\n\nTheory, pages 25\u201339, 2014.\n\n[NDH+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer,\nLukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient\nLangevin dynamics. arXiv preprint arXiv:1706.02692, 2017.\n\n[Nic12] Richard Nickl. Statistical theory. Statistical Laboratory, Department of Pure Mathe-\n\nmatics and Mathematical Statistics, University of Cambridge, 2012.\n\n11\n\n\f[NR17] Hariharan Narayanan and Alexander Rakhlin. Ef\ufb01cient sampling from time-varying\nlog-concave distributions. The Journal of Machine Learning Research, 18(1):4017\u2013\n4045, 2017.\n\n[RVRK+18] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al.\nA tutorial on Thompson sampling. Foundations and Trends R in Machine Learning,\n11(1):1\u201396, 2018.\n\n[WPB11] Chong Wang, John Paisley, and David Blei. Online variational inference for the hier-\narchical Dirichlet process. In Proceedings of the Fourteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 752\u2013760, 2011.\n\n[WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin\ndynamics. In Proceedings of the 28th International Conference on Machine Learning\n(ICML-11), pages 681\u2013688, 2011.\n\n[Zin03] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient\nascent. In Proceedings of the 20th International Conference on Machine Learning\n(ICML-03), pages 928\u2013936, 2003.\n\n12\n\n\f", "award": [], "sourceid": 741, "authors": [{"given_name": "Holden", "family_name": "Lee", "institution": "Princeton University"}, {"given_name": "Oren", "family_name": "Mangoubi", "institution": "Worcester Polytechnic Institute"}, {"given_name": "Nisheeth", "family_name": "Vishnoi", "institution": "Yale University"}]}