{"title": "Finite-Time Analysis of Projected Langevin Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 1243, "page_last": 1251, "abstract": "We analyze the projected Langevin Monte Carlo (LMC) algorithm, a close cousin of projected Stochastic Gradient Descent (SGD). We show that LMC allows to sample in polynomial time from a posterior distribution restricted to a convex body and with concave log-likelihood. This gives the first Markov chain to sample from a log-concave distribution with a first-order oracle, as the existing chains with provable guarantees (lattice walk, ball walk and hit-and-run) require a zeroth-order oracle. Our proof uses elementary concepts from stochastic calculus which could be useful more generally to understand SGD and its variants.", "full_text": "Finite-Time Analysis of Projected Langevin Monte\n\nCarlo\n\nS\u00b4ebastien Bubeck\nMicrosoft Research\n\nsebubeck@microsoft.com\n\nRonen Eldan\n\nWeizmann Institute\n\nroneneldan@gmail.com\n\nJoseph Lehec\n\nUniversit\u00b4e Paris-Dauphine\n\nlehec@ceremade.dauphine.fr\n\nAbstract\n\nWe analyze the projected Langevin Monte Carlo (LMC) algorithm, a close cousin\nof projected Stochastic Gradient Descent (SGD). We show that LMC allows to\nsample in polynomial time from a posterior distribution restricted to a convex\nbody and with concave log-likelihood. This gives the \ufb01rst Markov chain to sample\nfrom a log-concave distribution with a \ufb01rst-order oracle, as the existing chains\nwith provable guarantees (lattice walk, ball walk and hit-and-run) require a zeroth-\norder oracle. Our proof uses elementary concepts from stochastic calculus which\ncould be useful more generally to understand SGD and its variants.\n\n1\n\nIntroduction\n\nA fundamental primitive in Bayesian learning is the ability to sample from the posterior distribution.\nSimilarly to the situation in optimization, convexity is a key property to obtain algorithms with\nprovable guarantees for this task. Indeed several Markov Chain Monte Carlo methods have been\nanalyzed for the case where the posterior distribution is supported on a convex set, and the negative\nlog-likelihood is convex. This is usually referred to as the problem of sampling from a log-concave\ndistribution. In this paper we propose and analyze a new Markov chain for this problem which\ncould have several advantages over existing chains for machine learning applications. We describe\nformally our contribution in Section 1.1. Then in Section 1.2 we explain how this contribution relates\nto various line of work in different \ufb01elds such as theoretical computer science, statistics, stochastic\napproximation, and machine learning.\n\n1.1 Main result\nLet K \u2282 Rn be a convex set such that 0 \u2208 K, K contains a Euclidean ball of radius r > 0\nand is contained in a Euclidean ball of radius R. Denote PK the Euclidean projection on K (i.e.,\nPK(x) = argminy\u2208K |x \u2212 y| where | \u00b7 | denotes the Euclidean norm in Rn), and (cid:107) \u00b7 (cid:107)K the gauge\nof K de\ufb01ned by\nLet f : K \u2192 R be a L-Lipschitz and \u03b2-smooth convex function, that is f is differentiable and\nsatis\ufb01es \u2200x, y \u2208 K,|\u2207f (x)\u2212\u2207f (y)| \u2264 \u03b2|x\u2212 y|, and |\u2207f (x)| \u2264 L. We are interested in the prob-\nlem of sampling from the probability measure \u00b5 on Rn whose density with respect to the Lebesgue\nmeasure is given by:\n\n(cid:107)x(cid:107)K = inf{t \u2265 0; x \u2208 tK},\n\nx \u2208 Rn.\n\nd\u00b5\ndx\n\n=\n\n1\nZ\n\nexp(\u2212f (x))1{x \u2208 K}, where Z =\n\n1\n\nexp(\u2212f (y))dy.\n\n(cid:90)\n\ny\u2208K\n\n\f(cid:16)\n\n(cid:17)\n\n\u221a\n\nWe denote m = E\u00b5|X|, and M = E [(cid:107)\u03b8(cid:107)K], where \u03b8 is uniform on the sphere Sn\u22121 = {x \u2208 Rn :\n|x| = 1}.\nIn this paper we study the following Markov chain, which depends on a parameter \u03b7 > 0, and where\n\u03be1, \u03be2, . . . is an i.i.d. sequence of standard Gaussian random variables in Rn, and X 0 = 0,\n\n.\n\n(1)\n\n\u03b7\u03bek\n\nX k \u2212 \u03b7\n2\n\nX k+1 = PK\n\n\u2207f (X k) +\nWe call the chain (1) projected Langevin Monte Carlo (LMC).\nRecall that the total variation distance between two measures \u00b5, \u03bd is de\ufb01ned as TV(\u00b5, \u03bd) =\nsupA |\u00b5(A) \u2212 \u03bd(A)| where the supremum is over all measurable sets A. With a slight abuse of\n\u00b5. The notation vn = (cid:101)O(un) (respectively (cid:101)\u2126) means that there exists c \u2208 R, C > 0 such that\nnotation we sometimes write TV(X, \u03bd) where X is a random variable distributed according to\nvn \u2264 Cun logc(un) (respectively \u2265).\nOur main result shows that for an appropriately chosen step-size and number of iterations, one has\nconvergence in total variation distance of the iterates (X k) to the target distribution \u00b5.\nTheorem 1 Let \u03b5 > 0. One has TV(X N , \u00b5) \u2264 \u03b5 provided that \u03b7 = 1\n\n(cid:1)2 and\n\n(n + RL)2(M + L/r)2nm6 max\n\n\u03b2m(L +\n\n(cid:18) n + RL\n\n(cid:19)6\n\n(cid:32)\n\n1\n\u03b516\n\n(cid:0) m\n(cid:16)\n\n\u03b5\n\nN\n\n,\n\n1\n\u03b522\n\n(cid:17)8(cid:33)(cid:33)\n\n.\n\n\u221a\n\nR)\n\n(cid:32)\n\nN =(cid:101)\u2126\n\nNote that by viewing \u03b2, L, r as numerical constants, using M \u2264 1/r, and assuming R \u2264 n and\nm \u2264 n3/4, the bound reads\n\nObserve also that if f is constant, that is \u00b5 is the uniform measure on K, then L = 0, m \u2264 \u221a\none can show that M = (cid:101)O(1/\n\nn), which yields the bound:\n\n\u221a\n\nn, and\n\nr\n\n.\n\n.\n\n(cid:19)\n\n(cid:18) n9m6\n(cid:17)11(cid:19)\n(cid:18)(cid:16) n\n\n\u03b522\n\n\u03b52\n\nN =(cid:101)\u2126\n\nN =(cid:101)\u2126\n\n1.2 Context and related works\n\nThere is a long line of works in theoretical computer science proving results similar to Theorem 1,\nstarting with the breakthough result of Dyer et al. [1991] who showed that the lattice walk mixes in\n\n(cid:101)O(n23) steps. The current record for the mixing time is obtained by Lov\u00b4asz and Vempala [2007],\nwho show a bound of (cid:101)O(n4) for the hit-and-run walk. These chains (as well as other popular chains\n\nsuch as the ball walk or the Dikin walk, see e.g. Kannan and Narayanan [2012] and references\ntherein) all require a zeroth-order oracle for the potential f, that is given x one can calculate the\nvalue f (x). On the other hand our proposed chain (1) works with a \ufb01rst-order oracle, that is given\nx one can calculate the value of \u2207f (x). The difference between zeroth-order oracle and \ufb01rst-\norder oracle has been extensively studied in the optimization literature (e.g., Nemirovski and Yudin\n[1983]), but it has been largely ignored in the literature on polynomial-time sampling algorithms.\nWe also note that hit-and-run and LMC are the only chains which are rapidly mixing from any\nstarting point (see Lov\u00b4asz and Vempala [2006]), though they have this property for seemingly very\ndifferent reasons. When initialized in a corner of the convex body, hit-and-run might take a long\ntime to take a step, but once it moves it escapes very far (while a chain such as the ball walk would\nonly do a small step). On the other hand LMC keeps moving at every step, even when initialized in\na corner, thanks for the projection part of (1).\nOur main motivation to study the chain (1) stems from its connection with the ubiquitous\nstochastic gradient descent (SGD) algorithm.\nIn general this algorithm takes the form xk+1 =\nPK (xk \u2212 \u03b7\u2207f (xk) + \u03b5k) where \u03b51, \u03b52, . . . is a centered i.i.d. sequence. Standard results in approx-\nimation theory, such as Robbins and Monro [1951], show that if the variance of the noise Var(\u03b51) is\nof smaller order than the step-size \u03b7 then the iterates (xk) converge to the minimum of f on K (for a\nstep-size decreasing suf\ufb01ciently fast as a function of the number of iterations). For the speci\ufb01c noise\n\n2\n\n\fsequence that we study in (1), the variance is exactly equal to the step-size, which is why the chain\ndeviates from its standard and well-understood behavior. We also note that other regimes where\nSGD does not converge to the minimum of f have been studied in the optimization literature, such\nas the constant step-size case investigated in P\ufb02ug [1986], Bach and Moulines [2013].\nThe chain (1) is also closely related to a line of works in Bayesian statistics on Langevin Monte\nCarlo algorithms, starting essentially with Tweedie and Roberts [1996]. The focus there is on the\nunconstrained case, that is K = Rn. In this simpler situation, a variant of Theorem 1 was proven in\nthe recent paper Dalalyan [2014]. The latter result is the starting point of our work. A straightfor-\nward way to extend the analysis of Dalalyan to the constrained case is to run the unconstrained chain\nwith an additional potential that diverges quickly as the distance from x to K increases. However\nit seems much more natural to study directly the chain (1). Unfortunately the techniques used in\nDalalyan [2014] cannot deal with the singularities in the diffusion process which are introduced by\nthe projection. As we explain in Section 1.3 our main contribution is to develop the appropriate\nmachinery to study (1).\nIn the machine learning literature it was recently observed that Langevin Monte Carlo algorithms\nare particularly well-suited for large-scale applications because of the close connection to SGD. For\ninstance Welling and Teh [2011] suggest to use mini-batch to compute approximate gradients instead\nof exact gradients in (1), and they call the resulting algorithm SGLD (Stochastic Gradient Langevin\nDynamics). It is conceivable that the techniques developed in this paper could be used to analyze\nSGLD and its re\ufb01nements introduced in Ahn et al. [2012]. We leave this as an open problem for\nfuture work. Another interesting direction for future work is to improve the polynomial dependency\non the dimension and the inverse accuracy in Theorem 1 (our main goal here was to provide the\nsimplest polynomial-time analysis).\n\n1.3 Contribution and paper organization\n\n2\u2207f (X\u03b7(cid:98) t\n\nAs we pointed out above, Dalalyan [2014] proves the equivalent of Theorem 1 in the unconstrained\ncase. His elegant approach is based on viewing LMC as a discretization of the diffusion process\ndXt = dWt \u2212 1\n2\u2207f (Xt), where (Wt) is a Brownian motion. The analysis then proceeds in two\nsteps, by deriving \ufb01rst the mixing time of the diffusion process, and then showing that the discretized\nprocess is \u2018close\u2019 to its continuous version. In Dalalyan [2014] the \ufb01rst step is particularly trans-\nparent as he assumes \u03b1-strong convexity for the potential f, which in turns directly gives a mixing\ntime of order 1/\u03b1. The second step is also simple once one realizes that LMC (without projection)\ncan be viewed as the diffusion process dX t = dWt \u2212 1\n\u03b7 (cid:99)). Using Pinsker\u2019s inequality and\nGirsanov\u2019s formula it is then a short calculation to show that the total variation distance between X t\nand Xt is small.\nThe constrained case presents several challenges, arising from the re\ufb02ection of the diffusion pro-\ncess on the boundary of K, and from the lack of curvature in the potential (indeed the constant\npotential case is particularly important for us as it corresponds to \u00b5 being the uniform distri-\nbution on K). Rather than a simple Brownian motion with drift, LMC with projection can be\nviewed as the discretization of re\ufb02ected Brownian motion with drift, which is a process of the form\ndXt = dWt \u2212 1\n2\u2207f (Xt)dt \u2212 \u03bdtL(dt), where Xt \u2208 K,\u2200t \u2265 0, L is a measure supported on\n{t \u2265 0 : Xt \u2208 \u2202K}, and \u03bdt is an outer normal unit vector of K at Xt. The term \u03bdtL(dt) is\nreferred to as the Tanaka drift. Following Dalalyan [2014] the analysis is again decomposed in two\nsteps. We study the mixing time of the continuous process via a simple coupling argument, which\ncrucially uses the convexity of K and of the potential f. The main dif\ufb01culty is in showing that the\ndiscretized process (X t) is close to the continuous version (Xt), as the Tanaka drift prevents us\nfrom a straightforward application of Girsanov\u2019s formula. Our approach around this issue is to \ufb01rst\nuse a geometric argument to prove that the two processes are close in Wasserstein distance, and then\nto show that in fact for a re\ufb02ected Brownian motion with drift one can deduce a total variation bound\nfrom a Wasserstein bound.\nIn this extended abstract we focus on the special case where f is a constant function, that is \u00b5 is\nuniform on the convex body K. The generalization to an arbitrary smooth potential can be found\nin the supplementary material. The rest of the paper is organized as follows. Section 2 contains\nthe main tehcnical arguments. We \ufb01rst remind the reader of Tanaka\u2019s construction (Tanaka [1979])\nof re\ufb02ected Brownian motion in Section 2.1. We present our geometric argument to bound the\n\n3\n\n\fWasserstein distance between (Xt) and (X t) in Section 2.2, and we use our coupling argument to\nbound the mixing time of (Xt) in Section 2.3. The derivation of a total variation bound from the\nWasserstein bound is discussed in Section 2.4. Finally we conclude the paper in Section 3 with some\npreliminary experimental comparison between LMC and hit-and-run.\n\n2 The constant potential case\n\nIn this section we derive the main arguments to prove Theorem 1 when f is a constant function, that\nis \u2207f = 0. For a point x \u2208 \u2202K we say that \u03bd is an outer unit normal vector at x if |\u03bd| = 1 and\n\n(cid:104)x \u2212 x(cid:48), \u03bd(cid:105) \u2265 0,\n\n\u2200x(cid:48) \u2208 K.\n\nFor x /\u2208 \u2202K we say that 0 is an outer unit normal at x. We de\ufb01ne the support function hK of K by\n\nhK(y) = sup{(cid:104)x, y(cid:105); x \u2208 K} ,\n\ny \u2208 Rn.\n\nNote that hK is also the gauge function of the polar body of K.\n\n2.1 The Skorokhod problem\nLet T \u2208 R+ \u222a {+\u221e} and w : [0, T ) \u2192 Rn be a piecewise continuous path with w(0) \u2208 K.\nWe say that x : [0, T ) \u2192 Rn and \u03d5 : [0, T ) \u2192 Rn solve the Skorokhod problem for w if one has\nx(t) \u2208 K,\u2200t \u2208 [0, T ),\n\nx(t) = w(t) + \u03d5(t),\n\n\u2200t \u2208 [0, T ),\n\nand furthermore \u03d5 is of the form\n\n\u03d5(t) = \u2212\n\n\u03bds L(ds),\n\n\u2200t \u2208 [0, T ),\n\n(cid:90) t\n\n0\n\nwhere \u03bds is an outer unit normal at x(s), and L is a measure on [0, T ] supported on the set {t \u2208\n[0, T ) : x(t) \u2208 \u2202K}.\nThe path x is called the re\ufb02ection of w at the boundary of K, and the measure L is called the local\ntime of x at the boundary of K. Skorokhod showed the existence of such a a pair (x, \u03d5) in dimension\n1 in Skorokhod [1961], and Tanaka extended this result to convex sets in higher dimensions in\nTanaka [1979]. Furthermore Tanaka also showed that the solution is unique, and if w is continuous\nthen so is x and \u03d5. In particular the re\ufb02ected Brownian motion in K, denoted (Xt), is de\ufb01ned as\nthe re\ufb02ection of the standard Brownian motion (Wt) at the boundary of K (existence follows by\ncontinuity of Wt). Observe that by It\u02c6o\u2019s formula, for any smooth function g on Rn,\n\n(cid:90) t\n\n(cid:90) t\n\ng(Xt) \u2212 g(X0) =\n\n(cid:104)\u2207g(Xs), dWs(cid:105) +\n\n1\n2\n\n\u2206g(Xs) ds \u2212\n\n(cid:104)\u2207g(Xs), \u03bds(cid:105) L(ds).\n\n(2)\n\n0\n\n0\n\n(cid:90) t\n\n0\n\nTo get a sense of what a solution typically looks like, let us work out the case where w is piecewise\nconstant (this will also be useful to realize that LMC can be viewed as the solution to a Skorokhod\nproblem). For a sequence g1 . . . gN \u2208 Rn, and for \u03b7 > 0, we consider the path:\n\nw(t) =\n\ngk 1{t \u2265 k\u03b7},\n\nt \u2208 [0, (N + 1)\u03b7).\n\nDe\ufb01ne (xk)k=0,...,N inductively by x0 = 0 and\n\nxk+1 = PK(xk + gk).\n\n\u03d5(t) = \u2212(cid:82) t\n\nIt is easy to verify that the solution to the Skorokhod problem for w is given by x(t) = x\u03b7(cid:98) t\n\n\u03b7 (cid:99) and\n\n0 \u03bds L(ds), where the measure L is de\ufb01ned by (denoting \u03b4s for a dirac at s)\n\nN(cid:88)\n\nk=1\n\nN(cid:88)\n\nL =\n\n|xk + gk \u2212 PK(xk + gk)|\u03b4k\u03b7,\n\nand for s = k\u03b7,\n\nk=1\n\n\u03bds =\n\nxk + gk \u2212 PK(xk + gk)\n|xk + gk \u2212 PK(xk + gk)| .\n\n4\n\n\f(cid:90) t\n\n0\n\n(cid:90) T\n\n0\n\n2.2 Discretization of re\ufb02ected Brownian motion\n\nGiven the discussion above, it is clear that when f is a constant function, the chain (1) can be viewed\nas the re\ufb02ection (X t) of a discretized Brownian motion W t := W\u03b7(cid:98) t\n\u03b7 (cid:99) at the boundary of K (more\nprecisely the value of X k\u03b7 coincides with the value of X k as de\ufb01ned by (1)). It is rather clear that\nthe discretized Brownian motion (W t) is \u201cclose\u201d to the path (Wt), and we would like to carry this\nto the re\ufb02ected paths (X t) and (Xt). The following lemma extracted from Tanaka [1979] allows to\ndo exactly that.\nLemma 1 Let w and w be piecewise continuous path and assume that (x, \u03d5) and (x, \u03d5) solve the\nSkorokhod problems for w and w, respectively. Then for all time t we have\n\n|x(t) \u2212 x(t)|2 \u2264 |w(t) \u2212 w(t)|2\n\n+ 2\n\n(cid:104)w(t) \u2212 w(t) \u2212 w(s) + w(s), \u03d5(ds) \u2212 \u03d5(ds)(cid:105).\n\nApplying the above lemma to the processes (Wt) and (W t) at time T = N \u03b7 yields (note that\nWT = W T )\n\n|XT \u2212 X T|2 \u2264 \u22122\n\n(cid:104)Wt \u2212 W t, \u03bdt(cid:105)L(dt) + 2\n\n(cid:104)Wt \u2212 W t, \u03bdt(cid:105)L(dt)\n\n(cid:90) T\n\n0\n\nWe claim that the second integral is equal to 0. Indeed, since the discretized process is constant on\nthe intervals [k\u03b7, (k + 1)\u03b7) the local time L is a positive combination of Dirac point masses at\n\nOn the other hand Wk\u03b7 = W k\u03b7 for all integer k, hence the claim. Therefore\n\n\u03b7, 2\u03b7, . . . , N \u03b7.\n\n(cid:90) T\n\n|XT \u2212 X T|2 \u2264 \u22122\n\n(cid:104)Wt \u2212 W t, \u03bdt(cid:105) L(dt)\n\nUsing the inequality (cid:104)x, y(cid:105) \u2264 (cid:107)x(cid:107)K hK(y) we get\n\n0\n\n|XT \u2212 X T|2 \u2264 2 sup\n\n(cid:107)Wt \u2212 W T(cid:107)K\n\nhK(\u03bdt) L(dt).\n\nTaking the square root, expectation and using Cauchy\u2013Schwarz we get\n\nE(cid:2)|XT \u2212 X T|(cid:3)2 \u2264 2 E\n\n[0,T ]\n\n(cid:34)\n\n(cid:107)Wt \u2212 W T(cid:107)K\n\nsup\n[0,T ]\n\n(cid:35)\n\nhK(\u03bdt) L(dt)\n\n.\n\n(3)\n\nThe next two lemmas deal with each term in the right hand side of the above equation, and they will\nshow that there exists a universal constant C such that\n\nE(cid:2)|XT \u2212 X T|(cid:3) \u2264 C (\u03b7 log(T /\u03b7))1/4 n3/4 T 1/2 M 1/2.\n\n(4)\n\n(cid:90) T\n(cid:35)\n\n0\n\nE\n\n(cid:34)(cid:90) T\n\n0\n\nWe discuss why the above bound implies a total variation bound in Section 2.4.\nLemma 2 We have, for all t > 0\n\n(cid:20)(cid:90) t\n\n0\n\nE\n\n(cid:21)\n\nhK(\u03bds) L(ds)\n\n\u2264 nt\n2\n\n.\n\nProof By It\u02c6o\u2019s formula\n\nd|Xt|2 = 2(cid:104)Xt, dWt(cid:105) + n dt \u2212 2(cid:104)Xt, \u03bdt(cid:105) L(dt).\n\nNow observe that by de\ufb01nition of the re\ufb02ection, if t is in the support of L then\n\n(cid:104)Xt, \u03bdt(cid:105) \u2265 (cid:104)x, \u03bdt(cid:105),\n\n\u2200x \u2208 K.\n\nIn other words (cid:104)Xt, \u03bdt(cid:105) \u2265 hK(\u03bdt). Therefore\n\nhK(\u03bds) L(ds) \u2264 2\n\n(cid:104)Xs, dWs(cid:105) + nt + |X0|2 \u2212 |Xt|2.\n\n(cid:90) t\n\n0\n\n2\n\nThe \ufb01rst term of the right\u2013hand side is a martingale, so using that X0 = 0 and taking expectation\nwe get the result.\n\n(cid:90) t\n\n0\n\n5\n\n\fLemma 3 There exists a universal constant C such that\n\n(cid:34)\n\nE\n\nsup\n[0,T ]\n\n(cid:34)\n\nsup\n[0,T ]\n\n(cid:107)Wt \u2212 W t(cid:107)K\n\n(cid:35)\n\n\u2264 C M(cid:112)n\u03b7 log(T /\u03b7).\n(cid:35)\n(cid:21)\n\n(cid:20)\n\nmax\n\n0\u2264i\u2264N\u22121\n\nYi\n\n(cid:107)Wt \u2212 W t(cid:107)K\n\n= E\n\nProof Note that\n\nwhere\n\nE\n\nObserve that the variables (Yi) are identically distributed, let p \u2265 1 and write\n\n(cid:20)\n\nE\n\nYi =\n\nsup\n\nt\u2208[i\u03b7,(i+1)\u03b7)\n\n(cid:21)\n\n\uf8ee\uf8f0(cid:32)N\u22121(cid:88)\n\ni=0\n\nmax\ni\u2264N\u22121\n\nYi\n\n\u2264 E\n\n(cid:107)Wt \u2212 Wi\u03b7(cid:107)K.\n\n(cid:33)1/p\uf8f9\uf8fb \u2264 N 1/p (cid:107)Y0(cid:107)p.\n\n|Yi|p\n\nWe claim that\n(5)\nfor some constant C, and for all p \u2265 2. Taking this for granted and choosing p = log(N ) in the\nprevious inequality yields the result (recall that N = T /\u03b7). So it is enough to prove (5). Observe\nthat since (Wt) is a martingale, the process Mt = (cid:107)Wt(cid:107)K is a sub\u2013martingale. By Doob\u2019s maximal\ninequality\n\n(cid:107)Y0(cid:107)p \u2264 C\n\np n \u03b7 M\n\n\u221a\n\n(cid:107)Y0(cid:107)p = (cid:107) sup\n\nMt(cid:107)p \u2264 2(cid:107)M\u03b7(cid:107)p,\n\nfor every p \u2265 2. Letting \u03b3n be the standard Gaussian measure on Rn and using Khintchin\u2019s inequal-\nity we get\n\n[0,\u03b7]\n\n(cid:19)1/p \u2264 C\n\n\u221a\n\np\u03b7\n\n(cid:90)\n\nRn\n\n(cid:107)x(cid:107)K \u03b3n(dx).\n\n(cid:107)M\u03b7(cid:107)p =\n\n\u221a\n\n\u03b7\n\n(cid:107)x(cid:107)p\n\nK \u03b3n(dx)\n\nLastly, integrating in polar coordinate, it is easily seen that\n\u221a\n(cid:107)x(cid:107)K \u03b3n(dx) \u2264 C\n\nn M.\n\n(cid:18)(cid:90)\n\nRn\n\n(cid:90)\n\nRn\n\n2.3 A mixing time estimate for the re\ufb02ected Brownian motion\n\nGiven a probability measure \u03bd supported on K, we let \u03bdPt be the law of Xt when X0 has law \u03bd.\nThe following lemma is the key result to estimate the mixing time of the process (Xt).\nLemma 4 Let x, x(cid:48) \u2208 K\n\nTV(\u03b4xPt, \u03b4x(cid:48)Pt) \u2264 |x \u2212 x(cid:48)|\u221a\n\n.\n\n2\u03c0t\n\n(cid:82)\nThe above result clearly implies that for a probability measure \u03bd on K, TV(\u03b40Pt, \u03bdPt) \u2264\nK |x| \u03bd(dx)\n\u221a\n. Since \u00b5 (the uniform measure on K) is stationary for re\ufb02ected Brownian motion, we\nobtain\n\n2\u03c0t\n\nTV(\u03b40Pt, \u00b5) \u2264 m\u221a\n2\u03c0t\n\n.\n\n(6)\n\nIn other words, starting from 0, the mixing time of (Xt) is of order m2. We now turn to the proof of\nthe above lemma.\nProof The proof is based on a coupling argument. Let (Wt) be a Brownian motion starting from 0\nand let (Xt) be a re\ufb02ected Brownian motion starting from x:\n\n(cid:26) X0 = x\n\ndXt = dWt \u2212 \u03bdt L(dt)\n\n6\n\n\fwhere (\u03bdt) and L satisfy the appropriate conditions. We construct a re\ufb02ected Brownian motion (X(cid:48)\nt)\nt}, and for t < \u03c4 let St be the orthogonal\nstarting from x(cid:48) as follows. Let \u03c4 = inf{t \u2265 0; Xt = X(cid:48)\nre\ufb02ection with respect to the hyperplane (Xt\u2212 X(cid:48)\nt) is de\ufb01ned\nby\n\nt)\u22a5. Then up to time \u03c4, the process (X(cid:48)\n\n0 = x(cid:48)\nt \u2212 \u03bd(cid:48)\ndX(cid:48)\nt = dW (cid:48)\ndW (cid:48)\nt = St(dWt)\nwhere L(cid:48) is a measure supported on {t \u2264 \u03c4 ; X(cid:48)\nt \u2208 \u2202K}, and \u03bd(cid:48)\nall such t. After time \u03c4 we just set X(cid:48)\nmotion and thus (X(cid:48)\n\nt L(cid:48)(dt)\n\nt) is a re\ufb02ected Brownian motion starting from x(cid:48). Therefore\n\nt = Xt. Since St is an orthogonal map (W (cid:48)\n\nt is an outer unit normal at X(cid:48)\n\nt for\nt ) is a Brownian\n\n(cid:40) X(cid:48)\n\nTV(\u03b4xPt, \u03b4x(cid:48)Pt) \u2264 P(Xt (cid:54)= X(cid:48)\n\nt) = P(\u03c4 > t).\n\nObserve that on [0, \u03c4 )\n\ndWt \u2212 dW (cid:48)\n\nt = (I \u2212 St)(dWt) = 2(cid:104)Vt, dWt(cid:105)Vt,\n\nwhere Vt = Xt\u2212X(cid:48)\n|Xt\u2212X(cid:48)\nt) = 2(cid:104)Vt, dWt(cid:105)Vt \u2212 \u03bdt L(dt) + \u03bd(cid:48)\n\nd(Xt \u2212 X(cid:48)\n\nt|. So\n\nt\n\nwhere\n\nBt =\n\n(cid:104)Vs, dWs(cid:105),\n\non [0, \u03c4 ).\n\nt L(cid:48)(dt) = 2(dBt) Vt \u2212 \u03bdt L(dt) + \u03bd(cid:48)\n\nt L(cid:48)(dt),\n\n(cid:90) t\n\n0\n\nObserve that (Bt) is a one\u2013dimensional Brownian motion. It\u02c6o\u2019s formula then gives\nt), \u03bdt(cid:105) L(dt)\n\ndg(Xt \u2212 X(cid:48)\n\nt) = 2(cid:104)\u2207g(Xt \u2212 X(cid:48)\n+ (cid:104)\u2207g(Xt \u2212 X(cid:48)\n\nt), Vt(cid:105) dBt \u2212 (cid:104)\u2207g(Xt \u2212 X(cid:48)\nt), \u03bd(cid:48)t(cid:105) L(cid:48)(dt) + 2\u22072g(Xt \u2212 X(cid:48)\n\nt)(Vt, Vt) dt,\n\nfor every smooth function g on Rn. Now if g(x) = |x| then\nt) = Vt\n\n\u2207g(Xt \u2212 X(cid:48)\nt), \u03bdt(cid:105) \u2265 0 on the support of L, and (cid:104)\u2207g(Xt\u2212 X(cid:48)\n\nt), Vt(cid:105) = 1, (cid:104)\u2207g(Xt\u2212 X(cid:48)\n\nt(cid:105) \u2264 0\n|Xt\u2212Yt| P(Xt\u2212Yt)\u22a5 where Px\u22a5 denotes the\nt| \u2264\non [0, \u03c4 ). Therefore P(\u03c4 > t) \u2264 P(\u03c4(cid:48) > t) where \u03c4(cid:48) is the \ufb01rst time the Brownian\n\nt), \u03bd(cid:48)\nIn particular \u22072g(Xt \u2212 Yt)(Vt) = 0. We obtain |Xt \u2212 X(cid:48)\n\nso (cid:104)\u2207g(Xt\u2212 X(cid:48)\non the support of L(cid:48). Moreover \u22072g(Xt \u2212 X(cid:48)\northogonal projection on x\u22a5.\n|x\u2212 x(cid:48)| + 2Bt,\nmotion (Bt) hits the value \u2212|x \u2212 x(cid:48)|/2. Now by the re\ufb02ection principle\nP(\u03c4(cid:48) > t) = 2 P (0 \u2264 2 Bt < |x \u2212 x(cid:48)|) \u2264 |x \u2212 x(cid:48)|\u221a\n\nt) =\n\n1\n\n.\n\n2\u03c0t\n\n2.4 From Wasserstein distance to total variation\n\nTo conclude it remains to derive a total variation bound between XT and X T using (4). The details\nof this step are deferred to the supplementary material where we consider the case of a general log-\nconcave distribution. The intuition goes as follows: the processes (XT +s)s\u22650 and (X T +s)s\u22650 both\nevolve according to a Brownian motion until the \ufb01rst time s that one process undergoes a re\ufb02ection.\nBut if T is large enough and \u03b7 is small enough then one can easily get from (4) (and the fact that\nthe uniform measure does not put too much mass close to the boundary) that XT and X T are much\ncloser to each other than they are to the boundary of K. This implies that one can couple them (just\nas in Section 2.3) so that they meet before one of them hits the boundary.\n\n3 Experiments\n\nComparing different Markov Chain Monte Carlo algorithms is a challenging problem in and of it-\nself. Here we choose the following simple comparison procedure based on the volume algorithm\n\n7\n\n\fdeveloped in Cousins and Vempala [2014]. This algorithm, whose objective is to compute the vol-\nume of a given convex set K, procedes in phases. In each phase (cid:96) it estimates the mean of a certain\nfunction under a multivariate Gaussian restricted to K with (unrestricted) covariance \u03c3(cid:96)In. Cousins\nand Vempala provide a Matlab implementation of the entire algorithm, where in each phase the\ntarget mean is estimated by sampling from the truncated Gaussian using the hit-and-run (H&R)\nchain. We implemented the same procedure with LMC instead of H&R, and we choose the step-size\n\u03b7 = 1/(\u03b2n2), where \u03b2 is the smoothness parameter of the underlying log-concave distribution (in\nparticular here \u03b2 = 1/\u03c32\n(cid:96) ). The intuition for the choice of the step-size is as follows: the scaling in\ninverse smoothness comes from the optimization literature, while the scaling in inverse dimension\nsquared comes from the analysis in the unconstrained case in Dalalyan [2014].\n\nK = [\u22121, 1]n (referred to as the \u201cBox\u201d) and K = [\u22121, 1]n \u2229(cid:16)\u221a\n\nWe ran the volume algorithm with both H&R and LMC on the following set of convex bodies:\n(referred to as the \u201cBox\nand Ball\u201d), where n = 10 \u00d7 k, k = 1, . . . , 10. The computed volume (normalized by 2n for the\n\u201cBox\u201d and by 0.2\u00d72n for the \u201cBox and Ball\u201d) as well as the clock time (in seconds) to terminate are\nreported in the \ufb01gure above. From these experiments it seems that LMC and H&R roughly compute\nsimilar values for the volume (with H&R being slightly more accurate), and LMC is almost always a\nbit faster. These results are encouraging, but much more extensive experiments are needed to decide\nif LMC is indeed a competitor to H&R in practice.\n\nBn(cid:17)\n\nn\n2\n\n8\n\n1234567891000.511.522.533.544.5Estimated normalized volume Box H&RBox LMCBox and Ball H&RBox and Ball LMC123456789100200400600800100012001400160018002000Time Box H&RBox LMCBox and Ball H&RBox and Ball LMC\fReferences\nS. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient \ufb01sher\n\nscoring. In ICML 2012, 2012.\n\nF. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence\nrate o(1/n). In Advances in Neural Information Processing Systems 26 (NIPS), pages 773\u2013781.\n2013.\n\nB. Cousins and S. Vempala. Bypassing kls: Gaussian cooling and an o\u2217(n3) volume algorithm.\n\nArxiv preprint arXiv:1409.6011, 2014.\n\nA. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave den-\n\nsities. Arxiv preprint arXiv:1412.7392, 2014.\n\nM. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the\n\nvolume of convex bodies. Journal of the ACM (JACM), 38(1):1\u201317, 1991.\n\nR. Kannan and H. Narayanan. Random walks on polytopes and an af\ufb01ne interior point method for\n\nlinear programming. Mathematics of Operations Research, 37:1\u201320, 2012.\n\nL. Lov\u00b4asz and S. Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985\u20131005, 2006.\nL. Lov\u00b4asz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random\n\nStructures & Algorithms, 30(3):307\u2013358, 2007.\n\nA. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley\n\nInterscience, 1983.\n\nG. P\ufb02ug. Stochastic minimization with constant step-size: asymptotic laws. SIAM J. Control and\n\nOptimization, 24(4):655\u2013666, 1986.\n\nH. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics,\n\n22:400\u2013407, 1951.\n\nA. Skorokhod. Stochastic equations for diffusion processes in a bounded region. Theory of Proba-\n\nbility & Its Applications, 6(3):264\u2013274, 1961.\n\nH. Tanaka. Stochastic differential equations with re\ufb02ecting boundary condition in convex regions.\n\nHiroshima Mathematical Journal, 9(1):163\u2013177, 1979.\n\nL. Tweedie and G. Roberts. Exponential convergence of langevin distributions and their discrete\n\napproximations. Bernoulli, 2(4):341\u2013363, 1996.\n\nM. Welling and Y.W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML\n\n2011, 2011.\n\n9\n\n\f", "award": [], "sourceid": 775, "authors": [{"given_name": "Sebastien", "family_name": "Bubeck", "institution": "MSR"}, {"given_name": "Ronen", "family_name": "Eldan", "institution": null}, {"given_name": "Joseph", "family_name": "Lehec", "institution": null}]}