{"title": "Mirrored Langevin Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 2878, "page_last": 2887, "abstract": "We consider the problem of sampling from constrained distributions, which has posed significant challenges to both non-asymptotic analysis and algorithmic design. We propose a unified framework, which is inspired by the classical mirror descent, to derive novel first-order sampling schemes. We prove that, for a general target distribution with strongly convex potential, our framework implies the existence of a first-order algorithm achieving O~(\\epsilon^{-2}d) convergence, suggesting that the state-of-the-art O~(\\epsilon^{-6}d^5) can be vastly improved. With the important Latent Dirichlet Allocation (LDA) application in mind, we specialize our algorithm to sample from Dirichlet posteriors, and derive the first non-asymptotic O~(\\epsilon^{-2}d^2) rate for first-order sampling. We further extend our framework to the mini-batch setting and prove convergence rates when only stochastic gradients are available. Finally, we report promising experimental results for LDA on real datasets.", "full_text": "Mirrored Langevin Dynamics\n\nYa-Ping Hsieh\n\nAli Kavis\n\nPaul Rolland\n\nVolkan Cevher\n\nLaboratory for Information and Inference Systems (LIONS),\n\nEPFL, Lausanne, Switzerland\n\n{ya-ping.hsieh, ali.kavis, paul.rolland, volkan.cevher}@epfl.ch\n\nAbstract\n\nWe consider the problem of sampling from constrained distributions, which\nhas posed signi\ufb01cant challenges to both non-asymptotic analysis and algo-\nrithmic design. We propose a uni\ufb01ed framework, which is inspired by the\nclassical mirror descent, to derive novel \ufb01rst-order sampling schemes. We\nprove that, for a general target distribution with strongly convex potential,\nour framework implies the existence of a \ufb01rst-order algorithm achieving\n\u02dcO(\u0001\u22122d) convergence, suggesting that the state-of-the-art \u02dcO(\u0001\u22126d5) can be\nvastly improved. With the important Latent Dirichlet Allocation (LDA)\napplication in mind, we specialize our algorithm to sample from Dirichlet\nposteriors, and derive the \ufb01rst non-asymptotic \u02dcO(\u0001\u22122d2) rate for \ufb01rst-order\nsampling. We further extend our framework to the mini-batch setting and\nprove convergence rates when only stochastic gradients are available. Finally,\nwe report promising experimental results for LDA on real datasets.\n\n1 Introduction\n\nMany modern learning tasks involve sampling from a high-dimensional and large-scale\ndistribution, which calls for algorithms that are scalable with respect to both the dimension\nand the data size. One approach [32] that has found wide success is to discretize the\nLangevin Dynamics:\n\n(1.1)\nwhere e\u2212V (x)dx presents a target distribution and Bt is a d-dimensional Brownian motion.\nSuch a framework has inspired numerous \ufb01rst-order sampling algorithms [1, 7, 13, 15, 18,\n19, 26, 29], and the convergence rates are by now well-understood for unconstrained and\nlog-concave distributions [8, 12, 14].\n\n2dBt,\n\ndXt = \u2212\u2207V (Xt)dt +\n\n\u221a\n\nHowever, applying (1.1) to sampling from constrained distributions (i.e., when V has a\nbounded convex domain) remains a di\ufb03cult challenge. From the theoretical perspective,\nthere are only two existing algorithms [4, 5] that possess non-asymptotic guarantees, and\ntheir rates are signi\ufb01cantly worse than the unconstrained scenario under the same assumtions;\ncf., Table 1. Furthermore, many important constrained distributions are inherently non-log-\nconcave. A prominent instance is the Dirichlet posterior, which, in spite of the presence of\nseveral tailor-made \ufb01rst-order algorithms [18, 26], is still lacking a non-asymptotic guarantee.\n\nIn this paper, we aim to bridge these two gaps at the same time. For general constrained\ndistributions with a strongly convex potential V , we prove the existence of a \ufb01rst-order\nalgorithm that achieves the same convergence rates as if there is no constraint at all, suggesting\nthe state-of-the-art \u02dcO(\u0001\u22126d5) can be brought down to \u02dcO(\u0001\u22122d). When specialized to the\nimportant case of simplex constraint, we provide the \ufb01rst non-asymptotic guarantee for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fDirichlet posteriors, \u02dcO(\u0001\u22122d2R0) for deterministic and \u02dcO(cid:0)\u0001\u22122(N d + \u03c32)R0\n\n(cid:1) for the stochastic\n\nversion of our algorithms; cf., Example 1 and 2 for the involved parameters.\n\nOur framework combines ideas from the Mirror Descent [2, 25] algorithm for optimization\nand the theory of Optimal Transport [31]. Concretely, for constrained sampling problems,\nwe propose to use the mirror map to transform the target into an unconstrained distribution,\nwhereby many existing methods apply. Optimal Transport theory then comes in handy to\nrelate the convergence rates between the original and transformed problems. For simplex\nconstraints, we use the entropic mirror map to design practical \ufb01rst-order algorithms that\npossess rigorous guarantees, and are amenable to mini-batch extensions.\n\nThe rest of the paper is organized as follows. We brie\ufb02y review the notion of push-forward\nmeasures in Section 2. In Section 3, we propose the Mirrored Langevin Dynamics and\nprove its convergence rates for constrained sampling problems. Mini-batch extensions are\nderived in Section 4. Finally, in Section 5, we provide synthetic and real-world experiments\nto demonstrate the empirical e\ufb03ciency of our algorithms.\n\n1.1 Related Work\n\nFirst-Order Sampling Schemes with Langevin Dynamics: There exists a bulk of\nliterature on (stochastic) \ufb01rst-order sampling schemes derived from Langevin Dynamics or\nits variants [1, 4\u20136, 8, 9, 12, 14, 16, 20, 26, 32]. However, to our knowledge, this work is the\n\ufb01rst to consider mirror descent extensions of the Langevin Dynamics.\n\nThe authors in [21] proposed a formalism that can, in principle, incorporate any variant of\nLangevin Dynamics for a given distribution e\u2212V (x)dx. The Mirrored Langevin Dynamics,\nhowever, is targeting the push-forward measure e\u2212W (y)dy (see Section 3.1), and hence our\nframework is not covered in [21].\n\nFor Dirichlet posteriors, there is a similar variable transformation as our entropic mirror\nmap in [26] (see the \u201creduced-natural parametrization\u201d therein). The dynamics in [26] is\nnonetheless drastically di\ufb00erent from ours, as there is a position-dependent matrix multiplying\nthe Brownian motion, whereas our dynamics has no such feature; see (3.2).\n\nMirror Descent-Type Dynamics for Stochastic Optimization: Although there are\nsome existing work on mirror descent-type dynamics for stochastic optimization [17, 24, 27, 33],\nwe are unaware of any prior result on sampling.\n\n2 Preliminaries\n\n2.1 Notation\n\nIn this paper, all Lipschitzness and strong convexity are with respect to the Euclidean norm\n(cid:107) \u00b7 (cid:107). We use Ck to denote k-times di\ufb00erentiable functions with continuous kth derivative.\nThe Fenchel dual [28] of a function h is denoted by h(cid:63). Given two mappings T, F of proper\ndimensions, we denote their composite map by T \u25e6 F . For a probability measure \u00b5, we write\nX \u223c \u00b5 to mean that \u201cX is a random variable whose probability law is \u00b5\u201d.\n\n2.2 Push-Forward and Optimal Transport\nLet d\u00b5 = e\u2212V (x)dx be a probability measure with support X := dom(V ) = {x \u2208 Rd | V (x) <\n+\u221e}, and h be a convex function on X . Throughout the paper we assume:\nAssumption 1. h is closed, proper, h \u2208 C2, and \u22072h (cid:31) 0 on X \u2282 Rd.\nAssumption 2. All measures have \ufb01nite second moments.\nAssumption 3. All measures vanish on sets with Hausdor\ufb00 dimension [22] at most d \u2212 1.\nThe gradient map \u2207h induces a new probability measure d\u03bd := e\u2212W (y)dy through \u03bd(E) =\n\u00b5 under \u2207h, and we denote it by \u2207h#\u00b5 = \u03bd. If X \u223c \u00b5 and Y \u223c \u03bd, we will sometimes abuse\nthe notation by writing \u2207h#X = Y to mean \u2207h#\u00b5 = \u03bd.\n\n\u00b5(cid:0)\u2207h\u22121(E)(cid:1) for every Borel set E on Rd. We say that \u03bd is the push-forward measure of\n\n2\n\n\fIf \u2207h#\u00b5 = \u03bd, the triplet (\u00b5, \u03bd, h) must satisfy the Monge-Amp`ere equation:\n\ne\u2212V = e\u2212W\u25e6\u2207h det\u22072h.\n\nUsing (\u2207h)\u22121 = \u2207h(cid:63) and \u22072h \u25e6 \u2207h(cid:63) = \u22072h(cid:63)\u22121, we see that (2.1) is equivalent to\n\ne\u2212W = e\u2212V \u25e6\u2207h(cid:63)\n\ndet\u22072h(cid:63)\n\nwhich implies \u2207h(cid:63)#\u03bd = \u00b5.\nThe 2-Wasserstein distance between \u00b51 and \u00b52 is de\ufb01ned by1\n\nW 2\n\n2 (\u00b51, \u00b52) :=\n\ninf\n\nT :T #\u00b51=\u00b52\n\n3 Mirrored Langevin Dynamics\n\n(cid:90)\n\n(cid:107)x \u2212 T (x)(cid:107)2d\u00b51(x).\n\n(2.1)\n\n(2.2)\n\n(2.3)\n\nThis section demonstrates a framework for transforming constrained sampling problems into\nunconstrained ones. We then focus on applications to sampling from strongly log-concave\ndistributions and simplex-constrained distributions, even though the framework is more\ngeneral and future-proof.\n\n3.1 Motivation and Algorithm\n\nWe begin by brie\ufb02y recalling the mirror descent (MD) algorithm for optimization. In order to\nminimize a function over a bounded domain, say minx\u2208X f (x), MD uses a mirror map h to\ntransform the primal variable x into the dual space y := \u2207h(x), and then performs gradient\nupdates in the dual: y+ = y \u2212 \u03b2\u2207f (x) for some step-size \u03b2. The mirror map h is chosen to\nadapt to the geometry of the constraint X , which can often lead to faster convergence [25]\nor, more pivotal to this work, an unconstrained optimization problem [2].\n\nInspired by the MD framework, we would like to use the mirror map idea to remove the\nconstraint for sampling problems. Toward this end, we \ufb01rst establish a simple fact [30]:\nTheorem 1. Let h satisfy Assumption 1. Suppose that X \u223c \u00b5 and Y = \u2207h(X). Then\nY \u223c \u03bd := \u2207h#\u00b5 and \u2207h(cid:63)(Y) \u223c \u00b5.\n\nProof. For any Borel set E, we have \u03bd(E) = P (Y \u2208 E) = P(cid:0)X \u2208 \u2207h\u22121(E)(cid:1) = \u00b5(cid:0)\u2207h\u22121(E)(cid:1).\n\nSince \u2207h is one-to-one, Y = \u2207h(X) if and only if X = \u2207h\n\n\u22121(Y) = \u2207h(cid:63)(Y).\n\nIn the context of sampling, Theorem 1 suggests the following simple procedure: For any\ntarget distribution e\u2212V (x)dx with support X , we choose a mirror map h on X satisfying\nAssumption 1, and we consider the dual distribution associated with e\u2212V (x)dx and h:\n\ne\u2212W (y)dy := \u2207h#e\u2212V (x)dx.\n\n(3.1)\nTheorem 1 dictates that if we are able to draw a sample Y from e\u2212W (y)dy, then \u2207h(cid:63)(Y)\nimmediately gives a sample for the desired distribution e\u2212V (x)dx. Furthermore, suppose for\nthe moment that dom(h(cid:63)) = Rd, so that e\u2212W (y)dy is unconstrained. Then we can simply\nexploit the classical Langevin Dynamics (1.1) to e\ufb03ciently take samples from e\u2212W (y)dy.\nThe above reasoning leads us to set up the Mirrored Langevin Dynamics (MLD):\n\n(cid:26) dYt = \u2212(\u2207W \u25e6 \u2207h)(Xt)dt +\n\nXt = \u2207h(cid:63)(Yt)\n\nMLD \u2261\n\n\u221a\n\n2dBt\n\n.\n\n(3.2)\n\nNotice that the stationary distribution of Yt in MLD is e\u2212W (y)dy, since dYt is nothing but\nthe Langevin Dynamics (1.1) with \u2207V \u2190 \u2207W . As a result, we have Xt \u2192 X\u221e \u223c e\u2212V (x)dx.\n1In general, (2.3) is ill-de\ufb01ned; see [30]. The validity of (2.3) is guaranteed by McCann\u2019s theorem\n\n[23] under Assumption 2 and 3.\n\n3\n\n\fAssumption\nLI (cid:23) \u22072V (cid:23) mI\nLI (cid:23) \u22072V (cid:23) 0\n\u22072V (cid:23) mI\nLI (cid:23) \u22072V (cid:23) mI,\nV unconstrained\n\ndTV\n\nW2\n\nD(\u00b7(cid:107)\u00b7)\nunknown\nunknown\n\n\u02dcO(cid:0)\u0001\u22121d(cid:1)\n\u02dcO(cid:0)\u0001\u22121d(cid:1)\n\n\u02dcO(cid:0)\u0001\u22126d5(cid:1)\n\u02dcO(cid:0)\u0001\u221212d12(cid:1)\n\u02dcO(cid:0)\u0001\u22122d(cid:1)\n\u02dcO(cid:0)\u0001\u22122d(cid:1)\ndYt = \u2212\u22072h(Xt)\u22121(cid:16)\u2207V (Xt) + \u2207 log det\u22072h(Xt)\n(cid:17)\n\nunknown\nunknown\n\n\u02dcO(cid:0)\u0001\u22122d(cid:1)\n\u02dcO(cid:0)\u0001\u22122d(cid:1)\n\nTable 1: Convergence rates for sampling from e\u2212V (x)dx with dom(V ) bounded\n\nUsing (2.1), we can equivalently write the dYt term in (3.2) as\n\nAlgorithm\n\nMYULA [4]\nPLMC [5]\n\nMLD; this work\n\nLangevin Dynamics [8, 11, 14]\n\n\u221a\n\ndt +\n\n2dBt.\n\nIn order to arrive at a practical algorithm, we then discretize the MLD, giving rise to the\nfollowing equivalent iterations:\n\n(cid:40) \u2212\u03b2t\u2207W (yt) +(cid:112)2\u03b2t\u03bet\n\u2212\u03b2t\u22072h(xt)\u22121(cid:16)\u2207V (xt) + \u2207 log det\u22072h(xt)\n(cid:17)\n\n+(cid:112)2\u03b2t\u03bet\n\n(3.3)\n\nyt+1 \u2212 yt =\n\nwhere in both cases xt+1 = \u2207h(cid:63)(yt+1), \u03bet\u2019s are i.i.d. standard Gaussian, and \u03b2t\u2019s are\nstep-sizes. The \ufb01rst formulation in (3.3) is useful when \u2207W has a tractable form, while the\nsecond one can be computed using solely the information of V and h.\n\nNext, we turn to the convergence of discretized MLD. Since dYt in (3.2) is the classical\nLangevin Dynamics, and since we have assumed that W is unconstrained, it is typically not\ndi\ufb03cult to prove the convergence of yt to Y\u221e \u223c e\u2212W (y)dy. However, what we ultimately\ncare about is the guarantee on the primal distribution e\u2212V (x)dx. The purpose of the next\ntheorem is to \ufb01ll the gap between primal and dual convergence.\n\nWe consider three most common metrics in evaluating approximate sampling schemes, namely\nthe 2-Wasserstein distance W2, the total variation dTV, and the relative entropy D(\u00b7(cid:107)\u00b7).\nTheorem 2 (Convergence in yt implies convergence in xt). For any h satisfying Assump-\ntion 1, we have dTV(\u2207h#\u00b51,\u2207h#\u00b52) = dTV(\u00b51, \u00b52) and D(\u2207h#\u00b51(cid:107)\u2207h#\u00b52) = D(\u00b51(cid:107)\u00b52).\nIn particular, we have dTV(yt, Y\u221e) = dTV(xt, X\u221e) and D(yt(cid:107)Y\u221e) = D(xt(cid:107)X\u221e) in (3.3).\nIf, furthermore, h is \u03c1-strongly convex: \u22072h (cid:23) \u03c1I. Then W2(xt, X\u221e) \u2264 1\nProof. See Appendix A.\n\n\u03c1W2(yt, Y\u221e).\n\n3.2 Applications to Sampling from Constrained Distributions\n\nWe now consider applications of MLD. For strongly log-concave distributions with general\nconstraint, we prove matching rates to that of unconstrained ones; see Section 3.2.1. In\nSection 3.2.2, we consider the important case where the constraint is a probability simplex2.\n\n3.2.1 Sampling from a strongly log-concave distribution with constraint\n\nAs alluded to in the introduction, the existing convergence rates for constrained distributions\nare signi\ufb01cantly worse than their unconstrained counterparts; see Table 1 for a comparison.\nThe main result of this subsection is the existence of a \u201cgood\u201d mirror map for arbitrary\nconstraint, with which the dual distribution e\u2212W (y)dy becomes unconstrained:\nTheorem 3 (Existence of a good mirror map for MLD). Let d\u00b5(x) = e\u2212V (x)dx be a\nprobability measure with bounded convex support such that V \u2208 C2, \u22072V (cid:23) mI (cid:31) 0, and V\nis bounded away from +\u221e in the interior of the support. Then there exists a mirror map\nh \u2208 C2 such that the discretized MLD (3.3) yields\n\nD(cid:0)xT(cid:107)X\u221e(cid:1) = \u02dcO\n\n(cid:18) d\n\n(cid:19)\n\nT\n\n(cid:0)xT , X\u221e(cid:1) = \u02dcO\n\n, W2\n\n(cid:32)(cid:114)\n\n(cid:33)\n\nd\nT\n\n(cid:0)xT , X\u221e(cid:1) = \u02dcO\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n.\n\nd\nT\n\n,\n\ndTV\n\n2More examples of mirror map can be found in Appendix B.\n\n4\n\n\fProof. See Appendix C.\n\nRemark 1. We remark that Theorem 3 is only an existential result, not an actual algorithm.\nPractical algorithms are considered in the next subsection.\n\n3.2.2 Sampling Algorithms on Simplex\n\n\ufb01elds of machine learning and statistics.\n\nprobability simplex \u2206d := {x \u2208 Rd | (cid:80)d\nWe apply the discretized MLD (3.3) to the task of sampling from distributions on the\ni=1 xi \u2264 1, xi \u2265 0}, which is instrumental in many\n(cid:32)\n(cid:33)\n1 \u2212 d(cid:88)\n\nOn a simplex, the most natural choice of h is the entropic mirror map [2], which is well-known\nto be 1-strongly convex:\n\n(cid:32)\n1 \u2212 d(cid:88)\n\n, where 0 log 0 := 0.\n\nd(cid:88)\n\nh(x) =\n\n(cid:33)\n\n(3.4)\n\nlog\n\nxi log x(cid:96) +\n\nx(cid:96)\n\nx(cid:96)\n\n(cid:96)=1\n\n(cid:96)=1\n\n(cid:96)=1\n\nIn this case, the associated dual distribution can be computed explicitly.\nLemma 1 (Sampling on a simplex with entropic mirror map). Let e\u2212V (x)dx be the target\ndistribution on \u2206d, h be the entropic mirror map (3.4), and e\u2212W (y)dy := \u2207h#e\u2212V (x)dx.\nThen the potential W of the push-forward measure admits the expression\n\ny(cid:96) + (d + 1)h(cid:63)(y)\n\n(3.5)\n\nW (y) = V \u25e6 \u2207h(cid:63)(y) \u2212 d(cid:88)\n1 +(cid:80)d\n\n(cid:17)\n\n(cid:96)=1\n\n(cid:16)\n\n(cid:96)=1 ey(cid:96)\n\nis the Fenchel dual of h, which is strictly convex and\n\nwhere h(cid:63)(y) = log\n1-Lipschitz gradient.\nProof. See Appendix D.\nCrucially, we have dom(h(cid:63)) = Rd, so that the Langevin Dynamics for e\u2212W (y)dy is uncon-\nstrained.\n\nBased on Lemma 1, we now present the surprising case of the non-log-concave Dirichlet\nposteriors, a distribution of central importance in topic modeling [3], for which the dual\ndistribution e\u2212W (y)dy becomes strictly log-concave.\nExample 1 (Dirichlet Posteriors). Given parameters \u03b11, \u03b12, ..., \u03b1d+1 > 0 and observations\nn1, n2, ..., nd+1 where n(cid:96) is the number of appearance of category (cid:96), the probability density\nfunction of the Dirichlet posterior is\n\nwhere C is a normalizing constant and xd+1 := 1 \u2212(cid:80)d\n\np(x) =\n\nd+1(cid:89)\nV (x) = \u2212 log p(x) = log C \u2212 d+1(cid:88)\n\n1\nC\n\nxn(cid:96)+\u03b1(cid:96)\u22121\n\n(cid:96)=1\n\n(cid:96)\n\n, x \u2208 int (\u2206d)\n\n(3.6)\n\n(cid:96)=1 x(cid:96). The corresponding V is\n\n(n(cid:96) + \u03b1(cid:96) \u2212 1) log x(cid:96), x \u2208 int (\u2206d) .\n\n(cid:96)=1\n\nW (y) = log C \u2212 d(cid:88)\n\n(n(cid:96) + \u03b1(cid:96))y(cid:96) +\n\n(cid:96)=1\n\nThe interesting regime of the Dirichlet posterior is when it is sparse, meaning the majority\nof the n(cid:96)\u2019s are zero and a few nk\u2019s are large, say of order O(d). It is also common to set\n\u03b1(cid:96) < 1 for all (cid:96) in practice. Evidently, V is neither convex nor concave in this case, and no\nexisting non-asymptotic rate can be applied. However, plugging V into (3.5) gives\n\n(cid:33)\n\n(n(cid:96) + \u03b1(cid:96))\n\n(cid:32)d+1(cid:88)\n(cid:1) convergence in relative entropy, where\n\nh(cid:63)(y)\n\n(3.7)\n\n(cid:96)=1\n\none can then apply (3.3) to obtain an \u02dcO(cid:0)\u0001\u22122d2R0\n\nwhich, magically, becomes strictly convex and O(d)-Lipschitz gradient no matter what\nthe observations and parameters are! In view of Theorem 2 and [14, Corollary 7],\nR0 := W 2\n\n2 (y0, e\u2212W (y)dy) is the initial Wasserstein distance to the target.\n\n5\n\n\fAlgorithm 1 Stochastic Mirrored Langevin Dynamics (SMLD)\n\nRequire: Target distribution e\u2212V (x)dx where V =(cid:80)N\ni\u2208B \u2207Wi(yt) +(cid:112)2\u03b2t\u03bet\n(cid:80)\n\n1: Find Wi such that e\u2212NWi \u221d \u2207h#e\u2212N Vi for all i.\n2: for t \u2190 0, 1,\u00b7\u00b7\u00b7 , T \u2212 1 do\nPick a mini-batch B of size b uniformly at random.\n3:\nUpdate yt+1 = yt \u2212 \u03b2tN\n4:\nxt+1 = \u2207h(cid:63)(yt+1)\n5:\n6: end for\nreturn xT\n\nb\n\ni=1 Vi, step-sizes \u03b2t, batch-size b\n\n(cid:46) Update only when necessary.\n\n4 Stochastic Mirrored Langevin Dynamics\n\nWe have thus far only considered deterministic methods based on exact gradients. In\npractice, however, evaluating gradients typically involves one pass over the full data, which\ncan be time-consuming in large-scale applications. In this section, we turn attention to the\nmini-batch setting, where one can use a small subset of data to form stochastic gradients.\n\nToward this end, we assume:\nAssumption 4 (Primal Decomposibility). The target distribution e\u2212V (x)dx admits a de-\n\ncomposable structure V =(cid:80)N\n\ni=1 Vi for some functions Vi.\n\nConsider the following common scheme in obtaining stochastic gradients. Given a batch-size\nb, we randomly pick a mini-batch B from {1, 2, . . . , N} with |B| = b, and form an unbiased\nestimate of \u2207V by computing\n\n(4.1)\n\n(cid:88)\n\ni\u2208B\n\n\u02dc\u2207V :=\n\nN\nb\n\n\u2207Vi.\n\nThe following lemma asserts that exactly the same procedure can be carried out in the dual.\nLemma 2. Assume that h is 1-strongly convex. For i = 1, 2, ..., N, let Wi be such that\n\ne\u2212N Vi\n\n(cid:82) e\u2212N Vi\n\ne\u2212NWi = \u2207h#\n\n(cid:80)\ni\u2208B \u2207Wi, where B is chosen as in (4.1). Then:\n\n.\n\n(4.2)\n\nDe\ufb01ne W :=(cid:80)N\n\ni=1 Wi and \u02dc\u2207W := N\nthat e\u2212(W +C) = \u2207h#e\u2212V .\n\nb\n\n1. Primal decomposibility implies dual decomposability: There is a constant C such\n2. For each i, the gradient \u2207Wi depends only on \u2207Vi and the mirror map h.\n3. The gradient estimate is unbiased: E \u02dc\u2207W = \u2207W .\n4. The dual stochastic gradient is more accurate: E(cid:107) \u02dc\u2207W \u2212 \u2207W(cid:107)2 \u2264 E(cid:107) \u02dc\u2207V \u2212 \u2207V (cid:107)2.\n\nProof. See Appendix E.\n\nLemma 2 furnishes a template for the mini-batch extension of MLD. The pseudocode is\ndetailed in Algorithm 1, whose convergence rate is given by the next theorem.\nTheorem 4. Let e\u2212V (x)dx be a distribution satisfying Assumption 4, and h a 1-strongly\nconvex mirror map. Let \u03c32 := E(cid:107) \u02dc\u2207V \u2212 \u2207V (cid:107)2 be the variance of the stochastic gradient of\nV in (4.1). Suppose that the corresponding dual distribution e\u2212W (y)dy = \u2207h#e\u2212V (x)dx\nsatis\ufb01es LI (cid:23) \u22072W (cid:23) 0. Then, applying SMLD with constant step-size \u03b2t = \u03b2 yields3:\n\n(cid:16)\n\nxT(cid:107)e\u2212V (x)dx\n\nD\n\nprovided that \u03b2 \u2264 min\n\n(cid:115)\n(cid:17) \u2264\n(cid:110)(cid:2)2TW 2\n\n2\n\n2W 2\n\n(cid:0)y0, e\u2212W (y)dy(cid:1) (Ld + \u03c32)\n(cid:0)y0, e\u2212W (y)dy(cid:1)(cid:0)Ld + \u03c32(cid:1)(cid:3)\u2212 1\n\nT\n\n2\n\n(cid:32)(cid:114)\n(cid:111)\n\n2 , 1\nL\n\n.\n\n3Our guarantee is given on a randomly chosen iterate from {x1, x2, ..., xT}, instead of the \ufb01nal\niterate xT . In practice, we observe that the \ufb01nal iterate always gives the best performance, and we\nwill ignore this minor di\ufb00erence in the theorem statement.\n\n(cid:33)\n\n= O\n\n,\n\n(4.3)\n\nLd + \u03c32\n\nT\n\n6\n\n\fProof. See Appendix F.\n\nN :=(cid:80)d+1\n\nExample 2 (SMLD for Dirichlet Posteriors). For the case of Dirichlet posteriors, we have\nseen in (3.7) that the corresponding dual distribution satis\ufb01es (N + \u0393)I (cid:23) \u22072W (cid:31) 0, where\n(cid:96)=1 \u03b1(cid:96). Furthermore, it is easy to see that the stochastic gradient\n\n\u02dc\u2207W can be e\ufb03ciently computed (see Appendix G):\n\n\u02dc\u2207W (y)(cid:96) :=\n\n(cid:96)=1 n(cid:96) and \u0393 :=(cid:80)d+1\n(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) 2W 2\n\n(cid:17) \u2264\n\nN\nb\n\ni\u2208B\n\n2\n\n\u2207Wi(y)(cid:96) = \u2212\n\n(cid:18) N m(cid:96)\n(cid:0)y0, e\u2212W (y)dy(cid:1)(cid:16)\n\nb\n\n(cid:16)\n\nxT(cid:107)e\u2212V (x)dx\n\nD\n\n(cid:19)\n\nwhere m(cid:96) is the number of observations of category (cid:96) in the mini-batch B. As a result,\nTheorem 4 states that SMLD achieves\n\n+ \u03b1(cid:96)\n\n+ (N + \u0393)\n\n,\n\n(4.4)\n\n1 +(cid:80)d\n(cid:17)\n\ney(cid:96)\nk=1 eyk\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(N + \u0393)d + \u03c32\n\nT\n\n(cid:3)\n\n(N + \u0393)(d + 1) + \u03c32\n\nT\n\n= O\n\nwith a constant step-size.\n\n5 Experiments\n\nWe conduct experiments with a two-fold purpose. First, we use a low-dimensional synthetic\ndata, where we can evaluate the total variation error by comparing histograms, to verify\nthe convergence rates in our theory. Second, We demonstrate that the SMLD, modulo a\nnecessary modi\ufb01cation for resolving numerical issues, outperforms state-of-the-art \ufb01rst-order\nmethods on the Latent Dirichlet Allocation (LDA) application with Wikipedia corpus.\n\n5.1 Synthetic Experiment for Dirichlet Posterior\n\nWe implement the deterministic MLD for sampling from an 11-dimensional Dirichlet posterior\n(3.6) with n1 = 10,000, n2 = n3 = 10, and n4 = n5 = \u00b7\u00b7\u00b7 = n11 = 0, which aims to capture\nthe sparse nature of real observations in topic modeling. We set \u03b1(cid:96) = 0.1 for all (cid:96).\n\nAs a baseline comparison, we include the Stochastic Gradient Riemannian Langevin Dynamics\n(SGRLD) [26] with the expanded-mean parametrization. SGRLD is a tailor-made \ufb01rst-order\nscheme for simplex constraints, and it remains one of the state-of-the-art algorithms for\nLDA. For fair comparison, we use deterministic gradients for SGRLD.\n\nWe perform a grid search over the constant step-size for both algorithms, and we keep the\nbest three for MLD and SGRLD. For each iteration, we build an empirical distribution by\nrunning 2,000,000 independent trials, and we compute its total variation with respect to the\nhistogram generated by the true distribution.\n\nFigure 1(a) reports the total variation error along the \ufb01rst dimension, where we can see\nthat MLD outperforms SGRLD by a substantial margin. As dictated by our theory, all the\nMLD curves decay at the O(T \u22121/2) rate until they saturate at the dicretization error level.\nIn contrast, SGRLD lacks non-asymptotic guarantees, and there is no clear convergence rate\nwe can infer from Figure 1(a).\n\nThe improvement along all other dimensions (i.e., topics with less observations) are even\nmore signi\ufb01cant; see Appendix H.1.\n\n5.2 Latent Dirichlet Allocation with Wikipedia Corpus\n\nAn in\ufb02uential framework for topic modeling is the Latent Dirichlet Allocation (LDA) [3],\nwhich, given a text collection, requires to infer the posterior word distributions without\nknowing the exact topic for each word. The full model description is standard but somewhat\nconvoluted; we refer to the classic [3] for details.\n\nEach topic k in LDA determines a word distribution \u03c0k, and suppose there are in total\nK topics and W + 1 words. The variable of interest is therefore \u03c0 := (\u03c01, \u03c02, ..., \u03c0K) \u2208\n\u2206W \u00d7 \u2206W \u00d7 \u00b7\u00b7\u00b7 \u2206W . Since this domain is a Cartesian product of simplices, we propose to\nk=1 h(\u03c0k), where h is the entropic mirror map (3.4), for SMLD. It is easy to\n\nuse \u02dch(\u03c0) :=(cid:80)K\n\nsee that all of our computations for Dirichlet posteriors generalize to this setting.\n\n7\n\n\f(a) Synthetic data, \ufb01rst dimension.\n\n(b) LDA on Wikipedia corpus.\n\n5.2.1 Experimental Setup\n\nWe implement the SMLD for LDA on the Wikipedia corpus with 100,000 documents, and\nwe compare the performance against the SGRLD [26]. In order to keep the comparison fair,\nwe adopt exactly the same setting as in [26], including the model parameters, the batch-size,\nthe Gibbs sampler steps, etc. See Section 4 and 5 in [26] for omitted details.\n\nAnother state-of-the-art \ufb01rst-order algorithm for LDA is the SGRHMC in [21], for which we\nskip the implementation, due to not knowing how the \u02c6Bt was chosen in [21]. Instead, we\nwill repeat the same experimental setting as [21] and directly compare our results versus the\nones reported in [21]. See Appendix H.2 for comparison against SGRHMC.\n\n5.2.2 A Numerical Trick and the SMLD-approximate Algorithm\n\nA major drawback of the SMLD in practice is that the stochastic gradients (4.4) involve\nexponential functions, which are unstable for large-scale problems. For instance, in python,\nnp.exp(800) = inf, whereas the relevant variable regime in this experiment extends to\n1600. To resolve such numerical issues, we appeal to the linear approximation4 exp(y) (cid:39)\nmax{0, 1 + y}. Admittedly, our theory no longer holds under such numerical tricks, and we\nshall not claim that our algorithm is provably convergent for LDA. Instead, the contribution\nof MLD here is to identify the dual dynamics associated with (3.7), which would have been\notherwise di\ufb03cult to perceive. We name the resulting algorithm \u201cSMLD-approximate\u201d to\nindicate its heuristic nature.\n\n5.2.3 Results\n\nFigure 1(b) reports the perplexity on the test data up to 100,000 documents, with the \ufb01ve\nbest step-sizes we found via grid search for SMLD-approximate. For SGRLD, we use the\nbest step-sizes reported in [26].\n\nFrom the \ufb01gure, we can see a clear improvement, both in terms of convergence speed and\nthe saturation level, of the SMLD-approximate over SGRLD. One plausible explanation for\nsuch phenomenon is that our MLD, as a simple unconstrained Langevin Dynamics, is less\nsensitive to discretization. On the other hand, the underlying dynamics for SGRLD is a\nmore sophisticated Riemannian di\ufb00usion, which requires \ufb01ner discretization than MLD to\nachieve the same level of approximation to the original continuous-time dynamics, and this\nis true even in the presence of noisy gradients and our numerical heuristics\n\n4One can also use a higher-order Taylor approximation for exp(y), or add a small threshold\nexp(y) (cid:39) max{\u0001, 1 + y} to prevent the iterates from going to the boundary. In practice, we observe\nthat these variants do not make a huge impact on the performance.\n\n8\n\n\fAcknowledgments\n\nThis project has received funding from the European Research Council (ERC) under the\nEuropean Union\u2019s Horizon 2020 research and innovation programme (grant agreement n\u25e6\n725594 - time-data).\n\nReferences\n\n[1] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via\nstochastic gradient \ufb01sher scoring. In Proceedings of the 29th International Coference on\nInternational Conference on Machine Learning, pages 1771\u20131778, 2012.\n\n[2] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient\n\nmethods for convex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal\n\nof machine Learning research, 3(Jan):993\u20131022, 2003.\n\n[4] Nicolas Brosse, Alain Durmus, \u00b4Eric Moulines, and Marcelo Pereyra. Sampling from a\nlog-concave distribution with compact support with proximal langevin monte carlo. In\nProceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of\nMachine Learning Research, pages 319\u2013342. PMLR, 07\u201310 Jul 2017.\n\n[5] S\u00b4ebastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave\ndistribution with projected langevin monte carlo. arXiv preprint arXiv:1507.02564,\n2015.\n\n[6] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gra-\ndient mcmc algorithms with high-order integrators. In Advances in Neural Information\nProcessing Systems, pages 2278\u20132286, 2015.\n\n[7] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte\n\ncarlo. In International Conference on Machine Learning, pages 1683\u20131691, 2014.\n\n[8] Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In\nProceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine\nLearning Research, pages 186\u2013211. PMLR, 07\u201309 Apr 2018.\n\n[9] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped\n\nlangevin mcmc: A non-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.\n\n[10] Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle\n\nmirror descent. In Arti\ufb01cial Intelligence and Statistics, pages 985\u2013994, 2016.\n\n[11] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and\nlog-concave densities. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 79(3):651\u2013676, 2017.\n\n[12] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the langevin\n\nmonte carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.\n\n[13] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut\nNeven. Bayesian sampling using stochastic gradient thermostats. In Advances in neural\ninformation processing systems, pages 3203\u20133211, 2014.\n\n[14] Alain Durmus, Szymon Majewski, and Bla\u02d9zej Miasojedow. Analysis of langevin monte\n\ncarlo via convex optimization. arXiv preprint arXiv:1802.09188, 2018.\n\n[15] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Ga\u00a8el Richard.\nStochastic gradient richardson-romberg markov chain monte carlo. In Advances in\nNeural Information Processing Systems, pages 2047\u20132055, 2016.\n\n[16] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling:\n\nMetropolis-hastings algorithms are fast! arXiv preprint arXiv:1801.02309, 2018.\n\n9\n\n\f[17] Walid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent\ndynamics. In Advances in Neural Information Processing Systems, pages 6799\u20136809,\n2017.\n\n[18] Shiwei Lan and Babak Shahbaba. Sampling constrained probability distributions\nusing spherical augmentation. In Algorithmic Advances in Riemannian Geometry and\nApplications, pages 25\u201371. Springer, 2016.\n\n[19] Chang Liu, Jun Zhu, and Yang Song. Stochastic gradient geodesic mcmc methods. In\n\nAdvances in Neural Information Processing Systems, pages 3009\u20133017, 2016.\n\n[20] Tung Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distribu-\n\ntion through langevin di\ufb00usion. 2017.\n\n[21] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient\nmcmc. In Advances in Neural Information Processing Systems, pages 2917\u20132925, 2015.\n\n[22] Benoit B Mandelbrot. The fractal geometry of nature, volume 173. WH freeman New\n\nYork, 1983.\n\n[23] Robert J McCann. Existence and uniqueness of monotone measure-preserving maps.\n\nDuke Mathematical Journal, 80(2):309\u2013324, 1995.\n\n[24] Panayotis Mertikopoulos and Mathias Staudigl. On the convergence of gradient-like\n\ufb02ows with noisy gradient input. SIAM Journal on Optimization, 28(1):163\u2013197, 2018.\n\n[25] AS Nemirovsky and DB Yudin. Problem complexity and method e\ufb03ciency in optimiza-\n\ntion. 1983.\n\n[26] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics\non the probability simplex. In Advances in Neural Information Processing Systems,\npages 3102\u20133110, 2013.\n\n[27] Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a\nnetwork: Variance reduction, consensus, convergence. In Decision and Control (CDC),\n2012 IEEE 51st Annual Conference on, pages 6793\u20136800. IEEE, 2012.\n\n[28] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 1970.\n\n[29] Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga\u00a8el Richard. Stochastic quasi-\nnewton langevin monte carlo. In International Conference on Machine Learning, pages\n642\u2013651, 2016.\n\n[30] C\u00b4edric Villani. Topics in optimal transportation. Number 58. American Mathematical\n\nSoc., 2003.\n\n[31] C\u00b4edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[32] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin\ndynamics. In Proceedings of the 28th International Conference on Machine Learning\n(ICML-11), pages 681\u2013688, 2011.\n\n[33] Pan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent:\nFrom continuous-time dynamics to discrete-time algorithms. In International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 1087\u20131096, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1499, "authors": [{"given_name": "Ya-Ping", "family_name": "Hsieh", "institution": "EPFL"}, {"given_name": "Ali", "family_name": "Kavis", "institution": "EPFL"}, {"given_name": "Paul", "family_name": "Rolland", "institution": "EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}