{"title": "Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 6653, "page_last": 6664, "abstract": "We propose a new algorithm---Stochastic Proximal Langevin Algorithm (SPLA)---for sampling from a log concave distribution. Our method is a generalization of the Langevin algorithm to potentials expressed as the sum of one stochastic smooth term and multiple stochastic nonsmooth terms. In each iteration, our splitting technique only requires access to a stochastic gradient of the smooth term and a stochastic proximal operator for each of the nonsmooth terms. We establish nonasymptotic sublinear and linear convergence rates under convexity and strong convexity of the smooth term, respectively, expressed in terms of the KL divergence and Wasserstein distance. We illustrate the efficiency of our sampling technique through numerical simulations on a Bayesian learning task.", "full_text": "Stochastic Proximal Langevin Algorithm:\n\nPotential Splitting and Nonasymptotic Rates\n\nAdil Salim\n\nDmitry Kovalev\n\nPeter Richt\u00e1rik\u2217\n\nKing Abdullah University of Science and Technology, Thuwal, Saudi Arabia\n\nAbstract\n\nWe propose a new algorithm\u2014Stochastic Proximal Langevin Algorithm\n(SPLA)\u2014for sampling from a log concave distribution. Our method is a gen-\neralization of the Langevin algorithm to potentials expressed as the sum of one\nstochastic smooth term and multiple stochastic nonsmooth terms. In each itera-\ntion, our splitting technique only requires access to a stochastic gradient of the\nsmooth term and a stochastic proximal operator for each of the nonsmooth terms.\nWe establish nonasymptotic sublinear and linear convergence rates under convex-\nity and strong convexity of the smooth term, respectively, expressed in terms of\nthe KL divergence and Wasserstein distance. We illustrate the ef\ufb01ciency of our\nsampling technique through numerical simulations on a Bayesian learning task.\n\n1\n\nIntroduction\n\nMany applications in the \ufb01eld of Bayesian machine learning require to sample from a probability\ndistribution \u00b5\u22c6 with density \u00b5\u22c6(x), x \u2208 Rd. Due to their scalability, Monte Carlo Markov Chain\n(MCMC) methods such as Langevin Monte Carlo [48] or Hamiltonian Monte Carlo [28] are popular\nalgorithms to solve such problems. Monte Carlo methods typically generate a sequence of random\nvariables (xk)k\u22650 with the property that the distribution of xk approaches \u00b5\u22c6 as k grows.\nWhile the theory of MCMC algorithms has remained mainly asymptotic, in recent years the\nexploration of non-asymptotic properties of such algorithms has led to a renaissance in the\n\ufb01eld [14, 26, 39, 15, 16, 19, 22, 12, 53, 10, 52]. In particular, if \u00b5\u22c6(x) \u221d exp(\u2212U (x)), where\nU is a smooth convex function, [14, 19] provide explicit convergence rates for the Langevin algo-\nrithm (LA)\n\nxk+1 = xk \u2212 \u03b3\u2207U (xk) +p2\u03b3W k,\n\nwhere \u03b3 > 0 and (W k)k\u22650 is a sequence of i.i.d. standard Gaussian random variables. The function\nU, also called the potential, enters the algorithm through its gradient.\nIn optimization, the problem min U where U is composite, i.e. is a sum of nonsmooth terms which\nmust be handled separately, has many instances, see [17, Section 2]. These optimization problems\ncan be seen as a Maximum A Posteriori (MAP) computation of some Bayesian model. Sampling\na posterori in these models allows for a better Bayesian inference [20]. In these cases, the task\nof sampling a posterori takes the form of sampling from the target distribution \u00b5\u22c6, where U has a\ncomposite form.\nIn this work we study the setting where the potential U is the sum of a single smooth and a potentially\nlarge number of nonsmooth convex functions. In particular, we consider the problem\n\nSample from \u00b5\u22c6(x) \u221d exp(\u2212U (x)), where U (x) := F (x) +\n\nGi(x),\n\n(1)\n\nn\n\nPi=1\n\n\u2217Also af\ufb01liated with Moscow Institute of Physics and Technology, Dolgoprudny, Russia.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Complexity results obtained in Corollaries 2, 3 and 4 of our main result (Theorem 1).\n\nF\n\nStepsize \u03b3\n\nRate\n\nTheorem\n\nconvex\n\nO (\u03b5)\n\nKL(\u00b5\u02c6xk | \u00b5\u22c6) \u2264\n\n1\n\n2\u03b3(k+1) W 2(\u00b5x0 , \u00b5\u22c6) + O(\u03b3)\n\nCor 2\n\n\u03b1-strongly convex\n\nO (\u03b5\u03b1)\n\nW 2(\u00b5xk , \u00b5\u22c6) \u2264 (1 \u2212 \u03b3\u03b1)kW 2(\u00b5x0 , \u00b5\u22c6) + O (cid:0) \u03b3\n\n\u03b1 (cid:1)\n\nCor 3\n\n\u03b1-strongly convex\n\nO (\u03b5\u03b1)\n\nKL(\u00b5\u02dcxk | \u00b5\u22c6) \u2264 \u03b1(1 \u2212 \u03b3\u03b1)k+1W 2(\u00b5x0 , \u00b5\u22c6) + O(\u03b3)\n\nCor 4\n\nwhere F : Rd \u2192 R is a smooth convex function and G1, . . . , Gn : Rd \u2192 R are (possibly nons-\nmooth) convex functions. The additive model for U offers ample \ufb02exibility as typically there are\nmultiple decompositions of U in the form (1).\n\n2 Contributions\n\nWe now brie\ufb02y comment some of the key contributions of this work.\n\u22c4 A splitting technique for Langevin algorithm. We propose a new variant of LA for solving (1),\nwhich we call Stochastic Proximal Langevin Algorithm (SPLA). We assume that F and G can be\nwritten as expectations over some simpler functions f (\u00b7, \u03be) and gi(\u00b7, \u03be)\n\nF (x) = E\u03be(f (x, \u03be)),\n\nand Gi(x) = E\u03be(gi(x, \u03be)).\n\n(2)\nSPLA (see Algorithm 1 in Section 4) only requires accesses to the gradient of f (\u00b7, \u03be) and to prox-\nimity operators of the functions gi(\u00b7, \u03be). SPLA can be seen as a Langevin version of the stochastic\nPassty algorithm [30, 36]. To the best of our knowledge, this is the \ufb01rst time a splitting technique\nthat involves multiple (stochastic) proximity operators is used in a Langevin algorithm.\nRemarks: Current forms of LA tackle problem (1) using stochastic subgradients [18]. If n = 1 and\nG1 is proximable (i.e., the learner has access to the full proximity operator of G1), it has recently\nbeen proposed to use proximity operators instead of (sub)gradients [20, 18], as it is done in the\noptimization literature [29, 2]. Indeed, in this case, the proximal stochastic gradient method is an\nef\ufb01cient method to minimize U. If n > 1, and the functions Gi are proximable (but not U), the\nminimization of U is usually tackled using the operator splitting framework: the (stochastic) three-\noperator splitting [51, 17] or (stochastic) primal dual algorithms [13, 46, 9, 35]. These algorithms\ninvolve the computation of (stochastic) gradients and (full) proximity operators and enjoy numerical\nstability properties. However, proximity operators are sometimes dif\ufb01cult to implement. In this\ncase, stochastic proximity operators are cheaper2 than full proximity operators and numerically more\nstable than stochastic subgradients to handle nonsmooth terms [32, 31, 4, 5, 6] but also smooth [43]\nterms. In this paper, we bring together the advantages of operator splitting and stochastic proximity\noperators for sampling purposes.\n\u22c4 Theory. We perform a nonasymptotic convergence analysis of SPLA. Our main result, Theo-\nrem 1, gives a tractable recursion involving the Kullback-Leibler divergence and Wasserstein dis-\ntance (when U is strongly convex) between \u00b5\u22c6 and the distribution of certain samples generated by\nour method. We use this result to show that the KL divergence is lower than \u03b5 after O(1/\u03b52) iterations\nif the constant stepsize \u03b3 = O(\u03b5) is used (Corollary 2). Assuming F is \u03b1-strongly convex, we show\nthat the Wasserstein distance and (resp. the KL divergence) decrease exponentially, up to an oscil-\nlation region of size O(\u03b3/\u03b1) (resp. O(\u03b3)) as shown in Corollary 3 (resp. Corollary 4). If we wish to\npush the Wasserstein distance below \u03b5 (resp. the KL divergence below \u03b1\u03b5), this could be achieved\nby setting \u03b3 = O(\u03b5\u03b1), and it would be suf\ufb01cient to take O(1/\u03b5 log 1/\u03b5) iterations. These results are\nsummarized in Table 1. The obtained convergence rates match the previous known results obtained\nin simpler settings [18]. Note that convergence rates of optimization methods involving multiple\nstochastic proximity operators haven\u2019t been established yet.\n\n2See www.proximity-operator.net\n\n2\n\n\fRemarks: Our proof technique is inspired by [38], which is itself based on [18]. In [38], the authors\nconsider the n = 1 case, and assume that the smooth function F is proximable. In [18], a proximal\nstochastic (sub)gradient Langevin algorithm is studied. In this paper, convergence rates are estab-\nlished by showing that the probability distributions of the iterates shadow some discretized gradient\n\ufb02ow de\ufb01ned on a measure space. Hence, our work is a contribution to recent efforts to understand\nLangevin algorithm as an optimization algorithm in a space of probability measures [26, 49, 3].\n\u22c4 Online setting. In online settings, U is unknown but revealed across time. Our approach provides\na reasonable algorithm for such situations, especially in cases when the information revealed about\nU is stationary in time. In particular, this includes online Bayesian learning with structured priors\nor nonsmooth log likelihood [50, 23, 40, 47]. In this context, the learner is required to sample from\nsome posterior distribution \u00b5\u22c6 that takes the form (1) where F, G1, . . . , Gn are intractable. However,\nthese functions can be cheaply sampled, or are revealed across time through i.i.d. streaming data.\n\u22c4 Simulations. We illustrate the promise of our approach numerically by performing experiments\nwith SPLA. We \ufb01rst apply SPLA to a stochastic and nonsmooth toy model with a ground truth. Then,\nwe consider the problem of sampling from the posterior distribution in the Graph Trend Filtering\ncontext [47]. For this nonsmooth large scale simulation problem, SPLA is performing better than the\nstate of the art method that uses stochastic subgradients instead of stochastic proximity operators.\nIndeed, in the optimization litterature [2], proximity operators are already known to be more stable\nthan subgradients.\n\n3 Technical Preliminaries\n\nIn this section, we recall certain notions from convex analysis and probability theory, which are keys\nto the developments in this paper, state our main assumptions, and introduce needed notations.\n\n3.1 Subdifferential, minimal section and proximity operator\n\nGiven a convex function g : Rd \u2192 R, its subdifferential at x, \u2202g(x), is the set\n\u2202g(x) := (cid:8)d \u2208 Rd : g(x) + hd, y \u2212 xi \u2264 g(y)(cid:9) .\n\nSince \u2202g(x) is a nonempty closed convex set [2], the projection of 0 onto \u2202g(x)\u2014the least norm\nelement in the set \u2202g(x)\u2014is well de\ufb01ned, and we call this element \u22070g(x). The function \u22070g :\nRd \u2192 Rd is called the minimal section of \u2202g. The proximity operator associated with g is the\nmapping proxg : Rd \u2192 Rd de\ufb01ned by\n\nDue to its implicit de\ufb01nition, proxg can be hard to evaluate.\n\nproxg(x) := arg min\n\ny\u2208Rd (cid:8) 1\n\n2kx \u2212 yk2 + g(y)(cid:9) .\n\n3.2 Stochastic structure of F and Gi: integrability, smoothness and convexity\n\nHere we detail the assumptions behind the stochastic structure (2) of the functions F = E\u03be(f (x, \u03be))\nand Gi = E\u03be(gi(x, \u03be)) de\ufb01ning the potential U. Let (\u2126, F , P) be a probability space and denote E\nthe mathematical expectation and V the variance. Consider \u03be a random variable from \u2126 to another\nprobability space (\u039e, G ) with distribution \u00b5.\nAssumption 1 (Integrability). The functions f : Rd\u00d7\u039e \u2192 Rd and gi : Rd\u00d7\u039e \u2192 Rd, i = 1, . . . , n,\nare \u00b5-integrable for every x \u2208 Rd.\nFurthermore, we will make the following convexity and smoothness assumptions.\nAssumption 2 (Convexity and differentiability). The function f (\u00b7, s) is convex and differentiable\nfor every s \u2208 \u039e. The functions gi(\u00b7, s) are convex for every i \u2208 {1, 2, . . . , n}.\nThe gradient of f (\u00b7, s) is denoted \u2207f (\u00b7, s), the subdifferential of gi(\u00b7, s) is denoted \u2202gi(\u00b7, s) and\nits minimal section is denoted \u22070gi(\u00b7, s). Under Assumption 2, it is known that F is convex and\ndifferentiable and that \u2207F (x) = E\u03be(\u2207f (x, \u03be)) [34]. Next, we assume that F is smooth and \u03b1-\nstrongly convex. However, we allow \u03b1 = 0 if F is not strongly convex. We will only assume that\n\u03b1 > 0 in Corollaries 3 and 4.\n\n3\n\n\fAssumption 3 (Convexity and smoothness of F ). The gradient of F is L-Lipschitz continuous,\nwhere L \u2265 0. Moreover, F is \u03b1-strongly convex, where \u03b1 \u2265 0.\nUnder Assumption 2, the second part of the above holds for \u03b1 = 0. Finally, we will introduce two\nnoise conditions on the stochastic (sub)gradients of f (\u00b7, s) and gi(\u00b7, s).\nAssumption 4 (Bounded variance of \u2207f (x,\u00b7)). There exists \u03c3F \u2265 0, such that V\u03be(k\u2207f (x, \u03be)k) \u2264\n\u03c32\nF for every x \u2208 Rd.\nAssumption 5 (Bounded second moment of \u22070gi(x,\u00b7)). For every i \u2208 {1, 2, . . . , n}, there exists\nLGi \u2265 0 such that E\u03be(k\u22070gi(x, \u03be)k2) \u2264 L2\n\nfor every x \u2208 Rd.\n\nGi\n\nNote that if gi(\u00b7, s) is \u2113i(s)-Lipschitz continuous for every s \u2208 \u039e, and if \u2113i(s) is \u00b5-square integrable,\nthen Assumption 5 holds.\n\n3.3 KL divergence, entropy and potential energy\n\nRecall from (1) that U := F + Pn\ni=1 Gi and assume that R exp(\u2212U (x))dx < \u221e. Our goal is to\nsample from the unique distribution \u00b5\u22c6 over Rd with density \u00b5\u22c6(x) (w.r.t. the Lebesgue measure\ndenoted L) proportional to exp(\u2212U (x)), for which we write \u00b5\u22c6(x) \u221d exp(\u2212U (x)). The closeness\nbetween the samples of our algorithm and the target distribution \u00b5\u22c6 will be evaluated in terms of\ninformation theoretic and optimal transport theoretic quantities.\nLet B(Rd) be the Borel \u03c3-\ufb01eld of Rd. Given two nonnegative measures \u00b5 and \u03bd on (Rd,B(Rd)), we\nwrite \u00b5 \u226a \u03bd if \u00b5 is absolutely continuous w.r.t. \u03bd, and denote d\u00b5\nd\u03bd its density. The Kullback-Leibler\n(KL) divergence between \u00b5 and \u03bd, KL(\u00b5 | \u03bd), quanti\ufb01es the closeness between \u00b5 and \u03bd. If \u00b5 \u226a \u03bd,\nthen the KL divergence is de\ufb01ned by\n\nKL(\u00b5 | \u03bd) := Z log(cid:18) d\u00b5\n\nd\u03bd\n\n(x)(cid:19) d\u00b5(x),\n\nand otherwise we set KL(\u00b5 | \u03bd) = +\u221e. Up to an additive constant, KL(\u00b7 | \u00b5\u22c6) can be seen as the\nsum of two terms [37]: the entropy H(\u00b5) and the potential energy EU (\u00b5). The entropy of \u00b5 is given\nby H(\u00b5) := KL(\u00b5 | L), and the potential energy of \u00b5 is de\ufb01ned by EU (\u00b5) := R U d\u00b5(x).\n\n3.4 Wasserstein distance\n\nAlthough the KL divergence is equal to zero if and only if \u00b5 = \u03bd, it is not a mathematical distance\n(metric). The Wasserstein distance, de\ufb01ned below, metricizes the space P2(Rd) of probability mea-\nsures over Rd with a \ufb01nite second moment. Consider \u00b5, \u03bd \u2208 P2(Rd). A transference plan of (\u00b5, \u03bd)\nis a probability measure \u03c5 over (Rd \u00d7 Rd,B(Rd \u00d7 Rd)) with marginals \u00b5, \u03bd : for every A \u2208 B(Rd),\n\u03c5(A \u00d7 Rd) = \u00b5(A) and \u03c5(Rd \u00d7 A) = \u03bd(A). In particular, the product measure \u00b5 \u2297 \u03bd is a trans-\nference plan. We denote \u0393(\u00b5, \u03bd) the set of transference plans. A coupling of (\u00b5, \u03bd) is a random\nvariable (X, Y ) over some probability space with values in (Rd \u00d7 Rd,B(Rd \u00d7 Rd)) (i.e., X and Y\nare random variables with values in Rd) such that the distribution of X is \u00b5 and the distribution of\nY is \u03bd. In other words, (X, Y ) is a coupling of \u00b5, \u03bd if the distribution of (X, Y ) is a transference\nplan of \u00b5, \u03bd. The Wasserstein distance of order 2 between \u00b5 and \u03bd is de\ufb01ned by\n\nW 2(\u00b5, \u03bd) := inf (cid:26)ZRd\u00d7Rd kx \u2212 yk2d\u03c5(x, y),\n\n\u03c5 \u2208 \u0393(\u00b5, \u03bd)(cid:27) .\n\nOne can see that W 2(\u00b5, \u03bd) = inf E(kX \u2212 Y k2), where the inf is taken over all couplings (X, Y ) of\n\u00b5, \u03bd de\ufb01ned on some probability space with expectation E.\n\n4\n\n\f4 The SPLA Algorithm and its Convergence Rates\n\n4.1 The algorithm\n\nTo solve the sampling problem (1), our Stochastic Proximal Langevin Algorithm (SPLA) generates\na sequence of random variables (xk)k\u22650 from (\u2126, F , P) to (Rd,B(Rd)) de\ufb01ned as follows\n\nzk = xk \u2212 \u03b3\u2207f (xk, \u03bek)\n0 = zk +p2\u03b3W k\nyk\nyk\ni = prox\u03b3gi(\u00b7,\u03bek)(yk\n\ni\u22121)\n\nfor\n\ni = 1, . . . , n\n\nxk+1 = yk\nn,\n\nwhere (W k)k\u22650 is a sequence of i.i.d. standard Gaussian random variables, (\u03bek)k\u22650 is a sequence of\ni.i.d. copies of \u03be and \u03b3 > 0 is a positive step size. Our SPLA method is formalized as Algorithm 1;\nits steps are explained therein.\n\nAlgorithm 1 Stochastic Proximal Langevin Algorithm (SPLA)\n\nInitialize: x0 \u2208 Rd\nfor k = 0, 1, 2, . . . do\nSample random \u03bek\nzk = xk \u2212 \u03b3\u2207f (xk, \u03bek)\nSample random W k\n0 = zk + \u221a2\u03b3W k\nyk\nfor i = 1, . . . , n do\n\nyk\ni = prox\u03b3gi(\u00b7,\u03bek)(yk\n\ni\u22121)\n\nend for\nxk+1 = yk\nn\n\nend for\n\n4.2 Main theorem\n\n\u22b2 used for stoch. approximation: F \u2248 f (\u00b7, \u03bek) and Gi \u2248 gi(\u00b7, \u03bek)\n\u22b2 a stochastic gradient descent step in F\n\u22b2 a standard Gaussian vector in Rd\n\u22b2 a Langevin step w.r.t. F\n\n\u22b2 prox step to handle the term Gi(\u00b7) = E\u03begi(\u00b7, \u03be)\n\u22b2 the \ufb01nal SPLA step, accounting for F and G1, G2, . . . , Gn\n\nWe now state our main results in terms of Kullback-Leibler divergence and Wasserstein distance.\nWe denote \u00b5x the distribution of every random variable x de\ufb01ned on (\u2126, F , P).\nTheorem 1. Let Assumptions 1\u20135 hold and assume that \u03b3 \u2264 1/L. There exists C \u2265 0 such that,\n\n2\u03b3 KL(\u00b5yk\n\n0 | \u00b5\u22c6) \u2264 (1 \u2212 \u03b3\u03b1)W 2(\u00b5xk , \u00b5\u22c6) \u2212 W 2(\u00b5xk+1 , \u00b5\u22c6) + \u03b32(2\u03c32\n\nF + 2Ld + C).\n\n(3)\n\nThe constant C can be expressed as a linear combination of L2\nwith integer coef\ufb01cients.\nG1\nMoreover, if n = 2, then C := 2(L2\n). More generally, if for every i \u2208 {2, . . . , n}, gi(\u00b7, \u03be)\nG1\nadmits almost surely the representation gi(\u00b7, \u03be) = \u02dcgi(\u00b7, \u03bei) where \u03be2, . . . , \u03ben are independent random\nvariables, then C := n\n\n, . . . , L2\n\n+ L2\nG2\n\nGn\n\nn\n\n.\n\nL2\nGi\n\nPi=1\n\nProof. A full proof can be found in the Supplementary material. We only sketch the main steps\n\nhere. For every \u00b5-integrable function g : Rd \u2192 R, we denote Eg(\u00b5) = R gd\u00b5. Moreover, we denote\nF = EU + H. First, using [18, Lemma 1], \u00b5\u22c6 \u2208 P2(Rd), EU (\u00b5\u22c6),H(\u00b5\u22c6) < \u221e and if \u00b5 \u2208 P2(Rd),\nthen\n\nKL(\u00b5 | \u00b5\u22c6) = EU (\u00b5) + H(\u00b5) \u2212 (EU (\u00b5\u22c6) + H(\u00b5\u22c6)) = F(\u00b5) \u2212 F(\u00b5\u22c6),\n\nprovided that EU (\u00b5) < \u221e. Then, we decompose EU (\u00b5) = EF (\u00b5) + EG(\u00b5) where G = Pn\n\nUsing [18] again, we can establish the inequality\n\ni=1 Gi.\n\n2\u03b3hH(\u00b5yk\n\n0\n\n) \u2212 H(\u00b5\u22c6)i \u2264 W 2(\u00b5zk , \u00b5\u22c6) \u2212 W 2(\u00b5yk\n\n0\n\n, \u00b5\u22c6).\n\nThen, if \u03b3 \u2264 1/L we obtain, for every random variable a with distribution \u00b5\u22c6,\n\nEh(cid:13)(cid:13)zk \u2212 a(cid:13)(cid:13)\n\n2i \u2264 (1 \u2212 \u03b3\u03b1)Eh(cid:13)(cid:13)xk \u2212 a(cid:13)(cid:13)\n\n2i + 2\u03b3 [EF (\u00b5\u22c6) \u2212 EF (\u00b5zk )] + 2\u03b32\u03c32\n\nF ,\n\n(4)\n\n(5)\n\n5\n\n\fusing standard computations regarding the stochastic gradient descent algorithm. Using the smooth-\nness of F and the de\ufb01nition of the Wasserstein distance, this implies\n\n2\u03b3hEF (\u00b5yk\n\n0\n\n) \u2212 EF (\u00b5\u22c6)i \u2264 (1 \u2212 \u03b3\u03b1)W 2(\u00b5xk , \u00b5\u22c6) \u2212 W 2(\u00b5zk , \u00b5\u22c6) + \u03b32(2\u03c32\n\nF + 2Ld).\n\nIt remains to establish 2\u03b3hEG(\u00b5yk\n\n, \u00b5\u22c6) \u2212 W 2(\u00b5xk+1 , \u00b5\u22c6) + \u03b32C, which\nis the main technical challenge of the proof. This is done using the frameworks of Yosida approx-\nimation of random subdifferentials and Moreau regularizations of random convex functions [2].\nEquation (3) is obtained by summing the obtained inequalities.\n\n) \u2212 EG(\u00b5\u22c6)i \u2264 W 2(\u00b5yk\n\n0\n\n0\n\n4.3 Link with Wasserstein Gradient Flows\n\nEquation (3) is reminiscent of the fact that the SPLA shadows the gradient \ufb02ow of KL(\u00b7 | \u00b5\u22c6) in\nthe metric space (P2(Rd), W ). To see this, \ufb01rst consider the gradient \ufb02ow associated to F . By\nde\ufb01nition, it is the \ufb02ow of the differential equation [7]\n\nThe function x can alternatively be de\ufb01ned as a solution of the variational inequalities\n\nd\ndt\n\nx(t) = \u2212\u2207F (x(t)),\n\nt > 0.\n\n2 (F (x(t)) \u2212 F (a)) \u2264 \u2212 d\n\ndtkx(t) \u2212 ak2,\n\nt > 0,\n\n\u2200a \u2208 Rd.\n\n(6)\n\n(7)\n\nThe iterates (uk)k\u22650 of the stochastic gradient descent (SGD) algorithm applied to F can be seen\nas a (noisy) Euler discretization of (6) with a step size \u03b3 > 0. This idea has been used successfully\nin the stochastic approximation litterature [33, 24]. This analogy goes further since a fundamental\ninequality used to analyze SGD applied to F is ([27])\n\n2\u03b3E(cid:0)F (uk+1) \u2212 F (a)(cid:1) \u2264 Ekuk \u2212 ak2 \u2212 Ekuk+1 \u2212 ak2 + \u03b32K,\n\nk \u2265 0,\n\nwhere K \u2265 0 is some constant, which can be seen as a discrete counterpart of (7). Note that this\ninequality is similar to (5) that is used in the proof of Theorem 1.\nIn the optimal transport theory, the point of view of (7) is used to de\ufb01ne the gradient \ufb02ow of a\n(geodesically) convex function F de\ufb01ned on P2(Rd) (see [37] or [1, Page 280]). Indeed, the gradient\n\ufb02ow (\u03bdt)t\u22650 of F in the space (P2(Rd), W ) satis\ufb01es for every t > 0, \u00b5 \u2208 P2(Rd),\n\n2 (F(\u03bdt) \u2212 F(\u00b5)) \u2264 \u2212 d\n\n(8)\nwhich can be seen as a continuous time counterpart of Equation (3) by setting F = KL(\u00b7 | \u00b5\u22c6).\nFurthermore, Equation (4) in the proof of Theorem 1 is also related to (8). It is obtained by applying\nEquation (8) with F = H and \u03bd0 = \u00b5zk (see e.g [18, Lemma 5]).\n4.4 Explicit convergence rates for convex and strongly convex F\n\ndt W 2(\u03bdt, \u00b5),\n\nCorollaries 2, 3 and 4 below are obtained by unrolling the recursion provided by Theorem 1. The\nresults are summarized in Table 1.\nCorollary 2 (Convex F ). Consider a sequence of independent random variables (jk)k\u22650 such that\n(jk)k\u22650 is independent of (W k)k and (\u03bek)k, and the distribution of jk is uniform over {0, . . . , k}.\nDenote \u02c6xk = yjk\n\n0 . If \u03b3 \u2264 1/L, then,\nKL(\u00b5\u02c6xk | \u00b5\u22c6) \u2264\n\n1\n\n2\u03b3(k + 1)\n\nW 2(\u00b5x0 , \u00b5\u22c6) +\n\n\u03b3\n2\n\n(2\u03c32\n\nF + 2Ld + C).\n\nHence, given any \u03b5 > 0, choosing stepsize \u03b3 = minn 1\nL ,\n\nk + 1 \u2265 maxn L\n\n\u03b5 , 2\u03c32\n\nF +2Ld+C\n\n\u03b52\n\n\u03b5\n\n2\u03c32\n\nF +2Ld+Co and a number of iterations\no W 2(\u00b5x0 , \u00b5\u22c6),\n\nimplies KL(\u00b5\u02c6xk | \u00b5\u22c6) \u2264 \u03b5.\nCorollary 3 (Strongly convex F ). If \u03b1 > 0 and \u03b3 \u2264 1/L, then,\n\nW 2(\u00b5xk , \u00b5\u22c6) \u2264 (1 \u2212 \u03b3\u03b1)kW 2(\u00b5x0 , \u00b5\u22c6) + \u03b3(2\u03c32\n\n\u03b1\n\nF +2Ld+C)\n\n.\n\n6\n\n\fHence, given any \u03b5 > 0, choosing stepsize \u03b3 = minn 1\nL ,\no log(cid:16) 2W 2(\u00b5x0 ,\u00b5\u22c6)\n\nk \u2265 maxn L\n\n\u03b1 , 2(2\u03c32\n\nF +2Ld+C)\n\n2(2\u03c32\n\n\u03b5\u03b12\n\n\u03b5\n\n(cid:17) ,\n\n\u03b5\u03b1\n\nF +2Ld+C)o and a number of iterations\n\nimplies W 2(\u00b5xk , \u00b5\u22c6) \u2264 \u03b5.\nCorollary 4 (Strongly convex F ). Consider a sequence of independent random variables (jk)k\u22650\nsuch that (jk)k is independent of (W k)k and (\u03bek)k. Assume that the distribution of jk is geometric\nover {0, . . . , k}:\nDenote \u02dcxk = xjk. If \u03b1 > 0 and \u03b3 \u2264 1/L, then,\nKL(\u00b5\u02dcxk | \u00b5\u22c6) \u2264 \u03b1W 2(\u00b5x0 ,\u00b5\u22c6)\n\nP(jk = r) \u221d (1 \u2212 \u03b3\u03b1)\u2212r.\n\n1\u2212(1\u2212\u03b3\u03b1)k+1 + \u03b3(2\u03c32\n\nF +2Ld+C)\n\n(1\u2212\u03b3\u03b1)k+1\n\n\u00b7\n\n2\n\n2\n\n.\n\nHence, given any \u03b5 > 0, choosing stepsize \u03b3 = minn 1\nL ,\n\n2\u03c32\n\n\u03b5\u03b1\n\nF +2Ld+Co and a number of iterations\n\nk \u2265 maxn L\n\n\u03b1 , 2\u03c32\n\nF +2Ld+C\n\n\u03b5\u03b12\n\no log(cid:16)2 maxn1, W 2(\u00b5x0 ,\u00b5\u22c6)\n\n\u03b5\n\no(cid:17) ,\n\nimplies KL(\u00b5\u02dcxk | \u00b5\u22c6) \u2264 \u03b1\u03b5.\nWe can compare these bounds with the one of [18]. First, in the particular case n = 1 and\ng1(\u00b7, s) \u2261 G1, SPLA boils down to the algorithm of [18, Section 4.2], Corollary 2 matches ex-\nactly [18, Corollary 18] and Corollary 3 matches [18, Corollary 22]. To our knowledge, Corollary 4\nhas no counterpart in the litterature. We now focus on the case F \u2261 0 and n = 1 of SPLA, as it con-\ncentrates the innovations of our paper. In this case, L = 0 and \u03c3F = 0. Compared to the Stochastic\nSubgradient Langevin Algorithm (SSLA) [18, Section 4.1], Corollary 2 matches with [18, Corollary\n14].\n\n5 Numerical experiments\n\n5.1 Simulations using a ground truth\n\nWe \ufb01rst concentrate on the case F \u2261 0 and n = 1. Let U = |x| = E\u03be(|x|+x\u03be) (g1(x, s) = |x|+xs),\nwhere \u03be is standard Gaussian. The target \u00b5\u22c6 \u221d exp(\u2212U ) is a standard Laplace distribution in R.\nIn this case, L = \u03b1 = \u03c3F = 0 and C = L2\n= 2. We shall illustrate the bound on KL(\u00b5\u02c6xk|\u00b5\u22c6)\nG1\n(Corollary 2 for SPLA and [18, Corollary 14] for SSLA) for both algorithms using histograms. Note\nthat the distribution \u00b5\u02c6xk of \u02c6xk is a (deterministic) mixture of the \u00b5xj : \u00b5\u02c6xk = 1\nj=1 \u00b5xj . Using\nPinsker inequality, we can bound the total variation distance between \u00b5\u02c6xk and \u00b5\u22c6 from the bound\non KL, and illustrate this by histograms. In Figure 1, we take \u03b3 = 10 and do 105 iterations of both\nalgorithms. Note that here the complexity of one iteration of SPLA or SSLA is the same. One can\n\nk Pk\n\nFigure 1: Comparison between histograms of SPLA and SSLA and the true density 0.5 exp(\u2212|x|).\nsee that SPLA enjoys the well known advantages of stochastic proximal methods [43]: precision,\nnumerical stability (less outliers), and robustness to step size.\n\n7\n\n\fAlgorithm 2 SPLA for the Graph Trend Filtering\n\nInitialize: x0 \u2208 RV\nfor k = 0, 1, 2, . . . do\nzk = xk \u2212 \u03b3\nSample random W k\n0 = zk + \u221a2\u03b3W k\nyk\nfor i = 1, . . . , n do\n\n\u03c32 (xk \u2212 Y )\n\nSample uniform random edges ei\nyk\ni = prox\u03b3gei\n\ni\u22121)\n\n(yk\n\nend for\nxk+1 = yk\nn\n\nend for\n\n\u22b2 standard Gaussian vector in RV\n\n5.2 Application to Trend Filtering on Graphs\n\nIn this section we consider the following Bayesian point of view of trend \ufb01ltering on graphs [42].\nConsider a \ufb01nite undirected graph G = (V, E), where V is the set of vertices and E is the set of\nedges. Denote d the cardinality of V and |E| the cardinality of E. A realization of a random vector\nY \u2208 RV is observed. In a Bayesian framework, the distribution of Y is parametrized by a vector\nX \u2208 RV which is itself random and whose distribution p is proportional to exp(\u2212\u03bb TV(x, G)),\nwhere \u03bb > 0 is a scaling parameter and where for every x \u2208 RV\n\nTV(x, G) =\n\nPi,j\u2208V,{i,j}\u2208E |x(i) \u2212 x(j)|,\n\nis the Total Variation regularization over G. The goal is to learn X after an observation of Y . The\npaper [47] consider the case where the distribution of Y given X (a.k.a the likelihood) is proportional\nto exp(\u2212 1\n2\u03c32kX \u2212 yk2), where \u03c3 \u2265 0 is another scaling parameter. In other words, the distribution\nof Y given X is N (X, \u03c32I), a normal distribution centered at X with variance \u03c32I (where I is the\nd \u00d7 d identity matrix). Denoting\n\n\u03c0(x | y) \u221d exp(\u2212U (x)), U (x) = 1\n\n2\u03c32kx \u2212 yk2 + \u03bb TV(x, G),\n\nthe posterior distribution of X given Y , the maximum a posteriori estimator in this Bayesian frame-\nwork is called the Graph Trend Filtering estimate [47]. It can be written\n\nx\u22c6 = arg max\n\nx\u2208RV\n\n\u03c0(x | Y ) = arg min\n\nx\u2208RV\n\n1\n\n2\u03c32kx \u2212 Y k2 + \u03bb TV(x, G).\n\nAlthough maximum a posteriori estimators carry some information, they are not able to capture\nuncertainty in the learned parameters. Samples a posteriori provide a better understanding of the\nposterior distribution and allow to compute other Bayesian estimates such as con\ufb01dence intervals.\nThis allows to avoid over\ufb01tting among other things. In our context, sampling a posteriori would\nrequire to sample from the target distribution \u00b5\u22c6(x) = \u03c0(x | Y ).\nIn the case where G is a 2D grid (which can be identi\ufb01ed with an image), the proximity operator\nof TV(\u00b7, G) can be computed using a subroutine [8] and the proximal stochastic gradient Langevin\nalgorithm can be used to sample from \u03c0(\u00b7 | Y ) [20, 18]. However, on a large/general graph, the\nproximity operator of TV(\u00b7, G) is hard to evaluate [41, 36]. Since TV(\u00b7, G) is written as a sum, we\nshall rather select a batch of random edges and compute the proximity operators over these randomly\nchosen edges. More precisely, we write the potential U de\ufb01ning \u03c0(x | Y ) in the form (1) by setting\n\nn\n\nPi=1\n\nU (x) = F (x) +\n\nGi(x), F (x) = 1\n\n2\u03c32kx \u2212 Y k2, Gi(x) = \u03bb |E|\n\nn\n\nEei (|x(vi) \u2212 x(wi)|) ,\n\nwhere for every i \u2208 {1, . . . , n}, ei = {vi, wi} \u2208 E is an uniform random edge and the ei\nare independent. For every edge e = {v, w} \u2208 E, (where v, w are vertices) denote ge(x) =\n\u03bb |E|\nn |x(v) \u2212 x(w)| and note that Gi(x) = Eei (gei (x)). The parameter n can be seen as a\nbatch parameter: Pn\ni=1 gei (x) is an unbiaised approximation of T V (x, G). Also note that we set\nf (\u00b7, s) \u2261 F . The SPLA applied to sample from \u03c0(\u00b7 | Y ) is presented as Algorithm 2. In our\nIn the\nsimulations, the SPLA is compared to two different versions of the Langevin algorithm.\n\n8\n\n\fn\n\nPi=1\n\nfull proximity operator of\n\nStochastic Subgradient Langevin Algorithm (SSLA) [18], stochastic subgradients of gei are used\ninstead of stochastic proximity operators. In the Proximal Langevin Algorithm (ProxLA) [18], the\nGi is computed using a subroutine. As mentioned in [36, 47], we\nuse the gradient algorithm for the dual problem. The plots in Figure 2 provide simulations of the\nalgorithms on our machine (using one thread of a 2,800 MHz CPU and 256GB RAM). Additional\nnumerical experiments are available in the Appendix. Four real life graphs from the dataset [25]\nare considered : the Facebook graph (4,039 nodes and 88,234 edges, extracted from the Facebook\nsocial network), the Youtube graph (1,134,890 nodes and 2,987,624 edges, extracted from the social\nnetwork included in the Youtube website), the Amazon graph (the 334,863 nodes represent products\nlinked by and 925,872 edges) and the DBLP graph (a co-authorship network of 317,080 nodes and\n1,049,866 edges). On the larger graphs, we only simulate SPLA and SSLA since the computation\nof a full proximity operator becomes prohibitive. Numerical experiments over the Amazon and the\nDBLP graphs are available in the Supplementary material.\n\nFigure 2: Top row: The functional F = H + EU as a function of CPU time for the three algorithms\nover the Facebook graph. Left: Y \u223c N (0, I). Right: Y \u223c N (0, I) and then half of the coordinates\nof Y are put to zero. Bottom row: The functional F = H + EU as a function of CPU time for the\ntwo algorithms over the Youtube graph. Left: Y \u223c N (0, I). Right: Y \u223c N (0, I) and then half of\nthe coordinates of Y are put to zero.\n\nIn our simulations, we represent the functional F = H+EU as a function of CPU time while running\nthe algorithms. The parameters \u03bb and \u03c3 are chosen such that the log likelihood term and the Total\nVariation regularization term have the same weight. The functionals H and EU are estimated using\n\ufb01ve random realizations of each iterate \u02c6xk (H is estimated using a kernel density estimator). The\nbatch parameter n is equal to 400. We consider cases where Y has a standard gaussian distribution\nand cases where half of the components of Y are standard gaussians and half are equal to zero (this\ncorrespond to the graph inpainting task [11]). SPLA and SSLA are always simulated with the same\nstep size.\nAs expected, the numerical experiments show the advantage of using stochastic proximity operators\ninstead of stochastic subgradients.\nIt is a standard fact that proximity operators are better than\nsubgradients to tackle \u21131-norm terms [2]. Our \ufb01gures show that stochastic proximity operators are\nnumerically more stable than alternatives [43]. Our \ufb01gures also show the advantage of stochastic\nmethods (SSLA or SPLA) over deterministic ones for large scale problems. The SSLA and the\nSPLA provide iterates about one hundred times more frequently than ProxLA, and are faster in the\n\ufb01rst iterations.\n\n9\n\n\fReferences\n\n[1] L. Ambrosio, N. Gigli, and G. Savar\u00e9. Gradient Flows: In Metric Spaces and in the Space of\n\nProbability Measures. Springer Science & Business Media, 2008.\n\n[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in\nHilbert Spaces. CMS Books in Mathematics/Ouvrages de Math\u00e9matiques de la SMC. Springer,\nNew York, 2011.\n\n[3] E. Bernton. Langevin Monte Carlo and JKO splitting. arXiv preprint arXiv:1802.08671, 2018.\n\n[4] P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on\n\nOptimization, 26(4):2235\u20132260, 2016.\n\n[5] P. Bianchi and W. Hachem. Dynamical behavior of a stochastic Forward-Backward algo-\nrithm using random monotone operators. Journal of Optimization Theory and Applications,\n171(1):90\u2013120, 2016.\n\n[6] P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving\n\nrandom maximal monotone operators. Journal of Convex Analysis, 26(2):397\u2013436, 2019.\n\n[7] H. Br\u00e9zis. Op\u00e9rateurs maximaux monotones et semi-groupes de contractions dans les espaces\n\nde Hilbert. North-Holland mathematics studies. Elsevier Science, Burlington, MA, 1973.\n\n[8] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and T. Pock. An introduction to total vari-\nation for image analysis. Theoretical foundations and numerical methods for sparse recovery,\n9:263\u2013340, 2010.\n\n[9] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\napplications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n\n[10] N. S Chatterji, N. Flammarion, Y.-A. Ma, P. L Bartlett, and M. I Jordan. On the theory of\nvariance reduction for stochastic gradient Monte Carlo. arXiv preprint arXiv:1802.05431,\n2018.\n\n[11] S. Chen, A. Sandryhaila, G. Lederman, Z. Wang, J. MF Moura, P. Rizzo, J. Bielak, J. H Garrett,\nand J. Kova\u02c7cevi\u00b4c. Signal inpainting on graphs via total variation minimization. In International\nConference on Acoustics, Speech, and Signal Processing, pages 8267\u20138271, 2014.\n\n[12] X. Cheng, N. S Chatterji, P. L Bartlett, and M. I Jordan. Underdamped Langevin MCMC: A\n\nnon-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.\n\n[13] L. Condat. A primal-dual splitting method for convex optimization involving Lipschitzian,\nproximable and linear composite terms. Journal of Optimization Theory and Applications,\n158(2):460\u2013479, 2013.\n\n[14] A. S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-\nconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n79(3):651\u2013676, 2017.\n\n[15] A. S Dalalyan and A. Karagulyan. User-friendly guarantees for the langevin monte carlo with\n\ninaccurate gradient. Stochastic Processes and their Applications, 2019.\n\n[16] A. S Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic\n\nLangevin diffusions. arXiv preprint arXiv:1807.09382, 2018.\n\n[17] D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications.\n\nSet-Valued and Variational Analysis, 25(4):829\u2013858, 2017.\n\n[18] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex\n\noptimization. Journal of Machine Learning Research, 20(73):1\u201346, 2019.\n\n[19] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin\n\nalgorithm. The Annals of Applied Probability, 27(3):1551\u20131587, 2017.\n\n10\n\n\f[20] A. Durmus, E. Moulines, and M. Pereyra. Ef\ufb01cient Bayesian computation by proximal Markov\nChain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences,\n11(1):473\u2013506, 2018.\n\n[21] E. Hazan, K. Y Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic non-\nconvex problems. In International conference on machine learning, pages 1833\u20131841, 2016.\n\n[22] Y.-P. Hsieh, A. Kavis, P. Rolland, and V. Cevher. Mirrored Langevin dynamics. In Advances\n\nin Neural Information Processing Systems, pages 2878\u20132887, 2018.\n\n[23] S. Kotz, T. Kozubowski, and K. Podgorski. The Laplace distribution and generalizations: a\nrevisit with applications to communications, economics, engineering, and \ufb01nance. Springer\nScience & Business Media, 2012.\n\n[24] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and appli-\ncations, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York,\nsecond edition, 2003. Stochastic Modelling and Applied Probability.\n\n[25] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http:\n\n//snap.stanford.edu/data, June 2014.\n\n[26] Y.-A. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. L Bartlett, and M. I Jordan. Is there an\n\nanalog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.\n\n[27] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for\nIn Advances in Neural Information Processing Systems, pages 451\u2013459,\n\nmachine learning.\n2011.\n\n[28] R. M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2(11):2, 2011.\n\n[29] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends R(cid:13) in Optimization,\n\n1(3):127\u2013239, 2014.\n\n[30] G. B. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space.\n\nJournal of Mathematical Analysis and Applications, 72(2):383\u2013390, 1979.\n\n[31] A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point algo-\nrithms for constrained convex optimization. Journal of Machine Learning Research, 18:1\u201342,\n2018.\n\n[32] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c. Stochastic reformulations of linear systems: algorithms and con-\n\nvergence theory. arXiv:1706.01108, 2017.\n\n[33] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[34] R. T. Rockafellar and R. J.-B. Wets. On the interchange of subdifferentiation and conditional\n\nexpectations for convex functionals. Stochastics, 7(3):173\u2013182, 1982.\n\n[35] L. Rosasco, S. Villa, and B. C. V\u02dcu. Stochastic inertial primal-dual algorithms. arXiv preprint\n\narXiv:1507.00852, 2015.\n\n[36] A. Salim, P. Bianchi, and W. Hachem. Snake: a stochastic proximal gradient algorithm for\n\nregularized problems over large graphs. IEEE Transactions on Automatic Control, 2019.\n\n[37] F. Santambrogio. {Euclidean, metric, and Wasserstein} gradient \ufb02ows: an overview. Bulletin\n\nof Mathematical Sciences, 7(1):87\u2013154, 2017.\n\n[38] S. Schechtman, A. Salim, and P. Bianchi. Passty Langevin. In CAp, 2019.\n\n[39] U. \u00b8Sim\u0160ekli. Fractional Langevin Monte Carlo: Exploring L\u00e9vy driven stochastic differential\nequations for Markov Chain Monte Carlo. In International Conference on Machine Learning,\npages 3200\u20133209, 2017.\n\n11\n\n\f[40] P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class\n\nprobabilities. Machine learning, 46(1-3):21\u201352, 2002.\n\n[41] W. Tansey and J. G Scott. A fast and \ufb02exible algorithm for the graph-fused lasso. arXiv\n\npreprint arXiv:1505.06475, 2015.\n\n[42] R. J. Tibshirani. Adaptive piecewise polynomial estimation via trend \ufb01ltering. The Annals of\n\nStatistics, 42(1):285\u2013323, 2014.\n\n[43] P. Toulis, T. Horel, and E. M Airoldi. Stable Robbins-Monro approximations through stochastic\n\nproximal updates. arXiv preprint arXiv:1510.00967, 2015.\n\n[44] T. Van Erven and P. Harremos. R\u00e9nyi divergence and Kullback-Leibler divergence.\n\nTransactions on Information Theory, 60(7):3797\u20133820, 2014.\n\nIEEE\n\n[45] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[46] B. C. V\u02dcu. A splitting algorithm for dual monotone inclusions involving cocoercive operators.\n\nAdvances in Applied Mathematics, 38(3):667\u2013681, 2013.\n\n[47] Y.-X. Wang, J. Sharpnack, A. J Smola, and R. J Tibshirani. Trend \ufb01ltering on graphs. The\n\nJournal of Machine Learning Research, 17(1):3651\u20133691, 2016.\n\n[48] M. Welling and Y. W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\n\nInternational Conference on Machine Learning, pages 681\u2013688, 2011.\n\n[49] A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as\n\na composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.\n\n[50] X. Xu and M. Ghosh. Bayesian variable selection and estimation for group lasso. Bayesian\n\nAnalysis, 10(4):909\u2013936, 2015.\n\n[51] A. Yurtsever, B. C. Vu, and V. Cevher. Stochastic three-composite convex minimization. In\n\nAdvances in Neural Information Processing Systems, pages 4329\u20134337, 2016.\n\n[52] D. Zou, P. Xu, and Q. Gu. Subsampled stochastic variance-reduced gradient langevin dynam-\n\nics. In International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2018.\n\n[53] D. Zou, P. Xu, and Q. Gu. Sampling from non-log-concave distributions via variance-reduced\ngradient langevin dynamics. In The 22nd International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 2936\u20132945, 2019.\n\n12\n\n\f", "award": [], "sourceid": 3603, "authors": [{"given_name": "Adil", "family_name": "SALIM", "institution": "KAUST"}, {"given_name": "Dmitry", "family_name": "Koralev", "institution": "KAUST"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "KAUST"}]}