{"title": "Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex", "book": "Advances in Neural Information Processing Systems", "page_first": 3102, "page_last": 3110, "abstract": "In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied online. We apply this method to latent Dirichlet allocation in an online setting, and demonstrate that it achieves substantial performance improvements to the state of the art online variational Bayesian methods.", "full_text": "Stochastic Gradient Riemannian Langevin Dynamics\n\non the Probability Simplex\n\nSam Patterson\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nspatterson@gatsby.ucl.ac.uk\n\nYee Whye Teh\n\nDepartment of Statistics\n\nUniversity of Oxford\n\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nIn this paper we investigate the use of Langevin Monte Carlo methods on the\nprobability simplex and propose a new method, Stochastic gradient Riemannian\nLangevin dynamics, which is simple to implement and can be applied to large\nscale data. We apply this method to latent Dirichlet allocation in an online mini-\nbatch setting, and demonstrate that it achieves substantial performance improve-\nments over the state of the art online variational Bayesian methods.\n\n1\n\nIntroduction\n\n(cid:88)\n\nIn recent years there has been increasing interest in probabilistic models where the latent variables\nor parameters of interest are discrete probability distributions over K items, i.e. vectors lying in the\nprobability simplex\n\n\u2206K = {(\u03c01, . . . , \u03c0K) : \u03c0k \u2265 0,\n\n\u03c0k = 1} \u2282 RK\n\n(1)\n\nk\n\nImportant examples include topic models like latent Dirichlet allocation (LDA) [BNJ03], admixture\nmodels in genetics like Structure [PSD00], and discrete directed graphical models with a Bayesian\nprior over the conditional probability tables [Hec99].\nStandard approaches to inference over the probability simplex include variational inference [Bea03,\nWJ08] and Markov chain Monte Carlo methods (MCMC) like Gibbs sampling [GRS96]. In the\ncontext of LDA, many methods have been developed, e.g. variational inference [BNJ03], collapsed\nvariational inference [TNW07, AWST09] and collapsed Gibbs sampling [GS04]. With the increas-\ningly large scale document corpora to which LDA and other topic models are applied, there has\nalso been developments of specialised and highly scalable algorithms [NASW09]. Most proposed\nalgorithms are based on a batch learning framework, where the whole document corpus needs to be\nstored and accessed for every iteration. For very large corpora, this framework can be impractical.\nMost recently, [Sat01, HBB10, MHB12] proposed online Bayesian variational inference algorithms\n(OVB), where on each iteration only a small subset (a mini-batch) of the documents is processed\nto give a noisy estimate of the gradient, and a stochastic gradient descent algorithm [RM51] is\nemployed to update the parameters of interest. These algorithms have shown impressive results on\nvery large corpora like Wikipedia articles, where it is not even feasible to store the whole dataset in\nmemory. This is achieved by simply fetching the mini-batch articles in an online manner, processing,\nand then discarding them after the mini-batch.\nIn this paper, we are interested in developing scalable MCMC algorithms for models de\ufb01ned over\nthe probability simplex. In some scenarios, and particularly in LDA, MCMC algorithms have been\nshown to work extremely well, and in fact achieve better results faster than variational inference\non small to medium corpora [GS04, TNW07, AWST09]. However current MCMC methodology\n\n1\n\n\fhave mostly been in the batch framework which, as argued above, cannot scale to the very large\ncorpora of interest. We will make use of a recently developed MCMC method called stochastic\ngradient Langevin dynamics (SGLD) [WT11, ABW12] which operates in a similar online mini-\nbatch framework as OVB. Unlike OVB and other stochastic gradient descent algorithms, SGLD\nis not a gradient descent algorithm. Rather, it is a Hamiltonian MCMC [Nea10] algorithm which\nwill asymptotically produce samples from the posterior distribution. It achieves this by updating\nparameters according to both the stochastic gradients as well as additional noise which forces it to\nexplore the full posterior instead of simply converging to a MAP con\ufb01guration.\nThere are three dif\ufb01culties that have to be addressed, however, to successfully apply SGLD to LDA\nand other models de\ufb01ned on probability simplices. Firstly, the probability simplex (1) is compact\nand has boundaries that has to be accounted for when an update proposes a step that brings the\nvector outside the simplex. Secondly, the typical Dirichlet priors over the probability simplex place\nmost of its mass close to the boundaries and corners of the simplex. This is particularly the case for\nLDA and other linguistic models, where probability vectors parameterise distributions over a larger\nnumber of words, and it is often desirable to use distributions that place signi\ufb01cant mass on only\na few words, i.e. we want distributions over \u2206K which place most of its mass near the boundaries\nand corners. This also causes a problem as depending on the parameterisation used, the gradient\nrequired for Langevin dynamics is inversely proportional to entries in \u03c0 and hence can blow up\nwhen components of \u03c0 are close to zero. Finally, again for LDA and other linguistic models, we\nwould like algorithms that work well in high-dimensional simplices.\nThese considerations lead us to the \ufb01rst contribution of this paper in Section 3, which is an inves-\ntigation into different ways to parameterise the probability simplex. This section shows that the\nchoice of a good parameterisation is not obvious, and that the use of the Riemannian geometry of\nthe simplex [Ama95, GC11] is important in designing Langevin MCMC algorithms. In particular,\nwe show that an unnormalized parameterisation, using a mirroring trick to remove boundaries, cou-\npled with a natural gradient update, achieves the best mixing performance. In Section 4, we then\nshow that the SGLD algorithm, using this parameterisation and natural gradient updates, performs\nsigni\ufb01cantly better than OVB algorithms [HBB10, MHB12]. Section 2 reviews Langevin dynamics,\nnatural gradients and SGLD to setup the framework used in the paper, and Section 6 concludes.\n\n2 Review\n\n2.1 Langevin dynamics\n\nSuppose we model a data set x = x1, . . . , xN , with a generative model p(x | \u03b8) = (cid:81)N\n\ni=1 p(xi |\n\u03b8) parameterized by \u03b8 \u2208 RD with prior p(\u03b8) and that our aim is to compute the posterior p(\u03b8 |\nx). Langevin dynamics [Ken90, Nea10] is an MCMC scheme which produces samples from the\nposterior by means of gradient updates plus Gaussian noise, resulting in a proposal distribution\nq(\u03b8\u2217 | \u03b8) as described by Equation 2.\n\n\u03b8\u2217 = \u03b8 +\n\n\u0001\n2\n\n\u2207\u03b8log p(\u03b8) +\n\n\u2207\u03b8log p(xi|\u03b8)\n\n+ \u03b6,\n\n\u03b6 \u223c N (0, \u0001I)\n\n(2)\n\nThe mean of the proposal distribution is in the direction of increasing log posterior due to the gra-\ndient, while the added noise will prevent the samples from collapsing to a single (local) maximum.\nA Metropolis-Hastings correction step is required to correct for discretisation error, with proposals\n[RS02]. As \u0001 tends to zero, the acceptance ratio\naccepted with probability min\ntends to one as the Markov chain tends to a stochastic differential equation which has p(\u03b8 | x) as its\nstationary distribution [Ken78].\n\n1, p(\u03b8\u2217|x)\np(\u03b8|x)\n\nq(\u03b8|\u03b8\u2217)\nq(\u03b8\u2217|\u03b8)\n\n(cid:17)\n\n(cid:16)\n\n2.2 Riemannian Langevin dynamics\n\nLangevin dynamics has an isotropic proposal distribution leading to slow mixing if the components\nof \u03b8 have very different scales or if they are highly correlated. Preconditioning can help with this. A\nrecent approach, the Riemann manifold Metropolis adjusted Langevin algorithm [GC11] uses a user\nchosen matrix G(\u03b8) to precondition in a locally adaptive manner. We will refer to their algorithm\n\n2\n\n(cid:32)\n\nN(cid:88)\n\ni=1\n\n(cid:33)\n\n\fas Riemannian Langevin dynamics (RLD) in this paper. The Riemannian manifold in question is\nthe family of probability distributions p(x | \u03b8) parameterised by \u03b8, for which the expected Fisher\ninformation matrix I\u03b8 de\ufb01nes a natural Riemannian metric tensor. In fact any positive de\ufb01nite matrix\nG(\u03b8) de\ufb01nes a valid Riemannian manifold and hence we are not restricted to using G(\u03b8) = I\u03b8. This\nis important in practice as for many models of interest the expected Fisher information is intractable.\nAs in Langevin dynamics, RLD consists of a Gaussian proposal q(\u03b8\u2217 | \u03b8), along with a Metropolis-\nHastings correction step. The proposal distribution can be written as\n\n\u03b8\u2217 = \u03b8 +\n\n\u00b5(\u03b8) + G\u2212 1\n\n2 (\u03b8)\u03b6,\n\n\u03b6 \u223c N (0, \u0001I)\n\nwhere the jth component of \u00b5(\u03b8) is given by\n\n\u00b5(\u03b8)j =\n\nG\u22121(\u03b8)\n\n\u2207\u03b8log p(\u03b8) +\n\n\u2207\u03b8log p(xi|\u03b8)\n\n(cid:33)(cid:33)\n\n(cid:18)\n\nG\u22121(\u03b8)\n\nD(cid:88)\n\nk=1\n\n\u2212 2\n\n\u2202G(\u03b8)\n\n\u2202\u03b8k\n\nG\u22121(\u03b8)\n\nj\n\n(cid:32)\n\nD(cid:88)\n\nk=1\n\n+\n\n(cid:32)\n(cid:0)G\u22121(\u03b8)(cid:1)\n\n\u0001\n2\n\n(cid:18)\n\nN(cid:88)\n\ni=1\n\n(cid:19)\n\njk Tr\n\nG\u22121(\u03b8)\n\n\u2202G(\u03b8)\n\n\u2202\u03b8k\n\n(3)\n\n(cid:19)\n\njk\n\n(4)\n\nThe \ufb01rst term in Equation 4 is now the natural gradient of the log posterior. Whereas the standard\ngradient gives the direction of steepest ascent in Euclidean space, the natural gradient gives the\ndirection of steepest descent taking into account the geometry implied by G(\u03b8). The remaining\nterms in Equation 4 describe how the curvature of the manifold de\ufb01ned by G(\u03b8) changes for small\nchanges in \u03b8. The Gaussian noise in Equation 3 also takes the geometry of the manifold into account,\nhaving scale de\ufb01ned by G\u2212 1\n\n2 (\u03b8).\n\n2.3 Stochastic gradient Riemannian Langevin dynamics\n\nIn the Langevin dynamics and RLD algorithms, the proposal distribution requires calculation of the\ngradient of the log likelihood w.r.t. \u03b8, which means processing all N items in the data set. For\nlarge data sets this is infeasible, and even for small data sets it may not be the most ef\ufb01cient use of\ncomputation. The stochastic gradient Langevin dynamics (SGLD) algorithm [WT11] replaces the\ncalculation of the gradient over the full data set, with a stochastic approximation based on a subset\nof data. Speci\ufb01cally at iteration t we sample n data items indexed by Dt, uniformly from the full\ndata set and replace the exact gradient in Equation 2 with the approximation\n\n\u2207\u03b8logp(x | \u03b8) \u2248 N\n|Dt|\n\n\u2207\u03b8log p(xi|\u03b8)\n\n(5)\n\n(cid:88)\n\ni\u2208Dt\n\nAlso, SGLD does not use a Metropolis-Hastings correction step, as calculating the acceptance prob-\nability would require use of the full data set, hence defeating the purpose of the stochastic gradient\napproximation. Convergence to the posterior is still guaranteed as long as decaying step sizes satis-\n\nfying(cid:80)\u221e\n\nt=1 \u0001t = \u221e,(cid:80)\u221e\n\nt < \u221e are used.\n\nt=1 \u00012\n\nIn this paper we combine the use of a preconditioning matrix G(\u03b8) as in RLD with this stochastic\ngradient approximation, by replacing the exact gradient in Equation 4 with the approximation from\nEquation 5. The resulting algorithm, stochastic gradient Riemannian Langevin dynamics (SGRLD),\navoids the slow mixing problems of Langevin dynamics, while still being applicable in a large scale\nonline setting due to its use of stochastic gradients and lack of Metropolis-Hastings correction steps.\n\n3 Riemannian Langevin dynamics on the probability simplex\n\nIn this section, we investigate the issues which arise when applying Langevin Monte Carlo meth-\nods, speci\ufb01cally the Langevin dynamics and Riemannian Langevin dynamics algorithms, to models\nwhose parameters lie on the probability simplex. In these experiments, a Metropolis-Hastings cor-\nrection step was used. Consider the simplest possible model: a K dimensional probability vector\n, and data x = x1, . . . , xN with p(xi = k | \u03c0) = \u03c0k.\ni=1 \u03b4(xi = k). In\n\n\u03c0 with Dirichlet prior p(\u03c0) \u221d (cid:81)K\nThis results in a Dirchlet posterior p(\u03c0 | x) \u221d (cid:81)K\n\n, where nk = (cid:80)N\n\nk \u03c0nk+\u03b1k\u22121\n\nk \u03c0\u03b1k\u22121\n\nk\n\nk\n\n3\n\n\fParameterisation\n\n\u03b8\n\n\u2207\u03b8log p(\u03b8|x)\n\n(cid:16)\n(cid:80)D\n(cid:0)G\u22121(\u03b8)(cid:1)\n\nG(\u03b8)\nG\u22121(\u03b8)\nG\u22121 \u2202G\n\u2202\u03b8k\njk Tr\n\nk=1\n\nG\u22121(cid:17)\n(cid:16)\n\n(cid:80)D\n\nk=1\n\njk\n\nG\u22121(\u03b8) \u2202G\n\n\u2202\u03b8k\n\nReduced-Mean\n\n(cid:16)\n\nn\u00b7\n\n\u03b8k = \u03c0k\n\nn+\u03b1\n\ndiag(\u03b8)\u22121 +\n\n\u03b8 \u2212 1 nK +\u03b1\u22121\n1\u2212(cid:80)\n(cid:0)diag(\u03b8) \u2212 \u03b8\u03b8T(cid:1)\n\n\u03c0K\n1\n\nk \u03b8k\n\n1\nn\u00b7\n\n(cid:17)\n\nK\u03b8j \u2212 1\nK\u03b8j \u2212 1\n\n11T(cid:17)\n\n(cid:16)\n\nn\u00b7\n\n\u03c0k\n\n\u03c0k\n\nk\n\n1\nn\u00b7\n\nReduced-Natural\n\n1\u2212(cid:80)K\u22121\n\u03b8k = log\n(cid:0)diag(\u03c0) \u2212 \u03c0\u03c0T(cid:1)\nn + \u03b1 \u2212 (n\u00b7 + K\u03b1) \u03c0\n1\u2212(cid:80)\ndiag(\u03c0)\u22121 +\n(1\u2212(cid:80)\n(1\u2212(cid:80)\n\n1\nk \u03c0k\n\n\u2212\n\u2212\n\nK\u22121\n\nK\u22121\n\nk \u03c0k)2\n\n1\n\u03c02\nj\n1\n\u03c02\nj\n\nk \u03c0k)2\n\n11T(cid:17)\n\nExpanded-Mean\n|\u03b8k|(cid:80)\n\u03c0k =\nk=1 |\u03b8k|\n\u03b8 \u2212 n\u00b7\n\u03b8\u00b7 \u2212 1\nn+\u03b1\u22121\n\u22121\ndiag (\u03b8)\ndiag (\u03b8)\n\nExpanded-Natural\n\n\u03c0k = e\u03b8k(cid:80)\ndiag(cid:0)e\u03b8(cid:1)\nk=1 e\u03b8k\nn + \u03b1 \u2212 n\u00b7\u03c0 \u2212 e\u03b8\ndiag(cid:0)e\u2212\u03b8(cid:1)\n\n\u22121\n\u22121\n\ne\u2212\u03b8j\ne\u2212\u03b8j\n\nTable 1: Parameterisation Details\n\nk\n\nk \u03b8k\n\nour experiments we use a sparse, symmetric prior with \u03b1k = 0.1\u2200k, and sparse count data, setting\nK = 10 and n1 = 90, n2 = n3 = 5 and the remaining nk to zero. This is to replicate the sparse\nnature of the posterior in many models of interest. The qualitative conclusions we draw are not\nsensitive to the precise choice of hyperparameters and data here.\nThere are various possible ways to parameterise the probability simplex, and the performance of\nLangevin Monte Carlo depends strongly on the choice of parameterisation. We consider both the\nmean and natural parameter spaces, and in each of these we try both a reduced (K \u2212 1 dimensional)\nand expanded (K dimensional) parameterisation, with details as follows.\nReduced-Mean: in the mean parameter space, the most obvious approach is to set \u03b8 = \u03c0 directly,\nbut there are two problems with this. Though \u03c0 has K components, it must lie on the simplex, a\nK \u2212 1 dimensional space. Running Langevin dynamics or RLD on the full K dimensional param-\neterisation will result in proposals that are off the simplex with probability one. We can incorporate\nk=1 \u03c0k = 1 by using the \ufb01rst K \u2212 1 components as the parameter \u03b8, and set-\nk=1 \u03c0k. Note however that the proposals can still violate the boundary constraint\n0 < \u03c0k < 1, and this is particularly problematic when the posterior has mass close to the boundaries.\nExpanded-Mean: we can simplify boundary considerations using a redundant parameterisation.\nWe take as our parameter \u03b8 \u2208 RK\n+ with prior a product of independent Gamma(\u03b1k, 1) distributions,\nand so the prior on \u03c0 is still Dirichlet(\u03b1).\nThe boundary conditions 0 < \u03b8k can be handled by simply taking the absolute value of the proposed\n\u03b8\u2217. This is equivalent to letting \u03b8 take values in the whole of RK, with prior given by Gammas\n|\u03b8k|(cid:80)\nk |\u03b8k|, which again results in a Dirichlet(\u03b1)\n\nthe constraint that(cid:80)K\nting \u03c0K = 1 \u2212(cid:80)K\u22121\np(\u03b8) \u221d(cid:81)K\nmirrored at 0, p(\u03b8) \u221d(cid:81)K\n\ne\u2212\u03b8k. \u03c0 is then given by \u03c0k = \u03b8k(cid:80)\n\nk=1 |\u03b8k|\u03b1k\u22121e\u2212|\u03b8k|, and \u03c0k =\n\nk=1 \u03b8\u03b1k\u22121\n\ne\u03b8k\nk=1 e\u03b8k\n\ne\u03b8k(cid:80)K\n\n1+(cid:80)K\u22121\n\nprior on \u03c0. This approach allows us to bypass boundary issues altogether.\nReduced-Natural: in the natural parameter space, the reduced parameterisation takes the form\nfor k = 1, . . . , K \u2212 1. The prior on \u03b8 can be obtained from the Dirichlet(\u03b1) prior\n\u03c0k =\non \u03c0 using a change of variables. There are no boundary constraints as the range of \u03b8k is R.\nExpanded-Natural: \ufb01nally the expanded-natural parameterisation takes the form \u03c0k =\nk=1 e\u03b8k\nfor k = 1, . . . , K. As in the expanded-mean parameterisation, we use a product of Gamma priors,\nin this case for e\u03b8k, so that the prior for \u03c0 remains Dirichlet(\u03b1).\nFor all parameterisations, we run both Langevin dynamics and RLD. When applying RLD, we\nmust choose a metric G(\u03b8). For the reduced parameterisations, we can use the expected Fisher\ninformation matrix, but the redundancy in the full parameterisations means that this matrix has rank\nK\u22121 and hence is not invertible. For these parameterisations we use the expected Fisher information\nmatrix for a Gamma/Poisson model, which is equivalent to the Dirichlet/Multinomial apart from the\nfact that the total number of data items is considered to be random as well.\nThe details for each parameterisation are summarised in Table 1.\nIn all cases we are interested\nin sampling from the posterior distribution on \u03c0, while \u03b8 is the speci\ufb01c parameterisation being\nused. For the mean parameterisations, the \u03b8\u22121 term in the gradient of the log-posterior means\nthat for components of \u03b8 which are close to zero, the proposal distribution for Langevin dynamics\n(Equation 2) has a large mean, resulting in unstable proposals with a small acceptance probability.\nDue to the form of G(\u03b8)\u22121, the same argument holds for the RLD proposal distribution for the\nnatural parameterisations. This leaves us with three possible combinations, RLD on the expanded-\nmean parameterisation and Langevin dynamics on each of the natural parameterisations.\n\n4\n\n\f(a) Effective sample size\n\n(b) Samples\n\nFigure 1: Effective sample size and samples. Burn-in iterations is 10,000; thinning factor 100.\n\nTo investigate their relative performances we run a small experiment, producing 110,000 samples\nfrom each of the three remaining parameterisations, discarding 10,000 burn-in samples and thinning\nthe remaining samples by a factor of 100. For the resulting 1000 thinned samples of \u03b8, we calculate\nthe corresponding samples of \u03c0, and compute the effective sample size for each component of \u03c0.\nThis was done for a range of step sizes \u0001, and the mean and median effective sample sizes for the\ncomponents of \u03c0 is shown in Figure 1(a).\nFigure 1(b) shows the samples from each sampler at their optimal step size of 0.1. The samples\nfrom Langevin dynamics on both natural parameterisations display higher auto-correlation than the\nRLD samples produced using the expanded-mean parameterisation, as would be expected from their\nlower effective sample sizes. In addition to the increased effective sample size, the expanded-mean\nparameterisation RLD sampler has the advantage that it is computationally ef\ufb01cient as G(\u03b8) is a\ndiagonal matrix. Hence it is this algorithm that we use when applying these techniques to latent\nDirichlet allocation in Section 4.\n\n4 Applying Riemannian Langevin dynamics to latent Dirichlet allocation\n\nLatent Dirichlet Allocation (LDA) [BNJ03] is a hierarchical Bayesian model, most frequently used\nto model topics arising in collections of text documents. The model consists of K topics \u03c0k, which\nare distributions over the words in the collection, drawn from a symmetric Dirichlet prior with\nhyper-parameter \u03b2. A document d is then modelled by a mixture of topics, with mixing proportion\n\u03b7d, drawn from a symmetric Dirichlet prior with hyper-parameter \u03b1. The model corresponds to a\ngenerative process where documents are produced by drawing a topic assignment zdi i.i.d. from \u03b7d\nfor each word wdi in document d, and then drawing the word wdi from the corresponding topic \u03c0zdi.\nWe integrate out \u03b7 analytically, resulting in the semi-collapsed distribution:\n\nD(cid:89)\n\nK(cid:89)\n\nK(cid:89)\n\nW(cid:89)\n\np(w, z, \u03c0 | \u03b1, \u03b2) =\n\nwhere as in [TNW07], ndkw = (cid:80)Nd\n\nd=1\n\n\u0393 (K\u03b1)\n\n\u0393 (K\u03b1 + nd\u00b7\u00b7)\n\n(6)\ni=1 \u03b4(wdi = w, zdi = k) and \u00b7 denotes summation over the\n\n\u0393 (\u03b1)\n\nw=1\n\nk=1\n\nk=1\n\ncorresponding index. Conditional on \u03c0, the documents are i.i.d., and we can factorise Equation 6\n\n\u0393 (\u03b1 + ndk\u00b7)\n\n\u0393(W \u03b2)\n\u0393(\u03b2)W\n\n\u03c0\u03b2+n\u00b7kw\u22121\n\nkw\n\nD(cid:89)\n\np(w, z, \u03c0 | \u03b1, \u03b2) = p(\u03c0 | \u03b2)\n\np(wd, zd | \u03b1, \u03c0)\n\nwhere\n\np(wd, zd,| \u03b1, \u03c0) =\n\nd=1\n\n\u0393 (\u03b1 + ndk\u00b7)\n\n\u0393 (\u03b1)\n\nK(cid:89)\n\nk=1\n\nW(cid:89)\n\nw=1\n\n\u03c0ndkw\nkw\n\n5\n\n(7)\n\n(8)\n\n10^\u2212510^\u2212410^\u2212310^\u2212210^\u2212110^001002003004005006007008009001000Step sizeESS Expanded\u2212Mean RLD medianExpanded\u2212Mean RLD meanReduced\u2212Natural LD medianReduced\u2212Natural LD meanExpanded\u2212Natural LD medianExpanded\u2212Natural LD mean0100200300400500600700800900100010\u2212810\u2212610\u2212410\u221221000100200300400500600700800900100010\u22122810\u22122110\u22121410\u221271000100200300400500600700800900100010\u22121610\u22121210\u2212810\u22124100Thinned sample number\f4.1 Stochastic gradient Riemannian Langevin dynamics for LDA\n\ndent Gamma prior p(\u03b8k) \u221d (cid:81)W\n\nAs we would like to apply these techniques to large document collections, we use the stochas-\ntic gradient version of the Riemannian Langevin dynamics algorithm, as detailed in Section 2.3.\nFollowing the investigation in Section 3 we use the expanded-mean parameterisation. For each\nof the K topics \u03c0k, we introduce a W -dimensional unnormalised parameter \u03b8k with an indepen-\n, for w = 1, . . . , W .\nWe use the mirroring idea as well. The metric G(\u03b8) is then the diagonal matrix G(\u03b8) =\ndiag (\u03b811, . . . , \u03b81W , . . . , \u03b8K1, . . . , \u03b8KW )\nThe algorithm runs on mini-batches of documents: at time t it receives a mini-batch of documents\nindexed by Dt, drawn at random from the full corpus D. The stochastic gradient of the log posterior\nof \u03b8 on Dt is shown in Equation 9.\n\nkw e\u2212\u03b8kw and set \u03c0kw = \u03b8kw(cid:80)\nw=1 \u03b8\u03b2w\u22121\n\u22121.\n\nw \u03b8kw\n\n\u2212 1 +\n\n|D|\n|Dt|\n\n(cid:88)\n\nd\u2208Dt\n\nEzd|wd,\u03b8,\u03b1\n\n(cid:21)\n\n\u2212 ndk\u00b7\n\u03b8k\u00b7\n\n(cid:20) ndkw\n(cid:33)\n\n\u03b8kw\n\n\u2202log p(\u03b8 | w, \u03b2, \u03b1)\n\n\u2202\u03b8kw\n\n(cid:32)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b8kw +\n\n\u0001\n2\n\n\u03b8kw\n\n\u2248 \u03b2 \u2212 1\n(cid:88)\n\n|D|\n|Dt|\n\nd\u2208Dt\n\nFor this choice of \u03b8 and G(\u03b8), we use Equations 3, 4 to give the SGRLD update for \u03b8,\n\n\u03b8\u2217\nkw =\n\n\u03b2 \u2212 \u03b8kw +\n\nEzd|wd,\u03b8,\u03b1 [ndkw \u2212 \u03c0kwndk\u00b7]\n\n+ (\u03b8kw)\n\n1\n\n2 \u03b6kw\n\n(9)\n\n(10)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nwhere \u03b6kw \u223c N (0, \u0001). Note that the \u03b2\u22121 term in Equation 9 has been replaced with \u03b2 in Equation 10\nas the \u22121 cancels with the curvature terms as detailed in Table 1. As discussed in Section 3, we\nre\ufb02ect moves across the boundary 0 < \u03b8kw by taking the absolute value of the proposed update.\nComparing Equation 9 to the gradient for the simple model from Section 3, the observed counts\nnk for the simple model have been replaced with the expectation of the latent topic assignment\ncounts ndkw. To calculate this expectation we use Gibbs sampling on the topic assignments in each\ndocument separately, using the conditional distributions\n\np(zdi = k | wd, \u03b8, \u03b1) =\n\n(11)\n\n(cid:16)\n(cid:80)\n\nk\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n\\i\ndk\u00b7\n\\i\ndk\u00b7\n\n\u03b1 + n\n\n\u03b8kwdi\n\n\u03b1 + n\n\n\u03b8kwdi\n\nwhere \\i represents a count excluding the topic assignment variable we are updating.\n\n5 Experiments\n\nd,i q(zdi)(cid:81)\n\nWe investigate the performance of SGRLD, with no Metropolis-Hastings correction step, on two\nreal-world data sets. We compare it to two online variational Bayesian algorithms developed\nfor latent Dirichlet allocation: online variational Bayes (OVB) [HBB10] and hybrid stochastic\nvariational-Gibbs (HSVG) [MHB12]. The difference between these two methods is the form of vari-\national assumption made. OVB assumes a mean-\ufb01eld variational posterior, q(\u03b71:D, z1:D, \u03c01:K) =\nk q(\u03c0k), in particular this means topic assignment variables within the same\ndocument are assumed to be independent, when in reality they will be strongly coupled. In con-\ntrast HSVG collapses \u03b7d analytically and uses a variational posterior of the form q(z1:D, \u03c01:K) =\nk q(\u03c0k), which allows dependence within the components of zd. This more complicated\nposterior requires Gibbs sampling in the variational update step for zd, and we combined the code\nfor OVB [HBB10], with the Gibbs sampling routine from our SGRLD code to implement HSVG.\n\n(cid:81)\nd q(\u03b7d)(cid:81)\n(cid:81)\nd q(zd)(cid:81)\n\n5.1 Evaluation Method\n\nThe predictive performance of the algorithms can be measured by looking at the probability they\nassign to unseen data. A metric frequently used for this purpose is perplexity, the exponentiated\ncross entropy between the trained model probability distribution and the empirical distribution of\nthe test data. For a held-out document wd and a training set W, the perplexity is given by\n\n(cid:26)\n\n(cid:80)nd\u00b7\u00b7\ni=1 log p(wdi | W, \u03b1, \u03b2)\n\n(cid:27)\n\nperp(wd | W, \u03b1, \u03b2) = exp\n\n\u2212\n\n.\n\n(12)\n\nnd\u00b7\u00b7\n\n6\n\n\fThis requires calculating p(wdi\n\u03b7d, \u03c01, . . . , \u03c0K and topic assignments zd, to give\n\n| W, \u03b1, \u03b2), which is done by marginalising out the parameters\n\np(wdi | W, \u03b1, \u03b2) = E\u03b7d,\u03c0\n\n\u03b7dk\u03c0kwdi\n\n(13)\n\n(cid:35)\n\n(cid:34)(cid:88)\n\nk\n\nWe use a document completion approach [WMSM09], partitioning the test document wd into two\nsets of words, wtrain\nto estimate \u03b7d for the test document, then calculating the\nd\nperplexity on wtest\nTo calculate the perplexity for SGRLD, we integrate \u03b7 analytically, so Equation 13 is replaced by\n\nd using this estimate.\n\nand using wtrain\n\n, wtest\n\nd\n\nd\n\n(cid:34)\n\n(cid:34)(cid:88)\n\n(cid:35)(cid:35)\n\np(wdi | wtrain\n\nd\n\n,W, \u03b1, \u03b2) = E\u03c0|W,\u03b2\n\nE\nztrain\nd\n\n|\u03c0,\u03b1\n\n\u02c6\u03b7dk\u03c0kwdi\n\nwhere\n\n\u02c6\u03b7dk := p(ztest\n\ndi = k | ztrain\n\nd\n\nk\n\n, \u03b1) =\n\nntrain\ndk\u00b7 + \u03b1\nntrain\nd\u00b7\u00b7 + K\u03b1\n\n.\n\n(14)\n\n(15)\n\n(cid:34)(cid:88)\n\n(cid:35)\n\n\u2248(cid:88)\n\nWe estimate these expectations using the samples we obtain for \u03c0 from the Markov chain produced\nby SGRLD, and samples for ztrain\nFor OVB and HSVG, we estimate Equation 13 by replacing the true posterior p(\u03b7, \u03b2) with q(\u03b7, \u03b2).\n\nproduced by Gibbs sampling the topic assignments on wtrain\n\n.\n\nd\n\nd\n\np(wdi | W, \u03b1, \u03b2) = Ep(\u03b7d,\u03c0|W,\u03b1,\u03b2)\n\n\u03b7dk\u03c0kwdi\n\nEq(\u03b7d) [\u03b7dk] Eq(\u03c0k) [\u03c0kwdi ]\n\n(16)\n\nWe estimate the perplexity directly rather than use a variational bound [HBB10] so that we can\ncompare results of the variational algorithms to those of SGRLD.\n\nk\n\nk\n\n5.2 Results on NIPS corpus\n\nThe \ufb01rst experiment was carried out on the collection of NIPS papers from 1988-2003 [GCPT07].\nThis corpus contains 2483 documents, which is small enough to run all three algorithms in batch\nmode and compare their performance to that of collapsed Gibbs sampling on the full collection.\nEach document was split 80/20 into training and test sets, the training portion of all 2483 documents\nwere used in each update step, and the perplexity was calculated on the test portion of all docu-\nments. Hyper-parameters \u03b1 and \u03b2 were both \ufb01xed to 0.01, and 50 topics were used. A step-size\nschedule of the form \u0001t = (a \u2217 (1 + t\nb ))\u2212c was used. Perplexities were estimated for a range of step\nsize parameters, and for 1, 5 and 10 document updates per topic parameter update. For OVB the\ndocument updates are \ufb01xed point iterations of q(zd) while for HSVG and SGRLD they are Gibbs\nupdates of zd, the \ufb01rst half of which were discarded as burn-in. These numbers of document updates\nwere chosen as previous investigation of the performance of HSVG for varying numbers of Gibbs\nupdates has shown that 6-10 updates are suf\ufb01cient [MHB12] to achieve good performance.\nFigure 2(a) shows the lowest perplexities achieved along with the corresponding parameter settings.\nAs expected, CGS achieves the lowest perplexities. It is surprising that HSVG performs slightly\nworse than OVB on this data set. As it uses a less restricted variational distribution it should perform\nat least as well. SGRLD improves on the performance of OVB and HSVG, but does not match the\nperformance of Gibbs sampling.\n\n5.3 Results on Wikipedia corpus\n\nThe algorithms\u2019 performances in an online scenario was assessed on a set of articles downloaded\nat random from Wikipedia, as in [HBB10]. The vocabulary used is again as per [HBB10]; it is\nnot created from the Wikipedia data set, instead it is taken from the top 10,000 words in Project\nGutenburg texts, excluding all words of less than three characters. This results in vocabulary size\nW of approximately 8000 words. 150,000 documents from Wikipedia were used in total, in mini-\nbatches of 50 documents each. The perplexities were estimated using the methods discussed in\n\n7\n\n\f(a) NIPS corpus\n\n(b) Wikipedia corpus\n\nFigure 2: Test-set perplexities on NIPS and Wikipedia corpora.\n\nSection 5.1 on a separate holdout set of 1000 documents, split 90/10 training/test. As the corpus size\nis large, collapsed Gibbs sampling was not run on this data set.\nFor each algorithm a grid-search was run on the hyper-parameters, step-size parameters, and num-\nber of Gibbs sampling sweeps / variational \ufb01xed point iterations per \u03c0 update. The lowest three\nperplexities attained for each algorithm are shown in Figure 2(b). Corresponding parameters are\ngiven in the supplementary material. HSVG achieves better performance than OVB, as expected.\nThe performance of SGRLD is a substantial improvement on both the variational algorithms.\n\n6 Discussion\n\nWe have explored the issues involved in applying Langevin Monte Carlo techniques to a constrained\nparameter space such as the probability simplex, and developed a novel online sampling algorithm\nwhich addresses those issues. Using an expanded parametrisation with a re\ufb02ection trick for negative\nproposals removed the need to deal with boundary constraints, and using the Riemannian geometry\nof the parameter space dealt with the problem of parameters with differing scales.\nApplying the method to Latent Dirichlet Allocation on two data sets produced state of the art pre-\ndictive performance for the same computational budget as competing methods, demonstrating that\nfull Bayesian inference using MCMC can be practically applied to models of interest, even when\nthe data set is large. Python code for our method is available at http://www.stats.ox.ac.\nuk/\u02dcteh/sgrld.html.\nDue to the widespread use of models de\ufb01ned on the probability simplex, we believe the methods\ndeveloped here for Langevin dynamics on the probability simplex will \ufb01nd further uses beyond latent\nDirichlet allocation and stochastic gradient Monte Carlo methods. A drawback of SGLD algorithms\nis the need for decreasing step sizes; it would be interesting to investigate adaptive step sizes and the\napproximation entailed when using \ufb01xed step sizes (but see [AKW12] for a recent development).\n\nAcknowledgements\n\nWe thank the Gatsby Charitable Foundation and EPSRC (grant EP/K009362/1) for generous fund-\ning, reviewers and area chair for feedback and support, and [HBB10] for use of their excellent\npublicly available source code.\n\n8\n\n0200400600800100014001600180020002200 HSVGOVBSGRLDCollapsed Gibbs0500001000001500001000120014001600180020002200 HSVGOVBSGRLD\fReferences\n[ABW12]\n\n[AKW12]\n\n[Ama95]\n\nSungjin Ahn, Anoop Korattikara Balan, and Max Welling, Bayesian posterior sampling via\nstochastic gradient \ufb01sher scoring., ICML, 2012.\nS. Ahn, A. Korattikara, and M. Welling, Bayesian posterior sampling via stochastic gradient\nFisher scoring, Proceedings of the International Conference on Machine Learning, 2012.\nS. Amari, Information geometry of the EM and em algorithms for neural networks, Neural Net-\nworks 8 (1995), no. 9, 1379\u20131408.\n\n[AWST09] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, On smoothing and inference for topic models,\nProceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence, vol. 25,\n2009.\nM. J. Beal, Variational algorithms for approximate bayesian inference, Ph.D. thesis, Gatsby Com-\nputational Neuroscience Unit, University College London, 2003.\nD. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning\nResearch 3 (2003), 993\u20131022.\nM. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo\nmethods, Journal of the Royal Statistical Society B 73 (2011), 1\u201337.\n\n[BNJ03]\n\n[Bea03]\n\n[GC11]\n\n[GCPT07] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, Euclidean Embedding of Co-occurrence\n\nData, The Journal of Machine Learning Research 8 (2007), 2265\u20132295.\nW. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov chain monte carlo in practice, Chap-\nman and Hall, 1996.\nT. L. Grif\ufb01ths and M. Steyvers, Finding scienti\ufb01c topics, Proceedings of the National Academy\nof Sciences, 2004.\n\n[HBB10] M. D. Hoffman, D. M. Blei, and F. Bach, Online learning for latent dirichlet allocation, Advances\n\n[GRS96]\n\n[GS04]\n\n[Hec99]\n\n[Ken78]\n[Ken90]\n\n[MHB12]\n\n[Nea10]\n\n[PSD00]\n\n[RM51]\n\n[RS02]\n\n[Sat01]\n\n[TNW07]\n\n[WJ08]\n\n[NASW09] D. Newman, A. Asuncion, P. Smyth, and M. Welling, Distributed algorithms for topic models,\n\nin Neural Information Processing Systems, 2010.\nD. Heckerman, A tutorial on learning with Bayesian networks, Learning in Graphical Models\n(M. I. Jordan, ed.), Kluwer Academic Publishers, 1999.\nJ. Kent, Time-reversible diffusions, Advances in Applied Probability 10 (1978), 819\u2013835.\nA. D. Kennedy, The theory of hybrid stochastic algorithms, Probabilistic Methods in Quantum\nField Theory and Quantum Gravity, Plenum Press, 1990.\nD. Mimno, M. Hoffman, and D. Blei, Sparse stochastic inference for latent Dirichlet allocation,\nProceedings of the International Conference on Machine Learning, 2012.\n\nJournal of Machine Learning Research (2009).\nR. M. Neal, MCMC using Hamiltonian dynamics, Handbook of Markov Chain Monte Carlo\n(S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, eds.), Chapman & Hall / CRC Press, 2010.\nJ.K. Pritchard, M. Stephens, and P. Donnelly, Inference of population structure using multilocus\ngenotype data, Genetics 155 (2000), 945\u2013959.\nH. Robbins and S. Monro, A stochastic approximation method, Annals of Mathematical Statistics\n22 (1951), no. 3, 400\u2013407.\nG. O. Roberts and O. Stramer, Langevin diffusions and metropolis-hastings algorithms, Method-\nology and Computing in Applied Probability 4 (2002), 337\u2013357, 10.1023/A:1023562417138.\nM. Sato, Online model selection based on the variational Bayes, Neural Computation 13 (2001),\n1649\u20131681.\nY. W. Teh, D. Newman, and M. Welling, A collapsed variational Bayesian inference algorithm for\nlatent Dirichlet allocation, Advances in Neural Information Processing Systems, vol. 19, 2007,\npp. 1353\u20131360.\nM. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational in-\nference, Foundations and Trends in Machine Learning 1 (2008), no. 1-2, 1\u2013305.\n\n[WMSM09] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno, Evaluation methods\nfor topic models, Proceedings of the 26th International Conference on Machine Learning (ICML)\n(Montreal) (L\u00b4eon Bottou and Michael Littman, eds.), Omnipress, June 2009, pp. 1105\u20131112.\nM. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Pro-\nceedings of the International Conference on Machine Learning, 2011.\n\n[WT11]\n\n9\n\n\f", "award": [], "sourceid": 1418, "authors": [{"given_name": "Sam", "family_name": "Patterson", "institution": "Gatsby Unit, UCL"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}