{"title": "Large-Scale Stochastic Sampling from the Probability Simplex", "book": "Advances in Neural Information Processing Systems", "page_first": 6721, "page_last": 6731, "abstract": "Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demonstrate that because of this, current SGMCMC methods for the simplex struggle with sparse simplex spaces; when many of the components are close to zero. Unfortunately, many popular large-scale Bayesian models, such as network or topic models, require inference on sparse simplex spaces. To avoid the biases caused by this discretization error, we propose the stochastic Cox-Ingersoll-Ross process (SCIR), which removes all discretization error and we prove that samples from the SCIR process are asymptotically unbiased. We discuss how this idea can be extended to target other constrained spaces. Use of the SCIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.", "full_text": "Large-Scale Stochastic Sampling from the Probability\n\nSimplex\n\nJack Baker\n\nSTOR-i CDT, Mathematics and Statistics\n\nLancaster University\n\nj.baker1@lancaster.ac.uk\n\nEmily B. Fox\n\nComputer Science & Engineering and Statistics\n\nUniversity of Washington\n\nebfox@uw.edu\n\nPaul Fearnhead\n\nMathematics and Statistics\n\nLancaster University\n\np.fearnhead@lancaster.ac.uk\n\nChristopher Nemeth\n\nMathematics and Statistics\n\nLancaster University\n\nc.nemeth@lancaster.ac.uk\n\nAbstract\n\nStochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular\nmethod for scalable Bayesian inference. These methods are based on sampling a\ndiscrete-time approximation to a continuous time process, such as the Langevin\ndiffusion. When applied to distributions de\ufb01ned on a constrained space the time-\ndiscretization error can dominate when we are near the boundary of the space.\nWe demonstrate that because of this, current SGMCMC methods for the simplex\nstruggle with sparse simplex spaces; when many of the components are close to\nzero. Unfortunately, many popular large-scale Bayesian models, such as network\nor topic models, require inference on sparse simplex spaces. To avoid the biases\ncaused by this discretization error, we propose the stochastic Cox-Ingersoll-Ross\nprocess (SCIR), which removes all discretization error and we prove that samples\nfrom the SCIR process are asymptotically unbiased. We discuss how this idea can\nbe extended to target other constrained spaces. Use of the SCIR process within a\nSGMCMC algorithm is shown to give substantially better performance for a topic\nmodel and a Dirichlet process mixture model than existing SGMCMC approaches.\n\nStochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for\nscalable Bayesian inference (Welling and Teh, 2011; Chen et al., 2014; Ding et al., 2014; Ma et al.,\n2015). The foundation of SGMCMC methods are a class of continuous processes that explore a target\ndistribution\u2014e.g., the posterior\u2014using gradient information. These processes converge to a Markov\nchain which samples from the posterior distribution exactly. SGMCMC methods replace the costly\nfull-data gradients with minibatch-based stochastic gradients, which provides one source of error.\nAnother source of error arises from the fact that the continuous processes are almost never tractable to\nsimulate; instead, discretizations are relied upon. In the non-SG scenario, the discretization errors are\ncorrected for using Metropolis-Hastings. However, this is not (generically) feasible in the SG setting.\nThe result of these two sources of error is that SGMCMC targets an approximate posterior (Welling\nand Teh, 2011; Teh et al., 2016; Vollmer et al., 2016).\nAnother signi\ufb01cant limitation of SGMCMC methods is that they struggle to sample from constrained\nspaces. Naively applying SGMCMC can lead to invalid, or inaccurate values being proposed. The\nresult is large errors near the boundary of the space (Patterson and Teh, 2013; Ma et al., 2015; Li\net al., 2016). A particularly important constrained space is the simplex space, which is used to model\ndiscrete probability distributions. A parameter \u03c9 of dimension d lies in the simplex if it satis\ufb01es\nj=1 \u03c9j = 1. Many popular models\n\nthe following conditions: \u03c9j \u2265 0 for all j = 1, . . . , d and(cid:80)d\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcontain simplex parameters. For example, latent Dirichlet allocation (LDA) is de\ufb01ned by a set of\ntopic-speci\ufb01c distributions on words and document-speci\ufb01c distributions on topics. Probabilistic\nnetwork models often de\ufb01ne a link probability between nodes. More generally, mixture and mixed\nmembership models have simplex-constrained mixture weights; even the hidden Markov model\ncan be cast in this framework with simplex-constrained transition distributions. As models become\nlarge-scale, these vectors \u03c9 often become sparse\u2013i.e., many \u03c9j are close to zero\u2014pushing them to\nthe boundaries of the simplex. All the models mentioned have this tendency. For example in network\ndata, nodes often have relatively few links compared to the size of the network, e.g., the number of\nfriends the average social network user has will be small compared with the size of the whole social\nnetwork. In these cases the problem of sampling from the simplex space becomes even harder; since\nmany values will be very close to the boundary of the space.\nPatterson and Teh (2013) develop an improved SGMCMC method for sampling from the probability\nsimplex: stochastic gradient Riemannian Langevin dynamics (SGRLD). The improvements achieved\nare through an astute transformation of the simplex parameters, as well as developing a Riemannian\n(see Girolami and Calderhead, 2011) variant of SGMCMC. This method achieved state-of-the-art\nresults on an LDA model. However, we show that despite the improvements over standard SGMCMC,\nthe discretization error of SGRLD still causes problems on the simplex. In particular, it leads to\nasymptotic biases which dominate at the boundary of the space and causes signi\ufb01cant inaccuracy.\nTo counteract this, we design an SGMCMC method based on the Cox-Ingersoll-Ross (CIR) process.\nThe resulting process, which we refer to as the stochastic CIR process (SCIR), has no discretization\nerror. This process can be used to simulate from gamma random variables directly, which can then be\nmoved into the simplex space using a well known transformation. The CIR process has a lot of nice\nproperties. One is that the transition equation is known exactly, which is what allows us to simulate\nfrom the process without discretization error. We are also able to characterize important theoretical\nproperties of the SCIR algorithm, such as the non-asymptotic moment generating function, and thus\nits mean and variance. We discuss how these ideas can be used to simulate ef\ufb01ciently from other\nconstrained spaces, such as (0,\u221e).\nWe demonstrate the impact of this SCIR method on a broad class of models. Included in these\nexperiments is the development of a scalable sampler for Dirichlet processes, based on the slice\nsampler of Walker (2007); Papaspiliopoulos (2008); Kalli et al. (2011). To our knowledge the\napplication of SGMCMC methods to Bayesian nonparametric models has not been explored. All\nproofs in this article are relegated to the Supplementary Material. All code for the experiments is\navailable online1, and full details of hyperparameter and tuning constant choices has been detailed in\nthe Supplementary Material.\n\n1 Stochastic Gradient MCMC on the Probability Simplex\n\nup to a constant of proportionality, as p(\u03b8|x) \u221d p(\u03b8)(cid:81)N\n\n1.1 Stochastic Gradient MCMC\nConsider Bayesian inference for continuous parameters \u03b8 \u2208 Rd based on data x = {xi}N\ni=1. Denote\nthe density of xi as p(xi|\u03b8) and assign a prior on \u03b8 with density p(\u03b8). The posterior is then de\ufb01ned,\ni=1 p(xi|\u03b8), and has distribution \u03c0. We de\ufb01ne\nf (\u03b8) := \u2212 log p(\u03b8|x). Whilst MCMC can be used to sample from \u03c0, such algorithms require access\nto the full data set at each iteration. Stochastic gradient MCMC (SGMCMC) is an approximate\nMCMC algorithm that reduces this per-iteration computational and memory cost by using only a\nsmall subset of data points at each step.\nThe most common SGMCMC algorithm is stochastic gradient Langevin dynamics (SGLD), \ufb01rst\nintroduced by Welling and Teh (2011). This sampler uses the Langevin diffusion, de\ufb01ned as the\nsolution to the stochastic differential equation\n\n(1.1)\nwhere Wt is a d-dimensional Wiener process. Similar to MCMC, the Langevin diffusion de\ufb01nes a\nMarkov chain whose stationary distribution is \u03c0.\nUnfortunately, simulating from (1.1) is rarely possible, and the cost of calculating \u2207f is O(N ) since\nit involves a sum over all data points. The idea of SGLD is to introduce two approximations to\n\n2dWt,\n\nd\u03b8t = \u2212\u2207f (\u03b8t)dt +\n\n\u221a\n\n1Code available at https://github.com/jbaker92/scir.\n\n2\n\n\f\u03b8m+1 = \u03b8m \u2212 h\u2207 \u02c6f (\u03b8) +\n\n\u03b7m \u223c N (0, 1).\n\n\u221a\n\ncircumvent these issues. First, the continuous dynamics are approximated by discretizing them, in\na similar way to Euler\u2019s method for ODEs. This approximation is known as the Euler-Maruyama\nmethod. Next, in order to reduce the cost of calculating \u2207f, it is replaced with a cheap, unbiased\nestimate. This leads to the following update equation, with user chosen stepsize h\n\ni\u2208Sm\n\n2h\u03b7m,\n\nwe set \u2207 \u02c6f (\u03b8) := \u2212\u2207 log p(\u03b8) \u2212 N/n(cid:80)\n\n(1.2)\nHere, \u2207 \u02c6f is an unbiased estimate of \u2207f whose computational cost is O(n) where n (cid:28) N. Typically,\n\u2207 log p(xi|\u03b8), where Sm \u2282 {1, . . . , N} resampled at\neach iteration with |Sm| = n. Applying (1.2) repeatedly de\ufb01nes a Markov chain that approximately\ntargets \u03c0 (Welling and Teh, 2011). There are a number of alternative SGMCMC algorithms to SGLD,\nbased on approximations to other diffusions that also target the posterior distribution (Chen et al.,\n2014; Ding et al., 2014; Ma et al., 2015).\nRecent work has investigated reducing the error introduced by approximating the gradient using\nminibatches (Dubey et al., 2016; Nagapetyan et al., 2017; Baker et al., 2017; Chatterji et al., 2018).\nWhile, by comparison, the discretization error is generally smaller, in this work we investigate an\nimportant situation where it degrades performance considerably.\n\nj=1 \u03c9\u03b1j\n\n1.2 SGMCMC on the Probability Simplex\nWe aim to make inference on the simplex parameter \u03c9 of dimension d, where \u03c9j \u2265 0 for all\nj=1 \u03c9j = 1. We assume we have categorical data zi of dimension d for\ni = 1, . . . , N, so zij will be 1 if data point i belongs to category j and zik will be zero for all\nd , and that the\ni=1 zi) posterior. An important\ntransformation we will use repeatedly throughout this article is as follows: if we have d random\nj Xj will have Dir(\u03b1) distribution,\n\nj = 1, . . . , d and (cid:80)d\nk (cid:54)= j. We assume a Dirichlet prior Dir(\u03b1) on \u03c9, with density p(\u03c9) \u221d (cid:81)d\ndata is drawn from zi | \u03c9 \u223c Categorical(\u03c9) leading to a Dir(\u03b1 +(cid:80)N\ngamma variables Xj \u223c Gamma(\u03b1j, 1). Then (X1, . . . , Xd)/(cid:80)\n\nwhere \u03b1 = (\u03b11, . . . , \u03b1d).\nIn this simple case the posterior of \u03c9 can be calculated exactly. However, in the applications we\nconsider the zi are latent variables, and they are also simulated as part of a larger Gibbs sampler.\nThus the zi will change at each iteration of the algorithm. We are interested in the situation where this\nis the case, and N is large, so that standard MCMC runs prohibitively slowly. The idea of SGMCMC\nin this situation is to use subsamples of z to propose appropriate moves to \u03c9.\nApplying SGMCMC to models which contain simplex parameters is challenging due to their con-\nstraints. Naively applying SGMCMC can lead to invalid values being proposed. The \ufb01rst SGMCMC\nalgorithm developed speci\ufb01cally for the probability simplex was the SGRLD algorithm of Pat-\nterson and Teh (2013). Patterson and Teh (2013) try a variety of transformations for \u03c9 which\nmove the problem onto a space in Rd, where standard SGMCMC can be applied. They also\nbuild upon standard SGLD by developing a Riemannian variant. Riemannian MCMC (Girolami\nand Calderhead, 2011) takes the geometry of the space into account, which assists with errors at\nthe boundary of the space. The parameterization Patterson and Teh (2013) \ufb01nd numerically per-\nj=1 |\u03b8j|. They use a mirrored gamma prior for \u03b8j, which has density\np(\u03b8j) \u221d |\u03b8j|\u03b1j\u22121e\u2212|\u03b8j|. This means the prior for \u03c9 remains the required Dirichlet distribution. They\ncalculate the density of zi given \u03b8 using a change of variables and use a (Riemannian) SGLD update\nto update \u03b8.\n\nforms best is \u03c9j = |\u03b8j|/(cid:80)d\n\n1.3 SGRLD on Sparse Simplex Spaces\n\nPatterson and Teh (2013) suggested that the boundary of the space is where most problems occur using\nto 0. We refer to such \u03c9 as being sparse. In other words, there are many j for which(cid:80)N\nthese kind of samplers; motivating their introduction of Riemannian ideas for SGLD. In many popular\napplications, such as LDA and modeling sparse networks, many of the components \u03c9j will be close\ni=1 zij = 0.\ndimension d = 10 with N = 1000. We set(cid:80)N\nIn order to demonstrate the problems with using SGRLD in this case, we provide a similar experiment\n(cid:80)N\nto Patterson and Teh (2013). We use SGRLD to simulate from a sparse simplex parameter \u03c9 of\ni=1 zi3 = 100, and\ni=1 zij = 0, for 3 < j \u2264 10. The prior parameter \u03b1 was set to 0.1 for all components. Leading to\n\ni=1 zi1 = 800,(cid:80)N\n\ni=1 zi2 = (cid:80)N\n\n3\n\n\fFigure 1: Boxplots of a 1000 iteration sample from SGRLD and SCIR \ufb01t to a sparse Dirichlet\nposterior, compared to 1000 exact independent samples. On the log scale.\n\na highly sparse Dirichlet posterior. We will refer back to this experiment as the running experiment.\nIn Figure 1 we provide boxplots from a sample of the \ufb01fth component of \u03c9 using SGRLD after 1000\niterations with 1000 iterations of burn-in, compared with boxplots from an exact sample. The method\nSCIR will be introduced later. We can see from Figure 1 that SGRLD rarely proposes small values of\n\u03c9. This becomes a signi\ufb01cant issue for sparse Dirichlet distributions, since the lack of small values\nleads to a poor approximation to the posterior, as we can see from the boxplots.\nWe hypothesize that the reason SGRLD struggles when \u03c9j is near the boundary is due to the\ndiscretization by h, and we now try to diagnose this issue in detail. The problem relates to the bias of\nSGLD caused by the discretization of the algorithm. We use the results of Vollmer et al. (2016) to\ncharacterize this bias for a \ufb01xed stepsize h. For similar results when the stepsize scheme is decreasing,\nwe refer the reader to Teh et al. (2016). Proposition 1.1 is a simple application of Vollmer et al.\n(2016, Theorem 3.3), so we refer the reader to that article for full details of the assumptions. For\nsimplicity of the statement, we assume that \u03b8 is 1-dimensional, but the results are easily adapted to\nthe d-dimensional case.\nProposition 1.1. (Vollmer et al., 2016) Under Vollmer et al. (2016, Assumptions 3.1 and 3.2), assume\n\u03b8 is 1-dimensional. Let \u03b8m be iteration m of an SGLD algorithm for m = 1, . . . , M, then the\nasymptotic bias de\ufb01ned by limM\u2192\u221e\nWhile ordinarily this asymptotic bias is hard to disentangle from other sources of error, as E\u03c0[\u03b8] gets\nclose to zero h has to be set prohibitively small to give a good approximation to \u03b8. The crux of the\nissue is that, while the absolute error remains the same, at the boundary of the space the relative\nerror is large since \u03b8 is small, and biased upwards due to the positivity constraint. To counteract this,\nin the next section we introduce a method which has no discretization error. This allows us to prove\nthat the asymptotic bias, as de\ufb01ned in Proposition 1.1, will be zero for any choice of stepsize h.\n\n(cid:12)(cid:12)(cid:12) has leading term O(h).\n\n(cid:12)(cid:12)(cid:12)1/M(cid:80)M\n\nE[\u03b8m] \u2212 E\u03c0[\u03b8]\n\nm=1\n\n2 The Stochastic Cox-Ingersoll-Ross Algorithm\n\nthe posterior for each \u03b8j directly and independently as \u03b8j | z \u223c Gamma(\u03b1j +(cid:80)N\nusing the gamma reparameterization \u03c9 = \u03b8/(cid:80)\n\nWe now wish to counteract the problems with SGRLD on sparse simplex spaces. First, we make the\nfollowing observation: rather than applying a reparameterization of the prior for \u03c9, we can model\ni=1 zij, 1). Then\nj \u03b8j still leads to the desired Dirichlet posterior. This\nleaves the \u03b8j in a much simpler form, and this simpler form enables us to remove all discretization\nerror. We do this by using an alternative underlying process to the Langevin diffusion, known as the\nCox-Ingersoll-Ross (CIR) process, commonly used in mathematical \ufb01nance. A CIR process \u03b8t with\nparameter a and stationary distribution Gamma(a, 1) has the following form\n\nd\u03b8t = (a \u2212 \u03b8t)dt +\n\n(2.1)\nThe standard CIR process has more parameters, but we found changing these made no difference to the\nproperties of our proposed scalable sampler, so we omit them (for exact details see the Supplementary\nMaterial).\nThe CIR process has many nice properties. One that is particularly useful for us is that the transition\ndensity is known exactly. De\ufb01ne \u03c72(\u03bd, \u00b5) to be the non-central chi-squared distribution with \u03bd\n\n2\u03b8tdWt.\n\n(cid:112)\n\n4\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1e\u2212261e\u2212191e\u2212121e\u221205ExactSCIRSGRLDMethodOmegaMethodExactSCIRSGRLD\fdegrees of freedom and non-centrality parameter \u00b5. If at time t we are at state \u03d1t, then the probability\ndistribution of \u03b8t+h is given by\n\n\u03b8t+h | \u03b8t = \u03d1t \u223c 1 \u2212 e\u2212h\n\n2\n\nW,\n\nW \u223c \u03c72\n\n2a, 2\u03d1t\n\ne\u2212h\n1 \u2212 e\u2212h\n\n.\n\n(2.2)\n\n(cid:19)\n\nThis transition density allows us to simulate directly from the CIR process with no discretization error.\nFurthermore, it has been proven that the CIR process is negative with probability zero (Cox et al.,\n1985), meaning we will not need to take absolute values as is required for the SGRLD algorithm.\n\n2.1 Adapting for Large Datasets\n\nGamma(a, 1), where a = \u03b1 +(cid:80)N\nto replace a in the transition density equation (2.2) with an unbiased estimate \u02c6a = \u03b1 + N/n(cid:80)\n\nThe next issue we need to address is how to sample from this process when the dataset is large.\nSuppose that zi is data for i = 1, . . . , N, for some large N, and that our target distribution is\ni=1 zi. We want to approximate the target by simulating from the\nCIR process using only a subset of z at each iteration. A natural thing to do would be at each iteration\ni\u2208S zi,\nwhere S \u2282 {1, . . . , N}, similar to SGLD. We will refer to a CIR process using unbiased estimates\nin this way as the stochastic CIR process (SCIR). Fix some stepsize h, which now determines how\noften \u02c6a is resampled rather than the granularity of the discretization. Suppose \u02c6\u03b8m follows the SCIR\nprocess, then it will have the following update\n\nW,\n\nW \u223c \u03c72\n\n2\u02c6am, 2\u03d1m\n\n,\n\n(2.3)\n\n(cid:19)\n\ne\u2212h\n1 \u2212 e\u2212h\n\n(cid:18)\n\n(cid:18)\n\n\u02c6\u03b8m+1 | \u02c6\u03b8m = \u03d1m \u223c 1 \u2212 e\u2212h\n\n2\n\nwhere \u02c6am = \u03b1 + N/n(cid:80)\n\ni\u2208Sm\n\nzi.\n\nWe can show that this algorithm will approximately target the true posterior distribution in the same\nsense as SGLD. To do this, we draw a connection between the SCIR process and an SGLD algorithm,\nwhich allows us to use the arguments of SGLD to show that the SCIR process will target the desired\ndistribution. More formally, we have the following relationship:\n\u221a\nTheorem 2.1. Let \u03b8t be a CIR process with transition 2.2. Then Ut := g(\u03b8t) = 2\nLangevin diffusion for a generalized gamma distribution.\nTheorem 2.1, allows us to show that applying the transformation g(\u00b7) to the approximate SCIR\nprocess, leads to a discretization free SGLD algorithm for a generalized gamma distribution. Similarly,\napplying g\u22121(\u00b7) to the approximate target of this SGLD algorithm leads to the desired Gamma(a, 1)\ndistribution. Full details are given after the proof of Theorem 2.1. The result means that similar to\nSGLD, we can replace the CIR parameter a with an unbiased estimate \u02c6a created from a minibatch\nof data. Provided we re-estimate a from one iteration to the next using different minibatches, the\napproximate target distribution will still be Gamma(a, 1). As in SGLD, there will be added error\nbased on the noise in the estimate \u02c6a. However, from the desirable properties of the CIR process we\nare able to quantify this error more easily than for the SGLD algorithm, and we do this in Section 3.\nAlgorithm 1 below summarizes how SCIR can be used to sample from the simplex parameter\ni=1 zi). This can be done in a similar way to SGRLD, with the same per-iteration\n\n\u03c9 | z \u223c Dir(\u03b1 +(cid:80)N\n\n\u03b8t follows the\n\ncomputational cost, so the improvements we demonstrate later are essentially for free.\n\nAlgorithm 1: Stochastic Cox-Ingersoll-Ross (SCIR) for sampling from the probability simplex.\nInput: Starting points \u03b80, stepsize h, minibatch size n.\nResult: Approximate sample from \u03c9 | z.\nfor m = 1 to M do\n\nSample minibatch Sm from {1, . . . , N}\nfor j = 1 to d do\n\nSet \u02c6aj \u2190 \u03b1 + N/n(cid:80)\nSet \u03c9m \u2190 \u03b8m/(cid:80)\n\ni\u2208Sm\n\nend\n\nzij.\n\nj \u03b8mj.\n\nend\n\nSample \u02c6\u03b8mj | \u02c6\u03b8(m\u22121)j using (2.3) with parameter \u02c6aj and stepsize h.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Kolmogorov-Smirnov distance for SGRLD and SCIR at different minibatch sizes when\nused to sample from (a), a sparse Dirichlet posterior and (b) a dense Dirichlet posterior.\n\n2.2 SCIR on Sparse Data\n\nWe test the SCIR process on two synthetic experiments. The \ufb01rst experiment is the running experiment\non the sparse Dirichlet posterior of Section 1.3. The second experiment allocates 1000 datapoints\nequally to each component, leading to a highly dense Dirichlet posterior. For both experiments, we\nrun 1000 iterations of optimally tuned SGRLD and SCIR algorithms and compare to an exact sampler.\nFor the sparse experiment, Figure 1 shows boxplots of samples from the \ufb01fth component of \u03c9, which\nis sparse. For both experiments, Figure 2 plots the Kolmogorov-Smirnov distance (dKS) between\nthe approximate samples and the true posterior (full details of the distance measure are given in the\nSupplementary Material). For the boxplots, a minibatch of size 10 is used; for the dKS plots, the\nproportion of data in the minibatch is varied from 0.001 to 0.5. The dKS plots show the runs of \ufb01ve\ndifferent seeds, which gives some idea of variability.\nThe boxplots of Figure 1 demonstrate that the SCIR process is able to handle smaller values of \u03c9\nmuch more readily than SGRLD. The impact of this is demonstrated in Figure 2a, the sparse dKS\nplot. Here the SCIR process is achieving much better results than SGRLD, and converging towards\nthe exact sampler at larger minibatch sizes. The dense dKS plot of Figure 2b shows that as we move\nto the dense setting the samplers have similar properties. The conclusion is that the SCIR algorithm\nis a good choice of simplex sampler for either the dense or sparse case.\n\n2.3 Extensions\n\nDir(\u03b1 +(cid:80)N\n\nFor simplicity, in this article we have focused on a popular usecase of SCIR: sampling from a\ni=1 zi) distribution, with z categorical. This method can be easily generalized though.\nFor a start, the SCIR algorithm is not limited to z being categorical, and it can be used to sample\nfrom most constructions that use Dirichlet distributions, provided the z are not integrated out. The\nmethod can also be used to sample from constrained spaces on (0,\u221e) that are gamma distributed by\njust sampling from the SCIR process itself (since the stationary distribution of the CIR process is\ngamma). There are other diffusion processes that have tractable transition densities. These can be\nexploited in a similar way to create other discretization free SGMCMC samplers. One such process\nis called geometric Brownian motion, which has lognormal stationary distribution. This process can\nbe adapted to create a stochastic sampler from the lognormal distribution on (0,\u221e).\n\n3 Theoretical Analysis\n\n\u03b1 +(cid:80)N\n\nIn the following theoretical analysis we wish to target a Gamma(a, 1) distribution, where a =\ni=1 zi for some data z. We run an SCIR algorithm with stepsize h for M iterations, yielding\nthe sample \u02c6\u03b8m for m = 1, . . . , M. We compare this to an exact CIR process with stationary\ndistribution Gamma(a, 1), de\ufb01ned by the transition equation in (2.2). We do this by deriving the\nmoment generating function (MGF) of \u02c6\u03b8m in terms of the MGF of the exact CIR process. For\nsimplicity, we \ufb01x a stepsize h and, abusing notation slightly, set \u03b8m to be a CIR process that has been\nrun for time mh.\nTheorem 3.1. Let \u02c6\u03b8M be the SCIR process de\ufb01ned in (2.3) starting from \u03b80 after M steps with\nstepsize h. Let \u03b8M be the corresponding exact CIR process, also starting from \u03b80, run for time M h,\n\n6\n\n0.10.20.30.40.0010.0100.100Minibatch Size (log scale)KS Distance0.10.20.30.40.50.0010.0100.100Minibatch Size (log scale)KS DistanceMethodExactSCIRSGRLD\fand with coupled noise. Then the MGF of \u02c6\u03b8M is given by\n\n(cid:20) 1 \u2212 s(1 \u2212 e\u2212mh)\n\n1 \u2212 s(1 \u2212 e\u2212(m\u22121)h)\n\nM(cid:89)\n\nm=1\n\n(cid:21)\u2212(\u02c6am\u2212a)\n\n(cid:20)\n\nexp\n\n\u03b80\n\nse\u2212M h\n\n1 \u2212 s(1 \u2212 e\u2212M h)\n\nM\u02c6\u03b8M\n\n(s) = M\u03b8M (s)\n\nM\u03b8M (s) =(cid:2)1 \u2212 s(1 \u2212 e\u2212M h)(cid:3)\u2212a\n\n(3.1)\n\n,\n\n(cid:21)\n\n.\n\nwhere we have\n\nThe proof of this result follows by induction from the properties of the non-central chi-squared\ndistribution. The result shows that the MGF of the SCIR can be written as the MGF of the exact\nunderlying CIR process, as well as an error term in the form of a product. Deriving the MGF enables\nus to \ufb01nd the non-asymptotic bias and variance of the SCIR process, which is more interpretable than\nthe MGF itself. The results are stated formally in the following Corollary.\nCorollary 3.2. Given the setup of Theorem 3.1,\n\nE[\u02c6\u03b8M ] = E[\u03b8M ] = \u03b80e\u2212M h + a(1 \u2212 e\u2212M h).\n\n(cid:80)M\n\nm=1\n\nSince E\u03c0[\u03b8] = a, then limM\u2192\u221e | 1\nSimilarly,\n\nM\n\nVar[\u02c6\u03b8M ] = Var[\u03b8M ] + (1 \u2212 e\u22122M h)\n\nwhere Var[\u02c6a] = Var[\u02c6am] for m = 1, . . . , M and\n\n1 \u2212 e\u2212h\n1 + e\u2212h\n\nVar[\u02c6a],\n\nE[\u02c6\u03b8m]\u2212E\u03c0[\u03b8]| = 0 and SCIR is asymptotically unbiased.\n\nVar[\u03b8M ] = 2\u03b80(e\u2212M h \u2212 e\u22122M h) + a(1 \u2212 e\u2212M h)2.\n\nThe results show that the approximate process is asymptotically unbiased. We believe this explains\nthe improvements the method has over SGRLD in the experiments of Sections 2.2 and 4. We also\nobtain the non-asymptotic variance as a simple sum of the variance of the exact underlying CIR\nprocess, and a quantity involving the variance of the estimate \u02c6a. This is of a similar form to the strong\nerror of SGLD (Sato and Nakagawa, 2014), though without the contribution from the discretization.\nThe variance of the SCIR is somewhat in\ufb02ated over the variance of the CIR process. Reducing this\nvariance would improve the properties of the SCIR process and would be an interesting avenue for\nfurther work. Control variate ideas could be applied for this purpose (Nagapetyan et al., 2017; Baker\net al., 2017; Chatterji et al., 2018) and they may prove especially effective since the mode of a gamma\ndistribution is known exactly.\n\n4 Experiments\n\nIn this section we empirically compare SCIR to SGRLD on two challenging models: latent Dirichlet\nallocation (LDA) and a Bayesian nonparametric mixture. Performance is evaluated by measuring the\npredictive performance of the trained model on a held out test set over \ufb01ve different seeds. Stepsizes\nand hyperparameters are tuned using a grid search over the predictive performance of the method.\nThe minibatch size is kept \ufb01xed for both the experiments. In the Supplementary Material, we provide\na comparison of the methods to a Gibbs sampler. This sampler is non-scalable, but will converge to\nthe true posterior rather than an approximation. The aim of the comparison to Gibbs is to give the\nreader an idea of how the stochastic gradient methods compare to exact methods for the different\nmodels considered.\n\n4.1 Latent Dirichlet Allocation\n\nLatent Dirichlet allocation (LDA, see Blei et al., 2003) is a popular model used to summarize a\ncollection of documents by clustering them based on underlying topics. The data for the model is a\nmatrix of word frequencies, with a row for each document. LDA is based on a generative procedure.\nFor each document l, a discrete distribution over the K potential topics, \u03b8l, is drawn as \u03b8l \u223c Dir(\u03b1)\nfor some suitably chosen hyperparameter \u03b1. Each topic k is associated with a discrete distribution\n\u03c6k over all the words in a corpus, meant to represent the common words associated with particular\ntopics. This is drawn as \u03c6k \u223c Dir(\u03b2), for some suitable \u03b2. Finally, each word in document l is drawn\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: (a) plots the perplexity of SGRLD and SCIR when used to sample from the LDA model of\nSection 4.1 applied to Wikipedia documents; (b) plots the log predictive on a test set of the anonymous\nMicrosoft user dataset, sampling the mixture model de\ufb01ned in Section 4.2 using SCIR and SGRLD.\n\na topic k from \u03b8l and then the word itself is drawn from \u03c6k. LDA is a good example for this method\nbecause \u03c6k is likely to be very sparse, there are many words which will not be associated with a given\ntopic at all.\nWe apply SCIR and SGRLD to LDA on a dataset of scraped Wikipedia documents, by adapting\nthe code released by Patterson and Teh (2013). At each iteration a minibatch of 50 documents\nis sampled in an online manner. We use the same vocabulary set as in Patterson and Teh (2013),\nwhich consists of approximately 8000 words. The exponential of the average log-predictive on a\nheld out set of 1000 documents is calculated every 5 iterations to evaluate the model. This quantity\nis known as the perplexity, and we use a document completion approach to calculate it (Wallach\net al., 2009). The perplexity is plotted for \ufb01ve runs using different seeds, which gives an idea of\nvariability. Similar to Patterson and Teh (2013), for both methods we use a decreasing stepsize\nscheme of the form hm = h[1 + m/\u03c4 ]\u2212\u03ba. The results are plotted in Figure 3a. While the initial\nconvergence rate is similar, SCIR keeps descending past where SGRLD begins to converge. This\nexperiment illustrates the impact of removing the discretization error. We would expect to see further\nimprovements of SCIR over SGRLD if a larger vocabulary size were used; as this would lead to\nsparser topic vectors. In real-world applications of LDA, it is quite common to use vocabulary sizes\nabove 8000. The comparison to a collapsed Gibbs sampler, provided in the Supplementary Material,\nshows the methods are quite competetive to exact, non-scalable methods.\n\n4.2 Bayesian Nonparametric Mixture Model\n\nWe apply SCIR to sample from a Bayesian nonparametric mixture model of categorical data, proposed\nby Dunson and Xing (2009). To the best of our knowledge, the development of SGMCMC methods\nfor Bayesian nonparametric models has not been considered before. In particular, we develop a\ntruncation free, scalable sampler based on SGMCMC for Dirichlet processes (DP, see Ferguson,\n1973). For more thorough details of DPs and the stochastic sampler developed, the reader is referred\nto the Supplementary Material. The model can be expressed as follows\n\nxi | \u03b8, zi \u223c Multi(ni, \u03b8zi),\n\n\u03b8, zi \u223c DP(Dir(a), \u03b1).\n\n(4.1)\n\nobservations N by N :=(cid:80)\n\nHere Multi(m, \u03c6) is a multinomial distribution with m trials and associated discrete probability\ndistribution \u03c6; DP(G0, \u03b1) is a DP with base distribution G0 and concentration parameter \u03b1. The DP\ncomponent parameters and allocations are denoted by \u03b8 and zi respectively. We de\ufb01ne the number of\ni ni, and let L be the number of instances of xi, i = 1, . . . , L. This type\nof mixture model is commonly used to model the dependence structure of categorical data, such as\nfor genetic or natural language data (Dunson and Xing, 2009). The use of DPs means we can account\nfor the fact that we do not know the true dependence structure. DPs allow us to learn the number of\nmixture components in a penalized way during the inference procedure itself.\nWe apply this model to the anonymous Microsoft user dataset (Breese et al., 1998). This dataset\nconsists of approximately N = 105 instances of L = 30000 anonymized users. Each instance details\npart of the website the user visits, which is one of d = 294 categories (here d denotes the dimension\nof xi). We use the model to try and characterize the typical usage patterns of the website. Since\n\n8\n\n1500160017001800010000200003000040000Number of DocumentsPerplexity12.00012.02512.05012.07512.10002505007501000IterationTest Log PredictiveMethodSCIRSGRLD\fthere are a lot of categories and only an average of three observations for any one user, these data are\nexpected to be sparse.\nTo infer the model, we devise a novel minibatched version of the slice sampler (Walker, 2007;\nPapaspiliopoulos, 2008; Kalli et al., 2011). We assign an uninformative gamma prior on \u03b1, and this\nis inferred similarly to Escobar and West (1995). We minibatch the users at each iteration using\nn = 1000. For multimodal mixture models such as this, SGMCMC methods are known to get stuck\nin local modes (Baker et al., 2017), so we use a \ufb01xed stepsize for both SGRLD and SCIR. Once\nagain, we plot runs over 5 seeds to give an idea of variability. The results are plotted in Figure 3b.\nThey show that SCIR consistently converges to a lower log predictive test score, and appears to have\nlower variance than SGRLD. SGRLD also appears to be producing worse scores as the number of\niterations increases. We found that SGRLD had a tendency to propose many more clusters than were\nrequired. This is probably due to the asymptotic bias of Proposition 1.1, since this would lead to an\ninferred model that has a higher \u03b1 parameter than is set, meaning more clusters would be proposed\nthan are needed. In fact, setting a higher \u03b1 parameter appeared to alleviate this problem, but led to a\nworse \ufb01t, which is more evidence that this is the case.\nIn the Supplementary Material we provide plots comparing the stochastic gradient methods to the\nexact, but non-scalable Gibbs slice sampler (Walker, 2007; Papaspiliopoulos, 2008; Kalli et al.,\n2011). The comparison shows, while SCIR outperforms SGRLD, the scalable stochastic gradient\napproximation itself does not perform well in this case compared to the exact Gibbs sampler. This is\nto be expected for such a complicated model; the reason appears to be that the stochastic gradient\nmethods get stuck in local stationary points. Improving the performance of stochastic gradient based\nsamplers for Bayesian nonparametric problems is an important direction for future work.\n\n5 Discussion\n\nWe presented an SGMCMC method, the SCIR algorithm, for simplex spaces. We show that the\nmethod has no discretization error and is asymptotically unbiased. Our experiments demonstrate that\nthese properties give the sampler improved performance over other SGMCMC methods for sampling\nfrom sparse simplex spaces. Many important large-scale models are sparse, so this is an important\ncontribution. A number of useful theoretical properties for the sampler were derived, including the\nnon-asymptotic variance and moment generating function. We discuss how this sampler can be\nextended to target other constrained spaces discretization free. Finally, we demonstrate the impact\nof the sampler on a variety of interesting problems. An interesting line of further work would be\nreducing the non-asymptotic variance, which could be done by means of control variates.\n\n6 Acknowledgments\n\nJack Baker gratefully acknowledges the support of the EPSRC funded EP/L015692/1 STOR-i\nCentre for Doctoral Training. Paul Fearnhead was supported by EPSRC grants EP/K014463/1 and\nEP/R018561/1. Christopher Nemeth acknowledges the support of EPSRC grants EP/S00159X/1 and\nEP/R01860X/1. Emily Fox acknowledges the support of ONR Grant N00014-15-1-2380 and NSF\nCAREER Award IIS-1350133.\n\nReferences\nBaker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. (2017). Control variates for stochastic gradient\n\nMCMC. Available from https://arxiv.org/abs/1706.05439.\n\nBlackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Polya urn schemes. The\n\nAnnals of Statistics, 1(2):353\u2013355.\n\nBlei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022.\n\nBreese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical analysis of predictive algorithms for\ncollaborative \ufb01ltering. In Proceedings of the Fourteenth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 43\u201352.\n\n9\n\n\fChatterji, N. S., Flammarion, N., Ma, Y.-A., Bartlett, P. L., and Jordan, M. I. (2018). On the theory of\nvariance reduction for stochastic gradient Monte Carlo. Available at https://arxiv.org/abs/\n1802.05431v1.\n\nChen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo.\n\nIn\nProceedings of the 31st International Conference on Machine Learning, pages 1683\u20131691. PMLR.\n\nCox, J. C., Ingersoll, J. E., and Ross, S. A. (1985). A theory of the term structure of interest rates.\n\nEconometrica, 53(2):385\u2013407.\n\nDing, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. (2014). Bayesian sampling\nusing stochastic gradient thermostats. In Advances in Neural Information Processing Systems 27,\npages 3203\u20133211.\n\nDubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. (2016).\nVariance reduction in stochastic gradient Langevin dynamics. In Advances in Neural Information\nProcessing Systems 29, pages 1154\u20131162.\n\nDunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data.\n\nJournal of the American Statistical Association, 104(487):1042\u20131051.\n\nEscobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures.\n\nJournal of the American Statistical Association, 90(430):577\u2013588.\n\nFerguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of\n\nStatistics, 1(2):209\u2013230.\n\nGirolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n73(2):123\u2013214.\n\nGrif\ufb01ths, T. L. and Steyvers, M. (2004). Finding scienti\ufb01c topics. Proceedings of the National\n\nAcademy of Sciences of the United States of America, 101:5228\u20135235.\n\nKalli, M., Grif\ufb01n, J. E., and Walker, S. G. (2011). Slice sampling mixture models. Statistics and\n\nComputing, 21(1):93\u2013105.\n\nLi, W., Ahn, S., and Welling, M. (2016). Scalable MCMC for mixed membership stochastic\nblockmodels. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 723\u2013731.\n\nLiverani, S., Hastie, D., Azizi, L., Papathomas, M., and Richardson, S. (2015). PReMiuM: An R\npackage for pro\ufb01le regression mixture models using Dirichlet processes. Journal of Statistical\nSoftware, 64(7):1\u201330.\n\nMa, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In\n\nAdvances in Neural Information Processing Systems, pages 2917\u20132925.\n\nNagapetyan, T., Duncan, A., Hasenclever, L., Vollmer, S. J., Szpruch, L., and Zygalakis, K. (2017).\nThe true cost of stochastic gradient Langevin dynamics. Available at https://arxiv.org/abs/\n1706.02692.\n\nPapaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Techni-\ncal Report. Available at http://wrap.warwick.ac.uk/35493/1/WRAP_papaspliiopoulos_\n08-20wv2.pdf.\n\nPapaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods\n\nfor Dirichlet process hierarchical models. Biometrika, 95(1):169\u2013186.\n\nPatterson, S. and Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on the\nprobability simplex. In Advances in Neural Information Processing Systems 26, pages 3102\u20133110.\n\nRosenblatt, M. (1952). Remarks on a multivariate transformation. The Annals of Mathematical\n\nStatistics, 23(3):470\u2013472.\n\n10\n\n\fSato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin dynamics\nIn Proceedings of the 31st International\n\nby using Fokker-Planck equation and Ito process.\nConference on Machine Learning, pages 982\u2013990. PMLR.\n\nSethuraman, J. (1994). A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4(2):639\u2013650.\n\nTeh, Y. W., Thi\u00e9ry, A. H., and Vollmer, S. J. (2016). Consistency and \ufb02uctuations for stochastic\n\ngradient Langevin dynamics. Journal of Machine Learning Research, 17(7):1\u201333.\n\nVollmer, S. J., Zygalakis, K. C., and Teh, Y. W. (2016). Exploration of the (non-)asymptotic bias\nand variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research,\n17(159):1\u201348.\n\nWalker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics,\n\n36(1):45\u201354.\n\nWallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009). Evaluation methods for topic\nmodels. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n1105\u20131112. PMLR.\n\nWelling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In\nProceedings of the 28th International Conference on Machine Learning, pages 681\u2013688. PMLR.\n\nZygalakis, K. C. (2011). On the existence and the applications of modi\ufb01ed equations for stochastic\n\ndifferential equations. SIAM Journal on Scienti\ufb01c Computing, 33(1):102\u2013130.\n\n11\n\n\f", "award": [], "sourceid": 3370, "authors": [{"given_name": "Jack", "family_name": "Baker", "institution": "Lancaster University"}, {"given_name": "Paul", "family_name": "Fearnhead", "institution": "Lancaster University"}, {"given_name": "Emily", "family_name": "Fox", "institution": "University of Washington, Apple"}, {"given_name": "Christopher", "family_name": "Nemeth", "institution": "Lancaster University"}]}