{"title": "Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 4922, "page_last": 4931, "abstract": "Gibbs sampling is a Markov chain Monte Carlo method that is often used for learning and inference on graphical models.\nMinibatching, in which a small random subset of the graph is used at each iteration, can help make Gibbs sampling scale to large graphical models by reducing its computational cost.\nIn this paper, we propose a new auxiliary-variable minibatched Gibbs sampling method, {\\it Poisson-minibatching Gibbs}, which both produces unbiased samples and has a theoretical guarantee on its convergence rate. \nIn comparison to previous minibatched Gibbs algorithms, Poisson-minibatching Gibbs supports fast sampling from continuous state spaces and avoids the need for a Metropolis-Hastings correction on discrete state spaces.\nWe demonstrate the effectiveness of our method on multiple applications and in comparison with both plain Gibbs and previous minibatched methods.", "full_text": "Poisson-Minibatching for Gibbs Sampling with\n\nConvergence Rate Guarantees\n\nRuqi Zhang\n\nCornell University\n\nrz297@cornell.edu\n\nChristopher De Sa\nCornell University\n\ncdesa@cs.cornell.edu\n\nAbstract\n\nGibbs sampling is a Markov chain Monte Carlo method that is often used for learn-\ning and inference on graphical models. Minibatching, in which a small random\nsubset of the graph is used at each iteration, can help make Gibbs sampling scale\nto large graphical models by reducing its computational cost. In this paper, we\npropose a new auxiliary-variable minibatched Gibbs sampling method, Poisson-\nminibatching Gibbs, which both produces unbiased samples and has a theoretical\nguarantee on its convergence rate. In comparison to previous minibatched Gibbs\nalgorithms, Poisson-minibatching Gibbs supports fast sampling from continuous\nstate spaces and avoids the need for a Metropolis-Hastings correction on discrete\nstate spaces. We demonstrate the effectiveness of our method on multiple ap-\nplications and in comparison with both plain Gibbs and previous minibatched\nmethods.\n\n1\n\nIntroduction\n\nGibbs sampling is a Markov chain Monte Carlo (MCMC) method which is widely used for infer-\nence on graphical models [7]. Gibbs sampling works by iteratively resampling a variable from its\nconditional distribution with the remaining variables \ufb01xed. Although Gibbs sampling is a powerful\nmethod, its utility can be limited by its computational cost when the model is large. One way to\naddress this is to use stochastic methods, which use a subsample of the dataset or model\u2014called a\nminibatch\u2014to approximate the dataset or model used in an MCMC algorithm. Minibatched variants\nof many classical MCMC algorithms have been explored [18, 10, 3, 9], including the MIN-Gibbs\nalgorithm for Gibbs sampling [3].\nIn this paper, we propose a new minibatched variant of Gibbs sampling on factor graphs called\nPoisson-minibatching Gibbs (Poisson-Gibbs). Like other minibatched MCMC methods, Poisson-\nminibatching Gibbs improves Gibbs sampling by reducing its computational cost. In comparison\nto prior work, our method improves upon MIN-Gibbs in two ways. First, it eliminates the need for\na potentially expensive Metropolis-Hastings (M-H) acceptance step, giving it a better asymptotic\nper-iteration time complexity than MIN-Gibbs. Poisson-minibatching Gibbs is able to do this by\nchoosing a minibatch in a way that depends on the current state of the variables, rather than choosing\none that is independent of the current state as is usually done in stochastic algorithms. We show that\nsuch state-dependent minibatches can still be sampled quickly, and that an appropriately chosen state-\ndependent minibatch can result in a reversible Markov chain with the correct stationary distribution\neven without a Metropolis-Hastings correction step.\nThe second way that our method improves upon previous work is that it supports sampling over\ncontinuous state spaces, which are common in machine learning applications (in comparison, the\nprevious work only supported sampling over discrete state spaces). The main dif\ufb01culty here for Gibbs\nsampling is that resampling a continuous-valued variable from its conditional distribution requires\nsampling from a continuous distribution, and this is a nontrivial task (as compared with a discrete\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fState Space Algorithm\nDiscrete\n\nGibbs sampling\nMIN-Gibbs [3]\nMGPMH [3]\nDoubleMIN-Gibbs [3]\nPoisson-Gibbs\nGibbs with rejection sampling\nPGITS: Poisson-Gibbs with ITS\nPGDA: Poisson-Gibbs with double approximation O(L2 log L)\n\nComputational Cost/Iter\nO(D)\nO(D 2)\nO(DL2 +)\nO(DL2 + 2)\nO(DL2)\nO(N )\nO(L3)\n\nContinuous\n\nTable 1: Computational complexity cost for a single-iteration of Gibbs sampling. Here, N is the\nrequired number of steps in rejection sampling to accept a sample, and the rest of the parameters are\nde\ufb01ned in Section 1.1.\n\nrandom variable, which can be sampled from by explicitly computing its probability mass function).\nOur approach is based on fast inverse transform sampling method, which works by approximating\nthe probability density function (PDF) of a distribution with a polynomial [13].\nIn addition to these two new capabilities, we prove bounds on the convergence rate of Poisson-\nminibatching Gibbs in comparison to plain (i.e. not minibatched) Gibbs sampling. These bounds\ncan provide a recipe for how to set the minibatch size in order to come close to the convergence rate\nof plain Gibbs sampling. If we set the minibatch size in this way, we can derive expressions for the\nper-iteration computational cost of our method compared with others; these bounds are summarized\nin Table 1. In summary, the contributions of this paper are as follows:\n\n\u2022 We introduce Poisson-minibatching Gibbs, a variant of Gibbs sampling which can reduce\ncomputational cost without adding bias or needing a Metropolis-Hastings correction step.\n\u2022 We extend our method to sample from continuous-valued distributions.\n\u2022 We prove bounds on the convergence rate of our algorithm, as measured by the spectral gap,\n\u2022 We evaluate Poisson-minibatching Gibbs empirically, and show that its performance can\n\nmatch that of plain Gibbs sampling while using less computation at each iteration.\n\non both discrete and continuous state spaces.\n\n1.1 Preliminaries and De\ufb01nitions\nIn this section, we present some background about Gibbs sampling and graphical models and give\nthe de\ufb01nitions which will be used throughout the paper. In this paper, we consider Gibbs sampling\non a factor graph [7], a type of graphical model that de\ufb01nes a probability distribution in terms of its\nfactors. Explicitly, a factor graph consists of a set of variables V (each of which can take on values\nin some set X ) and a set of factors , and it de\ufb01nes a probability distribution \u21e1 over a state space\n\u2326= X V, where the probability of some x 2 \u2326 is\n\n\u21e1(x) = 1\n\nZ \u00b7 exp\u21e3P2 (x)\u2318 = 1\n\nZ \u00b7Q2 exp ((x)) .\n\nHere, Z denotes the scalar factor necessary for \u21e1 to be a distribution. Equivalently, we can think of\nthis as the Gibbs measure with energy function\n\nU (x) =P2 (x),\n\nwhere\n\n\u21e1(x) / exp(U (x));\n\nthis formulation will prove to be useful in many of the derivations later in the paper. (Here, the /\nnotation denotes that the expression on the left is a distribution that is proportional to the expression\non the right with the appropriate constant of proportionality to make it a distribution.) In a factor\ngraph, the factors typically only depend on a subset of the variables; we can represent this as a\nbipartite graph where the nodesets are V and and where we draw an edge between a variable i 2V\nand a factor 2 if depends on i. For simplicity, in this paper we assume that the variables are\nindexed with natural numbers V = {1, . . . , n}. We denote the set of factors that depend on the ith\nvariable, as\n\nA[i] = {| depends on variable i, 2 }.\n\n2\n\n\fAn important property of a factor graph is that the\nconditional distribution of a variable can be computed\nusing only the factors that depend on that variable.\nThis lends to a particularly ef\ufb01cient implementation\nof Gibbs sampling, in which only these adjacent fac-\ntors are used at each iteration (rather than needing\nto evaluate the whole energy function U):\nthis is\nillustrated in Algorithm 1.\nThe performance of our algorithm will depend on\nseveral parameters of the graphical model, which\nwe will now restate, from previous work on MIN-\nGibbs [3]. If the variables take on discrete values,\nwe let D = |X| denote the number of values each\ncan take on. We let = maxi |A[i]| denote the\nmaximum degree of the graph. We assume that the\nmagnitudes of the factor functions are all bounded,\nand for any we let M denote this bound\n\nAlgorithm 1 Gibbs Sampling\n\nInput: initial point x\nloop\n\nx(i) v\n\nsample variable i \u21e0 Unif{1, . . . , n}\nfor all v 2X do\nUv P2A[i] (x)\n\nend for\nconstruct distribution \u21e2 where\n\n\u21e2(v) / exp(Uv)\n\nsample v from \u21e2\nupdate x(i) v\noutput sample x\n\nend loop\n\nM = (supx2\u2326 (x)) (inf x2\u2326 (x)) .\n\nWithout loss of generality (and as was done in previous works [3]), we will assume that 0 \uf8ff (x) \uf8ff\nM because we can always add a constant to any factor without changing the distribution \u21e1. We\nde\ufb01ne the local maximum energy L and total maximum energy of the graph as bounds on the sum\nof M over the set of the factors associated with a single variable i and the whole graph, respectively,\n\nL = maxi2{1,2,...,N}P2A[i] M\n\nand\n\n = P2 M.\n\nIf the graph is very large and has many low-energy factors, the maximum energy of a graph can be\nmuch smaller than the maximum degree of the graph. All runtime analyses in this paper assume that\nevaluating a factor and sampling from a small discrete distribution can be done in constant time.\n\n2 Poisson-Minibatching Gibbs Sampling\n\nIn this section, we will introduce the idea of Poisson-minibatching under the setting in which we\nassume we can sample from the conditional distribution of x(i) exactly. One such example is when\nthe state space of x is discrete. We will consider how to sample from the conditional distribution\nwhen exact sampling is impossible in the next section.\nIn plain Gibbs sampling, we have to compute the sum over all the factors in A[i] to get the energy\nin every step. When the graph is large, the computation of getting the energy can be expensive; for\nexample, in the discrete case this cost is proportional to D. The main idea of Poisson-minibatching\nis to augment a desired distribution with extra Poisson random variables, which control how and\nwhether a factor is used in the minibatch for a particular iteration. Maclaurin and Adams [10] used a\nsimilar idea to control whether a data point will be included in the minibatch or not with augmented\nBernoulli variables. However, this method has been shown to be very inef\ufb01cient when only updating\na small fraction of Bernoulli variables in each iteration [15]. Our method does not suffer from the\nsame issue due to the usage of Poisson variables which we will explain further later in this section.\nWe de\ufb01ne the conditional distribution of additional variable s for each factor as\n\ns|x \u21e0 Poisson\u21e3 M\n\nL + (x)\u2318\n(x)\u25c6 + s log\u2713 M\n\nwhere > 0 is a hyperparameter that controls the minibatch size. Then the joint distribution of\nvariables x and s, where s is a variable vector including all s, is \u21e1(x, s) = \u21e1(x) \u00b7 P(s|x) and so\n(1)\n\nL\n\nUsing (1) allows us to compute conditional distributions (of the variables xi) using only a subset of\nthe factors. This is because the factor will not contribute to the energy unless s is greater than\nzero. If many s are zero, then we only need to compute the energy over a small set of factors. Since\n\n\u21e1(x, s) / exp0@X2\u2713s log\u27131 +\nE [|{ 2 A[i] | s > 0}|] \uf8ff EhP2A[i] si =P2A[i]\u21e3 M\n\nL \u25c6 log (s!)\u25c61A .\nL + (x)\u2318 \uf8ff + L,\n\nM\n\n3\n\n\fthis implies that + L is an upper bound of the expected number of non-zero s. When the graph\nis very large and has many low-energy factors, + L can be much smaller than the factor set size,\nin which case only a small set of factors will contribute to the energy while most factor terms will\ndisappear because s is zero.\nUsing Poisson auxiliary variables has two bene\ufb01ts. First, compared with the Bernoulli auxiliary\nvariables as described in FlyMC [10], there is a simple method for sampling n Poisson random\nvariables in total expected time proportional to the sum of their parameters, which can be much\nsmaller than n [3]. This means that sampling n Poisson variables can be much more ef\ufb01cient than\nsampling n Bernoulli variables, which allows our method to avoid any inef\ufb01ciencies caused by\nsampling Bernoulli variables as in FlyMC. Second, compared with a \ufb01xed-minibatch-size method\nsuch as the one used in [18], Poisson-minibatching has the important property that the variables s\nare independent. Whether a factor will be contained in the minibatch is independent to each other.\nThis property is necessary for proving convergence rate theorems in the paper.\nIn Poisson-Gibbs, we will sample from the joint distribution alternately. At each iteration we can\n(1) \ufb01rst re-sample all the s, then (2) choose a variable index i and re-sample x(i). Here, we can\nreduce the state back to only x, since the future distribution never depends on the current value of s.\nEssentially, we only bother to re-sample the s on which our eventual re-sampling of x(i) depends:\nstatistically, this is equivalent to re-sampling all s. Doing this corresponds to Algorithm 2.\nHowever, minibatching by itself does not mean that the method must be more effective than plain\nGibbs sampling. It is possible that the convergence rate of the minibatched chain becomes much\nslower than the original rate, such that the total cost of the minibatch method is larger than that of\nthe baseline method even if the cost of each step is smaller. To rule out this undesirable situation,\nwe prove that the convergence speed of our chain is not slowed down, or at least not too much, after\napplying minibatching. To do this, we bound the convergence rate of our algorithm, as measured\nby the spectral gap [8], which is the gap between the largest and second-largest eigenvalues of the\nchain\u2019s transition operator. This gap has been used previously to measure the convergence rate of\nminibatched MCMC [3].\nTheorem 1. Poisson-Gibbs (Algorithm 2) is reversible and has a stationary distribution \u21e1. Let\n\u00af denote its spectral gap, and let denote the spectral gap of plain Gibbs sampling. If we use a\nminibatch size parameter 2L, then\n\n\u00af exp\u2713\n\n4L2\n\n \u25c6 \u00b7 .\n\nThis theorem guarantees that the convergence rate of Poisson-Gibbs will not be slowed down by\nmore than a factor of exp(4L2/). If we set =\u21e5( L2), then this factor becomes O(1), which is\nindependent of the size of the problem. We proved Theorem 1 and the other theorems in this paper\nusing the technique of Dirichlet forms, which is a standard way of comparing the spectral gaps of\ntwo chains by comparing their transition probabilities (more details are in the supplemental material).\nNext, we derive expressions for the overall computational cost of Algorithm 2, supposing that we\nset =\u21e5( L2) as suggested by Theorem 1. First, we need to evaluate the cost of sampling all the\nPoisson-distributed s. While a na\u00efve approach to sample this would take O() time, we can do\nit substantially faster. For brevity, and because much of the technique is already described in the\nprevious work [3], we defer an explicit analysis to the supplementary material, and just state the\nfollowing.\nStatement 1. Sampling all the auxiliary variables s for 2 A[i] can be done in average time\nO( + L), resulting in a sparse vector s.\nNow, to get an overall cost when assuming exact sampling from the conditional distribution, we\nconsider discrete state spaces, in which we can sample from the conditional distribution of x(i)\nexactly. In this case, the cost of a single iteration of Poisson-Gibbs will be dominated by the loop\nover v. This loop will run D times, and each iteration will take O(|S|) time to run. On average, this\ngives us an overall runtime O(( + L) \u00b7 D) = O(L2D) for Poisson-Gibbs. Note that due to the fast\nway we sample Poisson variables, the cost of sampling Poisson variables is negligible compared to\nother costs.\nIn comparison, the cost of the previous algorithms MIN-Gibbs, MGPMH and DoubleMIN-Gibbs [3]\nare all larger in big-O than that of Poisson-Gibbs, as showed in Table 1. MGPMH and DoubleMIN-\n\n4\n\n\fGibbs need to conduct an M-H correction, which adds to the cost, and the cost of MIN-Gibbs and\nDoubleMIN-Gibbs depend on which is a global statistic. By contrast, our method does not need\nadditional M-H step and is not dependent on global statistics. Thus the total cost of Gibbs sampling\ncan be reduced more by Poisson-minibatching compared to the previous methods.\n\nApplication of Poisson-Minibatching to Metropolis-Hastings. Poisson-minibatching method\ncan be applied to other MCMC methods, not just Gibbs sampling. To illustrate the general applica-\nbility of Poisson-minibatching method, we applied Poisson-minibatching to Metropolis-Hastings\nsampling and call it Poisson-MH (details of this algorithm and a demonstration on a mixture of\nGaussians are given in the supplemental material). We get the following convergence rate bound.\nTheorem 2. Poisson-MH is reversible and has a stationary distribution \u21e1. If we let \u00af denote its\nspectral gap, and let \u00af denote the spectral gap of plain M-H sampling with the same proposal and\ntarget distributions, then\n\n\u00af 1\n\n2 exp\u21e3 L2\n\n+L\u2318 \u00b7 .\n\n3 Poisson-Gibbs on Continuous State Spaces\n\nIn this section, we consider how to sample from a continuous conditional distribution, i.e. when\nX = [a, b] \u21e2 R, without sacri\ufb01cing the bene\ufb01ts of Poisson-minibatching. The main dif\ufb01culty is\nthat sampling from an arbitrary continuous conditional distribution is not trivial in the same way as\nsampling from an arbitrary discrete conditional distribution is. Some additional sampling method is\nrequired. In principle, we can combine any sampling method with Poisson-minibatching, such as\nrejection sampling which is commonly used in Gibbs sampling. However, rejection sampling needs\nto evaluate the energy multiple times per sample, so even if we reduce the cost of evaluating the\nenergy by minibatching, the total cost can still be large, besides which there is no good guarantee on\nthe convergence rate of rejection sampling.\nIn order to sample from the conditional distribution ef\ufb01ciently, we propose a new sampling method\nbased on inverse transform sampling (ITS) method. The main idea is to approximate the continuous\ndistribution with a polynomial; this requires only a number of energy function evaluations proportional\nto the degree of the polynomial. We provide overall cost and theoretical analysis of convergence rate\nfor our method.\n\nPoisson-Gibbs with Double Chebyshev Approximation.\nInverse transform sampling is a clas-\nsical method that generates samples from a uniform distribution and then transforms them by the\ninverse of cumulative distribution function (CDF) of the desired distribution. Since the CDF is often\nintractable in practice, Fast Inverse Transform Sampling (FITS) [13] uses a Chebyshev polynomial ap-\nproximation to estimate the PDF fast and then get the CDF by computing an integral of a polynomial.\nInspired by FITS, we propose Poisson-Gibbs with double Chebyshev approximation (PGDA).\nThe main idea of double Chebyshev approximation is to approximate the energy function \ufb01rst and\nthen the PDF by using Chebyshev approximation twice. Speci\ufb01cally, we \ufb01rst get a polynomial\napproximation to the energy function U on [a, b], denoted by \u02dcU, the Chebyshev interpolant [17]\n\n\u02dcU (x) =\n\n\u21b5kTk\u2713 2(x a)\n\nb a 1\u25c6 ,\u21b5 k 2 R, x 2 [a, b],\n\nmXk=0\n\n(2)\n\nwhere Tk(x) = cos(k cos1 x) is the degree-k Chebyshev polynomial. Although the domain is\ncontinuous, we only need to evaluate U on m + 1 Chebyshev nodes to construct the interpolant, and\nthe expansion coef\ufb01cients \u21b5k can be computed stably in O(m log m) time. The following theorem\nshows that the error of a Chebyshev approximation can be made arbitrarily small with large m.\n(Although stated for the case of [a, b] = [1, 1], it easily generalizes to arbitrary [a, b].)\nTheorem 3 (Theorem 8.2 from Trefethen [17]). Assume U is analytic in the open Bernstein ellipse\nB([1, 1],\u21e2 ), where the Bernstein ellipse is a region in the complex plane bounded by an ellipse\nwith foci at \u00b11 and semimajor-plus-semiminor axis length \u21e2> 1. If for all x 2 B([1, 1],\u21e2 ),\n|U (x)|\uf8ff V for some constant V > 0, the error of the Chebyshev interpolant on [1, 1] is bounded\nby\n\n| \u02dcU (x) U (x)|\uf8ff m\n\nwhere\n\nm =\n\n4V\u21e2 m\n\u21e2 1\n\n.\n\n5\n\n\fAlgorithm 2 Poisson-Gibbs\ngiven: initial state x 2 \u2326\nloop\n\nsample variable i \u21e0 Unif{1, . . . , n}.\nfor all in A[i] do\nL + (x)\u2318\nsample s \u21e0 Poisson\u21e3 M\nend for\nS { |s > 0}\nfor all v 2X do\n\nx(i) v\n\nUv P2S s log\u21e31 + L\n\nend for\nconstruct distribution \u21e2 where\n\nM\n\n(x)\u2318\n\n\u21e2(v) / exp(Uv)\n\nsample v from \u21e2\nupdate x(i) v\noutput sample x\n\nend loop\n\nAlgorithm 3 PGDA: Poisson-Gibbs Double Cheby-\nshev Approximation\n\ngiven: state x 2 \u2326, degree m and k, domain [a, b]\nloop\n\nset i, s, S, and U as in Algorithm 2.\nconstruct degree-m Chebyshev polynomial ap-\nproximation of energy Uv on [a, b]: \u02dcUv\nconstruct degree-k Chebyshev polynomial\napproximation: \u02dcf (v) \u21e1 exp( \u02dcUv)\ncompute the CDF polynomial\n\n\u02dcF (v) = Z b\n\na\n\n\u02dcf (y) dy!1Z v\n\na\n\n\u02dcf (y) dy\n\nsample u \u21e0 Unif[0, 1].\nsolve root-\ufb01nding problem for v: \u02dcF (v) = u\n. Metropolis-Hastings correction:\np exp(Uv) \u02dcf (x(i))\nwith probability min(1, p), set x(i) v\noutput sample x\n\nexp(Ux(i)) \u02dcf (v)\n\nend loop\n\nAfter getting the approximation of the energy, we can get the PDF by exp( \u02dcU ). However, it is generally\nhard to get the CDF now since the integral of exp( \u02dcU ) for polynomial \u02dcU is usually intractable. So, we\nuse another Chebyshev approximation \u02dcf to estimate exp( \u02dcU ). Constructing the second Chebyshev\napproximation requires no additional evaluations of energy functions; its total computational cost\nis \u02dcO(mk) because we need to evaluate a degree-m polynomial k times to compute the coef\ufb01cients.\nAfter doing this, we are able to compute the CDF directly since it is the integral of a polynomial.\nWith the CDF \u02dcF (x) in hand, inverse transform sampling is used to generate samples. First, a pseudo-\nrandom sample u is generated from the uniform distribution on [0, 1], and then we solve the following\nroot-\ufb01nding problem for x: \u02dcF (x) = u. Since \u02dcF (x) is a polynomial, this root-\ufb01nding problem can\nbe solved by many standard methods. We use bisection method to ensure the robustness of the\nalgorithm [13].\nImportantly, the sample we get here is actually from an approximation of the CDF. To correct the\nerror introduced by the polynomial approximation, we add a M-H correction as the \ufb01nal step to make\nsure the samples come from the target distribution. Our algorithm is given in Algorithm 3. As before,\nwe prove a bound on PGDA in terms of the spectral gap, given the additional assumption that the\nfactors are analytic.\nTheorem 4. PGDA (Algorithm 3) is reversible and has a stationary distribution \u21e1. Let \u00af denote\nits spectral gap, and let denote the spectral gap of plain Gibbs sampling. Assume \u21e2> 1 is some\nconstant such that every factor function , treated as a function of any single variable x(i), must\nbe analytically continuable to the Bernstein ellipse with radius parameter \u21e2 shifted-and-scaled so\nthat its foci are at a and b, such that it satis\ufb01es |(z)|\uf8ff M anywhere in that ellipse. Then, if\n log(2) 4L, and if m is set large enough that 4\u21e2m/2 \uf8ff p\u21e2 1, then it will hold that\np\u21e2 1 \u25c6 1.\n+ exp\u2713 16L \u00b7 \u21e2 m\n\u00af 1 4pz exp\u27134L2\n\n \u25c6 \u00b7 , where z =\n\n4 \u00b7 exp (8L) \u00b7 \u21e2 k\n\np\u21e2 1\n\nSimilar to Theorem 1, this theorem implies that the convergence rate of PGDA can be slowed down by\nat most a constant factor relative to plain Gibbs. If we set m = \u21e5(log L), k =\u21e5( L) and =\u21e5( L2),\nthen the ratio of the spectral gaps will also be O(1), which is independent of the problem parameters.\nNote that it is possible to combine FITS with Poisson-Gibbs directly (i.e. use only one polynomial\napproximation to estimate the PDF directly), and we call this method Poisson-Gibbs with fast inverse\n\n2\n\n2\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Marginal error comparison among Poisson-Gibbs and previous methods on a Potts\nmodel. (b) Marginal error of Poisson-Gibbs on varying values of on a Potts model. (c) Symmetric\nKL divergence comparison among PGITS, PGDA and previous methods on a continuous spin model.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Runtime comparisons with the same experimental setting as in Figure 1.\n\ntransform sampling (PGITS). It turns out that PGDA is more ef\ufb01cient than PGITS since PGDA\nrequires fewer evaluations of U to achieve the same convergence rate. If we set the parameters as\nabove, the total computational cost of PGDA is O(m \u00b7 ( + L) + m \u00b7 k) = O(log L \u00b7 (L2 + L)) =\nO(log L \u00b7 L2). On the other hand, the cost of PGITS to achieve the same constant-factor spectral gap\nratio is O(L3). A derivation of this is given in the supplemental material.\n\n4 Experiments\n\nWe demonstrate our methods on three tasks including Potts models, continuous spin models and\ntruncated Gaussian mixture in comparison with plain Gibbs sampling and previous minibatched\nGibbs sampling. We release the code at https://github.com/ruqizhang/poisson-gibbs.\n\nnXi=1\n\nnXj=1\n\n4.1 Potts Models\nWe \ufb01rst test the performance of Poisson-minibatching Gibbs sampling on the Potts model [14] as in\nDe Sa et al. [3]. The Potts model is a generalization of the Ising model [6] with domain {1, . . . , D}\nover an N \u21e5 N lattice. The energy of a con\ufb01guration is the following:\n \u00b7 Aij \u00b7 (x(i), x(j))\n\nU (x) =\n\nwhere the function equals one only when x(i) = x(j) and zero otherwise. Aij is the interaction\nbetween two sites i and j and is the inverse temperature. As was done in previous work, we set the\nmodel to be fully connected and the interaction Aij is determined by the distance between site i and\nsite j based on a Gaussian kernel [3]. The graph has n = N 2 = 400 variables in total, = 4.6 and\nD = 10. On this model, L = 5.09.\nWe \ufb01rst compare our method with two other methods: plain Gibbs sampling and the most ef\ufb01cient\nMIN-Gibbs methods on this task, DoubleMIN-Gibbs. Note that, in comparison to our method,\nDoubleMIN-Gibbs needs an additional M-H correction step which requires a second minibatch to\nbe sampled. We set = 1 \u00b7 L2 for all minibatch methods. We tried two values for the second\nminibatch size in DoubleMIN-Gibbs 2 = 1 \u00b7 L2 and 104 \u00b7 L2. We compute run-average marginal\n\n7\n\n\fdistributions for each variable by collecting samples. By symmetry, the marginal for each variable\nin the stationary distribution is uniform, so the `2-distance between the estimated marginals and\nthe uniform distribution can be used to evaluate the convergence of Markov chain. We report this\nmarginal error averaged over three runs.\nFigure 1a shows the `2-distance marginal error as a function of iterations. We observe that Poisson-\nGibbs performs comparably with plain Gibbs and it outperforms DoubleMIN-Gibbs signi\ufb01cantly\nespecially when 2 is not large enough. The performance of DoubleMIN-Gibbs is highly in\ufb02uenced\nby the size of the second minibatch. We have to increase the second minibatch to 104 \u00b7 L2 in order to\nmake it converge. This is because the variance of M-H correction will be very large when the second\nminibatch is not large enough. On the other hand, Poisson-Gibbs does not require an additional M-H\ncorrection which not only reduces the computational cost but also improves stability. In Figure 1b,\nwe show the performance of our method with different values of . When we increase the minibatch\nsize, the convergence speed of Poisson-Gibbs approaches plain Gibbs, which validates our theory.\nThe number of factors being evaluated of Poisson-Gibbs varies each iteration, thus we report the\naverage number which are 7, 28 and 132 respectively for = 0.1 \u00b7 L2, 1 \u00b7 L2 and 5 \u00b7 L2.\nThe runtime comparisons with the same setup are reported in Figure 2a and 2b to demonstrate\nthe computational speed-up of Poisson-Gibbs empirically. We can see that the results align with\nour theoretical analysis: Poisson-Gibbs is signi\ufb01cantly faster than plain Gibbs samping and faster\nthan previous minibatched Gibbs sampling methods. Compared to plain Gibbs, Poisson-Gibbs\nspeeds up the computation by evaluating only a subset of factors in each iteration. Compared\nto DoubleMIN-Gibbs, Poisson-Gibbs is faster because it removes the need of an additional M-H\ncorrection step.\n\n4.2 Continuous Spin Models\n\nIn this section, we study a more general setting of spin models where spins can take continuous values.\nContinuous spin models are of interest in both the statistics and physics communities [11, 2, 4]. This\nrandom graph model can also be used to describe complex networks such as social, information, and\nbiological networks [12]. We consider the energy of a con\ufb01guration as the following:\n\nU (x) =\n\nnXi=1\n\nnXj=1\n\n \u00b7 Aij \u00b7 (x(i) \u00b7 x(j) + 1)\n\nwhere x(i) 2 [0, 1] and = 1. Notice that the existing minibatched Gibbs sampling methods [3]\nare not applicable on this task since they can be used only on discrete state spaces. We compare\nPGITS, PGDA with: (1) Gibbs sampling with FITS (Gibbs-ITS); (2) Gibbs sampling with Double\nChebyshev approximation (Gibbs-DA); (3) Gibbs with rejection sampling (Gibbs-rejection); and\n(4) Poisson-Gibbs with rejection sampling (PG-rejection). We use symmetric KL divergence to\nquantitatively evaluate the convergence. On this model, L = 13.71 and we set = L2. The degree of\npolynomial is m = 3 for PGITS and the \ufb01rst approximation in PGDA. The degree of polynomial is\nk = 10 for the second approximation in PGDA. In rejection sampling, we set the proposal distribution\nto be wg where g is the uniform distribution on [0, 1] and w is a constant tuned for best performance.\nThe ground truth stationary distribution is obtained by running Gibbs-ITS for 107 iterations.\nOn this task, the average number of evaluated factors per iteration of Poisson-Gibbs is 190. Figure 1c\nshows the symmetric KL divergence as a function of iterations, with results averaged over three runs.\nObserve that our methods achieve comparable performance to Gibbs sampling with only a fraction of\nfactors. For rejection sampling, the average steps needed for a sample to be accepted is greater than\n300 which means that the cost is much larger than that of PGITS and PGDA. Given the same time\nbudget, it can only run for many fewer iterations (we run it for 104 iterations). On the other hand, the\ntwo Chebyshebv approximation methods are much more ef\ufb01cient for both Poisson-Gibbs and plain\nGibbs. The advantage of FITS over rejection sampling has also been discussed in previous work [13].\nAlso notice that PGDA converges faster than PGITS given the same degree of polynomial. This\nempirical result validates our theoretical results that suggest PGDA is more ef\ufb01cient than PGITS.\nWe also report the symmetric KL divergence as a function of runtime in Figure 2c. Similar to the\nprevious section, the two Poisson-Gibbs methods are faster than plain Gibbs sampling.\n\n8\n\n\f(a) True\n\n(b) PGITS\n\n(c) PGDA\n\n(d) Gibbs-rejection\n\nFigure 3: A visualization of the estimated density on a truncated Gaussian mixture model.\n\n4.3 Truncated Gaussian Mixture\nWe further demonstrate PGITS and PGDA on a truncated Gaussian mixture model. We consider the\nfollowing Gaussian mixture with tied means as done in previous work [18, 9]:\n\nx1 \u21e0N (0, 2\n\n1), x2 \u21e0N (0, 2\n\n2), yi \u21e0\n\n1\n2N (x1, 2\ny = 2, x1 = 0 and\nWe used the same parameters as in Welling and Teh [18]: 2\nx2 = 1. This posterior has two modes at (x1, x2) = (0, 1) and (x1, x2) = (1,1). We truncate the\nposterior by bounding the variables x1 and x2 in [6, 6]. The energy can be written as\n\n1\n2N (x1 + x2, 2\ny).\n\ny) +\n1 = 10, 2\n\n2 = 1, 2\n\nU (x) = log p(x1) + log p(x2) +\n\nlog p(yi|x1, x2)\n\nNXi=1\n\nwhich can be regarded as a factor graph with N factors. We add a positive constant to the energy\nto ensure each factor is non-negative: this will not change the underlying distribution. As in Li and\nWong [9], we set N = 106. L = 1581.14 for this model and we set = 500, m = 20 and k = 25.\nWe have also considered higher values of and found that the results are very similar. We generate\n106 samples for all methods. A uniform distribution in [6, 6] is used as the proposal distribution in\nGibbs with rejection sampling. We try varying values for w but none of them results in reasonable\ndensity estimate which may be due to the inef\ufb01ciency of rejection sampling [13]. We report the\nresults when the average needed steps for a sample to be accepted is around 1000. The average\nnumber of factors being evaluated per iteration of Poisson-Gibbs is 1802. Our results are reported in\nFigure 3, where we observe visually that the density estimates of PGITS and PGDA are very accurate.\nIn contrast, rejection sampling completely failed to estimate the density given the budget.\n\n5 Conclusion\n\nWe propose Poisson-minibatching Gibbs sampling to generate unbiased samples with theoretical\nguarantees on the convergence rate. Our method provably converges to the desired stationary\ndistribution at a rate that is at most a constant factor slower than the full batch method, as measured\nby the spectral gap. We provide guidance about how to set the hyperparameters of our method to\nmake the convergence speed arbitrarily close to the full batch method. On continuous state spaces,\nwe propose two variants of Poisson-Gibbs based on fast inverse transform sampling and provide\nconvergence analysis for both of them. We hope that our work will help inspire more exploration into\nunbiased and guaranteed-fast stochastic MCMC methods.\n\nAcknowledgements\nThis work was supported by a gift from Huawei. We thank Wing Wong for the helpful discussion.\n\nReferences\n[1] Shigeki Aida. Uniform positivity improving property, Sobolev inequalities, and spectral gaps.\n\nJournal of functional analysis, 158(1):152\u2013185, 1998.\n\n[2] AD Bruce. Universality in the two-dimensional continuous spin model. Journal of Physics A:\n\nMathematical and General, 18(14):L873, 1985.\n\n9\n\n\f[3] Christopher De Sa, Vincent Chen, and Wing Wong. Minibatch gibbs sampling on large graphical\n\nmodels. arXiv preprint arXiv:1806.06086, 2018.\n\n[4] Sander Dommers, Christof Kuelske, and Philipp Schriever. Continuous spin models on annealed\ngeneralized random graphs. Stochastic Processes and their Applications, 127(11):3719\u20133753,\n2017.\n\n[5] Masatoshi Fukushima, Yoichi Oshima, and Masayoshi Takeda. Dirichlet forms and symmetric\n\nMarkov processes, volume 19. Walter de Gruyter, 2010.\n\n[6] Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift f\u00fcr Physik A Hadrons and\n\nNuclei, 31(1):253\u2013258, 1925.\n\n[7] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles\n\nand techniques. MIT press, 2009.\n\n[8] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American\n\nMathematical Soc., 2017.\n\n[9] Dangna Li and Wing H Wong. Mini-batch tempered MCMC. arXiv preprint arXiv:1707.09705,\n\n2017.\n\n[10] Dougal Maclaurin and Ryan P Adams. Fire\ufb02y Monte Carlo: Exact MCMC with subsets of data.\n\nIn UAI, pages 543\u2013552, 2014.\n\n[11] Manon Michel, Johannes Mayer, and Werner Krauth. Event-chain Monte Carlo for classical\n\ncontinuous spin models. EPL (Europhysics Letters), 112(2):20003, 2015.\n\n[12] Mark EJ Newman. The structure and function of complex networks. SIAM review, 45(2):\n\n167\u2013256, 2003.\n\n[13] Sheehan Olver and Alex Townsend. Fast inverse transform sampling in one and two dimensions.\n\narXiv preprint arXiv:1307.1223, 2013.\n\n[14] Renfrey Burnard Potts. Some generalized order-disorder transformations. In Mathematical\nproceedings of the cambridge philosophical society, volume 48, pages 106\u2013109. Cambridge\nUniversity Press, 1952.\n\n[15] Matias Quiroz, Minh-Ngoc Tran, Mattias Villani, Robert Kohn, and Khue-Dung Dang. The\nblock-Poisson estimator for optimally tuned exact subsampling MCMC. arXiv preprint\narXiv:1603.08232, 2016.\n\n[16] Daniel Rudolf. Explicit error bounds for Markov chain Monte Carlo.\n\narXiv:1108.3201, 2011.\n\narXiv preprint\n\n[17] Lloyd N Trefethen. Approximation theory and approximation practice, volume 128. Siam,\n\n2013.\n\n[18] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.\nIn Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages\n681\u2013688, 2011.\n\n10\n\n\f", "award": [], "sourceid": 2732, "authors": [{"given_name": "Ruqi", "family_name": "Zhang", "institution": "Cornell University"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Cornell"}]}