{"title": "Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.", "full_text": "Scan Order in Gibbs Sampling: Models in Which it\n\nMatters and Bounds on How Much\n\nBryan He, Christopher De Sa, Ioannis Mitliagkas, and Christopher R\u00e9\n\nStanford University\n\n{bryanhe,cdesa,imit,chrismre}@stanford.edu\n\nAbstract\n\nGibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively\nsamples variables from their conditional distributions. There are two common scan\norders for the variables: random scan and systematic scan. Due to the bene\ufb01ts\nof locality in hardware, systematic scan is commonly used, even though most\nstatistical guarantees are only for random scan. While it has been conjectured that\nthe mixing times of random scan and systematic scan do not differ by more than a\nlogarithmic factor, we show by counterexample that this is not the case, and we\nprove that that the mixing times do not differ by more than a polynomial factor\nunder mild conditions. To prove these relative bounds, we introduce a method of\naugmenting the state space to study systematic scan using conductance.\n\n1\n\nIntroduction\n\nGibbs sampling, or Glauber dynamics, is a Markov chain Monte Carlo method that draws approximate\nsamples from multivariate distributions that are dif\ufb01cult to sample directly [9; 15, p. 40]. A major use\nof Gibbs sampling is marginal inference: the estimation of the marginal distributions of some variables\nof interest [8]. Some applications include various computer vision tasks [9, 23, 24], information\nextraction [7], and latent Dirichlet allocation for topic modeling [11]. Gibbs sampling is simple to\nimplement and quickly produces accurate samples for many models, so it is widely used and available\nin popular libraries such as OpenBUGS [16], FACTORIE [17], JAGS [18], and MADlib [14].\nGibbs sampling (Algorithm 1) iteratively selects a single variable and resamples it from its conditional\ndistribution, given the other variables in the model. The method that selects the variable index to\nsample (s in Algorithm 1) is called the scan order. Two scan orders are commonly used: random scan\nand systematic scan (also known as deterministic or sequential scan). In random scan, the variable to\nsample is selected uniformly and independently at random at each iteration. In systematic scan, a\n\ufb01xed permutation is selected, and the variables are repeatedly selected in that order. The existence of\nthese two distinct options raises an obvious question\u2014which scan order produces accurate samples\nmore quickly? This question has two components: hardware ef\ufb01ciency (how long does each iteration\ntake?) and statistical ef\ufb01ciency (how many iterations are needed to produce an accurate sample?).\nFrom the hardware ef\ufb01ciency perspective, systematic scans are clearly superior [21, 22]. Systematic\nscans have good spatial locality because they access the variables in linear order, which makes their\niterations run faster on hardware. As a result, systematic scans are commonly used in practice.\nComparing the two scan orders is much more interesting from the perspective of statistical ef\ufb01ciency,\nwhich we focus on for the rest of this paper. Statistical ef\ufb01ciency is measured by the mixing\ntime, which is the number of iterations needed to obtain an accurate sample [15, p. 55]. The\nmixing times of random scan and systematic scan have been studied, and there is a longstanding\nconjecture [3; 15, p. 300] that systematic scan (1) never mixes more than a constant factor slower\nthan random scan and (2) never mixes more than a logarithmic factor faster than random scan. This\nconjecture implies that the choice of scan order does not have a large effect on performance.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm 1 Gibbs sampler\ninput Variables xi for 1 \u2264 i \u2264 n, and target distribution \u03c0\n\nInitialize x1, . . . , xn\nloop\n\nSelect variable index s from {1, . . . , n}\nSample xs from the conditional distribution P\u03c0\n\nend loop\n\n(cid:0)Xs | X{1,...,n}\\{s}(cid:1)\n\nRecently, Roberts and Rosenthal [20] described a model in which systematic scan mixes more\nslowly than random scan by a polynomial factor; this disproves direction (1) of this conjecture.\nIndependently, we constructed other models for which the scan order has a signi\ufb01cant effect on\nmixing time. This raises the question: what are the true bounds on the difference between these\nmixing times? In this paper, we address this question and make the following contributions.\n\n\u2022 In Section 3, we study the effect of the variable permutation chosen for systematic scan on\nthe mixing time. In particular, in Section 3.1, we construct a model for which a systematic\nscan mixes a polynomial factor faster than random scan, disproving direction (2) of the\nconjecture, and in Section 3.2, we construct a model for which the systematic scan with the\nworst-case permutation results in a mixing time that is slower by a polynomial factor than\nboth the best-case systematic scan permutation and random scan.\n\u2022 In Section 4, we empirically verify the mixing times of the models we construct, and we\n\u2022 In Section 5, we prove a weaker version of the conjecture described above, providing\nrelative bounds on the mixing times of random and systematic scan. Speci\ufb01cally, under\nmild conditions, different scan orders can only change the mixing time by a polynomial\nfactor. To obtain these bounds, we introduce a method of augmenting the state space of\nGibbs sampling so that the method of conductance can be applied to analyze its dynamics.\n\nanalyze how the mixing time changes as a function of the permutation.\n\n2 Related Work\n\nRecent work has made progress on analyzing the mixing time of Gibbs sampling, but there are still\nmajor limitations to our understanding. In particular, most results are only for speci\ufb01c models or for\nrandom scan. For example, mixing times are known for Mallow\u2019s model [1, 4], and colorings of a\ngraph [5] for both random and systematic scan, but these are not applicable to general models. On the\nother hand, random scan has been shown to mix in polynomial time for models that satisfy structural\nconditions \u2013 such as having close-to-modular energy functions [10] or having bounded hierarchy\nwidth and factor weights [2] \u2013 but corresponding results for for systematic scan are not known.\nThe major exception to these limitations is Dobrushin\u2019s condition, which guarantees O(n log n)\nmixing for both random scan and systematic scan [6, 13]. However, many models of interest with\nclose-to-modular energy functions or bounded hierarchy width do not satisfy Dobrushin\u2019s condition.\nA similar choice of scan order appears in stochastic gradient descent (SGD), where the standard SGD\nalgorithm uses random scan, and the incremental gradient method (IGM) uses systematic scan. In\ncontrast to Gibbs sampling, avoiding \u201cbad permutations\u201d in the IGM is known to be important to\nensure fast convergence [12, 19]. In this paper, we bring some intuition about the existence of bad\npermutations from SGD to Gibbs sampling.\n\n3 Models in Which Scan Order Matters\n\nDespite a lack of theoretical results regarding the effect of scan order on mixing times, it is generally\nbelieved that scan order only has a small effect on mixing time. In this section, we \ufb01rst de\ufb01ne\nrelevant terms and state some common conjectures regarding scan order. Afterwards, we give several\ncounterexamples showing that the scan order can have asymptotic effects on the mixing time.\nThe total variation distance between two probability distributions \u00b5 and \u03bd on \u2126 is [15, p. 47]\n\n(cid:107)\u00b5 \u2212 \u03bd(cid:107)TV = max\nA\u2286\u2126\n\n|\u00b5(A) \u2212 \u03bd(A)|.\n\n2\n\n\fTable 1: Models and Approximate Mixing Times\n\nModel\n\ntmix(R) min\n\n\u03b1\n\ntmix(S\u03b1) max\n\n\u03b1\n\ntmix(S\u03b1)\n\nSequence of Dependencies\nTwo Islands\nDiscrete Pyramid [20]\nMemorize and Repeat\nSoft Dependencies\n\nn2\n2n\nn\nn3\nn3/2\n\nn\n2n\nn3\nn2\nn\n\nn2\nn2n\nn3\nn2\nn2\n\nThe mixing time is the minimum number of steps needed to guarantee that the total variation distance\nbetween the true and estimated distributions is below a given threshold \u0001 from any starting distribution.\nFormally, the mixing time of a stochastic process P with transition matrix P (t) after t steps and\nstationary distribution \u03c0 is [15, p. 55]\n\n(cid:26)\n\n(cid:27)\n\ntmix(P, \u0001) = min\n\nt : max\n\n\u00b5\n\n(cid:107)P (t)\u00b5 \u2212 \u03c0(cid:107)TV \u2264 \u0001\n\n,\n\nwhere the maximum is taken over the distribution \u00b5 of the initial state of the process. When comparing\nthe statistical ef\ufb01ciency of systematic scan and random scan, it would be useful to establish, for any\nsystematic scan process S and random scan process R on the same n-variable model, a relative bound\nof the form\n\nF1(\u0001, n, tmix(R, \u0001)) \u2264 tmix(S, \u0001) \u2264 F2(\u0001, n, tmix(R, \u0001))\n\n(1)\nfor some functions F1 and F2. Similarly, to bound the effect that the choice of permutation can have\non the mixing time, it would be useful to know, for any two systematic scan processes S\u03b1 and S\u03b2\nwith different permutations on the same model, that for some function F3,\n\ntmix(S\u03b1, \u0001) \u2264 F3(\u0001, n, tmix(S\u03b2, \u0001)).\n\n(2)\n\nDiaconis [3] and Levin et al. [15, p. 300] conjecture that systematic scan is never more than a\nconstant factor slower or a logarithmic factor faster than random scan. This is equivalent to choosing\nF1(\u0001, n, t) = C1(\u0001)\u00b7t\u00b7(log n)\u22121 and F2(\u0001, n, t) = C2(\u0001)\u00b7t in the inequality in (1), for some functions\nC1 and C2. It is also commonly believed that all systematic scans mix at the same asymptotic rate,\nwhich is equivalent to choosing F3(\u0001, n, t) = C3(\u0001) \u00b7 t in (2).\nThese conjectures imply that using systematic scan instead of random scan will not result in signi\ufb01cant\nconsequences, at least asymptotically, and that the particular permutation used for systematic scan is\nnot important. However, we show that neither conjecture is true by constructing models (listed in\nTable 1) in which the scan order has substantial asymptotic effects on mixing time.\nIn the rest of this section, we go through two models in detail to highlight the diversity of behaviors\nthat different scan orders can have. First, we construct the sequence of dependencies model, for\nwhich a single \u201cgood permutation\u201d of systematic scan mixes faster, by a polynomial factor, than both\nrandom scan and systematic scans using most other permutations. This serves as a counterexample\nto the conjectured lower bounds (i.e. the choice of F1 and F3) on the mixing time of systematic\nscan. Second, we construct the two islands model, for which a small set of \u201cbad permutations\u201d mix\nvery slowly in comparison to random scan and most other systematic scans. This contradicts the\nconjectured upper bounds (i.e. the choice of F2 and F3). For completeness, we also discuss the\ndiscrete pyramid model introduced by Roberts and Rosenthal [20], which contradicts the conjectured\nchoice of F2. Table 1 lists several additional models we constructed: these models further explore the\nspace of asymptotic comparisons among scan orders, but for brevity we defer them to the appendix.\n\n3.1 Sequence of Dependencies\n\nThe \ufb01rst model we describe is the sequence of dependencies model (Figure 1a), where we explore\nhow fast systematic scan can be by allowing a speci\ufb01c good permutation to mix rapidly. The sequence\nof dependencies model achieves this by having the property that, at any time, progress towards mixing\nis only made if a particular variable is sampled; this variable is always the one that is chosen by the\ngood permutation. As a result, while a systematic scan using the good permutation makes progress at\n\n3\n\n\fs0\n\nx1\n\ns1\n\nx2\n\n\u00b7\u00b7\u00b7\n\nxi\u22121\n\nsi\n\nxi\n\n\u00b7\u00b7\u00b7\n\nxn\n\nsn\n\n(a) Sequence of Dependencies Model\n\nIsland x\n\nIsland y\n\n...\n\n...\n\nsx1\n\nsx2\n\n...\n\nsxn\n\nx1\n\nx2\n\nxn\n\nb\n\ny1\n\ny2\n\nyn\n\nsy1\n\nsy2\n\n...\n\nsyn\n\n(b) Two Islands Model\n\n...\n\n...\n\n\u00b7\u00b7\u00b7\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\u00b7\n\u00b7\n\ns1\n\nsn\n\nxn\n\nx1\n\ns0\n\ns2\n\nx2\n\nx3\n\n\u00b7\u00b7\u00b7\n\ns3\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n\n(c) Discrete Pyramid Model\n\nFigure 1: State space of the models.\n\nevery step, both random scan and other systematic scans often fail to progress, which leads to a gap\nbetween their mixing times. Thus, this model exhibits two surprising behaviors: (1) one systematic\nscan is polynomially better than random scan and (2) systematic scans using different permutations\nhave polynomial differences in mixing times. We now describe this model in detail.\n\nVariables There are n binary variables x1, . . . , xn. Independently, each variable has a very strong\nprior of being true. However, variable xi is never true unless xi\u22121 is also true. The unnormalized\nprobability distribution is the following, where M is a very large constant.1\n\nP (x) \u221d\n\n(cid:80)n\n\ni=1 xi\n\nif xi is true and xi\u22121 is false for some i \u2208 {2, . . . , n}\notherwise\n\n0\nM\n\n(cid:40)\n\nState Space There are n + 1 states with non-zero probability: s0, . . . , sn, where si is the state\nwhere the \ufb01rst i variables are true and the remaining n \u2212 i variables are false. In the stationary\ndistribution, sn has almost all of the mass due to the strong priors on the variables, so reaching sn is\nessentially equivalent to mixing because the total variation distance from the stationary distribution is\nequal to the mass not on sn. Notice that sampling xi will almost always move the state from si\u22121 to\nsi, very rarely move it from si to si\u22121, and can have no other effect. The worst-case starting state is\ns0, where the variables must be sampled in the order x1, . . . , xn for this model to mix.\n\nRandom Scan The number of steps needed to transition from s0 to s1 is distributed as a geometric\nrandom variable with mean n (variables are randomly selected, and speci\ufb01cally x1 must be selected).\nSimilarly, the number of steps needed to transition from si\u22121 to si is distributed as a geometric\nrandom variable with mean n. In total, there are n transitions, so O(n2) steps are needed to mix.\n\nBest Systematic Scan The best systematic scan uses the order x1, x2, . . . , xn. For this scan, one\nsweep will reach sn no matter what the starting state is, so the mixing time is n.\n\nWorst Systematic Scan The worst systematic scan uses the order xn, xn\u22121, . . . , x1. The \ufb01rst\nsweep only uses x1, the second sweep only uses x2, and in general, any sweep only makes progress\nusing one transition. Finally, in the n-th sweep, xn is used in the \ufb01rst step. Thus, this process mixes\nin n(n \u2212 1) + 1 steps, which is O(n2).\n\n3.2 Two Islands\n\nWith the sequence of dependencies model, we showed that a single good permutation can mix much\nfaster than other scan orders. Next, we describe the two islands model (Figure 1b), which has the\n\n1We discuss the necessary magnitude of M in Appendix B\n\n4\n\n\fopposite behavior: it has bad permutations that yield much slower mixing times. The two islands\nmodel achieves this by having two disjoint blocks of variables such that consecutively sampling two\nvariables from the same block accomplishes very little. As a result, a systematic scan that uses a\npermutation that frequently consecutively samples from the same block mixes a polynomial factor\nslower than both random scan and most other systematic scans. We now describe this model in detail.\n\nVariables There are 2n binary variables grouped into two blocks: x1, . . . , xn and y1, . . . , yn.\nConditioned on all other variables being false, each variable is equally likely to be true or false.\nHowever, the x variables and the y variables contradict each other. As a result, if any of the x\u2019s are\ntrue, then all of the y\u2019s must be false, and if any of the y\u2019s are true, then all of the x\u2019s must be false.\nThe unnormalized probability distribution for this model is the following.\n\n(cid:40)\n\nP (x, y) \u221d\n\n0 if \u2203xi true and \u2203yj true\n1 otherwise\n\n(3)\n\nThis model can be interpreted as a machine learning inference problem in the following way. Each\nvariable represents whether the reasoning in some sentence is sound. The sentences corresponding\nto x1, . . . , xn and the sentences corresponding to y1, . . . , yn reach contradicting conclusions. If any\nvariable is true, its conclusion is correct, so all of the sentences that reached the opposite conclusion\nmust be not be sound, and their corresponding variables must be false. However, this does not\nguarantee that all other sentences that reached the same conclusion have sound reasoning, so it is\npossible for some variables in a block to be true while others are false. Under these assumptions\nalone, the natural way to model this system is with the two islands distribution in (3).\n\nState Space The states are divided into three groups: states in island x (at least one x variable is\ntrue), states in island y (at least one y variable is true), and a single bridge state b (all variables are\nfalse). The islands are well-connected internally, so the islands mix rapidly. but it is impossible to\ndirectly move from one island to the other \u2013 the only way to move from one island to the other is\nthrough the bridge. To simplify the analysis, we assume that the bridge state has very low mass.2\nThis allows us to assume that the chains always move off of the bridge when a variable is sampled.\nThe bridge is the only way to move from one island to the other, so it acts as a bottleneck. As a result,\nthe ef\ufb01ciency of bridge usage is critical to the mixing time. We will use bridge ef\ufb01ciency to refer to\nthe probability that the chain moves to the other island when it reaches the bridge. Because mixing\nwithin the islands is rapid in comparison to the time needed to move onto the bridge, the mixing time\nis inversely proportional to the bridge ef\ufb01ciency of the chain.\n\nRandom Scan In random scan, the variable selected after getting on the bridge is independent of\nthe previous variable. As a result, with probability 1/2, the chain will move onto the other island,\nand with probability 1/2, the chain will return to the same island, so the bridge ef\ufb01ciency is 1/2.\n\nBest Systematic Scan Several different systematic scans achieve the fastest mixing time. One such\nscan is x1, y1, x2, y2, . . . , xn, yn. Since the sampled variables alternate between the blocks, if the\nchain moves onto the bridge (necessarily by sampling a variable from the island it was previously on),\nit will always proceed to sample a variable from the other block, which will cause it to move onto\nthe other island. Thus, the bridge ef\ufb01ciency is 1. More generally, any systematic scan that alternates\nbetween sampling from x variables and sampling from y variables will have a bridge ef\ufb01ciency of 1.\n\nWorst Systematic Scan Several different systematic scans achieve the slowest mixing time. One\nsuch scan is x1, . . . , xn, y1, . . . , yn. In this case, if the chain moves onto the bridge, it will almost\nalways proceed to sample a variable from the same block, and return to the same island. In fact,\nthe only way for this chain to move across islands is if it moves from island x to the bridge using\ntransition xn and then moves to island y using transition y1, or if it moves from island y to the bridge\nusing transition yn and then moves to island x using transition x1. Thus, only 2 of the 2n transitions\nwill cross the bridge, and the bridge ef\ufb01ciency is 1/n. More generally, any systematic scan that\nconsecutively samples all x variables and then all y variables will have a bridge ef\ufb01ciency of 1/n.\n\nComparison of Mixing Times The mixing times of the chains are inversely proportional to the\nbridge ef\ufb01ciency. As a result, random scan takes twice as long to mix as the best systematic scan, and\nmixes n/2 times faster than the worst systematic scan.\n\n2We show that the same asymptotic result holds without this assumption in Appendix C.\n\n5\n\n\f3.3 Discrete Pyramid\n\nIn the discrete pyramid model (Figure 1c) introduced by Roberts and Rosenthal [20], there are n\nbinary variables xi, and the mass is uniformly distributed over all states where at most one xi is true.\nIn this model, the mixing time of random scan, O(n), is asymptotically better than that of systematic\nscan for any permutation, which all have the same mixing time, O(n3).\n\n4 Experiments\n\nIn this section, we run several experiments to illustrate the effect of scan order on mixing times. First,\nin Figure 2a, we plot the mixing times of the models from Section 3 as a function of the number of\nvariables. These experiments validate our results about the asymptotic scaling of the mixing time,\nas well as show that the scan order can have a signi\ufb01cant effect on the mixing time for even small\nmodels. (Due to the exponential state space of the two islands model, we modify it slightly to make\nthe computation of mixing times feasible: we simplify the model by only considering the states that\nare adjacent to the bridge, and assume that the states on each individual island mix instantly.)\nIn the following experiments, we consider a modi\ufb01ed version of the two islands model, in which the\nmass of the bridge state is set to 0.1 of the mass of the other states to allow the effect of scan order to\nbe clear even for a small number of variables. Figure 2b illustrates the rate at which different scan\norders explore this modi\ufb01ed model. Due to symmetry, we know that half of the mass should be on\neach island in the stationary distribution, so getting half of the mass onto the other island is necessary\nfor mixing. This experiment illustrates that random scan and a good systematic scan move to the\nother island quickly, while a bad systematic scan requires many more iterations.\nFigure 2c illustrates the effect that the permutation chosen for systematic scan can have on the mixing\ntime. In this experiment, the mixing time for each permutation was found and plotted in sorted order.\nFor the sequence of dependencies model, there are a small number of good permutations which mix\nvery quickly compared to the other permutations and random scan. However, no permutation is bad\ncompared to random scan. In the two islands model, as we would expect based on the analysis in\nSection 3, there are a small number of bad permutations which mix very slowly compared to the\nother permutations and random scan. Some permutations are slightly better than random scan, but\nnone of the scan orders are substantially better. In addition, the mixing times for systematic scan\napproximately discretized due to the fact that mixing time depends so heavily on the bridge ef\ufb01ciency.\n\n5 Relative Bounds on Mixing Times via Conductance\n\nIn Section 3, we described two models for which a systematic scan can mix a polynomial factor\nfaster or slower than random scan, thus invalidating conventional wisdom that the scan order does not\nhave an asymptotically signi\ufb01cant effect on mixing times. This raises a question of how different the\nmixing times of different scans can be. In this section, we derive the following weaker \u2013 but correct \u2013\nversion of the conjecture stated by Diaconis [3] and Levin et al. [15].\nOne of the obstacles to proving this result is that the systematic scan chain is not reversible. A\nstandard method of handling non-reversible Markov chains is to study a lazy version of the Markov\nchain instead [15, p. 9]. In the lazy version of a Markov chain, each step has a probability of 1/2 of\nstaying at the current state, and acts as a normal step otherwise. This is equivalent to stopping at a\nrandom time that is distributed as a binomial random variable. Due to the fact that systematic scan is\nnot reversible, our bounds are on the lazy systematic scan, rather than the standard systematic scan.\nTheorem 1. For any random scan Gibbs sampler R and lazy systematic scan sampler S with the\nsame stationary distribution \u03c0, their relative mixing times are bounded as follows.\n\n(cid:18) 1\n\n(cid:19)\n\n(1/2 \u2212 \u0001)2 tmix(R, \u0001) \u2264 2t2\n\nmix(S, \u0001) log\n\n(1/2 \u2212 \u0001)2 tmix(S, \u0001) \u2264\n\n\u0001\u03c0min\n(minx,i Pi(x, x))2 t2\n\n8n2\n\nmix(R, \u0001) log\n\n(cid:18) 1\n\n\u0001\u03c0min\n\n(cid:19)\n\n,\n\nwhere Pi is the transition matrix corresponding to resampling just variable i, and \u03c0min is the\nprobability of the least likely state in \u03c0.\n\n6\n\n\f)\ns\nd\nn\na\ns\nu\no\nh\nt\n(\n\nx\ni\nm\nt\n\nSequence of Dependencies\n1\n\nTwo Islands\n\nDiscrete Pyramid\n\n0.5\n\n0\n\n0\n\nBest Systematic\nWorst Systematic\nOther Systematic\nRandom\nTrue Value\n\n25\nn\n\n50\n\n0\n\n25\nn\n\n50\n\n0\n\n25\nn\n\n50\n\n(a) Mixing times for \u0001 = 1/4.\n\nTwo Islands (n = 10)\n\nSequence of Dependencies (n = 10)\n\nTwo Islands (n = 6)\n\nd\nn\na\nl\ns\nI\n\nn\no\n\ns\ns\na\n\nM\n\n0.5\n\n0.25\n\n0\n\n50\n\n25\n\n0\nIterations (thousands)\n\n75 100\n\n150\n\n100\n\n50\n\nx\ni\nm\nt\n\n0\n\n0\n\n25 50 75 100\nPercentile\n\n)\ns\nd\nn\na\ns\nu\no\nh\nt\n(\n\nx\ni\nm\nt\n\n3\n\n2\n\n1\n\n0\n\n0\n\n25 50 75 100\nPercentile\n\n(b) Marginal island mass over time.\n\n(c) Sorted mixing times of different permutations (\u0001 = 1/4).\n\nFigure 2: Empirical analysis of the models.\n\nUnder mild conditions, namely \u0001 being \ufb01xed and the quantities log(\u03c0\u22121\nmin) and (minx,i Pi(x, x))\u22121\nbeing at most polynomial in n, this theorem implies that the choice of scan order can only affect the\nmixing time by up to polynomial factors in n and tmix. We now outline the proof of this theorem and\ninclude full proofs in Appendix D.\nIn the two islands models, the mixing time of a scan order was determined by its ability to move\nthrough a single bridge state that restricted \ufb02ow. This suggests that a technique with the ability to\nmodel the behavior of this bridge state is needed to bound the relative mixing times of different scans.\nConductance, also known as the bottleneck ratio, is a topological property of Markov chains used to\nbound mixing times by considering the \ufb02ow of mass around the model [15, p. 88]. This ability to\nmodel bottlenecks in a Markov chain makes conductance a natural technique both for studying the\ntwo islands model and bounding mixing times in general.\nMore formally, consider a Markov chain on state space \u2126 with transition matrix P and stationary\ndistribution \u03c0. The conductance of a set S and of the whole chain are respectively de\ufb01ned as\n\n\u03a6(S) =\n\n(cid:80)\n(cid:80)n\n\nx\u2208S,y /\u2208S \u03c0(x)P (x, y)\n\n\u03c0(S)\n\n\u03a6(cid:63) = min\n\nS:\u03c0(S)\u2264 1\n\n2\n\n\u03a6(S).\n\nConductance can be directly applied to analyze random scan. Let Pi be the transition matrix\ncorresponding to sampling variable i. The state space \u2126 is used without modi\ufb01cation, and the\ntransition matrix is P = 1\ni=1 Pi. The stationary distribution is the expected target distribution \u03c0.\nn\nOn the other hand, conductance cannot be directly applied to systematic scan. Systematic scan is not\na homogeneous Markov chain because it uses a sequence of transition matrices rather than a single\ntransition matrix. One standard method of converting systematic scan into a homogeneous Markov\nchain is to consider each full scan as one step of a Markov chain. However, this makes it dif\ufb01cult\nto compare with random scan because it completely changes which states are connected by single\nsteps of the transition matrix. To allow systematic and random scan to be compared more easily,\nwe introduce an alternative way of converting systematic scan to a homogeneous Markov chain by\naugmenting the state space. The augmented state space is \u03a8 = \u2126 \u00d7 [n], which represents an ordered\npair of the normal state and the index of the variable to be sampled. The transition probability is\nP ((x, i), (y, j)) = Pi(x, y)s(i, j), where s(i, j) = I[i + 1 \u2261 j (mod n)] is an indicator that shows\nif the correct variable will be sampled next.\n\n7\n\n\fAdditionally, augmenting the state space for random scan allows easier comparison with systematic\nscan in some cases. For augmented random scan, the state space is also \u03a8 = \u2126 \u00d7 [n], the same\nas for systematic scan. The transition probability is P ((x, i), (y, j)) = 1\nn Pi(x, y), which means\nthat the next variable to sample is selected uniformly. The stationary distributions of the augmented\nrandom scan and systematic scan chains are both \u03c0 ((x, i)) = n\u22121\u03c0(x). Because the state space and\nstationary distribution are the same, augmented random scan and augmented systematic scan can be\ncompared directly, which lets us prove the following lemma.\nLemma 1. For any random scan Gibbs sampler and systematic scan sampler with the same stationary\ndistribution \u03c0, let \u03a6RS denote the conductance of the random scan process, let \u03a6RS-A denote the\nconductance of the augmented random scan process, and let \u03a6SS-A denote the conductance of the\naugmented systematic scan process. Then,\n\n1\n2n\n\n\u00b7 min\n\nx,i\n\nPi(x, x) \u00b7 \u03a6RS-A \u2264 \u03a6SS-A \u2264 \u03a6RS.\n\nIn Lemma 1, the upper bound states that the conductance of systematic scan is no larger than the\nconductance of random scan. We use this result in the proof of Theorem 1 to show that systematic\nscan cannot mix too much more quickly than random scan. To prove this upper bound, we show\nthat for any set S under random scan, the set \u02c6S containing the corresponding augmented states for\nsystematic scan will have the same conductance under systematic scan as S had under random scan.\nThe lower bound in Lemma 1 states that the conductance of systematic scan is no smaller than a\nfunction of the conductance of augmented random scan. This function depends on the number of\nvariables n and minx,i Pi(x, x), which is the minimum holding probability of any state. To prove\nthis lower bound, we show that for any set S under augmented systematic scan, we can bound its\nconductance under augmented random scan.\nThere are well-known bounds on the mixing time of a Markov chain in terms of its conductance,\nwhich we state in Theorem 2 [15, pp. 89, 235].\nTheorem 2. For any lazy or reversible Markov chain,\n\u2264 tmix(\u0001) \u2264 2\n\u03a62\n(cid:63)\n\n(cid:18) 1\n\n1/2 \u2212 \u0001\n\n(cid:19)\n\n.\n\n\u03a6(cid:63)\n\nlog\n\n\u0001\u03c0min\n\nIt is straightforward to prove the result of Theorem 1 by combining the bounds from Theorem 2 with\nthe conductance bounds from Lemma 1.\n\n6 Conclusion\n\nWe studied the effect of scan order on mixing times of Gibbs samplers, and found that for particular\nmodels, the scan order can have an asymptotic effect on the mixing times. These models invalidate\nconventional wisdom about scan order and show that we cannot freely change scan orders without\nconsidering the resulting changes in mixing times. In addition, we found bounds on the mixing times\nof different scan orders, which replaces a common conjecture about the mixing times of random scan\nand systematic scan.\n\nAcknowledgments\n\nThe authors acknowledge the support of: DARPA FA8750-12-2-0335; NSF IIS-1247701; NSF\nCCF-1111943; DOE 108845; NSF CCF-1337375; DARPA FA8750-13-2-0039; NSF IIS-1353606;\nONR N000141210041 and N000141310129; NIH U54EB020405; NSF DGE-114747; DARPA\u2019s\nSIMPLEX program; Oracle; NVIDIA; Huawei; SAP Labs; Sloan Research Fellowship; Moore\nFoundation; American Family Insurance; Google; and Toshiba. The views and conclusions expressed\nin this material are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies or endorsements, either expressed or implied, of DARPA, AFRL, NSF, ONR, NIH,\nor the U.S. Government.\n\n8\n\n\fReferences\n[1] I. Benjamini, N. Berger, C. Hoffman, and E. Mossel. Mixing times of the biased card shuf\ufb02ing and the\nasymmetric exclusion process. Transactions of the American Mathematical Society, 357(8):3013\u20133029,\n2005.\n\n[2] C. De Sa, C. Zhang, K. Olukotun, and C. R\u00e9. Rapidly mixing gibbs sampling for a class of factor graphs\n\nusing hierarchy width. In Advances in Neural Information Processing Systems, 2015.\n\n[3] P. Diaconis. Some things we\u2019ve learned (about markov chain monte carlo). Bernoulli, 19(4):1294\u20131305,\n\n2013.\n\n[4] P. Diaconis and A. Ram. Analysis of systematic scan metropolis algorithms using iwahori-hecke algebra\n\ntechniques. The Michigan Mathematical Journal, 48(1):157\u2013190, 2000.\n\n[5] M. Dyer, L. A. Goldberg, and M. Jerrum. Systematic scan for sampling colorings. The Annals of Applied\n\nProbability, 16(1):185\u2013230, 2006.\n\n[6] M. Dyer, L. A. Goldberg, and M. Jerrum. Dobrushin conditions and systematic scan. Combinatorics,\n\nProbability and Computing, 17(06):761\u2013779, 2008.\n\n[7] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction\nsystems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational\nLinguistics, 2005.\n\n[8] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal\n\nof the American Statistical Association, 85(410):398\u2013409, 1990.\n\n[9] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721\u2013741, 1984.\n\n[10] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In Advances in\n\nNeural Information Processing Systems, 2015.\n\n[11] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of Sciences,\n\n101(suppl 1):5228\u20135235, 2004.\n\n[12] M. G\u00fcrb\u00fczbalaban, A. Ozdaglar, and P. Parrilo. Convergence rate of incremental gradient and newton\n\nmethods. arXiv preprint arXiv:1510.08562, 2015.\n\n[13] T. P. Hayes. A simple condition implying rapid mixing of single-site dynamics on spin systems. In 47th\n\nAnnual IEEE Symposium on Foundations of Computer Science, 2006.\n\n[14] J. M. Hellerstein, C. R\u00e9, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng,\nK. Li, et al. The madlib analytics library: or mad skills, the sql. Proceedings of the VLDB Endowment, 5\n(12):1700\u20131711, 2012.\n\n[15] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. American Mathematical Society,\n\n2009.\n\n[16] D. Lunn, D. Spiegelhalter, A. Thomas, and N. Best. The bugs project: Evolution, critique and future\n\ndirections. Statistics in medicine, 28(25):3049\u20133067, 2009.\n\n[17] A. McCallum, K. Schultz, and S. Singh. Factorie: Probabilistic programming via imperatively de\ufb01ned\n\nfactor graphs. In Advances in Neural Information Processing Systems, 2009.\n\n[18] M. Plummer. Jags: A program for analysis of bayesian graphical models using gibbs sampling. In\n\nProceedings of the 3rd international workshop on distributed statistical computing, 2003.\n\n[19] B. Recht and C. R\u00e9. Beneath the valley of the noncommutative arithmetic-geometric mean inequality:\nconjectures, case-studies, and consequences. In Proceedings of the 25th Annual Conference on Learning\nTheory, 2012.\n\n[20] G. O. Roberts and J. S. Rosenthal. Surprising convergence properties of some simple gibbs samplers under\n\nvarious scans. International Journal of Statistics and Probability, 5(1):51\u201360, 2015.\n\n[21] A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB\n\nEndowment, 3(1):703\u2013710, 2010.\n\n[22] C. Zhang and C. R\u00e9. Towards high-throughput gibbs sampling at scale: A study across storage managers.\n\nIn Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013.\n\n[23] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr images through a hidden markov random\n\ufb01eld model and the expectation-maximization algorithm. IEEE Transactions on Medical Imaging, 20(1):\n45\u201357, 2001.\n\n[24] S. C. Zhu, Y. Wu, and D. Mumford. Filters, random \ufb01elds and maximum entropy (frame): Towards a\n\nuni\ufb01ed theory for texture modeling. International Journal of Computer Vision, 27(2):107\u2013126, 1998.\n\n9\n\n\f", "award": [], "sourceid": 5, "authors": [{"given_name": "Bryan", "family_name": "He", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Stanford University"}, {"given_name": "Ioannis", "family_name": "Mitliagkas", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford University"}]}