{"title": "Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices", "book": "Advances in Neural Information Processing Systems", "page_first": 2235, "page_last": 2243, "abstract": "We present a new way of converting a reversible finite Markov chain into a nonreversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC.", "full_text": "Improving the Asymptotic Performance of Markov\n\nChain Monte-Carlo by Inserting Vortices\n\nGalleria 2, Manno CH-6928, Switzerland\n\nGalleria 2, Manno CH-6928, Switzerland\n\nFaustino Gomez\n\nIDSIA\n\ntino@idsia.ch\n\nYi Sun\nIDSIA\n\nyi@idsia.ch\n\nJ\u00a8urgen Schmidhuber\n\nIDSIA\n\nGalleria 2, Manno CH-6928, Switzerland\n\njuergen@idsia.ch\n\nAbstract\n\nWe present a new way of converting a reversible \ufb01nite Markov chain into a non-\nreversible one, with a theoretical guarantee that the asymptotic variance of the\nMCMC estimator based on the non-reversible chain is reduced. The method is\napplicable to any reversible chain whose states are not connected through a tree,\nand can be interpreted graphically as inserting vortices into the state transition\ngraph. Our result con\ufb01rms that non-reversible chains are fundamentally better\nthan reversible ones in terms of asymptotic performance, and suggests interesting\ndirections for further improving MCMC.\n\n1\n\nIntroduction\n\nMarkov Chain Monte Carlo (MCMC) methods have gained enormous popularity over a wide variety\nof research \ufb01elds [6, 8], owing to their ability to compute expectations with respect to complex, high\ndimensional probability distributions. An MCMC estimator can be based on any ergodic Markov\nchain with the distribution of interest as its stationary distribution. However, the choice of Markov\nchain greatly affects the performance of the estimator, in particular the accuracy achieved with a\npre-speci\ufb01ed number of samples [4].\nIn general, the ef\ufb01ciency of an MCMC estimator is determined by two factors:\ni) how fast the\nchain converges to its stationary distribution, i.e., the mixing rate [9], and ii) once the chain reaches\nits stationary distribution, how much the estimates \ufb02uctuate based on trajectories of \ufb01nite length,\nwhich is characterized by the asymptotic variance. In this paper, we consider the latter criteria.\nPrevious theory concerned with reducing asymptotic variance has followed two main tracks. The\n\ufb01rst focuses on reversible chains, and is mostly based on the theorems of Peskun [10] and Tierney\n[11], which state that if a reversible Markov chain is modi\ufb01ed so that the probability of staying in\nthe same state is reduced, then the asymptotic variance can be decreased. A number of methods\nhave been proposed, particularly in the context of Metropolis-Hastings method, to encourage the\nMarkov chain to move away from the current state, or its adjacency in the continuous case [12, 13].\nThe second track, which was explored just recently, studies non-reversible chains. Neal proved\nin [4] that starting from any \ufb01nite-state reversible chain, the asymptotic variance of a related non-\nreversible chain, with reduced probability of back-tracking to the immediately previous state, will\nnot increase, and typically decrease. Several methods have been proposed by Murray based on this\nidea [5].\n\n1\n\n\fNeal\u2019s result suggests that non-reversible chains may be fundamentally better than reversible ones in\nterms of the asymptotic performance. In this paper, we follow up this idea by proposing a new way\nof converting reversible chains into non-reversible ones which, unlike in Neal\u2019s method, are de\ufb01ned\non the state space of the reversible chain, with the theoretical guarantee that the asymptotic variance\nof the associated MCMC estimator is reduced. Our method is applicable to any non-reversible chain\nwhose state transition graph contains loops, including those whose probability of staying in the\nsame state is zero and thus cannot be improved using Peskun\u2019s theorem. The method also admits\nan interesting graphical interpretation which amounts to inserting \u2018vortices\u2019 into the state transition\ngraph of the original chain. Our result suggests a new and interesting direction for improving the\nasymptotic performance of MCMC.\nThe rest of the paper is organized as follows: section 2 reviews some background concepts and\nresults; section 3 presents the main theoretical results, together with the graphical interpretation;\nsection 4 provides a simple yet illustrative example and explains the intuition behind the results;\nsection 5 concludes the paper.\n\nXT\n\n1\nT\n\n2 Preliminaries\nSuppose we wish to estimate the expectation of some real valued function f over domain S, with\nrespect to a probability distribution \u03c0, whose value may only be known to a multiplicative constant.\nLet A be a transition operator of an ergodic1 Markov chain with stationary distribution \u03c0, i.e.,\n\n\u03c0 (x) A (x \u2192 y) = \u03c0 (y) B (y \u2192 x) , \u2200x, y \u2208 S,\n\n(1)\nwhere B is the reverse operator as de\ufb01ned in [5]. The expectation can then be estimated through the\nMCMC estimator\n\n\u00b5T =\n\n(2)\nwhere x1,\u00b7\u00b7\u00b7 , xT is a trajectory sampled from the Markov chain. The asymptotic variance of \u00b5T ,\nwith respect to transition operator A and function f is de\ufb01ned as\nT\u2192\u221e T V [\u00b5T ] ,\n\nA (f) = lim\n\u03c32\n\nf (xt) ,\n\n(3)\n\nt=1\n\nwhere V [\u00b5T ] denotes the variance of \u00b5T . Since the chain is ergodic, \u03c32\nA (f) is well-de\ufb01ned follow-\ning the central limit theorem, and does not depend on the distribution of the initial point. Roughly\nspeaking, asymptotic variance has the meaning that the mean square error of the estimates based on\nA (f), after a suf\ufb01ciently long period\nT consecutive states of the chain would be approximately 1\nof \u201dburn in\u201d such that the chain is close enough to its stationary distribution. Asymptotic variance\ncan be used to compare the asymptotic performance of MCMC estimators based on different chains\nwith the same stationary distribution, where smaller asymptotic variance indicates that, asymptoti-\ncally, the MCMC estimator requires fewer samples to reach a speci\ufb01ed accuracy.\nUnder the ergodic assumption, the asymptotic variance can be written as\n\nT \u03c32\n\nA (f) = V [f] +X\u221e\n\n\u03c32\n\n\u03c4 =1\n\n(cA,f (\u03c4) + cB,f (\u03c4)) ,\n\n(4)\n\nwhere\n\ncA,f (\u03c4) = EA [f (xt) f (xt+\u03c4 )] \u2212 EA [f (xt)] E [f (xt+\u03c4 )]\n\nA (f) = \u03c32\n\nis the covariance of the function value between two states that are \u03c4 time steps apart in the trajectory\nA (f) depends on both A and its reverse\nof the Markov chain with transition operator A. Note that \u03c32\noperator B, and \u03c32\nIn this paper, we consider only the case where S is \ufb01nite, i.e., S = {1,\u00b7\u00b7\u00b7 , S}, so that the transition\noperators A and B, the stationary distribution \u03c0, and the function f can all be written in matrix form.\nLet \u03c0 = [\u03c0 (1) ,\u00b7\u00b7\u00b7 , \u03c0 (S)]>, f = [f (1) ,\u00b7\u00b7\u00b7 , f (S)]>, Ai,j = A (i \u2192 j), Bi,j = B (i \u2192 j). The\nasymptotic variance can thus be written as\n\nB (f) since A is also the reverse operator of B by de\ufb01nition.\n\nA (f) = V [f] +X\u221e\n\n\u03c32\n\n\u03c4 =1\n\nf>(cid:0)QA\u03c4 + QB\u03c4 \u2212 2\u03c0\u03c0>(cid:1) f,\n\n1Strictly speaking, the ergodic assumption is not necessary for the MCMC estimator to work, see [4].\n\nHowever, we make the assumption to simplify the analysis.\n\n2\n\n\fwith Q = diag {\u03c0}. Since B is the reverse operator of A, QA = B>Q. Also, from the ergodic\nassumption,\n\nlim\n\u03c4\u2192\u221e A\u03c4 = lim\n\n\u03c4\u2192\u221e B\u03c4 = R,\n\nwhere R = 1\u03c0> is a square matrix in which every row is \u03c0>. It follows that the asymptotic variance\ncan be represented by Kenney\u2019s formula [7] in the non-reversible case:\n\n(5)\nwhere [\u00b7]H denotes the Hermitian (symmetric) part of a matrix, and \u039b = Q+\u03c0\u03c0>\u2212J, with J = QA\nbeing the joint distribution of two consecutive states.\n\n\u03c32\n\nH (Qf) \u2212 2f>Qf,\n\nA (f) = V [f] + 2 (Qf)>(cid:2)\u039b\u2212(cid:3)\n\n3\n\nImproving the asymptotic variance\n\nA (f) \u2264 \u03c32\n\nIt is clear from Eq.5 that the transition operator A affects the asymptotic variance only through term\n[\u039b\u2212]H. If the chain is reversible, then J is symmetric, so that \u039b is also symmetric, and therefore\ncomparing the asymptotic variance of two MCMC estimators becomes a matter of comparing their\nJ, namely, if2 J (cid:22) J0 = QA0, then \u03c32\nA0 (f), for any f. This leads to a simple proof of\nPeskun\u2019s theorem in the discrete case [3].\nIn the case where the Markov chain is non-reversible, i.e., J is asymmetric, the analysis becomes\nmuch more complicated. We start by providing a suf\ufb01cient and necessary condition in section 3.1,\nwhich transforms the comparison of asymptotic variance based on arbitrary \ufb01nite Markov chains\ninto a matrix ordering problem, using a result from matrix analysis. In section 3.2, a special case\nis identi\ufb01ed, in which the asymptotic variance of a reversible chain is compared to that of a non-\nreversible one whose joint distribution over consecutive states is that of the reversible chain plus\na skew-Hermitian matrix. We prove that the resulting non-reversible chain has smaller asymptotic\nvariance, and provide a necessary and suf\ufb01cient condition for the existence of such non-zero skew-\nHermitian matrices. Finally in section 3.3, we provide a graphical interpretation of the result.\n\n3.1 The general case\n\nFrom Eq.5 we know that comparing the asymptotic variances of two MCMC estimators is equivalent\nto comparing their [\u039b\u2212]H. The following result from [1, 2] allows us to write [\u039b\u2212]H in terms of the\nsymmetric and asymmetric parts of \u039b.\nLemma 1 If a matrix X is invertible, then [X\u2212]\u2212\nskew Hermitian part of X.\n\nH = [X]H + [X]>\n\nH [X]S, where [X]S is the\n\nS [X]\u2212\n\nFrom Lemma 1, it follows immediately that in the discrete case, the comparison of MCMC esti-\nmators based on two Markov chains with the same stationary distribution can be cast as a different\nproblem of matrix comparison, as stated in the following proposition.\nProposition 1 Let A, A0 be two transition operators of ergodic Markov chains with stationary dis-\ntribution \u03c0. Let J = QA, J0 = QA0, \u039b = Q + \u03c0\u03c0> \u2212 J, \u039b0 = Q + \u03c0\u03c0> \u2212 J0. Then the following\nthree conditions are equivalent:\n\n1) \u03c32\n\nA (f) \u2264 \u03c32\n\n2) [\u039b\u2212]H (cid:22)h\n\n(\u039b0)\u2212i\n\nH\n\nA0 (f) for any f\n\nS [\u039b]\u2212\n\nH [J]S (cid:22) [J0]H \u2212 [J0]>\n\n3) [J]H \u2212 [J]>\nProof. First we show that \u039b is invertible. Following the steps in [3], for any f 6= 0,\n\nf>\u039bf = f> [\u039b]H f = f>(cid:0)Q + \u03c0\u03c0> \u2212 J(cid:1) f\n\nS [\u039b0]\u2212\n\nH [J0]S\n\nEh\n\n(f (xt) \u2212 f (xt+1))2i\n\n=\n\n1\n2\n\n+ E [f (xt)]2 > 0,\n\n2For symmetric matrices X and Y , we write X (cid:22) Y if Y \u2212 X is positive semi-de\ufb01nite, and X \u227a Y if\n\nY \u2212 X is positive de\ufb01nite.\n\n3\n\n\fthus [\u039b]H (cid:31) 0, and \u039b is invertible since \u039bf 6= 0 for any f 6= 0.\nCondition 1) and 2) are equivalent by de\ufb01nition. We now prove 2) is equivalent to 3). By Lemma 1,\n\n(\u039b0)\n\n\u21d0\u21d2 [\u039b]H + [\u039b]>\n\n>\nS [\u039b]H [\u039b]S (cid:23) [\u039b0]H + [\u039b0]\nS [\u039b0]H [\u039b0]S ,\n\nH (cid:22)h\n(cid:2)\u039b\u2212(cid:3)\n\n\u2212i\n\nH\n\nthe result follows by noticing that [\u039b]H = Q + \u03c0\u03c0> \u2212 [J]H and [\u039b]S = \u2212 [J]S.\n3.2 A special case\n\nS [\u039b]\u2212\n\nGenerally speaking, the conditions in Proposition 1 are very hard to verify, particularly because of\nthe term [J]>\nH [J]S. Here we focus on a special case where [J0]S = 0, and [J0]H = J0 = [J]H.\nThis amounts to the case where the second chain is reversible, and its transition operator is the\naverage of the transition operator of the \ufb01rst chain and the associated reverse operator. The result is\nformalized in the following corollary.\n\nCorollary 1 Let T be a reversible transition operator of a Markov chain with stationary distribution\n\u03c0. Assume there is some H that satis\ufb01es\nCondition I. 1>H = 0, H1 = 0, H = \u2212H>, and3\nCondition II. T \u00b1 Q\u2212H are valid transition matrices.\nDenote A = T + Q\u2212H, B = T \u2212 Q\u2212H, then\n1) A preserves \u03c0, and B is the reverse operator of A.\n2) \u03c32\n3) If H 6= 0, then there is some f, such that \u03c32\n4) If A\u03b5 = T + (1 + \u03b5) Q\u2212H is valid transition matrix, \u03b5 > 0, then \u03c32\nProof. For 1), notice that \u03c0>T = \u03c0>, so\n\nT (f) for any f.\n\nB (f) \u2264 \u03c32\n\nA (f) = \u03c32\n\nA (f) < \u03c32\n\n(f) \u2264 \u03c32\n\nA (f).\n\nT (f).\n\nA\u03b5\n\nand similarly for B. Moreover\n\n\u03c0>A = \u03c0>T + \u03c0>Q\u2212H = \u03c0> + 1>H = \u03c0>,\n\nQA = QT + H = (QT \u2212 H)> =(cid:0)Q(cid:0)T \u2212 Q\u2212H(cid:1)(cid:1)>\n\n= (QB)> ,\n\nthus B is the reverse operator of A.\nFor 2), \u03c32\n\nA (f) = \u03c32\n\nB (f) follows from Eq.5. Let J0 = QT , J = QA. Note that [J]S = H,\n\nA (f) \u2264 \u03c32\n\nT (f) for any f.\n\nHX\u2212.\ns , with \u03bbs > 0, \u2200s.\n\n1\n2\n\nThus\n\nJ0 = QT =\n(QA + QB) = [QA]H = [J]H ,\nH H (cid:23) 0 from Proposition 1. It follows that \u03c32\n\nand [\u039b]H (cid:31) 0 thus H> [\u039b]\u2212\nFor 3), write X = [\u039b]H,\n\nSince X (cid:31) 0, HX\u2212H> (cid:23) 0, one can write(cid:0)X + HX\u2212H>(cid:1)\u2212\n\n(cid:2)\u039b\u2212(cid:3)\nH =(cid:0)X + H>X\u2212H(cid:1)\u2212\nH>(cid:0)X + HX\u2212H>(cid:1)\u2212\nA (f)(cid:3) = (Qf)>h\n= (Hes\u2217)>XS\n= \u03bbs kHes\u2217k4 +X\n\n= X\u2212 \u2212 X\u2212H>(cid:0)X + HX\u2212H>(cid:1)\u2212\n=PS\nH =XS\nX\u2212 \u2212(cid:0)X + H>X\u2212H(cid:1)\u2212i\nX\u2212H>(cid:0)X + HX\u2212H>(cid:1)\u2212\n(cid:0)e>\n\n\u03bbsHes (Hes)> (Hes\u2217)\ns\u2217 H>Hes\n\nSince H 6= 0, there is at least one s\u2217, such that Hes\u2217 6= 0. Let f = Q\u2212XHes\u2217, then\n\n\u03bbsHes (Hes)> .\n\ns=1 \u03bbsese>\n\nT (f) \u2212 \u03c32\n\n(Qf)\nHX\u2212 (Qf)\n\n= (Qf)>\n\n(cid:2)\u03c32\n\n(cid:1)2\n\n> 0.\n\n1\n2\n\ns=1\n\ns6=s\u2217 \u03bbs\n\ns=1\n\n3We write 1 for the S-dimensional column vector of 1\u2019s.\n\n4\n\n\fFor 4), let \u039b\u03b5 = Q + \u03c0\u03c0> \u2212 QA\u03b5, then for \u03b5 > 0,\nX + (1 + \u03b5)2 H>X\u2212H\n\n(cid:2)\u039b\u2212\n\n(cid:3)\n\nH =\n\n\u03b5\n\n(cid:16)\n\n(cid:17)\u2212 (cid:22)(cid:0)X + H>X\u2212H(cid:1)\u2212\n\n=(cid:2)\u039b\u2212(cid:3)\n\nH ,\n\n(f) \u2264 \u03c32\n\nA (f) for any f.\n\nby Eq.5, we have \u03c32\nA\u03b5\nCorollary 1 shows that starting from a reversible Markov chain, as long as one can \ufb01nd a non-\nzero H satisfying Conditions I and II, then the asymptotic performance of the MCMC estimator is\nguaranteed to improve. The next question to ask is whether such an H exists, and, if so, how to \ufb01nd\none. We answer this question by \ufb01rst looking at Condition I. The following proposition shows that\nany H satisfying this condition can be constructed systematically.\n\nProposition 2 Let H be an S-by-S matrix. H satis\ufb01es Condition I if and only if H can be written\nas the linear combination of 1\n\n2 (S \u2212 1) (S \u2212 2) matrices, with each matrix of the form\nUi,j = uiu>\n\ni , 1 \u2264 i < j \u2264 S \u2212 1.\n\nj \u2212 uju>\n\nHere u1,\u00b7\u00b7\u00b7 , uS\u22121 are S \u2212 1 non-zero linearly independent vectors satisfying u>\nProof. Suf\ufb01ciency. It is straightforward to verify that each Ui,j is skew-Hermitian and satis\ufb01es\nUi,j1 = 0. Such properties are inherited by any linear combination of Ui,j.\n2 (S \u2212 1) (S \u2212 2) linearly independent bases for all H\nNecessity. We show that there are at most 1\nsuch that H = \u2212H> and H1 = 0. On one hand, any S-by-S skew-Hermitian matrix can be written\nas the linear combination of 1\n\n2 S (S \u2212 1) matrices of the form\n\ns 1 = 0.\n\nVi,j : {Vi,j}m,n = \u03b4 (m, i) \u03b4 (n, j) \u2212 \u03b4 (n, i) \u03b4 (m, j) ,\n\n1\n\nwhere \u03b4 is the standard delta function such that \u03b4 (i, j) = 1 if i = j and 0 otherwise. However,\nthe constraint H1 = 0 imposes S \u2212 1 linearly independent constraints, which means that out of\n2 S (S \u2212 1) parameters, only\n1\n2 S (S \u2212 1) \u2212 (S \u2212 1) =\n\n(cid:19)\nare independent.\nOn the other hand, selecting two non-identical vectors from u1,\u00b7\u00b7\u00b7 , uS\u22121 results in\n2 (S \u2212 1) (S \u2212 2) different Ui,j. It has still to be shown that these Ui,j are linearly independent.\nAssume\n\n(cid:18)S \u2212 1\n\n(S \u2212 1) (S \u2212 2)\n\n1\n2\n\n=\n\n2\n\n1\n\n(cid:0)uiu>\n\nj \u2212 uju>\n\ni\n\nConsider two cases: Firstly, assume u1,\u00b7\u00b7\u00b7 , uS\u22121 are orthogonal, i.e., u>\nparticular us,\n\ni uj = 0 for i 6= j. For a\n\n\u03bai,j\n\n1\u2264i\n\n1\u2264i\n(cid:13)(cid:13)u>\n\nj \u2212 uju>\n\ni\n\n(cid:13)(cid:13) .\n\nSince(cid:13)(cid:13)u>\n\ns us\n\n(cid:13)(cid:13) 6= 0, it follows that \u03bai,s = \u03bas,j = 0, for all 1 \u2264 i < s < j \u2264 S \u2212 1. This holds for\n\ns (QT + H) 1 = 1 by Condition I, only\nthe non-negative constraint needs to be considered.\nIt turns out that not all reversible Markov chains admit a non-zero H satisfying both Condition I and\nII. For example, consider a Markov chain with only two states. It is impossible to \ufb01nd a non-zero\nskew-Hermitian H such that H1 = 0, because all 2-by-2 skew-Hermitian matrices are proportional\nto\n\n\" 0 \u22121\n\n#\n\n.\n\n1\n\n0\n\nThe next proposition gives the suf\ufb01cient and necessary condition for the existence of a non-zero H\nsatisfying both I and II. In particular, it shows an interesting link between the existence of such H\nand the connectivity of the states in the reversible chain.\n\nProposition 3 Assume a reversible ergodic Markov chain with transition matrix T and let J = QT .\nThe state transition graph GT is de\ufb01ned as the undirected graph with node set S = {1,\u00b7\u00b7\u00b7 , S} and\nedge set {(i, j) : Ji,j > 0, 1 \u2264 i < j \u2264 S}. Then there exists some non-zero H satisfying Condition\nI and II, if and only if there is a loop in GT .\nProof. Suf\ufb01ciency: Without loss of generality, assume the loop is made of states 1, 2,\u00b7\u00b7\u00b7 , N and\nedges (1, 2) ,\u00b7\u00b7\u00b7 , (N \u2212 1, N) , (N, 1), with N \u2265 3. By de\ufb01nition, J1,N > 0, and Jn,n+1 > 0 for\nall 1 \u2264 n \u2264 N \u2212 1. A non-zero H can then be constructed as\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nHi,j =\n\n\u03b5,\n\u2212\u03b5,\n\u03b5,\n\u2212\u03b5,\n0,\n\nif 1 \u2264 i \u2264 N \u2212 1 and j = i + 1,\nif 2 \u2264 i \u2264 N and j = i \u2212 1,\nif i = N and j = 1,\nif i = 1 and j = N,\notherwise.\n\nHere\n\n\u03b5 = min\n\n1\u2264n\u2264N\u22121\n\n{Jn,n+1, 1 \u2212 Jn,n+1, J1,N , 1 \u2212 J1,N} .\n\nClearly, \u03b5 > 0, since all the items in the minimum are above 0. It is trivial to verify that H = \u2212H>\nand H1 = 0.\nNecessity: Assume there are no loops in GT , then all states in the chain must be organized in a tree,\nfollowing the ergodic assumption. In other word, there are exactly 2 (S \u2212 1) non-zero off-diagonal\nelements in J. Plus, these 2 (S \u2212 1) elements are arranged symmetrically along the diagonal and\nspanning every column and row of J.\nBecause the states are organized in a tree, there is at least one leaf node s in GT , with a single neigh-\nbor s0. Row s and column s in J thus looks like rs = [\u00b7\u00b7\u00b7 , ps,s,\u00b7\u00b7\u00b7 , ps,s0,\u00b7\u00b7\u00b7 ] and its transpose,\nrespectively, with ps,s \u2265 0 and ps,s0 > 0, and all other entries being 0.\nAssume that one wants to construct a some H, such that J \u00b1 H \u2265 0. Let hs be the s-th row of H.\nSince rs \u00b1 hs \u2265 0, all except the s0-th elements in hs must be 0. But since hs1 = 0, the whole s-th\nrow, thus the s-th column of H must be 0.\nHaving set the s-th column and row of H to 0, one can consider the reduced Markov chain with one\nstate less, and repeat with another leaf node. Working progressively along the tree, it follows that all\nrows and columns in H must be 0.\nThe indication of Proposition 3 together with 2 is that all reversible chains can be improved in terms\nof asymptotic variance using Corollary 1, except those whose transition graphs are trees. In practice,\nthe non-tree constraint is not a problem because almost all current methods of constructing reversible\nchains generate chains with loops.\n\n3.3 Graphical interpretation\n\nIn this subsection we provide a graphical interpretation of the results in the previous sections.\nStarting from a simple case, consider a reversible Markov chain with three states forming a loop.\nLet u1 = [1, 0,\u22121]> and u2 = [0, 1,\u22121]>. Clearly, u1 and u2 are linearly independent and\nu>\n1 1 = u>\n2 1 = 0. By Proposition 2 and 3, there exists some \u03b5 > 0, such that H = \u03b5U12 satis-\n\n6\n\n\fFigure 1: Illustration of the construction of larger vortices. The left hand side is a state transition\ngraph of a reversible Markov chain with S = 9 states, with a vortex 3 \u2192 8 \u2192 6 \u2192 5 \u2192 4 of\nstrength \u03b5 inserted. The corresponding H can be expressed as the linear combination of Ui,j, as\nshown on the right hand side of the graph. We start from the vortex 8 \u2192 6 \u2192 9 \u2192 8, and add\none vortex a time. The dotted lines correspond to edges on which the \ufb02ows cancel out when a new\nvortex is added. For example, when vortex 6 \u2192 5 \u2192 9 \u2192 6 is added, edge 9 \u2192 6 cancels edge\n6 \u2192 9 in the previous vortex, resulting in a larger vortex with four states. Note that in this way one\ncan construct vortices which do not include state 9, although each Ui,j is a vortex involving 9.\n\n\ufb01es Condition I and II, with U1,2 = u1u>\n\n2 \u2212 u2u>\n\n1 . Write U1,2 and J + H in explicit form,\n\n\"\n\n#\n\n1 \u22121\n0\n\u22121\n0\n1\n1 \u22121\n0\n\nU1,2 =\n\n, J + H =\n\n\"\n\np1,2 + \u03b5 p1,3 \u2212 \u03b5\np1,1\np2,1 \u2212 \u03b5\np2,3 + \u03b5\np2,2\np3,1 + \u03b5 p3,2 \u2212 \u03b5\n\np3,3\n\n#\n\n,\n\nwith pi,j being the probability of the consecutive states being i, j. It is clear that in J + H, the\nprobability of jumps 1 \u2192 2, 2 \u2192 3, and 3 \u2192 1 is increased, and the probability of jumps in the\nopposite direction is decreased. Intuitively, this amounts to adding a \u2018vortex\u2019 of direction 1 \u2192 2 \u2192\n3 \u2192 1 in the state transition. Similarly, the joint probability matrix for the reverse operator is J \u2212H,\nwhich adds a vortex in the opposite direction. This simple case also gives an explanation of why\nadding or subtracting non-zero H can only be done where a loop already exists, since the operation\nrequires subtracting \u03b5 from all entries in J corresponding to edges in the loop.\nIn the general case, de\ufb01ne S \u2212 1 vectors u1,\u00b7\u00b7\u00b7 , uS\u22121 as\n\nus = [0,\u00b7\u00b7\u00b7 , 0,\n\n, 0,\u00b7\u00b7\u00b7 , 0,\u22121]>.\n\n1\n\ns-th element\n\nIt is straightforward to see that u1,\u00b7\u00b7\u00b7 , uS\u22121 are linearly independent and u>\ns 1 = 0 for all s, thus\nj \u2212 uju>\nany H satisfying Condition I can be represented as the linear combination of Ui,j = uiu>\ni ,\nwith each Ui,j containing 1\u2019s at positions (i, j), (j, S), (S, i), and \u22121\u2019s at positions (i, S), (S, j),\n(j, i). It is easy to verify that adding \u03b5Ui,j to J amounts to introducing a vortex of direction i \u2192 j \u2192\nS \u2192 i, and any vortex of N states (N \u2265 3) s1 \u2192 s2 \u2192 \u00b7\u00b7\u00b7 \u2192 sN \u2192 s1 can be represented by the\nn=1 Usn,sn+1 in the case of state S being in the vortex and assuming sN = S\nn=1 Usn,sn+1 if S is not in the vortex, as demonstrated in\nFigure 1. Therefore, adding or subtracting an H to J is equivalent to inserting a number of vortices\ninto the state transition map.\n\nlinear combinationPN\u22121\nwithout loss of generality, or UsN ,s1 +PN\u22121\n\n4 An example\n\nAdding vortices to the state transition graph forces the Markov chain to move in loops following\npre-speci\ufb01ed directions. The bene\ufb01t of this can be illustrated in the following example. Consider a\nreversible Markov chain with S states forming a ring, namely from state s one can only jump to s\u22951\nor s (cid:9) 1, with \u2295 and (cid:9) being the mod-S summation and subtraction. The only possible non-zero H\n\nin this example is of form \u03b5PS\u22121\n\ns=1 Us,s+1, corresponding to vortices on the large ring.\n\nWe assume uniform stationary distribution \u03c0 (s) = 1\nS . In this case, any reversible chain behaves\nlike a random walk. The chain which achieves minimal asymptotic variance is the one with the\nprobability of both jumping forward and backward being 1\n2. The expected number of steps for\n4 . However, adding the vortex reduces this number to\nthis chain to reach the state S\n\n2 edges away is S2\n\n7\n\n321765498698659865498654983365498\u2013\u03b5U6,8\u2013\u03b5U5,6\u2013\u03b5U4,5\u2013\u03b5U3,4+ \u03b5U3,8H=\fFigure 2: Demonstration of the vortex effect: (a) and (b) show two different, reversible Markov\nchains, each containing 128 states connected in a ring. The equilibrium distribution of the chains is\ndepicted by the gray inner circles; darker shades correspond to higher probability. The equilibrium\ndistribution of chain (a) is uniform, while that of (b) contains two peaks half a ring apart. In addition,\nthe chains are constructed such that the probability of staying in the same state is zero. In each\ncase, two trajectories, of length 1000, are generated from the chain with and without the vortex,\nstarting from the state pointed to by the arrow. The length of the bar radiating out from a given\nstate represents the relative frequency of visits to that state, with red and blue bars corresponding\nto chains with and without vortex, respectively. It is clear from the graph that trajectories sampled\nfrom reversible chains spread much slower, with only 1/5 of the states reached in (a) and 1/3 in (b),\nand the trajectory in (b) does not escape from the current peak. On the other hand, with vortices\nadded, trajectories of the same length spread over all the states, and effectively explore both peaks\nof the stationary distribution in (b). The plot (c) show the correlation of function values (normalized\nby variance) between two states \u03c4 time steps apart, with \u03c4 ranging from 1 to 600. Here we take\n\nthe Markov chains from (b) and use function f (s) = cos(cid:0)4\u03c0 \u00b7 s\n\nonly do the absolute values of the correlations go down signi\ufb01cantly, but also their signs alternate,\nindicating that these correlations tend to cancel out in the sum of Eq.5.\n\n(cid:1). When vortices are added, not\n\n128\n\nroughly S\n2\u03b5 for large S, suggesting that it is much easier for the non-reversible chain to reach faraway\nstates, especially for large S. In the extreme case, when \u03b5 = 1\n2, the chain cycles deterministically,\nreducing asymptotic variance to zero. Also note that the reversible chain here has zero probability\nof staying in the current state, thus cannot be further improved using Peskun\u2019s theorem.\nOur intuition about why adding vortices helps is that chains with vortices move faster than the\nreversible ones, making the function values of the trajectories less correlated. This effect is demon-\nstrated in Figure 2.\n\n5 Conclusion\n\nIn this paper, we have presented a new way of converting a reversible \ufb01nite Markov chain into a non-\nreversible one, with the theoretical guarantee that the asymptotic variance of the MCMC estimator\nbased on the non-reversible chain is reduced. The method is applicable to any reversible chain whose\nstates are not connected through a tree, and can be interpreted graphically as inserting vortices into\nthe state transition graph.\nThe results con\ufb01rm that non-reversible chains are fundamentally better than reversible ones. The\ngeneral framework of Proposition 1 suggests further improvements of MCMC\u2019s asymptotic per-\nformance, by applying other results from matrix analysis to asymptotic variance reduction. The\ncombined results of Corollary 1, and Propositions 2 and 3, provide a speci\ufb01c way of doing so, and\npose interesting research questions. Which combinations of vortices yield optimal improvements\nfor a given chain? Finding one of them is a combinatorial optimization problem. How can a good\ncombination be constructed in practice, using limited history and computational resources?\n\n8\n\nWithoutvortexWithvortex100200300400500600(cid:45)0.4(cid:45)0.20.00.20.40.60.81.0WithoutvortexWithvortex\u0001a\u0001\u0001b\u0001\u0001c\u0001\fReferences\n[1] R.P. Wen, \u201dProperties of the Matrix Inequality\u201d, Journal of Taiyuan Teachers College, 2005.\n[2] R. Mathias, \u201dMatrices With Positive De\ufb01nite Hermitian Part:\n\nInequalities And Lin-\near Systems\u201d, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.\n1.1.33.1768, 1992.\n\n[3] L.H. Li, \u201dA New Proof of Peskun\u2019s and Tierney\u2019s Theorems using Matrix Method\u201d, Joint\nGraduate Students Seminar of Department of Statistics and Department of Biostatistics, Univ.\nof Toronto, 2005.\n\n[4] R.M. Neal, \u201dImproving asymptotic variance of MCMC estimators: Non-reversible chains are\n\nbetter\u201d, Technical Report No. 0406, Department of Statistics, Univ. of Toronto, 2004.\n\n[5] I. Murray, \u201dAdvances in Markov chain Monte Carlo methods\u201d, M. Sci. thesis, University Col-\n\nlege London, 2007.\n\n[6] R.M. Neal, \u201dBayesian Learning for Neural Networks\u201d, Springer, 1996.\n[7] J. Kenney and E.S. Keeping, \u201dMathematics of Statistics\u201d, van Nostrand, 1963.\n[8] C. Andrieu, N. de Freitas, A. Doucet, and M.I. Jordan, \u201dAn Introduction to MCMC for Ma-\n\nchine Learning\u201d, Machine Learning, 50, 5-43, 2003.\n\n[9] Szakdolgozat, \u201dThe Mixing Rate of Markov Chain Monte Carlo Methods and some Appli-\ncations of MCMC Simulation in Bioinformatics\u201d, M.Sci. thesis, Eotvos Lorand University,\n2006.\n\n[10] P.H. Peskun, \u201dOptimum Monte-Carlo sampling using Markov chains\u201d, Biometrika, vol. 60, pp.\n\n607-612, 1973.\n\n[11] L. Tierney, \u201dA note on Metropolis Hastings kernels for general state spaces\u201d, Ann. Appl.\n\nProbab. 8, 1-9, 1998.\n\n[12] S. Duane, A.D. Kennedy, B.J. Pendleton and D. Roweth, \u201dHybrid Monte Carlo\u201d, Physics Let-\n\nters B, vol.195-2, 1987.\n\n[13] J.S. Liu, \u201dPeskun\u2019s theorem and a modi\ufb01ed discrete-state Gibbs sampler\u201d, Biometria, vol.83,\n\npp.681-682, 1996.\n\n9\n\n\f", "award": [], "sourceid": 338, "authors": [{"given_name": "Yi", "family_name": "Sun", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}, {"given_name": "Faustino", "family_name": "Gomez", "institution": null}]}