{"title": "Projecting Markov Random Field Parameters for Fast Mixing", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1385, "abstract": "Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fast-mixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing of univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.", "full_text": "Projecting Markov Random Field Parameters for\n\nFast Mixing\n\nXianghang Liu\n\nJustin Domke\n\nNICTA, The University of New South Wales\n\nNICTA, The Australian National University\n\nxianghang.liu@nicta.com.au\n\njustin.domke@nicta.com.au\n\nAbstract\n\nMarkov chain Monte Carlo (MCMC) algorithms are simple and extremely power-\nful techniques to sample from almost arbitrary distributions. The \ufb02aw in practice\nis that it can take a large and/or unknown amount of time to converge to the station-\nary distribution. This paper gives suf\ufb01cient conditions to guarantee that univariate\nGibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a pre-\ncise sense. Further, an algorithm is given to project onto this set of fast-mixing\nparameters in the Euclidean norm. Following recent work, we give an example use\nof this to project in various divergence measures, comparing univariate marginals\nobtained by sampling after projection to common variational methods and Gibbs\nsampling on the original parameters.\n\n1 Introduction\n\nExact inference in Markov Random Fields (MRFs) is generally intractable, motivating approximate\nalgorithms. There are two main classes of approximate inference algorithms: variational methods\nand Markov chain Monte Carlo (MCMC) algorithms [13].\n\nAmong variational methods, mean-\ufb01eld approximations [9] are based on a \u201ctractable\u201d family of\ndistributions, such as the fully-factorized distributions. Inference \ufb01nds a distribution in the tractable\nset to minimize the KL-divergence from the true distribution. Other methods, such as loopy belief\npropagation (LBP), generalized belief propagation [14] and expectation propagation [10] use a less\nrestricted family of target distributions, but approximate the KL-divergence. Variational methods\nare typically fast, and often produce high-quality approximations. However, when the variational\napproximations are poor, estimates can be correspondingly worse.\n\nMCMC strategies, such as Gibbs sampling, simulate a Markov chain whose stationary distribution is\nthe target distribution. Inference queries are then answered by the samples drawn from the Markov\nchain. In principle, MCMC will be arbitrarily accurate if run long enough. The principal dif\ufb01culty\nis that the time for the Markov chain to converge to its stationary distribution, or the \u201cmixing time\u201d,\ncan be exponential in the number of variables.\n\nThis paper is inspired by a recent hybrid approach for Ising models [3]. This approach minimizes\nthe divergence from the true distribution to one in a tractable family. However, the tractable family\nis a \u201cfast mixing\u201d family where Gibbs sampling is guaranteed to quickly converge to the stationary\ndistribution. They observe that an Ising model will be fast mixing if the spectral norm of a matrix\ncontaining the absolute values of all interactions strengths is controlled. An algorithm projects\nonto this fast mixing parameter set in the Euclidean norm, and projected gradient descent (PGD)\ncan minimize various divergence measures. This often leads to inference results that are better\nthan either simple variational methods or univariate Gibbs sampling (with a limited time budget).\nHowever, this approach is limited to Ising models, and scales poorly in the size of the model, due to\nthe dif\ufb01culty of projecting onto the spectral norm.\n\n1\n\n\fThe principal contributions of this paper are, \ufb01rst, a set of suf\ufb01cient conditions to guarantee that\nunivariate Gibbs sampling on an MRF will be fast-mixing (Section 4), and an algorithm to project\nonto this set in the Euclidean norm (Section 5). A secondary contribution of this paper is considering\nan alternative matrix norm (the induced \u221e-norm) that is somewhat looser than the spectral norm,\nbut more computationally ef\ufb01cient. Following previous work [3], these ideas are experimentally\nvalidated via a projected gradient descent algorithm to minimize other divergences, and looking at\nthe accuracy of the resulting marginals. The ability to project onto a fast-mixing parameter set may\nalso be of independent interest. For example, it might be used during maximum likelihood learning\nto ensure that the gradients estimated through sampling are more accurate.\n\n2 Notation\n\nWe consider discrete pairwise MRFs with n variables, where the i-th variable takes values in\n{1, ..., Li}, E is the set of edges, and \u03b8 are the potentials on each edge. Each edge in E is an ordered\npair (i, j) with i \u2264 j. The parameters are a set of matrices \u03b8 := {\u03b8ij|\u03b8ij \u2208 RLi\u00d7Lj , \u2200(i, j) \u2208 E}.\nWhen i > j, and (j, i) \u2208 E, we let \u03b8ij denote the transpose of \u03b8ji. The corresponding distribution is\n\np(x; \u03b8) = exp\u239b\n\n\u239d #(i,j)\u2208E\n\n\u03b8ij(xi, xj) \u2212 A(\u03b8)\u239e\n\u23a0 ,\n\n(1)\n\nwhere A(\u03b8) := log&x exp\u2019&(i,j)\u2208E \u03b8ij(xi, xj)( is the log-partition function, and \u03b8ij(xi, xj)\n\ndenotes the entry in the xi-th row and xj-th column of \u03b8ij. It is easy to show that any parametrization\nof a pairwise MRF can be converted into this form. \u201cSelf-edges\u201d (i, i) can be included in E if one\nwishes to explicitly represent univariate terms.\n\nIt is sometimes convenient to work with the exponential family representation\n\n(2)\nwhere f (x) is the suf\ufb01cient statistics for con\ufb01guration x. If these are indicator functions for all\ncon\ufb01gurations of all pairs in E, then the two representations are equivalent.\n\np(x; \u03b8) = exp{f (x) \u00b7 \u03b8 \u2212 A(\u03b8)},\n\n3 Background Theory on Rapid Mixing\n\nThis section reviews background on mixing times that will be used later in the paper.\nDe\ufb01nition 1. Given two \ufb01nite distributions p and q, the total variation distance \u2225 \u00b7 \u2225T V is de\ufb01ned\nas \u2225p(X) \u2212 q(X)\u2225T V = 1\n\n2&x |p(X = x) \u2212 q(X = x)|.\n\nNext, one must de\ufb01ne a measure of how fast a Markov chain converges to the stationary distribution.\nLet the state of the Markov chain after t iterations be X t. Given a constant \u03f5, this is done by \ufb01nding\nsome number of iterations \u03c4(\u03f5) such that the induced distribution p(X t|X 0 = x) will always have a\ndistance of less than \u03f5 from the stationary distribution, irrespective of the starting state x.\nDe\ufb01nition 2. Let {X t} be the sequence of random variables corresponding to running Gibbs sam-\npling on a distribution p. The mixing time \u03c4(\u03f5) is de\ufb01ned as \u03c4(\u03f5) = min{t : d(t) < \u03f5}, where\nd(t) = maxx \u2225P(X t|X 0 = x) \u2212 p(X)\u2225T V is the maximum distance at time t when considering all\npossible starting states x.\n\nNow, we are interested in when Gibbs sampling on a distribution p can be shown to have a fast\nmixing time. The central property we use is the dependency of one variable on another, de\ufb01ned\ninformally as how much the conditional distribution over Xi can be changed when all variables\nother than Xj are the same.\nDe\ufb01nition 3. Given a distribution p, the dependency matrix R is de\ufb01ned by\n\u2212i)\u2225T V .\n\n\u2225p(Xi|x\u2212i) \u2212 p(Xi|x\u2032\n\nRij =\n\nmax\n\nx,x\u2032:x\u2212j=x\u2032\n\n\u2212j\n\nHere, the constraint x\u2212j = x\u2032\n\u2212j indicates that all variables in x and x\u2032 are identical except xj . The\ncentral result on rapid mixing is given by the following Theorem, due to Dyer et al. [5], generalizing\nthe work of Hayes [7]. Informally, it states that if \u2225R\u2225 < 1 for any sub-multiplicative norm \u2225 \u00b7 \u2225,\nthen mixing will take on the order of n ln n iterations, where n is the number of variables.\n\n2\n\n\f1\u2212\u2225R\u2225 ln! \u22251n\u2225 \u22251T\n\n\u03f5\n\n\" .\n\nTheorem 4. [5, Lemma 17] If \u2225 \u00b7 \u2225 is any sub-multiplicative matrix norm and ||R|| < 1, the mixing\ntime of univariate Gibbs sampling on a system with n variables with random updates is bounded by\n\u03c4(\u03f5) \u2264 n\n\nn \u2225\n\nHere, \u22251n\u2225 denotes the same matrix norm applied to a matrix of ones of size n \u00d7 1, and similarly\nfor 1T\n\nn . In particular, if \u2225 \u00b7 \u2225 induced by a vector p-norm, then \u22251n\u2225 \u22251T\n\nn \u2225 = n.\n\nSince this result is true for a variety of norms, it is natural to ask, for a given matrix R, which norm\nwill give the strongest result. It can be shown that for symmetric matrices (such as the dependency\nmatrix), the spectral norm \u2225 \u00b7 \u22252 is always superior.\nTheorem 5. [5, Lemma 13] If A is a symmetric matrix and \u2225 \u00b7 \u2225 is any sub-multiplicative norm,\nthen \u2225A\u22252 \u2264 \u2225A\u2225.\n\nUnfortunately, as will be discussed below, the spectral norm can be more computationally expensive\nthan other norms. As such, we will also consider the use of the \u221e-norm \u2225 \u00b7 \u2225\u221e. This leads to\nadditional looseness in the bound in general, but is limited in some cases. In particular if R = rG\nwhere G is the adjacency matrix for some regular graph with degree d, then for all induced p-norms,\n\u2225R\u2225 = rd, since \u2225R\u2225 = maxx\u0338=0 \u2225Rx\u2225/\u2225x| = r maxx\u0338=0 \u2225Gx\u2225/\u2225x\u2225 = r\u2225Go\u2225/\u2225o\u2225 = rd, where\no is a vector of ones. Thus, the extra looseness from using, say, \u2225 \u00b7 \u2225\u221e instead of \u2225 \u00b7 \u22252 will tend to\nbe minimal when the graph is close to regular, and the dependency is close to a constant value. For\nirregular graphs with highly variable dependency, the looseness can be much larger.\n\n4 Dependency for Markov Random Fields\n\nIn order to establish that Gibbs sampling on a given MRF will be fast mixing, it is necessary to\ncompute (a bound on) the dependency matrix R, as done in the following result. The proof of this\nresult is fairly long, and so it is postponed to the Appendix. Note that it follows from several bounds\non the dependency that are tighter, but less computationally convenient.\n\nTheorem 6. The dependency matrix for a pairwise Markov random \ufb01eld is bounded by\n\nRij(\u03b8) \u2264 max\na,b\n\n1\n2\n\n\u2225\u03b8ij\n\n\u00b7a \u2212 \u03b8ij\n\n\u00b7b \u2225\u221e.\n\nHere, \u03b8ij\n\u00b7a indicates the a\u2212th column of \u03b8ij . Note that the MRF can include univariate terms as self-\nedges with no impact on the dependency bound, regardless of the strength of the univariate terms. It\ncan be seen easily that from the de\ufb01nition of R (De\ufb01nition 3), for any i the entry Rii for self-edges\n(i, i) should always be zero. One can, without loss of generality, set each column of \u03b8ii to be the\nsame, meaning that Rii = 0 in the above bound.\n\n5 Euclidean Projection Operator\n\n#(i,j)\u2208E \u2225\u03b8ij \u2212 \u03c8ij\u22252\n\nThe Euclidean distance between two MRFs parameterized respectively by \u03c8 and \u03b8 is \u2225\u03b8 \u2212 \u03c8\u22252 :=\nF . This section considers projecting a given vector \u03c8 onto the fast mixing set\nor, formally, \ufb01nding a vector \u03b8 with minimum Euclidean distance to \u03c8, subject to the constraint\nthat a norm \u2225 \u00b7 \u2225\u2217 applied to the bound on the dependency matrix R is less than some constant c.\nEuclidean projection is considered because, \ufb01rst, it is a straightforward measure of the closeness\nbetween two parameters and, second, it is the building block of the projected gradient descent for\nprojection in other distance measures. To begin with, we do not specify the matrix norm \u2225 \u00b7 \u2225\u2217, as it\ncould be any sub-multiplicative norm (Section 3).\n\nThus, in principle, we would like to \ufb01nd \u03b8 to solve\n\nprojc(\u03c8) := argmin\n\n\u03b8:\u2225R(\u03b8)\u2225\u2217\u2264c\n\n\u2225\u03b8 \u2212 \u03c8\u22252.\n\n(3)\n\nUnfortunately, while convex, this optimization turns out to be somewhat expensive to solve, due to\na lack of smoothness Instead, we introduce a matrix Z, and constrain that Zij \u2265 Rij(\u03b8), where\nRij(\u03b8) is the bound on dependency in Thm 6 (as an equality). We add an extra quadratic term\n\n3\n\n\f\u03b1\u2225Z \u2212 Y \u22252\nthe smoothness and the closeness to original problem (3). The smoothed projection operator is\n\nF to the objective, where Y is an arbitrarily given matrix and \u03b1 > 0 is trade-off between\n\nprojC(\u03c8, Y ) := argmin\n(\u03b8,Z)\u2208C\n\n\u2225\u03b8 \u2212 \u03c8\u22252 + \u03b1\u2225Z \u2212 Y \u22252\n\nF , C = {(\u03b8, Z) : Zij \u2265 Rij(\u03b8), \u2225Z\u2225\u2217 \u2264 c}.\n\n(4)\n\nIf \u03b1 = 0, this yields a solution that is identical to that of Eq. 3. However, when \u03b1 = 0, the objective\nin Eq. 4 is not strongly convex as a function of Z, which results in a dual function which is non-\nsmooth, meaning it must be solved with a method like subgradient descent, with a slow convergence\nrate.\nIn general, of course, the optimal point of Eq. 4 is different to that of Eq. 3. However,\nthe main usage of the Euclidean projection operator is the projection step in the projected gradient\ndescent algorithm for divergence minimization. In these tasks the smoothed projection operator can\nbe directly used in the place of the non-smoothed one without changing the \ufb01nal result. In situations\nwhen the exact Euclidean projection is required, it can be done by initializing Y1 arbitrarily and\nrepeating (\u03b8k+1, Yk+1) \u2190 projC(\u03c8, Yk), for k = 1, 2, . . . until convergence.\n\n5.1 Dual Representation\n\nTheorem 7. Eq. 4 has the dual representation\n\nmaximize\n\ng(\u03c3, \u03c6, \u2206, \u0393)\n\n\u03c3,\u03c6,\u2206,\u0393\nsubject to \u03c3ij(a, b, c) \u2265 0, \u03c6ij(a, b, c) \u2265 0, \u2200(i, j) \u2208 E, a, b, c\n\n,\n\n(5)\n\nwhere\n\ng(\u03c3, \u03c6, \u2206, \u0393) = min\n\nZ\n\nh1(Z; \u03c3, \u03c6, \u2206, \u0393) + min\n\n\u03b8\n\nh2(\u03b8; \u03c3, \u03c6)\n\nh1(Z; \u03c3, \u03c6, \u2206, \u0393) = \u2212tr(Z\u039bT ) + I(\u2225Z\u2225\u2217 \u2264 c) + \u03b1\u2225Z \u2212 Y \u22252\nF\n\nh2(\u03b8; \u03c3, \u03c6) = \u2225\u03b8 \u2212 \u03c8\u22252 +\n\n1\n\n2 !i,j\u2208E !a,b,c\"\u03c3ij(a, b, c) \u2212 \u03c6ij(a, b, c)#(\u03b8ij\n\nc,a \u2212 \u03b8ij\n\nc,b),\n\nin which \u039bij\n\n% \u0393ij\n\n:= \u2206ijDij + \u02c6\u0393ij + $a,b,c \u03c3ij(a, b, c) + \u03c6ij(a, b, c), where \u02c6\u0393ij\n\nif (i, j) \u2208 E\nif (j, i) \u2208 E , and D is an indicator matrix with Dij = 0 if (i, j) \u2208 E or (j, i) \u2208 E,\nand Dij = 1 otherwise. The dual variables \u03c3ij and \u03c6ij are arrays of size Lj \u00d7 Li \u00d7 Li for all pairs\n(i, j) \u2208 E while \u2206 and \u0393 are of size n \u00d7 n.\n\n\u2212\u0393ij\n\n:=\n\nThe proof of this is in the Appendix. Here, I(\u00b7) is the indicator function with I(x) = 0 when x is\ntrue and I(x) = \u221e otherwise.\n\nBeing a smooth optimization problem with simple bound constraints, Eq. 5 can be solved with\nLBFGS-B [2]. For a gradient-based method like this to be practical, it must be possible to quickly\nevaluate g and its gradient. This is complicated by the fact that g is de\ufb01ned in terms of the mini-\nmization of h1 with respect to Z and h2 with respect to \u03b8. We discuss how to solve these problems\nnow. We \ufb01rst consider the minimization of h2. This is a quadratic function of \u03b8 and can be solved\nanalytically via the condition that \u2202\n\n\u2202\u03b8 h2(\u03b8; \u03c3, \u03c6) = 0. The closed form solution is\n\n\u03b8ij\nc,a = \u03c8ij\n\nc,a \u2212\n\n1\n\n4&!b\n\n\u03c3ij(a, b, c) \u2212!b\n\n\u03c3ij(b, a, c) \u2212!b\n\n\u03c6ij(a, b, c) +!b\n\n\u03c6ij(b, a, c)\u2019\n\n\u2200(i, j) \u2208 E, 1 \u2264 a, c \u2264 m.. The time complexity is linear in the size of \u03c8.\n\nMinimizing h1 is more involved. We assume to start that there exists an algorithm to quickly project\na matrix onto the set {Z : \u2225Z\u2225\u2217 \u2264 c}, i.e. to solve the optimization problem of\n\nmin\n\n\u2225Z\u2225\u2217\u2264c\n\n\u2225Z \u2212 A\u22252\nF .\n\n(6)\n\nThen, we observe that arg minZ h1 is equal to\n\narg min\n\nZ\n\n\u2212tr(Z\u039bT ) + I(\u2225Z\u2225\u2217 \u2264 c) + \u03b1\u2225Z \u2212 Y \u22252\n\nF = arg min\n\u2225Z\u2225\u2217\u2264c\n\n\u2225Z \u2212 (Y +\n\n1\n2\u03b1\n\n\u039b)\u22252\nF .\n\n4\n\n\fFor different norms \u2225 \u00b7 \u2225\u2217, the projection algorithm will be different and can have a large impact on\nef\ufb01ciency. We will discuss in the followings sections the choices of \u2225 \u00b7 \u2225\u2217 and an algorithm for the\n\u221e-norm.\n\nFinally, once h1 and h2 have been solved, the gradient of g is (by Danskin\u2019s theorem [1])\n\n\u2202g\n\u2202\u2206ij\n\n\u2202g\n\n\u2202\u03c3ij(a, b, c)\n\n= \u2212 Dij \u02c6Zij,\n\n=\n\n1\n2\n\n(\u02c6\u03b8ij\n\nc,a \u2212 \u02c6\u03b8ij\n\nc,b) \u2212 \u02c6Zij,\n\n\u2202g\n\u2202\u0393ij\n\n\u2202g\n\n\u2202\u03c6ij(a, b, c)\n\n= \u02c6Zji \u2212 \u02c6Zij,\n\n= \u2212 \u2202\u03c3ij (a,b,c)g,\n\nwhere \u02c6Z and \u02c6\u03b8 represent the solutions to the subproblems.\n\n5.2 Spectral Norm\n\nWhen \u2225\u00b7\u2225\u2217 is set to the spectral norm, i.e. the largest singular value of a matrix, the projection in Eq.\n6 can be performed by thresholding the singular values of A [3]. Theoretically, using spectral norm\nwill give a tighter bound on Z than other norms (Section 3). However, computing a full singular\nvalue decomposition can be impractically slow for a graph with a large number of variables.\n\n5.3 \u221e-norm\n\nHere, we consider setting \u2225 \u00b7 \u2225\u2217 to the \u221e-norm, \u2225A\u2225\u221e = maxi!j |Aij|, which measures the\n\nmaximum l1 norm of the rows of A. This norm has several computational advantages. Firstly, to\nproject a matrix onto a \u221e-norm ball {A : \u2225A\u221e\u2225 \u2264 c}, we can simply project each row ai of the\nmatrix onto the l1-norm ball {a : \u2225a\u22251 \u2264 c}. Duchi et al. [4] provide a method linear in the number\nof nonzeros in a and logarithmic in the length of a. Thus, if Z is an n \u00d7 n, matrix, Eq. 6 for the\n\u221e-norm can be solved in time n2 and, for suf\ufb01ciently sparse matrices, in time n log n.\n\nA second advantage of the \u221e-norm is that (unlike the spectral norm) projection in Eq. 6 preserves\nthe sparsity of the matrix. Thus, one can disregard the matrix D and dual variables \u2206 when solving\nthe optimization in Theorem 7. This means that Z itself can be represented sparsely, i.e. we only\nneed variables for those (i, j) \u2208 E. These simpli\ufb01cations signi\ufb01cantly improve the ef\ufb01ciency of\nprojection, with some tradeoff in accuracy.\n\n6 Projection in Divergences\n\nIn this section, we want to \ufb01nd a distribution p(x; \u03b8) in the fast mixing family closest to a target\ndistribution p(x; \u03c8) in some divergence D(\u03c8, \u03b8). The choice of divergence depends on convenience\nof projection, the approximate family and the inference task. We will \ufb01rst present a general algo-\nrithmic framework based on projected gradient descent (Algorithm 1), and then discuss the details\nof several previously proposed divergences [11, 3].\n\n6.1 General algorithm framework for divergence minimization\n\nThe problem of projection in divergences is formulated as\n\nD(\u00b7, \u00b7) is some divergence measure, and \u00afC := {\u03b8 : \u2203Z, s.t.(\u03b8, Z) \u2208 C}, where C is the feasible set\nin Eq. 4. Our general strategy for this is to use projected gradient descent to solve the optimization\n\nD(\u03c8, \u03b8),\n\nmin\n\u03b8\u2208 \u00afC\n\n(7)\n\nmin\n\n(\u03b8,Z)\u2208C\n\nD(\u03c8, \u03b8),\n\n(8)\n\nusing the joint operator to project onto C described in Section 5.\n\nFor different divergences, the only difference in projection algorithm is the evaluation of the gradient\n\u2207\u03b8D(\u03c8, \u03b8). It is clear that if (\u03b8\u2217, Z \u2217) is the solution of Eq. 8, then \u03b8\u2217 is the solution of 7.\n\n6.2 Divergences\n\n5\n\n\fAlgorithm 1 Projected gradient descent for divergence projection\n\nInitialize (\u03b81, Z1), k \u2190 1.\nrepeat\n\n\u03b8\u2032 \u2190 \u03b8k \u2212 \u03bb\u2207\u03b8D(\u03c8, \u03b8k)\n(\u03b8k+1, Zk+1) \u2190 projC(\u03b8\u2032, Zk)\nk \u2190 k + 1\n\nuntil convergence\n\nIn this section, we will discuss the differ-\nent choices of divergences and correspond-\ning projection algorithms.\n\n6.2.1 KL-divergence\n\nis\n\np(x;\u03b8)\n\narguably\n\nThe KL-divergence KL(\u03c8\u2225\u03b8)\n\n!x p(x; \u03c8) log p(x;\u03c8)\n\n:=\nthe\noptimal divergence for marginal inference\nbecause it strives to preserve the marginals\nof p(x; \u03b8) and p(x; \u03c8). However, projection\nin KL-divergence is intractable here because\nthe evaluation of the gradient \u2207\u03b8KL(\u03c8\u2225\u03b8)\nrequires the marginals of distribution \u03c8.\n\n6.2.2 Piecewise KL-divergence\n\n[3]\n\nover\n\nsome\n\ntractable\n\npiecewise KL-divergence\n\nOne tractable surrogate of KL(\u03c8\u2225\u03b8) is\nthe\nde-\n\ufb01ned\nsubgraphs.\nHere, D(\u03c8, \u03b8) := maxT \u2208T KL(\u03c8T \u2225\u03b8T ),\nwhere T is a set of low-treewidth sub-\ngraphs. The gradient can be evaluated as\n\u2207\u03b8D(\u03c8, \u03b8) = \u2207\u03b8KL(\u03c8T \u2217\u2225\u03b8T \u2217) where\nT \u2217 = arg maxT \u2208T KL(\u03c8T \u2225\u03b8T ). For any\nT in T , KL(\u03c8T \u2225\u03b8T ) and its gradient can be\nevaluated by the junction-tree algorithm.\n\n \n\n4\n\n \n\nGrid, Attractive only\n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03c8||\u03b8) (TW 1)\nPiecewise KL(\u03c8||\u03b8) (TW 2)\nKL(\u03b8||\u03c8)\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\n\nE\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\nEdge density = 0.50, Mixed\n\n3\n\n3.5\n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\ni\n\n \nl\na\nn\ng\nr\na\nM\n\n6.2.3 Reversed KL-divergence\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\n4\n\nThe \u201creversed\u201d KL-divergence KL(\u03b8\u2225\u03c8) is\nminimized by mean-\ufb01eld methods.\nIn\ngeneral KL(\u03b8\u2225\u03c8) is inferior to KL(\u03c8\u2225\u03b8)\ntends to\nfor marginal\nunderestimate the support of\nthe distri-\nbution [11].\nit often works well\nin practice. \u2207\u03b8KL(\u03b8\u2225\u03c8) can computed\n\ninference since it\n\nStill,\n\nas \u2207\u03b8KL(\u03b8\u2225\u03c8) = !x p(x; \u03b8)(\u03b8 \u2212 \u03c8) \u00b7\nf (x)\"f (x) \u2212 \u00b5(\u03b8)# , which can be approxi-\n\nFigure 1: Mean univariate marginal error on 16 \u00d7 16\ngrids (top) with attractive interactions and median-\ndensity random graphs (bottom) with mixed interac-\ntions, comparing 30k iterations of Gibbs sampling af-\nter projection (onto the l\u221e norm) to variational meth-\nods. The original parameters also show a lower curve\nwith 106 samples.\n\nmated by samples generated from p(x; \u03b8) [3]. In implementation, we maintain a \u201cpool\u201d of samples,\neach of which is updated by a single Gibbs step after each iteration of Algorithm 1.\n\n7 Experiments\n\nThe experiments below take two stages: \ufb01rst, the parameters are projected (in some divergence) and\nthen we compare the accuracy of sampling with the resulting marginals. We focus on this second\naspect. However, we provide a comparison of the computation time for various projection algorithms\nin Table 1, and when comparing the accuracy of sampling with a given amount of time, provide two\n\n6\n\n\fcurves for sampling with the original parameters, where one curve has an extra amount of sampling\neffort roughly approximating the time to perform projection in the reversed KL divergence.\n\n7.1 Synthetic MRFs\n\nInteraction strength = 2.00, Attractive Only\n\nOur \ufb01rst experiment follows that of [8, 3]\nin evaluating the accuracy of approximation\nmethods in marginal inference.\nIn the exper-\niments, we approximate randomly generated\nMRF models with rapid-mixing distributions\nusing the projection algorithms described pre-\nviously. Then, the marginals of fast mixing\napproximate distributions are estimated by run-\nning a Gibbs chain on each distribution. These\nare compared against exact marginals as com-\nputed by the junction tree algorithm. We use\nthe mean absolute difference of the marginals\n|p(Xi = 1) \u2212 q(Xi = 1)| as the accuracy mea-\nsure. We compare to Naive mean-\ufb01eld (MF),\nGibbs sampling on original parameters (Gibbs),\nand Loopy belief propagation (LBP). Many\nother methods have been compared against a\nsimilar benchmark [6, 8].\n\nr\no\nr\nr\n\nE\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\nr\no\nr\nr\n\nE\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n0\n10\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03c8||\u03b8) (TW 1)\nPiecewise KL(\u03c8||\u03b8) (TW 2)\nKL(\u03b8||\u03c8)\n\n1\n10\n\n2\n10\n\n3\n10\n\n4\n10\n\n5\n10\n\n6\n10\n\nNumber of Samples\n\nEdge density = 0.50, Interaction strength = 3.00, Mixed\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\nWhile our methods are for general MRFs, we\ntest on Ising potentials because this is a stan-\ndard benchmark. Two graph topologies are\ntwo-dimensional 16 \u00d7 16 grids and 10\nused:\nnode random graphs, where each edge is in-\ndependently present with probability pe \u2208\n{0.3, 0.5, 0.7}. Node parameters \u03b8i are uni-\nform from [\u2212dn, dn] with \ufb01xed \ufb01eld strength\ndn = 1.0. Edge parameters \u03b8ij are uniform\nfrom [\u2212de, de] or [0, de] to obtain mixed or at-\ntractive interactions respectively, with interac-\ntion strengths de \u2208 {0, 0.5, . . . , 4}. Figure 1\nshows the average marginal error at different\ninteraction strengths. Error bars show the stan-\ndard error normalized by the number of samples, which can be interpreted as a 68.27% con\ufb01dence\ninterval. We also include time-accuracy comparisons in Figure 2. All results are averaged over 50\nrandom trials. We run Gibbs long enough ( 106 samples) to get a fair comparison in terms of running\ntime.\n\nFigure 2: Examples of the accuracy of obtained\nmarginals vs.\nthe number of samples. Top:\nGrid graphs. Bottom: Median-Density Random\ngraphs.\n\n2\n10\nNumber of Samples\n\n3\n10\n\n \n0\n10\n\n1\n10\n\n4\n10\n\n5\n10\n\nExcept where otherwise stated, parameters are projected onto the ball {\u03b8 : \u2225R(\u03b8)\u2225\u221e \u2264 c}, where\nc = 2.5 is larger than the value of c = 1 suggested by the proofs above. Better results are obtained\nby using this larger constraint set, presumably because of looseness in the bound. For piecewise\nprojection, grids use simple vertical and horizontal chains of treewidth either one or two. For random\ngraphs, we randomly generate spanning trees until all edges are covered. Gradient descent uses a\n\ufb01xed step size of \u03bb = 0.1. A Gibbs step is one \u201csystematic-scan\u201d pass over all variables between.\nThe reversed KL divergence maintains a pool of 500 samples, each of which is updated by a single\nGibbs step in each iteration.\n\nWe wish to compare the trade-off between computation time and accuracy represented by the choice\nbetween the use of the \u221e and spectral norms. We measure the running time on 16 \u00d7 16 grids in\nTable 1, and compare the accuracy in Figure 3.\n\nThe appendix contains results for a three-state Potts model on an 8\u00d78 grid, as a test of the multivari-\nate setting. Here, the intractable divergence KL(\u03c8\u2225\u03b8) is included for reference, with the projection\ncomputed with the help of the junction tree algorithm for inference.\n\n7\n\n\fTable 1: Running times on 16\u00d716 grids with attractive interactions. Euclidean projection converges\nin around 5 LBFGS-B iterations. Piecewise projection (with a treewidth of 1) and reversed KL\nprojection use 60 gradient descent steps. All results use a single core of a Intel i7 860 processor.\n\nGibbs\n\nEuclidean\n\nPiecewise\n\nReversed-KL\n\nde = 1.5\nde = 3.0\n\n30k Steps 106 Steps l\u221e norm l2 norm l\u221e norm l2 norm l\u221e norm l2 norm\n66.81s\n254.25s\n\n45.26s\n13.13s\n211.08s 20.12s\n\n25.63s\n12.87s\n164.34s 20.73s\n\n22.42s\n22.42s\n\n0.67s\n0.67s\n\n1.50s\n3.26s\n\n7.2 Berkeley binary image denoising\n\nThis experiment evaluates various methods\nfor denoising binary images from the Berke-\nley segmentation dataset downscaled from\n300 \u00d7 200 to 120 \u00d7 80. The images are\nbinarized by setting Yi = 1 if pixel i is above\nthe average gray scale in the image, and\nYi = \u22121. The noisy image X is created by\nsetting: Xi = Yi+1\nt1.25\n,\ni\nin which ti is sampled uniformly from [0, 1].\nFor\nconditional\ndistribution Y is modeled as P (Y |X) \u221d\n\ni(1 \u2212 t1.25\n\n) + 1\u2212Yi\n\npurposes,\n\ninference\n\nthe\n\n2\n\n2\n\ni\n\nexp!\u03b2\"ij YiYj + \u03b1\n\n2 \"i(2Xi \u2212 1)Yi#,\n\nwhere the pairwise strength \u03b2 > 0 encourages\nOn this attractive-only Ising\nsmoothness.\npotential,\nthe Swendsen-Wang method [12]\nmixes rapidly, and so we use the resulting\nsamples to estimate the ground truth. The\nparameters \u03b1 and \u03b2 are heuristically chosen to\nbe 0.5 and 0.7 respectively.\n\n \n\nGrid, Mixed\n\nEuclidean SP\nPiecewise KL(\u03c8||\u03b8) (TW 1) SP\nPiecewise KL(\u03c8||\u03b8) (TW 2) SP\nKL(\u03b8||\u03c8) SP\nEuclidean Inf\nPiecewise KL(\u03c8||\u03b8) (TW 1) Inf\nPiecewise KL(\u03c8||\u03b8) (TW 2) Inf\nKL(\u03b8||\u03c8) Inf\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n0\n\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\n4\n\nFigure 3: The marginal error using \u221e-norm pro-\njection (solid lines) and spectral-norm projection\n(dotted lines) on 16x16 Ising grids.\n\nFigure 4 shows\nthe decrease of average\nmarginal error. To compare running time, Eu-\nclidean and K(\u03b8\u2225\u03c8) projection cost approxi-\nmately the same as sampling 105 and 4.8 \u00d7 105\nsamples respectively. Gibbs sampling on the\noriginal parameter converges very slowly. Sam-\npling the approximate distributions from our\nprojection algorithms converge quickly in less\nthan 104 samples.\n\n8 Conclusions\n\n0.5\n\n0.45\n\n0.4\n\nr\no\nr\nr\n\nE\n\ni\n\n \nl\na\nn\ng\nr\na\nM\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nMarginal Error\n\n \n\nLBP\nMean\u2212Field\nOriginal Parameter\nEuclidean\nPiecewise KL(\u03c8||\u03b8) (TW 1)\nKL(\u03b8||\u03c8)\n\n0.1\n\n \n0\n10\n\n1\n10\n\n2\n10\n\nWe derived suf\ufb01cient conditions on the parame-\nters of an MRF to ensure fast-mixing of univari-\nate Gibbs sampling, along with an algorithm to\nproject onto this set in the Euclidean norm. As\nan example use, we explored the accuracy of\nsamples obtained by projecting parameters and\nthen sampling, which is competitive with simple variational methods as well as traditional Gibbs\nsampling. Other possible applications of fast-mixing parameter sets include constraining parame-\nters during learning.\n\nFigure 4: Average marginal error on the Berkeley\nsegmentation dataset.\n\nNumber of samples\n\n3\n10\n\n4\n10\n\n5\n10\n\n6\n10\n\nAcknowledgments\n\nNICTA is funded by the Australian Government through the Department of Communications and\nthe Australian Research Council through the ICT Centre of Excellence Program.\n\n8\n\n\fReferences\n\n[1] Dimitri Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 2004. 5.1\n\n[2] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm\n\nfor bound constrained optimization. SIAM J. Sci. Comput., 16(5):1190\u20131208, 1995. 5.1\n\n[3] Justin Domke and Xianghang Liu. Projecting Ising model parameters for fast mixing. In NIPS,\n\n2013. 1, 5.2, 6, 6.2.2, 6.2.3, 7.1\n\n[4] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections\n\nonto the l1-ball for learning in high dimensions. In ICML, 2008. 5.3\n\n[5] Martin E. Dyer, Leslie Ann Goldberg, and Mark Jerrum. Matrix norms and rapid mixing for\n\nspin systems. Ann. Appl. Probab., 19:71\u2013107, 2009. 3, 4, 5\n\n[6] Amir Globerson and Tommi Jaakkola. Approximate inference using conditional entropy de-\n\ncompositions. In UAI, 2007. 7.1\n\n[7] Thomas P. Hayes. A simple condition implying rapid mixing of single-site dynamics on spin\n\nsystems. In FOCS, pages 39\u201346, 2006. 3\n\n[8] Tamir Hazan and Amnon Shashua. Convergent message-passing algorithms for inference over\n\ngeneral graphs with convex free energies. In UAI, pages 264\u2013273, 2008. 7.1\n\n[9] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009. 1\n\n[10] Thomas Minka. Expectation propagation for approximate bayesian inference. In UAI, 2001. 1\n\n[11] Thomas Minka. Divergence measures and message passing. Technical report, 2005. 6, 6.2.3\n\n[12] Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in monte carlo\n\nsimulations. Phys. Rev. Lett., 58:86\u201388, Jan 1987. 7.2\n\n[13] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and varia-\n\ntional inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008. 1\n\n[14] Jonathan Yedidia, William Freeman, and Yair Weiss. Constructing free energy approximations\nIEEE Transactions on Information Theory,\n\nand generalized belief propagation algorithms.\n51:2282\u20132312, 2005. 1\n\n9\n\n\f", "award": [], "sourceid": 756, "authors": [{"given_name": "Xianghang", "family_name": "Liu", "institution": "NICTA/UNSW"}, {"given_name": "Justin", "family_name": "Domke", "institution": "NICTA and Australian National University"}]}