{"title": "Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width", "book": "Advances in Neural Information Processing Systems", "page_first": 3097, "page_last": 3105, "abstract": "Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width\u2014regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.", "full_text": "Rapidly Mixing Gibbs Sampling for a Class of Factor\n\nGraphs Using Hierarchy Width\n\nChristopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00b4e\n\nStanford University, Stanford, CA 94309\n\ncdesa@stanford.edu, czhang@cs.wisc.edu,\n\nkunle@stanford.edu, chrismre@stanford.edu\nDepartments of Electrical Engineering and Computer Science\n\nAbstract\n\nGibbs sampling on factor graphs is a widely used inference technique, which of-\nten produces good empirical results. Theoretical guarantees for its performance\nare weak: even for tree structured graphs, the mixing time of Gibbs may be expo-\nnential in the number of variables. To help understand the behavior of Gibbs sam-\npling, we introduce a new (hyper)graph property, called hierarchy width. We show\nthat under suitable conditions on the weights, bounded hierarchy width ensures\npolynomial mixing time. Our study of hierarchy width is in part motivated by a\nclass of factor graph templates, hierarchical templates, which have bounded hier-\narchy width\u2014regardless of the data used to instantiate them. We demonstrate a\nrich application from natural language processing in which Gibbs sampling prov-\nably mixes rapidly and achieves accuracy that exceeds human volunteers.\n\n1\n\nIntroduction\n\nWe study inference on factor graphs using Gibbs sampling, the de facto Markov Chain Monte Carlo\n(MCMC) method [8, p. 505]. Speci\ufb01cally, our goal is to compute the marginal distribution of some\nquery variables using Gibbs sampling, given evidence about some other variables and a set of factor\nweights. We focus on the case where all variables are discrete. In this situation, a Gibbs sampler\nrandomly updates a single variable at each iteration by sampling from its conditional distribution\ngiven the values of all the other variables in the model. Many systems\u2014such as Factorie [14],\nOpenBugs [12], PGibbs [5], DimmWitted [28], and others [15, 22, 25]\u2014use Gibbs sampling for\ninference because it is fast to run, simple to implement, and often produces high quality empirical\nresults. However, theoretical guarantees about Gibbs are lacking. The aim of the technical result of\nthis paper is to provide new cases in which one can guarantee that Gibbs gives accurate results.\nFor an MCMC sampler like Gibbs sampling, the standard measure of ef\ufb01ciency is the mixing time\nof the underlying Markov chain. We say that a Gibbs sampler mixes rapidly over a class of models\nif its mixing time is at most polynomial in the number of variables in the model. Gibbs sampling\nis known to mix rapidly for some models. For example, Gibbs sampling on the Ising model on a\ngraph with bounded degree is known to mix in quasilinear time for high temperatures [10, p. 201].\nRecent work has outlined conditions under which Gibbs sampling of Markov Random Fields mixes\nrapidly [11]. Continuous-valued Gibbs sampling over models with exponential-family distributions\nis also known to mix rapidly [2, 3]. Each of these celebrated results still leaves a gap: there are\nmany classes of factor graphs on which Gibbs sampling seems to work very well\u2014including as part\nof systems that have won quality competitions [24]\u2014for which there are no theoretical guarantees\nof rapid mixing.\nMany graph algorithms that take exponential time in general can be shown to run in polynomial\ntime as long as some graph property is bounded. For inference on factor graphs, the most commonly\n\n1\n\n\fused property is hypertree width, which bounds the complexity of dynamic programming algorithms\non the graph. Many problems, including variable elimination for exact inference, can be solved in\npolynomial time on graphs with bounded hypertree width [8, p. 1000]. In some sense, bounded hy-\npertree width is a necessary and suf\ufb01cient condition for tractability of inference in graphical models\n[1, 9]. Unfortunately, it is not hard to construct examples of factor graphs with bounded weights and\nhypertree width 1 for which Gibbs sampling takes exponential time to mix. Therefore, bounding\nhypertree width is insuf\ufb01cient to ensure rapid mixing of Gibbs sampling. To analyze the behavior\nof Gibbs sampling, we de\ufb01ne a new graph property, called the hierarchy width. This is a stronger\ncondition than hypertree width; the hierarchy width of a graph will always be larger than its hy-\npertree width. We show that for graphs with bounded hierarchy width and bounded weights, Gibbs\nsampling mixes rapidly.\nOur interest in hierarchy width is motivated by so-called factor graph templates, which are common\nin practice [8, p. 213]. Several types of models, such as Markov Logic Networks (MLN) and Rela-\ntional Markov Networks (RMN) can be represented as factor graph templates. Many state-of-the-art\nsystems use Gibbs sampling on factor graph templates and achieve better results than competitors\nusing other algorithms [14, 27]. We exhibit a class of factor graph templates, called hierarchical\ntemplates, which, when instantiated, have a hierarchy width that is bounded independently of the\ndataset used; Gibbs sampling on models instantiated from these factor graph templates will mix in\npolynomial time. This is a kind of sampling analog to tractable Markov logic [4] or so-called \u201csafe\nplans\u201d in probabilistic databases [23]. We exhibit a real-world templated program that outperforms\nhuman annotators at a complex text extraction task\u2014and provably mixes in polynomial time.\nIn summary, this work makes the following contributions:\n\n\u2022 We introduce a new notion of width, hierarchy width, and show that Gibbs sampling mixes\nin polynomial time for all factor graphs with bounded hierarchy width and factor weight.\n\u2022 We describe a new class of factor graph templates, hierarchical factor graph templates,\n\u2022 We validate our results experimentally and exhibit factor graph templates that achieve high\n\nsuch that Gibbs sampling on instantiations of these templates mixes in polynomial time.\n\nquality on tasks but for which our new theory is able to provide mixing time guarantees.\n\n1.1 Related Work\n\nGibbs sampling is just one of several algorithms proposed for use in factor graph inference. The\nvariable elimination algorithm [8] is an exact inference method that runs in polynomial time for\ngraphs of bounded hypertree width. Belief propagation is another widely-used inference algorithm\nthat produces an exact result for trees and, although it does not converge in all cases, converges to a\ngood approximation under known conditions [7]. Lifted inference [18] is one way to take advantage\nof the structural symmetry of factor graphs that are instantiated from a template; there are lifted\nversions of many common algorithms, such as variable elimination [16], belief propagation [21], and\nGibbs sampling [26]. It is also possible to leverage a template for fast computation: Venugopal et al.\n[27] achieve orders of magnitude of speedup of Gibbs sampling on MLNs. Compared with Gibbs\nsampling, these inference algorithms typically have better theoretical results; despite this, Gibbs\nsampling is a ubiquitous algorithm that performs practically well\u2014far outstripping its guarantees.\nOur approach of characterizing runtime in terms of a graph property is typical for the analysis of\ngraph algorithms. Many algorithms are known to run in polynomial time on graphs of bounded\ntreewidth [19], despite being otherwise NP-hard. Sometimes, using a stronger or weaker property\nthan treewidth will produce a better result; for example, the submodular width used for constraint\nsatisfaction problems [13].\n\n2 Main Result\n\nIn this section, we describe our main contribution. We analyze some simple example graphs, and\nuse them to show that bounded hypertree width is not suf\ufb01cient to guarantee rapid mixing of Gibbs\nsampling. Drawing intuition from this, we de\ufb01ne the hierarchy width graph property, and prove that\nGibbs sampling mixes in polynomial time for graphs with bounded hierarchy width.\n\n2\n\n\fQ\n\nQ\n\n\u03c6T\n\n\u03c6F\n\nT1\n\nT2\n\n\u00b7\u00b7\u00b7 Tn\n\nF1\n\nF2\n\n\u00b7\u00b7\u00b7 Fn\n\nT1\n\nT2\n\n\u00b7\u00b7\u00b7 Tn\n\nF1\n\nF2\n\n\u00b7\u00b7\u00b7 Fn\n\n(a) linear semantics\n\n(b) logical/ratio semantics\n\nFigure 1: Factor graph diagrams for the voting model; single-variable prior factors are omitted.\n\n\u0001(I) =(cid:80)\n\nFirst, we state some basic de\ufb01nitions. A factor graph G is a graphical model that consists of a set of\nvariables V and factors \u03a6, and determines a distribution over those variables. If I is a world for G\n(an assignment of a value to each variable in V ), then \u0001, the energy of the world, is de\ufb01ned as\n\n\u03c6\u2208\u03a6 \u03c6(I).\n\n(1)\nZ exp(\u0001(I)), where Z is the normalization constant necessary\nThe probability of world I is \u03c0(I) = 1\nfor this to be a distribution. Typically, each \u03c6 depends only on a subset of the variables; we can draw\nG as a bipartite graph where a variable v \u2208 V is connected to a factor \u03c6 \u2208 \u03a6 if \u03c6 depends on v.\nDe\ufb01nition 1 (Mixing Time). The mixing time of a Markov chain is the \ufb01rst time t at which the\nestimated distribution \u00b5t is within statistical distance 1\n4 of the true distribution [10, p. 55]. That is,\n\ntmix = min(cid:8)t : maxA\u2282\u2126 |\u00b5t(A) \u2212 \u03c0(A)| \u2264 1\n\n(cid:9) .\n\n4\n\n2.1 Voting Example\n\nWe start by considering a simple example model [20], called the voting model, that models the sign\nof a particular \u201cquery\u201d variable Q \u2208 {\u22121, 1} in the presence of other \u201cvoter\u201d variables Ti \u2208 {0, 1}\nand Fi \u2208 {0, 1}, for i \u2208 {1, . . . , n}, that suggest that Q is positive and negative (true and false),\nrespectively. We consider three versions of this model. The \ufb01rst, the voting model with linear\nsemantics, has energy function\n\ni=1 wFiFi,\n\ni=1 wFiFi.\n\nwhere wTi, wFi, and w > 0 are constant weights. This model has a factor connecting each voter\nvariable to the query, which represents the value of that vote, and an additional factor that gives a\nprior for each voter. It corresponds to the factor graph in Figure 1(a). The second version, the voting\nmodel with logical semantics, has energy function\n\ni=1 Ti \u2212 wQ(cid:80)n\n\n\u0001(Q, T, F ) = wQ(cid:80)n\n\ni=1 Fi +(cid:80)n\n\u0001(Q, T, F ) = wQ maxi Ti \u2212 wQ maxi Fi +(cid:80)n\ni=1 Ti) \u2212 wQ log (1 +(cid:80)n\n\ni=1 wTi Ti +(cid:80)n\ni=1 wTiTi +(cid:80)n\ni=1 Fi) +(cid:80)n\n\ni=1 wTiTi +(cid:80)n\n\nHere, in addition to the prior factors, there are only two other factors, one of which (which we call\n\u03c6T ) connects all the true-voters to the query, and the other of which (\u03c6F ) connects all the false-voters\nto the query. The third version, the voting model with ratio semantics, is an intermediate between\nthese two models, and has energy function\n\n\u0001(Q, T, F ) = wQ log (1 +(cid:80)n\n\ni=1 wFiFi.\nWith either logical or ratio semantics, this model can be drawn as the factor graph in Figure 1(b).\nThese three cases model different distributions and therefore different ways of representing the\npower of a vote; the choice of names is motivated by considering the marginal odds of Q given\nthe other variables. For linear semantics, the odds of Q depend linearly on the difference between\nthe number of nonzero positive-voters Ti and nonzero negative-voters Fi. For ratio semantics, the\nodds of Q depend roughly on their ratio. For logical semantics, only the presence of nonzero voters\nmatters, not the number of voters.\nWe instantiated this model with random weights wTi and wFi, ran Gibbs sampling on it, and com-\nputed the variance of the estimated marginal probability of Q for the different models (Figure 2).\nThe results show that the models with logical and ratio semantics produce much lower-variance es-\ntimates than the model with linear semantics. This experiment motivates us to try to prove a bound\non the mixing time of Gibbs sampling on this model.\nTheorem 1. Fix any constant \u03c9 > 0, and run Gibbs sampling on the voting model with bounded\nfactor weights {wTi, wFi, w} \u2282 [\u2212\u03c9, \u03c9]. For the voting model with linear semantics, the largest\n\n3\n\n\fQ\n\nr\no\nf\n\ne\nt\na\nm\n\ni\nt\ns\ne\n\nl\na\nn\ni\ng\nr\na\nm\n\nf\no\n\ne\nc\nn\na\ni\nr\na\nv\n\n1\n\n0.1\n\n0.01\n\n0.001\n\n0.0001\n\n0\n\nConvergence of Voting Model (n = 50)\n\nlinear\nratio\nlogical\n10 20 30 40 50 60 70 80 90 100\n\niterations (thousands)\n\nQ\n\nr\no\nf\n\ne\nt\na\nm\n\ni\nt\ns\ne\n\nl\na\nn\ni\ng\nr\na\nm\n\nf\no\ne\nc\nn\na\ni\nr\na\nv\n\n1\n\n0.1\n\n0.01\n\n0.001\n\n0.0001\n\nConvergence of Voting Model (n = 500)\n\nlinear\nratio\nlogical\n\n0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1\n\niterations (millions)\n\nFigure 2: Convergence for the voting model with w = 0.5, and random prior weights in (\u22121, 0).\n\npossible mixing time tmix of any such model is tmix = 2\u0398(n). For the voting model with either\nlogical or ratio semantics, the largest possible mixing time is tmix = \u0398(n log n).\n\nThis result validates our observation that linear semantics mix poorly compared to logical and ratio\nsemantics. Intuitively, the reason why linear semantics performs worse is that the Gibbs sampler will\nswitch the state of Q only very infrequently\u2014in fact exponentially so. This is because the energy\nroughly depends linearly on the number of voters n, and therefore the probability of switching Q\ndepends exponentially on n. This does not happen in either the logical or ratio models.\n\n2.2 Hypertree Width\n\nIn this section, we describe the commonly-used graph property of hypertree width, and show using\nthe voting example that bounding it is insuf\ufb01cient to ensure rapid Gibbs sampling. Hypertree width\nis typically used to bound the complexity of dynamic programming algorithms on a graph; in partic-\nular, variable elimination for exact inference runs in polynomial time on factor graphs with bounded\nhypertree width [8, p. 1000]. The hypertree width of a hypergraph, which we denote tw(G), is a\ngeneralization of the notion of acyclicity; since the de\ufb01nition of hypertree width is technical, we\ninstead state the de\ufb01nition of an acyclic hypergraph, which is suf\ufb01cient for our analysis. In order to\napply these notions to factor graphs, we can represent a factor graph as a hypergraph that has one\nvertex for each node of the factor graph, and one hyperedge for each factor, where that hyperedge\ncontains all variables the factor depends on.\nDe\ufb01nition 2 (Acyclic Factor Graph [6]). A join tree, also called a junction tree, of a factor graph G\nis a tree T such that the nodes of T are the factors of G and, if two factors \u03c6 and \u03c1 both depend on\nthe same variable x in G, then every factor on the unique path between \u03c6 and \u03c1 in T also depends on\nx. A factor graph is acyclic if it has a join tree. All acyclic graphs have hypertree width tw(G) = 1.\n\nNote that all trees are acyclic; in particular the voting model (with any semantics) has hypertree\nwidth 1. Since the voting model with linear semantics and bounded weights mixes in exponential\ntime (Theorem 1), this means that bounding the hypertree width and the factor weights is insuf\ufb01cient\nto ensure rapid mixing of Gibbs sampling.\n\n2.3 Hierarchy Width\n\nSince the hypertree width is insuf\ufb01cient, we de\ufb01ne a new graph property, the hierarchy width, which,\nwhen bounded, ensures rapid mixing of Gibbs sampling. This result is our main contribution.\nDe\ufb01nition 3 (Hierarchy Width). The hierarchy width hw(G) of a factor graph G is de\ufb01ned recur-\nsively such that, for any connected factor graph G = (cid:104)V, \u03a6(cid:105),\n\n(2)\n\n(3)\n\nhw(G) = 1 + min\n\u03c6\u2217\u2208\u03a6\n\nhw((cid:104)V, \u03a6 \u2212 {\u03c6\u2217}(cid:105)),\n\nand for any disconnected factor graph G with connected components G1, G2, . . .,\n\nhw(G) = max\n\ni\n\nhw(Gi).\n\n4\n\n\fAs a base case, all factor graphs G with no factors have\nhw((cid:104)V,\u2205(cid:105)) = 0.\n\n(4)\n\nTo develop some intuition about how to use the de\ufb01nition of hierarchy width, we derive the hierarchy\nwidth of the path graph drawn in Figure 3.\n\nv1\n\n\u03c61\n\nv2\n\n\u03c62\n\nv3\n\n\u03c63\n\nv4\n\n\u03c64\n\nv5\n\n\u03c65\n\nv6\n\n\u03c66\n\nv7\n\n\u00b7\u00b7\u00b7\n\nvn\n\nFigure 3: Factor graph diagram for an n-variable path graph.\n\nLemma 1. The path graph model has hierarchy width hw(G) = (cid:100)log2 n(cid:101).\n\nProof. Let Gn denote the path graph with n variables. For n = 1, the lemma follows from (4). For\nn > 1, Gn is connected, so we must compute its hierarchy width by applying (2). It turns out that\nthe factor that minimizes this expression is the factor in the middle, and so applying (2) followed by\n(3) shows that hw(Gn) = 1 + hw(G(cid:100) n\n\n2 (cid:101)). Applying this inductively proves the lemma.\n\nSimilarly, we are able to compute the hierarchy width of the voting model factor graphs.\nLemma 2. The voting model with logical or ratio semantics has hierarchy width hw(G) = 3.\nLemma 3. The voting model with linear semantics has hierarchy width hw(G) = 2n + 1.\n\nThese results are promising, since they separate our polynomially-mixing examples from our\nexponentially-mixing examples. However, the hierarchy width of a factor graph says nothing about\nthe factors themselves and the functions they compute. This means that it, alone, tells us nothing\nabout the model; for example, any distribution can be represented by a trivial factor graph with a\nsingle factor that contains all the variables. Therefore, in order to use hierarchy width to produce a\nresult about the mixing time of Gibbs sampling, we constrain the maximum weight of the factors.\nDe\ufb01nition 4 (Maximum Factor Weight). A factor graph has maximum factor weight M, where\n\n(cid:16)\n\nM = max\n\u03c6\u2208\u03a6\n\n\u03c6(I) \u2212 min\n\nI\n\nmax\n\nI\n\n(cid:17)\n\n\u03c6(I)\n\n.\n\nFor example, the maximum factor weight of the voting example with linear semantics is M = 2w;\nwith logical semantics, it is M = 2w; and with ratio semantics, it is M = 2w log(n + 1). We now\nshow that graphs with bounded hierarchy width and maximum factor weight mix rapidly.\nTheorem 2 (Polynomial Mixing Time). If G is a factor graph with n variables, at most s states per\nvariable, e factors, maximum factor weight M, and hierarchy width h, then\ntmix \u2264 (log(4) + n log(s) + eM ) n exp(3hM ).\n\nIn particular, if e is polynomial in n, the number of values for each variable is bounded, and hM =\nO(log n), then tmix(\u0001) = O(nO(1)).\n\nTo show why bounding the hierarchy width is necessary for this result, we outline the proof of\nTheorem 2. Our technique involves bounding the absolute spectral gap \u03b3(G) of the transition matrix\nof Gibbs sampling on graph G; there are standard results that use the absolute spectral gap to bound\nthe mixing time of a process [10, p. 155]. Our proof proceeds via induction using the de\ufb01nition of\nhierarchy width and the following three lemmas.\nLemma 4 (Connected Case). Let G and \u00afG be two factor graphs with maximum factor weight M,\nwhich differ only inasmuch as G contains a single additional factor \u03c6\u2217. Then,\n\n\u03b3(G) \u2265 \u03b3( \u00afG) exp (\u22123M ) .\n\nLemma 5 (Disconnected Case). Let G be a disconnected factor graph with n variables and m\nconnected components G1, G2, . . . , Gm with n1, n2, . . . nm variables, respectively. Then,\n\n\u03b3(G) \u2265 min\ni\u2264m\n\nni\nn\n\n\u03b3(Gi).\n\n5\n\n\fLemma 6 (Base Case). Let G be a factor graph with one variable and no factors. The absolute\nspectral gap of Gibbs sampling running on G will be \u03b3(G) = 1.\n\nUsing these Lemmas inductively, it is not hard to show that, under the conditions of Theorem 2,\n\n\u03b3(G) \u2265 1\nn\n\nexp (\u22123hM ) ;\n\nconverting this to a bound on the mixing time produces the result of Theorem 2.\nTo gain more intuition about the hierarchy width, we compare its properties to those of the hypertree\nwidth. First, we note that, when the hierarchy width is bounded, the hypertree width is also bounded.\nStatement 1. For any factor graph G, tw(G) \u2264 hw(G).\n\nOne of the useful properties of the hypertree width is that, for any \ufb01xed k, computing whether a\ngraph G has hypertree width tw(G) \u2264 k can be done in polynomial time in the size of G. We show\nthe same is true for the hierarchy width.\nStatement 2. For any \ufb01xed k, computing whether hw(G) \u2264 k can be done in time polynomial in\nthe number of factors of G.\n\nFinally, we note that we can also bound the hierarchy width using the degree of the factor graph.\nNotice that a graph with unbounded node degree contains the voting program with linear semantics\nas a subgraph. This statement shows that bounding the hierarchy width disallows such graphs.\nStatement 3. Let d be the maximum degree of a variable in factor graph G. Then, hw(G) \u2265 d.\n\n3 Factor Graph Templates\n\nOur study of hierarchy width is in part motivated by the desire to analyze the behavior of Gibbs\nsampling on factor graph templates, which are common in practice and used by many state-of-the-\nart systems. A factor graph template is an abstract model that can be instantiated on a dataset to\nproduce a factor graph. The dataset consists of objects, each of which represents a thing we want to\nreason about, which are divided into classes. For example, the object Bart could have class Person\nand the object Twilight could have class Movie. (There are many ways to de\ufb01ne templates; here, we\nfollow the formulation in Koller and Friedman [8, p. 213].)\nA factor graph template consists of a set of template variables and template factors. A template\nvariable represents a property of a tuple of zero or more objects of particular classes. For exam-\nple, we could have an IsPopular(x) template, which takes a single argument of class Movie. In\nthe instantiated graph, this would take the form of multiple variables like IsPopular(Twilight) or\nIsPopular(Avengers). Template factors are replicated similarly to produce multiple factors in the\ninstantiated graph. For example, we can have a template factor\n\n\u03c6 (TweetedAbout(x, y), IsPopular(x))\nfor some factor function \u03c6. This would be instantiated to factors like\n\n\u03c6 (TweetedAbout(Avengers, Bart), IsPopular(Avengers)) .\n\nWe call the x and y in a template factor object symbols. For an instantiated factor graph with template\nfactors \u03a6, if we let A\u03c6 denote the set of possible assignments to the object symbols in a template\nfactor \u03c6, and let \u03c6(a, I) denote the value of its factor function in world I under the object symbol\nassignment a, then the standard way to de\ufb01ne the energy function is with\n\n\u0001(I) =(cid:80)\n\n(cid:80)\n\n\u03c6\u2208\u03a6\n\na\u2208A\u03c6\n\nw\u03c6\u03c6(a, I),\n\n(5)\nwhere w\u03c6 is the weight of template factor \u03c6. This energy function results from the creation of\na single factor \u03c6a(I) = \u03c6(a, I) for each object symbol assignment a of \u03c6. Unfortunately, this\nstandard energy de\ufb01nition is not suitable for all applications. To deal with this, Shin et al. [20]\nintroduce the notion of a semantic function g, which counts the of energy of instances of the factor\ntemplate in a non-standard way. In order to do this, they \ufb01rst divide the object symbols of each\ntemplate factor into two groups, the head symbols and the body symbols. When writing out factor\ntemplates, we distinguish head symbols by writing them with a hat (like \u02c6x). If we let H\u03c6 denote\nthe set of possible assignments to the head symbols, let B\u03c6 denote the set of possible assignments\n\n6\n\n\fbounded factor weight\n\nvoting\n(linear)\n\nbounded hypertree width\n\npolynomial mixing time\nbounded hierarchy width\nhierarchical templates\n\nvoting\n(logical)\n\nvoting\n(ratio)\n\nFigure 4: Subset relationships among classes of factor graphs, and locations of examples.\n\n(cid:80)\n\n\u03c6\u2208\u03a6\n\nh\u2208H\u03c6\n\nw\u03c6(h) g\n\n\u0001(I) =(cid:80)\n\nto the body symbols, and let \u03c6(h, b, I) denote the value of its factor function in world I under the\nassignment (h, b), then the energy of a world is de\ufb01ned as\n\n(cid:16)(cid:80)\nThis results in the creation of a single factor \u03c6h(I) = g ((cid:80)\n\n(6)\nb\u2208B\u03c6\nb \u03c6(h, b, I)) for each assignment of the\ntemplate\u2019s head symbols. We focus on three semantic functions in particular [20]. For the \ufb01rst,\nlinear semantics, g(x) = x. This is identical to the standard semantics in (5). For the second,\nlogical semantics, g(x) = sgn(x). For the third, ratio semantics, g(x) = sgn(x) log(1 +|x|). These\nsemantics are analogous to the different semantics used in our voting example. Shin et al. [20]\nexhibit several classi\ufb01cation problems where using logical or ratio semantics gives better F1 scores.\n\n\u03c6(h, b, I)\n\n.\n\n(cid:17)\n\n3.1 Hierarchical Factor Graphs\n\nIn this section, we outline a class of templates, hierarchical templates, that have bounded hierarchy\nwidth. We focus on models that have hierarchical structure in their template factors; for example,\n\nshould have hierarchical structure, while\n\n\u03c6(A(\u02c6x, \u02c6y, z), B(\u02c6x, \u02c6y), Q(\u02c6x, \u02c6y))\n\n\u03c6(A(z), B(\u02c6x), Q(\u02c6x, y))\n\n(7)\n\n(8)\n\nshould not. Armed with this intuition, we give the following de\ufb01nitions.\nDe\ufb01nition 5 (Hierarchy Depth). A template factor \u03c6 has hierarchy depth d if the \ufb01rst d object\nsymbols that appear in each of its terms are the same. We call these symbols hierarchical symbols.\nFor example, (7) has hierarchy depth 2, and \u02c6x and \u02c6y are hierarchical symbols; also, (8) has hierarchy\ndepth 0, and no hierarchical symbols.\nDe\ufb01nition 6 (Hierarchical). We say that a template factor is hierarchical if all of its head symbols\nare hierarchical symbols. For example, (7) is hierarchical, while (8) is not. We say that a factor\ngraph template is hierarchical if all its template factors are hierarchical.\n\nWe can explicitly bound the hierarchy width of instances of hierarchical factor graphs.\nLemma 7. If G is an instance of a hierarchical template with E template factors, then hw(G) \u2264 E.\nWe would now like to use Theorem 2 to prove a bound on the mixing time; this requires us to\nbound the maximum factor weight of the graph. Unfortunately, for linear semantics, the maximum\nfactor weight of a graph is potentially O(n), so applying Theorem 2 won\u2019t get us useful results.\nFortunately, for logical or ratio semantics, hierarchical factor graphs do mix in polynomial time.\nStatement 4. For any \ufb01xed hierarchical factor graph template G, if G is an instance of G with\nbounded weights using either logical or ratio semantics, then the mixing time of Gibbs sampling on\n\nG is polynomial in the number of objects n in its dataset. That is, tmix = O(cid:0)nO(1)(cid:1).\n\nSo, if we want to construct models with Gibbs samplers that mix rapidly, one way to do it is with\nhierarchical factor graph templates using logical or ratio semantics.\n\n4 Experiments\n\nSynthetic Data We constructed a synthetic dataset by using an ensemble of Ising model graphs\neach with 360 nodes, 359 edges, and treewidth 1, but with different hierarchy widths. These graphs\n\n7\n\n\fErrors of Marginal Estimates for Synthetic Ising Model\n\n1\n\nr\no\nr\nr\ne\n\ne\nr\na\nu\nq\ns\n\n0.1\n\n0.01\n\n0.001\n\n10\n\nw = 0.5\nw = 0.7\nw = 0.9\n\n100\nhierarchy width\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\ne\n\ne\nr\na\nu\nq\ns\n\nn\na\ne\nm\n\nlinear\nratio\nlogical\n\n20\n\n0\n\n0\n\nMax Error of Marginal Estimate for KBP Dataset\n\n40\n\n60\n\n80\n\n100\n\niterations per variable\n\n(a) Error of marginal estimates for synthetic\nIsing model after 105 samples.\n\n(b) Maximum error marginal estimates for KBP\ndataset after some number of samples.\n\nFigure 5: Experiments illustrate how convergence is affected by hierarchy width and semantics.\n\nranged from the star graph (like in Figure 1(a)) to the path graph; and each had different hierarchy\nwidth. For each graph, we were able to calculate the exact true marginal of each variable because\nof the small tree-width. We then ran Gibbs sampling on each graph, and calculated the error of the\nmarginal estimate of a single arbitrarily-chosen query variable. Figure 5(a) shows the result with\ndifferent weights and hierarchy width. It shows that, even for tree graphs with the same number of\nnodes and edges, the mixing time can still vary depending on the hierarchy width of the model.\n\nReal-World Applications We observed that the hierarchical templates that we focus on in this\nwork appear frequently in real applications. For example, all \ufb01ve knowledge base population (KBP)\nsystems illustrated by Shin et al. [20] contain subgraphs that are grounded by hierarchical templates.\nMoreover, sometimes a factor graph is solely grounded by hierarchical templates, and thus provably\nmixes rapidly by our theorem while achieving high quality. To validate this, we constructed a hier-\narchical template for the Paleontology application used by Shanan et al. [17]. We found that when\nusing the ratio semantic, we were able to get an F1 score of 0.86 with precision of 0.96. On the\nsame task, this quality is actually higher than professional human volunteers [17]. For comparison,\nthe linear semantic achieved an F1 score of 0.76 and the logical achieved 0.73.\nThe factor graph we used in this Paleontology application is large enough that it is intractable, using\nexact inference, to estimate the true marginal to investigate the mixing behavior. Therefore, we\nchose a subgraph of a KBP system used by Shin et al. [20] that can be grounded by a hierarchical\ntemplate and chose a setting of the weight such that the true marginal was 0.5 for all variables. We\nthen ran Gibbs sampling on this subgraph and report the average error of the marginal estimation in\nFigure 5(b). Our results illustrate the effect of changing the semantic on a more complicated model\nfrom a real application, and show similar behavior to our simple voting example.\n\n5 Conclusion\n\nThis paper showed that for a class of factor graph templates, hierarchical templates, Gibbs sampling\nmixes in polynomial time. It also introduced the graph property hierarchy width, and showed that\nfor graphs of bounded factor weight and hierarchy width, Gibbs sampling converges rapidly. These\nresults may aid in better understanding the behavior of Gibbs sampling for both template and general\nfactor graphs.\n\nAcknowledgments\n\nThanks to Stefano Ermon and Percy Liang for helpful conversations.\nThe authors acknowledge the support of: DARPA FA8750-12-2-0335; NSF IIS-1247701; NSF CCF-1111943;\nDOE 108845; NSF CCF-1337375; DARPA FA8750-13-2-0039; NSF IIS-1353606; ONR N000141210041\nand N000141310129; NIH U54EB020405; Oracle; NVIDIA; Huawei; SAP Labs; Sloan Research Fellowship;\nMoore Foundation; American Family Insurance; Google; and Toshiba.\n\n8\n\n\fReferences\n[1] Venkat Chandrasekaran, Nathan Srebro, and Prahladh Harsha. Complexity of inference in graphical\n\nmodels. arXiv preprint arXiv:1206.3240, 2012.\n\n[2] Persi Diaconis, Kshitij Khare, and Laurent Saloff-Coste. Gibbs sampling, exponential families and or-\n\nthogonal polynomials. Statist. Sci., 23(2):151\u2013178, May 2008.\n\n[3] Persi Diaconis, Kshitij Khare, and Laurent Saloff-Coste. Gibbs sampling, conjugate priors and coupling.\n\nSankhya A, (1):136\u2013169, 2010.\n\n[4] Pedro Domingos and William Austin Webb. A tractable \ufb01rst-order probabilistic logic. In AAAI, 2012.\n[5] Joseph Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin. Parallel gibbs sampling: From\n\ncolored \ufb01elds to thin junction trees. In AISTATS, pages 324\u2013332, 2011.\n\n[6] Georg Gottlob, Gianluigi Greco, and Francesco Scarcello. Treewidth and hypertree width. Tractability:\n\nPractical Approaches to Hard Problems, page 1, 2014.\n\n[7] Alexander T Ihler, John Iii, and Alan S Willsky. Loopy belief propagation: Convergence and effects of\n\nmessage errors. In Journal of Machine Learning Research, pages 905\u2013936, 2005.\n\n[8] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press,\n\n2009.\n\n[9] Johan Kwisthout, Hans L Bodlaender, and Linda C van der Gaag. The necessity of bounded treewidth for\n\nef\ufb01cient inference in bayesian networks. In ECAI, pages 237\u2013242, 2010.\n\n[10] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American\n\nMathematical Soc., 2009.\n\n[11] Xianghang Liu and Justin Domke. Projecting markov random \ufb01eld parameters for fast mixing.\n\nIn\nZ. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 27, pages 1377\u20131385. Curran Associates, Inc., 2014.\n\n[12] David Lunn, David Spiegelhalter, Andrew Thomas, and Nicky Best. The BUGS project: evolution,\n\ncritique and future directions. Statistics in medicine, (25):3049\u20133067, 2009.\n\n[13] D\u00b4aniel Marx. Tractable hypergraph properties for constraint satisfaction and conjunctive queries. Journal\n\nof the ACM (JACM), (6):42, 2013.\n\n[14] Andrew McCallum, Karl Schultz, and Sameer Singh. Factorie: Probabilistic programming via impera-\n\ntively de\ufb01ned factor graphs. In NIPS, pages 1249\u20131257, 2009.\n\n[15] David Newman, Padhraic Smyth, Max Welling, and Arthur U Asuncion. Distributed inference for latent\n\ndirichlet allocation. In NIPS, pages 1081\u20131088, 2007.\n\n[16] Kee Siong Ng, John W Lloyd, and William TB Uther. Probabilistic modelling, inference and learning\n\nusing logical theories. Annals of Mathematics and Arti\ufb01cial Intelligence, (1-3):159\u2013205, 2008.\n\n[17] Shanan E Peters, Ce Zhang, Miron Livny, and Christopher R\u00b4e. A machine reading system for assembling\n\nsynthetic Paleontological databases. PloS ONE, 2014.\n\n[18] David Poole. First-order probabilistic inference. In IJCAI, pages 985\u2013991. Citeseer, 2003.\n[19] Neil Robertson and Paul D. Seymour. Graph minors. ii. algorithmic aspects of tree-width. Journal of\n\nalgorithms, (3):309\u2013322, 1986.\n\n[20] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, Feiran Wang, and Christopher R\u00b4e.\n\nIncremental knowledge base construction using deepdive. PVLDB, 2015.\n[21] Parag Singla and Pedro Domingos. Lifted \ufb01rst-order belief propagation.\n\nIn AAAI, pages 1094\u20131099,\n\n2008.\n\n[22] Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. PVLDB, 2010.\n[23] Dan Suciu, Dan Olteanu, Christopher R\u00b4e, and Christoph Koch. Probabilistic databases. Synthesis Lectures\n\non Data Management, (2):1\u2013180, 2011.\n\n[24] Mihai Surdeanu and Heng Ji. Overview of the english slot \ufb01lling track at the TAC2014 knowledge base\n\npopulation evaluation.\n\n[25] Lucas Theis, Jascha Sohl-dickstein, and Matthias Bethge. Training sparse natural image models with a\n\nfast gibbs sampler of an extended state space. In NIPS, pages 1124\u20131132. 2012.\n\n[26] Deepak Venugopal and Vibhav Gogate. On lifting the gibbs sampling algorithm. In F. Pereira, C.J.C.\nBurges, L. Bottou, and K.Q. Weinberger, editors, NIPS, pages 1655\u20131663. Curran Associates, Inc., 2012.\n[27] Deepak Venugopal, Somdeb Sarkhel, and Vibhav Gogate. Just count the satis\ufb01ed groundings: Scalable\nlocal-search and sampling based inference in mlns. In AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n[28] Ce Zhang and Christopher R\u00b4e. DimmWitted: A study of main-memory statistical analytics. PVLDB,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 1731, "authors": [{"given_name": "Christopher", "family_name": "De Sa", "institution": "Stanford"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "Wisconsin"}, {"given_name": "Kunle", "family_name": "Olukotun", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": null}]}