{"title": "On Lifting the Gibbs Sampling Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1655, "page_last": 1663, "abstract": "Statistical relational learning models combine the power of first-order logic, the de facto tool for handling relational structure, with that of probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the speed, accuracy and scalability of existing graphical models' inference algorithms by exploiting symmetry in the first-order representation. In this paper, we consider blocked Gibbs sampling, an advanced variation of the classic Gibbs sampling algorithm and lift it to the first-order level. We propose to achieve this by partitioning the first-order atoms in the relational model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing such clusters and determining their complexity and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy and convergence.", "full_text": "On Lifting the Gibbs Sampling Algorithm\n\nDeepak Venugopal\n\nDepartment of Computer Science\nThe University of Texas at Dallas\n\nRichardson, TX, 75080, USA\n\nVibhav Gogate\n\nDepartment of Computer Science\nThe University of Texas at Dallas\n\nRichardson, TX, 75080, USA\n\ndxv021000@utdallas.edu\n\nvgogate@hlt.utdallas.edu\n\nAbstract\n\nFirst-order probabilistic models combine the power of \ufb01rst-order logic, the de\nfacto tool for handling relational structure, with probabilistic graphical models,\nthe de facto tool for handling uncertainty. Lifted probabilistic inference algorithms\nfor them have been the subject of much recent research. The main idea in these\nalgorithms is to improve the accuracy and scalability of existing graphical models\u2019\ninference algorithms by exploiting symmetry in the \ufb01rst-order representation. In\nthis paper, we consider blocked Gibbs sampling, an advanced MCMC scheme,\nand lift it to the \ufb01rst-order level. We propose to achieve this by partitioning the\n\ufb01rst-order atoms in the model into a set of disjoint clusters such that exact lifted\ninference is polynomial in each cluster given an assignment to all other atoms not\nin the cluster. We propose an approach for constructing the clusters and show how\nit can be used to trade accuracy with computational complexity in a principled\nmanner. Our experimental evaluation shows that lifted Gibbs sampling is superior\nto the propositional algorithm in terms of accuracy, scalability and convergence.\n\n1 Introduction\n\nModeling large, complex, real-world domains requires the ability to handle both rich relational struc-\nture and large amount of uncertainty. Unfortunately, the two existing representation and reasoning\ntools of choice \u2013 probabilistic graphical models (PGMs) and \ufb01rst-order logic \u2013 are unable to effec-\ntively handle both. PGMs can compactly represent and reason about uncertainty. However, they are\npropositional and thus ill-equipped to handle relational structure. First-order logic can effectively\nhandle relational structure. However, it has no representation for uncertainty. Therefore, combin-\ning the representation and reasoning power of \ufb01rst-order logic with PGMs is a worthwhile goal.\nStatistical relational learning (SRL) [7] is an emerging \ufb01eld which attempts to do just that.\nThe key task in SRL is inference - the problem of answering a query given an SRL model. In prin-\nciple, we can simply ground (propositionalize) the given SRL model to yield a PGM and thereby\nsolve the inference problem in SRL by reducing it to inference over PGMs. This approach is prob-\nlematic and impractical, however, because the PGMs obtained by grounding a SRL model can be\nsubstantially large, having millions of variables and billions of features; existing inference algo-\nrithms for PGMs are unable to handle problems at this scale. An alternative approach, which has\ngained prominence since the work of Poole [25] is lifted or \ufb01rst-order inference. The main idea,\nwhich is similar to theorem proving in \ufb01rst-order logic, is to take a propositional inference algo-\nrithm and exploit symmetry in its execution by performing inference over a group of identical or\ninterchangeable random variables. The algorithms are called lifted algorithms because they identify\nsymmetry by consulting the \ufb01rst-order representation without grounding the model.\nSeveral lifted algorithms have been proposed to date. Prominent exact algorithms are \ufb01rst-order\nvariable elimination [25] and its extensions [2, 23], which lift the variable elimination algorithm, and\nprobabilistic theorem proving (PTP) [8] which lifts the weighted model counting algorithm [1, 29].\nNotable approximate inference algorithms are lifted Belief propagation [30] and lifted importance\nsampling [8, 9], which lift belief propagation [20] and importance sampling respectively.\n\n1\n\n\fIn this paper, we lift blocked Gibbs sampling, an advanced MCMC technique. Blocked Gibbs\nsampling improves upon the Gibbs sampling algorithm by grouping variables (each group is called\na block) and then jointly sampling all variables in the block [10, 16]. Blocking improves the mixing\ntime and as a result improves both the accuracy and convergence of Gibbs sampling. The dif\ufb01culty\nis that to jointly sample variables in a block, we need to compute a joint distribution over them. This\nis typically exponential in the treewidth of the ground network projected on the block.\nSeveral earlier papers have attempted to exploit relational or \ufb01rst-order structure in MCMC sam-\npling. Notable examples are lazy MC-SAT [27], Metropolis-Hastings MCMC for Bayesian logic\n(BLOG) [18], typed MCMC [14] and orbital MCMC [21]. Unfortunately, none of the aforemen-\ntioned techniques are truly lifted. In particular, they do not exploit \ufb01rst-order structure to the fullest\nextent. In fact, lifting a generic MCMC technique is dif\ufb01cult because at each point in order to ensure\nconvergence to the desired stationary distribution one has to maintain an assignment to all random\nvariables in the ground network. We circumvent these issues by lifting the blocked Gibbs sampling\nalgorithm, which as we show is more amenable to lifting.\nOur main idea in applying the blocking approach to SRL models is to partition the set of \ufb01rst-order\natoms in the model into disjoint clusters such that PTP (an exact lifted inference scheme) is feasible\nin each cluster given an assignment to all other atoms not in the cluster. Given such a set of clusters,\nwe show that Gibbs sampling is essentially a message passing algorithm over the cluster graph\nformed by connecting clusters that have atoms that are in the Markov blanket of each other. Each\nmessage from a sender to a receiving cluster is a truth assignment to all ground atoms that are in the\nMarkov blanket of the receiving cluster. We show how to store this message compactly by taking\nadvantage of the \ufb01rst-order representation yielding a lifted MCMC algorithm.\nWe present experimental results comparing the performance of lifted blocked Gibbs sampling with\n(propositional) blocked Gibbs sampling, MC-SAT [26, 27] and Lifted BP [30] on various bench-\nmark SRL models. Our experiments show that lifted Gibbs sampling is superior to blocked Gibbs\nsampling and MC-SAT in terms of convergence, accuracy and scalability. It is also more accurate\nthan lifted BP on some instances.\n\n2 Notation and Preliminaries\n\nIn this section, we describe notation and preliminaries on propositional logic, \ufb01rst-order logic,\nMarkov logic networks and Gibbs sampling. For more details, refer to [3, 13, 15].\nThe language of propositional logic consists of atomic sentences called propositions or atoms, and\nlogical connectives such as \u2227 (conjunction), \u2228 (disjunction), \u00ac (negation), \u21d2 (implication) and \u21d4\n(equivalence). Each proposition takes values from the binary domain {False, True} (or {0, 1}).\nA propositional formula f is an atom, or any complex formula that can be constructed from atoms\nusing logical connectives. For example, A, B and C are propositional atoms and f = A \u2228 \u00acB \u2227 C is a\npropositional formula. A knowledge base (KB) is a set of formulas. A world is a truth assignment\nto all atoms in the KB.\nFirst-order logic (FOL) generalizes propositional logic by allowing atoms to have internal structure;\nan atom in FOL is a predicate that represents relations between objects. A predicate consists of a\npredicate symbol, denoted by Monospace fonts, e.g., Friends, Smokes, etc., followed by a paren-\nthesized list of arguments called terms. A term is a logical variable, denoted by lower case letters\nsuch as x, y, z, etc., or a constant, denoted by upper case letters such as X, Y , Z, etc. We assume\nthat each logical variable, e.g., x is typed and takes values over a \ufb01nite set \u2206x. The language of FOL\nalso includes two quanti\ufb01ers: \u2200 (universal) and \u2203 (existential) which express properties of an entire\ncollection of objects. A formula in \ufb01rst order logic is a predicate (atom), or any complex sentence\nthat can be constructed from atoms using logical connectives and quanti\ufb01ers. For example, the for-\nmula \u2200x Smokes(x) \u21d2 Asthma(x) states that all persons who smoke have asthma. \u2203x Cancer(x)\nstates that there exists a person x who has cancer. A \ufb01rst-order KB is a set of \ufb01rst-order formulas.\nIn this paper, we use a subset of FOL which has no function symbols, equality constraints or existen-\ntial quanti\ufb01ers. We also assume that domains are \ufb01nite (and therefore function-free) and that there is\na one-to-one mapping between constants and objects in the domain (Herbrand interpretations). We\nassume that each formula f is of the form \u2200x f , where x are the set of variables in f and f is a\nconjunction or disjunction of literals; each literal being an atom or its negation. For brevity, we will\ndrop \u2200 from all the formulas. Given variables x = {x1, . . . , xn} and constants X = {X1, . . . , Xn}\n\n2\n\n\fPM(\u03c9) =\n\n1\n\nZ(M)\n\nexp Xi\n\nwiN (fi, \u03c9)!\n\n(1)\n\nwhere Xi \u2208 \u2206xi, f [X/x] is obtained by substituting every occurrence of variable xi in f with Xi.\nA ground formula is a formula obtained by substituting all of its variables with a constant. A ground\nKB is a KB containing all possible groundings of all of its formulas. For example, the grounding\nof a KB containing one formula, Smokes(x) \u21d2 Asthma(x) where \u2206x = {Ana, Bob}, is a KB\ncontaining two formulas: Smokes(Ana) \u21d2 Asthma(Ana) and Smokes(Bob) \u21d2 Asthma(Bob). A\nworld in FOL is a truth assignment to all atoms in its grounding.\nMarkov logic [3] extends FOL by softening the hard constraints expressed by the formulas and\nis arguably the most popular modeling language for SRL. A soft formula or a weighted formula\nis a pair (f, w) where f is a formula in FOL and w is a real-number. A Markov logic network\n(MLN), denoted by M, is a set of weighted formulas (fi, wi). Given a set of constants that represent\nobjects in the domain, a Markov logic network de\ufb01nes a Markov network or a log-linear model. The\nMarkov network is obtained by grounding the weighted \ufb01rst-order knowledge base and represents\nthe following probability distribution.\n\nwhere \u03c9 is a world, N (fi, \u03c9) is the number of groundings of fi that evaluate to True in the world\n\u03c9 and Z(M) is a normalization constant or the partition function.\nIn this paper, we assume that the input MLN to our algorithm is in normal form [11, 19]. We\nrequire this for simplicity of exposition. Our main algorithm can be easily modi\ufb01ed to work with\nother canonical forms such as parfactors [25] and \ufb01rst order CNFs with substitution constraints [8].\nHowever, its speci\ufb01cation becomes much more complicated and messy. A normal MLN [11] is an\nMLN that satis\ufb01es the following two properties: (1) There are no constants in any formula, and (2)\nIf two distinct atoms with the same predicate symbol have variables x and y in the same position\nthen \u2206x = \u2206y. Note that in a normal MLN, we assume that the terms in each atom are ordered and\ntherefore we can identify each term by its position in the order.\n\n2.1 Gibbs Sampling and Blocking\n\nGiven an MLN, a set of query atoms and evidence, we can adapt the basic (propositional) Gibbs sam-\npling [6] algorithm for computing the marginal probabilities of query atoms given evidence as fol-\nlows. First, we ground all the formulas in the MLN, yielding a Markov network. Second, we instan-\ntiate all the evidence atoms in the network. Assume that the resulting evidence-instantiated network\nis de\ufb01ned over a set of variables X. Third, we generate N samples (\u00afx(1), . . . , \u00afx(N )) (a sample is a\ntruth assignment to all random variables in the Markov network) as follows. We begin with a random\nassignment to all variables, yielding \u00afx(0). Then for t = 1, . . . , N , we perform the following steps.\nLet (X1, . . . , Xn) be an arbitrary ordering of variables in X. Then, for i = 1 to n, we generate a new\ni+1 , . . . , \u00afx(t\u22121)\nvalue \u00afx(t)\n).\n(This is often called systematic scan Gibbs sampling. An alternative approach is random scan Gibbs\nsampling which often converges faster than systematic scan Gibbs sampling). For conciseness, we\n). Once the required N samples\nwill write P (Xi|\u00afx\nare generated, we can use them to answer any query over the model. In particular, the marginal\nprobability for each variable can be estimated by averaging the conditional marginals:\n\nfor Xi by sampling a value from the distribution P (Xi|\u00afxt\n\ni+1 , . . . , \u00afx(t\u22121)\n\n(t)\n\u2212i) = P (Xi|\u00afxt\n\ni\u22121, \u00afx(t\u22121)\n\ni\u22121, \u00afx(t\u22121)\n\n1, . . . , \u00afxt\n\n1, . . . , \u00afxt\n\nn\n\nn\n\ni\n\n1\nN\n\nNXt=1\n\nbP (\u00afxi) =\n\nP (\u00afxi|\u00afx\n\n(t)\n\u2212i)\n\n(t)\n\n(t)\n\u2212i) = P (Xi|\u00afx\n\n(t)\n\u2212i on M B(Xi).\n\n\u2212i,M B(Xi )) where M B(Xi) is the Markov\n\u2212i,M B(Xi) is the projection\n\nNote that in Markov networks, P (Xi|\u00afx\nBlanket (the set of variables that share a function with Xi) of Xi and \u00afx\nof \u00afx\nThe sampling distribution of Gibbs sampling converges to the posterior distribution (the distribu-\ntion associated with the evidence instantiated Markov network) as the number of samples increases\nbecause the resulting Markov chain is guaranteed to be aperiodic and ergodic (see [15] for details).\nThe main idea in blocked Gibbs sampling [10] is grouping variables to form a block, and then\njointly sampling all variables in a block given an assignment to all other variables not in the block.\n\n(t)\n\n3\n\n\fBlocking improves mixing yielding a more accurate sampling algorithm [15]. However, the compu-\ntational complexity of jointly sampling all variables in a block typically increases with the treewidth\nof the Markov network projected on the block. Thus, in practice, given time and memory resource\nconstraints, the main issue in blocked Gibbs sampling is \ufb01nding the right balance between compu-\ntational complexity and accuracy.\n\n3 Our Approach\n\nWe illustrate the key ideas in our approach using an example MLN having two weighted formulas:\nR(x, y) \u2228 S(y, z), w1 and S(y, z) \u2228 T(z, u), w2. Note that the problem of computing the partition\nfunction of this MLN for arbitrary domain sizes is non-trivial; it cannot be polynomially solved\nusing existing exact lifted approaches such as PTP [8] and lifted VE [2].\nOur main idea is to partition the set of atoms into disjoint blocks (clusters) such that PTP is poly-\nnomial in each cluster and then sample all atoms in the cluster jointly. PTP is polynomial if we can\nrecursively apply its two lifting rules (de\ufb01ned next), the power rule and the generalized binomial\nrule, until the treewidth of the remaining ground network is bounded by a constant.\nThe power rule is based on the concept of a decomposer. Given a normal MLN M, a set of logical\nvariables, denoted by x, is called a decomposer if it satis\ufb01es the following two conditions: (i) Every\natom in M contains exactly one variable from x, and (ii) For any predicate symbol R, there exists a\nposition s.t. variables from x only appear at that position in atoms of R. Given a decomposer x, it\nis easy to show that Z(M) = [Z(M[X/x])]|\u2206x| where x \u2208 x and M[X/x] is the MLN obtained\nby substituting all logical variables x in M by the same constant X \u2208 \u2206x and then converting the\nresulting MLN to a normal MLN. Note that for any two variables x, y in x, \u2206x = \u2206y by normality.\nThe generalized binomial rule is used to sample singleton atoms ef\ufb01ciently (the rule also re-\nquires that the atom is not involved in self-joins, i.e., it does not appear more than once in\nthe same formula). Given a normal MLN M having a singleton atom R(x), we can show that\n\nZ(M) =P|\u2206x|\n\ni=0 (cid:0)|\u2206x|\n\ni (cid:1)Z(M|\u00afRi)w(i)2p(i) where \u00afRi is a sample of R s.t. exactly i tuples are set to\n\nTrue. M|\u00afRi is the MLN obtained from M by performing the following steps in order: (i) Ground\nall R(x) and set its groundings to have the same assignment as Ri, (ii) Delete formulas that evaluate\nto either True or False, (iii) Delete all groundings of R(x) and (iv) Convert the resulting MLN\nto a normal MLN. w(i) is the exponentiated sum of the weights of formulas that evaluate to True\nand p(i) is the number of ground atoms that are removed from the MLN as a result of removing\nformulas (these are essentially don\u2019t care atoms which can be assigned to either True or False).\nlet us apply the clustering idea to our example\nNow,\nMLN. Let us put each \ufb01rst-order atom in a cluster by\nitself, namely we have three clusters: R(x, y), S(y, z)\nand T(z, u) (see Figure 1(a)). Note that each (\ufb01rst-order)\ncluster represents all groundings of all atoms in the\ncluster. To perform Gibbs sampling over this clustering,\nwe need to compute three conditional distributions:\nP (R(x, y)|\u00afS(y, z), \u00afT(z, u)),\nP (S(y, z)|\u00afR(x, y), \u00afT(z, u))\nand P (T(z, u)|\u00afR(x, y), \u00afS(y, z)) where \u00afR(x, y) denotes\na truth assignment to all possible groundings of R. Let\nthe domain size of each variable be n. Naively, given an\nassignment to all other atoms not in the cluster, we will\nneed O(2n2\n) time and space for computing and specifying\nthe joint distribution at each cluster. This is because there are n2 ground atoms associated with each\ncluster. Notice however that all groundings of each \ufb01rst-order atom are conditionally independent\nof each other given a truth assignment to all other atoms. In other words, we can apply PTP here\nand compute each conditional distribution in O(n3) time and space (since there are n3 groundings\nof each formula and we need to process each ground formula at least once). Thus, the complexity\nof sampling all atoms in all clusters is O(n3). Note that the complexity of sampling all variables\nusing propositional Gibbs sampling is also O(n3).\nNow, let us consider an alternative clustering in which we have two clusters as shown in Figure\n1(b). Intuitively, this clustering is likely to yield better accuracy than the previous one because more\n\nFigure 1: Two possible clusterings for\nlifted blocked Gibbs sampling on the exam-\nple MLN having two weighted formulas.\n\n(a) Clustering 1\n\n(b) Clustering 2\n\ny\n\nR(x, y)\n\nS(y, z)\n\nR(x,y)\nS(y,z)\n\nz\n\nz\n\nT(z, u)\n\nT(z, u)\n\n4\n\n\fi=1 and {S(y, Zi), kiw2}n\n\natoms will be sampled jointly. Counter-intuitively, however, as we show next, Clustering 2 will yield\na blocked sampler having smaller complexity than the one based on Clustering 1.\nTo perform blocked Gibbs sampling over Clustering 2, we need to compute two distribu-\ntions P (R(x, y), S(y, z)|\u00afT(z, u)), P (T(z, u)|\u00afR(x, y), \u00afS(y, z)). Let us see how PTP will compute\nIf we instantiate all groundings of T, we get the following reduced\nP (R(x, y), S(y, z)|\u00afT(z, u)).\nMLN {R(x, y) \u2228 S(y, Zi), w1}n\ni=1 where Zi \u2208 \u2206z and ki is the number\nof False groundings of T(y, Zi). This MLN contains a decomposer y. PTP will now apply the\npower rule, yielding formulas of the form {R(x, Y ) \u2228 S(Y, Zi), w1}n\ni=1\nwhere Y \u2208 \u2206y. R(x, Y ) is a singleton atom and therefore applying the generalized binomial rule,\nwe will get n + 1 reduced MLNs, each containing n atoms of the form {S(Y, Zi)}n\ni=1. These\natoms are conditionally independent of each other and a distribution over them can be computed\nin O(n) time. Thus, the complexity of computing P (R(x, y), S(y, z)|\u00afT(z, u)) is O(n2). Samples\nfor R and S can be generated from P (R(x, y), S(y, z)|\u00afT(z, u)) in O(n2) time as well. Notice that\nP (T(z, u)|\u00afR(x, y), \u00afS(y, z)) = P (T(z, u)|\u00afS(y, z)) because R is not in the Markov blanket of T. This\ndistribution can also be computed in O(n2) time. Therefore, the complexity of sampling all atoms\nusing the clustering shown in Figure 1(b) is O(n2).\nto compute the conditional distribution\nSpace Complexity:\nP (R(x, y), S(y, z)|\u00afT(z, u)), we only need to know how many groundings of T(Zi, u) are True in\n\u00afT(z, u) for all Zi \u2208 \u2206z. Cluster T(z, u) can share this information with its neighbor using only\nO(n) space. Similarly, to compute P (T(z, u)|\u00afS(y, z)) we only need to know how many groundings\nof S(y, Zi) are True in \u00afS(y, z) for all Zi \u2208 \u2206z. This requires O(n) space and thus the overall space\ncomplexity of Clustering 2 is O(n). On the other hand, the space complexity of Gibbs sampling\nover Clustering 1 is O(n2).\n\ni=1 and {S(Y, Zi), kiw2}n\n\nFor Clustering 2, notice that\n\n4 The Lifted Blocked Gibbs Sampling Algorithm\n\nNext, we will formalize the discussion in the previous section yielding a lifted blocked Gibbs sam-\npling algorithm. We begin with some required de\ufb01nitions.\nWe de\ufb01ne a cluster as a set of \ufb01rst order atoms (these atoms will be sampled jointly in a lifted Gibbs\nsampling iteration). Given a set of disjoint clusters {C1, . . . , Cm}, the Markov blanket of a cluster\nCi is the set of clusters that have at least one atom that is in the Markov blanket of an atom in Ci.\nGiven a MLN M, the Gibbs cluster graph is a graph G (each vertex of G is a cluster) such that: (i)\nEach atom in the MLN is in exactly one cluster of G (ii) Two clusters Ci and Cj in G are connected\nby an edge if Cj is in the Markov blanket of Ci. Note that by de\ufb01nition if Ci is in the Markov\nblanket of Cj, then Cj is in the Markov blanket of Ci.\n\n1\n2\n3\n\n4\n5\n\n6\n7\n\n8\n9\n\n10\n\nend\n\nfor t = 1 to N do\n\ninteger N and a set of query atoms\n\nAlgorithm 1: Lifted Blocked Gibbs Sampling\n\nInput: A normal MLN M, a Gibbs cluster graph G, an\n\nOutput: A Marginal Distribution over the query atoms\nbegin\n\nLet (C1, . . . , Cm) be an arbitrary ordering of\nclusters of G\n// Gibbs iteration\nfor i = 1 to m do\n\nThe lifted blocked Gibbs sampling algorithm (see\nAlgorithm 1) can be envisioned as a message\npassing algorithm over a Gibbs cluster graph G.\nEach edge (Ci, Cj) in G stores two messages in\neach direction. The message from Ci to Cj con-\ntains the current truth assignment to all ground-\nings of all atoms (we will discuss how to rep-\nresent the truth assignment in a lifted manner\nshortly) that are in the Markov blanket of one or\nmore atoms in Ci. We initialize the messages ran-\ndomly. Then at each Gibbs iteration, we generate\na sample over all atoms by sampling the clusters\nalong an ordering (C1, . . . , Cm) (Steps 3-10). At\neach cluster, we \ufb01rst use PTP to compute a condi-\ntional joint distribution over all atoms in the clus-\nter given an assignment to atoms in their Markov\nblanket. This assignment is derived using the in-\ncoming messages. Then, we sample all atoms in\nthe cluster from the joint distribution and update the estimate for query atoms in the cluster as well\nas all outgoing messages. We can prove that:\nTheorem 1. The Markov chain induced by Algorithm 1 is ergodic and aperiodic and its stationary\ndistribution is the distribution represented by the input normal MLN.\n\nM(Ci) = MLN obtained by instantiating the\nMarkov Blanket of Ci based on the incoming\nmessages\nCompute P (Ci) by running PTP on M(Ci)\nSample a truth assignment to all atoms in Ci\nfrom P (Ci)\nUpdate the estimate of all query atoms in Ci\nUpdate all outgoing messages from Ci\n\n5\n\n\f4.1 Lifted Message Representation\n\nWe say that a representation of truth assignments to the groundings of an atom is lifted if we only\nspecify the number of true (or false) assignments to its full or partial grounding.\nExample 1. Consider an atom R(x, y), where \u2206x = {X1, X2} and \u2206y = {Y1, Y2}. We can\nrepresent the truth assignment (R(X1, Y1) = 1, R(X1, Y2) = 0, R(X2, Y1) = 1, R(X2, Y2) = 0) in a\nlifted manner using either an integer 2 or a vector ([Y1, 2], [Y2, 0]). The \ufb01rst representation says that\n2 groundings of R(x, y) are true while the second representation says that 2 groundings of R(x, Y1)\nand 0 groundings of R(x, Y2) are true.\nNext, we state suf\ufb01cient conditions for representing a message in a lifted manner while ensur-\ning correctness, summarized in Theorem 2. We begin with a required de\ufb01nition. Given an atom\nR(x1, . . . , xp) and a subset of atoms {S1, . . . , Sk} from its Markov blanket, we say that a term at\nposition i in R is a shared term w.r.t. {S1, . . . , Sk} if there exists a formula f such that in f , a logical\nvariable appears at position i in R and in one or more atoms in {S1, . . . , Sk}. For instance, in our\nrunning example, y (position 2) is a shared term of R w.r.t. {S} but x (position 1) is not.\nTheorem 2 (Suf\ufb01cient Conditions for a Lifted Message Representation). Given a Gibbs cluster\ngraph G and an MLN M, let R be an atom in Ci and let Cj be a neighbor of Ci in G. Let SR,Cj be\nthe set of atoms formed by taking an intersection between the Markov blanket of R and the union of\nthe Markov blanket of atoms in Cj. Let x be the set of shared terms of R w.r.t. SR,Cj \u222a Cj and y\nbe the set of remaining terms in R. Let the outgoing message from Ci to Cj be represented using a\nvector of |\u2206x| pairs of the form [Xk, rk] where \u2206x is the Cartesian product of the domains of all\nterms in x, Xk \u2208 \u2206x is the k-th element in \u2206x and rk is the number of groundings of R(Xk, y) that\nare true in the current assignment. If all messages in the lifted Blocked Gibbs sampling algorithm\n(Algorithm 1) use the aforementioned representation, then the stationary distribution of the Markov\nchain induced by the algorithm is the distribution represented by the input normal MLN.\n\nProof. (Sketch). The generalized Binomial rule states that all MLNs obtained by conditioning on a\nsingleton atom S with exactly k of its groundings set to true are equivalent to each other. In other\nwords, in order to compute the distribution represented by the MLN conditioned on S, we only need\nto know how many groundings of S are set to true. Next, we will show that the atom obtained by\n(partially) grounding the shared terms x of an atom R in cluster Ci, namely R(Xk, y) (where y is\nthe set of terms of R that are not shared) is equivalent to a singleton atom and therefore knowing the\nnumber of groundings of R(Xk, y) that are set to true is suf\ufb01cient to compute the joint distribution\nover the atoms in cluster Cj, where Ci and Cj are neighbors in G.\nConsider the MLN M\u2032 which is obtained from M by \ufb01rst removing all formulas that do not mention\natoms in Cj and then (partially) grounding all the shared terms of R. Let y\u2032 be a logical variable such\nthat its domain \u2206y \u2032 = \u2206y, where \u2206y is the Cartesian product of the domains of all variables in y\nk(y\u2032) = R(Xk, y) where Xk \u2208 \u2206x is the k-th element in \u2206x. Notice that we can replace\nand let R\u2032\neach atom R(Xk, y) in M\u2032 by R\u2032\nk(y\u2032) without changing the associated distribution. Moreover, each\nk(y\u2032) is a singleton and therefore it follows from the generalized Binomial rule that in order\natom R\u2032\nto compute the distribution associated with M\u2032 conditioned on R\u2032\nk(y\u2032), we only need to know how\nmany of its possible groundings are true. Since Ci sends precisely this information to Cj using the\nmessage de\ufb01ned in the statement of this theorem, it follows that the lifted Blocked Gibbs sampling\nalgorithm which uses a lifted message representation is equivalent to the algorithm (Algorithm 1)\nthat uses a propositional representation. Since Algorithm 1 converges to the distribution represented\nby the MLN (Theorem 1), the proof follows.\n\n4.2 Complexity\n\nTheorem 2 provides a method for representing the messages succinctly by taking advantage of the\nsymmetry at inference time. It also generalizes the ideas presented in the previous section (last\nparagraph) and helps us bound the space complexity of each message. Formally,\nTheorem 3 (Space Complexity of a Message). Given a Gibbs cluster graph G and an MLN M,\nlet the outgoing message from cluster Ci to cluster Cj in G be de\ufb01ned over the set {R1, . . . , Rk} of\natoms. Let xi denote the set of shared terms of Ri that satisfy the conditions outlined in Theorem 2.\n\nThen, the space complexity of representing the message is O(Pk\n\nNote that the time/space requirements of the algorithm is the sum of the time/space required to run\nPTP for a cluster and the time/space for the message from the cluster. We can compute the time\n\ni=1 |\u2206xi|).\n\n6\n\n\fand space complexity of PTP at a cluster by running it schematically as follows. We apply the\npower rule as before but explore only one randomly selected branch in the search tree induced by\nthe generalized binomial rule. Recall that applying the generalized binomial rule will result in n + 1\nrecursive calls (i.e, the search tree node has branching factor of n + 1) where n is the domain size of\nthe singleton atom. If neither the power rule nor the generalized binomial rule can be applied at any\npoint during search, the complexity of PTP is exponential in the treewidth of the remaining ground\nnetwork. More precisely, the complexity of PTP is O(exp(g) \u00d7 exp(w + 1)) where g is the number\nof times the generalized binomial rule is applied and w is the treewidth (computed heuristically) of\nthe remaining ground network.\n\n4.3 Constructing the Gibbs Cluster Graph\n\nAlgorithm 2: Construct Gibbs Cluster Graph\n\nInput: A normal MLN M, complexity bounds \u03b1 and \u03b2\nOutput: A Gibbs cluster graph G\nbegin\n\nInitialization: Construct a Gibbs cluster graph G\nwith exactly one atom in each cluster\nwhile True do\n\nF = \u2205 // F: Set of feasible\n\ncluster graphs\n\nfor all pairs of clusters Ci and Cj in G do\n\nMerge Ci and Cj yielding a cluster graph G\u2032\nif T (G\u2032) \u2264 T (G) and S(G\u2032) \u2264 S(G) then\n\nAdd G\u2032 to F\n\nelse if T (G\u2032) \u2264 \u03b1 and S(G\u2032) \u2264 \u03b2 then\n\nAdd G\u2032 to F\n\n1\n2\n\n3\n4\n\n5\n6\n7\n8\n\n9\n10\n\n11\n12\n\nNext, we present a heuristic algorithm for con-\nstructing the Gibbs cluster graph. From a com-\nputational view point, we want its time and\nspace requirements to be as small as possible.\nFrom an approximation quality viewpoint, to\nimprove mixing, we want to jointly sample, i.e.,\ncluster together highly coupled/correlated vari-\nables. Formally, we want to\n\nMaximize: Xi\n\n\u03b6(Ci),\n\nSubject to: S(G) \u2264 \u03b1, T (G) \u2264 \u03b2\n\n13\n\nend\n\nPi \u03b6(Ci)\n\nIf F is empty return G\nG = Cluster graph in F that has the maximum\n\nwhere S(G) and T (G) denote the time and\nspace requirements of the Gibbs cluster graph\nG, \u03b6(Ci) measures the amount of coupling in\nthe cluster Ci of G, and parameters \u03b1 and \u03b2 are\nused to bound the time and space complexity\nrespectively. In our implementation, we measure coupling using the number of times two atoms\nappear together in a formula.\nThe optimization problem is NP-hard in general and therefore we propose to use the greedy approach\ngiven in Algorithm 2 for solving it. The algorithm begins by constructing a Gibbs cluster graph in\nwhich each \ufb01rst-order atom is in a cluster by itself. Then, in the while loop, the algorithm tries\nto iteratively improve the cluster graph. At each iteration, given the current cluster graph G, for\nevery possible pair of clusters (Ci, Cj) of G, the algorithm creates a new cluster graph G\u2032 from G\nby merging Ci and Cj. Among these graphs, the algorithm selects the graph that yields the most\ncoupling and at the same time either has smaller complexity than G or satis\ufb01es the input complexity\nbounds \u03b1 and \u03b2. It then replaces G with the selected graph and iterates until the graph cannot be\nimproved. Note that increasing the cluster size may decrease the complexity of the cluster graph in\nsome cases and therefore we require steps 6 and 7 which add G\u2032 to the feasible set if its complexity is\nsmaller than G. Also note that the algorithm is not guaranteed to return a cluster graph that satis\ufb01es\nthe input complexity bounds, even if such a cluster graph exists. If the algorithm fails then we may\nhave to use local search or dynamic programming; both are computationally expensive.\n\n5 Experiments\n\n[R(x) \u2228 S(x, y), w1];\n\nIn this section, we compare the performance of lifted blocked Gibbs sampling (LBG) with (proposi-\ntional) blocked Gibbs sampling (BG), lazy MC-SAT [26, 27] and lifted belief propagation (LBP)\n[30]. We experimented with the following four MLNs:\n(i) A RST MLN having two formu-\nlas, M1 :\n[S(x, y) \u2228 T(y, z)], (ii) A toy Smoker-Asthma-Cancer MLN\nhaving three formulas, M3 :\n[Asthma(x) \u2192 \u00acSmokes(x)], [Asthma(x) \u2227 Friends(x, y) \u2192\n\u00acSmokes(y)], [Smoke(x) \u2192 Cancer(x)], (iii) The example R, S, T MLN de\ufb01ned in Section 3, M3\nand (iv) WEBKB MLN, M4 used in [17]. Note that the \ufb01rst two MLNs can be solved in polynomial\ntime using PTP while PTP is exponential on M3 and M4. For each MLN, we set 10% randomly\nselected ground atoms as evidence. We varied the number of objects in the domain from 5 to 200.\nWe used a time-bound of 1000 seconds for all algorithms.\n\n7\n\n\f \n\ne\nc\nn\ne\ng\nr\ne\nv\ni\nd\nL\nK\n \ne\ng\na\nr\ne\nv\nA\n\n 1\n\n 0.1\n\n 0.01\n\n 0.001\n\n 0.0001\n\n 1e-05\n\n 0.1\n\n 0.01\n\n 0.001\n\n)\n\nR\n(\ng\no\nl\n\nBG \nMC-SAT \nLBP \nLBG \n\n 0.1\n\n 0.01\n\n 0.001\n\n \n\ne\nc\nn\ne\ng\nr\ne\nv\ni\nd\nL\nK\n \ne\ng\na\nr\ne\nv\nA\n\nBG \nMC-SAT \nLBP \nLBG \n\n 0.1\n\n 0.01\n\n 0.001\n\n)\n\nR\n(\ng\no\nl\n\nLBG \nBG \n\n 0\n\n 100\n\n 200\n\n 400\n\n 300\n 500\nTime(seconds)\n(a)\n\n 600\n\n 700\n\n 800\n\n 0.0001\n\n 0\n\n 100\n\n 200\n\n 400\n\n 300\n 500\nTime(seconds)\n(b)\n\n 600\n\n 700\n\n 800\n\n 0.0001\n\n 50\n\n 100\n\n 150\n\n 300\n\n 350\n\n 400\n\n 200\n 250\nTime(s)\n\n(c)\n\nLBG \nBG \n\n 1000\n\nLBG \nBG \n\n 1000\n\nLBG \nBG \n\n)\ns\n(\ne\nm\nT\n\ni\n\n 100\n\n 10\n\n)\ns\n(\ne\nm\nT\n\ni\n\n 100\n\n 10\n\n 0.0001\n\n 50\n\n 100\n\n 150\n\n 200\n 250\nTime(s)\n\n 300\n\n 350\n\n 400\n\n 1\n\n 0\n\n 20\n\n 40\n\n 60\n\n(d)\n\n 160 180 200\n\n 1\n\n 0\n\n 20\n\n 40\n\n 60\n\n 100 120 140\n\n 160 180 200\n\n 100 120 140\n\n 80\nNum-objects\n(e)\n\n 80\nNum-objects\n(f)\n\nFigure 2: KL divergence as a function of time for: (a) M1 with 50 objects and (b) M2 with 50 objects.\nConvergence diagnostic using Gelman-Rubin statistic (R) for (c) M3 with 25 objects and (d) M4 with 25\nobjects. Note that for lifted BP, the values displayed are the ones obtained after the algorithm has converged.\nTime required by 100 Gibbs iterations as a function of the number of objects for (e) M3 and (f) M4.\nWe implemented LBG and BG in C++ and used alchemy [12] to implement MC-SAT and LBP.\nFor LBG, BG and MC-SAT, we used a burn-in of 100 samples to negate the effects of initializa-\ntion. For M1 and M2, we measure the accuracy using the KL divergence between the estimated\nmarginal probabilities and the true marginal probabilities computed using PTP. Since computing ex-\nact marginals of M3 and M4 is not feasible, we perform convergence diagnostics for LBG and BG\nusing the Gelman-Rubin statistic [5], denoted by R. R measures the disagreement between chains\nby comparing the between-chain variances with the within-chain variances. The closer the value of\nR to 1, the better the mixing.\nFigure 2 shows the results. Figures 2(a) and 2(b) show the KL divergence as a function of time for\nM1 and M2 respectively. In both cases, LBG converges much faster than BG and MC-SAT and\nhas smaller error. LBP is more accurate than LBG on M1 while LBG is more accurate than LBP on\nM2. Figures 2(c) and 2(d) show log(R) as a function of time for M3 and M4 respectively. We see\nthat the Markov chain associated with LBG mixes much faster than the one associated with BG. To\nmeasure scalability, we use running time per Gibbs iteration as a performance metric. Figures 2(e)\nand 2(f) show the time required by 100 Gibbs iterations as a function of number of objects for M3\nand M4 respectively. They clearly demonstrates that LBG is more scalable than BG.\n\n6 Summary and Future Work\n\nIn this paper, we proposed lifted Blocked Gibbs sampling, a new algorithm that improves blocked\nGibbs sampling by exploiting relational or \ufb01rst-order structure. Our algorithm operates by construct-\ning a Gibbs cluster graph, which represents a partitioning of atoms into clusters and then performs\nmessage passing over the graph. Each message is a truth assignment to the Markov blanket of\nthe cluster and we showed how to represent it in a lifted manner. We proposed an algorithm for\nconstructing the Gibbs cluster graph and showed that it can be used to trade accuracy with computa-\ntional complexity. Our experiments demonstrate clearly that lifted blocked Gibbs sampling is more\naccurate and scalable than propositional blocked Gibbs sampling as well as MC-SAT.\nFuture work includes: lifting Rao-Blackwellised Gibbs sampling; applying our lifting rules to slice\nsampling [22] and \ufb02at histogram MCMC [4]; developing new clustering strategies; etc.\nAcknowledgements: This research was partly funded by the ARO MURI grant W911NF-08-1-\n0242. The views and conclusions contained in this document are those of the authors and should not\nbe interpreted as necessarily representing the of\ufb01cial policies, either expressed or implied, of ARO\nor the U.S. Government.\n\n8\n\n\fReferences\n[1] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial Intelli-\n\ngence, 172(6-7):772\u2013799, 2008.\n\n[2] R. de Salvo Braz. Lifted First-Order Probabilistic Inference. PhD thesis, University of Illinois, Urbana-\n\nChampaign, IL, 2007.\n\n[3] P. Domingos and D. Lowd. Markov Logic: An Interface Layer for Arti\ufb01cial Intelligence. Morgan &\n\nClaypool, San Rafael, CA, 2009.\n\n[4] S. Ermon, C.P. Gomes, A. Sabharwal, and B. Selman. Accelerated Adaptive Markov Chain for Partition\n\nFunction Computation. In NIPS, pages 2744\u20132752, 2011.\n\n[5] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statistical\n\nScience, 7(4):457\u2013472, 1992.\n\n[6] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721\u2013741, 1984.\n\n[7] L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. MIT Press, 2007.\n[8] V. Gogate and P. Domingos. Probabilistic theorem proving. In UAI, pages 256\u2013265, 2011.\n[9] V. Gogate, A. Jha, D. Venugopal. Advances in Lifted Importance Sampling. In AAAI, pages 1910\u20131916,\n\n2012.\n\n[10] C. S. Jensen, U. Kjaerulff, and A. Kong. Blocking gibbs sampling in very large probabilistic expert\nsystems. International Journal of Human Computer Studies. Special Issue on Real-World Applications of\nUncertain Reasoning, 42:647\u2013666, 1993.\n\n[11] A. Jha, V. Gogate, A. Meliou, and D. Suciu. Lifted inference from the other side: The tractable features.\n\nIn NIPS, pages 973\u2013981, 2010.\n\n[12] S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, and P. Domingos. The Alchemy system for\nstatistical relational AI. Technical report, Department of Computer Science and Engineering, University\nof Washington, Seattle, WA, 2006. http://alchemy.cs.washington.edu.\n\n[13] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\n[14] P. Liang, M. I. Jordan, and D. Klein. Type-based MCMC. In HLT-NAACL, pages 573\u2013581, 2010.\n[15] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer Publishing Company, Incorporated,\n\n2001.\n\n[16] J. S. Liu, W. H. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the\n\ncomparison of estimators and augmentation schemes. Biometrika, 81:27\u201340, 1994.\n\n[17] D. Lowd and P. Domingos. Recursive random \ufb01elds. In IJCAI, pages 950\u2013955. 2007.\n[18] B. Milch and S. J. Russell. General-purpose MCMC inference over relational structures. In UAI, pages\n\n349\u2013358, 2006.\n\n[19] B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, and L. P. Kaelbling. Lifted probabilistic inference\n\nwith counting formulas. In AAAI, pages 1062\u20131068, 2008.\n\n[20] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy Belief propagation for approximate inference: An\n\nempirical study. In UAI, pages 467\u2013475, 1999.\n\n[21] M. Niepert. Markov Chains on Orbits of Permutation Groups. In UAI, pages 624\u2013633, 2012.\n[22] Radford Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2000.\n[23] K. S. Ng, J. W. Lloyd, and W. T. Uther. Probabilistic modelling, inference and learning using logical\n\ntheories. Annals of Mathematics and Arti\ufb01cial Intelligence, 54(1-3):159\u2013205, 2008.\n\n[24] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann, San Francisco, CA, 1988.\n\n[25] D. Poole. First-order probabilistic inference. In IJCAI, pages 985\u2013991, 2003.\n[26] H. Poon and P. Domingos. Sound and ef\ufb01cient inference with probabilistic and deterministic dependen-\n\ncies. In AAAI, pages 458\u2013463, 2006.\n\n[27] H. Poon, P. Domingos, and M. Sumner. A general method for reducing the complexity of relational\n\ninference and its application to MCMC. In AAAI, pages 1075\u20131080, 2008.\n\n[28] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107\u2013136, 2006.\n[29] T. Sang, P. Beame, and H. Kautz. Solving Bayesian networks by weighted model counting. In AAAI,\n\npages 475\u2013482, 2005.\n\n[30] P. Singla and P. Domingos. Lifted \ufb01rst-order belief propagation. In AAAI, pages 1094\u20131099, Chicago, IL,\n\n2008. AAAI Press.\n\n9\n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Deepak", "family_name": "Venugopal", "institution": null}, {"given_name": "Vibhav", "family_name": "Gogate", "institution": null}]}