{"title": "Lifted Inference Seen from the Other Side : The Tractable Features", "book": "Advances in Neural Information Processing Systems", "page_first": 973, "page_last": 981, "abstract": "Lifted inference algorithms for representations that combine first-order logic and probabilistic graphical models have been the focus of much recent research. All lifted algorithms developed to date are based on the same underlying idea: take a standard probabilistic inference algorithm (e.g., variable elimination, belief propagation etc.) and improve its efficiency by exploiting repeated structure in the first-order model. In this paper, we propose an approach from the other side in that we use techniques from logic for probabilistic inference. In particular, we define a set of rules that look only at the logical representation to identify models for which exact efficient inference is possible. We show that our rules yield several new tractable classes that cannot be solved efficiently by any of the existing techniques.", "full_text": "Lifted Inference Seen from the Other Side : The\n\nTractable Features\n\nAbhay Jha Vibhav Gogate Alexandra Meliou Dan Suciu\n\n{abhaykj,vgogate,ameli,suciu}@cs.washington.edu\n\nComputer Science & Engineering\n\nUniversity of Washington\nWashington, WA 98195\n\nAbstract\n\nLifted Inference algorithms for representations that combine \ufb01rst-order logic and\ngraphical models have been the focus of much recent research. All lifted algo-\nrithms developed to date are based on the same underlying idea: take a standard\nprobabilistic inference algorithm (e.g., variable elimination, belief propagation\netc.) and improve its ef\ufb01ciency by exploiting repeated structure in the \ufb01rst-order\nmodel. In this paper, we propose an approach from the other side in that we use\ntechniques from logic for probabilistic inference. In particular, we de\ufb01ne a set of\nrules that look only at the logical representation to identify models for which exact\nef\ufb01cient inference is possible. Our rules yield new tractable classes that could not\nbe solved ef\ufb01ciently by any of the existing techniques.\n\n1\n\nIntroduction\n\nRecently, there has been a push towards combining logical and probabilistic approaches in Arti\ufb01cial\nIntelligence. It is motivated in large part by the representation and reasoning challenges in real world\napplications: many domains such as natural language processing, entity resolution, target tracking\nand Bio-informatics contain both rich relational structure, and uncertain and incomplete information.\nLogic is good at handling the former but lacks the representation power to model the latter. On the\nother hand, probability theory is good at modeling uncertainty but inadequate at handling relational\nstructure.\nMany representations that combine logic and graphical models, a popular probabilistic represen-\ntation [1, 2], have been proposed over the last few years. Among them, Markov logic networks\n(MLNs) [2, 3] are arguably the most popular one. In its simplest form, an MLN is a set of weighted\n\ufb01rst-order logic formulas, and can be viewed as a template for generating a Markov network. Specif-\nically, given a set of constants that model objects in the domain, it represents a ground Markov\nnetwork that has one (propositional) feature for each grounding of each (\ufb01rst-order) formula with\nconstants in the domain.\nUntil recently, most inference schemes for MLNs were propositional: inference was carried out\nby \ufb01rst constructing a ground Markov network and then running a standard probabilistic inference\nalgorithm over it. Unfortunately, the ground Markov network is typically quite large, containing\nmillions and sometimes even billions of inter-related variables. This precludes the use of existing\nprobabilistic inference algorithms, as they are unable to handle networks at this scale. Fortunately,\nin some cases, one can perform lifted inference in MLNs without grounding out the domain. Lifted\ninference treats sets of indistinguishable objects as one, and can yield exponential speed-ups over\npropositional inference.\nMany lifted inference algorithms have been proposed over the last few years (c.f. [4, 5, 6, 7]). All\nof them are based on the same principle: take an existing probabilistic inference algorithm and try\n\n1\n\n\fInterpretation in English\nMost people don\u2019t smoke\nMost people don\u2019t have asthma\nMost people aren\u2019t friends\nPeople who have asthma don\u2019t smoke\nAsthmatics don\u2019t have smoker friends\n\nFeature\n\u00acSmokes(X)\n\u00acAsthma(X)\n\u00acFriends(X,Y)\nAsthma(X) \u21d2 \u00acSmokes(X)\nAsthma(X) \u2227 Friends(X,Y) \u21d2 \u00acSmokes(Y)\n\nWeight\n\n1.4\n2.3\n4.6\n1.5\n1.1\n\nTable 1: An example MLN (modi\ufb01ed from [10]).\n\nto lift it by carrying out inference over groups of random variables that behave similarly during\nthe algorithm\u2019s execution. In other words, these algorithms are basically lifted versions of standard\nprobabilistic inference algorithms. For example, \ufb01rst-order variable elimination [4, 5, 7] lifts the\nstandard variable elimination algorithm [8, 9], while lifted Belief propagation [10] lifts Pearl\u2019s Belief\npropagation [11, 12].\nIn this paper, we depart from existing approaches, and present a new approach to lifted inference\nfrom the other, logical side. In particular, we propose a set of rewriting rules that exploit the structure\nof the logical formulas for inference. Each rule takes an MLN as input and expresses its partition\nfunction as a combination of partition functions of simpler MLNs (if the preconditions of the rule\nare satis\ufb01ed). Inference is tractable if we can evaluate an MLN using these set of rules. We analyze\nthe time complexity of our algorithm and identify new tractable classes of MLNs, which have not\nbeen previously identi\ufb01ed.\nOur work derives heavily from database literature in which inference techniques based on manipu-\nlating logical formulas (queries) have been investigated rigorously [13, 14]. However, the techniques\nthat they propose are not lifted. Our algorithm extends their techniques to lifted inference, and thus\ncan be applied to a strictly larger class of probabilistic models.\nTo summarize, our algorithm is truly lifted, namely we never ground the model, and it offers guar-\nantees on the running time. This comes at a cost that we do not allow arbitrary MLNs. However, the\nset of tractable MLNs is quite large, and includes MLNs that cannot be solved in PTIME by any of\nthe existing lifted approaches. The small toy MLN given in Table 1 is one such example. This MLN\nis also out of reach of state-of-the-art propositional inference approaches such as variable elimina-\ntion [8, 9], which are exponential in treewidth. This is because the treewidth of the ground Markov\nnetwork is polynomial in the number of constants in the domain.\n\n2 Preliminaries\n\nIn this section we will cover some preliminaries and notation used in the rest of the paper. A feature\n(fi) is constructed using constants, variables, and predicates. Constants, denoted with small-case\nletters (e.g. a), are used to represent a particular object. An upper-case letter (e.g. X) indicates a\nvariable associated with a particular domain (\u2206X), ranging over all objects in its domain. Predicate\nsymbols (e.g. Friends) are used to represent relationships between the objects. For example,\nFriends(bob,alice) denotes that Alice (represented by constant alice) and Bob (constant\nbob) are friends. An atom is a predicate symbol applied to a tuple of variables or constants. For\nexample, Friends(bob,X) and Friends(bob,alice) are atoms.\nA conjunctive feature is of the form \u2200 \u00afX r1 \u2227 r2 \u2227 \u00b7\u00b7\u00b7 \u2227 rk, where each ri is an atom or the negation\nof an atom, and \u00afX are the variables used in the atoms. Similarly, a disjunctive feature is of the form\n\u2200 \u00afX r1\u2228 r2\u2228\u00b7\u00b7\u00b7\u2228 rk. For example, fc : \u2200X \u00acSmokes(X)\u2227 Asthma(X) is a conjunctive feature,\nwhile fd : \u2200X \u00acSmokes(X) \u2228 \u00acFriends(bob,X) is a disjunctive feature. The former asserts\neveryone in the domain of X has asthma and does not smoke. The latter says that if a person smokes,\nhe/she cannot be friends with Bob. A grounding of a feature is an assignment of the variables to\nconstants from their domain. For example, \u00acSmokes(alice) \u2228 \u00acFriends(bob,alice) is\na grounding of the disjunctive feature fd. We assume that no predicate symbol occurs more than\nonce in a feature i.e. we don\u2019t allow for self-joins. In this work we focus on features containing only\nuniversal quanti\ufb01ers (\u2200), and will from now on drop the quanti\ufb01cation symbol \u2200 from the notation.\nGiven a set (wi, fi)i=1,k where each fi is a conjunctive or disjunctive feature and wi \u2208 R is a\nweight assigned to that feature, we de\ufb01ne the following probability distribution over a possible\n\n2\n\n\fworld \u03c9 in accordance with Markov Logic Networks (MLN) :\n\n(cid:32)(cid:88)\n\ni\n\nP r(\u03c9) =\n\n1\nZ\n\nexp\n\n(cid:33)\n\nwiN(fi, \u03c9)\n\n(1)\n\nIn Equation (1), a possible world \u03c9 can be any subset of tuples from the domain of predicates, Z,\nthe normalizing constant is called the partition function, and N(fi, \u03c9) is the number of groundings\nof feature fi that are true in the world \u03c9.\nTable 1 gives an example of a MLN that has been modi\ufb01ed from [10]. There is an implicit type-\nsafety assumption in the MLNs, that if a predicate symbol occurs in more than one feature, then the\nvariables used at the same position must have same domain. In the MLN of Table 1, if \u2206X = \u2206Y =\n{alice, bob}; then predicates Smokes and Asthma each have two tuples, while Friends has four.\nHence, the total number of possible worlds is 22+2+4 = 256. Consider the possible world \u03c9 below :\n\nSmokes Asthma\nbob\n\nbob\n\nalice\n\nFriends\n(bob,bob)\n(bob,alice)\n\n(alice,alice)\n\nThen from Equation (1): P r(\u03c9) = 1\nIn this\npaper we focus on MLNs, but our algorithm is applicable to other \ufb01rst order probabilistic models as\nwell.\n\nZ exp (1.4 \u00b7 1 + 2.3 \u00b7 0 + 4.6 \u00b7 0 + 1.5 \u00b7 1 + 1.1 \u00b7 2).\n\n3 Problem Statement\n\nIn this paper, we are interested in computing the partition function Z(M) of an MLN M. We\nformulate the partition function in a parametrized form, using the notion of Generating Functions of\nCounting Programs (CP). A Counting Program is a set of features \u00aff along with indeterminates \u00af\u03b1,\nwhere \u03b1i is the indeterminate for fi. Given a counting program P = (fi, \u03b1i)i=1...k, we de\ufb01ne its\ngenerating function(GF) FP as follows:\n\nFP (\u00af\u03b1) = (cid:88)\n\n(cid:89)\n\n\u03b1N (fi,\u03c9)\n\ni\n\n(2)\n\n\u03c9\n\ni\n\nThe generating function as expressed in Eq. 2 is in general of exponential size in the domain of\nobjects. We want to characterize cases where we can express it more succinctly, and hence compute\nthe partition function faster. Let n be the size of the object domain, and k be the size of our program.\nThen we are interested in the cases where FP can be computed with following number of arithmetic\noperations.\n\nClosed Form Polynomial in log(n), k\nPolynomial Expression Polynomial in n, k\nPseudo-Polynomial Expression Polynomial in n for bounded k\n\nComputing FP refers to evaluating it for one instantiation of parameters \u00af\u03b1. To illustrate the above\ncases, let k = 1. Then the pseudo-polynomial and polynomial expression are equivalent. The\nprogram (R(X, Y ), \u03b1) has GF (1 + \u03b1)|\u2206X||\u2206Y |, which is in closed form. While the program\n, which is a\n\n(R(X) \u2227 S(X, Y ) \u2227 T (Y ), \u03b1) has GF 2|\u2206X||\u2206Y |(cid:80)|\u2206X|\n\n1 +(cid:0) 1+\u03b1\n\n(cid:0)|\u2206X|\n\n(cid:1)(cid:16)\n\n(cid:1)i(cid:17)|\u2206Y |\n\npolynomial expression. This polynomial does not have a closed form.\nIn the following section we demonstrate an algorithm that computes the generating function, and\nallows us to identify cases where the generating function falls under one of the above categories.\n\ni=0\n\ni\n\n2\n\n4 Computing the Generating Function\n\nAsssume a Counting Program P = (fi, \u03b1i)i=1,k. In this section, we present some rules that can be\nused to compute the GF of a CP from simpler CPs. We can then upper bound the size of FP by the\n\n3\n\n\fchoice of rules used. The cases which cannot be evaluated by these rules are still open and we don\u2019t\nknow if the GF in those cases can be expressed succinctly.\nWe will require that all CPs are in normal form to simplify our analysis. Note that the normality\nrequirement does not change the class of CPs that can be solved in PTIME by our algorithm. This\nis because every CP can converted to an equivalent normal CP in PTIME.\n\n4.1 Normal Counting Programs\nDe\ufb01nition 4.1 A counting program is called normal if it satis\ufb01es the following properties :\n\n1. There are no constants in any feature.\n2. If two distinct atoms with the same predicate symbol have variables X and Y in the same\n\nposition, then \u2206X = \u2206Y .\n\nIt is easy to show that:\n\nProposition 4.2 Computing the partition function of an MLN can be reduced in PTIME to comput-\ning the generating function of a normal CP.\nThe following example demonstrates how to normalize a set of features.\n\nExample 4.3 Consider a CP containing two features Friends(X, Y ) and Friends(bob, Y ).\nClearly, it is not in normal form because the second feature contains a constant. To normal-\nize it, we can replace the two features by: (i) Friends1(Y ) \u2261 Friends(bob, Y ), and (ii)\nFriends2(Z, Y ) \u2261 Friends(X, Y ), X (cid:54)= bob, where the domain of Z is \u2206Z = \u2206X \\ bob.\nNote that we assume criterion 2 is satis\ufb01ed in MLNs. During the course of algorithm, it may\nget violated when we replace variables with constants as we\u2019ll see, but we can use the above\ntransformation whenever that happens. So from now on we assume that our CP is normalized.\n4.2 Preliminaries and Operators\n\nWe proceed to establish notation and operators used by our algorithm. Given a feature f, we denote\nby V ars(f) the set of variables used in its atoms. We assume that variables used in different features\nmust be different. Furthermore, without loss of generality, we assume numeric domains for each\nlogical variable, namely \u2206X = {1, . . . ,|\u2206X|}. We de\ufb01ne a substitution f[a/X], where X \u2208\nV ars(f) and a \u2208 \u2206X, as the replacement of X with a in every atom of f. P [a/X] applies the\nsubstitution fi[a/X] to every feature fi in P . Note that after a substitution, the CP is no longer\nnormal and therefore, we may have to normalize it.\nDe\ufb01ne a relation U among the variables of a CP as follows : U(X, Y ) iff there exist two atoms ri, rj\nwith the same predicate, such that X \u2208 V ars(ri), Y \u2208 V ars(rj), and X and Y appear at the same\nposition in ri and rj respectively. Let U be the transitive closure of U. Note that U is an equivalence\nrelation. For a variable X, denote by Unify(X) its equivalence class under U. For example, given\ntwo features Smokes(X) \u2227 \u00acAsthma(X) and \u00acSmokes(Y) \u2228 \u00acFriends(Z,Y), we have\nUnify(X) = Unify(Y ) = {X, Y }. Given a feature, a variable is a root variable iff it appears in\nevery atom of the feature. For some variable X, the set X = Unify(X) is a separator if \u2200Y \u2208 X\n: Y \u2208 V ars(fi) implies Y must be a root variable for fi. In the last example, the set {X, Y } is a\nseparator. Notice that, since the program is normal, we have \u2206X = \u2206Y whenever Y \u2208 Unify(X),\nthus, if \u00afX is a separator, then we write \u2206 \u00afX for \u2206Y for any Y \u2208 Unify(X). Two variables are called\nequivalent if there is a bijection from Unify(X) to Unify(Y ) such that for any Z1 \u2208 Unify(X) and\nits image Z2 \u2208 Unify(Y ), Z1 and Z2 always occur together.\nNext, we de\ufb01ne three operators used by our algorithm: splitting, conditioning and Dirichlet convolu-\ntion. We de\ufb01ne a process Split(Y, k) that splits every feature in the CP that contains the variable Y\ninto two features with disjoint domains: one with \u2206Y = {k} and the other with \u2206Y c = \u2206Y \u2212 {k}.\nBoth features retain the same indeterminate. Also, Cond(i, r, k) de\ufb01nes a process that removes an\ni = fi \\ {r}; then Cond(i, r, k) replaces fi with (i) two features\natom r from feature fi. Denote f(cid:48)\ni , 1) if r \u21d2 \u00acfi, and (iii) (f(cid:48)\n(T RU E, \u03b1k\ni , \u03b1i) otherwise.\ni bi\u03b1i, their Dirichlet convolution, P\u2217Q, is\naibj\u03b1ij\n\nGiven two polynomials P = (cid:80)n\n\ni ai\u03b1i and Q = (cid:80)m\nP\u2217Q =(cid:88)\n\ni ) and (f(cid:48)\n\ni , 1) if r \u21d2 fi, (ii) one feature (f(cid:48)\n\nde\ufb01ned as:\n\ni,j\n\n4\n\n\fWe de\ufb01ne a new variant of this operator P\u2217cQ as: P\u2217cQ = \u03b1mnP (cid:48)(cid:0) 1\n\u03b1n and Q(cid:48)(cid:0) 1\n\n(cid:1) = Q(\u03b1)\n\nP (\u03b1)\n\n\u03b1\n\n\u03b1m\n\n\u03b1\n\n(cid:1)\u2217Q(cid:48)(cid:0) 1\n\n\u03b1\n\n(cid:1), where P (cid:48)(cid:0) 1\n\n\u03b1\n\n(cid:1) =\n\n4.3 The Algorithm\nOur algorithm is basically a recursive application of a series of rewriting rules (see rules R1-R6\ngiven below). Each (non-trivial) rule takes a CP as input and if the preconditions for applying it are\nsatis\ufb01ed, then it expresses the generating function of the input CP as a combination of generating\nfunctions of a few simpler CPs. The generating function of the resulting CPs can then be computed\n(independently) by recursively calling the algorithm on each. The recursion terminates when the\ngenerating function of the CP is trivial to compute (SUCCESS) or when none of the rules can be\napplied (FAILURE). In the case, when algorithm succeeds, we analyze whether the GF is in closed\nform or is a polynomial expression.\nNext, we present our algorithm which is essentially a sequence of rules. Given a CP, we go through\nthe rules in order and apply the \ufb01rst applicable rule, which may require us to recursively compute\nthe GF of simpler CPs, for which we continue in the same way.\nOur \ufb01rst rule uses feature and variable equivalence to reduce the size of the CP. Formally,\nRule R1 (Variable and Feature Equivalence Rule) If variables X and Y are equivalent, replace\nthe pair with a single new variable Z in every atom where they occur. Do the same for every pair of\nvariables in Unify(X), Unify(Y ).\nIf two features fi, fj are identical, then we replace them with a single feature fi with indeterminate\n\u03b1i\u03b1j that is the product of their individual indeterminates.\n\nThe correctness of Rule R1 is immediate from the fact that the CP after the transformation is equal\nto the CP before the transformation.\nOur second rule speci\ufb01es some trivial manipulations.\nRule R2 (Trivial manipulations)\n\n1. Eliminate FALSE features.\n2. If a feature fi is T RU E, then FP = \u03b1iFP\u2212fi.\n3. If a program P is just a tuple then FP = 1 + \u03b1, where \u03b1 is the indeterminate.\n4. If some feature fi has indeterminate \u03b1i = 1 (due to R6), then remove all the atoms in fi\nof a predicate symbol that is present in some other feature. Let N be the product of the\ndomain of the rest of the atoms, then FP = 2N FP\u2212fi.\n\nOur third rule utilizes the independence property. Intuitively, given two CPs which are independent,\nnamely they have no atoms in common, the generating function of the joint CP is simply the product\nof the generating function of the two CPs. Formally,\n\nRule R3 (Independence Rule) If a CP P can be split into two programs P1 and P2 such that the\ntwo programs don\u2019t have any predicate symbols in common, then FP = FP1 \u00b7 FP2.\nThe correctness of Rule R3 follows from the fact that every world \u03c9 of P can be written as a\nconcatenation of two disjoint worlds, namely \u03c9 = (\u03c91 \u222a \u03c92) where \u03c91 and \u03c92 are the worlds from\nP1 and P2 respectively. Hence the GF can be written as:\n\n\u03b1N (fi,\u03c91)\n\ni\n\n\u03b1N (fi,\u03c92)\n\ni\n\n=\n\n\u03b1N (fi,\u03c91)\n\ni\n\n\u03b1N (fi,\u03c92)\n\ni\n\n= FP1 \u00b7 FP2\n\n(3)\n\n(cid:88)\n\n(cid:89)\n\n\u03c91\u222a\u03c92\n\nfi\u2208P1\n\nFP =\n\n(cid:89)\n\nfi\u2208P2\n\n(cid:88)\n\n(cid:89)\n\n\u03c91\n\nfi\u2208P1\n\n(cid:88)\n\n(cid:89)\n\n\u03c92\n\nfi\u2208P2\n\nThe next rule allows us to split a feature if it has a component that is independent of the rest of the\nprogram. Note that while the previous rule splits the program into two independent sets of features,\nthis feature enables us to split a single feature.\nRule R4 (Dirichlet Convolution Rule) If the program contains feature f = f1 \u2227 f2, s.t. f1 doesn\u2019t\nshare any variables or symbols with any atom in the program, then FP = Ff1\u2217FP\u2212f +f2. Similarly\nif f = f1 \u2228 f2, then FP = Ff1\u2217cFP\u2212f +f2.\n\n5\n\n\fWe show the proof for a single feature f, the extension is straightforward. For this, we write GF in\na different form as\n\nFP (\u03b1) =(cid:88)\nC(f, i)\u03b1i = (cid:88)\n\ni\n\ni1,i2|i1i2=i\n\nFf (\u03b1) =(cid:88)\n\ni\n\nC(f, i)\u03b1i\n\nwhere the coef\ufb01cient C(f, i) is exactly the number of worlds where the feature f is satis\ufb01ed i times.\nNow assume f = f1 \u2227 f2, then in any given world \u03c9, if f1 is satis\ufb01ed n1 times and f2 is satis\ufb01ed\nn2 times, then f is satis\ufb01ed n1n2 times. Hence\n\nC(f1, i1)C(f2, i2)\u03b1i = Ff1\u2217Ff2\n\nOur next rule utilizes the similarity property in addition to the independence property. Given a set\nP of independent but equivalent CPs, the generating function of the joint CP equals the generating\nfunction of any CP, Pi \u2208 P raised to the power |P|. By de\ufb01nition, every instantiation \u00afa of a separator\n\u00afX de\ufb01nes a CP that has no tuple in common with other programs for \u00afX = \u00afb, \u00afa (cid:54)= \u00afb. Moreover, all\nsuch CPs are equivalent (subject to a renaming of the variables and constants). Thus, we have the\nfollowing rule:\n\nRule R5 (Power Rule) Let \u00afX be a separator. Then FP =(cid:0)FP [\u00afa/ \u00afX]\n\n(cid:1)|\u2206 \u00afX|\n\nRule R5 generalizes the inversion and partial inversion operators given in [4, 5]. Its correctness\nfollows in a straight-forward manner from the correctness of the independence rule.\nOur \ufb01nal rule generalizes the counting arguments presented in [5, 7]. Consider a singleton atom\nR(X). Conditioning over all possible truth assignments to all groundings of R(X) will yield 2|\u2206X|\nindependent CPs. Thus, the GF can be written as a sum over the generating functions of 2|\u2206X|\nindependent CPs. However, the resulting GF has exponential complexity. In some cases, however,\nthe sum can be written ef\ufb01ciently by grouping together GFs that are equivalent.\nRule R6 (Generalized Binomial Rule) Let P red(X) be a singleton atom in some feature. For\nevery Y \u2208 Unify(X) apply Split(Y, k). Then for every feature fi in the new program containing an\natom r = P red(Y ) apply (fi, \u03b1i) \u2190 Cond(i, r, k) and similarly (fi, \u03b1i) \u2190 Cond(i,\u00acr, \u2206Y c \u2212 k)\n\nfor those containing r = P red(Y c). Let the resulting program be Pk. Then FP =(cid:80)\u2206X\n\n(cid:1)FPk.\n\n(cid:0)\u2206X\n\nk=0\n\nk\n\nNote that Pk is just one CP whose GF has a parameter k.\n\nThe proof is a little involved and omitted here for lack of space.\nHaving speci\ufb01ed the rules and established their correctness, we now present the main result of this\npaper:\n\nTheorem 4.4 Let P be a Counting Program (CP).\n\nexpression.\n\n\u2022 If P can be evaluated using only rules R1, R2, R3 and R5, then it has a closed form.\n\u2022 If P can be evaluated using only rules R1, R2, R3, R4, and R5, then it has a polynomial\n\u2022 If P can be evaluated using rules Rules 1 to 6 then it admits a pseudo-polynomial expression.\nComputing the dirichlet convolution (Rule R4) requires going through all the coef\ufb01cients, hence it\ntakes linear time. Thus, we do not have a closed form solution when we apply (Rule R4). Rule R6\nimplies that we have to recurse over more than one program, hence their repeated application can\nmean we have to solve number of programs that is exponential in the size of program. Therefore,\nwe can only guarantee a pseudo-polynomial expression if we use this rule.\nWe can now see the effectiveness of generating functions. When we want to recurse over a set of\nfeatures, simply keeping the partition function for smaller features is not enough; we need more\ninformation than that. In particular we need all the coef\ufb01cients of the generating function. For e.g.\nwe can\u2019t compute the partition function for R(X) \u2227 S(Y ) with just the partition functions of R(X)\nand S(Y ). However, if we have their GF, the GF of f = R(X)\u2227 S(Y ) is just a dirichlet convolution\nof the GF of R(X) and S(Y ). One could also compute the GF of f using a dynamic programming\nalgorithm, which keeps all the coef\ufb01cients of the generating function. Generating functions let us\nstore this information in a very succinct way. For e.g. if the GF is (1 + \u03b1)n, then it is much simpler\n\nto use this representation, than keeping all n + 1 binomial coef\ufb01cients :(cid:0)n\n\n(cid:1), k = 0, n.\n\nk\n\n6\n\n\fFigure 1: Our approach vs FOVE for increasing\ndomain sizes. X,Y-axes drawn on a log-scale.\n\nFigure 2: Our approach vs FOVE as the evidence\nincreases. Y-axis is drawn on a log scale.\n\n4.4 Examples\n\nWe illustrate our approach through examples. We will use simple predicate symbols like R, S, T and\nassume the domain of all variables as [n]. Note that for a single tuple, say R(a) with indeterminate\n\u03b1, GF = 1 + \u03b1 from rule R2. Now suppose we have a simple program like P = {(R(X), \u03b1)} (a\n\nsingle feature R(X) with indeterminate \u03b1). Then from rule R5: FP = (cid:0)FP [a/X]\narithmetic operations, while if we were to write the same GF as (cid:80)\n\n(cid:1)n = (1 + \u03b1)n.\n(cid:1)\u03b1k it would require\n\nThese are both examples of programs with closed form GF. We can evaluate FP with O(log(n))\n\n(cid:0)n\n\nO(n log(n)) operations. The key insight of our approach is representing GFs succinctly. Now\nassume the following program P with multiple features :\n\nk\n\nk\n\nNote that (X1, X2) form a separator. Hence using R5, FP =(cid:0)FP [(a,a)/(X1,X2)]\n\nR(X1) \u2227 S(X1, Y1) \u03b1\nS(X2, Y2) \u2227 T (X2) \u03b2\n\n(cid:1)n. Now consider\n\nprogram P (cid:48) = P [(a, a)/(X1, X2)]:\n\nR(a) \u2227 S(a, Y1) \u03b1\nS(a, Y2) \u2227 T (a) \u03b2\n\nS(a, Y1) \u03b1\nS(a, Y2) \u03b2\n\nUsing R4 twice, for R(a) and T (a) along with R2 (to get the GF for R(a), T (a)); we get FP (cid:48) =\n(1 + \u03b1)\u2217(1 + \u03b2)\u2217FP (cid:48)(cid:48), where P (cid:48)(cid:48) is\n\nwhich is same as (S(a, Y ), \u03b1\u03b2) using R1. The GF for this program, as shown earlier is (1 + \u03b1\u03b2)n.\nNow putting values back together, we get:\n\nFinally, for the original program: FP = (FP (cid:48))n =(cid:0)2n+1 + (1 + \u03b1\u03b2)n(cid:1)n. Note that this is also in\n\nFP (cid:48) = (1 + \u03b1)\u2217(1 + \u03b2)\u2217(1 + \u03b1\u03b2)n =(cid:0)2n+1 + (1 + \u03b1\u03b2)n(cid:1)\n\nclosed form.\n5 Experiments\n\nThe algorithm that we described is based on computing the generating functions of counting pro-\ngrams to perform lifted inference, which approaches the problem from a completely different angle\nthan existing techniques. Due to this novelty, we can solve MLNs that are intractable for other ex-\nisting lifted algorithms such as \ufb01rst-order variable elimination (FOVE) [5, 6, 7]. Speci\ufb01cally, we\ndemonstrate with our experiments that on some MLNs we indeed outperform FOVE by orders of\nmagnitude.\nWe ran our algorithm on the MLN given in Table 1. The set of features used in this MLN fall into\nthe class of counting programs having a pseudo-polynomial generating function. This is the most\ngeneral class of features our approach covers, and here our algorithm does not give any guarantees\nas evidence increases. The evidence in our experiments is randomly generated for the two tables\nAsthma and Smokes. In our experiments we study the in\ufb02uence of two factors on the runtime:\n\n7\n\n10110210310\u2212610\u2212410\u22122100102104106Domain SizeTime (sec)  Counting Program (evidence 30%)FOVE (evidence 30%)FOVE extrapolation02040608010010\u2212410\u22122100102Percentage of EvidenceTime (sec)  Counting Program (domain size 13)Counting Program (domain size 100)FOVE (domain size 13)\fSize of Domain:\nIdentifying tractable features is particularly important for inference in \ufb01rst order\nmodels, because (i) grounding can produce very big graphical models and (ii) the treewidth of these\nmodels could be very high. As the size of domain increases, our approach should scale better than\nthe existing techniques which can\u2019t do lifted inference on this MLN. All the predicates in this MLN\nare only de\ufb01ned on one domain, that of persons.\nEvidence: Since this MLN falls into the class of features for which we give no guarantees as\nevidence increases, we want to study the behavior of our algorithm in the presence of increasingly\nmore evidence.\nFig. 5 displays the execution time of our CP algorithm vs the FOVE approach for domain sizes\nvarying from 5 to 100, at the presence of 30% evidence. All results display average runtimes over\n15 repetitions with the same parameter settings. FOVE cannot do lifted inference on this MLN\nand resorts to grounding. Thus, it could only execute up to the domain size of 18; after that it\nconsistently ran out of memory. The \ufb01gure also displays the extrapolated data points for FOVE\u2019s\nbehavior in larger domain sizes, and shows its runtime growing exponentially. Our approach on the\nother hand dominates FOVE by orders of magnitude for those small domains, and \ufb01nishes within\nseconds even for domains of size 100. Note that the complexity of our algorithm for this MLN is\nquadratic. Hence it looks linear on the log-scale.\nFig. 5 demonstrates the behavior of the algorithms as the amount of evidence is increased from 0 to\n100%. We chose a domain size of 13 to run FOVE, since it couldn\u2019t terminate for higher domain\nsizes. The \ufb01gure displays the runtime of our algorithm for domain sizes of 13 and 100. Although for\nthis class of features we do not give guarantees on the running time for large evidence, our algorithm\nstill performs well as the evidence increases. In fact after a point the algorithm gets faster. This is\nbecause the main time-consuming rule used in this MLN is R4. R4 chooses a singleton atom in\nthe last feature, say Asthma, and eliminates it. This involves time complexity proportional to the\ndomain of the atom and the running time of the smaller MLN obtained after removing that atom. As\nevidence increases, the atom corresponding to Asthma may be split into many smaller predicates;\nbut the domain size of each predicate also keeps getting smaller. In particular with 100% evidence,\nthe domain is just 1 and therefore R6 takes constant time!\n\n6 Conclusion and Future Work\n\nWe have presented a novel approach to lifted inference that uses the theory of generating functions\nto do ef\ufb01cient inference. We also give guarantees on the theoretical complexity of our approach.\nThis is the \ufb01rst work that tries to address the complexity of lifted inference in terms of only the\nfeatures (formulas). This is bene\ufb01cial because using a set of tractable features ensures that inference\nis always ef\ufb01cient and hence it will scale to large domains.\nSeveral avenues remain for future work. For instance, a feature such as transitive closure ( e.g.,\nFriends(X,Y) \u2227 Friends(Y,Z) \u21d2 Friends(X,Z)), which occurs quite often in many\nreal world applications, is intractable for our algorithm. In future, we would like to address the\ncomplexity of such features by characterizing the completeness of our approach. Another avenue for\nfuture work is extending other lifted inference approaches [5, 7] with rules that we have developed\nin this paper. Unlike our algorithm, the aforementioned algorithms are complete. Namely, when\nlifted inference is not possible, they ground the domain and resort to propositional inference. But\neven in those cases, just running a propositional algorithm that does not exploit symmetry is not very\nef\ufb01cient. In particular, ground networks generated by logical formulas have some repetition in their\nstructure that is dif\ufb01cult to capture after grounding. Take for example R(X,Y) \u2227 S(Z,Y). This\nfeature is in PTIME by our algorithm, but if we create a ground markov network by grounding this\nfeature then it can have unbounded treewidth (as big as the domain itself). We think our approach\ncan provide an insight about how to best construct a graphical model from the groundings of a\nlogical formula. This is also another interesting piece of future work that our algorithm motivates.\n\nReferences\n\n[1] Lise Getoor and Ben Taskar. Introduction to Statistical Relational Learning. The MIT Press,\n\n2007.\n\n8\n\n\f[2] Pedro Domingos and Daniel Lowd. Markov Logic: An Interface Layer for Arti\ufb01cial Intelli-\n\ngence. Morgan and Claypool, 2009.\n\n[3] Matthew Richardson and Pedro Domingos. Markov logic networks. In Machine Learning,\n\npage 2006, 2006.\n\n[4] David Poole. First-order probabilistic inference. In IJCAI\u201903: Proceedings of the 18th inter-\nnational joint conference on Arti\ufb01cial intelligence, pages 985\u2013991, San Francisco, CA, USA,\n2003. Morgan Kaufmann Publishers Inc.\n\n[5] Rodrigo De Salvo Braz, Eyal Amir, and Dan Roth. Lifted \ufb01rst-order probabilistic inference.\nIn IJCAI\u201905: Proceedings of the 19th international joint conference on Arti\ufb01cial intelligence,\npages 1319\u20131325, San Francisco, CA, USA, 2005. Morgan Kaufmann Publishers Inc.\n\n[6] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, and Leslie Pack Kael-\nbling. Lifted probabilistic inference with counting formulas. In AAAI\u201908: Proceedings of the\n23rd national conference on Arti\ufb01cial intelligence, pages 1062\u20131068. AAAI Press, 2008.\n\n[7] K. S. Ng, J. W. Lloyd, and W. T. Uther. Probabilistic modelling, inference and learning using\n\nlogical theories. Annals of Mathematics and Arti\ufb01cial Intelligence, 54(1-3):159\u2013205, 2008.\n[8] Nevin Zhang and David Poole. A simple approach to bayesian network computations.\n\nIn\nProceedings of the Tenth Canadian Conference on Arti\ufb01cial Intelligence, pages 171\u2013178, 1994.\n[9] R. Dechter. Bucket elimination: A unifying framework for reasoning. Arti\ufb01cial Intelligence,\n\n113:41\u201385, 1999.\n\n[10] Parag Singla and Pedro Domingos. Lifted \ufb01rst-order belief propagation. In AAAI\u201908: Pro-\nceedings of the 23rd national conference on Arti\ufb01cial intelligence, pages 1094\u20131099. AAAI\nPress, 2008.\n\n[11] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[12] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate\ninference: An empirical study. In In Proceedings of the Fifteenth Conference on Uncertainty\nin Arti\ufb01cial Intelligence (UAI), pages 467\u2013475, 1999.\n\n[13] Nilesh Dalvi and Dan Suciu. Management of probabilistic data: foundations and challenges.\n\nIn PODS, pages 1\u201312, New York, NY, USA, 2007. ACM Press.\n\n[14] Karl Schnaitter Nilesh Dalvi and Dan Suciu. Computing query probability with incidence\n\nalgebras. In PODS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Abhay", "family_name": "Jha", "institution": null}, {"given_name": "Vibhav", "family_name": "Gogate", "institution": null}, {"given_name": "Alexandra", "family_name": "Meliou", "institution": null}, {"given_name": "Dan", "family_name": "Suciu", "institution": null}]}