{"title": "Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 324, "abstract": "Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence.", "full_text": "Computing Marginal Distributions over Continuous\nMarkov Networks for Statistical Relational Learning\n\nMatthias Br\u00a8ocheler, Lise Getoor\nUniversity of Maryland, College Park\n\nCollege Park, MD 20742\n\n{matthias, getoor}@cs.umd.edu\n\nAbstract\n\nContinuous Markov random \ufb01elds are a general formalism to model joint proba-\nbility distributions over events with continuous outcomes. We prove that marginal\ncomputation for constrained continuous MRFs is #P-hard in general and present\na polynomial-time approximation scheme under mild assumptions on the struc-\nture of the random \ufb01eld. Moreover, we introduce a sampling algorithm to com-\npute marginal distributions and develop novel techniques to increase its ef\ufb01-\nciency. Continuous MRFs are a general purpose probabilistic modeling tool and\nwe demonstrate how they can be applied to statistical relational learning. On the\nproblem of collective classi\ufb01cation, we evaluate our algorithm and show that the\nstandard deviation of marginals serves as a useful measure of con\ufb01dence.\n\nIntroduction\n\n1\nContinuous Markov random \ufb01elds are a general and expressive formalism to model complex prob-\nability distributions over multiple continuous random variables. Potential functions, which map\nthe values of sets (cliques) of random variables to real numbers, capture the dependencies be-\ntween variables and induce a exponential family density function as follows: Given a \ufb01nite set\nof n random variables X = {X1, . . . , Xn} with an associated bounded interval domain Di \u2282 R,\nlet \u03c6 = {\u03c61, . . . , \u03c6m} be a \ufb01nite set of m continuous potential functions de\ufb01ned over the interval\ndomains, i.e. \u03c6j : D \u2192 [0, M ], for some bound M \u2208 R+, where D = D1 \u00d7 D2 . . . \u00d7 Dn. For\na set of free parameters \u039b = {\u03bb1, . . . , \u03bbm}, we then de\ufb01ne the probability measure P over X with\nrespect to \u03c6 through its density function f as:\n\n(cid:90)\n\n\uf8eb\uf8ed\u2212 m(cid:88)\n\nj=1\n\n\uf8f6\uf8f8 dx\n\nexp[\u2212 m(cid:88)\n\nf (x) =\n\n1\n\nZ(\u039b)\n\n\u03bbj\u03c6j(x)]\n\n; Z(\u039b) =\n\nexp\n\nj=1\n\nD\n\n\u03bbj\u03c6j(x)\n\n(1)\n\nwhere Z is the normalization constant. The de\ufb01nition is analogous to the popular discrete Markov\nrandom \ufb01elds (MRF) but using integration over the bounded domain rather than summation for the\npartition function Z.\nIn addition, we assume the existence of a set of kA equality and kB inequality constraints on the\nrandom variables, that is, A(x) = a, where A : D \u2192 RkA, a \u2208 RkA and B(x) \u2264 b, where\nB : D \u2192 RkB , b \u2208 RkB . Both equality and inequality constraints restrict the possible combinations\nof values the random variables X can assume. That is, we set f (x) = 0 whenever any of the\nconstraints are violated and constrain the domain of integration, denoted \u02dcD, for the normalization\nconstant correspondingly. Constraints are useful in probabilistic modeling to exclude inconsistent\noutcomes based on prior knowledge about the distribution. We call this class of MRFs constrained\ncontinuous Markov random \ufb01elds (CCMRF).\nProbabilistic inference often requires the computation of marginal distributions for all or a subset of\nthe random variables X. Marginal computation for discrete MRFs has been extensively studied due\nto its wide applicability in probabilistic reasoning. In this work, we study the theoretical and practi-\ncal aspects of computing marginal density functions over CCMRFs. General continuous MRFs can\n\n1\n\n\f\u03bb1 : A.text \u223c= B.text \u02dc\u2227 class(A, C) \u02dc\u21d2 class(B, C)\n\u03bb2 : link(A, B) \u02dc\u2227 class(A, C) \u02dc\u21d2 class(B, C)\nConstraint : f unctional(class)\n\nTable 1: Example PSL program for collective classi\ufb01cation.\n\nbe used in a variety of probabilistic modeling scenarios and have been studied for applications with\ncontinuous domains such as computer vision. Gaussian Random Fields are a type of continuous\nMRF which assume normality. In this work, we make no restrictive assumptions about the marginal\ndistributions other than boundedness. For general continuous MRFs, non-parametric belief propa-\ngation (NBP) [1] has been proposed as a method to estimate marginals. NBP represents the \u201cbelief\u201d\nas a combination of kernel densities which are propagated according to the structure of the MRF. In\ncontrast to NBP, our approach provides polynomial-time approximation guarantees and avoids the\nrepresentational choice of kernel densities.\nThe main contributions of this work are described in Section 3. We begin by showing that computing\nmarginals in CCMRFs is #P-hard in the number of random variables n. We then discuss a Markov\nchain Monte Carlo (MCMC) sampling scheme that can approximate the exact distribution to within\n\u0001 error in polynomial time under the general assumption that the potential functions and inequality\nconstraints are convex. Based on this result, we propose a tractable sampling algorithm and present\na novel approach to increasing its effectiveness by detecting and counteracting slow convergence.\nOur theoretical results are based on recent advances in computational geometry and the study of log-\nconcave functions [2]. In Section 4, we investigate the performance, scalability, and convergence\nof the sampling algorithm on the probabilistic inference problem of collective classi\ufb01cation on a set\nof Wikipedia documents. In particular, we show that the standard deviation of the marginal density\nfunction can serve as a strong indicator for the \u201ccon\ufb01dence\u201d in the classi\ufb01cation prediction, thereby\ndemonstrating a useful qualitative aspect of marginals over continuous MRFs. Before turning to the\nmain contributions of the paper, in the next section, we give background motivation for the form of\nCCMRFs we study.\n2 Motivation\nOur treatment of CCMRFs is motivated by probabilistic similarity logic (PSL) [3]. PSL is a rela-\ntional language that provides support for probabilistic reasoning about similarities. PSL is similar to\nexisting SRL models, e.g., MLNs [4], BLPs [5], RMNs [6], in that it de\ufb01nes a probabilistic graph-\nical model over the properties and relations of the entities in a domain as a grounding of a set of\nrules that have attached parameters. However, PSL supports reasoning about \u201csoft\u201d truth values,\nwhich can be seen as similarities between entities or sets of entities, degrees of belief, or strength\nof relationships. PSL uses annotated logic rules to capture the dependency structure of the domain,\nbased on which it builds a joint continuous probabilistic model over all decision atoms which can be\nexpressed as a CCMRF as de\ufb01ned above. PSL has been used to reason about the similarity between\nconcepts from different ontologies as well as articles from Wikipedia. Table 1 shows a simple PSL\nprogram for collective classi\ufb01cation. The \ufb01rst rule states that documents with similar text are likely\nto have the same class. The second rule says that two documents which are linked to each other\nare also likely to be assigned the same class. Finally, we express the constraint that each document\ncan have at most one class, that is, the class predicate is functional and can only map to one value.\nSuch domain speci\ufb01c constraints motivate our introduction of equality and inequality constraints\nfor CCMRFs. Rules and constraints are written in \ufb01rst order logic formalism and are grounded out\nagainst the observed data such that each ground rule constitutes one potential function or constraint\ncomputing the truth value of the formula. Rules have an associated weight \u03bbi which is used as the\nparameter for each associated potential function. The weights can be learned from training data.\nIn the following, we make some assumptions about the nature of the constraints and the potential\nfunctions motivated by the requirements of the PSL framework and the types of CCMRFs modeled\ntherein. Firstly, we assume all domains are in the [0, 1] interval which corresponds to the domain\nof similarity truth values in PSL. Secondly, all constraints are assumed to be linear. Thirdly, the\npotential functions \u03c6j are of the form \u03c6j(x) = max(0, oj\u00b7x+qj) where oT\nj \u2208 Rn is a n-dimensional\nrow vector and qj \u2208 R. The particular form of the potential functions is motivated by the way\nsimilarity truth values are combined in PSL using t-norms (see [3] for details).\nWhile the techniques presented in this work are not speci\ufb01c to PSL, a brief outline of the PSL frame-\nwork helps in understanding the assumptions about the CCMRFs of interest made in our algorithm\nand experiments. In Section 3.5 we show how our assumptions can be relaxed while maintaining\npolynomial-time guarantees for applications outside the PSL framework.\n\n2\n\n\fa) Example of geometric marginal computation\n\nb) Hit-and-Run and random ball walk illustration\n\n3 Computing continuous marginals\nThis section contains the main technical contributions of this paper. We start our study of marginal\ncomputation for CCMRFs by proving that computing the exact density function is #P hard (3.1).\nIn Section 3.2, we discuss how to approximate the marginal distribution using a MCMC sampling\nscheme which produces a guaranteed \u0001-approximation in polynomial time under suitable conditions.\nWe show how to improve the sampling scheme by detecting phases of slow convergence and present\na technique to counteract them (3.3). Finally, we describe an algorithm based on the sampling\nscheme and its improvements (3.4). In addition, we discuss how to relax the linearity conditions in\nSection 3.5.\nThroughout this discussion we use the following simple example for illustration:\nExample 1 Let X = {X1, X2, X3} be subject to the inequality constraint x1 + x3 \u2264 1. Let\n\u03c61(x) = x1, \u03c62(x) = max(0, x1 \u2212 x2), \u03c63(x) = max(0, x2 \u2212 x3) where \u03bb = (1, 2, 1) are the\nassociated free parameters.\n\n3.1 Exact marginal computation\nTheorem 1 Computing the marginal probability density function\n\nfX(cid:48)(x(cid:48)) =(cid:82)\n\ny\u2208\u00d7 \u02dcDi,s.t.Xi /\u2208X(cid:48) f (x(cid:48), y)dy for a subset X(cid:48) \u2282 X under a probability measure P de\ufb01ned\n\nby a CCMRF is #P hard in the worst case.\n\nWe prove this statement by a simple reduction from the problem of computing the volume of a\nn-dimensional polytope de\ufb01ned by linear inequality constraints. To see the relationship to computa-\ntional geometry, note that the domain D is a n-dimensional unit hypercube1. Each linear inequality\nconstraints Bi from the system B can be represented by a hyperplane which \u201ccuts off\u201d part of the\nhypercube D. Finally, the potential functions induce a probability distribution over the resulting\nconvex polytope. Figure 3a) visualizes the domain for our running example in the 3-dimensional\nEuclidean space. The constraint domain is shown as a wedge. The highlighted area marks the region\nof probability mass that is equal to the probability P(0.4 \u2264 X2 \u2264 0.6).\nProof 1 (Sketch) For any random variable X \u2208 X, the marginal probability P(l \u2264 X \u2264 u)\nunder the uniform probability distribution de\ufb01ned by a single potential function \u03c6 = 0 corresponds\nto the volume of the \u201cslice\u201d de\ufb01ned by the bounds u < l \u2208 [0, 1] relative to the volume of the\nentire polytope. In [7] it was shown that computing the volume of such slices is at least as hard as\ncomputing the volume of the entire polytope which is known to be #P-hard [8].\n\n3.2 Approximate marginal computation and sampling scheme\nDespite this hardness result, ef\ufb01cient approximation algorithms for convex volume computation\nbased on MCMC techniques have been devised and yield polynomial-time approximation guaran-\ntees. We will review the techniques and then relate them to our problem of marginal computation.\nThe \ufb01rst provably polynomial-time approximation algorithm for volume computation was based on\n\u201crandom ball-walks\u201d. Starting from some initial point p inside the polytope, one samples from the\nlocal density function of f restricted to the inside of a ball of radius r around the point p. If the\nnewly sampled point p(cid:48) lies inside the polytope, we move to p(cid:48), otherwise we stay at p and repeat\nthe sampling. If P is the uniform distribution (as typically chosen for volume computation), the\n\n1We ignore equality constraints for now until the discussion of the algorithm in Section 3.4\n\n3\n\n1 1 0 X1 X3 X2 P(0.4 \u2264 X2 \u2264 0.6) 1 1 0 X1 X3 p p\u2019 p p\u2019 xMAP \fresulting Markov chain converges to P over the polytope in O\u2217(n3) steps assuming that the starting\ndistribution is not \u201ctoo far\u201d from P [9].2\nMore recently, the hit and run sampling scheme [10] was rediscovered which has the advantage that\nno strong assumptions about the initial distribution need to be made. As in the random ball walk,\nwe start at some interior point p. Next, we generate a direction d (i.e., n dimensional vector of\nlength 1) uniformly at random and compute the line segment l of the line p + \u03b1d that resides inside\nthe polytope. We then compute the distribution of P over the segment l, sample from it uniformly\nat random and move to the new sample point p(cid:48) to repeat the process. For P being the uniform\ndistribution, the Markov chain also converges after O\u2217(n3) steps but for hit-and-run we only need to\nassume that the starting point p does not lie on the boundary of the polytope [2]. In [7], the authors\nshow that hit-and-run signi\ufb01cantly outperforms random ball walk sampling in practice, because it\n(1) does not get easily stuck in corners since each sample is guaranteed to be drawn from inside\nthe polytope, (2) does not require parameter setting like the radius r which greatly in\ufb02uences the\nperformance of random ball walk. Figure 3 b) shows an iteration of the random ball walk and the\nhit-and-run sampling schemes for our running example restricted to just two dimensions to simplify\nthe presentation. We can see that, depending on the radius of the ball, a signi\ufb01cant portion may not\nintersect with the feasible region.\nLov\u00b4asz and Vempala[2] have proven a stronger result which shows that hit-and-run sampling con-\nverges for general log-linear distributions. Based on their result, we get a polynomial-time approxi-\nmation guarantee for distributions induced by CCMRFs as de\ufb01ned above.\nTheorem 2 The complexity of computing an approximate distribution \u03c3\u2217 using the hit-and-\nrun sampling scheme such that the total variation distance of \u03c3\u2217 and P is less than \u0001 is\ntial distribution \u03c3 such that the density function d\u03c3/dP is bounded by M except on a set S with\n\u03c3(S) \u2264 \u0001/2.\nProof 2 (Sketch) Since A, B are linear, \u02dcD is an \u02dcn = n \u2212 kA dimensional convex polytope after\ndimensionality reduction through A. By de\ufb01nition, f is from the exponential family and since all\nfactors are linear or maximums of linear functions, f is a log concave function (maximums and\nsums of convex functions are convex). More speci\ufb01cally, f is a log concave and log piecewise linear\nfunction. Let \u03c3s be the distribution of the current point after s steps of hit-and-run have been applied\nto f starting from \u03c3. Now, according to Theorem 1.3 from [2], for s > 1030 n2R2\nthe total\nr2\nvariation distance of \u03c3s and P is less than \u0001, where r is such that the level set of f of probability 1\ncontains a ball of radius r and R2 \u2265 Ef (|x \u2212 zf|2), where zf is the centroid of f.\nNow, each hit-and-run step requires us to iterate over the random variable domain boundaries,\nO(\u02dcn), compute intersections with the inequality constraints, O(\u02dcnkB), and integrate over the line\nsegment involving all factors, O(\u02dcnm).\n\nO\u2217(cid:0)\u02dcn3(kB + \u02dcn + m)(cid:1), where \u02dcn = n \u2212 kA, under the assumptions that we start from an ini-\n\n8\n\nln5 M nR\n\nr\u0001\n\nImproved sampling scheme\n\n3.3\nOur proposed sampling algorithm is an implementation of the hit-and-run MCMC scheme. How-\never, the theoretical treatment presented above leaves two questions unaddressed: 1) How do we get\nthe initial distribution \u03c3? 2) The hit-and-run algorithm assumes that all sample points are strictly\ninside the polytope and bounded away from its boundary. How can we get out of corners if we do\nget stuck?\nThe theorem above assumes a suitable initial distribution \u03c3, however, in practice, no such distribu-\ntion is given. Lov\u00b4asz and Vempala also show that the hit-and-run scheme converges from a single\nstarting point on uniform distributions under the condition that it does not lie on the boundary and\nat the expense of an additional factor of n in the number of steps to be taken (compare Theorem 1.1\nand Corollary 1.2 in [2]). We follow this approach and use a MAP state xM AP of the distribution P\nas the single starting point for the sampling algorithm. Choosing a MAP state as the starting point\nhas two advantages: 1) we are guaranteed that xM AP is an interior point and 2) it is the point with\nthe highest probability density and therefore highest probability mass in a small local neighborhood.\nHowever, starting from a MAP state elevates the importance of the second question, since the MAP\nstate often lies exactly on the boundary of the polytope and therefore we are likely to start the\nsampling algorithm from a vertex of the polytope. The problem with corner points p is that most of\n\n2The O\u2217 notation ignores logarithmic and factors and dependence on other parameters like error bounds.\n\n4\n\n\fthe directions sampled uniformly at random will lead to line segments of zero length and hence we\ndo not move between iterations. Let W be the subset of inequality constraints B that are \u201cactive\u201d at\nthe corner point p and b the corresponding entries in b, i.e. W p = b (since all constraints are linear,\nwe abuse notation and consider B, W to be matrices). In other words, the hyperplanes corresponding\nto the constraints in W intersect in p. Now, for all directions d \u2208 Rn such that there exist active\nconstraints Wi, Wj with Wid < 0 and Wjd > 0, the line segment through p induced by d must\nnecessarily be 0. It also follows that more active constraints increase the likelihood of getting stuck\nin a corner.\nFor example, in Figure 3 b) the point xM AP in the upper left hand corner denotes the MAP state\nof the distribution de\ufb01ned in our running example. If we generate a direction uniformly at random,\nonly 1/4 of those will be feasible, that is, for all other we won\u2019t be able to move away from xM AP .\nTo avoid the problem of repeatedly sampling infeasible directions at corner points, we propose to\nrestrict the sampling of directions to feasible directions only when we determine that a corner point\nhas been reached. We de\ufb01ne a corner point p as a point inside the polytope where the number of\nactive constraints is above some threshold \u03b8.3 A direction d is feasible, if W d < 0. Assuming that\nthere are a active constraints at corner point p (i.e., W has a rows) we sample each entry of the\na-dimensional vector z from \u2212|N (0, 1)| where N (0, 1) is the standard Gaussian distribution with\nzero mean and unit variance. Now, we try to \ufb01nd directions d such that W d \u2264 z.\nA number of algorithms have been proposed to solve such systems of linear inequalities for feasible\npoints d. In our sampling algorithm we implement the relaxation method introduced by Agmon [11]\nand Motzkin and Schoenberg [12] due to its simplicity. The relaxation method proceeds as follows:\nWe start with d0 = 0. At each iteration we check if W di \u2264 z ; if so, we have found a solution\nand terminate. If not, we choose the most \u201cviolated\u201d inequality constraint Wk from W , i.e., the row\nvector Wk from W which maximizes Wkdi\u2212zk\n\n(cid:107)Wk(cid:107) , and update the direction,\n\ndi+1 = di + 2\n\nzk \u2212 Wkdi\n(cid:107)Wk(cid:107)2 W T\n\nk\n\nThe relaxation method is guaranteed to terminate, since a feasible direction d always exists [12].\n\n3.4 Sampling algorithm\n\nAlgorithm CCMRF Sampling\nInput: CCMRF speci\ufb01ed by RVs X with domains D = [0, 1]n, equality constraints\nOutput: Marginal probability density histograms H[Xi] : [0, 1] \u2192 R+,\u2200Xi \u2208 X\n\nA(x) = a inequality constraints B(x) \u2264 b, potential functions \u03c6, parameters \u039b\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n\nelse\n\nif A = \u2205\nP \u2190 1|X|\nn(cid:48) \u2190 n\nr \u2190 rank(A)\n[U, \u03a3, V ] \u2190 svd(A)\nP \u2190 V |columns: [r+1,n]\nn(cid:48) \u2190 n \u2212 r\n\nx0 \u2190 MAP(A(x) = a, B(x) \u2264 b, \u03c6)\ncornered \u2190 FALSE\nfor j = 0 to \u03c1\nif cornered\n\nd \u2190 (cid:126)0\nW \u2190 B|rows:active \u00d7 P\nz \u2190 zi = \u223c \u2212|N (0, 1)| \u2200i = 1 . . . n(cid:48)\nwhile \u2203i : Wkd \u2212 zk > 0\nv \u2190 argmaxk\nWkd\u2212zk\n(cid:107)Wk(cid:107)\nd = d + 2 zv\u2212Wvd\n(cid:107)Wv(cid:107)2 W T\ncornered \u2190 FALSE\nelse\nd \u2190 di = \u223c N (0, 1) \u2200i = 1 . . . n(cid:48)\nd \u2190 1(cid:107)d(cid:107) d\nd \u2190 P \u00d7 d\nactive \u2190 \u2205\n\u03b1low \u2190 \u2212\u221e, \u03b1high \u2190 \u221e\ncd \u2190 B \u00d7 d ; cx \u2190 B \u00d7 xj\n\nv\n\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n51\n52\n53\n54\n\nfor i = 1 . . .|rows(B)|\n\nif cdi (cid:54)= 0\n\ncdi\n\na = bi\u2212cxi\nif cdi > 0 then \u03b1high \u2190 min(\u03b1high, a)\nif cdi < 0 then \u03b1low \u2190 max(\u03b1low, a)\nif a = 0 then active \u2190 active \u222a {i}\n\nif \u03b1high \u2212 \u03b1low = 0 \u2227 |active| > \u03b8\n\ncornered \u2190 TRUE\ncontinue\n\nM \u2190 map : [0, 1] \u2192 R \u00d7 R\nfor \u03c6i = max(0, oi \u00b7 x + qi) \u2208 \u03c6\n\nr \u2190 \u03bbi(oi \u00b7 d)\nc \u2190 \u03bbi(oi \u00b7 xj + qi)\na \u2190 \u2212c/r\nif r > 0 \u2227 a < \u03b1high\nelse if r < 0 \u2227 a > \u03b1low\n\n[r\u03b1, c\u03b1] \u2190(cid:80)\n\u03a3\u03b1 \u2190(cid:80)\n\nM (max(a, \u03b1low)) \u2190 M (max(a, \u03b1low)) + [r, c]\nM (\u03b1low) \u2190 M (\u03b1low) + [r, c]\nif a < \u03b1high then M (a) \u2190 M (a) + [\u2212r,\u2212c]\nelse M (\u03b1low) \u2190 M (\u03b1low) + [0, c]\n\na\u2264\u03b1 M (a)\na** \u03c1\n\n(log (\u2212sra + ra\u03a3a + e\u2212ca\u2212raa) + ca)\n100 n(cid:48)3\nH[i][xj+1\n\nra\n\n] + 1 \u2200i = 1 . . . n\nFigure 1: Constrained continuous MRF sampling algorithm\n\n] \u2190 H[i][xj+1\n\ni\n\ni\n\n3We used \u03b8 = 2 in our experiments.\n\n5\n\n\f(cid:80)m\n\nPutting the pieces together, we present the marginal distribution sampling algorithm in Figure 1.\nThe inputs to the algorithm were discussed in Section 1. In addition, we assume that the domain\nrestrictions Di = [l, u] for the random variables Xi are encoded as pairs of linear inequality con-\nstraints l \u2264 xi \u2264 u in B, b. The algorithm \ufb01rst analyzes the equality constraints A to determine\nthe number of \u201cfree\u201d random variables and reduce the dimensionality accordingly. The singular-\nvalue decomposition of A is used to determine the n \u00d7 n(cid:48) projection matrix P which maps from\nthe null-space of A to the original space D, where n(cid:48) = n \u2212 rank(A) is the dimensionality of the\nnull-space. If no equality constraints have been speci\ufb01ed, P is the n-dimensional unit matrix. Next,\nthe algorithm determines a MAP state x0 of the density function de\ufb01ned by the CCMRF, which\nis the point with the highest probability mass, that is, x0 = argmaxx\u2208 \u02dcDf (x). Since the Z(\u039b) is\nconstant and the logarithm monotonic, this is identical to x0 = argminx\u2208 \u02dcD\nj=1 \u03bbi\u03c6i(x). Hence,\ncomputing a MAP state can be cast as a linear optimization problem, since all constraints are linear\nand the potential functions maximums of two linear functions. Linear optimization problems can be\nsolved ef\ufb01ciently in time O(n3.5) and are very fast in practice.\nAfter determining the null-space and starting point, we begin collecting \u03c1 samples. If we detected\nbeing stuck in a corner during the previous iteration, we sample a direction d from the feasible\nsubspace of all possible directions in the reduced null-space using the adapted relaxation method\ndescribed above (lines 13-19). Otherwise, we sample a direction uniformly at random from the\nnull-space of A. We then normalize the direction and project it back into our original domain D by\nmatrix multiplication with P . The projection ensures that all equality constraints remain satis\ufb01ed\nas we move along the direction d. Next, we compute the segment of the line l : xj + \u03b1d inside\nthe polytope de\ufb01ned by the inequality constraints B (lines 25-32).\nIterating over all inequality\nconstraints, we determine the value of \u03b1 where l intersects the constraint i. We keep track of the\nlargest negative and smallest positive values to de\ufb01ne the bounds [\u03b1low, \u03b1high] such that the line\nsegment is de\ufb01ned exactly by those values of \u03b1 inside this interval. In addition, we determine all\nactive constraints, i.e. those constraints where the current sample point xj is the point of intersection\nand hence \u03b1 = 0. If the interval [\u03b1low, \u03b1high] is 0, then we are currently sitting in a corner. If, in\naddition, the number of active constraints exceed some threshold \u03b8 we are stuck in a corner and\nabort the current iteration to start over with restricted direction sampling.\nIn lines 36-48 we compute the cumulative density function of the probability P over the line segment\nl with \u03b1 \u2208 [\u03b1low, \u03b1high]. Based on our assumption in Section 2, the sum of potential functions\nIn order to\nintegrate the density function, we need to segment S into its differentiable parts, so we start by\ndetermining the subintervals of [\u03b1low, \u03b1high] where S is linear and differentiable and can therefore\nbe described by S = rx + c. We compute the slope r and y-intercept c for each potential function\nindividually as well as the point of undifferentiability a where the line crosses 0. We use a map M to\nstore the line description [r, c] with the point of intersection a (lines 36-46). Then, we compute the\naggregate slope ra and y-intercept ca for the sum of all potentials for each point of undifferentiability\na (line 47) and use this information to compute the unnormalized cumulative density function by\nintegrating over each subinterval and summing those up in \u03a3\u03b1 (line 48). Now, \u03a3a/\u03a3\u03b1high gives\nthe cumulative probability mass for all points of undifferentiability a which de\ufb01ne the subintervals.\nNext, we sample a number s from the interval [0, \u03a3\u03b1high] uniformly at random (line 49) and compute\n\u03b1 such that \u03a3\u03b1 = s (line 50-51). Finally, we move to the new sample point xj+1 = xj + \u03b1d and\nadd it to the histogram which approximates the marginal densities if the number of steps taken so\nfar exceeds the burn-in period which we con\ufb01gured to be 1% of the total number of steps.\n\ni=1 \u03bbi\u03c6i restricted to the line l is a continuous piece-wise linear function.\n\nS = (cid:80)m\n\n3.5 Generalizing to convex continuous MRFs\nIn our treatment so far, we made speci\ufb01c assumptions about the constraints and potential functions.\nMore generally, Theorem 2 holds when the inequality constraints as well as the potential functions\nare convex. A system of inequality constraints is convex if the set of all points that satisfy the con-\nstraints is convex, that is, any line connecting two points in the set is completely contained in the\nset.\nOur algorithm needs to be modi\ufb01ed where we currently assume linearity. Firstly, computing a MAP\nstate requires general convex optimization. Secondly, our method for \ufb01nding feasible directions\nwhen being caught in a corner of the polytope needs to be adapted to the case of arbitrary convex\nconstraints. One simple approach is to use the tangent hyperplane at the point xj as an approxi-\nmation to the actual constraint and proceed as is. Similarly, we need to modify the computation\nof intersection points between the line and the convex constraints as well as how we determine the\n\n6\n\n\fpoints of undifferentiability. Lastly, the computation of integrals over subintervals for the potential\nfunctions requires knowledge of the form of potential functions to be solved analytically or they\nneed to be approximated ef\ufb01ciently. The algorithm can handle arbitrary domains for the random\nvariables as long as they are connected subintervals of R.\n\n4 Experiments\n\nThis section presents an empirical evaluation of the proposed sampling algorithm on the problem\nof category prediction for Wikipedia documents based on similarity. After describing the data and\nthe experimental methodology, we demonstrate that the computed marginal distributions effectively\npredict document categories. Moreover, we show that analysis of the marginal distribution provides\nan indicator for the con\ufb01dence in those predictions. Finally, we investigate the convergence rate and\nruntime performance of the algorithm in detail.\nFor our evaluation dataset, we collected all Wikipedia articles that appeared in the featured list4\nfor a two week period in Oct. 2009, thus obtaining 2460 documents. Of these, we considered a\nsubset of 1717 documents assigned to the 7 most popular categories. After stemming and stop-word\nremoval, we represented the text of each document as a tf/idf-weighted word vector. To measure the\nsimilarity between documents, we used the popular cosine metric on the weighted word vectors. The\ndata contains the relations Link(fromDoc, toDoc), which establishes a hyperlink between\ntwo documents. We used K-fold cross-validation for k = 20, 25, 30, 35 by splitting the dataset\ninto K non-overlapping subsets each of which is determined using snowball sampling over the\nlink structure from a randomly chosen initial document. For each training and test data subset, we\nrandomly designate 20% of the documents as \u201cseed documents\u201d of which the category is observed\nand the goal is to predict the categories of the remaining documents. All experiments were executed\non identical hardware powered by two Intel Xeon Quad Core 2.3 GHz Processors and 8 GB of RAM.\n\n4.1 Classi\ufb01cation results\n\nK Baseline Marginals\n20\n41.4%\n25\n31.7%\n30\n39.1%\n35\n46.1%\nFigure 2: a) Classi\ufb01cation Accuracy\n\n39.5%\n39.1%\n36.7%\n38.8%\n\nImprovement\n\n55.8%\n51.5%\n51.1%\n56.6%\n\nK P(Null Hypothesis) Relative Difference \u2206(\u03c3)\n20\n25\n30\n35\nb) Std. deviation as an indicator for con\ufb01dence\n\n1.95E-09\n2.40E-13\n<1.00E-16\n4.54E-08\n\n38.3%\n41.2%\n43.5%\n39.0%\n\nThe baseline method uses only the document content by propagating document categories via textual\nsimilarity measured by the cosine distance. Using rules and constraints similar to those presented in\nTable 1, we create a joint probabilistic model for collective classi\ufb01cation of Wikipedia documents.\nWe use PSL twofold in this process: Firstly, PSL constructs the CCMRF by grounding the rules\nand constraints against the given data as described in Section 2 and secondly, we use the percep-\ntron weight learning method provided by PSL to learn the free parameters of the CCMRF from\nthe training data (see [3] for more detail). The sampling algorithm takes the constructed CCMRF\nand learned parameters as input and computes the marginal distributions for all random variables\nfrom 3 million samples. We have one random variable to represent the similarity for each possible\ndocument-category pair, that is, one RV for each grounding of the category predicate. For each\ndocument D we pick the category C with the highest expected similarity as our prediction. The\naccuracy in prediction of both methods is compared in Table 2 a) over the 4 different splits of the\ndata. We observe that the collective probabilistic model outperforms the baseline by up to 46%. All\nresults are statistically signi\ufb01cant at p = 0.02.\nWhile this results suggests that the sampling algorithm works in practice, it is not surprising and\nnovel since similar results for collective classi\ufb01cation have been produced before using other\napproaches in statistical relational learning (e.g., compare [13]). However, the marginal distribu-\ntions we obtain provide additional information beyond the simple point estimate of its expected\nvalue. In particular, we show that the standard deviation of the marginals can serve as an indicator\nfor the con\ufb01dence in the particular classi\ufb01cation prediction.\nIn order to show this, we compute\nthe standard deviation of the marginal distributions for those random variables picked during the\n\n4 http://en.wikipedia.org/wiki/Wikipedia:Featured_lists, see [3] for more infor-\n\nmation on the dataset\n\n7\n\n\fFigure 3: a) KL Divergence by sample size\n\nb) Runtime for 1000 samples\n\nprediction stage for each fold. We separate those values into two sets, S+, S\u2212, based on whether\nthe prediction turned out to be correct (+) or incorrect (\u2212) when evaluated against the ground truth.\nLet \u03c3+, \u03c3\u2212 denote the average standard deviation for those values in S+, S\u2212 respectively. Our\nhypothesis is that we have higher con\ufb01dence in the correct predictions, that is, \u03c3+ will typically be\nsmaller than \u03c3\u2212. In other words, we hypothesize that the relative difference between the average\ndeviations, \u2206(\u03c3) = 2 \u03c3\u2212\u2212\u03c3+\n\u03c3++\u03c3\u2212 , is larger than 0. Under the corresponding null hypothesis, we would\nexpect any difference in average standard deviation, and therefore any nonzero \u2206(\u03c3), to be purely\ncoincidental or noise. Assuming that such noise in the \u2206(\u03c3)\u2019s, which we computed for each fold,\ncan be approximated by a Gaussian distribution with 0 mean and unknown variance5, we test the\nnull hypothesis using a two tailed Z-test with the observed sample variance. The Z-test scores on\nthe 4 differently sized splits are reported in Table 2 b) and allow us to reject the null hypothesis with\nvery high con\ufb01dence. Table 2 b) also lists \u2206(\u03c3) for each split averaged across the multiple folds\nand shows that \u03c3\u2212 is about 40% larger than \u03c3+ on average.\n\n4.2 Algorithm performance\nIn investigating the performance of the sampling algorithm we are mainly interested in two ques-\ntions: 1) How many samples does it take to converge on the marginal density functions? and 2) What\nis the computational cost of sampling? To answer the \ufb01rst question, we collect independent samples\nof varying size from 31 thousand to 2 million and one reference sample with 3 million steps for all\nfolds. For each of the former samples we compare the marginals thus obtained to the ones of the\nreference sample by measuring their KL divergence. To compute the KL divergence we discretize\nthe density function using a histogram with 10 bins. The center line in Figure 3 a) shows the average\nKL divergence with respect to the sample size across all folds. To study the impact of dimensional-\nity on convergence, we order the folds by the number of random variables n and show the average\nKL divergence for the lowest and highest quartile which contains 174 \u2212 224 and 322 \u2212 413 random\nvariables respectively. The plot is drawn in log-log scale and therefore suggests that each magnitude\nincrease in sample size yields a magnitude improvement in KL divergence. To answer the second\nquestion, Figure 3 b) displays the time needed to generate 1000 samples with respect to the number\nof potential functions in the CCMRF. Computing the induced probability density function along the\nsampled line segment dominates the cost of each sampling step and the graph shows that this cost\ngrows linearly with the number of potential functions.\n\n5 Conclusion\nWe have presented a novel approximation scheme for computing marginal probabilities over con-\nstrained continuous MRFs based on recent results in computational geometry and discussed tech-\nniques to improve its ef\ufb01ciency. We introduced an effective sampling algorithm and veri\ufb01ed its\nperformance in an empirical evaluation. To our knowledge, this is the \ufb01rst study of the theoretical,\npractical, and empirical aspects of marginal computation in general constrained continuous MRFs.\nWhile our initial results are quite promising, there are still many further directions for research in-\ncluding improved scalability, applications to other probabilistic inference problems, and using the\ncon\ufb01dence values to improve the prediction accuracy.\n\n5Even if the standard deviations in S+, S\u2212 are not normally distributed, the central limit theorem postulates\n\nthat their averages will eventually follow a normal distributions under independence assumptions.\n\n8\n\n0.05 \u00a00.5 \u00a05 \u00a030000 \u00a0300000 \u00a03000000 \u00a0KL \u00a0Divergence \u00a0Number \u00a0of \u00a0Samples \u00a0KL \u00a0Divergence \u00a0by \u00a0Sample \u00a0Size \u00a0Average \u00a0KL \u00a0Divergence \u00a0Lowest \u00a0Quar8le \u00a0KL \u00a0Divergence \u00a0Highest \u00a0Quar8le \u00a0KL \u00a0Divergence \u00a00 \u00a05 \u00a010 \u00a015 \u00a020 \u00a025 \u00a030 \u00a035 \u00a00 \u00a02000 \u00a04000 \u00a06000 \u00a08000 \u00a010000 \u00a0Time \u00a0in \u00a0sec \u00a0Number \u00a0of \u00a0Poten1al \u00a0Func1ons \u00a0Run1me \u00a0for \u00a01000 \u00a0Samples \u00a0\fAcknowledgment\n\nWe thank Stanley Kok, Stephan Bach, and the anonymous reviewers for their helpful comments and\nsuggestions. This material is based upon work supported by the National Science Foundation under\nGrant No. 0937094. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this\nmaterial are those of the authors and do not necessarily re\ufb02ect the views of the NSF.\n\nReferences\n[1] E. B Sudderth. Graphical models for visual object recognition and tracking. Ph.D. thesis,\n\nMassachusetts Institute of Technology, 2006.\n\n[2] L. Lovasz and S. Vempala. Hit-and-run from a corner. In Proceedings of the thirty-sixth annual\n\nACM symposium on Theory of computing, pages 310\u2013314, Chicago, IL, USA, 2004. ACM.\n\n[3] M. Broecheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[4] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107\u2013136,\n\n2006.\n\n[5] K. Kersting and L. De Raedt. Bayesian logic programs. Technical report, Albert-Ludwigs\n\nUniversity, 2001.\n\n[6] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In\n\nProceedings of UAI-02, 2002.\n\n[7] M. Broecheler, G. Simari, and VS. Subrahmanian. Using histograms to better answer queries\n\nto probabilistic logic programs. Logic Programming, page 4054, 2009.\n\n[8] M. E. Dyer and A. M. Frieze. On the complexity of computing the volume of a polyhedron.\n\nSIAM Journal on Computing, 17(5):967\u2013974, October 1988.\n\n[9] R. Kannan, L. Lovasz, and M. Simonovits. Random walks and an o*(n5) volume algorithm\n\nfor convex bodies. Random structures and algorithms, 11(1):150, 1997.\n\n[10] R. L. Smith. Ef\ufb01cient monte carlo procedures for generating points uniformly distributed over\n\nbounded regions. Operations Research, 32(6):1296\u20131308, 1984.\n\n[11] S. Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics,\n\n6(3):382392, 1954.\n\n[12] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear inequalities. IJ Schoen-\n\nberg: Selected Papers, page 75, 1988.\n\n[13] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi\ufb01-\n\ncation in network data. AI Magazine, 29(3):93, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1292, "authors": [{"given_name": "Matthias", "family_name": "Broecheler", "institution": null}, {"given_name": "Lise", "family_name": "Getoor", "institution": null}]}**