{"title": "Preconditioner Approximations for Probabilistic Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1113, "page_last": 1120, "abstract": null, "full_text": "Preconditioner Approximations for Probabilistic Graphical Models\n\nPradeep Ravikumar John Lafferty School of Computer Science Carnegie Mellon University\n\nAbstract\nWe present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods.\n\n1\n\nIntroduction\n\nApproximate inference techniques are enabling sophisticated new probabilistic models to be developed and applied to a range of practical problems. One of the primary uses of approximate inference is to estimate the partition function and event probabilities for undirected graphical models, which are natural tools in many domains, from image processing to social network modeling. A central challenge is to improve the accuracy of existing approximation methods, and to derive rigorous rather than heuristic bounds on probabilities in such graphical models. In this paper, we present a simple new approach to the approximate inference problem, based upon non-iterative procedures that have low time complexity. We follow the variational mean field intuition of focusing on tractable subgraphs, however we recast the problem of optimizing the tractable distribution parameters as a generalized linear system problem. In this way, the task of deriving a tractable distribution conveniently reduces to the well-studied problem of obtaining a good preconditioner for a matrix (Boman and Hendrickson, 2003). This framework has the added advantage that tighter bounds can be obtained by reducing the sparsity of the preconditioners, at the expense of increasing the time complexity for computing the approximation. In the following section we establish some notation and background. In Section 3, we outline the basic idea of our proposed framework, and explain how to use preconditioners for deriving tractable approximate distributions. In Sections 3.1 and 4, we then describe the underlying theory, which we call the generalized support theory for graphical models. In Section 5 we present experiments that compare the new approximation schemes to some of the standard variational and optimization based methods.\n\n\f\n2\n\nNotation and Background\n\nConsider a graph G = (V , E ), where V denotes the set of nodes and E denotes the set of edges. Let Xi be a random variable associated with node i, for i V , yielding a random vector X = {X1 , . . . , Xn }. Let = { , I } denote the set of potential functions or sufficient statistics, for a set I of cliques in G. Associated with is a vector of parameters = { , I }. With this notation, the exponential family of distributions of X , associated with and G, is given by . p(x; ) = exp - () (1) For traditional reasons through connections with statistical physics, Z = exp () is called the partition function. As discussed in (Yedidia et al., 2001), at the expense in increasing the state space one can assume without loss of generality that the graphical model is a pairwise Markov random field, i.e., the set of cliques I is the set of edges {(s, t) E }. We shall assume a pairwise random field, and thus can express the potential function and parameter vectors in more compact form as matrices: 11 (x1 , x1 ) . . . 1n (x1 , xn ) 11 . . . 1n . . . . . (x) := . . . . . . := . (2) . . . . . . n1 (xn , x1 ) . . . nn (xn , xn ) n1 . . . nn\n\nIn the following we will denote the trace of the product of two matrices A and B by the inner product A, B . Assuming that each Xi is finite-valued, the partition function Z () is x then given by Z () = exp , (x) . The computation of Z () has a complexity exponential in the tree-width of the graph G and hence is intractable for large graphs. Our goal is to obtain rigorous upper and lower bounds for this partition function, which can then be used to obtain rigorous upper and lower bounds for general event probabilities; this is discussed further in (Ravikumar and Lafferty, 2004). 2.1 Preconditioners in Linear Systems\n\nConsider a linear system, Ax = c, where the variable x is n dimensional, and A is an n n matrix with m non-zero entries. Solving for x via direct methods such as Gaussian elimination has a computational complexity O(n3 ), which is impractical for large values of n. Multiplying both sides of the linear system by the inverse of an invertible matrix B , we get an equivalent \"preconditioned\" system, B -1 Ax = B -1 c. If B is similar to A, B -1 A is in turn similar to I , the identity matrix, making the preconditioned system easier to solve. Such an approximating matrix B is called a preconditioner. The computational complexity of preconditioned conjugate gradient is given by 1 T (A) = (A, B ) (m + T (B )) log (\n\n3)\n\nwhere T (A) is the time required for an -approximate solution; (A, B ) is the condition number of A and B which intuitively corresponds to the quality of the approximation B , and T (B ) is the time required to solve B y = c.\n\nRecent developments in the theory of preconditioners are in part based on support graph theory, where the linear system matrix is viewed as the Laplacian of a graph, and graphbased techniques can be used to obtain good approximations. While these methods rej quire diagonally dominant matrices (Aii =i |Aij |), they yield \"ultra-sparse\" (tree plus a constant number of edges) preconditioners with a low condition number. In our\n\n\f\nexperiments, we use two elementary tree-based preconditioners in this family, Vaidya's Spanning Tree preconditioner Vaidya (1990), and Gremban-Miller's Support Tree preconditioner Gremban (1996).\n\n3\n\nGraphical Model Preconditioners\n\nOur proposed framework follows the generalized mean field intuition of looking at sparse graph approximations of the original graph, but solving a different optimization problem. We begin by outlining the basic idea, and then develop the underlying theory. Consider the graphical model with graph G, potential-function matrix (x), and parameter matrix . For purposes of intuition, think of the graphical model \"energy\" , (x) as the matrix norm x x. We would like to obtain a sparse approximation B for . If B approximates well, then the condition number is small: m x x x x in (, B ) = max = max (, B ) /min (, B ) (4) x x Bx x x Bx This suggests the following procedure for approximate inference. First, choose a matrix B that minimizes the condition number with (rather than KL divergence as in mean-field). Then, scale B appropriately, as detailed in the following sections. Finally, use the scaled matrix B as the parameter matrix for approximate inference. Note that if B corresponds to a tree, approximate inference has linear time complexity. 3.1 Generalized Eigenvalue Bounds\n\nGiven a graphical model with graph G, potential-function matrix (x), and parameter matrix , our goal is to obtain parameter matrices U and L , corresponding to sparse graph approximations of G, such that Z (L ) Z () Z (U ). (5) That is, the partition functions of the sparse graph parameter matrices U and L are upper and lower bounds, respectively, of the partition function of the original graph. However, we will instead focus on a seemingly much stronger condition; in particular, we will look for L and U that satisfy L , (x) , (x) U , (x) (6) for all x. By monotonicity of exp, this stronger condition implies condition (5) on the partition function, by summing over the values of X . However, this stronger condition will give us greater flexibility, and rigorous bounds for general event probabilities since then exp Z L , (x) (U ) p(x; ) exp Z U , (x) . (L ) (7)\n\nIn contrast, while variational methods give bounds on the log partition function, the derived bounds on general event probabilities via the variational parameters are only heuristic. Let S be a set of sparse graphs; for example, S may be the set of all trees. Focusing on the upper bound, we for now would like to obtain a graph G S with parameter matrix B , which approximates G, and whose partition function upper bounds the partition function of the original graph. Following (6), we require, , (x) B, (x) , such that G(B ) S (8) where G(B ) denotes the graph corresponding to the parameter matrix B . Now, we would like the distribution corresponding to B to be as close as possible to the distribution corresponding to ; that is, B, (x) should not only upper bound , (x) but should be\n\n\f\nclose to it. The distance measure we use for this is the minimax distance. In other words, while the upper bound requires that B (x) , , (x) we would like min 1, (9)\n\nB (x) , 10) x , (x) ( to be as high as possible. Expressing these desiderata in the form of an optimization problem, we have B = arg max min\nB : G( B ) S x B ,(x) ,(x) ,\n\nsuch that\n\nB ,(x) ,(x)\n\n\n\n1.\n\nBefore solving this problem, we first make some definitions, which are generalized versions of standard concepts in linear systems theory. Definition 3.1. For a pairwise Markov random field with potential function matrix (x); the generalized eigenvalues of a pair of parameter matrices (A, B ) are defined as\n max (A, B ) min (A, B )\n\n= =\n\nx: B,(x) =0\n\nmax min\n\nx: B,(x) =0\n\nB (x) A, , (x) B (x) A, , (x) .\n\n(\n\n11)\n\n(12)\n\nNote that ax (A, B ) m = =\nx: B,(x) =0\n\nmax\n\n1 x:\n\nB,(x) =0\n\nmax\n\n (x) A, B , (x) B (x) A, , (x)\n\n( = -1 ax (A, B ). m\n\n13)\n\n(14)\n\nWe state the basic properties of the generalized eigenvalues in the following lemma. Lemma 3.2. The generalized eigenvalues satisfy\n min (A, B ) \n\nB (x) A, , (x)\n\n\n\nax (A, B ) m\n\n(15) (16) (17) (18)\n\n max (A, B ) = -1 max (A, B ) min (A, B )\n\n=\n\n -1 min (A, B )\n\n1 min (A, B ) = . max (B , A)\n\nIn the following, we will use A to generically denote the parameter matrix of the model. We can now rewrite the optimization problem for the upper bound in equation (11) as (Problem 1 )\nB : G( B ) S\n\nmax\n\nin (A, B ), such that ax (A, B ) 1 m m\n\n(19)\n\nWe shall express the optimal solution of Problem 1 in terms of the optimal solution of a companion problem. Towards that end, consider the optimization problem (Problem 2 )\nC : G( C ) S\n\nmin\n\nax (A, C ) m . in (A, C ) m\n\n(20)\n\nThe following proposition shows the sense in which these problems are equivalent.\n\n\f\n Proposition 3.3. If C attains the optimum in Problem 2 , then C = max (A, C ) C attains the optimum of Problem 1 .\n\nProof.\n\nT om fr\n\nThus, C upper bounds all feasible solutions in Problem 1 . However, it itself is a feasible solution, since = A 1 , ax (A, C )C (A, C ) = 1 (25) ax (A, C ) = ax m m m (A, C ) max max Lemma 3.2. Thus, C attains the maximum in the upper bound Problem 1 .\n\nFor any feasible solution B of Problem 1 , we have in (A, B ) m (since ax (A, B ) 1) in (A, B ) m m ax (A, B ) m in (A, C ) m (since C is the optimum of Problem 2 ) ax (A, C ) m A ( = in , ax (A, C )C from Lemma 3.2) m m = in (A, C ). m\n\n(21) (22) (23) (24)\n\nhe analysis for obtaining an upper bound parameter matrix B for a given parameter matrix A carries over for the lower bound; we need to replace a maximin problem with a minimax problem. For the lower bound, we want a matrix B such that B (x) A, B (x) A, B = min max such that 1 (26) , (x) , , (x) B : G(B )S {x: B,(x) =0} This leads to the following lower bound optimization problem. (Problem 3 )\nB : G( B ) S\n\nmin\n\nax (A, B ), such that in (A, B ) 1. m m\n\n(27)\n\nThe proof of the following statement closely parallels the proof of Proposition 3.3. ^^ ^ Proposition 3.4. If C attains the optimum in Problem 2 , then C = in (A, C )C attains m the optimum of the lower bound Problem 3 . Finally, we state the following basic lemma, whose proof is easily verified. Lemma 3.5. For any pair of parameter-matrices (A, B ), we have . A, (x) min (A, B )B , (x) max (A, B )B , (x) Main Procedure\n\n(28)\n\n3.2\n\nWe now have in place the machinery necessary to describe the procedure for solving the main problem in equation (6), to obtain upper and lower bound matrices for a graphical model. Lemma 3.5 shows how to obtain upper and lower bound parameter matrices with respect to any matrix B , given a parameter matrix A, by solving a generalized eigenvalue problem. Propositions 3.3 and 3.4 tell us, in principle, how to obtain the optimal such upper and lower bound matrices. We thus have the following procedure. First, obtain a parameter matrix C such that G(C ) S , which minimizes ax (, C )/in (, C ). Then m m ax (, C ) C gives the optimal upper bound parameter matrix and in (, C ) C gives the m m optimal lower bound parameter matrix. However, as things stand, this recipe appears to be even more challenging to work with than the generalized mean field procedures. The difficulty lies in obtaining the matrix C . In the following section we offer a series of relaxations that help to simplify this task.\n\n\f\n4\n\nGeneralized Support Theory for Graphical Models\n\nIn what follows, we begin by assuming that the potential function matrix is positive semidefinite, (x) 0, and later extend our results to general . Definition 4.1. For a pairwise MRF with potential function matrix (x) 0, the generalized support number of a pair of parameter matrices (A, B ), where B 0, is (A, B ) = min { R | B , (x) A, (x) for all x} (29)\n\nThe generalized support number can be thought of as the \"number of copies\" of B required to \"support\" A so that B - A, (x) 0. The usefulness of this definition is demonstrated by the following result. Proposition 4.2. If B\n 0 then max (A, B ) (A, B ).\n\nProof. Frm the definition of the ge ralized support number for a graphical model, o ne we have that (A, B )B - A, (x) 0. Now, since we assume that (x) 0, if B A,(x) also B 0 then B, (x) 0. Therefore, it follows that ,(x) (A, B ), and thus B (x) A, (A, B ) (30) ax (A, B ) = max m x , (x) g Tiving the statement of the proposition. his leads to our first relaxation of the generalized eigenvalue bound for a model. From Lemma 3.2 and Proposition 4.2 we see that ax (A, B ) m = ax (A, B )ax (B , A) (A, B ) (B , A) m m in (A, B ) m (31)\n\nThus, this result suggests that to approximate the graphical model (, ) we can search for a parameter matrix B , with corresponding simple graph G(B ) S , such that B = arg min (, B ) (B , )\nB\n\n(32)\n\nWhile this relaxation may lead to effective bounds, we will now go further, to derive an additional relaxation that relates our generalized graphical model support number to the \"classical\" support number. Proposition 4.3. For a potential function matrix (x) (A, B ) = min{ | ( B - A) 0}. 0, (A, B ) (A, B ), where\n\nProof. Since (A, B )B - A 0 by definition and (x) 0 by assumption, we have that (A, B )B - A, (x) 0. Therefore, (A, B ) (A, B ) from the definition of g Teneralized support number. he above result reduces the problem of approximating a graphical model to the problem of minimizing classical support numbers, the latter problem being well-studied in the scientific computing literature (Boman and Hendrickson, 2003; Bern et al., 2001), where the expression (A, C ) (C, A) is called the condition number, and a matrix that minimizes it within a simple family of graphs is called a preconditioner. We can thus plug in any algorithm for finding a sparse preconditioner for , carrying out the optimization B = arg min (, B ) (B , )\nB\n\n(33)\n\n\f\nand then use that matrix B in our basic procedure. One example is Vaidya's preconditioner Vaidya (1990), which is essentially the maximum spanning tree of the graph. Another is the support tree of Gremban (1996), which introduces Steiner nodes, in this case auxiliary nodes introduced via a recursive partitioning of the graph. We present experiments with these basic preconditioners in the following section. Before turning to the experiments, we comment that our generalized support number analysis assumed that the potential function matrix (x) was positive semi-definite. The case when it is not can be handled as follows. We first add a large positive diagonal matrix D so that (x) = (x) + D 0. Then, for a given parameter matrix , we use the above machinery to get an upper bound parameter matrix B such that A, (x) B, (x) + B - A, D . (34) Exponentiating and summing both sides over x, we then get the required upper bound for the parameter matrix A; the same can be done for the lower bound. A, (x) + D B, (x) + D \n\n5\n\nExperiments\n\nAs the previous sections detailed, the preconditioner based bounds are in principle quite easy to compute--we compute a sparse preconditioner for the parameter matrix (typically O(n) to O(n3 )) and use the preconditioner as the parameter matrix for the bound computation (which is linear if the preconditioner matrix corresponds to a tree). This yields a simple, non-iterative deterministic procedure as compared to the more complex propagation-based or iterative update procedures. In this section we evaluate these bounds on small graphical models for which exact answers can be readily computed, and compare the bounds to variational approximations. We show simulation results averaged over a randomly generated set of graphical models. The graphs used were 2D grid graphs, and the edge potentials were selected according to a uniform distribution Uniform(-2dcoup , 0) for various coupling strengths dcoup . We report the relative error, (bound - log-partition-function)/log-partition-function. As a baseline, we use the mean field and structured mean field methods for the lower bound, and the Wainwright et al. (2003) tree-reweighted belief propagation approximation for the upper bound. For the preconditioner based bounds, we use two very simple preconditioners, (a) Vaidya's maximum spanning tree preconditioner (Vaidya, 1990), which assumes the input parameter matrix to be a Laplacian, and (b) Gremban (1996)'s support tree preconditioner, which also gives a sparse parameter matrix corresponding to a tree, with Steiner (auxiliary) nodes. To compute bounds over these larger graphs with Steiner nodes we average an internal node over its children; this is the technique used with such preconditioners for solving linear systems. We note that these preconditioners are quite basic, and the use of better preconditioners (yielding a better condition number) has the potential to achieve much better bounds, as shown in Propositions 3.3 and 3.4. We also reiterate that while our approach can be used to derive bounds on event probabilities, the variational methods yield bounds only for the partition function, and only apply heuristically to estimating simple event probabilities such as marginals. As the plots in Figure 1 show, even for the simple preconditioners used, the new bounds are quite close to the actual values, outperforming the mean field method and giving comparable results to the tree-reweighted belief propagation method. The spanning tree preconditioner provides a good lower bound, while the support tree preconditioner provides a good upper bound, however not as tight as the bound obtained using tree-reweighted belief propagation. Although we cannot compute the exact solution for large graphs, we can\n\n\f\n1.4 1.2 Average relative error 1 0.8 0.6 0.4 0.2 0 Spanning Tree Preconditioner Structured Mean Field Support Tree Preconditioner Mean Field\nAverage relative error\n\n0.35 Support Tree Preconditioner Tree BP 0.3 0.25 0.2 0.15 0.1 0.05 0\n\n0.6\n\n0.8\n\n1.0 1.4 Coupling strength\n\n2.0\n\n0.6\n\n0.8\n\n1.0 1.4 Coupling strength\n\n2.0\n\nLower bound on partition function\n\n0\n\n500\n\n1000\n\nSpanning tree Preconditioner Structured Mean Field Support Tree Preconditioner Mean Field\n\nFigure 1: Comparison of lower bounds (top left), and upper bounds (top right) for small grid graphs, and lower bounds for grid graphs of increasing size (left).\n\n1500\n\n200\n\n400\n\n600\n\n800\n\nNumber of nodes in graph\n\ncompare bounds. The bottom plot of Figure 1 compares lower bounds for graphs with up to 900 nodes; a larger bound is necessarily tighter, and the preconditioner bounds are seen to outperform mean field. Acknowledgments We thank Gary Miller for helpful discussions. Research supported in part by NSF grants IIS-0312814 and IIS-0427206.\n\nReferences\nM. Bern, J. R. Gilbert, B. Hendrickson, N. Nguyen, and S. Toledo. Support-graph preconditioners. Submitted to SIAM J. Matrix Anal. Appl., 2001. E. G. Boman and B. Hendrickson. Support theory for preconditioning. SIAM Journal on Matrix Analysis and Applications, 25, 2003. K. Gremban. Combinatorial preconditioners for sparse, symmetric, diagonally dominant linear systems. Ph.D. Thesis, Carnegie Mellon University, 1996, 1996. P. Ravikumar and J. Lafferty. Variational Chernoff bounds for graphical models. Proceedings of Uncertainty in Artificial Intelligence (UAI), 2004. P. M. Vaidya. Solving linear equations with symmetric diagonally dominant matrices by constructing good preconditioners. 1990. Unpublished manuscript, UIUC. M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation and approximate ML estimation by pseudo-moment matching. 9th Workshop on Artificial Intelligence and Statistics, 2003. J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. IJCAI 2001 Distinguished Lecture track, 2001.\n\n\f\n", "award": [], "sourceid": 2953, "authors": [{"given_name": "John", "family_name": "Lafferty", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}