{"title": "Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1285, "page_last": 1293, "abstract": "In this paper we make several contributions towards accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call Neighborhood Fusion\" (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our final contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.\"", "full_text": "Accelerating Bayesian Structural Inference for\nNon-Decomposable Gaussian Graphical Models\n\nBaback Moghaddam\nJet Propulsion Laboratory\n\nCalifornia Institute of Technology\n\nbaback@jpl.nasa.gov\n\nMohammad Emtiyaz Khan\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nemtiyaz@cs.ubc.ca\n\nBenjamin M. Marlin\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nbmarlin@cs.ubc.ca\n\nKevin P. Murphy\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nmurphyk@cs.ubc.ca\n\nAbstract\n\nWe make several contributions in accelerating approximate Bayesian structural\ninference for non-decomposable GGMs. Our (cid:2)rst contribution is to show how to\nef(cid:2)ciently compute a BIC or Laplace approximation to the marginal likelihood of\nnon-decomposable graphs using convex methods for precision matrix estimation.\nThis optimization technique can be used as a fast scoring function inside standard\nStochastic Local Search (SLS) for generating posterior samples. Our second con-\ntribution is a novel framework for ef(cid:2)ciently generating large sets of high-quality\ngraph topologies without performing local search. This graph proposal method,\nwhich we call (cid:147)Neighborhood Fusion(cid:148) (NF), samples candidate Markov blankets\nat each node using sparse regression techniques. Our third contribution is a hybrid\nmethod combining the complementary strengths of NF and SLS. Experimental\nresults in structural recovery and prediction tasks demonstrate that NF and hybrid\nNF/SLS out-perform state-of-the-art local search methods, on both synthetic and\nreal-world datasets, when realistic computational limits are imposed.\n\nIntroduction\n\n1\nThere are two main reasons to learn the structure of graphical models: knowledge discovery (to\ninterpret the learned topology) and density estimation (to compute log-likelihoods and make pre-\ndictions). The main dif(cid:2)culty in graphical model structure learning is that the hypothesis space is\nextremely large, containing up to 2d(d(cid:0)1)=2 graphs on d nodes. When the sample size n is small,\nthere can be signi(cid:2)cant uncertainty with respect to the graph structure. It is therefore advantageous\nto adopt a Bayesian approach and maintain an approximate posterior over graphs instead of using a\nsingle (cid:147)best(cid:148) graph, especially since Bayesian model averaging (BMA) can improve predictions.\nThere has been much work on Bayesian inference for directed acyclic graphical model (DAG)\nstructure, mostly based on Markov chain Monte Carlo (MCMC) or stochastic local search (SLS)\n[22, 19, 16, 14]. MCMC and SLS methods for DAGs exploit the important fact that the marginal\nlikelihood of a DAG, or an approximation such as the Bayesian Information Criterion (BIC) score,\ncan be computed very ef(cid:2)ciently under standard assumptions including independent conjugate\npriors, and complete data. An equally important property in the DAG setting is that the score can be\nquickly updated when small local changes are made to the graph. This conveniently allows one to\nmove rapidly through the very large graph space of DAGs.\nHowever, for knowledge discovery, a DAG may be an unsuitable representation for several reasons.\nFirst, it does not allow directed cycles, which may be an unnatural restriction in certain domains.\n\n\fSecond, DAGs can only be identi(cid:2)ed up to Markov equivalence in the general case. In contrast,\nundirected graphs (UGs) avoid these issues and may be a more natural representation for some\nproblems. Also, for UGs there are fast methods available for identifying the local connectivity at\neach node (the node\u2019s Markov blanket). We note that while the UG and DAG representations have\ndifferent properties and enable different inference and structure learning algorithms, the distinction\nbetween UGs and DAGs from a density estimation perspective may be less important [12].\nMost prior work on Bayesian inference for Gaussian Graphical Models (GGMs) has focused on the\nspecial case of decomposable graphs (e.g., [17, 2, 29]). The popularity of decomposable GGMs is\nmostly due to the fact that one can compute the marginal likelihood in closed form using similar\nassumptions to the DAG case. In addition, one can update the marginal likelihood in constant time\nafter single-edge moves in graph space [17]. However, the space of decomposable graphs is much\nsmaller than the space of general undirected graphs. For example, the number of decomposable\ngraphs on d nodes for d = 2; : : : ; 8 is 2, 8, 61, 822, 18154, 617675, 30888596 [1, p.158]. If we\ndivide the number of decomposable graphs by the number of general undirected graphs, we get the\n(cid:147)volume(cid:148) ratios: 1, 1, 0:95, 0:80, 0:55, 0:29, 0:12. This means that decomposability signi(cid:2)cantly\nlimits the subclass of UGs available for modeling purposes, even for small d. Several authors\nhave studied Bayesian inference for GGM structure in the general case using approximations to the\nmarginal likelihood based on Monte Carlo methods (e.g., [8, 31, 20, 3]). However, these methods\ncannot scale to large graphs because of the high computational cost of Monte Carlo approximation.\nIn this paper, we propose several techniques to help accelerate approximate Bayesian structural\ninference for non-decomposable GGMs. In Section 2, we show how to ef(cid:2)ciently compute BIC\nand Laplace approximations to the marginal likelihood p(DjG) by using recent convex optimization\nmethods for estimating the precision matrix of a GGM. In Section 3, we present a novel framework\nfor generating large sets of high-quality graphs which we call (cid:147)Neighborhood Fusion(cid:148) (NF). This\nframework is quite general in scope and can use any Markov blanket (cid:2)nding method to devise a\nset of probability distributions (proposal densities) over the local topology at each node. It then\nspeci(cid:2)es rules for (cid:147)fusing(cid:148) these local densities (via sampling) into an approximate posterior over\nwhole graphs p(GjD). In Section 4, we combine the complementary strengths of NF and existing\nSLS methods to obtain even higher quality posterior distributions in certain cases. In Section 5,\nwe present an empirical evaluation of both knowledge discovery and predictive performance of our\nmethods. For knowledge discovery, we measure structural recovery in terms of accuracy of (cid:2)nding\ntrue edges in synthetic GGMs (with known structure). For predictive performance, we evaluate\ntest set log-likelihood as well as missing-data imputation on real data (with unknown structure).\nWe show that the proposed NF and hybrid NF/SLS methods for general graphs outperform current\napproaches to GGM learning for both decomposable and general (non-decomposable) graphs.\nThroughout this paper we will view the marginal likelihood p(DjG) as the key to structural inference\nand as being equivalent to the graph posterior p(GjD) by adopting a (cid:3)at structural prior p(G) w.l.o.g.\n\n2 Marginal Likelihood for General Graphs\nIn this section we will review the G-Wishart distribution and discuss approximations to the marginal\nlikelihood of a non-decomposable GGM under the G-Wishart prior. Unlike the decomposable case,\nhere the marginal likelihood can not be found in closed form. Our main contribution is the insight\nthat recently proposed convex optimization methods for precision matrix estimation can be used\nto ef(cid:2)ciently (cid:2)nd the mode of a G-Wishart distribution, which in turn allows for more ef(cid:2)cient\ncomputation of BIC and Laplace modal approximations to the marginal likelihood.\nWe begin with some notation. We de(cid:2)ne n to be the number of data cases and d to be the number of\ndata dimensions. We denote the ith data case by xi and a complete data set D with the n (cid:2) d matrix\nX, with the corresponding scatter matrix S = X T X (we assume centered data). We use G to denote\nan undirected graph, or more precisely its adjacency matrix. Graph edges are denoted by unordered\npairs (i; j) and the edge (i; j) is in the graph G if Gij = 1. The space of all positive de(cid:2)nite matrices\nhaving the same zero-pattern as G is denoted by S ++\nG . The covariance matrix is denoted by (cid:6) and\nits inverse or the precision matrix by (cid:10) = (cid:6)(cid:0)1. We also de(cid:2)ne hA; Bi = Trace(AB).\nThe Gaussian likelihood p(Dj(cid:10)) is expressed in terms of the data scatter matrix S in Equation 1.\nWe denote the prior distribution over precision matrices given a graph G by p((cid:10)jG). The standard\n\n\fmeasure of model quality in the Bayesian model selection setting is the marginal likelihood p(DjG)\nwhich is obtained by integrating p(Dj(cid:10))p((cid:10)jG) over the space S ++\n\nG as shown in Equation 2.\n\nn\n\nN (xij 0; (cid:10)(cid:0)1) / j(cid:10)jn=2 exp((cid:0)\n\np(Dj(cid:10)) =\n\nYi=1\np(DjG) = ZS++\n\nG\n\np(Dj(cid:10)) p((cid:10)jG) d(cid:10)\n\n1\n2\n\nh(cid:10); Si)\n\n(1)\n\n(2)\n\nThe G-Wishart density in Equation 3 is the Diaconis-Ylvisaker conjugate form [10] for the GGM\nlikelihood as shown in [27]. The indicator function I[(cid:10) 2 S ++\nG ] in Equation 3 restricts the density\u2019s\nsupport to S ++\nG . The G-Wishart generalizes the hyper inverse Wishart (HIW) distribution to general\nnon-decomposable graphs. The G-Wishart normalization constant Z is shown in Equation 4.\n\nW ((cid:10)jG; (cid:14)0; S0) =\n\nI[(cid:10) 2 S ++\nG ]\nZ(G; (cid:14)0; S0)\n\nj(cid:10)j((cid:14)0(cid:0)2)=2 exp((cid:0)\n\n1\n2\n\nh(cid:10); S0i)\n\nZ(G; (cid:14)0; S0) = ZS++\np(DjG) = ZS++\n\nG\n\nG\n\nj(cid:10)j((cid:14)0(cid:0)2)=2 exp((cid:0)\n\n1\n2\n\nh(cid:10); S0i) d(cid:10)\n\np(Dj(cid:10)) W ((cid:10)jG; (cid:14)0; S0) d(cid:10) /\n\nZ(G; (cid:14)n; Sn)\nZ(G; (cid:14)0; S0)\n\n(3)\n\n(4)\n\n(5)\n\nBecause of the conjugate prior in Equation 3, the (cid:10) posterior has a similar form W ((cid:10)jG; (cid:14)n; Sn)\nwhere (cid:14)n = (cid:14)0 + n is the posterior degrees of freedom and the posterior scatter matrix Sn = S + S0.\nThe resulting marginal likelihood is then the ratio of the two normalizing terms shown in Equation 5\n(which we refer to as Zn and Z0 for short).\nThe main drawback of the G-Wishart for general graphs, compared to the HIW for decomposable\ngraphs, is that one cannot compute the normalization terms Zn and Z0 in closed form. As a\nresult, Bayesian model selection for non-decomposable GGMs relies on approximating the marginal\nlikelihood p(DjG). The existing literature focuses on Monte Carlo and Laplace approximations.\nOne strategy that makes use of Monte Carlo estimates of both Zn and Z0 is given by [3]. However,\nthe computation time required to (cid:2)nd accurate estimates can be extremely high [20] (see Section 6).\nAn effective approximation strategy based on using a Laplace approximation to Zn and a Monte\nCarlo approximation to Z0 is given in [21]. This requires (cid:2)nding the mode of the G-Wishart, with\nwhich a closed-form expression for the Hessian is derived [21]. We consider a simpler method which\napplies the Laplace approximation to both Zn and Z0 for greater speed, which we call full-Laplace.\nNevertheless, computing the Hessian determinant has a computational complexity of O(E 3), where\nE is the number of edges in G. Since E = O(d2) in the worst-case scenario, computing a full\nHessian determinant becomes infeasible for large d in all but the sparsest of graphs.\nDue to the high computational cost of Monte Carlo and Laplace approximation in high dimensions,\nwe consider two alternative marginal likelihood approximations that are signi(cid:2)cantly more ef(cid:2)cient.\nThe (cid:2)rst alternative is to approximate Zn and Z0 by Laplace computations in which the Hessian\nmatrix is replaced by its diagonal (by setting off-diagonal elements to zero). We refer to this method\nas the diagonal-Laplace score. The other alternative is the Bayesian Information Criterion (BIC)\nscore shown in Equation 6, which is another large-sample Laplace approximation\n\nBIC(G) = log p(Dj ^(cid:10)G) (cid:0)\n\ndof(G) log n ;\n\n1\n2\n\ndof(G) = d +Xi