{"title": "Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses", "book": "Advances in Neural Information Processing Systems", "page_first": 2087, "page_last": 2095, "abstract": "", "full_text": "Structure estimation for discrete graphical models:\nGeneralized covariance matrices and their inverses\n\nPo-Ling Loh\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nploh@berkeley.edu\n\nMartin J. Wainwright\n\nDepartments of Statistics and EECS\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nwainwrig@stat.berkeley.edu\n\nAbstract\n\nWe investigate a curious relationship between the structure of a discrete graphical\nmodel and the support of the inverse of a generalized covariance matrix. We show\nthat for certain graph structures, the support of the inverse covariance matrix of\nindicator variables on the vertices of a graph re\ufb02ects the conditional independence\nstructure of the graph. Our work extends results that have previously been es-\ntablished only in the context of multivariate Gaussian graphical models, thereby\naddressing an open question about the signi\ufb01cance of the inverse covariance ma-\ntrix of a non-Gaussian distribution. Based on our population-level results, we\nshow how the graphical Lasso may be used to recover the edge structure of cer-\ntain classes of discrete graphical models, and present simulations to verify our\ntheoretical results.\n\n1\n\nIntroduction\n\nGraphical model inference is now prevalent in many \ufb01elds, running the gamut from computer vision\nand civil engineering to political science and epidemiology.\nIn many applications, learning the\nedge structure of an underlying graphical model is of great importance\u2014for instance, a graphical\nmodel may be used to represent friendships between people in a social network, or links between\norganisms with the propensity to spread an infectious disease [1]. It is well known that zeros in the\ninverse covariance matrix of a multivariate Gaussian distribution indicate the absence of an edge\nin the corresponding graphical model. This fact, combined with techniques in high-dimensional\nstatistical inference, has been leveraged by many authors to recover the structure of a Gaussian\ngraphical model when the edge set is sparse (e.g., see the papers [2, 3, 4, 5] and references therein).\nRecently, Liu et al. [6, 7] introduced the notion of a nonparanormal distribution, which generalizes\nthe Gaussian distribution by allowing for univariate monotonic transformations, and argued that the\nsame structural properties of the inverse covariance matrix carry over to the nonparanormal.\nHowever, the question of whether a relationship exists between conditional independence and the\nstructure of the inverse covariance matrix in a general graph remains unresolved. In this paper, we\nfocus on discrete graphical models and establish a number of interesting links between covariance\nmatrices and the edge structure of an underlying graph. Instead of only analyzing the standard co-\nvariance matrix, we show that it is often fruitful to augment the usual covariance matrix with higher-\norder interaction terms. Our main result has a striking corollary in the context of tree-structured\ngraphs: for any discrete graphical model, the inverse of a generalized covariance matrix is always\n(block) graph-structured. In particular, for binary variables, the inverse of the usual covariance ma-\ntrix corresponds exactly to the edge structure of the tree. We also establish several corollaries that\napply to more general discrete graphs. Our methods are capable of handling noisy or missing data\nin a seamless manner.\n\n1\n\n\fOther related work on graphical model selection for discrete graphs includes the classic Chow-\nLiu algorithm for trees [8]; nodewise logistic regression for discrete models with pairwise inter-\nactions [9, 10]; and techniques based on conditional entropy or mutual information [11, 12]. Our\nmain contribution is to present a clean and surprising result on a simple link between the inverse\ncovariance matrix and edge structure of a discrete model, which may be used to derive inference\nalgorithms applicable even to data with systematic corruptions.\nThe remainder of the paper is organized as follows: In Section 2, we provide brief background and\nnotation on graphical models, and describe the classes of augmented covariance matrices we will\nconsider. In Section 3, we state our main results on the relationship between the support of general-\nized inverse covariance matrices and the edge structure of a discrete graphical model. We relate our\npopulation-level results to concrete algorithms that are guaranteed to recover the edge structure of a\ndiscrete graph with high probability. In Section 4, we report the results of simulations used to verify\nour theoretical claims. For detailed proofs, we refer the reader to the technical report [13].\n\n2 Background and problem setup\n\nIn this section, we provide background on graphical models and exponential families. We then work\nthrough a simple example that illustrates the phenomena and methodology studied in this paper.\n\n2.1 Graphical models\nAn undirected graph G = (V, E) consists of a collection of vertices V = {1, 2, . . . , p} and a\ncollection of unordered vertex pairs E \u2286 V \u00d7 V , meaning no distinction is made between edges\n(s, t) and (t, s). We associate to each vertex s \u2208 V a random variable Xs taking values in some\nspace X . The random vector X := (X1, . . . , Xp) is a Markov random \ufb01eld with respect to G if\nXA \u22a5\u22a5 XB | XS whenever S is a cutset of A and B, meaning every path from A to B in G must pass\nthrough S. We have used the shorthand XA := {Xs : s \u2208 A}. In particular, Xs \u22a5\u22a5 Xt | X\\{s,t}\nwhenever (s, t) /\u2208 E.\nBy the Hammersley-Clifford theorem for strictly positive distributions [14], the Markov properties\nimply a factorization of the distribution of X:\n\n\u03c8C(xC),\n\n(1)\n\nP(x1, . . . , xp) \u221d (cid:89)\n\nC\u2208C\n\nwhere C is the set of all cliques (fully-connected subsets of V ) and \u03c8C(xC) are the corresponding\nclique potentials. The factorization (1) may alternatively be represented in terms of an exponential\nfamily associated with the clique structure of G. For each clique C \u2208 C, we de\ufb01ne a family of\nsuf\ufb01cient statistics {\u03c6C;\u03b1 : X |C| \u2192 R, \u03b1 \u2208 IC} associated with variables in C, where IC indexes\nthe suf\ufb01cient statistics corresponding to C. We also introduce a canonical parameter \u03b8C;\u03b1 \u2208 R\nassociated with each suf\ufb01cient statistic \u03c6C;\u03b1. For a given assignment of canonical parameters \u03b8, we\nmay express the clique potentials as\n\nso equation (1) may be rewritten as\n\nwhere A(\u03b8) := log(cid:80)\n\n\u03c8C(xC) =\n\n\u03b8C;\u03b1\u03c6C;\u03b1(xC) := (cid:104)\u03b8C, \u03c6C(cid:105),\n\n(cid:104)\u03b8C, \u03c6C(cid:105) \u2212 A(\u03b8)(cid:9),\n\nP\u03b8(x1, . . . , xp) = exp(cid:8)(cid:88)\nx\u2208X p exp(cid:0)(cid:80)\nP\u03b8(x1, . . . , xp) = exp(cid:8)(cid:88)\n\nC\u2208C(cid:104)\u03b8C, \u03c6C(cid:105)(cid:1) is the (log) partition function.\n\u03b8stxsxt \u2212 A(\u03b8)(cid:9).\n\n(cid:88)\n\nC\u2208C\n\n\u03b8sxs +\n\ns\u2208V\n\n(s,t)\u2208E\n\nNote that for a graph with only pairwise interactions, we have C = V \u222a E. If we associate the\nfunction \u03c6s(xs) = xs with clique {s} and the function \u03c6st(xs, xt) = xsxt with edge (s, t), the\nfactorization (2) becomes\n\n(2)\n\n(3)\n\n(cid:88)\n\n\u03b1\u2208IC\n\n2\n\n\fWhen X = {0, 1}, this family of distributions corresponds to the inhomogeneous Ising model.\nWhen X = R (and with certain additional restrictions on the weights), the family (3) corresponds\nto a Gauss-Markov random \ufb01eld. Both of these models are minimal exponential families, meaning\nthe suf\ufb01cient statistics are linearly independent [15].\nFor a discrete graphical model with X = {0, 1, . . . , m\u22121}, it is convenient to make use of suf\ufb01cient\nstatistics involving indicator functions. For clique C, de\ufb01ne the subset of con\ufb01gurations\n\nX |C|\n0 = {J = (j1, . . . , j|C|) | j(cid:96) (cid:54)= 0 for all (cid:96) = 1, . . . ,|C|},\n\nfor which no variables take the value 0. Then |X |C|\nwe de\ufb01ne the indicator function\n\n0\n\n| = (m\u2212 1)|C|. For any con\ufb01guration J \u2208 X |C|\n\n0\n\n,\n\nif xC = J,\notherwise,\n\n(cid:26)1\n\n0\n\n\u03c6C;J (xC) =\n\nand consider the family of models\n\nP\u03b8(x1, . . . , xp) = exp(cid:8)(cid:88)\nwith (cid:104)\u03b8C, \u03c6C(cid:105) =(cid:80)\n\nC\u2208C\n\nJ\u2208X |C|\n\n0\n\n(cid:104)\u03b8C, \u03c6C(cid:105) \u2212 A(\u03b8)(cid:9), where xj \u2208 X = {0, 1, . . . , m \u2212 1},\n\n(4)\n\n|C|\n\u03b8C;J \u03c6C;J (xC). Note in particular that when m = 2, X\n0\n\nis a singleton\n\nstate containing the vector of all ones, and the suf\ufb01cient statistics are given by\n\n\u03c6C;J (xC) =\n\nxs,\n\nfor C \u2208 C\n\nand J = {1}|C|;\n\n(cid:89)\n\ns\u2208C\n\ni.e., the indicator functions may simply be expressed as products of variables appearing in the clique.\nWhen the graphical model has only pairwise interactions, elements of C have cardinality at most\ntwo, and the model (4) clearly reduces to the Ising model (3). Finally, as with the equation (3), the\nfamily (4) is a minimal exponential family.\n\n2.2 Covariance matrices and beyond\n\nConsider the usual covariance matrix \u03a3 = cov(X1, . . . , Xp). When X is Gaussian, it is a well-\nknown consequence of the Hammersley-Clifford theorem that the entries of the precision matrix\n\u0393 = \u03a3\u22121 correspond to rescaled conditional correlations [14]. The magnitude of \u0393st is a scalar\nmultiple of the correlation of Xs and Xt conditioned on X\\{s,t}, and encodes the strength of the\nedge (s, t). In particular, the sparsity pattern of \u0393st re\ufb02ects the edge structure of the graph: \u0393st = 0\nif and only if Xs \u22a5\u22a5 Xt | X\\{s,t}. For more general distributions, however, Corr(Xs, Xt | X\\{s,t})\nis a function of X\\{s,t}, and it is not known whether the entries of \u0393 have any relationship with the\nstrengths of edges in the graph.\nNonetheless, it is tempting to conjecture that inverse covariance matrices, and more generally, in-\nverses of higher-order moment matrices, might be related to graph structure. Let us explore this\npossibility by considering a simple example, namely the binary Ising model (3) with X = {0, 1}.\nExample 1. Consider a simple chain graph on four nodes, as illustrated in Figure 1(a). In terms\nof the factorization (3), let the node potentials be \u03b8s = 0.1 for all s \u2208 V and the edge potentials\nbe \u03b8st = 2 for all (s, t) \u2208 E. For a multivariate Gaussian graphical model de\ufb01ned on G, standard\ntheory predicts that the inverse covariance matrix \u0393 = \u03a3\u22121 of the distribution is graph-structured:\n\u0393st = 0 if and only if (s, t) /\u2208 E. Surprisingly, this is also the case for the chain graph with binary\nvariables: a little computation show that \u0393 takes the form shown in panel (f). However, this statement\nis not true for the single-cycle graph shown in panel (b), with added edge (1, 4). Indeed, as shown\nin panel (g), the inverse covariance matrix has no nonzero entries at all. But for a more complicated\ngraph, say the one in (e), we again observe a graph-structured inverse covariance matrix.\nStill focusing on the single-cycle graph in panel (b), suppose that instead of considering the or-\ndinary covariance matrix, we compute the covariance matrix of the augmented random vector\n(X1, X2, X3, X4, X1X3), where the extra term X1X3 is represented by the dotted edge shown\n\n3\n\n\f(a) Chain\n\n\u0393chain =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 9.80\n\n\u22123.59\n\n0\n0\n\n0\n\n\u22123.59\n34.30 \u22124.77\n34.30 \u22123.59\n\u22124.77\n\u22123.59\n9.80\n\n0\n0\n\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fb .\n\n(b) Single cycle\n\n(c) Edge augmented\n\n(e) Dino\n\n(d) With 3-cliques\n\n\uf8ee\uf8ef\uf8ef\uf8f0 51.37 \u22125.37 \u22120.17 \u22125.37\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n\n\u22125.37\n\u22120.17 \u22125.37\n\u22125.37 \u22120.17 \u22125.37\n\n51.37 \u22125.37 \u22120.17\n51.37 \u22125.37\n51.37\n\n\u0393loop =\n\n(f)\n\n(g)\n\nFigure 1. (a)\u2013(e) Different examples of graphical models. (f) Inverse covariance for chain-structured\ngraph in (a). (g) Inverse covariance for single-cycle graph in (b).\n\nin panel (c). The 5 \u00d7 5 inverse of this generalized covariance matrix takes the form\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u0393aug = 103 \u00d7\n\n1.15 \u22120.02\n\u22120.02\n1.09 \u22120.02\n\u22120.02\n\u22121.14\n\n1.09 \u22120.02 \u22121.14\n0.05 \u22120.02\n0.01\n1.14 \u22120.02 \u22121.14\n\u22120.02\n0.01\n0\n0.01 \u22121.14\n1.19\n\n0.05\n0.01\n\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb .\n\nThis matrix safely separates nodes 1 and 4, but the entry corresponding to the phantom edge (1, 3) is\nnot equal to zero. Indeed, we would observe a similar phenomenon if we chose to augment the graph\nby including the edge (2, 4) rather than (1, 3). Note that the relationship between entries of \u0393aug and\nthe edge strength is not direct; although the factorization (3) has no potential corresponding to the\naugmented \u201cedge\u201d (1, 3), the (1, 3) entry of \u0393aug is noticeably larger in magnitude than the entries\ncorresponding to actual edges with nonzero potentials. This example shows that the usual inverse\ncovariance matrix is not always graph-structured, but computing generalized covariance matrices\ninvolving higher-order interaction terms may indicate graph structure.\nNow let us consider a more general graphical model that adds the 3-clique interaction terms shown\nin panel (d) to the usual Ising terms. We compute the covariance matrix of the augmented vector\n\n\u03a6(X) =(cid:8)X1, X2, X3, X4, X1X2, X2X3, X3X4, X1X4, X1X3, X1X2X3, X1X3X4\nfunctions X\u03b1 =(cid:81)\n\nEmpirically, we \ufb01nd that the 11\u00d711 inverse of the matrix cov(\u03a6(X)) continues to respect aspects of\nthe graph structure: in particular, there are zeros in position (\u03b1, \u03b2), corresponding to the associated\ns\u2208\u03b2 X\u03b2, whenever \u03b1 and \u03b2 do not lie within the same max-\nimal clique. (For instance, this applies to the pairs (\u03b1, \u03b2) = ({2},{4}) and (\u03b1, \u03b2) = ({2},{1, 4}).)\n\ns\u2208\u03b1 Xs and X\u03b2 =(cid:81)\n\n(cid:9) \u2208 {0, 1}11.\n\nThe goal of this paper is to understand when certain inverse covariances do (and do not) capture\nthe structure of a graphical model. The underlying principles behind the behavior demonstrated in\nExample 1 will be made concrete in Theorem 1 and its corollaries in the next section.\n\n3 Main results and consequences\n\nWe now state our main results on the structure of generalized inverse covariance matrices and graph\nstructure. We present our results in two parts: one concerning statements at the population level,\nand the other concerning statements at the level of statistical consistency based on random samples.\n\n3.1 Population-level results\n\nOur main result concerns a connection between the inverses of generalized inverse covariance ma-\ntrices associated with the model (4) and the structure of the graph. We begin with some notation.\n\nRecall that a triangulation of a graph G = (V, E) is an augmented graph (cid:101)G = (V,(cid:101)E) with no\n\nchordless 4-cycles. (For instance, the single cycle in panel (b) is a chordless 4-cycle, whereas panel\n\n4\n\n\f\u03a6(X;S) =(cid:8)\u03c6S;J , J \u2208 X |C|\n\n, S \u2208 S \u2229 C(cid:9),\n\n(c) shows a triangulated graph. The dinosaur graph in panel (e) is also triangulated.) The edge set (cid:101)E\n\ncorresponds to the original edge set E plus the additional edges added to form the triangulation. In\ngeneral, G admits many different triangulations; the results we prove below will hold for any \ufb01xed\ntriangulation of G.\nWe also require some notation for de\ufb01ning generalized covariance matrices. Let S be a collection\nof subsets of vertices, and de\ufb01ne the random vector\n\n0\n\n(5)\nconsisting of all suf\ufb01cient statistics over cliques in S. We will often be interested in situations where\nS contains all subsets of a given set. For a subset A \u2286 V , we let pow(A) denote the set of all\nnon-empty subsets of A. (For instance, pow({1, 2}) = {1, 2, (1, 2)}.) Furthermore, for a collection\nof subsets S, we let pow(S) be the set of all subsets {pow(S), S \u2208 S}, discarding any duplicates\nthat arise. We are now ready to state our main theorem regarding the support of a certain type of\ngeneralized inverse covariance matrix.\nTheorem 1. [Triangulation and block graph-structure.] Consider an arbitrary discrete graphical\nmodel of the form (4), and let T be the set of maximal cliques in any triangulation of G. Then the\ninverse \u0393 of the augmented covariance matrix cov(\u03a6(X; pow(T ))) is block graph-structured in the\nfollowing sense:\n\n(a) For any two subsets A and B which are not subsets of the same maximal clique, the block\n\n\u0393(pow(A), pow(B)) is zero.\n\n(b) For almost all parameters \u03b8, the entire block \u0393(pow(A), pow(B)) is nonzero whenever A\n\nand B belong to a common maximal clique.\n\nThe proof of this result relies on convex analysis and the geometry of exponential families [15, 16].\nIn particular, in any minimal exponential family, there is a one-to-one correspondence between\nexponential parameters (\u03b8\u03b1 in our notation) and mean parameters (\u00b5\u03b1 = E[\u03c6\u03b1(X)]). This corre-\nspondence is induced by the Fenchel-Legendre duality between the log partition function A and its\ndual A\u2217, and allows us to relate \u0393 to the graph structure.\nNote that when the original graph G is a tree, the graph is already triangulated and the set T in\nTheorem 1 is equal to the edge set E. Hence, Theorem 1 implies that the inverse \u0393 of the augmented\ncovariance matrix with suf\ufb01cient statistics for all vertices and edges is graph-structured, and blocks\nof nonzeros in \u0393 correspond to edges in the graph. In particular, the (m\u2212 1)p\u00d7 (m\u2212 1)p submatrix\n\u0393V,V corresponding to suf\ufb01cient statistics of vertices is block graph-structured; in the case when\nm = 2, the submatrix \u0393V,V is simply the p \u00d7 p block corresponding to the vector (X1, . . . , Xp).\nWhen G is not triangulated, however, we may need to invert a larger augmented covariance matrix\nand include suf\ufb01cient statistics over pairs (s, t) /\u2208 E, as well.\nIn fact, it is not necessary to take the set of suf\ufb01cient statistics over all maximal cliques, and we\nmay consider a slightly smaller augmented covariance matrix. Recall that any triangulation T gives\nrise to a junction tree representation of G, where nodes of the junction tree are subsets of V cor-\nresponding to maximal cliques in T , and the edges are intersections of adjacent cliques known as\nseparator sets [15]. The following corollary involves the generalized covariance matrix containing\nonly suf\ufb01cient statistics for nodes and separator sets of T :\nCorollary 1. Let S be the set of separator sets in any triangulation of G, and let \u0393 be the inverse of\n\ncov(\u03a6(X; V \u222a pow(S))). Then \u0393V,V is block graph-structured: \u0393s,t = 0 whenever (s, t) /\u2208 (cid:101)E.\n\nThe proof of this corollary is based on applying the block matrix inversion formula [17] to express\n\u0393V,V in terms of the matrix \u0393 from Theorem 1. Panel (c) of Example 1 and the associated matrix\n\u0393aug provides a concrete instance of this corollary in action. In panel (c), the single separator set in\nthe triangulation is {1, 3}, so augmenting the usual covariance matrix with the additional suf\ufb01cient\nstatistic X1X3 and taking the inverse should yield a graph-structured matrix. Indeed, edge (2, 4)\n\ndoes not belong to (cid:101)E, and as predicted by Corollary 1, we observe that \u0393aug(2, 4) = 0.\n\nNote that V \u222a pow(S) \u2286 pow(T ), and the set of suf\ufb01cient statistics considered in Corollary 1 is\ngenerally much smaller than the set of suf\ufb01cient statistics considered in Theorem 1. Hence, the gen-\neralized covariance matrix of Corollary 1 has a smaller dimension than the generalized covariance\nmatrix of Theorem 1, and is much more tractable for estimation.\n\n5\n\n\fAlthough Theorem 1 and Corollary 1 are clean results at the population level, however, forming the\nproper augmented covariance matrix requires some prior knowledge of the graph\u2014namely, which\nedges are involved in a suitable triangulation. In the case of a graph with only singleton separator\nsets, Corollary 1 specializes to the following useful corollary, which only involves the covariance\nmatrix over indicators of vertices of G:\nCorollary 2. For any graph with singleton separator sets, the inverse matrix \u0393 of the ordinary\ncovariance matrix cov(\u03a6(X; V )) is graph-structured. (This class includes trees as a special case.)\n\nAgain, we may relate this corollary to Example 1\u2014the inverse covariance matrices for the tree graph\nin panel (a) and the dinosaur graph in panel (e) are exactly graph-structured. Indeed, although the\ndinosaur graph is not a tree, it possesses the nice property that the only separator sets in its junction\ntree are singletons.\nCorollary 1 also guarantees that inverse covariances may be partially graph-structured, in the sense\nthat (\u0393V,V )st = 0 for any pair of vertices (s, t) separable by a singleton separator set. This is\nbecause for any such pair (s, t), we form a junction tree with two nodes, one containing s and one\ncontaining t, and apply Corollary 1 to conclude that (\u0393V,V )st = 0. Indeed, the matrix \u0393V,V over\nsingleton vertices is agnostic to which triangulation we choose for the graph.\nIn settings where there exists a junction tree representation of the graph with only singleton separator\nsets, Corollary 2 has a number of useful implications for the consistency of methods that have\ntraditionally only been applied for edge recovery in Gaussian graphical models. In such settings,\nCorollary 2 implies that it suf\ufb01ces to estimate the support of \u0393V,V from the data.\n\n3.2 Consequences for graphical Lasso for trees\n\nMoving beyond the population level, we now establish results concerning the statistical consistency\nof methods for graph selection in discrete graphical models, based on i.i.d. draws from a discrete\ngraph. We describe how a combination of our population-level results and some concentration\ninequalities may be leveraged to analyze the statistical behavior of log-determinant methods for dis-\ncrete tree-structured graphical models, and suggest extensions of these methods when observations\nare systematically corrupted by noise or missing data.\nGiven p-dimensional random variables (X1, . . . , Xp) with covariance \u03a3\u2217, consider the estimator\n\n(cid:98)\u0398 \u2208 arg min\n\n{trace((cid:98)\u03a3\u0398) \u2212 log det(\u0398) + \u03bbn\n\n\u0398(cid:23)0\n\nwhere (cid:98)\u03a3 is an estimator for \u03a3\u2217. For multivariate Gaussian data, this program is an (cid:96)1-regularized\n\nmaximum likelihood estimate known as the graphical Lasso, and is a well-studied method for re-\ncovering the edge structure in a Gaussian graphical model [18, 19, 20]. Although the program (6)\nhas no relation to the MLE in the case of a discrete graphical model, it is still useful for estimating\n\u0398\u2217 := (\u03a3\u2217)\u22121, and our analysis shows the surprising result that the program is consistent for re-\n\ncovering the structure of any tree-structured Ising model. We consider a general estimate (cid:98)\u03a3 of the\n\ns(cid:54)=t\n\n(cid:88)\n\n|\u0398st|},\n\ncovariance matrix \u03a3 such that\n\n(6)\n\n(7)\n\n(cid:114)\nP(cid:2)(cid:107)(cid:98)\u03a3 \u2212 \u03a3\u2217(cid:107)max \u2265 \u03d5(\u03a3\u2217)\n\nlog p\n\nn\n\n(cid:3) \u2264 c exp(\u2212\u03c8(n, p))\n(cid:80)n\n\nn\n\nIn the case of fully-\ni \u2212 xxT is the usual\n\nobserved i.i.d. data with sub-Gaussian parameter \u03c32, where(cid:98)\u03a3 = 1\n\nfor functions \u03d5 and \u03c8, where (cid:107) \u00b7 (cid:107)max denotes the elementwise (cid:96)\u221e-norm.\ni=1 xixT\nsample covariance, this bound holds with \u03d5(\u03a3\u2217) = \u03c32 and \u03c8(n, p) = c(cid:48) log p.\nIn addition, we require a certain mutual incoherence condition on the true covariance matrix \u03a3\u2217 to\ncontrol the correlation of non-edge variables with edge variables in the graph. Let \u0393\u2217 = \u03a3\u2217 \u2297 \u03a3\u2217,\nwhere \u2297 denotes the Kronecker product. Then \u0393\u2217 is a p2 \u00d7 p2 matrix indexed by vertex pairs. The\nincoherence condition is given by\n(cid:107)\u0393\u2217\n\n(8)\nmax\ne\u2208Sc\nwhere S := {(s, t) : \u0398\u2217\nst (cid:54)= 0} is the set of vertex pairs corresponding to nonzero elements of\nthe precision matrix \u0398\u2217\u2014equivalently, the edge set of the graph, by our theory on tree-structured\ndiscrete graphs. For more intuition on the mutual incoherence condition, see Ravikumar et al. [4].\n\nSS)\u22121(cid:107)1 \u2264 1 \u2212 \u03b1,\n\n\u03b1 \u2208 (0, 1],\n\neS(\u0393\u2217\n\n6\n\n\fOur global edge recovery algorithm proceeds as follows:\nAlgorithm 1 (Graphical Lasso).\n\n1. Form a suitable estimate(cid:98)\u03a3 of the true covariance matrix \u03a3.\n2. Optimize the graphical Lasso program (6) with parameter \u03bbn, denoting the solution by(cid:98)\u0398.\n3. Threshold the entries of(cid:98)\u0398 at level \u03c4n to obtain an estimate of \u0398\u2217.\n\nn + \u03bbn\n\n(cid:8) c1\n\n\u03b1\n\n\u03b1\n\nn and \u03c4n = c2\n\n(cid:113) log p\n\nWe then have the following consistency result, a straightforward consequence of the graph structure\n\nCorollary 3. Suppose we have a tree-structured Ising model with degree at most d, satisfying the\n\nof \u0398\u2217 and concentration properties of(cid:98)\u03a3:\nmutual incoherence condition (8). If n (cid:37) d2 log p, then Algorithm 1 with(cid:98)\u03a3 the sample covariance\n(cid:113) log p\n(cid:9) recovers all edges (s, t) with\nst| > \u03c4n/2, with probability at least 1 \u2212 c exp(\u2212c(cid:48) log p).\n\nmatrix and parameters \u03bbn \u2265 c1\n|\u0398\u2217\nHence, if |\u0398\u2217\nst| > \u03c4n/2 for all edges (s, t) \u2208 E, Corollary 3 guarantees that the log-determinant\nmethod plus thresholding recovers the full graph exactly. In the case of the standard sample co-\nvariance matrix, this method has been implemented by Banerjee et al. [18]; our analysis estab-\nlishes consistency of their method for discrete trees. The scaling n (cid:37) d2 log p is unavoidable, as\nshown by information-theoretic analysis [21], and also appears in other past work on Ising mod-\nels [10, 9, 11]. Our analysis also has a cautionary message: the proof of Corollary 3 relies heavily\non the population-level result in Corollary 2, which ensures that \u0398\u2217 is tree-structured. For a general\ngraph, we have no guarantees that \u0398\u2217 will be graph-structured (e.g., see panel (b) in Figure 1), so\nthe graphical Lasso (6) is inconsistent in general.\nOn the positive side, if we restrict ourselves to tree-structured graphs, the estimator (6) is attractive,\ncondition (7). In particular, when the samples {xi}n\n\nsince it relies only on an estimate (cid:98)\u03a3 of the population covariance \u03a3\u2217 that satis\ufb01es the deviation\nall we require is a suf\ufb01ciently good estimate(cid:98)\u03a3 of \u03a3\u2217. Furthermore, the program (6) is always convex\neven when the estimator(cid:98)\u03a3 is not positive semide\ufb01nite (as will often be the case for missing/corrupted\n\ni=1 are contaminated by noise or missing data,\n\ndata).\nAs a concrete example of how we may correct the program (6) to handle corrupted data, consider\nthe case when each entry of xi is missing independently with probability \u03c1, and the corresponding\nobservations zi are zero-\ufb01lled for missing entries. A natural estimator is\n\n(cid:98)\u03a3 =\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:33)\n\nzizT\ni\n\n\u00f7 M \u2212\n\n1\n\n(1 \u2212 \u03c1)2 zzT ,\n\n(9)\n\nwhere \u00f7 denotes elementwise division by the matrix M with diagonal entries (1 \u2212 \u03c1) and off-\ndiagonal entries (1 \u2212 \u03c1)2, correcting for the bias in both the mean and second moment terms. The\ndeviation condition (7) may be shown to hold w.h.p., where \u03d5(\u03a3\u2217) scales with (1 \u2212 \u03c1) (cf. Loh and\n\nWainwright [22]). Similarly, we may derive an appropriate estimator(cid:98)\u03a3 and a subsequent version of\n\nAlgorithm 1 in situations when the data are systematically contaminated by other forms of additive\nor multiplicative corruption.\nGeneralizing to the case of m-ary discrete graphical models with m > 2, we may easily modify\nthe program (6) by replacing the elementwise (cid:96)1-penalty by the corresponding group (cid:96)1-penalty,\nwhere the groups are the indicator variables for a given vertex. Precise theoretical guarantees may\nbe derived from results on the group graphical Lasso [23].\n\n4 Simulations\n\nFigure 2 depicts the results of simulations we performed to test our theoretical predictions. In all\ncases, we generated binary Ising models with node weights 0.1 and edge weights 0.3 (using spin\n{\u22121, 1} variables). The \ufb01ve curves show the results of our graphical Lasso method applied to\nthe dinosaur graph in Figure 1. Each curve plots the probability of success in recovering the 15\n\n7\n\n\fedges of the graph, as a function of the rescaled sample size\nlog p, where p = 13. The leftmost\n(red) curve corresponds to the case of fully-observed covariates (\u03c1 = 0), whereas the remaining\nfour curves correspond to increasing missing data fractions \u03c1 \u2208 {0.05, 0.1, 0.15, 0.2}, using the\ncorrected estimator (9). We observe that all \ufb01ve runs display a transition from success probability 0\nto success probability 1 in roughly the same range of the rescaled sample size, as predicted by our\ntheory. Indeed, since the dinosaur graph has only singleton separators, Corollary 2 ensures that the\ninverse covariance matrix is exactly graph-structured. Note that the curves shift right as the fraction\n\u03c1 of missing data increases, since the problem becomes harder.\n\nn\n\nFigure 2. Simulation results for our graphical Lasso method on binary Ising models, allowing for\nmissing data in the observations. The \ufb01gure shows simulation results for the dinosaur graph. Each\nlog p .\npoint represents an average over 1000 trials. The horizontal axis gives the rescaled sample size n\n\n5 Discussion\n\nThe correspondence between the inverse covariance matrix and graph structure of a Gauss-Markov\nrandom \ufb01eld is a classical fact, with many useful consequences for ef\ufb01cient estimation of Gaussian\ngraphical models. It has long been an open question as to whether or not similar properties extend\nto a broader class of graphical models. In this paper, we have provided a partial af\ufb01rmative answer\nto this question and developed theoretical results extending such relationships to discrete undirected\ngraphical models.\nAs shown by our results, the inverse of the ordinary covariance matrix is graph-structured for special\nsubclasses of graphs with singleton separator sets. More generally, we have shown that it is worth-\nwhile to consider the inverses of generalized covariance matrices, formed by introducing indicator\nfunctions for larger subsets of variables. When these subsets are chosen to re\ufb02ect the structure\nof an underlying junction tree, the edge structure is re\ufb02ected in the inverse covariance matrix. Our\npopulation-level results have a number of statistical consequences for graphical model selection. We\nhave shown how our results may be used to establish consistency (or inconsistency) of the standard\ngraphical Lasso applied to discrete graphs, even when observations are systematically corrupted\nby mechanisms such as additive noise and missing data. As noted by an anonymous reviewer, the\nChow-Liu algorithm might also potentially be modi\ufb01ed to allow for missing or corrupted observa-\ntions. However, our proposed method and further offshoots of our population-level result may be\napplied even in cases of non-tree graphs, which is beyond the scope of the Chow-Liu algorithm.\n\nAcknowledgments\n\nPL acknowledges support from a Hertz Foundation Fellowship and an NDSEG Fellowship. MJW\nand PL were also partially supported by grants NSF-DMS-0907632 and AFOSR-09NL184. The\nauthors thank the anonymous reviewers for helpful feedback.\n\n8\n\n010020030040050000.20.40.60.81success prob vs. sample size for dino graph with missing datan/log psuccess prob, avg over 1000 trials rho = 0rho = 0.05rho = 0.1rho = 0.15rho = 0.2\fReferences\n[1] M.E.J. Newman and D.J. Watts. Scaling and percolation in the small-world network model.\n\nPhys. Rev. E, 60(6):7332\u20137342, December 1999.\n\n[2] T. Cai, W. Liu, and X. Luo. A constrained (cid:96)1 minimization approach to sparse precision matrix\n\nestimation. Journal of the American Statistical Association, 106:594\u2013607, 2011.\n\n[3] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[4] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estima-\ntion by minimizing (cid:96)1-penalized log-determinant divergence. Electronic Journal of Statistics,\n4:935\u2013980, 2011.\n\n[5] M. Yuan. High-dimensional inverse covariance matrix estimation via linear programming.\n\nJournal of Machine Learning Research, 99:2261\u20132286, August 2010.\n\n[6] H. Liu, F. Han, M. Yuan, J.D. Lafferty, and L.A. Wasserman. High dimensional semi-\nparametric Gaussian copula graphical models. arXiv e-prints, March 2012. Available at\nhttp://arxiv.org/abs/1202.2169.\n\n[7] H. Liu, J.D. Lafferty, and L.A. Wasserman. The nonparanormal: Semiparametric estimation of\nhigh dimensional undirected graphs. Journal of Machine Learning Research, 10:2295\u20132328,\n2009.\n\n[8] C.I. Chow and C.N. Liu. Approximating discrete probability distributions with dependence\n\ntrees. IEEE Transactions on Information Theory, 14:462\u2013467, 1968.\n\n[9] A. Jalali, P.D. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models\nusing group-sparse regularization. Journal of Machine Learning Research - Proceedings Track,\n15:378\u2013387, 2011.\n\n[10] P. Ravikumar, M.J. Wainwright, and J.D. Lafferty. High-dimensional Ising model selection\n\nusing (cid:96)1-regularized logistic regression. Annals of Statistics, 38:1287, 2010.\n\n[11] A. Anandkumar, V.Y.F. Tan, and A.S. Willsky. High-dimensional structure learning of Ising\n\nmodels: Local separation criterion. Annals of Statistics, 40(3):1346\u20131375, 2012.\n\n[12] G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random \ufb01elds from samples:\n\nSome observations and algorithms. In APPROX-RANDOM, pages 343\u2013356, 2008.\n\n[13] P. Loh and M.J. Wainwright. Structure estimation for discrete graphical models: Generalized\n\ncovariance matrices and their inverses. arXiv e-prints, November 2012.\n\n[14] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[15] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, January 2008.\n\n[16] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.\n[17] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[18] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning\nResearch, 9:485\u2013516, 2008.\n\n[19] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical Lasso. Biostatistics, 9(3):432\u2013441, July 2008.\n\n[20] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.\n\nBiometrika, 94(1):19\u201335, 2007.\n\n[21] Narayana P. Santhanam and Martin J. Wainwright.\n\nInformation-theoretic limits of select-\ning binary graphical models in high dimensions. IEEE Transactions on Information Theory,\n58(7):4117\u20134134, 2012.\n\n[22] P. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data: Prov-\n\nable guarantees with non-convexity. Annals of Statistics, 40(3):1637\u20131664, 2012.\n\n[23] L. Jacob, G. Obozinski, and J. P. Vert. Group Lasso with Overlap and Graph Lasso.\n\nInternational Conference on Machine Learning (ICML), pages 433\u2013440, 2009.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 4584, "authors": [{"given_name": "Po-ling", "family_name": "Loh", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}