{"title": "High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions", "book": "Advances in Neural Information Processing Systems", "page_first": 1863, "page_last": 1871, "abstract": "We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm   for structure estimation based known as  conditional mutual information test. This simple local algorithm    requires only low-order statistics of the data and decides    whether  two nodes   are neighbors in the unknown graph. Under some transparent assumptions, we establish that the proposed algorithm is structurally consistent (or sparsistent)  when the number of samples scales as n= Omega(J_{min}^{-4} log p), where p is the number of nodes and J_{min} is the minimum edge potential.  We also prove novel non-asymptotic necessary conditions for graphical model selection.", "full_text": "High-Dimensional Graphical Model Selection:\n\nTractable Graph Families and Necessary Conditions\n\nAnima Anandkumar\n\nDept. of EECS,\n\nUniv. of California\nIrvine, CA, 92697\n\na.anandkumar@uci.edu\n\nVincent Y.F. Tan\n\nDept. of ECE,\n\nUniv. of Wisconsin\nMadison, WI, 53706.\nvtan@wisc.edu\n\nAlan S. Willsky\nDept. of EECS\n\nMassachusetts Inst. of Technology,\n\nCambridge, MA, 02139.\nwillsky@mit.edu\n\nAbstract\n\nWe consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples\nfrom the model. We propose an ef\ufb01cient threshold-based algorithm for structure estimation based\non conditional mutual information thresholding. This simple local algorithm requires only low-\norder statistics of the data and decides whether two nodes are neighbors in the unknown graph.\nWe identify graph families for which the proposed algorithm has low sample and computational\ncomplexities. Under some transparent assumptions, we establish that the proposed algorithm is\nstructurally consistent (or sparsistent) when the number of samples scales as n = \u2126(J \u22124\nmin log p),\nwhere p is the number of nodes and Jmin is the minimum edge potential. We also develop novel\nnon-asymptotic techniques for obtaining necessary conditions for graphical model selection.\n\nKeywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions,\ntypical sets, Fano\u2019s inequality.\n\n1 Introduction\n\nThe formalism of probabilistic graphical models can be employed to represent dependencies among a large set of\nrandom variables in the form of a graph [1]. An important challenge in the study of graphical models is to learn\nthe unknown graph using samples drawn from the graphical model. The general structure estimation problem is\nNP-hard [2]. In the high-dimensional regime, structure estimation is even more dif\ufb01cult since the number of available\nobservations is typically much smaller than the number of dimensions (or variables). One of the goals is to characterize\ntractable model classes for which consistent graphical model selection can be guaranteed with low computational and\nsample complexities.\n\nThe seminal work by Chow and Liu [3] proposed an ef\ufb01cient algorithm for maximum-likelihood structure estimation\nin tree-structured graphical models by reducing the problem to a maximum weight spanning tree problem. A more\nrecent approach for ef\ufb01cient structure estimation is based on convex-relaxation [4\u20136]. The success of such methods\ntypically requires certain \u201cincoherence\u201d conditions to hold. However, these conditions are NP-hard to verify for\ngeneral graphical models.\n\nWe adopt an alternative paradigm in this paper and instead analyze a simple local algorithm which requires only\nlow-order statistics of the data and makes decisions on whether two nodes are neighbors in the unknown graph. We\ncharacterize the class of Ising and Gaussian graphical models for which we can guarantee ef\ufb01cient and consistent\nstructure estimation using this simple algorithm. The class of graphs is based on a local-separation property and\nincludes many well-known random graph families, including locally-tree like graphs such as large girth graphs, the\nErd\u02ddos-R\u00b4enyi random graphs [7] and power-law graphs [8], as well as graphs with short cycles such as bounded-degree\ngraphs, and small-world graphs [9]. These graphs are especially relevant in modeling social networks [10, 11].\n\n1\n\n\f1.1 Summary of Results\n\nWe propose an algorithm for structure estimation, termed as conditional mutual information thresholding (CMIT),\nwhich computes the minimum empirical conditional mutual information of a given node pair over conditioning sets\nof bounded cardinality \u03b7. If the minimum exceeds a given threshold (depending on the number of samples n and the\nnumber of nodes p), the node pair is declared as an edge. This test has a low computational complexity of O(p\u03b7+2)\nand requires only low-order statistics (up to order \u03b7 + 2) when \u03b7 is small. The parameter \u03b7 is an upper bound on the\nsize of local vertex-separators in the graph, and is small for many common graph families, as discussed earlier. We\nestablish that under a set of mild and transparent assumptions, structure learning is consistent in high-dimensions for\nCMIT when the number of samples scales as n = \u2126(J \u22124\nmin log p), for a p-node graph, where Jmin is the minimum\n(absolute) edge-potential in the model.\n\nWe also develop novel techniques to obtain necessary conditions for consistent structure estimation of Erd\u02ddos-R\u00b4enyi\nrandom graphs. We obtain non-asymptotic bounds on the number of samples n in terms of the expected degree and\nthe number of nodes of the model. The techniques employed are information-theoretic in nature and combine the use\nof Fano\u2019s inequality and the so-called asymptotic equipartition property.\n\nOur results have many rami\ufb01cations: we explicitly characterize the tradeoff between various graph parameters such\nas the maximum degree, girth and the strength of edge potentials for ef\ufb01cient and consistent structure estimation. We\ndraw connections between structure learning and the statistical physical properties of the model: learning is fundamen-\ntally related to the absence of long-range dependencies in the model, i.e., the regime of correlation decay. The notion\nof correlation decay on Ising models has been extensively characterized [12], but its connections to structure learning\nhave only been explored in a few recent works (e.g., [13]). This work establishes that consistent structure learning is\nfeasible under a slightly weaker condition than the usual notion of correlation decay for a rich class of graphs. More-\nover, we show that the Gaussian analog of correlation decay is the so-called walk-summability condition [14]. This is\na somewhat unexpected and surprising connection since walk-summability is a condition to characterize the perfor-\nmance of inference algorithms such as loopy belief propagation (LBP). Our work demonstrates that both successful\ninference and learning hinge on similar properties of the Gaussian graphical model.\n\n2 Preliminaries\n\n2.1 Graphical Models\n\nA p-dimensional graphical model is a family of p-dimensional multivariate distributions Markov on some undirected\ngraph G=(V, E) [1]. Each node in the graph i \u2208 V is associated to a random variable Xi taking values in a set X . We\nconsider both discrete (in particular Ising) models where X is a \ufb01nite set and Gaussian models where X = R. The set\nof edges E captures the set of conditional-independence relationships among the random variables. More speci\ufb01cally,\nthe vector of random variables X := (X1, . . . , Xp) with joint distribution P satis\ufb01es the global Markov property with\nrespect to a graph G, if for all disjoint sets A, B \u2282 V , we have\n\nP (xA, xB|xS) = P (xA|xS)P (xB|xS).\n\n(1)\n\nwhere set S is a separator1between A and B. The Hammersley-Clifford theorem states that under the positivity\ncondition, given by P (x) > 0 for all x \u2208 X p [1], the model P satis\ufb01es the global Markov property according to a\ngraph G if and only if it factorizes according to the cliques of G.\nWe consider the class of Ising models, i.e., binary pairwise models which factorize according to the edges of the graph.\nMore precisely, the probability mass function (pmf) of an Ising model is\n\nP (x) \u221d exp(cid:20) 1\n\n2\n\nxT JGx + hT x(cid:21) , x \u2208 {\u22121, 1}p.\n\nFor Gaussian graphical models, the probability density function (pdf) is of the form,\n\nf (x) \u221d exp(cid:20)\u2212\n\nxT JGx + hT x(cid:21) , x \u2208 Rp.\n\n1\n2\n\n(2)\n\n(3)\n\n1A set S \u2282 V is a separator of sets A and B if the removal of nodes in S separates A and B into distinct components.\n\n2\n\n\fIn both the cases, the matrix JG is called the potential or information matrix and h, the potential vector. For both Ising\nand Gaussian models, the sparsity pattern of the matrix JG corresponds to that of the graph G, i.e., JG(i, j) = 0 if and\nonly if (i, j) /\u2208 G.\nWe assume that the potentials are uniformly bounded above and below as:\n\nJmin \u2264 |JG(i, j)| \u2264 Jmax,\n\n\u2200 (i, j) \u2208 G.\n\n(4)\n\nOur results on structure learning depend on Jmin and Jmax, which is fairly natural \u2013 intuitively, models with edge\npotentials which are \u201ctoo small\u201d or \u201ctoo large\u201d are harder to learn than those with comparable potentials, i.e., homoge-\nnous models.\n\nNotice that the conventional parameterizations for the Ising models in (2) and the Gaussian models in (3) are slightly\ndifferent. Without loss of generality, for Ising model, we assume that J(i, i) = 0 for all i \u2208 V . On the other hand, in\nthe Gaussian setting, we assume that the diagonal elements of the inverse covariance (or information) matrix JG are\nnormalized to unity (J(i, i) = 1, i \u2208 V ), and that JG can be decomposed as JG = I \u2212 RG, where RG is the matrix\nof partial correlation coef\ufb01cients [14].\n\nWe consider the problem of structure learning, which involves the estimation of the edge set of the graph G given n\ni.i.d. samples X1, . . . , Xn drawn either from the Ising model in (2) or the Gaussian model in (3). We consider the\nhigh-dimensional regime, where both p and n grow simultaneously; typically, the growth of p is much faster than that\nof n.\n\n2.2 Tractable Graph Families\n\nWe consider the class of graphical models Markov on a graph Gp belonging to some ensemble G(p) of graphs with\np nodes. We emphasize that in our formulation the graph ensemble G(p) can either be deterministic or random\n\u2013 in the latter, we also specify a probability measure over the set of graphs in G(p).\nIn the random setting, we\nuse the term almost every (a.e.) graph G \u223c G(p) satis\ufb01es a certain property Q (for example, connectedness) if\nlimp\u2192\u221e P [Gp satis\ufb01es Q] = 1. In other words, the property Q holds asymptotically almost surely2 (a.a.s.) with\nrespect to the random graph ensemble G(p). Intuitively, this means that graphs that have a vanishing probability of\noccurrence as p \u2192 \u221e are ignored. Our conditions and theoretical guarantees will be based on this notion for random\ngraph ensembles.\nWe now characterize the ensemble of graphs amenable for consistent structure estimation. For \u03b3 \u2208 N, let B\u03b3(i; G)\ndenote the set of vertices within distance \u03b3 from node i with respect to graph G. Let H\u03b3,i := G(B\u03b3 (i; G)) de-\nnote the subgraph of G spanned by B\u03b3(i; G), but in addition, we retain the nodes not in B\u03b3(i; G) (and remove the\ncorresponding edges).\n\nDe\ufb01nition 1 (\u03b3-Local Separator) Given a graph G, a \u03b3-local separator S\u03b3(i, j) between i and j, for (i, j) /\u2208 G, is a\nminimal vertex separator3 with respect to the subgraph H\u03b3,i. The parameter \u03b3 is referred to as the path threshold for\nlocal separation.\n\nIn other words, the \u03b3-local separator S\u03b3(i, j) separates nodes i and j with respect to paths in G of length at most \u03b3.\nWe now characterize the ensemble of graphs based on the size of local separators.\n\nDe\ufb01nition 2 ((\u03b7, \u03b3)-Local Separation Property) An ensemble of graphs G(p; \u03b7, \u03b3) satis\ufb01es (\u03b7, \u03b3)-local separation\nproperty if for a.e. Gp \u2208 G(p; \u03b7, \u03b3),\n\n(5)\n\n(i,j) /\u2208Gp |S\u03b3(i, j)| \u2264 \u03b7.\nmax\n\nIn Section 3, we propose an ef\ufb01cient algorithm for graphical model selection when the underlying graph belongs to a\ngraph ensemble G(p; \u03b7, \u03b3) with sparse local node separators (i.e., with small \u03b7). Below we provide examples of three\ngraph families which satisfy (5) for small \u03b7.\n\n2Note that the term a.a.s. does not apply to deterministic graph ensembles G(p) where no randomness is assumed, and in this\n\nsetting, we assume that the property Q holds for every graph in the ensemble.\n\n3A minimal separator is a separator of smallest cardinality.\n\n3\n\n\f(Example 1) Bounded Degree: Any (deterministic or random) ensemble of degree-bounded graphs GDeg(p, \u2206)\nsatis\ufb01es the (\u03b7, \u03b3)-local separation property with \u03b7 = \u2206 and every \u03b3 \u2208 N. Thus, our algorithm consistently recovers\ngraphs with small (bounded) degrees (\u2206 = O(1)). This case was considered previously in several works, e.g. [15,16].\n(Example 2) Bounded Local Paths: The (\u03b7, \u03b3)-local separation property also holds when there are at most \u03b7 paths\nof length at most \u03b3 in G between any two nodes (henceforth, termed as the (\u03b7, \u03b3)-local paths property). In other\nwords, there are at most \u03b7 \u2212 1 overlapping4 cycles of length smaller than 2\u03b3. Thus, a graph with girth g (length of\nthe shortest cycle) satis\ufb01es the (\u03b7, \u03b3)-local separation property with \u03b7 = 1 and \u03b3 = g. For example, the bipartite\nRamanujan graph [17, p. 107] and the random Cayley graphs [18] have large girths. The girth condition can be\nweakened to allow for a small number of short cycles, while not allowing for overlapping cycles. Such graphs are\ntermed as locally tree-like. For instance, the ensemble of Erd\u02ddos-R\u00b4enyi graphs GER(p, c/p), where an edge between\nany node pair appears with a probability c/p, independent of other node pairs, is locally tree-like. It can be shown\nthat GER(p, c/p) satis\ufb01es (\u03b7, \u03b3)-local separation property with \u03b7 = 2 and \u03b3 \u2264 log p\n4 log c a.a.s. Similar observations apply\nfor the more general scale-free or power-law graphs [8, 19]. Along similar lines, the ensemble of \u2206-random regular\ngraphs, denoted by GReg(p, \u2206), which is the uniform ensemble of regular graphs with degree \u2206 has no overlapping\ncycles of length at most \u0398(log\u2206\u22121 p) a.a.s. [20, Lemma 1].\n(Example 3) Small-World Graphs: The class of hybrid graphs or augmented graphs [8, Ch. 12] consist of graphs\nwhich are the union of two graphs: a \u201clocal\u201d graph having short cycles and a \u201cglobal\u201d graph having small average\ndistances between nodes. Since the hybrid graph is the union of these local and global graphs, it simultaneously has\nlarge degrees and short cycles. The simplest model GWatts(p, d, c/p), \ufb01rst studied by Watts and Strogatz [9], consists\nof the union of a d-dimensional grid and an Erd\u02ddos-R\u00b4enyi random graph with parameter c. One can check that a.e.\ngraph G \u223c GWatts(p, d, c/p) satis\ufb01es (\u03b7, \u03b3)-local separation property in (5), with \u03b7 = d + 2 and \u03b3 \u2264 log p\n4 log c . Similar\nobservations apply for more general hybrid graphs studied in [8, Ch. 12].\n\n3 Method and Guarantees\n\n3.1 Assumptions\n\n(A1) Scaling Requirements: We consider the asymptotic setting where both the number of variables (nodes) p\nand the number of samples n go to in\ufb01nity. We assume that the parameters (n, p, Jmin) scale in the following\nfashion:5\n\n(6)\nWe require that the number of nodes p \u2192 \u221e to exploit the local separation properties of the class of graphs\nunder consideration.\n(A2a) Strict Walk-summability for Gaussian Models: The Gaussian graphical model Markov on almost every\n\nmin log p).\n\nn = \u03c9(J \u22124\n\nGp \u223c G(p) is \u03b1-walk summable, i.e.,\n\n(7)\nwhere \u03b1 is a constant (i.e., is not a function of p), RGp := [|RGp (i, j)|] is the entry-wise absolute value of\nthe partial correlation matrix RGp. In addition, k\u00b7k denotes the spectral norm, which for symmetric matrices,\nis given by the maximum absolute eigenvalue.\n(A2b) Bounded Potentials for Ising Models: The Ising model Markov on a.e. Gp \u223c G(p) has its maximum\n\nkRGpk \u2264 \u03b1 < 1,\n\nabsolute potential below a threshold J \u2217. More precisely,\n\nFurthermore, the ratio \u03b1 in (8) is not a function of p. See [21, 22] for an explicit characterization of J \u2217 for\nspeci\ufb01c graph ensembles.\n\n\u03b1 :=\n\ntanh Jmax\ntanh J \u2217 < 1.\n\n(8)\n\n(A3) Local-Separation Property: We assume that the ensemble of graphs G(p; \u03b7, \u03b3) satis\ufb01es the (\u03b7, \u03b3)-local\n\nseparation property with \u03b7, \u03b3 \u2208 N satisfying:\n\n4Two cycles are said to overlap if they have common vertices.\n5The notations \u03c9(\u00b7), \u2126(\u00b7), o(\u00b7) and O(\u00b7) refer to asymptotics as the number of variables p \u2192 \u221e.\n\n\u03b7 = O(1), Jmin\u03b1\u2212\u03b3 = e\u03c9(1),\n\n(9)\n\n4\n\n\fwhere \u03b1 is given by (7) for Gaussian models and by (8) for Ising models.6 We can weaken the second re-\nquirement in (9) as Jmin\u03b1\u2212\u03b3 = \u03c9(1) for deterministic graph families (rather than random graph ensembles).\n(A4) Edge Potentials: The edge potentials {Ji,j, (i, j) \u2208 G} of the Ising model are assumed to be generically\ndrawn from [\u2212Jmax,\u2212Jmin] \u222a [Jmin, Jmax], i.e., our results hold except for a set of Lebesgue measure zero.\nWe also characterize speci\ufb01c classes of models where this assumption can be removed and we allow for all\nchoices of edge potentials. See [21, 22] for details.\n\nThe above assumptions are very general and hold for a rich class of models. Assumption (A1) stipulates the scaling\nrequirements of number of samples for consistent structure estimation. Assumption (A2) and (A4) impose constraints\non the model parameters. Assumption (A3) requires the local-separation property described in Section 2.2 with the\npath threshold \u03b3 satisfying (9). We provide examples of graphs where the above assumptions are met.\nGaussian Models on Girth-bounded Graphs: Consider the ensemble of graphs GDeg,Girth(p; \u2206, g) with maximum\ndegree \u2206 and girth g. We now derive a relationship between \u2206 and g, for the above assumptions to hold. It can be\nestablished that for the walk-summability condition in (A2a) to hold for Gaussian models, we require that Jmax =\nO(1/\u2206). When the minimum edge potential achieves this bound (Jmin = \u0398(1/\u2206)), a suf\ufb01cient condition for (A3) to\nhold is given by\n\n(10)\nIn (10), we notice a natural tradeoff between the girth and the maximum degree of the graph ensemble for successful\nestimation under our framework: graphs with large degrees can be learned ef\ufb01ciently if their girths are large. Indeed, in\nthe extreme case of trees which have in\ufb01nite girth, in accordance with (10), there is no constraint on the node degrees\nfor consistent graphical model selection and recall that the Chow-Liu algorithm [3] is an ef\ufb01cient method for model\nselection on tree-structured graphical models.\n\n\u2206\u03b1g = o(1).\n\nNote that the condition in (10) allows for the maximum degree bound \u2206 to grow with the number of nodes as long as\nthe girth g also grows appropriately. For example, if the maximum degree scales as \u2206 = O(poly(log p)) and the girth\nscales as g = O(log log p), then (10) is satis\ufb01ed. This implies that graphs with fairly large degrees and short cycles\ncan be recovered successfully consistently using the algorithm in Section 3.2.\nGaussian Models on Erd\u02ddos-R\u00b4enyi and Small-World Graphs: We can also conclude that a.e. Erd\u02ddos-R\u00b4enyi graph\nG \u223c GER(p, c/p) satis\ufb01es (9) with \u03b7 = 2 when c = O(poly(log p)) under the best possible scaling for Jmin subject\nto the walk-summability constraint in (7). Similarly, the small-world ensemble GWatts(p, d, c/p) satis\ufb01es (9) with\n\u03b7 = d + 2, when d = O(1) and c = O(poly(log p)).\nIsing Models: For Ising models, the best possible scaling of the minimum edge potential Jmin is when Jmin = \u0398(J \u2217),\nfor the threshold J \u2217 de\ufb01ned in (8). For the ensemble of graphs GDeg,Girth(p; \u2206, g) with degree \u2206 and girth g, we can\nestablish that J \u2217 = \u0398(1/\u2206). When the minimum edge potential achieves the threshold, i.e., Jmin = \u0398(1/\u2206), we\nend up with a similar requirement as in (10) for Gaussian models. Similarly, for both the Erd\u02ddos-R\u00b4enyi graph ensemble\nGER(p, c/p) and small-world ensemble GWatts(p, d, c/p), we can establish that the threshold J \u2217 = \u0398(1/c), and thus,\nthe observations made for the Gaussian setting hold for the Ising model as well.\n\n3.2 Conditional Mutual Information Threshold Test\n\nOur structure learning procedure is known as the Conditional Mutual Information Threshold Test (CMIT). Let\nCMIT(xn; \u03ben,p, \u03b7) be the output edge set from CMIT given n i.i.d. samples xn, a threshold \u03ben,p and a constant\n\u03b7 \u2208 N. The conditional mutual information test proceeds as follows: one computes the empirical conditional mutual\ninformation7 for each node pair (i, j) \u2208 V 2 and \ufb01nds the conditioning set which achieves the minimum, over all\nsubsets of cardinality at most \u03b7,\n(11)\n\nmin\n\nS\u2282V \\{i,j},|S|\u2264\u03b7bI(Xi; Xj|XS),\n\nwhere bI(Xi; Xj|XS) denotes the empirical conditional mutual information of Xi and Xj given XS. If the above\nminimum value exceeds the given threshold \u03ben,p, then the node pair is declared as an edge. Recall that the conditional\nmutual information I(Xi; Xj|XS) = 0 iff given XS, the random variables Xi and Xj are conditionally independent.\n\n6We say that two sequences f (p), g(p) satisfy f (p) = e\u03c9(g(p)), if\n7The empirical conditional mutual information is obtained by \ufb01rst computing the empirical distribution and then computing its\n\n\u2192 \u221e as p \u2192 \u221e.\n\ng(p) log p\n\nf (p)\n\nconditional mutual information.\n\n5\n\n\fThus, (11) seeks to identify non-neighbors, i.e., node pairs which can be separated in the unknown graph G. However,\nsince we constrain the conditioning set |S| \u2264 \u03b7 in (11), the optimal conditioning set may not form an exact separator.\nDespite this restriction, we establish that the above test can correctly classify the edges and non-neighbors using a\nsuitable threshold \u03ben,p subject to the assumptions (A1)\u2013(A4). The threshold \u03ben,p is chosen as a function of the number\nof nodes p, the number of samples n, and the minimum edge potential Jmin as follows:\n\n\u03ben,p = O(J 2\n\nmin), \u03ben,p = \u03c9(\u03b12\u03b3), \u03ben,p = \u2126(cid:18) log p\nn (cid:19) ,\n\nwhere \u03b3 is the path-threshold in (5) for (\u03b7, \u03b3)-local separation to hold and \u03b1 is given by (7) and (8). The computational\ncomplexity of the CMIT algorithm is O(p\u03b7+2). Thus the algorithm is computationally ef\ufb01cient for small \u03b7. Moreover,\nthe algorithm only uses statistics of order \u03b7 + 2 in contrast to the convex-relaxation approaches [4\u20136] which typically\nuse higher-order statistics.\n\n(12)\n\n(13)\n\nTheorem 1 (Structural consistency of CMIT) Assume that (A1)-(A4) hold. Given a Gaussian graphical model or\nan Ising model Markov on a graph Gp \u223c G(p; \u03b7, \u03b3), CMIT(xn; \u03ben,p, \u03b7) is structurally consistent. In other words,\n\nlim\n\nn,p\u2192\u221e\n\nP [CMIT ({xn}; \u03ben,p, \u03b7) 6= Gp] = 0.\n\nConsistency guarantee The CMIT algorithm consistently recovers the structure of the graphical models with prob-\nability tending to one and the probability measure in (4) is with respect to both the graph and the samples.\n\nSample-complexity The sample complexity of the CMIT scales as \u2126(J \u22124\nmin log p) and is favorable when the mini-\nmum edge potential Jmin is large. This is intuitive since the edges have stronger potentials when Jmin is large. On the\nother hand, Jmin cannot be arbitrarily large due to the assumption (A2). The minimum sample complexity is attained\nwhen Jmin achieves this upper bound.\nIt can be established that for both Gaussian and Ising models Markov on a degree-bounded graph ensemble\nGDeg(p, \u2206) with maximum degree \u2206 and satisfying assumption (A3), the minimum sample complexity is given by\nn = \u2126(\u22064 log p) i.e., when Jmin = \u0398(1/\u2206).\nWe can have improved guarantees for the Erd\u02ddos-R\u00b4enyi random graphs GER(p, c/p).\nIn the Gaussian setting, the\nminimum sample complexity can be improved to n = \u2126(\u22062 log p), i.e., when Jmin = \u0398(1/\u221a\u2206) where the maximum\ndegree scales as \u2206 = \u0398(log p log c) [7].\nOn the other hand, for Ising models, the minimum sample complexity can be further improved to n = \u2126(c4 log p),\ni.e., when Jmin = \u0398(J \u2217) = \u0398(1/c). Note that c/2 is the expected degree of the GER(p, c/p) ensemble. Speci\ufb01cally,\nwhen the Erd\u02ddos-R\u00b4enyi random graphs have a bounded average degree (c = O(1)), we can obtain a minimum sample\ncomplexity of n = \u2126(log p) for structure estimation of Ising models. Recall that the sample complexity of learning\ntree models is \u2126(log p) [23]. Thus, the complexity of learning sparse Erd\u02ddos-R\u00b4enyi random graphs is akin to learning\ntrees in certain parameter regimes.\nThe sample complexity of structure estimation can be improved to n = \u2126(J \u22122\nmin log p) by employing empirical condi-\ntional covariances for Gaussian models and empirical conditional variation distances in place of empirical conditional\nmutual information. However, to present a uni\ufb01ed framework for Gaussian and Ising models, we present the CMIT\nhere. See [21, 22] for details.\n\nComparison with convex-relaxation approaches We now compare our approach for structure learning with\nconvex-relaxation methods. The work by Ravikumar et al. in [5] employs an `1-penalized likelihood estimator and un-\nder the so-called incoherence conditions, the sample complexity is n = \u2126((\u22062 + J \u22122\nmin) log p). Our sample complexity\n(using conditional covariances) n = \u2126(J \u22122\nmin log p) is the same in terms of Jmin, while there is no explicit dependence\non the maximum degree \u2206. Similarly, we match the neighborhood-based regression method by Meinshausen and\nBuhlmann in [24] under more transparent conditions.\nFor structure estimation of Ising models, the work in [6] considers `1-penalized logistic regression which has a sample\ncomplexity of n = \u2126(\u22063 log p) for a degree-bounded ensemble GDeg(p, \u2206) satisfying certain \u201cincoherence\u201d condi-\ntions. The sample complexity of CMIT, given by n = \u2126(\u22064 log p), is slightly worse, while the modi\ufb01ed algorithm\ndescribed previously has a sample complexity of n = \u2126(\u22062 log p), for general degree-bounded ensembles. Addition-\nally, under the CMIT algorithm, we can guarantee an improved sample complexity of n = \u2126(c4 log p) for Erd\u02ddos-R\u00b4enyi\n\n6\n\n\frandom graphs GER(p, c/p) and small-world graphs GWatts(p, d, c/p), since the average degree c/2 is typically much\nsmaller than the maximum degree \u2206. Moreover, note that, the incoherence conditions stated in [6] are NP-hard to\nestablish for general models since they involve the partition function of the model. In contrast, our conditions are\ntransparent and relate to the statistical-physical properties of the model. Moreover, our algorithm is local and requires\nonly low-order statistics, while the method in [6] requires full-order statistics.\n\nProof Outline We \ufb01rst analyze the scenario when exact statistics are available. (i) We establish that for any two\nnon-neighbors (i, j) /\u2208 G, the minimum conditional mutual information in (11) (based on exact statistics) does not\nexceed the threshold \u03ben,p. (ii) Similarly, we also establish that the conditional mutual information in (11) exceeds the\nthreshold \u03ben,p for all neighbors (i, j) \u2208 G. (iii) We then extend these results to empirical versions using concentration\nbounds. See [21, 22] for details.\n\nThe main challenge in our proof is step (i). To this end, we analyze the conditional mutual information when the\nconditioning set is a local separator between i and j and establish that it decays as p \u2192 \u221e. The techniques involved to\nestablish this for Ising and Gaussian models are different: for Ising models, we employ the self-avoiding walk (SAW)\ntree construction [25]. For Gaussian models, we use the techniques from walk-sum analysis [14].\n\n4 Necessary Conditions for Model Selection\n\nIn the previous sections, we proposed and analyzed ef\ufb01cient algorithms for learning the structure of graphical models.\nWe now derive the necessary conditions for consistent structure learning. We focus on the ensemble of Erd\u02ddos-R\u00b4enyi\ngraphs GER(p, c/p).\nFor the class of degree-bounded graphs GDeg(p, \u2206), necessary conditions on sample complexity have been character-\nized previously [26] by considering a certain (restricted) set of ensembles. However, a na\u00a8\u0131ve application of such bounds\n(based on Fano\u2019s inequality [27, Ch. 2]) turns out to be too weak for the class of Erd\u02ddos-R\u00b4enyi graphs GER(p, c/p).\nWe provide novel necessary conditions for structure learning of Erd\u02ddos-R\u00b4enyi graphs. Our techniques may also be\napplicable to other classes of random graphs.\nRecall that a graph G is drawn from the ensemble of Erd\u02ddos-R\u00b4enyi graphs GER(p, c/p). Given n i.i.d. samples Xn :=\n\nto derive tight necessary conditions on the number of samples n (as a function of average degree c/2 and number of\nnodes p) so that the probability of error P (p)\nAgain, note that the probability measure P is with respect to both the Erd\u02ddos-R\u00b4enyi graph and the samples.\n\n(X1, . . . , Xn) \u2208 (X p)n, the task is to estimate G from Xn. Denote the estimated graph as bG := bG(Xn). It is desired\n:= P (bG(Xn) 6= G) \u2192 0 as the number of nodes p tends to in\ufb01nity.\nDiscrete Graphical Models Let Hb(q) := \u2212q log2 q \u2212 (1 \u2212 q) log2(1 \u2212 q) be the binary entropy function. For\nthe Ising model, or more generally any discrete model where each random variable Xi \u2208 X = {1, . . . ,|X|}, we can\ndemonstrate the following:\n\ne\n\nTheorem 2 (Weak Converse for Discrete Models) For a discrete graphical model Markov on G \u223c GER(p, c/p), if\nP (p)\ne \u2192 0, it is necessary for n to satisfy\nn \u2265\n\n2(cid:19)Hb(cid:18) c\n\np(cid:19) \u2265\n\n(14)\n\n1\n\np log2 |X|(cid:18)p\n\nc log2 p\n2 log2 |X|\n\n.\n\nThe above bound does not involve any asymptotic notation and shows transparently, how n has to depend on p, c\nand |X| for consistent structure learning. Note that if the cardinality of the random variables |X| is large, then the\nnecessary sample complexity is small, which makes intuitive sense from a source-coding perspective. Moreover, the\nabove bound states that more samples are required as the average degree c/2 increases. Our bound involves only the\naverage degree c/2 and not the maximum degree of the graph, which is typically much larger than c [7].\n\nGaussian Graphical Models We now turn out attention to the Gaussian analogue of Theorem 2 under a similar\nsetup. We assume that the \u03b1-walk-summability condition in assumption (A2a) holds. We are then able to demonstrate\nthe following:\n\n7\n\n\fTheorem 3 (Weak Converse for Gaussian Models) For an \u03b1-walk summable Gaussian graphical model Markov on\nG \u223c GER(p, c/p) as p \u2192 \u221e, if P (p)\n\ne \u2192 0, we require\n\n2\n\n1\u2212\u03b1 + 1(cid:17)i(cid:18)p\n\n2(cid:19)Hb(cid:18) c\n\np(cid:19) \u2265\n\nn \u2265\n\np log2h2\u03c0e(cid:16) 1\n\nc log2 p\n\nlog2h2\u03c0e(cid:16) 1\n\n1\u2212\u03b1 + 1(cid:17)i .\n\n(15)\n\nAs with Theorem 2, the above bound does not involve any asymptotic notation and similar intuitions hold as before.\nThere is a natural logarithmic dependence on p and a linear dependence on the average degree parameter c. Finally, the\ndependence on \u03b1 can be explained as follows: any \u03b1-walk-summable model is also \u03b2-walk-summable for all \u03b2 > \u03b1.\nThus, the class of \u03b2-walk-summable models contains the class of \u03b1-walk-summable models. This results in a looser\nbound in (15) for large \u03b1.\n\nAnalysis tools Our analysis tools are information-theoretic in nature. A common tool to derive necessary conditions\nis to resort to Fano\u2019s inequality [27, Ch. 2], which (lower) bounds the probability of error P (p)\nas a function of the\nconditional entropy H(G|Xn) and the size of the set of all graphs with p nodes. However, a na\u00a8\u0131ve application of\nFano\u2019s inequality results in a trivial lower bound as the set of all graphs, which can be realized by GER(p, c/p) is \u201ctoo\nlarge\u201d.\n\ne\n\nTo ameliorate this problem, we focus our attention on the typical graphs for applying Fano\u2019s inequality and not all\ngraphs. The set of typical graphs has a small cardinality but high probability when p is large. The novelty of our proof\nlies in our use of both typicality as well as Fano\u2019s inequality to derive necessary conditions for structure learning. We\ncan show that (i) the probability of the typical set tends to one as p \u2192 \u221e, (ii) the graphs in the typical set are almost\nuniformly distributed (the asymptotic equipartition property), (iii) the cardinality of the typical set is small relative to\nthe set of all graphs. These properties are used to prove Theorems 2 and 3.\n\n5 Conclusion\n\nIn this paper, we adopted a novel and a uni\ufb01ed paradigm for graphical model selection. We presented a simple local\nalgorithm for structure estimation with low computational and sample complexities under a set of mild and transparent\nconditions. This algorithm succeeds on a wide range of graph ensembles such as the Erd\u02ddos-R\u00b4enyi ensemble, small-\nworld networks etc. We also employed novel information-theoretic techniques for establishing necessary conditions\nfor graphical model selection.\n\nAcknowledgement\n\nThe \ufb01rst author is supported by the setup funds at UCI and in part by the AFOSR under Grant FA9550-10-1-0310, the\nsecond author is supported by A*STAR, Singapore and the third author is supported in part by AFOSR under Grant\nFA9550-08-1-1080.\n\nReferences\n\n[1] S. Lauritzen, Graphical models: Clarendon Press. Clarendon Press, 1996.\n[2] D. Karger and N. Srebro, \u201cLearning Markov Networks: Maximum Bounded Tree-width Graphs,\u201d in Proc. of\n\nACM-SIAM symposium on Discrete algorithms, 2001, pp. 392\u2013401.\n\n[3] C. Chow and C. Liu, \u201cApproximating Discrete Probability Distributions with Dependence Trees,\u201d IEEE Tran. on\n\nInformation Theory, vol. 14, no. 3, pp. 462\u2013467, 1968.\n\n[4] A. d\u2019Aspremont, O. Banerjee, and L. El Ghaoui, \u201cFirst-order methods for sparse covariance selection,\u201d SIAM. J.\n\nMatrix Anal. & Appl., vol. 30, no. 56, 2008.\n\n[5] P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu, \u201cHigh-dimensional covariance estimation by minimizing\n\n`1-penalized log-determinant divergence,\u201d Arxiv preprint arXiv:0811.3628, 2008.\n\n[6] P. Ravikumar, M. Wainwright, and J. Lafferty, \u201cHigh-dimensional Ising Model Selection Using l1-Regularized\n\nLogistic Regression,\u201d Annals of Statistics, 2008.\n\n8\n\n\f[7] B. Bollob\u00b4as, Random Graphs. Academic Press, 1985.\n[8] F. Chung and L. Lu, Complex graphs and network. Amer. Mathematical Society, 2006.\n[9] D. Watts and S. Strogatz, \u201cCollective dynamics of small-worldnetworks,\u201d Nature, vol. 393, no. 6684, pp. 440\u2013\n\n442, 1998.\n\n[10] M. Newman, D. Watts, and S. Strogatz, \u201cRandom graph models of social networks,\u201d Proc. of the National\n\nAcademy of Sciences of the United States of America, vol. 99, no. Suppl 1, 2002.\n\n[11] R. Albert and A. Barab\u00b4asi, \u201cStatistical mechanics of complex networks,\u201d Reviews of modern physics, vol. 74,\n\nno. 1, p. 47, 2002.\n\n[12] H. Georgii, Gibbs Measures and Phase Transitions. Walter de Gruyter, 1988.\n[13] J. Bento and A. Montanari, \u201cWhich Graphical Models are Dif\ufb01cult to Learn?\u201d in Proc. of Neural Information\n\nProcessing Systems (NIPS), Vancouver, Canada, Dec. 2009.\n\n[14] D. Malioutov, J. Johnson, and A. Willsky, \u201cWalk-Sums and Belief Propagation in Gaussian Graphical Models,\u201d\n\nJ. of Machine Learning Research, vol. 7, pp. 2031\u20132064, 2006.\n\n[15] G. Bresler, E. Mossel, and A. Sly, \u201cReconstruction of Markov Random Fields from Samples: Some Observations\nand Algorithms,\u201d in Intl. workshop APPROX Approximation, Randomization and Combinatorial Optimization.\nSpringer, 2008, pp. 343\u2013356.\n\n[16] P. Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai, \u201cGreedy Learning of Markov Network Structure ,\u201d in\n\nProc. of Allerton Conf. on Communication, Control and Computing, Monticello, USA, Sept. 2010.\n\n[17] F. Chung, Spectral graph theory. Amer Mathematical Society, 1997.\n[18] A. Gamburd, S. Hoory, M. Shahshahani, A. Shalev, and B. Virag, \u201cOn the girth of random cayley graphs,\u201d\n\nRandom Structures & Algorithms, vol. 35, no. 1, pp. 100\u2013117, 2009.\n\n[19] S. Dommers, C. Giardin`a, and R. van der Hofstad, \u201cIsing models on power-law random graphs,\u201d Journal of\n\nStatistical Physics, pp. 1\u201323, 2010.\n\n[20] B. McKay, N. Wormald, and B. Wysocka, \u201cShort cycles in random regular graphs,\u201d The Electronic Journal of\n\nCombinatorics, vol. 11, no. R66, p. 1, 2004.\n\n[21] A. Anandkumar, V. Y. F. Tan, and A. S. Willsky, \u201cHigh-Dimensional Structure Learning of Ising Models:\n\nTractable Graph Families,\u201d Preprint, Available on ArXiv 1107.1736, June 2011.\n\n[22] \u2014\u2014, \u201cHigh-Dimensional Gaussian Graphical Model Selection: Tractable Graph Families,\u201d Preprint, ArXiv\n\n1107.1270, June 2011.\n\n[23] V. Tan, A. Anandkumar, and A. Willsky, \u201cLearning Markov Forest Models: Analysis of Error Rates,\u201d J. of\n\nMachine Learning Research, vol. 12, pp. 1617\u20131653, May 2011.\n\n[24] N. Meinshausen and P. Buehlmann, \u201cHigh Dimensional Graphs and Variable Selection With the Lasso,\u201d Annals\n\nof Statistics, vol. 34, no. 3, pp. 1436\u20131462, 2006.\n\n[25] D. Weitz, \u201cCounting independent sets up to the tree threshold,\u201d in Proc. of ACM symp. on Theory of computing,\n\n2006, pp. 140\u2013149.\n\n[26] W. Wang, M. Wainwright, and K. Ramchandran, \u201cInformation-theoretic bounds on model selection for Gaussian\nMarkov random \ufb01elds,\u201d in IEEE International Symposium on Information Theory Proceedings (ISIT), Austin,\nTx, June 2010.\n\n[27] T. Cover and J. Thomas, Elements of Information Theory.\n\nJohn Wiley & Sons, Inc., 2006.\n\n9\n\n\f", "award": [], "sourceid": 1050, "authors": [{"given_name": "Animashree", "family_name": "Anandkumar", "institution": null}, {"given_name": "Vincent", "family_name": "Tan", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}]}