{"title": "Which graphical models are difficult to learn?", "book": "Advances in Neural Information Processing Systems", "page_first": 1303, "page_last": 1311, "abstract": "We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).", "full_text": "Which graphical models are dif\ufb01cult to learn?\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering and\n\nAndrea Montanari\n\nDepartment of Statistics\n\nStanford University\n\nJos\u00b4e Bento\n\nStanford University\n\njbento@stanford.edu\n\nmontanari@stanford.edu\n\nAbstract\n\nWe consider the problem of learning the structure of Ising models (pairwise bi-\nnary Markov random \ufb01elds) from i.i.d. samples. While several methods have\nbeen proposed to accomplish this task, their relative merits and limitations remain\nsomewhat obscure. By analyzing a number of concrete examples, we show that\nlow-complexity algorithms systematically fail when the Markov random \ufb01eld de-\nvelops long-range correlations. More precisely, this phenomenon appears to be\nrelated to the Ising model phase transition (although it does not coincide with it).\n\n1\n\nIntroduction and main results\n\nGiven a graph G = (V = [p], E), and a positive parameter \u03b8 > 0 the ferromagnetic Ising model on\nG is the pairwise Markov random \ufb01eld\n\n\u00b5G,\u03b8(x) =\n\n1\n\nZG,\u03b8 Y(i,j)\u2208E\n\ne\u03b8xixj\n\n(1)\n\nover binary variables x = (x1, x2, . . . , xp). Apart from being one of the most studied models in\nstatistical mechanics, the Ising model is a prototypical undirected graphical model, with applications\nin computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent\nparameters \u03b8ij, (i, j) \u2208 E is of interest as well, and will be introduced in Section 1.2.2. (Let us\nstress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph\nG.)\nIn this paper we study the following structural learning problem: Given n i.i.d. samples x(1),\nx(2),. . . , x(n) with distribution \u00b5G,\u03b8(\u00b7 ), reconstruct the graph G. For the sake of simplicity, we\nassume that the parameter \u03b8 is known, and that G has no double edges (it is a \u2018simple\u2019 graph).\nThe graph learning problem is solvable with unbounded sample complexity, and computational re-\nsources [1]. The question we address is: for which classes of graphs and values of the parameter \u03b8 is\nthe problem solvable under appropriate complexity constraints? More precisely, given an algorithm\nAlg, a graph G, a value \u03b8 of the model parameter, and a small \u03b4 > 0, the sample complexity is\nde\ufb01ned as\n\nnAlg(G, \u03b8) \u2261 infnn \u2208 N : Pn,G,\u03b8{Alg(x(1), . . . , x(n)) = G} \u2265 1 \u2212 \u03b4o ,\n\n(2)\n\nwhere Pn,G,\u03b8 denotes probability with respect to n i.i.d. samples with distribution \u00b5G,\u03b8. Further,\nwe let \u03c7Alg(G, \u03b8) denote the number of operations of the algorithm Alg, when run on nAlg(G, \u03b8)\nsamples.1\n\n1For the algorithms analyzed in this paper, the behavior of nAlg and \u03c7Alg does not change signi\ufb01cantly if we\n\nrequire only \u2018approximate\u2019 reconstruction (e.g. in graph distance).\n\n1\n\n\fThe general problem is therefore to characterize the functions nAlg(G, \u03b8) and \u03c7Alg(G, \u03b8), in par-\nticular for an optimal choice of the algorithm. General bounds on nAlg(G, \u03b8) have been given in\n[2, 3], under the assumption of unbounded computational resources. A general charactrization of\nhow well low complexity algorithms can perform is therefore lacking. Although we cannot prove\nsuch a general characterization, in this paper we estimate nAlg and \u03c7Alg for a number of graph mod-\nels, as a function of \u03b8, and unveil a fascinating universal pattern: when the model (1) develops long\nrange correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi}i\u2208V\nbecome strongly correlated for \u03b8 large. For a large class of graphs with degree bounded by \u2206, this\nphenomenon corresponds to a phase transition beyond some critical value of \u03b8 uniformly bounded in\np, with typically \u03b8crit \u2264 const./\u2206. In the examples discussed below, the failure of low-complexity\nalgorithms appears to be related to this phase transition (although it does not coincide with it).\n\n1.1 A toy example: the thresholding algorithm\n\nIn order to illustrate the interplay between graph structure, sample complexity and interaction\nstrength \u03b8, it is instructive to consider a warmup example. The thresholding algorithm reconstructs\nG by thresholding the empirical correlations\n\n1\nn\n\nnX\u2113=1\n\nbCij \u2261\n\nx(\u2113)\ni x(\u2113)\n\nj\n\nfor i, j \u2208 V .\n\n(3)\n\nTHRESHOLDING( samples {x(\u2113)}, threshold \u03c4 )\n1: Compute the empirical correlations {bCij}(i,j)\u2208V \u00d7V ;\n2:\n3:\n\nFor each (i, j) \u2208 V \u00d7 V\nIf bCij \u2265 \u03c4 , set (i, j) \u2208 E;\n\nWe will denote this algorithm by Thr(\u03c4 ). Notice that its complexity is dominated by the computation\nof the empirical correlations, i.e. \u03c7Thr(\u03c4 ) = O(p2n). The sample complexity nThr(\u03c4 ) can be bounded\nfor speci\ufb01c classes of graphs as follows (the proofs are straightforward and omitted from this paper).\nTheorem 1.1. If G has maximum degree \u2206 > 1 and if \u03b8 < atanh(1/(2\u2206)) then there exists\n\u03c4 = \u03c4 (\u03b8) such that\n\nnThr(\u03c4 )(G, \u03b8) \u2264\n\n(tanh \u03b8 \u2212 1\n\n2\u2206 )2\n\n8\n\nlog\n\n2p\n\u03b4\n\n.\n\n(4)\n\nFurther, the choice \u03c4 (\u03b8) = (tanh \u03b8 + (1/2\u2206))/2 achieves this bound.\nTheorem 1.2. There exists a numerical constant K such that the following is true. If \u2206 > 3 and\nthe\n\u03b8 > K/\u2206, there are graphs of bounded degree \u2206 such that for any \u03c4 , nThr(\u03c4 ) = \u221e, i.e.\nthresholding algorithm always fails with high probability.\n\nThese results con\ufb01rm the idea that the failure of low-complexity algorithms is related to long-range\ncorrelations in the underlying graphical model. If the graph G is a tree, then correlations between far\napart variables xi, xj decay exponentially with the distance between vertices i, j. The same happens\non bounded-degree graphs if \u03b8 \u2264 const./\u2206. However, for \u03b8 > const./\u2206, there exists families of\nbounded degree graphs with long-range correlations.\n\n1.2 More sophisticated algorithms\n\nIn this section we characterize \u03c7Alg(G, \u03b8) and nAlg(G, \u03b8) for more advanced algorithms. We again\nobtain very distinct behaviors of these algorithms depending on long range correlations. Due to\nspace limitations, we focus on two type of algorithms and only outline the proof of our most chal-\nlenging result, namely Theorem 1.6.\nIn the following we denote by \u2202i the neighborhood of a node i \u2208 G (i /\u2208 \u2202i), and assume the degree\nto be bounded: |\u2202i| \u2264 \u2206.\n1.2.1 Local Independence Test\n\nA recurring approach to structural learning consists in exploiting the conditional independence struc-\nture encoded by the graph [1, 4, 5, 6].\n\n2\n\n\fLet us consider, to be de\ufb01nite, the approach of [4], specializing it to the model (1). Fix a vertex r,\nwhose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its\nneighbors2: \u00b5G,\u03b8(xr|x\u2202r). Any change of xi, i \u2208 \u2202r, produces a change in this distribution which\nis bounded away from 0. Let U be a candidate neighborhood, and assume U \u2286 \u2202r. Then changing\nthe value of xj, j \u2208 U will produce a noticeable change in the marginal of Xr, even if we condition\non the remaining values in U and in any W , |W| \u2264 \u2206. On the other hand, if U * \u2202r, then it is\npossible to \ufb01nd W (with |W| \u2264 \u2206) and a node i \u2208 U such that, changing its value after \ufb01xing all\nother values in U \u222a W will produce no noticeable change in the conditional marginal. (Just choose\ni \u2208 U\\\u2202r and W = \u2202r\\U). This procedure allows us to distinguish subsets of \u2202r from other sets\nof vertices, thus motivating the following algorithm.\n\nLOCAL INDEPENDENCE TEST( samples {x(\u2113)}, thresholds (\u01eb, \u03b3) )\n1:\n2:\n\nSelect a node r \u2208 V ;\nSet as its neighborhood the largest candidate neighbor U of\nsize at most \u2206 for which the score function SCORE(U ) > \u01eb/2;\n\n3: Repeat for all nodes r \u2208 V ;\nThe score function SCORE(\u00b7 ) depends on ({x(\u2113)}, \u2206, \u03b3) and is de\ufb01ned as follows,\n\nmin\nW,j\n\nmax\n\nxi,xW ,xU ,xj|bPn,G,\u03b8{Xi = xi|X W = xW , X U = xU}\u2212\n\nbPn,G,\u03b8{Xi = xi|X W = xW , X U \\j = xU \\j, Xj = xj}| .\nIn the minimum, |W| \u2264 \u2206 and j \u2208 U. In the maximum, the values must be such that\nbPn,G,\u03b8{X W = xW , X U = xU} > \u03b3/2,\nbPn,G,\u03b8{X W = xW , X U \\j = xU \\j, Xj = xj} > \u03b3/2\nbPn,G,\u03b8 is the empirical distribution calculated from the samples {x(\u2113)}. We denote this algorithm\ncomputation of the SCORE(U ) and the computation ofbPn,G,\u03b8 all contribute for \u03c7Ind(G, \u03b8).\nBoth theorems that follow are consequences of the analysis of [4].\nTheorem 1.3. Let G be a graph of bounded degree \u2206 \u2265 1. For every \u03b8 there exists (\u01eb, \u03b3), and a\nnumerical constant K, such that\n\nby Ind(\u01eb, \u03b3). The search over candidate neighbors U, the search for minima and maxima in the\n\n(5)\n\nnInd(\u01eb,\u03b3)(G, \u03b8) \u2264\n\n100\u2206\n\u01eb2\u03b34 log\n\n2p\n\u03b4\n\n,\n\n\u03c7Ind(\u01eb,\u03b3) (G, \u03b8) \u2264 K (2p)2\u2206+1 log p .\n\nMore speci\ufb01cally, one can take \u01eb = 1\n\n4 sinh(2\u03b8), \u03b3 = e\u22124\u2206\u03b8 2\u22122\u2206.\n\nThis \ufb01rst result implies in particular that G can be reconstructed with polynomial complexity for\nany bounded \u2206. However, the degree of such polynomial is pretty high and non-uniform in \u2206. This\nmakes the above approach impractical.\n\nA way out was proposed in [4]. The idea is to identify a set of \u2018potential neighbors\u2019 of vertex r via\nthresholding:\n\nB(r) = {i \u2208 V : bCri > \u03ba/2} ,\n\n(6)\nFor each node r \u2208 V , we evaluate SCORE(U ) by restricting the minimum in Eq. (5) over W \u2286 B(r),\nand search only over U \u2286 B(r). We call this algorithm IndD(\u01eb, \u03b3, \u03ba). The basic intuition here is\nthat Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above,\nthis is true at small \u03b8.\nTheorem 1.4. Let G be a graph of bounded degree \u2206 \u2265 1. Assume that \u03b8 < K/\u2206 for some small\nenough constant K. Then there exists \u01eb, \u03b3, \u03ba such that\n\nnIndD(\u01eb,\u03b3,\u03ba)(G, \u03b8) \u2264 8(\u03ba2 + 8\u2206) log\nMore speci\ufb01cally, we can take \u03ba = tanh \u03b8, \u01eb = 1\n\n,\n\n\u03c7IndD(\u01eb,\u03b3,\u03ba)(G, \u03b8) \u2264 K \u2032p\u2206\u2206 log(4/\u03ba)\n4 sinh(2\u03b8) and \u03b3 = e\u22124\u2206\u03b8 2\u22122\u2206.\n\n4p\n\u03b4\n\n\u03b1 + K \u2032\u2206p2 log p .\n\n2If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a\n\nwith index in R.\n\n3\n\n\f1.2.2 Regularized Pseudo-Likelihoods\n\nA different approach to the learning problem consists in maximizing an appropriate empirical likeli-\nhood function [7, 8, 9, 10, 13]. To control the \ufb02uctuations caused by the limited number of samples,\nand select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13].\nAs a speci\ufb01c low complexity implementation of this idea, we consider the \u21131-regularized pseudo-\nlikelihood method of [7]. For each node r, the following likelihood function is considered\n\nL(\u03b8;{x(\u2113)}) = \u2212\n\n1\nn\n\nnX\u2113=1\n\nlog Pn,G,\u03b8(x(\u2113)\n\nr |x(\u2113)\n\\r )\n\n(7)\n\nwhere x\\r = xV \\r = {xi : i \u2208 V \\ r} is the vector of all variables except xr and Pn,G,\u03b8 is de\ufb01ned\nfrom the following extension of (1),\n\n\u00b5G,\u03b8(x) =\n\n1\n\nZG,\u03b8 Yi,j\u2208V\n\ne\u03b8ij xixj\n\n(8)\n\nwhere \u03b8 = {\u03b8ij}i,j\u2208V is a vector of real parameters. Model (1) corresponds to \u03b8ij = 0, \u2200(i, j) /\u2208 E\nand \u03b8ij = \u03b8, \u2200(i, j) \u2208 E.\nThe function L(\u03b8;{x(\u2113)}) depends only on \u03b8r,\u00b7 = {\u03b8rj, j \u2208 \u2202r} and is used to estimate the neigh-\nborhood of each node by the following algorithm, Rlr(\u03bb),\n\nREGULARIZED LOGISTIC REGRESSION( samples {x(\u2113)}, regularization (\u03bb))\n1:\n2: Calculate \u02c6\u03b8r,\u00b7 = arg min\nIf \u02c6\u03b8rj > 0, set (r, j) \u2208 E;\n\n\u03b8r,\u00b7\u2208Rp\u22121{L(\u03b8r,\u00b7;{x(\u2113)}) + \u03bb||\u03b8r,\u00b7||1};\n\nSelect a node r \u2208 V ;\n\n3:\n\nOur \ufb01rst result shows that Rlr(\u03bb) indeed reconstructs G if \u03b8 is suf\ufb01ciently small.\nTheorem 1.5. There exists numerical constants K1, K2, K3, such that the following is true. Let G\nbe a graph with degree bounded by \u2206 \u2265 3. If \u03b8 \u2264 K1/\u2206, then there exist \u03bb such that\n\nnRlr(\u03bb)(G, \u03b8) \u2264 K2 \u03b8\u22122 \u2206 log\n\n8p2\n\u03b4\n\n.\n\n(9)\n\nFurther, the above holds with \u03bb = K3 \u03b8 \u2206\u22121/2.\nThis theorem is proved by noting that for \u03b8 \u2264 K1/\u2206 correlations decay exponentially, which makes\nall conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the\nprobability of success as a function of n, while strenghtening the error bounds of [7].\nIn order to prove a converse to the above result, we need to make some assumptions on \u03bb. Given\n\u03b8 > 0, we say that \u03bb is \u2018reasonable for that value of \u03b8 if the following conditions old: (i) Rlr(\u03bb)\nis successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r\nconnected to \u2206 neighbors, plus isolated vertices); (ii) \u03bb \u2264 \u03b4(n) for some sequence \u03b4(n) \u2193 0.\nTheorem 1.6. There exists a numerical constant K such that the following happens. If \u2206 > 3,\n\u03b8 > K/\u2206, then there exists graphs G of degree bounded by \u2206 such that for all reasonable \u03bb,\nnRlr(\u03bb)(G) = \u221e, i.e. regularized logistic regression fails with high probability.\nThe graphs for which regularized logistic regression fails are not contrived examples. Indeed we will\nprove that the claim in the last theorem holds with high probability when G is a uniformly random\ngraph of regular degree \u2206.\nThe proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary\nfor Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection\nusing the Lasso. In this paper we show that such a condition is also necessary when the underlying\nmodel is an Ising model. Notice that, given the graph G, checking the incoherence condition is\nNP-hard for general (non-ferromagnetic) Ising model, and requires signi\ufb01cant computational effort\n\n4\n\n\f 20\n\n 15\n\n\u03bb0\n\n 10\n\n 5\n\n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1\n\n\u03b8\n\n 1\n 0.9\n 0.8\n 0.7\n 0.6\n 0.5\n 0.4\n 0.3\n 0.2\n 0.1\n 0\n\n 1.2\n\n 1\n\n 0.8\n\n 0.6\n\nPsucc\n\n 0.4\n\n 0.2\n\n 0\n\n 0\n\n 0.2\n\n 0.4\n\n\u03b8\n\n 0.6\n\n\u03b8crit\n\n 0.8\n\n 1\n\nFigure 1: Learning random subgraphs of a 7 \u00d7 7 (p = 49) two-dimensional grid from n = 4500\nIsing models samples, using regularized logistic regression. Left: success probability as a function\nof the model parameter \u03b8 and of the regularization parameter \u03bb0 (darker corresponds to highest\nprobability). Right:\nthe same data plotted for several choices of \u03bb versus \u03b8. The vertical line\ncorresponds to the model critical temperature. The thick line is an envelope of the curves obtained\nfor different \u03bb, and should correspond to optimal regularization.\n\neven in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear\npicture of which graph structure are dif\ufb01cult to learn. We will instead show how to evaluate it on\nspeci\ufb01c graph families.\nUnder the restriction \u03bb \u2192 0 the solutions given by Rlr converge to \u03b8\u2217 with n [7]. Thus, for large\nn we can expand L around \u03b8\u2217 to second order in (\u03b8 \u2212 \u03b8\u2217). When we add the regularization term\nto L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic\napproximation. It is thus not surprising that, when \u03bb \u2192 0 the incoherence condition introduced for\nthe Lasso in [14] is also relevant for the Ising model.\n\n2 Numerical experiments\n\nIn order to explore the practical relevance of the above results, we carried out extensive numerical\nsimulations using the regularized logistic regression algorithm Rlr(\u03bb). Among other learning algo-\nrithms, Rlr(\u03bb) strikes a good balance of complexity and performance. Samples from the Ising model\n(1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large\nfor \u03b8 \u2265 \u03b8crit, and was estimated using the time required for the overall bias to change sign (this is a\nquite conservative estimate at low temperature). Generating the samples {x(\u2113)} was indeed the bulk\nof our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we\nshow here only part of these data). Notice that Rlr(\u03bb) had been tested in [7] only on tree graphs G,\nor in the weakly coupled regime \u03b8 < \u03b8crit. In these cases sampling from the Ising model is easy, but\nstructural learning is also intrinsically easier.\nFigure reports the success probability of Rlr(\u03bb) when applied to random subgraphs of a 7 \u00d7 7\ntwo-dimensional grid. Each such graphs was obtained by removing each edge independently with\nprobability \u03c1 = 0.3. Success probability was estimated by applying Rlr(\u03bb) to each vertex of 8\ngraphs (thus averaging over 392 runs of Rlr(\u03bb)), using n = 4500 samples. We scaled the regular-\nization parameter as \u03bb = 2\u03bb0\u03b8(log p/n)1/2 (this choice is motivated by the algorithm analysis and\nis empirically the most satisfactory), and searched over \u03bb0.\nThe data clearly illustrate the phenomenon discussed. Despite the large number of samples\nn \u226b log p, when \u03b8 crosses a threshold, the algorithm starts performing poorly irrespective of \u03bb.\nIntriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted\ngrid \u03b8crit(\u03c1 = 0.3) \u2248 0.7 [15, 16].\n\n5\n\n\f 1.2\n\n 1\n\n 0.8\n\n 0.6\nPsucc\n 0.4\n\n 0.2\n\n 0\n\n 0\n\n\u03b8 = 0.35, 0.40\n\n\u03b8 = 0.25\n\n\u03b8 = 0.20\n\n\u03b8 = 0.10\n\n\u03b8 = 0.45\n\n\u03b8 = 0.50\n\n\u03b8 = 0.65, 0.60, 0.55\n\n 2000\n\n 4000\n\n 6000\n\n 8000\n\n 10000\n\nn\n\n 1.2\n\n 1\n\n 0.8\n\n 0.6\nPsucc\n 0.4\n\n 0.2\n\n 0\n\n 0\n\n 0.1\n\n 0.2\n\n 0.3\n\n\u03b8thr\n\n 0.5\n\n 0.6\n\n 0.7\n\n 0.8\n\n 0.4\n\u03b8\n\nFigure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr.\nLeft: success probability as a function of the number of samples n for several values of \u03b8. Right:\nthe same data plotted for several choices of \u03bb versus \u03b8 as in Fig. 1, right panel.\n\nFigure 2 presents similar data when G is a uniformly random graph of degree \u2206 = 4, over p = 50\nvertices. The evolution of the success probability with n clearly shows a dichotomy. When \u03b8 is\nbelow a threshold, a small number of samples is suf\ufb01cient to reconstruct G with high probability.\nAbove the threshold even n = 104 samples are to few. In this case we can predict the threshold\nanalytically, cf. Lemma 3.3 below, and get \u03b8thr(\u2206 = 4) \u2248 0.4203, which compares favorably with\nthe data.\n\n3 Proofs\n\nIn order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some\nnotations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row\nindices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are\ntrying to reconstruct and de\ufb01ne S = \u2202r, Sc = V \\ \u2202r \u222a r. Since the cost function L(\u03b8;{x(\u2113)}) +\n\u03bb||\u03b8||1 only depend on \u03b8 through its components \u03b8r,\u00b7 = {\u03b8rj}, we will hereafter neglect all the other\nparameters and write \u03b8 as a shorthand of \u03b8r,\u00b7.\nLet \u02c6z\u2217 be a subgradient of ||\u03b8||1 evaluated at the true parameters values, \u03b8\u2217 = {\u03b8rj : \u03b8ij = 0, \u2200j /\u2208\nn be the parameter estimate returned by Rlr(\u03bb) when the number\n\u2202r, \u03b8rj = \u03b8, \u2200j \u2208 \u2202r}. Let \u02c6\u03b8\nof samples is n. Note that, since we assumed \u03b8\u2217 \u2265 0, \u02c6z\u2217\nS = 1. De\ufb01ne Qn(\u03b8, ;{x(\u2113)}) to be the\nHessian of L(\u03b8;{x(\u2113)}) and Q(\u03b8) = limn\u2192\u221e Qn(\u03b8, ;{x(\u2113)}). By the law of large numbers Q(\u03b8) is\nthe Hessian of EG,\u03b8 log PG,\u03b8(Xr|X\\r) where EG,\u03b8 is the expectation with respect to (8) and X is a\nrandom variable distributed according to (8). We will denote the maximum and minimum eigenvalue\nof a symmetric matrix M by \u03c3max(M ) and \u03c3min(M ) respectively.\nWe will omit arguments whenever clear from the context. Any quantity evaluated at the true pa-\nrameter values will be represented with a \u2217, e.g. Q\u2217 = Q(\u03b8\u2217). Quantities under a \u2227 depend on n.\nThroughout this section G is a graph of maximum degree \u2206.\n\n3.1 Proof of Theorem 1.6\n\nOur \ufb01rst auxiliary results establishes that, if \u03bb is small, then ||Q\u2217\ncondition for the failure of Rlr(\u03bb).\nLemma 3.1. Assume [Q\u2217\n\n\u22121 \u02c6z\u2217\n\nScSQ\u2217\nSS\n\n\u22121 \u02c6z\u2217\n\nS||\u221e > 1 is a suf\ufb01cient\n\nS]i \u2265 1 + \u01eb for some \u01eb > 0 and some row i \u2208 V , \u03c3min(Q\u2217\nScSQ\u2217\nSS\nmin\u01eb/29\u22064. Then the success probability of Rlr(\u03bb) is upper bounded as\n\nSS) \u2265\n\nCmin > 0, and \u03bb 0 for \u03b8 > 0 such that the following is true: If G is the\ngraph with only one edge between nodes r and i and n\u03bb2 \u2264 K, then\nPsucc \u2264 e\u2212M (K,\u03b8)p + e\u2212n(1\u2212tanh \u03b8)2/32 .\n\n(11)\n\n\u22121 \u02c6z\u2217\n\nScSQ\u2217\nSS\n\nS||\u221e \u2264 1 is violated with high\nFinally, our key result shows that the condition ||Q\u2217\nprobability for large random graphs. The proof of this result relies on a local weak convergence\nresult for ferromagnetic Ising models on random graphs proved in [17].\nLemma 3.3. Let G be a uniformly random regular graph of degree \u2206 > 3, and \u01eb > 0 be suf\ufb01ciently\nsmall. Then, there exists \u03b8thr(\u2206, \u01eb) such that, for \u03b8 > \u03b8thr(\u2206, \u01eb), ||Q\u2217\nS||\u221e \u2265 1 + \u01eb with\nprobability converging to 1 as p \u2192 \u221e.\nFurthermore, for large \u2206, \u03b8thr(\u2206, 0+) = \u02dc\u03b8 \u2206\u22121(1 + o(1)). The constant \u02dc\u03b8 is given by \u02dc\u03b8 =\ntanh \u00afh)/\u00afh and \u00afh is the unique positive solution of \u00afh tanh \u00afh = (1 \u2212 tanh2 \u00afh)2. Finally, there exist\nCmin > 0 dependent only on \u2206 and \u03b8 such that \u03c3min(Q\u2217\nSS) \u2265 Cmin with probability converging to\n1 as p \u2192 \u221e.\nThe proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straight-\nforward and we omit its proof for space reasons.\n\nScSQ\u2217\nSS\n\n\u22121 \u02c6z\u2217\n\nProof. (Theorem 1.6) Fix \u2206 > 3, \u03b8 > K/\u2206 (where K is a large enough constant independent of\n\u2206), and \u01eb, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose\n\u221211S|i > 1 + \u01eb for\na \u2206-regular graph Gp = (V = [p], Ep) and a vertex r \u2208 V such that |Q\u2217\nsome i \u2208 V \\ r.\nBy Theorem 1 in [4] we can assume, without loss of generality n > K \u2032\u2206 log p for some small\nconstant K \u2032. Further by Lemma 3.2, n\u03bb2 \u2265 F (p) for some F (p) \u2191 \u221e as p \u2192 \u221e and the condition\nof Lemma 3.1 on \u03bb is satis\ufb01ed since by the \u201dreasonable\u201d assumption \u03bb \u2192 0 with n. Using these\nresults in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability\n\nScSQ\u2217\nSS\n\nPsucc(Gp) \u2264 4\u22062p\u2212\u03b42\n\nAK \u2032\u2206 + 2\u2206 e\u2212nF (p)\u03b42\n\nB .\n\nIn particular Psucc(Gp) \u2192 0 as p \u2192 \u221e.\n3.2 Proofs of auxiliary lemmas\n\n(12)\n\nProof. (Lemma 3.1) We will show that under the assumptions of the lemma and if \u02c6\u03b8 = (\u02c6\u03b8S, \u02c6\u03b8SC ) =\n(\u02c6\u03b8S, 0) then the probability that the i component of any subgradient of L(\u03b8;{x(\u2113)})+\u03bb||\u03b8||1 vanishes\nfor any \u02c6\u03b8S > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit\n{x(\u2113)} in all the expression derived from L.\nLet \u02c6z be a subgradient of ||\u03b8|| at \u02c6\u03b8 and assume \u2207L(\u02c6\u03b8) + \u03bb\u02c6z = 0. An application of the mean value\ntheorem yields\n(13)\n(j) a point in the line\nwhere W n = \u2212\u2207L(\u03b8\u2217) and [Rn]j = [\u22072L(\u00af\u03b8\nfrom \u02c6\u03b8 to \u03b8\u2217. Notice that by de\ufb01nition \u22072L(\u03b8\u2217) = Qn\u2217 = Qn(\u03b8\u2217). To simplify notation we will\nomit the \u2217 in all Qn\u2217. All Qn in this proof are thus evaluated at \u03b8\u2217.\nBreaking this expression into its S and Sc components and since \u02c6\u03b8SC = \u03b8\u2217\n\u02c6\u03b8S \u2212 \u03b8\u2217\n\n\u22072L(\u03b8\u2217)[\u02c6\u03b8 \u2212 \u03b8\u2217] = W n \u2212 \u03bb\u02c6z + Rn ,\n\nSC = 0 we can eliminate\n\nj (\u02c6\u03b8 \u2212 \u03b8\u2217) with \u00af\u03b8\n\n)\u2212\u22072L(\u03b8\u2217)]T\n\nS from the two expressions obtained and write\nS \u2212 Rn\n\nSC ] \u2212 Qn\n\nSC \u2212 Rn\n\nSS)\u22121[W n\n\nSC S(Qn\n\nSS)\u22121 \u02c6zS = \u03bb\u02c6zSC .\n\nS] + \u03bbQn\n\nSC S(Qn\n\n[W n\n\n(14)\n\n(j)\n\nNow notice that Qn\n\nSC S(Qn\n\nSS)\u22121 = T1 + T2 + T3 + T4 where\n\nT1 = Q\u2217\nT3 = [Qn\n\nSC S[(Qn\nSC S \u2212 Q\u2217\n\nSS)\u22121 \u2212 (Q\u2217\nSC S][(Qn\n\nSS)\u22121] ,\nSS)\u22121 \u2212 (Q\u2217\n\nSS)\u22121] ,\n\nT2 = [Qn\nT4 = Q\u2217\n\nSC S]Q\u2217\nSC S \u2212 Q\u2217\n\u22121 .\nSC SQ\u2217\n\nSS\n\nSS\n\n\u22121 ,\n\n7\n\n\f(15)\n\nS /\u03bb||\u221e < \u03beC} ,\n\nWe will assume that the samples {x(\u2113)} are such that the following event holds\n\nE \u2261 {||Qn\n\nSS \u2212 Q\u2217\n\nSS||\u221e < \u03beA,||Qn\n\nSC S \u2212 Q\u2217\n\nSC S||\u221e < \u03beB,||W n\n\nSC SQ\u2217\n\nSS) > \u03c3min(Q\u2217\n\nmin\u01eb/(16\u2206), \u03beB \u2261 Cmin\u01eb/(8\u221a\u2206) and \u03beC \u2261 Cmin\u01eb/(8\u2206). Since EG,\u03b8(Qn) = Q\u2217\nwhere \u03beA \u2261 C 2\nand EG,\u03b8(W n) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables,\na simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10).\nFrom E it follows that \u03c3min(Qn\nSS) \u2212 Cmin/2 > Cmin/2. We can therefore lower\nbound the absolute value of the ith component of \u02c6zSC by\n|[Q\u2217\nwhere the subscript i denotes the i-th row of a matrix.\nThe proof is completed by showing that the event E and the assumptions of the theorem imply that\neach of last 7 terms in this expression is smaller than \u01eb/8. Since |[Q\u2217\nS| \u2265 1 + \u01eb by\nassumption, this implies |\u02c6zi| \u2265 1 + \u01eb/8 > 1 which cannot be since any subgradient of the 1-norm\nhas components of magnitude at most 1.\nThe last condition on E immediately bounds all terms involving W by \u01eb/8. Some straightforward\nmanipulations imply (See Lemma 7 from [7])\n\n\u221211S]i|\u2212||T1,i||\u221e\u2212||T2,i||\u221e\u2212||T3,i||\u221e\u2212(cid:12)(cid:12)(cid:12)\n\nCmin (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03bb (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u221e\n\n\u03bb (cid:12)(cid:12)(cid:12)\u2212(cid:12)(cid:12)(cid:12)\n\n\u03bb (cid:12)(cid:12)(cid:12)\u2212\n\n+(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nSC SQ\u2217\n\n\u22121]T\n\ni \u02c6zn\n\nW n\nS\n\nW n\ni\n\nRn\nS\n\nRn\ni\n\n\u2206\n\nSS\n\nSS\n\n\u03bb (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u221e(cid:19) ,\n\n||T2,i||\u221e \u2264\n\n\u221a\u2206\nCmin||[Qn\n\nSC S \u2212 Q\u2217\n\nSC S]i||\u221e ,\n\n||T1,i||\u221e \u2264\n\n||T3,i||\u221e \u2264\n\n\u2206\nmin||Qn\nC 2\n2\u2206\nmin||Qn\nC 2\n\nSS \u2212 Q\u2217\nSS \u2212 Q\u2217\n\nSS||\u221e ,\nSS||\u221e||[Qn\n\nSC S \u2212 Q\u2217\n\nSC S]i||\u221e ,\n\nand thus all will be bounded by \u01eb/8 when E holds. The upper bound of Rn follows along similar\nlines via an mean value theorem, and is deferred to a longer version of this paper.\n\nProof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1.\nFor t \u2208 N, let T(t) = (VT, ET) be the regular rooted tree of t generations and de\ufb01ne the associated\nIsing measure as\n(16)\n\neh\u2217xi .\n\n\u00b5+\nT,\u03b8(x) =\n\n1\n\nZT,\u03b8 Y(i,j)\u2208ET\n\ne\u03b8xixj Yi\u2208\u2202T(t)\n\nT,\u03b8(x), as p \u2192 \u221e.\n\nHere \u2202T(t) is the set of leaves of T(t) and h\u2217 is the unique positive solution of h = (\u2206 \u2212\n1) atanh{tanh \u03b8 tanh h}. It can be proved using [17] and uniform continuity with respect to the\n\u2018external \ufb01eld\u2019 that non-trivial local expectations with respect to \u00b5G,\u03b8(x) converge to local expecta-\ntions with respect to \u00b5+\nMore precisely, let Br(t) denote a ball of radius t around node r \u2208 G (the node whose neighborhood\nwe are trying to reconstruct). For any \ufb01xed t, the probability that Br(t) is not isomorphic to T(t)\ngoes to 0 as p \u2192 \u221e. Let g(xBr(t)) be any function of the variables in Br(t) such that g(xBr(t)) =\ng(\u2212xBr(t)). Then almost surely over graph sequences Gp of uniformly random regular graphs with\np nodes (expectations here are taken with respect to the measures (1) and (16))\n(17)\n\nEG,\u03b8{g(X Br(t))} = ET(t),\u03b8,+{g(X T(t))} .\n\nlim\np\u2192\u221e\n\n\u22121 \u02c6z\u2217\n\nBr (t)\n\n\u22121 \u02c6z\u2217\n\nScSQ\u2217\nSS\n\n)} and (Q\u2217\n\nScS)il = E{gi,l(X\n\nSS)lk = E{gl,k(X\nScSQ\u2217\nSS\n\nS]i for t = dist(r, i) \ufb01nite. We then write\nThe proof consists in considering [Q\u2217\n) and\n(Q\u2217\napply the weak convergence result (17) to these expectations. We thus reduced the calculation of\nS]i to the calculation of expectations with respect to the tree measure (16). The latter\n[Q\u2217\ncan be implemented explicitly through a recursive procedure, with simpli\ufb01cations arising thanks to\nthe tree symmetry and by taking t \u226b 1. The actual calculations consist in a (very) long exercise in\ncalculus and we omit them from this outline.\nThe lower bound on \u03c3min(Q\u2217\n\n)} for some functions g\u00b7,\u00b7(X\n\nSS) is proved by a similar calculation.\n\nBr (t)\n\nBr (t)\n\nAcknowledgments\n\nThis work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978\nand the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship.\n\n8\n\n\fReferences\n\n[1] P. Abbeel, D. Koller and A. Ng, \u201cLearning factor graphs in polynomial time and sample com-\n\nplexity\u201d. Journal of Machine Learning Research., 2006, Vol. 7, 1743\u20131788.\n\n[2] M. Wainwright, \u201cInformation-theoretic limits on sparsity recovery in the high-dimensional and\n\nnoisy setting\u201d, arXiv:math/0702301v2 [math.ST], 2007.\n\n[3] N. Santhanam, M. Wainwright, \u201cInformation-theoretic limits of selecting binary graphical\n\nmodels in high dimensions\u201d, arXiv:0905.2639v1 [cs.IT], 2009.\n\n[4] G. Bresler, E. Mossel and A. Sly, \u201cReconstruction of Markov Random Fields from Sam-\nples: Some Observations and Algorithms\u201d,Proceedings of the 11th international workshop,\nAPPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343\u2013356.\n\n[5] Csiszar and Z. Talata, \u201cConsistent estimation of the basic neighborhood structure of Markov\n\nrandom \ufb01elds\u201d, The Annals of Statistics, 2006, 34, Vol. 1, 123-145.\n\n[6] N. Friedman, I. Nachman, and D. Peer, \u201cLearning Bayesian network structure from massive\n\ndatasets: The sparse candidate algorithm\u201d. In UAI, 1999.\n\n[7] P. Ravikumar, M. Wainwright and J. Lafferty, \u201cHigh-Dimensional Ising Model Selection Using\n\nl1-Regularized Logistic Regression\u201d, arXiv:0804.4202v1 [math.ST], 2008.\n\n[8] M.Wainwright, P. Ravikumar, and J. Lafferty, \u201cInferring graphical model structure using l1-\n\nregularized pseudolikelihood\u201c, In NIPS, 2006.\n\n[9] H. H\u00a8o\ufb02ing and R. Tibshirani, \u201cEstimation of Sparse Binary Pairwise Markov Networks using\n\nPseudo-likelihoods\u201d , Journal of Machine Learning Research, 2009, Vol. 10, 883\u2013906.\n\n[10] O.Banerjee, L. El Ghaoui and A. d\u2019Aspremont, \u201cModel Selection Through Sparse Maximum\nLikelihood Estimation for Multivariate Gaussian or Binary Data\u201d, Journal of Machine Learning\nResearch, March 2008, Vol. 9, 485\u2013516.\n\n[11] M. Yuan and Y. Lin, \u201cModel Selection and Estimation in Regression with Grouped Variables\u201d,\n\nJ. Royal. Statist. Soc B, 2006, 68, Vol. 19,49\u201367.\n\n[12] N. Meinshausen and P. B\u00a8uuhlmann, \u201cHigh dimensional graphs and variable selection with the\n\nlasso\u201d, Annals of Statistics, 2006, 34, Vol. 3.\n\n[13] R. Tibshirani, \u201cRegression shrinkage and selection via the lasso\u201d, Journal of the Royal Statis-\n\ntical Society, Series B, 1994, Vol. 58, 267\u2013288.\n\n[14] P. Zhao, B. Yu, \u201cOn model selection consistency of Lasso\u201d, Journal of Machine. Learning\n\nResearch 7, 25412563, 2006.\n\n[15] D. Zobin, \u201dCritical behavior of the bond-dilute two-dimensional Ising model\u201c, Phys. Rev.,\n\n1978 ,5, Vol. 18, 2387 \u2013 2390.\n\n[16] M. Fisher, \u201dCritical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds\u201d,\n\nPhys. Rev. 162 ,Oct. 1967, Vol. 2, 480\u2013485.\n\n[17] A. Dembo and A. Montanari, \u201cIsing Models on Locally Tree Like Graphs\u201d, Ann. Appl. Prob.\n\n(2008), to appear, arXiv:0804.4726v2 [math.PR]\n\n9\n\n\f", "award": [], "sourceid": 870, "authors": [{"given_name": "Andrea", "family_name": "Montanari", "institution": null}, {"given_name": "Jose", "family_name": "Pereira", "institution": null}]}