{"title": "On Learning Discrete Graphical Models using Greedy Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1935, "page_last": 1943, "abstract": "In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum node-degree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Omega(d log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Omega(d^2 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.", "full_text": "On Learning Discrete Graphical Models Using\n\nGreedy Methods\n\nAli Jalali\n\nUniversity of Texas at Austin\nalij@mail.utexas.edu\n\nChristopher C. Johnson\n\nUniversity of Texas at Asutin\n\ncjohnson@cs.utexas.edu\n\nPradeep Ravikumar\n\nUniversity of Texas at Asutin\n\npradeepr@cs.utexas.edu\n\nAbstract\n\nIn this paper, we address the problem of learning the structure of a pairwise graph-\nical model from samples in a high-dimensional setting. Our \ufb01rst main result stud-\nies the sparsistency, or consistency in sparsity pattern recovery, properties of a\nforward-backward greedy algorithm as applied to general statistical models. As\na special case, we then apply this algorithm to learn the structure of a discrete\ngraphical model via neighborhood estimation. As a corollary of our general result,\nwe derive suf\ufb01cient conditions on the number of samples n, the maximum node-\ndegree d and the problem size p, as well as other conditions on the model param-\neters, so that the algorithm recovers all the edges with high probability. Our result\nguarantees graph selection for samples scaling as n = \u2126(d2 log(p)), in contrast to\nexisting convex-optimization based algorithms that require a sample complexity\nof \u2126(d3 log(p)). Further, the greedy algorithm only requires a restricted strong\nconvexity condition which is typically milder than irrepresentability assumptions.\nWe corroborate these results using numerical simulations at the end.\n\n1 Introduction\nUndirected graphical models, also known as Markov random \ufb01elds, are used in a variety of domains,\nincluding statistical physics, natural language processing and image analysis among others. In this\npaper we are concerned with the task of estimating the graph structure G of a Markov random \ufb01eld\n(MRF) over a discrete random vector X = (X1, X2, . . . , Xp), given n independent and identically\ndistributed samples {x(1), x(2), . . . , x(n)}. This underlying graph structure encodes conditional in-\ndependence assumptions among subsets of the variables, and thus plays an important role in a broad\nrange of applications of MRFs.\n\nExisting approaches: Neighborhood Estimation, Greedy Local Search. Methods for estimating such\ngraph structure include those based on constraint and hypothesis testing [22], and those that estimate\nrestricted classes of graph structures such as trees [8], polytrees [11], and hypertrees [23]. A recent\nclass of successful approaches for graphical model structure learning are based on estimating the lo-\ncal neighborhood of each node. One subclass of these for the special case of bounded degree graphs\ninvolve the use of exhaustive search so that their computational complexity grows at least as quickly\nas O(pd), where d is the maximum neighborhood size in the graphical model [1, 4, 9]. Another\nsubclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate\nthe neighborhood set for each vertex r \u2208 V by optimizing its \u21131-regularized conditional likelihood;\n[15, 10] use \u21131/\u21132-regularized conditional likelihood. Even these methods, however need to solve\nregularized convex programs with typically polynomial computational cost of O(p4) or O(p6), are\nstill expensive for large problems. Another popular class of approaches are based on using a score\nmetric and searching for the best scoring structure from a candidate set of graph structures. Ex-\n\n1\n\n\fact search is typically NP-hard [7]; indeed for general discrete MRFs, not only is the search space\nintractably large, but calculation of typical score metrics itself is computationally intractable since\nthey involve computing the partition function associated with the Markov random \ufb01eld [26]. Such\nmethods thus have to use approximations and search heuristics for tractable computation. Question:\nCan one use local procedures that are as inexpensive as the heuristic greedy approaches, and yet\ncome with the strong statistical guarantees of the regularized convex program based approaches?\n\nHigh-dimensional Estimation; Greedy Methods. There has been an increasing focus in recent years\non high-dimensional statistical models where the number of parameters p is comparable to or even\nlarger than the number of observations n. It is now well understood that consistent estimation is pos-\nsible even under such high-dimensional scaling if some low-dimensional structure is imposed on the\nmodel space. Of relevance to graphical model structure learning is the structure of sparsity, where\na sparse set of non-zero parameters entail a sparse set of edges. A surge of recent work [5, 12]\nhas shown that \u21131-regularization for learning such sparse models can lead to practical algorithms\nwith strong theoretical guarantees. A line of recent work (cf. paragraph above) has thus leveraged\nthis sparsity inducing nature of \u21131-regularization, to propose and analyze convex programs based on\nregularized log-likelihood functions. A related line of recent work on learning sparse models has\nfocused on \u201cstagewise\u201d greedy algorithms. These perform simple forward steps (adding parameters\ngreedily), and possibly also backward steps (removing parameters greedily), and yet provide strong\nstatistical guarantees for the estimate after a \ufb01nite number of greedy steps. The forward greedy vari-\nant which performs just the forward step has appeared in various guises in multiple communities: in\nmachine learning as boosting [13], in function approximation [24], and in signal processing as basis\npursuit [6]. In the context of statistical model estimation, Zhang [28] analyzed the forward greedy\nalgorithm for the case of sparse linear regression; and showed that the forward greedy algorithm is\nsparsistent (consistent for model selection recovery) under the same \u201cirrepresentable\u201d condition as\nthat required for \u201csparsistency\u201d of the Lasso. Zhang [27] analyzes a more general greedy algorithm\nfor sparse linear regression that performs forward and backward steps, and showed that it is spar-\nsistent under a weaker restricted eigenvalue condition. Here we ask the question: Can we provide\nan analysis of a general forward backward algorithm for parameter estimation in general statistical\nmodels? Speci\ufb01cally, we need to extend the sparsistency analysis of [28] to general non-linear mod-\nels, which requires a subtler analysis due to the circular requirement of requiring to control the third\norder terms in the Taylor series expansion of the log-likelihood, that in turn requires the estimate to\nbe well-behaved. Such extensions in the case of \u21131-regularization occur for instance in [20, 25, 3].\nOur Contributions. In this paper, we address both questions above. In the \ufb01rst part, we analyze the\nforward backward greedy algorithm [28] for general statistical models. We note that even though we\nconsider the general statistical model case, our analysis is much simpler and accessible than [28],\nand would be of use even to a reader interested in just the linear model case of Zhang [28]. In the\nsecond part, we use this to show that when combined with neighborhood estimation, the forward\nbackward variant applied to local conditional log-likelihoods provides a simple computationally\ntractable method that adds and deletes edges, but comes with strong sparsistency guarantees. We\nreiterate that the our \ufb01rst result on the sparsistency of the forward backward greedy algorithm for\ngeneral objectives is of independent interest even outside the context of graphical models. As we\nshow, the greedy method is better than the \u21131-regularized counterpart in [20] theoretically, as well\nas experimentally. The suf\ufb01cient condition on the parameters imposed by the greedy algorithm\nis a restricted strong convexity condition [19], which is weaker than the irrepresentable condition\nrequired by [20]. Further, the number of samples required for sparsistent graph recovery scales as\nO(d2 log p), where d is the maximum node degree, in contrast to O(d3 log p) for the \u21131-regularized\ncounterpart. We corroborate this in our simulations, where we \ufb01nd that the greedy algorithm requires\nfewer observations than [20] for sparsistent graph recovery.\n\n2 Review, Setup and Notation\n2.1 Markov Random Fields\nLet X = (X1, . . . , Xp) be a random vector, each variable Xi taking values in a discrete set X\nof cardinality m. Let G = (V, E) denote a graph with p nodes, corresponding to the p variables\n{X1, . . . , Xp}. A pairwise Markov random \ufb01eld over X = (X1, . . . , Xp) is then speci\ufb01ed by\nnodewise and pairwise functions \u03b8r : X 7\u2192 R for all r \u2208 V , and \u03b8rt : X \u00d7X 7\u2192 R for all (r, t) \u2208 E:\n(1)\n\nP(x) \u221d exp(cid:8)Xr\u2208V\n\n\u03b8r(xr) + X(r,t)\u2208E\n\n\u03b8rt(xr, xt)(cid:9).\n\n2\n\n\fIn this paper, we largely focus on the case where the variables are binary with X = {\u22121, +1},\nwhere we can rewrite (1) to the Ising model form [14] for some set of parameters {\u03b8r} and {\u03b8rt} as\n(2)\n\nP(x) \u221d exp(cid:8)Xr\u2208V\n\n\u03b8rxr + X(r,t)\u2208E\n\n\u03b8rtxrxt(cid:9).\n\n2.2 Graphical Model Selection\n\nLet D := {x(1), . . . , x(n)} denote the set of n samples, where each p-dimensional vector x(i) \u2208\n{1, . . . , m}p is drawn i.i.d. from a distribution P\u03b8\u2217 of the form (1), for parameters \u03b8\u2217 and graph\nG = (V, E\u2217) over the p variables. Note that the true edge set E\u2217 can also be expressed as a function\nof the parameters as\n\nE\u2217 = {(r, t) \u2208 V \u00d7 V : \u03b8\u2217st 6= 0}.\n\n(3)\n\nThe graphical model selection task consists of inferring this edge set E\u2217 from the samples D. The\ngoal is to construct an estimator \u02c6En for which P[ \u02c6En = E\u2217] \u2192 1 as n \u2192 \u221e. Denote by N \u2217(r)\nthe set of neighbors of a vertex r \u2208 V , so that N \u2217(r) = {t : (r, t) \u2208 E\u2217}. Then the graphical\nmodel selection problem is equivalent to that of estimating the neighborhoods \u02c6Nn(r) \u2282 V , so that\nP[ \u02c6Nn(r) = N \u2217(r);\u2200r \u2208 V ] \u2192 1 as n \u2192 \u221e.\nFor any pair of random variables Xr and Xt, the parameter \u03b8rt fully characterizes whether there is\nan edge between them, and can be estimated via its conditional likelihood. In particular, de\ufb01ning\n\u0398r := (\u03b8r1, . . . , \u03b8rp), our goal is to use the conditional likelihood of Xr conditioned on XV \\r to\nestimate \u0398r and hence its neighborhood N (r). This conditional distribution of Xr conditioned on\nXV \\r generated by (2) is given by the logistic model\n\nGiven the n samples D, the corresponding conditional log-likelihood is given by\n\nP(cid:16)Xr = xr(cid:12)(cid:12)(cid:12)XV \\r = xV \\r(cid:17) =\n\uf8f1\uf8f2\uf8f3\nlog\uf8eb\uf8ed1+ exp\uf8eb\uf8ed\u03b8rx(i) +Xt\u2208V \\r\nnXi=1\n\n.\n\nexp(\u03b8rxr +Pt\u2208V \\r \u03b8rtxrxt)\n1 + exp(\u03b8r +Pr\u2208V \\r \u03b8rtxr)\nt \uf8f6\uf8f8\uf8f6\uf8f8\u2212\u03b8rx(i)\nr \u2212Xt\u2208V \\r\n\n\u03b8rtx(i)\n\nr x(i)\n\n\u03b8rtx(i)\n\nr x(i)\n\n.\n\n(4)\n\nt \uf8fc\uf8fd\uf8fe\n\nL(\u0398r; D) =\n\n1\nn\n\nIn Section 4, we study a greedy algorithm (Algorithm 2) that \ufb01nds these node neighborhoods\n\n\u02c6Nn(r) = Supp(b\u0398r) of each random variable Xr separately by a greedy stagewise optimization\nneighborhoods to obtain a graph estimate bE using an \u201cOR\u201d rule: bEn = \u222ar{(r, t) : t \u2208 \u02c6Nn(r)}.\n\nof the conditional log-likelihood of Xr conditioned on XV \\r. The algorithm then combines these\nOther rules such as the \u201cAND\u201d rule, that add an edge only if it occurs in each of the respective node\nneighborhoods, could be used to combine the node-neighborhoods to a graph estimate. We show\nin Theorem 2 that the neighborhood selection by the greedy algorithm succeeds in recovering the\nexact node-neighborhoods with high probability, so that by a union bound, the graph estimates using\neither the AND or OR rules would be exact with high probability as well.\n\nBefore we describe this greedy algorithm and its analysis in Section 4 however, we \ufb01rst consider\nthe general statistical model case in the next section. We \ufb01rst describe the forward backward greedy\nalgorithm of Zhang [28] as applied to general statistical models, followed by a sparsistency analysis\nfor this general case. We then specialize these general results in Section 4 to the graphical model\ncase. The next section is thus of independent interest even outside the context of graphical models.\n\n3 Greedy Algorithm for General Losses\n\n1 := {Z1, . . . , Zn} denote n obser-\nConsider a random variable Z with distribution P, and let Z n\nvations drawn i.i.d. according to P. Suppose we are interested in estimating some parameter\n\u03b8\u2217 \u2208 Rp of the distribution P that is sparse; denote its number of non-zeroes by s\u2217 := k\u03b8\u2217k0.\nLet L : Rp \u00d7 Z n 7\u2192 R be some loss function that assigns a cost to any parameter \u03b8 \u2208 Rp, for a\ngiven set of observations Z n\n1 . For ease of notation, in the sequel, we adopt the shorthand L(\u03b8) for\nL(\u03b8; Z n\n\n1 ). We assume that \u03b8\u2217 satis\ufb01es EZ [\u2207L(\u03b8\u2217)] = 0.\n\n3\n\n\fAlgorithm 1 Greedy forward-backward algorithm for \ufb01nding a sparse optimizer of L(\u00b7)\nInput: Data D := {x(1), . . . , x(n)}, Stopping Threshold \u01ebS, Backward Step Factor \u03bd \u2208 (0, 1)\nOutput: Sparse optimizerb\u03b8\nb\u03b8(0) \u2190\u2212 0 and bS(0) \u2190\u2212 \u03c6 and k \u2190\u2212 1\nwhile true do {Forward Step}\n; \u03b1L(b\u03b8(k\u22121) +\u03b1ej ; D)\nj\u2208( bS(k\u22121))c\nbS(k) \u2190\u2212 bS(k\u22121) \u222a {j\u2217}\nf \u2190\u2212 L(b\u03b8(k\u22121); D) \u2212 L(b\u03b8(k\u22121) + \u03b1\u2217ej\u2217 ; D)\nf \u2264 \u01ebS then\nbreak\n\n(j\u2217, \u03b1\u2217) \u2190\u2212 arg\n\n\u03b4(k)\nif \u03b4(k)\n\nmin\n\nend if\n\n\u03b8 L(cid:0)\u03b8 bS(k) ; D(cid:1)\nb\u03b8(k) \u2190\u2212 arg min\nk \u2190\u2212 k + 1\nwhile true do {Backward Step}\nj\u2217 \u2190\u2212 arg min\n\nj\n\nj\u2217\n\nej; D)\n\nend if\n\nbreak\n\nj\u2208 bS(k\u22121) L(b\u03b8(k\u22121) \u2212b\u03b8(k\u22121)\n\nif L(cid:0)b\u03b8(k\u22121) \u2212b\u03b8(k\u22121)\nbS(k\u22121) \u2190\u2212 bS(k) \u2212 {j\u2217}\nb\u03b8(k\u22121) \u2190\u2212 arg min\n\nej\u2217 ; D(cid:1) \u2212 L(cid:0)b\u03b8(k\u22121); D(cid:1) > \u03bd\u03b4(k)\n\u03b8 L(cid:0)\u03b8 bS(k\u22121) ; D(cid:1)\n\nk \u2190\u2212 k \u2212 1\n\nf\n\nend while\n\nthen\n\nend while\n\nWe now consider the forward backward greedy algorithm in Algorithm 1 that rewrites the algorithm\nin [27] to allow for general loss functions. The algorithm starts with an empty set of active variables\n\nbS(0) and gradually adds (and removes) vairables to the active set until it meets the stopping criterion.\n\nThis algorithm has two major steps: the forward step and the backward step. In the forward step,\nthe algorithm \ufb01nds the best next candidate and adds it to the active set as long as it improves the loss\nfunction at least by \u01ebS, otherwise the stopping criterion is met and the algorithm terminates. Then,\nin the backward step, the algorithm checks the in\ufb02uence of all variables in the presence of the newly\nadded variable. If one or more of the previously added variables do not contribute at least \u03bd\u01ebS to\nthe loss function, then the algorithm removes them from the active set. This procedure ensures that\nat each round, the loss function is improved by at least (1 \u2212 \u03bd)\u01ebS and hence it terminates within a\n\ufb01nite number of steps.\n\nWe state the assumptions on the loss function such that sparsistency is guaranteed. Let us \ufb01rst recall\nthe de\ufb01nition of restricted strong convexity from Negahban et al. [18]. Speci\ufb01cally, for a given set S,\nthe loss function is said to satisfy restricted strong convexity (RSC) with parameter \u03bal with respect\nto the set S if\n\nL(\u03b8 + \u2206; Z n\n\n1 ) \u2212 L(\u03b8; Z n\n\n1 ) \u2212 h\u2207L(\u03b8; Z n\n\n1 ), \u2206i \u2265\n\n\u03bal\n2 k\u2206k2\n\n2\n\nfor all \u2206 \u2208 S.\n\n(5)\n\nWe can now de\ufb01ne sparsity restricted strong convexity as follows. Speci\ufb01cally, we say that the\nloss function L satis\ufb01es RSC(k) with parameter \u03bal if it satis\ufb01es RSC with parameter \u03bal for the set\n{\u2206 \u2208 Rp : k\u2206k0 \u2264 k}.\nIn contrast, we say that the loss function satis\ufb01es restricted strong smoothness (RSS) with parameter\n\u03bau with respect to a set S if\n\nL(\u03b8 + \u2206; Z n\n\n1 ) \u2212 L(\u03b8; Z n\n\n1 ) \u2212 h\u2207L(\u03b8; Z n\n\n1 ), \u2206i \u2264\n\n\u03bau\n2 k\u2206k2\n\n2\n\nfor all \u2206 \u2208 S.\n\n4\n\n\fWe can de\ufb01ne RSS(k) similarly. The loss function L satis\ufb01es RSS(k) with parameter \u03bau if it\nsatis\ufb01es RSS with parameter \u03bau for the set {\u2206 \u2208 Rp : k\u2206k0 \u2264 k}. Given any constants \u03bal and \u03bau,\nand a sample based loss function L, we can typically use concentration based arguments to obtain\nbounds on the sample size required so that the RSS and RSC conditions hold with high probability.\nAnother property of the loss function that we require is an upper bound \u03bbn on the \u2113\u221e norm of the\ngradient of the loss at the true parameter \u03b8\u2217, i.e., \u03bbn \u2265 k\u2207L(\u03b8\u2217)k\u221e. This captures the \u201cnoise level\u201d\nof the samples with respect to the loss. Here too, we can typically use concentration arguments to\nshow for instance that \u03bbn \u2264 cn(log(p)/n)1/2, for some constant cn > 0 with high probability.\nTheorem 1 (Sparsistency). Suppose the loss function L(\u00b7) satis\ufb01es RSC (\u03b7 s\u2217) and RSS (\u03b7 s\u2217)\nwith parameters \u03bal and \u03bau for some \u03b7 \u2265 2 + 4\u03c12(p(\u03c12 \u2212 \u03c1)/s\u2217 + \u221a2)2 with \u03c1 = \u03bau/\u03bal. Moreover,\nsuppose that the true parameters \u03b8\u2217 satisfy minj\u2208S \u2217 |\u03b8\u2217j| >p32\u03c1\u01ebS/\u03bal. Then if we run Algorithm 1\nwith stopping threshold \u01ebS \u2265 (8\u03c1\u03b7/\u03bal) s\u2217\u03bb2\n\nn, the outputb\u03b8 with support bS satis\ufb01es:\n\u221as\u2217 (\u03bbn\u221a\u03b7 + \u221a\u01ebS\n\n\u221a2\u03bau).\n\n\u03bal\n\n(a) Error Bound: kb\u03b8 \u2212 \u03b8\u2217k2 \u2264 2\n(b) No False Exclusions: S\u2217 \u2212bS = \u2205.\n(c) No False Inclusions: bS \u2212 S\u2217 = \u2205.\n\nProof. The proof theorem hinges on three main lemmas: Lemmas 1 and 2 which are simple conse-\nquences of the forward and backward steps failing when the greedy algorithm stops, and Lemma 3\nwhich uses these two lemmas and extends techniques from [21] and [19] to obtain an \u21132 error bound\non the error. Provided these lemmas hold, we then show below that the greedy algorithm is sparsis-\ntent. However, these lemmas require apriori that the RSC and RSS conditions hold for sparsity size\n\nLemmas 1, 2 and Lemma 3 to complete the proof as detailed below.\n\n|S\u2217 \u222a bS|. Thus, we use the result in Lemma 4 that if RSC(\u03b7s\u2217) holds, then the solution when the\nalgorithm terminates satis\ufb01es |bS| \u2264 (\u03b7 \u2212 1)s\u2217, and hence |bS \u222a S\u2217| \u2264 \u03b7s\u2217. Thus, we can then apply\n(a) The result follows directly from Lemma 3, and noting that |bS \u222a S\u2217| \u2264 \u03b7s\u2217. In this Lemma, we\nshow that the upper bound holds by drawing from \ufb01xed point techniques in [21] and [19], and by\nusing a simple consequence of the forward step failing when the greedy algorithm stops.\n(b) We follow the chaining argument in [27]. For any \u03c4 \u2208 R, we have\n2 \u2264 k\u03b8\u2217 \u2212b\u03b8k2\n\n\u03c4|{j \u2208 S\u2217 \u2212 bS : |\u03b8\u2217\n\nS \u2217\u2212 bSk2\n8\u03b7s\u2217\u03bb2\nn\n+\n\nj |2 > \u03c4}| \u2264 k\u03b8\u2217\n\n16\u03bau\u01ebS\n\n2\n\n\u2264\n\n\u03ba2\nl\n\n\u03ba2\nl\n\nwhere the last inequality follows from part (a) and the inequality (a + b)2 \u2264 2a2 + 2b2. Now,\nsetting \u03c4 = 32\u03bau\u01ebS\n\n, and dividing both sides by \u03c4 /2 we get\n\n\u03ba2\n\nl\n\n|S\u2217 \u2212bS|,\n\n\u03b7s\u2217\u03bb2\nn\n2\u03bau\u01ebS\n\nj |2 > \u03c4}| \u2264\n\n2|{j \u2208 S\u2217 \u2212 bS : |\u03b8\u2217\nj |2 \u2264 \u03c4}| +\n\n+ |S\u2217 \u2212bS|.\nSubstituting |{j \u2208 S\u2217 \u2212bS : |\u03b8\u2217\nj|2 > \u03c4}| = |S\u2217 \u2212 bS| \u2212 |{j \u2208 S\u2217 \u2212bS : |\u03b8\u2217\n2\u03bau\u01ebS \u2264 |{j \u2208 S\u2217 \u2212 bS : |\u03b8\u2217\ndue to the setting of the stopping threshold \u01ebS. This in turn entails that\n|S\u2217 \u2212bS| \u2264 |{j \u2208 S\u2217 \u2212bS : |\u03b8\u2217j|2 \u2264 \u03c4}| = 0,\n\n|S\u2217 \u2212 bS| \u2264 |{j \u2208 S\u2217 \u2212 bS : |\u03b8\u2217\n\nby our assumption on the size of the minimum entry of \u03b8\u2217.\n\n\u03b7s\u2217\u03bb2\nn\n\nj|2 \u2264 \u03c4}|, we get\n\nj |2 \u2264 \u03c4}| + 1/2,\n\n(c) From Lemma 2, which provides a simple consequence of the backward step failing when the\n2, so that\n\u2264 1/2, due to the\n\ngreedy algorithm stops, for b\u2206 =b\u03b8 \u2212 \u03b8\u2217, we have \u01ebS/\u03bau|bS \u2212 S\u2217| \u2264 kb\u2206 bS\u2212S \u2217k2\nusing Lemma 3 and that |S\u2217 \u2212 bS| = 0, we obtain that |bS \u2212 S\u2217| \u2264 4\u03b7s\n\nsetting of the stopping threshold \u01ebS.\n\n2 \u2264 kb\u2206k2\n\n\u03bb2\n\u01ebS \u03ba2\n\nn\u03bau\n\n\u2217\n\nl\n\n5\n\n\fAlgorithm 2 Greedy forward-backward algorithm for pairwise discrete graphical model learning\nInput: Data D := {x(1), . . . , x(n)}, Stopping Threshold \u01ebS, Backward Step Factor \u03bd \u2208 (0, 1)\nOutput: Estimated Edges bE\nfor r \u2208 V do\nRun Algorithm 1 with the loss L(\u00b7) set as in (4), to obtain b\u0398r with support cNr\nend for\nOutput bE =Srn(r, t) : t \u2208 cNro\n\n3.1 Lemmas for Theorem 1\n\nWe list the simple lemmas that characterize the solution obtained when the algorithm terminates,\nand on which the proof of Theorem 1 hinges.\n\nwe have\n\n(cid:12)(cid:12)(cid:12)L(cid:16)b\u03b8(cid:17) \u2212 L (\u03b8\u2217)(cid:12)(cid:12)(cid:12) \u03bb2\n\nn\n\ns\u2217 + \u221a2(cid:19)2\n\n, then the algorithm 1 stops with k \u2264 (\u03b7 \u2212 1)s\u2217.\n\n4\u03c12(cid:18)q \u03c12\u2212\u03c1\nNotice that if \u01ebS \u2265 (8\u03c1\u03b7/\u03bal) (\u03b72/(4\u03c12)) \u03bb2\nn.\nfor large value of s\u2217 \u2265 8\u03c12 > \u03b72/(4\u03c12), it suf\ufb01ces to have \u01ebS \u2265 (8\u03c1\u03b7/\u03bal) s\u2217\u03bb2\n4 Greedy Algorithm for Pairwise Graphical Models\n\nn, then, the assumption of this lemma is satis\ufb01ed. Hence\n\nSuppose we are given set of n i.i.d. samples D := {x(1), . . . , x(n)}, drawn from a pairwise Ising\nmodel as in (2), with parameters \u03b8\u2217, and graph G = (V, E\u2217). It will be useful to denote the maximum\nnode-degree in the graph E\u2217 by d. As we will show, our model selection performance depends\ncritically on this parameter d. We propose Algorithm 2 for estimating the underlying graphical\nmodel from the n samples D.\nTheorem 2 (Pairwise Sparsistency). Suppose we run Algorithm 2 with stopping threshold \u01ebS \u2265\n, where, d is the maximum node degree in the graphical model, and the true parameters \u03b8\u2217\nc1\nsatisfy c3\u221ad\n\n> minj\u2208S \u2217 |\u03b8\u2217j| > c2\u221a\u01ebS, and further that number of samples scales as\n\nd log p\n\nn\n\nn > c4 d2 log p,\n\nfor some constants c1, c2, c3, c4. Then, with probability at least 1 \u2212 c\u2032 exp(\u2212c\u2032\u2032n), the output b\u03b8\nsupported on bS satis\ufb01es:\n\n(a) No False Exclusions: E\u2217 \u2212 bE = \u2205.\n\n6\n\n\f(b) No False Inclusions: bE \u2212 E\u2217 = \u2205.\n\nProof. This theorem is a corollary to our general Theorem 1. We \ufb01rst show that the conditions of\nTheorem 1 hold under the assumptions in this corollary.\n\nRSC, RSS. We \ufb01rst note that the conditional log-likelihood loss function in (4) corresponds to a lo-\ngistic likelihood. Moreover, the covariates are all binary, and bounded, and hence also sub-Gaussian.\n[19, 2] analyze the RSC and RSS properties of generalized linear models, of which logistic models\nare an instance, and show that the following result holds if the covariates are sub-Gaussian. Let\n\u2202L(\u2206; \u03b8\u2217) = L(\u03b8\u2217 + \u2206) \u2212 L(\u03b8\u2217) \u2212 h\u2207L(\u03b8\u2217), \u2206i be the second order Taylor series remainder.\nThen, Proposition 2 in [19] states that that there exist constants \u03bal\n2, independent of n, p such\nthat with probability at least 1 \u2212 c1 exp(\u2212c2n), for some constants c1, c2 > 0,\n\n1 and \u03bal\n\n\u2202L(\u2206; \u03b8\u2217) \u2265 \u03bal\n\n1k\u2206k2(k\u2206k2 \u2212 \u03bal\nn k\u2206k1)\n2r log(p)\nThus, if k\u2206k0 \u2264 k := \u03b7d, then k\u2206k1 \u2264 \u221akk\u2206k2, so that\nn ! \u2265\n2 \u03bal\n2r k log p\n\u2202L(\u2206; \u03b8\u2217) \u2265 k\u2206k2\nif n > 4(\u03bal\nfunction L satis\ufb01es RSC(k) with parameter \u03bal\nfollows from [19, 2] that there exist constants \u03bau\nc\u20321 exp(\u2212c\u20322n),\n\n1 \u2212 \u03bal\n\n2/\u03bal\n\n1)2 \u03b7d log(p). In other words, with probability at least 1 \u2212 c1 exp(\u2212c2n), the loss\n1)2 \u03b7d log(p). Similarly, it\n2 such that with probability at least 1 \u2212\n\n1 provided n > 4(\u03bal\n\n1 and \u03bau\n\n2/\u03bal\n\nfor all \u2206 : k\u2206k2 \u2264 1.\n\n\u03bal\n1\n2 k\u2206k2\n2,\n\n\u2202L(\u2206; \u03b8\u2217) \u2264 \u03bau\n\nfor all \u2206 : k\u2206k2 \u2264 1,\n\n1k\u2206k2{k\u2206k2 \u2212 \u03bau\n2 /\u03bau\n\n2k\u2206k1}\n1 )2 \u03b7d log(p).\n\nrt , where Z (i)\n\n1 provided n > 4(\u03bau\n\nso that by a similar argument, with probability at least 1\u2212c\u20321 exp(\u2212c\u20322n), the loss function L satis\ufb01es\nRSS(k) with parameter \u03bau\nNoise Level. Next, we obtain a bound on the noiselevel \u03bbn \u2265 k\u2207L(\u03b8\u2217)k\u221e following simi-\nlar arguments to [20]. Let W denote the gradient \u2207L(\u03b8\u2217) of the loss function (4). Any en-\nnPn\ni=1 Z (i)\ntry of W has the form Wt = 1\n\\s )) are\nand bounded |Z (i)\nzero-mean, i.i.d.\nrt | \u2264 1. Thus, an application of Hoeffding\u2019s inequality\nyields that P[|Wt| > \u03b4] \u2264 2 exp(\u22122n\u03b42). Applying a union bound over indices in W , we get\nP[kWk\u221e > \u03b4] \u2264 2 exp(\u22122n\u03b42 + log(p)). Thus, if \u03bbn = (log(p)/n)1/2, then kWk\u221e \u2264 \u03bbn with\nprobability at least 1 \u2212 exp(\u2212n\u03bb2\nWe can now verify that under the assumptions in the corollary, the conditions on the stopping size \u01ebS\nand the minimum absolute value of the non-zero parameters minj\u2208S \u2217 |\u03b8\u2217j| are satis\ufb01ed. Moreover,\nfrom the discussion above, under the sample size scaling in the corollary, the required RSC and\nRSS conditions hold as well. Thus, Theorem 1 yields that each node neighborhood is recovered\nwith no false exclusions or inclusions with probability at least 1 \u2212 c\u2032 exp(\u2212c\u2032\u2032n). An application of\na union bound over all nodes completes the proof.\n\nr \u2212 P(xr = 1|x(i)\n\nrt = x(i)\n\nn + log(p)).\n\nt (x(i)\n\nRemarks. The suf\ufb01cient condition on the parameters imposed by the greedy algorithm is a restricted\nstrong convexity condition [19], which is weaker than the irrepresentable condition required by\n[20]. Further, the number of samples required for sparsistent graph recovery scales as O(d2 log p),\nwhere d is the maximum node degree, in contrast to O(d3 log p) for the \u21131 regularized counterpart.\nWe corroborate this in our simulations, where we \ufb01nd that the greedy algorithm requires fewer\nobservations than [20] for sparsistent graph recovery.\n\nWe also note that the result can also be extended to the general pairwise graphical model case, where\neach random variable takes values in the range {1, . . . , m}. In that case, the conditional likelihood\nof each node conditioned on the rest of the nodes takes the form of a multiclass logistic model, and\nthe greedy algorithm would take the form of a \u201cgroup\u201d forward-backward greedy algorithm, which\nwould add or remove all the parameters corresponding to an edge as a group. Our analysis however\nnaturally extends to such a group greedy setting as well. The analysis for RSC and RSS remains the\nsame and for bounds on \u03bbn, see equation (12) in [15]. We defer further discussion on this due to the\nlack of space.\n\n7\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n Greedy\nAlgorithm\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n Greedy\nAlgorithm\n\n Logistic\nRegression\n\np = 36\np = 64\np = 100\n\n \n\np = 36\np = 64\np = 100\n\n Logistic\nRegression\n\n \n\n0\n0\n\n0.5\n\n1\n\n1.5\n2.5\nControl Parameter\n\n2\n\n3\n\n3.5\n\n4\n\n \n\n0\n0\n\n0.5\n\n1\n\n1.5\n2.5\nControl Parameter\n\n2\n\n3\n\n3.5\n\n4\n\n(a) Chain (Line Graph)\n\n(b) 4-Nearest Neighbor (Grid Graph)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n \n\n0\n0\n\n Greedy\nAlgorithm\n\n \n\n Logistic\nRegression\n\np = 36\np = 64\np = 100\n\n0.25\n\n0.5\n\n0.75\n\nControl Parameter\n\n1\n\n1.25\n\n1.51.5\n\n(c) Star\n\n(d) Chain, 4-Nearest Neighbor and Star\nGraphs\n\nFig 1: Plots of success probability P[bN\u00b1(r) = N \u2217(r),\u2200r \u2208 V ] versus the control parameter\n\u03b2(n, p, d) = n/[20d log(p)] for Ising model on (a) chain (d = 2), (b) 4-nearest neighbor (d = 4)\nand (c) Star graph (d = 0.1p). The coupling parameters are chosen randomly from \u03b8\u2217st = \u00b10.50\nfor both greedy and node-wise \u21131-regularized logistic regression methods. As our theorem suggests\nand these \ufb01gures show, the greedy algorithm requires less samples to recover the exact structure of\nthe graphical model.\n\n5 Experimental Results\n\nWe now present experimental results that illustrate the power of Algorithm 2 and support our theo-\nretical guarantees. We simulated structure learning of different graph structures and compared the\nlearning rates of our method to that of node-wise \u21131-regularized logistic regression as outlined in\n[20].\n\nWe performed experiments using 3 different graph structures: (a) chain (line graph), (b) 4-nearest\nneighbor (grid graph) and (c) star graph. For each experiment, we assumed a pairwise binary Ising\nmodel in which each \u03b8\u2217rt = \u00b11 randomly. For each graph type, we generated a set of n i.i.d.\nsamples {x(1), ..., x(n)} using Gibbs sampling. We then attempted to learn the structure of the\nmodel using both Algorithm 2 as well as node-wise \u21131-regularized logistic regression. We then\ncompared the actual graph structure with the empirically learned graph structures.\nIf the graph\nstructures matched completely then we declared the result a success otherwise we declared the result\na failure. We compared these results over a range of sample sizes (n) and averaged the results for\neach sample size over a batch of size 10. For all greedy experiments we set the stopping threshold\n\u01ebS = c log(np)\n, where c is a tuning constant, as suggested by Theorem 2, and set the backwards\nstep threshold \u03bd = 0.5. For all logistic regression experiments we set the regularization parameter\n\n\u03bbn = c\u2032plog(p)/n, where c\u2032 was set via cross-validation.\nFigure 1 shows the results for the chain (d = 2), grid (d = 4) and star (d = 0.1p) graphs using\nboth Algorithm 2 and node-wise \u21131-regularized logistic regression for three different graph sizes\np \u2208 {36, 64, 100} with mixed (random sign) couplings. For each sample size, we generated a batch\nof 10 different graphical models and averaged the probability of success (complete structure learned)\nover the batch. Each curve then represents the probability of success versus the control parameter\n\u03b2(n, p, d) = n/[20d log(p)] which increases with the sample size n. These results support our\ntheoretical claims and demonstrate the ef\ufb01ciency of the greedy method in comparison to node-wise\n\u21131-regularized logistic regression [20].\n\nn\n\n6 Acknowledgements\n\nWe would like to acknowledge the support of NSF grant IIS-1018426.\n\n8\n\n\fReferences\n[1] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and sample complexity.\n\nJour. Mach. Learning Res., 7:1743\u20131788, 2006.\n\n[2] A. Agarwal, S. Negahban, and M. Wainwright. Convergence rates of gradient methods for high-\n\ndimensional statistical recovery. In NIPS, 2010.\n\n[3] F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384\u2013414,\n\n2010.\n\n[4] G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random \ufb01elds from samples: Some easy\n\nobservations and algorithms. In RANDOM 2008.\n\n[5] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals\n\nof Statistics, 2006.\n\n[6] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci.\n\nComputing, 20(1):33\u201361, 1998.\n\n[7] D. Chickering. Learning Bayesian networks is NP-complete. Proceedings of AI and Statistics, 1995.\n[8] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans.\n\nInfo. Theory, 14(3):462\u2013467, 1968.\n\n[9] I. Csisz\u00b4ar and Z. Talata. Consistent estimation of the basic neighborhood structure of Markov random\n\n\ufb01elds. The Annals of Statistics, 34(1):123\u2013145, 2006.\n\n[10] C. Dahinden, M. Kalisch, and P. Buhlmann. Decomposition and model selection for large contingency\n\ntables. Biometrical Journal, 52(2):233\u2013252, 2010.\n\n[11] S. Dasgupta. Learning polytrees. In Uncertainty on Arti\ufb01cial Intelligence, pages 134\u201314, 1999.\n[12] D. Donoho and M. Elad. Maximal sparsity representation via \u21131 minimization. Proc. Natl. Acad. Sci.,\n\n100:2197\u20132202, March 2003.\n\n[13] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting.\n\nAnnals of Statistics, 28:337\u2013374, 2000.\n\n[14] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift f\u00a8ur Physik, 31:253\u2013258, 1925.\n[15] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models using group-\n\nsparse regularization. In Inter. Conf. on AI and Statistics (AISTATS) 14, 2011.\n\n[16] S.-I. Lee, V. Ganapathi, and D. Koller. Ef\ufb01cient structure learning of markov networks using l1-\n\nregularization. In Neural Information Processing Systems (NIPS) 19, 2007.\n\n[17] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the lasso. Annals\n\nof Statistics, 34(3), 2006.\n\n[18] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\nIn Neural Information Processing Systems\n\nanalysis of m-estimators with decomposable regularizers.\n(NIPS) 22, 2009.\n\n[19] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. In Arxiv, 2010.\n\n[20] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using \u21131-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319.\n\n[21] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. 2:\n\n494\u2013515, 2008.\n\n[22] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction and search. MIT Press, 2000.\n[23] N. Srebro. Maximum likelihood bounded tree-width Markov networks. Arti\ufb01cial Intelligence, 143(1):\n\n123\u2013138, 2003.\n\n[24] V. N. Temlyakov. Greedy approximation. Acta Numerica, 17:235\u2013409, 2008.\n[25] S. van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36:\n\n614\u2013645, 2008.\n\n[26] D. J. A. Welsh. Complexity: Knots, Colourings, and Counting. LMS Lecture Note Series. Cambridge\n\nUniversity Press, Cambridge, 1993.\n\n[27] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Neural\n\nInformation Processing Systems (NIPS) 21, 2008.\n\n[28] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of\n\nMachine Learning Research, 10:555\u2013568, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1098, "authors": [{"given_name": "Ali", "family_name": "Jalali", "institution": null}, {"given_name": "Christopher", "family_name": "Johnson", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}