{"title": "Graphical Models via Generalized Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1358, "page_last": 1366, "abstract": "Undirected graphical models, or Markov networks, such as Gaussian graphical models and Ising models enjoy popularity in a variety of applications.  In many settings, however, data may not follow a Gaussian or binomial distribution assumed by these models. We introduce a new class of graphical models based on generalized linear models (GLM) by assuming that node-wise conditional distributions arise from exponential families.  Our models allow one to estimate networks for a wide class of exponential distributions, such as the Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We provide examples of high-throughput genomic networks learned via our GLM graphical models for multinomial and Poisson distributed data.", "full_text": "Graphical Models via Generalized Linear Models\n\nEunho Yang\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\neunho@cs.utexas.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nGenevera I. Allen\n\nDepartment of Statistics\n\nRice University\n\ngallen@rice.edu\n\nZhandong Liu\n\nDepartment of Pediatrics-Neurology\n\nBaylor College of Medicine\n\nzhandonl@bcm.edu\n\nAbstract\n\nUndirected graphical models, also known as Markov networks, enjoy popularity\nin a variety of applications. The popular instances of these models such as Gaus-\nsian Markov Random Fields (GMRFs), Ising models, and multinomial discrete\nmodels, however do not capture the characteristics of data in many settings. We\nintroduce a new class of graphical models based on generalized linear models\n(GLMs) by assuming that node-wise conditional distributions arise from expo-\nnential families. Our models allow one to estimate multivariate Markov networks\ngiven any univariate exponential distribution, such as Poisson, negative binomial,\nand exponential, by \ufb01tting penalized GLMs to select the neighborhood for each\nnode. A major contribution of this paper is the rigorous statistical analysis show-\ning that with high probability, the neighborhood of our graphical models can be\nrecovered exactly. We also provide examples of non-Gaussian high-throughput\ngenomic networks learned via our GLM graphical models.\n\n1\n\nIntroduction\n\nUndirected graphical models, also known as Markov random \ufb01elds, are an important class of sta-\ntistical models that have been extensively used in a wide variety of domains, including statistical\nphysics, natural language processing, image analysis, and medicine. The key idea in this class of\nmodels is to represent the joint distribution as a product of clique-wise compatibility functions; given\nan underlying graph, each of these compatibility functions depends only on a subset of variables\nwithin any clique of the underlying graph. Such a factored graphical model distribution can also be\nrelated to an exponential family distribution [1], where the unnormalized probability is expressed\nas the exponential of a weighted linear combination of clique-wise suf\ufb01cient statistics. Learning a\ngraphical model distribution from data within this exponential family framework can be reduced to\nlearning weights on these suf\ufb01cient statistics. An important modeling question is then, how do we\nchoose suitable suf\ufb01cient statistics? In the case of discrete random variables, suf\ufb01cient statistics can\nbe taken as indicator functions as in the Ising or Potts model. These, however, are not suited to all\nkinds of discrete variables such as that of non-negative integer counts. Similarly, in the case of con-\ntinuous variables, Gaussian Markov Random Fields (GMRFs) are popular. The multivariate normal\ndistribution imposed by the GMRF, however, is a stringent assumption; the marginal distribution of\nany variable must also be Gaussian.\nIn this paper, we propose a general class of graphical models beyond the Ising model and the GMRF\nto encompass variables arising from all exponential family distributions. Our approach is motivated\nby recent state of the art methods for learning the standard Ising and Gaussian MRFs [2, 3, 4].\n\n1\n\n\fThe key idea in these recent methods is to learn the MRF graph structure by estimating node-\nneighborhoods, which are estimated by maximizing the likelihood of each node conditioned on\nthe rest of the nodes. These node-wise \ufb01tting methods have been shown to be both computationally\nand statistically attractive. Here, we study the general class of models obtained by the following\nconstruction: suppose the node-conditional distributions of each node conditioned on the rest of the\nnodes are Generalized Linear Models (GLMs) [5]. By the Hammersley-Clifford Theorem [6] and\nsome algebra as derived in [7], these node-conditional distributions entail a global distribution that\nfactors according to cliques de\ufb01ned by the graph obtained from the node-neighborhoods. Moreover,\nthese have a particular set of potential functions speci\ufb01ed by the GLM. The resulting class of MRFs\nbroadens the class of models available off-the-shelf, from the standard Ising, indicator-discrete, and\nGaussian MRFs.\nBeyond our initial motivation of \ufb01nding more general graphical model suf\ufb01cient statistics, a broader\nclass of parametric graphical models are important for a number of reasons. First, our models pro-\nvide a principled approach to model multivariate distributions and network structures among a large\nnumber of variables. For many non-Gaussian exponential families, multivariate distributions typi-\ncally do not exist in an analytical or computationally tractable form. Graphical model GLMs provide\na way to \u201cextend\u201d univariate exponential families of distributions to the multivariate case and model\nand study relationships between variables for these families of distributions. Second, while some\nhave proposed to extend the GMRF to a non-parametric class of graphical models by \ufb01rst Gaussian-\nizing the data and then \ufb01tting a GMRF over the transformed variables [8], the sample complexity of\nsuch non-parametric methods is often inferior to parametric methods. Thus for modeling data that\nclosely follows a non-Gaussian distribution, statistical power for network recovery can be gained\nby directly \ufb01tting parametric GLM graphical models. Third, and speci\ufb01cally for multivariate count\ndata, others have suggested combinatorial approaches to \ufb01tting graphical models, mostly in the con-\ntext of contingency tables [6, 9, 1, 10]. These approaches, however, are computationally intractable\nfor even moderate numbers of variables.\nFinally, potential applications for our GLM graphical models abound. Networks of call-times, time\nspent on websites, diffusion processes, and life-cycles can be modeled with exponential graphical\nmodels; other skewed multivariate data can be modeled with gamma or chi-squared graphical mod-\nels. Perhaps the most interesting motivating applications are for multivariate count data such as from\nwebsite visits, user-ratings, crime and disease incident reports, bibliometrics, and next-generation\ngenomic sequencing technologies. The latter is a relatively new high-throughput technology to mea-\nsure gene expression that is rapidly replacing the microarray [11]. As Gaussian graphical models are\nwidely used to infer genomic regulatory networks from microarray data, Poisson and negative bino-\nmial graphical models may be important for inferring genomic networks from the multivariate count\ndata arising from this emerging technology. Beyond next generation sequencing, there has been a\nrecent proliferation of new high-throughput genomic technologies that produce non-Gaussian data.\nThus, our more general class of GLM graphical models can be used for inferring genomic networks\nfrom these new high-throughput technologies.\nThe construction of our GLM graphical models also suggests a natural method for learning such\nmodels: node-wise neighborhood estimation by \ufb01tting sparsity constrained GLMs. A main contri-\nbution of this paper is to provide a sparsistency analysis for the recovery of the underlying graph\nstructure of this new class of MRFs. The presence of non-linearities arising from the GLM poses\nsubtle technical issues not present in the linear case [2]. Indeed, for the speci\ufb01c cases of logistic, and\nmultinomial respectively, [3, 4] derive such a sparsistency analysis via fairly extensive arguments\nwhich were tuned to those speci\ufb01c cases. Here, we generalize their analysis to general GLMs, which\nrequires a slightly modi\ufb01ed M-estimator and a more subtle theoretical analysis. We note that this\nanalysis might be of independent interest even outside the context of modeling and recovering graph-\nical models. In recent years, there has been a trend towards uni\ufb01ed statistical analyses that provide\nstatistical guarantees for broad classes of models via general theorems [12]. Our result is in this vein\nand provides structure recovery for the class of sparsity constrained generalized linear models. We\nhope that the techniques we introduce might be of use to address the outstanding question of sparsity\nconstrained M-estimation in its full generality.\n\n2\n\n\f2 A New Class of Graphical Models\n\nProblem Setup and Background. Suppose X = (X1, . . . , Xp) is a random vector, with each\nvariable Xi taking values in a set X . Suppose G = (V, E) is an undirected graph over p nodes\ncorresponding to the p variables; the corresponding graphical model is a set of distributions that\nsatisfy Markov independence assumptions with respect to the graph. By the Hammersley-Clifford\ntheorem, any such distribution also factors according to the graph in the following way. Let C be\na set of cliques (fully-connected subgraphs) of the graph G, and let {\u03c6c(Xc) c \u2208 C} be a set of\nclique-wise suf\ufb01cient statistics. With this notation, any distribution of X within the graphical model\nfamily represented by the graph G takes the form:\nP (X) \u221d exp\n\n(1)\nwhere {\u03b8c} are weights over the suf\ufb01cient statistics. With a pairwise graphical model distribution,\nthe set of cliques consists of the set of nodes V and the set of edges E, so that\n\n(cid:26)(cid:88)\n\n\u03b8c\u03c6c(Xc)\n\n(cid:27)\n\nc\u2208C\n\n,\n\nP (X) \u221d exp\n\n\u03b8s\u03c6s(Xs) +\n\n\u03b8st\u03c6st(Xs, Xt)\n\n.\n\n(2)\n\n(cid:27)\n\n(cid:26)(cid:88)\n\ns\u2208V\n\n(cid:88)\n\n(s,t)\u2208E\n\nAs previously discussed, an important question is how to select the class of suf\ufb01cient statistics, \u03c6, in\nparticular to obtain as a multivariate extension of speci\ufb01ed univariate parametric distributions? We\nnext outline a subclass of graphical models where the node-conditional distributions are exponential\nfamily distributions, with an important special case where these node-conditional distributions are\ngeneralized linear models (GLMs). Then, in Section 3, we will study how to learn the underlying\ngraph structure, or infer the edge set E, providing an M-estimator and suf\ufb01cient conditions under\nwhich the estimator recovers the graph structure with high probability.\nGraphical Models via GLMs. In this section, we investigate the class of models that arise from\nspecifying the node-conditional distributions as exponential families. Speci\ufb01cally, suppose we are\ngiven a univariate exponential family distribution,\n\nP (Z) = exp(\u03b8 B(Z) + C(Z) \u2212 D(\u03b8)),\n\nwith suf\ufb01cient statistics B(Z), base measure C(Z), and D(\u03b8) as the log-normalization constant.\nLet X = (X1, X2, . . . , Xp) be a p-dimensional random vector; and let G = (V, E) be an undi-\nrected graph over p nodes corresponding to the p variables. Now suppose the distribution of Xs\ngiven the rest of nodes XV \\s is given by the above exponential family, but with the canonical expo-\nnential family parameter set to a linear combination of k-th order products of univariate functions\n{B(Xt)}t\u2208N (s). This gives the following conditional distribution:\n\nP (Xs|XV \\s) = exp\n\nB(Xs)\n\n\u03b8s +\n\n\u03b8st B(Xt) +\n\n\u03b8s t2t3 B(Xt2)B(Xt3)\n\n(cid:88)\n\nt2,t3\u2208N (s)\n\n(cid:16)\n\n(cid:110)\n(cid:88)\n\n+\n\nt2,...,tk\u2208N (s)\n\n(cid:88)\nk(cid:89)\n\nj=2\n\nt\u2208N (s)\n\n(cid:17)\n\n\u03b8s t2...tk\n\nB(Xtj )\n\n+ C(Xs) \u2212 \u00afD(XV \\s)\n\n(3)\n\nwhere C(Xs) is speci\ufb01ed by the exponential family, and \u00afD(XV \\s) is the log-normalization constant.\nBy the Hammersley-Clifford theorem, and some elementary calculation, this conditional distribution\ncan be shown to specify the following unique joint distribution P (X1, . . . , Xp):\nProposition 1. Suppose X = (X1, X2, . . . , Xp) is a p-dimensional random vector, and its node-\nconditional distributions are speci\ufb01ed by (3). Then its joint distribution P (X1, . . . , Xp) is given by:\n\nP (X) = exp\n\n\u03b8sB(Xs) +\n\n\u03b8st B(Xs)B(Xt)\n\n(cid:111)\n\n,\n\n(cid:41)\n\n(cid:88)\n\n+\n\n(cid:88)\n\ns\u2208V\n\nt2,...,tk\u2208N (s)\n\ns\n\n\u03b8s...tk B(Xs)\n\nk(cid:89)\n\nwhere A(\u03b8) is the log-normalization constant.\n\n(cid:40)(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\ns\u2208V\n\nt\u2208N (s)\n\nB(Xtj ) +\n\nj=2\n\ns\n\n3\n\nC(Xs) \u2212 A(\u03b8)\n\n,\n\n(4)\n\n\fAn important question is whether the conditional and joint distributions speci\ufb01ed above have the\nmost general form, under just the assumption of exponential family node-conditional distributions?\nIn particular, note that the canonical parameter in the previous proposition is a tensor factorization\nof the univariate suf\ufb01cient statistic, with pair-wise and higher-order interactions, which seems a bit\nstringent. Interestingly, by extending the argument from [7] and the Hammersley-Clifford Theorem,\nwe can show that indeed (3) and (4) have the most general form.\nProposition 2. Suppose X = (X1, X2, . . . , Xp) is a p-dimensional random vector, and its node-\nconditional distributions are speci\ufb01ed by an exponential family,\n\nP (Xs|XV \\s) = exp{E(XV \\s) B(Xs) + C(Xs) \u2212 \u00afD(XV \\s)},\n\n(5)\nwhere the function E(XV \\s) (and hence the log-normalization constant \u00afD(XV \\s)) only depends on\nvariables Xt in N (s). Further, suppose the corresponding joint distribution factors according to the\ngraph G = (V, E), with the factors over cliques of size at most k. Then, the conditional distribution\nin (5) has the tensor-factorized form in (3), and the corresponding joint distribution has the form in\n(4).\n\nThe proposition thus tells us that under the general assumptions that (a) the joint distribution is a\ngraphical model that factors according to a graph G, and has clique-factors of size at most k, and\n(c) its node-conditional distribution follows an exponential family, it necessarily follows that the\nconditional and joint distributions are given by (3) and (4) respectively.\nAn important special case is when the joint distribution has factors of size at most two. The condi-\ntional distribution then is given by:\n\nP (Xs|XV \\s) = exp\n\n\u03b8st B(Xs)B(Xt) + C(Xs) \u2212 \u00afD(XV \\s)\n\nwhile the joint distribution is given as\n\nP (X) = exp\n\n\u03b8st B(Xs)B(Xt) +\n\n(cid:88)\n\ns\n\nC(Xs) \u2212 A(\u03b8)\n\nNote that when the univariate suf\ufb01cient statistic function B(\u00b7) is a linear function B(Xs) = Xs,\nthen the conditional distribution in (6) is precisely a generalized linear model [5] in canonical form,\n\n(cid:88)\n\nt\u2208N (s)\n\n\uf8f1\uf8f2\uf8f3\u03b8s B(Xs) +\n\uf8f1\uf8f2\uf8f3(cid:88)\n\n\u03b8sB(Xs) +\n\ns\n\n(s,t)\u2208E\n\n(cid:88)\n\uf8f1\uf8f2\uf8f3\u03b8s Xs +\n\uf8f1\uf8f2\uf8f3(cid:88)\n\n\u03b8sXs +\n\ns\n\n(cid:88)\n\n(cid:88)\n\nt\u2208N (s)\n\n(s,t)\u2208E\n\n\uf8fc\uf8fd\uf8fe ,\n\uf8fc\uf8fd\uf8fe .\n\uf8fc\uf8fd\uf8fe ,\n\uf8fc\uf8fd\uf8fe .\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nP (Xs|XV \\s) = exp\n\n\u03b8st Xs Xt + C(Xs) \u2212 \u00afD(XV \\s; \u03b8)\n\nwhile the joint distribution has the form,\n\nP (X) = exp\n\n(cid:88)\n\ns\n\nC(Xs) \u2212 A(\u03b8)\n\n\u03b8st Xs Xt +\n\nIn the subsequent sections, we will refer to the entire class of models in (7) as GLM graphical\nmodels, but focus on the case (9) with linear functions B(Xs) = Xs.\nExamples. The GLM graphical models provide multivariate or Markov network extensions of uni-\nvariate exponential family distributions. The popular Gaussian graphical model and Ising model can\nthus also be represented by (7). Consider the latter, for example, where for the Bernoulli distribution,\nwe have that B(X) = X, C(X) = 0, and A(\u03b8) is the log-partition function; plugging these into (9),\nwe have the form of the Ising model studied in [3]. The form of the multinomial graphical model,\nan extension of the Ising model, can also be represented by (7) and has been previously studied in\n[4] and others.\nIt is instructive to consider the domain of the set of all possible valid parameters in the GLM graph-\nical model (9); namely those that ensure that the density is normalizable, or equivalently, so that the\nlog-partition function satis\ufb01es A(\u03b8) < +\u221e. The Ising model imposes no constraint on its param-\neters, {\u03b8st}, for normalizability, since there are \ufb01nitely many con\ufb01gurations of the binary random\n\n4\n\n\fvector X. For other exponential families, with countable discrete or continuous valued variables, the\nGLM graphical model does impose additional constraints on valid parameters. Consider the example\nof the Poisson and exponential distributions. The Poisson family has suf\ufb01cient statistic B(X) = X\nand base measure C(X) = \u2212log(X!). With some algebra, we can show that A(\u03b8) < +\u221e implies\n\u03b8st \u2264 0 \u2200 s, t. Thus, the Poisson graphical model can only capture negative conditional relationships\nbetween variables. Consider the exponential distribution with suf\ufb01cient statistic B(X) = \u2212X, base\nmeasure C(X) = 0. To ensure that the density is \ufb01nitely integrable, so that A(\u03b8) < +\u221e, we then\nrequire that \u03b8st \u2265 0 \u2200 s, t. Similar constraints on the parameter space are necessary to ensure proper\ndensity functions for several other exponential family graphical models as well.\n\n3 Statistical Guarantees\n\nIn this section, we study the problem of learning the graph structure of an underlying GLM graphical\nmodel given iid samples. Speci\ufb01cally, we assume that we are given n samples X n\ni=1,\nfrom a GLM graphical model:\n\n1 = {X (i)}n\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n(s,t)\u2208E\u2217\n\n(cid:88)\n\ns\n\n\uf8fc\uf8fd\uf8fe .\n\nP (X; \u03b8\u2217) = exp\n\n\u03b8\u2217\nst Xs Xt +\n\nC(Xs) \u2212 A(\u03b8)\n\n(10)\n\nWe have removed node-wise terms for simplicity, noting that our analysis extends to the general\ncase. The goal in graphical model structure recovery is to recover the edges E\u2217 of the underlying\ngraph G = (V, E\u2217). Following [3, 4], we will approach this problem via neighborhood estimation,\nwhere we estimate the neighborhood of each node individually, and then stitch these together to\nN \u2217(s), then we can estimate the overall graph structure as:\n\nform the global graph estimate. Speci\ufb01cally, if we have an estimate (cid:98)N (s) for the true neighborhood\n\n(cid:16) (cid:88)\n\n(cid:98)E = \u222as\u2208V \u222at\u2208(cid:98)N (s) {(s, t)}.\n\uf8f1\uf8f2\uf8f3Xs\nP(cid:0)X (i)\n\n1 = {X (i)}n\n\n(cid:1) =\n\ns |X (i)\\s , \u03b8\\s\n\nn(cid:88)\n\n\u03b8\u2217\nstXt\n\nt\u2208N (s)\n\n(cid:17)\n\n\u2212X (i)\n\n1\nn\n\ni=1\n\nn(cid:89)\n\ni=1\n\n(11)\n\n(12)\n\n\u03b8\u2217\nstXt\n\n(cid:16) (cid:88)\n\n(cid:17)\uf8fc\uf8fd\uf8fe .\ns (cid:104)\u03b8\\s, X (i)\\s (cid:105) + D(cid:0)(cid:104)\u03b8\\s, X (i)\\s (cid:105)(cid:1).\n\nt\u2208N (s)\nst for t \u2208 N (s) and \u03b8\u2217\n\n\\s = {\u03b8\u2217\n\nLet \u03b8\u2217\nfor t (cid:54)\u2208 N (s). Given n samples X n\ndistribution (12) as:\n\nst}t\u2208V \\s \u2208 Rp\u22121 be a zero-padded vector, with entries \u03b8\u2217\n\nst = 0,\ni=1, we can write the conditional log-likelihood of the\n\nIn order to estimate the neighborhood of any node, we consider the sparsity constrained conditional\nMLE. Given the joint distribution in (10), the conditional distribution of Xs given the rest of the\nnodes is given by:\n\nP (Xs|XV \\s) = exp\n\n+ C(Xs) \u2212 D\n\n(cid:96)(\u03b8\\s; X n\n\n1 ) := \u2212 1\nn\n\nlog\n\nWe can then solve the (cid:96)1 regularized conditional log-likelihood loss for each node Xs:\n\n(13)\n\nmin\n\n(cid:96)(\u03b8\\s; X n\n\n1 ) + \u03bbn(cid:107)\u03b8\\s(cid:107)1.\n\n\u03b8\\s\u2208Rp\u22121\n\nGiven the solution(cid:98)\u03b8\\s of the M-estimation problem above, we then estimate the node-neighborhood\nof s as (cid:98)N (s) = {t \u2208 V \\s : (cid:98)\u03b8st (cid:54)= 0}. In the following when we focus on a \ufb01xed node s \u2208 V ,\n\nwe will overload notation, and use \u03b8 \u2208 Rp\u22121 as the parameters of the conditional distribution,\nsuppressing the dependence on s.\nIn the rest of the section, we \ufb01rst discuss the assumptions we impose on the GLM graphical model\nparameters. The \ufb01rst set of assumptions are standard irrepresentable-type conditions imposed for\nstructure recovery in high-dimensional statistical estimators, and in particular, our assumptions mir-\nror those in [3]. The second set of assumptions are key to our generalized analysis of the class of\nGLM graphical models as a whole. We then follow with our main theorem, that guarantees structure\nrecovery under these assumptions, with high probability even in high-dimensional regimes.\n\n5\n\n\fOur \ufb01rst set of assumptions use the Fisher Information matrix, Q\u2217\n1 ), which is the\nHessian of the node-conditional log-likelihood. In the following, we will simply use Q\u2217 instead of\ns where the reference node s should be understood implicitly. We also use S = {(s, t) : t \u2208 N (s)}\nQ\u2217\nto denote the true neighborhood of node s, and Sc to denote its complement. We use Q\u2217\nSS to denote\nthe d \u00d7 d sub-matrix indexed by S. Our \ufb01rst two assumptions , and are as follows:\n\u03bbmin. Moreover, there exists a constant \u03bbmax < \u221e such that \u03bbmax((cid:98)E[X\\sX T\\s]) \u2264 \u03bbmax.\nSS) \u2265\nAssumption 1 (Dependency condition). There exists a constant \u03bbmin > 0 such that \u03bbmin(Q\u2217\n\ns = \u22072(cid:96)(\u03b8\u2217\n\ns ; X n\n\ntS(Q\u2217\n\nSS)\u22121(cid:107)1 \u2264 1 \u2212 \u03b1.\n\nAssumption 2 (Incoherence condition). We also need an incoherence or irrepresentable condition\non the \ufb01sher information matrix as in [3]. Speci\ufb01cally, there exists a constant \u03b1 > 0, such that\nmaxt\u2208Sc (cid:107)Q\u2217\nA key technical facet of the linear, logistic, and multinomial models in [2, 3, 4] and used heavily in\ntheir proofs, is that the random variables {Xs} there were bounded with high probability. Unfortu-\nnately, in the general GLM distribution in (12), we cannot assume this explicitly. Nonetheless, we\nshow that we can analyze the corresponding regularized M-estimation problems, provided the \ufb01rst\nand second moments are bounded.\nAssumption 3. The \ufb01rst and second moments of the distribution in (10) are bounded as follows. The\nt ] \u2264 \u03bav.\n\ufb01rst moment \u00b5\u2217 := E[X] , satis\ufb01es (cid:107)\u00b5\u2217(cid:107)2 \u2264 \u03bam; the second moment satis\ufb01es maxt\u2208V E[X 2\nWe also need smoothness assumptions on the log-normalization constants :\nAssumption 4. The log-normalization constant A(\u00b7) of the joint distribution (10) satis\ufb01es:\nmaxu:(cid:107)u(cid:107)\u22641 \u03bbmax(\u22072A(\u03b8\u2217 + u)) \u2264 \u03bah.\nAssumption 5. The log-partition function D(\u00b7) of\nsatis\ufb01es: There exist constants \u03ba1 and \u03ba2 (that depend on the exponential\nmax{|D(cid:48)(cid:48)(\u03ba1 log \u03b7)|,|D(cid:48)(cid:48)(cid:48)(\u03ba1 log \u03b7)|} \u2264 n\u03ba2 where \u03b7 = max{n, p}, \u03ba1 \u2265 9\n[0, 1/4].\n\nthe node-conditional distribution (12)\nfamily) s.t.\n2(cid:107)\u03b8\u2217(cid:107)2 and \u03ba2 \u2208\n\nAssumptions 3 and 4 are the key technical conditions under which we can generalize the analyses\nin [2, 3, 4] to the general GLM case. In particular, we can show that the statements of the following\npropositions hold, which show that the random vectors X following the GLM graphical model in\n(10) are suitably well-behaved:\nProposition 3. Suppose X is a random vector with the distribution speci\ufb01ed in (10). Then, for any\nvector u \u2208 Rp such that (cid:107)u(cid:107)2 \u2264 c(cid:48), any positive constant \u03b4, and some constants c > 0,\n\nProposition 4. Suppose X is a random vector with the distribution speci\ufb01ed in (10). Then, for\n\u03b4 \u2264 min{2\u03bav/3, \u03bah + \u03bav}, and some constant c > 0,\n\nP(cid:0)|(cid:104)u, X(cid:105)| \u2265 \u03b4 log \u03b7(cid:1) \u2264 c\u03b7\u2212\u03b4/c(cid:48)\nn(cid:88)\n(cid:0)X (i)\n\n(cid:1)2 \u2265 \u03b4\n\n\u2264 2 exp(cid:0)\u2212c n \u03b42(cid:1) .\n\n(cid:33)\n\n.\n\ns\n\n(cid:32)\n\nP\n\n1\nn\n\ni=1\n\n\u221a\n\n10\n\u03bbmin\n\nPutting these key technical results and assumptions together, we arrive at our main result:\nTheorem 1. Consider a GLM graphical model distribution as speci\ufb01ed in (10), with true parameter\nst| \u2265\n\u03b8\u2217 and associated edge set E\u2217 that satis\ufb01es Assumptions 1-5. Suppose that min(s,t)\u2208E\u2217 |\u03b8\u2217\nd\u03bbn where d is the maximum neighborhood size. Suppose also that the regularization pa-\nn1\u2212\u03ba2 for some constant M > 0. Then, there exist\n1\u22123\u03ba2 , then with\n\npositive constants L, K1 and K2 such that if n \u2265 L(cid:8)d2 log p(max{log n, log p})2(cid:9) 1\n\nrameter is chosen such that \u03bbn \u2265 M (2\u2212\u03b1)\n\nprobability at least 1 \u2212 exp(\u2212K1\u03bb2\n(a) (Unique Solution) For each node s \u2208 V , the solution of the M-estimation problem in (13) is\n\nnn) \u2212 K2 max{n, p}\u22125/4, the following statements hold:\n\n(cid:113) log p\n\n\u03b1\n\n(b) (Correct Neighborhood Recovery) The M-estimate also recovers the true neighborhood exactly,\n\nunique, and\n\nso that (cid:98)N (s) = N (s).\n\n6\n\n\fFigure 1: Probabilities of successful support recovery for a Poisson grid structure (\u03c9 = \u22120.1). The\nprobability of successful edge recovery vs. n (Left), and the probability of successful edge recovery\nvs. control parameter \u03b2 = n/(c log p) (Right).\n\nNote that if the neighborhood of each node is recovered with high probability, then by a simple\n\nunion bound, the estimate in (11), (cid:98)E = \u222as\u2208V \u222at\u2208(cid:98)N (s) {(s, t)} is equal to the true edge set E\u2217 with\n\nhigh-probability.\nAlso note that \u03ba2 in the statement is a constant from Assumption 5. The Poisson family has one\nof the steepest log-partition function: D(\u03b7) = exp(\u03b7). Hence, in order to satisfy Assumption 5,\nwe need (cid:107)\u03b8\u2217(cid:107)2 \u2264 1\nlog p with \u03ba2 = 1/4. On the other hand, for the binomial, multinomial or\nGaussian cases studied in [2, 3, 4], we can recover their results with \u03ba2 = 0 since the log-partition\nfunction D(\u00b7) of these families are upper bounded by some constant for any input. Nevertheless, we\nneed to restrict \u03b8\u2217 to satisfy Assumption 4 so that the variables are bounded with high probability in\nProposition 3 and 4 for any GLM case.\n\nlog n\n\n18\n\n4 Experiments\n\n(cid:113) log p\n\nExperiments on Simulated Networks. We provide a small simulation study that demonstrates the\nconsequences of Theorem 1 when the conditional distribution in (12) has the form of Poisson distri-\nbution. We performed experiments on lattice (4 nearest neighbor) graphs with identical edge weight\n\u03c9 for all edges. Simulating data via Gibbs sampling, we solved the sparsity-constrained optimization\nproblem with a constant factor of\nn for \u03bbn. The left panel of Figure 1 shows the probability of\nsuccessful edge recovery for different numbers of nodes, p = {64, 100, 169, 225}. In the right panel\nof Figure 1, we re-scale the sample size n using the \u201ccontrol parameter\u201d \u03b2 = n/(c log p) for some\nconstant c. Each point in the plot indicates the probability that all edges are successfully recovered\nout of 50 trials. We can see that the curves for different problem sizes are well aligned with the\nresults of Theorem 1.\nLearning Genomic Networks. Gaussian graphical models learned from microarray data have often\nbeen used to study high-throughput genomic regulatory networks. Our GLM graphical models will\nbe important for understanding genomic networks learned from other high-throughput technologies\nthat do not produce approximately Gaussian data. Here, we demonstrate the versatility of our model\nby learning two cancer genomic networks, a genomic copy number aberration network (from aCGH\ndata) for Glioblastoma learned by multinomial graphical models and a meta-miRNA inhibitory net-\nwork (from next generation sequencing data) for breast cancer learned by Poisson graphical models.\nLevel III data, breast cancer miRNA expression (next generation sequencing) [13] and copy number\nvariation (aCGH) Glioblastoma data [14], was obtained from the the Cancer Genome Atlas (TCGA)\ndata portal (http://tcga-data.nci.nih.gov/tcga/), and processed according to standard techniques. Data\ndescriptions and processing details are given in the supplemental materials.\nA Poisson graphical model and a multinomial graphical model were \ufb01t to the processed miRNA\ndata and aberration data respectively by performing neighborhood selection with the sparsity of the\ngraph determined by stability selection [15]. Our GLM graphical models, Figure 2, reveal results\nconsistent with the cancer genomics literature. The meta-miRNA inhibitory network has three major\nhubs, two of which, mir-519 and mir-520, are known to be breast cancer tumor suppressors [16, 17].\nInterestingly, let-7, a well-known miRNA involved in tumor metastasis [18], plays a central role\n\n7\n\n4006008001000120000.20.40.60.81nSuccess probability  p = 64p = 100p = 169p = 2251.522.533.5400.20.40.60.81\u03b2Success probability  p = 64p = 100p = 169p = 225\fFigure 2: Genomic copy number aberration network for Glioblastoma learned via multinomial\ngraphical models (left) and meta-miRNA inhibitory network for breast cancer learned via Poisson\ngraphical models (right).\n\nin our network, sharing edges with the \ufb01ve largest hubs; this suggests that our model has learned\nrelevant negative associations between tumor suppressors and enhancers. The Glioblastoma copy\nnumber aberration network reveals \ufb01ve major modules, color coded on the left panel in Figure 2,\nand three of these modules have been previously implicated in Glioblastoma: EGFR in the yellow\nmodule, PTEN in the purple module, and CDK2A in the blue module [19].\n\n5 Discussion\n\nWe have introduced a new class of graphical models that arise when we assume that node-wise\nconditional distributions follow an exponential family distribution. We have also provided simple\nM-estimators for learning the network by \ufb01tting node-wise penalized GLMs that enjoy strong sta-\ntistical recovery properties. Our work has broadened the class of off-the-shelf graphical models to\nencompass a wide range of parametric distributions. These classes of graphical models may be of\nfurther interest to the statistical community as they provide closed form multivariate densities for\nseveral exponential family distributions (e.g. Poisson, exponential, negative binomial) where few\ncurrently exist. Furthermore, the statistical analysis of our M-estimator required subtle techniques\nthat may be of general interest in the analysis of sparse M-estimation.\nOur work outlines the general class of graphical models for exponential family distributions, but\nthere are many avenues for future work in studying this model for speci\ufb01c distributional families.\nIn particular, our model sometimes places restrictions on the parameter space. A question remains,\ncan these restrictions be relaxed for speci\ufb01c exponential family distributions? Additionally, we have\nfocused on families with linear suf\ufb01cient statistics (e.g. Gaussian, Bernoulli, Poisson, exponential,\nnegative binomial); our models can be studied with non-linear suf\ufb01cient statistics or multi-parameter\ndistributions as well. Overall, our work has opened the door for learning Markov Networks from\na broad class of distributions, the properties and applications of which leave much room for future\nresearch.\n\nAcknowledgments\n\nE.Y. and P.R. acknowledge support from NSF IIS-1149803. G.A. and Z.L. acknowledge sup-\nport from the Collaborative Advances in Biomedical Computing seed funding program at the Ken\nKennedy Institute for Information Technology at Rice University supported by the John and Ann\nDoerr Fund for Computational Biomedicine and by the Center for Computational and Integrative\nBiomedical Research seed funding program at Baylor College of Medicine. G.A. also acknowl-\nedges support from NSF DMS-1209017.\n\n8\n\n9949539135748593514754164918955601951769074731009229708711379321282002348440183338144264658631015823822532524471780502349711261303142132629794616834356869366754724589276665567996887857886775298mir-1431517428mir-1051816342312191025113531338229262714293136243262mir-519-amir-315673020mir-518c mir-520a mir-449mir-150let-7\fReferences\n[1] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[2] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34:1436\u20131462, 2006.\n\n[3] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using (cid:96)1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[4] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models using group-\n\nsparse regularization. In Inter. Conf. on AI and Statistics (AISTATS), 14, 2011.\n\n[5] P. McCullagh and J.A. Nelder. Generalized linear models. Monographs on statistics and applied proba-\n\nbility 37. Chapman and Hall/CRC, New York, 1989.\n\n[6] S.L. Lauritzen. Graphical models, volume 17. Oxford University Press, USA, 1996.\n[7] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 36(2):192\u2013236, 1974.\n\n[8] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. The Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[9] Y.M.M. Bishop, S.E. Fienberg, and P.W. Holland. Discrete multivariate analysis. Springer Verlag, 2007.\n[10] Trevor. Hastie, Robert. Tibshirani, and JH (Jerome H.) Friedman. The elements of statistical learning.\n\nSpringer, 2 edition, 2009.\n\n[11] J.C. Marioni, C.E. Mason, S.M. Mane, M. Stephens, and Y. Gilad. Rna-seq: an assessment of technical\nreproducibility and comparison with gene expression arrays. Genome research, 18(9):1509\u20131517, 2008.\n[12] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers, 2010.\n\n[13] Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours.\n\nNature, 490(7418):61\u201370, 2012.\n\n[14] Cancer Genome Atlas Research Network. Comprehensive genomic characterization de\ufb01nes human\n\nglioblastoma genes and core pathways. Nature, 455(7216):1061\u20131068, October 2008.\n\n[15] H. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection (stars) for high\n\ndimensional graphical models. Arxiv preprint arXiv:1006.3316, 2010.\n\n[16] K. Abdelmohsen, M.M. Kim, S. Srikantan, E.M. Mercken, S.E. Brennan, G.M. Wilson, R. de Cabo, and\nM. Gorospe. mir-519 suppresses tumor growth by reducing hur levels. Cell cycle (Georgetown, Tex.),\n9(7):1354, 2010.\n\n[17] I. Keklikoglou, C. Koerner, C. Schmidt, JD Zhang, D. Heckmann, A. Shavinskaya, H. Allgayer,\nB. G\u00a8uckel, T. Fehm, A. Schneeweiss, et al. Microrna-520/373 family functions as a tumor suppressor\nin estrogen receptor negative breast cancer by targeting nf-\u03bab and tgf-\u03b2 signaling pathways. Oncogene,\n2011.\n\n[18] F. Yu, H. Yao, P. Zhu, X. Zhang, Q. Pan, C. Gong, Y. Huang, X. Hu, F. Su, J. Lieberman, et al.\n\nregulates self renewal and tumorigenicity of breast cancer cells. Cell, 131(6):1109\u20131123, 2007.\n\nlet-7\n\n[19] R. McLendon, A. Friedman, D. Bigner, E.G. Van Meir, D.J. Brat, G.M. Mastrogianakis, J.J. Olson,\nT. Mikkelsen, N. Lehman, K. Aldape, et al. Comprehensive genomic characterization de\ufb01nes human\nglioblastoma genes and core pathways. Nature, 455(7216):1061\u20131068, 2008.\n\n[20] Jianhua Zhang. Convert segment data into a region by sample matrix to allow for other high level com-\n\nputational analyses, version 1.2.0 edition. Bioconductor package.\n\n[21] Gerald B W Wertheim, Thomas W Yang, Tien-chi Pan, Anna Ramne, Zhandong Liu, Heather P Gardner,\nKatherine D Dugan, Petra Kristel, Bas Kreike, Marc J van de Vijver, Robert D Cardiff, Carol Reynolds,\nand Lewis A Chodosh. The Snf1-related kinase, Hunk, is essential for mammary tumor metastasis.\nProceedings of the National Academy of Sciences of the United States of America, 106(37):15855\u201315860,\nSeptember 2009.\n\n[22] J.T. Leek, R.B. Scharpf, H.C. Bravo, D. Simcha, B. Langmead, W.E. Johnson, D. Geman, K. Baggerly,\nand R.A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data.\nNature Reviews Genetics, 11(10):733\u2013739, 2010.\n\n[23] J. Li, D.M. Witten, I.M. Johnstone, and R. Tibshirani. Normalization, testing, and false discovery rate\n\nestimation for rna-sequencing data. Biostatistics, 2011.\n\n[24] G. I. Allen and Z. Liu. A Log-Linear Graphical Model for Inferring Genetic Networks from High-\nThroughput Sequencing Data. IEEE International Conference on Bioinformatics and Biomedicine, 2012.\n[25] J. Bullard, E. Purdom, K. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and\n\ndifferential expression in mrna-seq experiments. BMC bioinformatics, 11(1):94, 2010.\n\n9\n\n\f", "award": [], "sourceid": 659, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": null}, {"given_name": "Genevera", "family_name": "Allen", "institution": null}, {"given_name": "Zhandong", "family_name": "Liu", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}