{"title": "Conditional Random Fields via Univariate Exponential Families", "book": "Advances in Neural Information Processing Systems", "page_first": 683, "page_last": 691, "abstract": "Conditional random fields, which model the distribution of a multivariate response conditioned on a set of covariates using undirected graphs, are widely used in a variety of multivariate prediction applications. Popular instances of this class of models such as categorical-discrete CRFs, Ising CRFs, and conditional Gaussian based CRFs, are not however best suited to the varied types of response variables in many applications, including count-valued responses. We thus introduce a \u201cnovel subclass of CRFs\u201d, derived by imposing node-wise conditional distributions of response variables conditioned on the rest of the responses and the covariates as arising from univariate exponential families. This allows us to derive novel multivariate CRFs given any univariate exponential distribution, including the Poisson, negative binomial, and exponential distributions. Also in particular, it addresses the common CRF problem of specifying feature'' functions determining the interactions between response variables and covariates. We develop a class of tractable penalized $M$-estimators to learn these CRF distributions from data, as well as a unified sparsistency analysis for this general class of CRFs showing exact structure recovery can be achieved with high probability.\"", "full_text": "Conditional Random Fields via Univariate\n\nExponential Families\n\nEunho Yang\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\neunho@cs.utexas.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nGenevera I. Allen\n\nDepartment of Statistics and\n\nElectrical & Computer Engineering\n\nRice University\n\ngallen@rice.edu\n\nZhandong Liu\n\nDepartment of Pediatrics-Neurology\n\nBaylor College of Medicine\n\nzhandonl@bcm.edu\n\nAbstract\n\nConditional random \ufb01elds, which model the distribution of a multivariate response\nconditioned on a set of covariates using undirected graphs, are widely used in a\nvariety of multivariate prediction applications. Popular instances of this class of\nmodels, such as categorical-discrete CRFs, Ising CRFs, and conditional Gaus-\nsian based CRFs, are not well suited to the varied types of response variables in\nmany applications, including count-valued responses. We thus introduce a novel\nsubclass of CRFs, derived by imposing node-wise conditional distributions of re-\nsponse variables conditioned on the rest of the responses and the covariates as\narising from univariate exponential families. This allows us to derive novel multi-\nvariate CRFs given any univariate exponential distribution, including the Poisson,\nnegative binomial, and exponential distributions. Also in particular, it addresses\nthe common CRF problem of specifying \u201cfeature\u201d functions determining the inter-\nactions between response variables and covariates. We develop a class of tractable\npenalized M-estimators to learn these CRF distributions from data, as well as a\nuni\ufb01ed sparsistency analysis for this general class of CRFs showing exact struc-\nture recovery can be achieved with high probability.\n\n1\n\nIntroduction\n\nConditional random \ufb01elds (CRFs) are a popular class of models that combine the advantages of\ndiscriminative modeling and undirected graphical models. They are widely used across structured\nprediction domains such as natural language processing, computer vision, and bioinformatics. The\nkey idea in this class of models is to represent the joint distribution of a set of response variables\nconditioned on a set of covariates using a product of clique-wise compatibility functions. Given an\nunderlying graph over the response variables, each of these compatibility functions depends on all\nthe covariates, but only on a subset of response variables within any clique of the underlying graph.\nThey are thus a discriminative counterpart of undirected graphical models, where we have covariates\nthat provide information about the multivariate response, and the underlying graph structure encodes\nconditional independence assumptions among the responses conditioned on the covariates.\nThere is a key model speci\ufb01cation question that arises, however, in any application of CRFs: how\ndo we specify the clique-wise suf\ufb01cient statistics, or compatibility functions (sometimes also called\nfeature functions), that characterize the conditional graphical model between responses? In par-\n\n1\n\n\fticular, how do we tune these to the particular types of variables being modeled? Traditionally,\nthese questions have been addressed either by hand-crafted feature functions, or more generally by\ndiscretizing the multivariate response vectors into a set of indicator vectors and then letting the com-\npatibility functions be linear combinations of the product of indicator functions [1]. This approach,\nhowever, may not be natural for continuous, skewed continuous or count-valued random variables.\nRecently, spurred in part by applications in bioinformatics, there has been much research on other\nsub-classes of CRFs. The Ising CRF which models binary responses, was studied by [2] and ex-\ntended to higher-order interactions by [3]. Several versions and extensions of Gaussian-based CRFs\nhave also been proposed [4, 5, 6, 7, 8]. These sub-classes of CRFs, however, are speci\ufb01c to Gaussian\nand binary variable types, and may not be appropriate for multivariate count data or skewed con-\ntinuous data, for example, which are increasingly seen in big-data settings such as high-throughput\ngenomic sequencing.\nIn this paper, we seek to (a) formulate a novel subclass of CRFs that have the \ufb02exibility to model\nresponses of varied types, (b) address how to specify compatibility functions for such a family of\nCRFs, and (c) develop a tractable procedure with strong statistical guarantees for learning this class\nof CRFs from data. We \ufb01rst show that when node-conditional distributions of responses conditioned\non other responses and covariates are speci\ufb01ed by univariate exponential family distributions, there\nexists a consistent joint CRF distribution, that necessarily has a speci\ufb01c form: with terms that are\ntensorial products of functions over the responses, and functions over the covariates.This subclass\nof \u201cexponential family\u201d CRFs can be viewed as a conditional extension of the MRF framework\nof [9, 10]. As such, this broadens the class of off-the-shelf CRF models to encompass data that\nfollows distributions other than the standard discrete, binary, or Gaussian instances. Given this\nnew family of CRFs, we additionally show that if covariates also follow node-conditional univariate\nexponential family distributions, then the functions over features in turn are precisely speci\ufb01ed by the\nexponential family suf\ufb01cient statistics. Thus, our twin results de\ufb01nitively answer for the \ufb01rst time\nthe key model speci\ufb01cation question of specifying compatibility or feature functions for a broad\nfamily of CRF distributions. We then provide a uni\ufb01ed M-estimation procedure, via penalized\nneighborhood estimation, to learn our family of CRFs from i.i.d. observations that simultaneously\naddresses all three sub-tasks of CRF learning: feature selection (where we select a subset of the\ncovariates for any response variable), structure recovery (where we learn the graph structure among\nthe response variables), and parameter learning (where we learn the parameters specifying the CRF\ndistribution). We also present a single theorem that gives statistical guarantees saying that with high-\nprobability, our M-estimator achieves each of these three sub-tasks. Our result can be viewed as an\nextension of neighborhood selection results for MRFs [11, 12, 13]. Overall, this paper provides a\nfamily of CRFs that generalizes many of the sub-classes in the existing literature and broadens the\nutility and applicability of CRFs to model many other types of multivariate responses.\n\n2 Conditional Graphical Models via Exponential Families\n\nSuppose we have a p-variate random response vector Y = (Y1, . . . , Yp), with each response vari-\nable Ys taking values in a set Ys. Suppose we also have a set of covariates X = (X1, . . . , Xq)\nassociated with this response vector Y . Suppose G = (V, E) is an undirected graph over p nodes\ncorresponding to the p response variables. Given the underlying graph G, and the set of cliques\n(fully-connected sub-graphs) C of the graph G, the corresponding conditional random \ufb01eld (CRF)\nis a set of distributions over the response conditioned on the covariates that satisfy Markov indepen-\ndence assumptions with respect to the graph G. Speci\ufb01cally, letting {c(Yc, X)}c2C denote a set\nof clique-wise suf\ufb01cient statistics, any strictly positive distribution of Y conditioned on X within\nthe conditional random \ufb01eld family takes the form: P (Y |X) / exp{Pc2C c(Yc, X)}. With a\n\npair-wise conditional random \ufb01eld distribution, the set of cliques consists of the set of nodes V and\nthe set of edges E, so that\n\nP (Y |X) / exp\u21e2Xs2V\n\ns(Ys, X) + X(s,t)2E\n\nst(Ys, Yt, X).\n\nA key model speci\ufb01cation question is how to select the class of suf\ufb01cient statistics, . We have a\nconsiderable understanding of how to specify univariate distributions over various types of variables\nas well as on how to model their conditional response through regression. Consider the univariate\nexponential family class of distributions: P (Z) = exp(\u2713B (Z) + C(Z) D(\u2713)), with suf\ufb01cient\n\n2\n\n\fstatistics B(Z), base measure C(Z), and log-normalization constant D(\u2713). Such exponential fam-\nily distributions include a wide variety of commonly used distributions such as Gaussian, Bernoulli,\nmultinomial, Poisson, exponential, gamma, chi-squared, beta, any of which can be instantiated with\nparticular choices of the functions B(\u00b7), and C(\u00b7). Such univariate exponential family distributions\nare thus used to model a wide variety of data types including skewed continuous data and count data.\nAdditionally, through generalized linear models, they are used to model the response of various data\ntypes conditional on a set of covariates. Here, we seek to use our understanding of univariate expo-\nnential families and generalized linear models to specify a conditional graphical model distribution.\nConsider the conditional extension of the construction in [14, 9, 10]. Suppose that the node-\nconditional distributions of response variables, Ys, conditioned on the rest of the response variables,\nYV \\s, and the covariates, X, is given by an univariate exponential family:\n\nP (Ys|YV \\s, X) = exp{Es(YV \\s, X) Bs(Ys) + Cs(Ys) \u00afDs(YV \\s, X)}.\n\n(1)\nHere, the functions Bs(\u00b7), Cs(\u00b7) are speci\ufb01ed by the choice of the exponential family, and the pa-\nrameter Es(YV \\s, X) is an arbitrary function of the variables Yt in N (s) and the covariates X;\nN (s) is the set of neighbors of node s according to an undirected graph G = (V, E). Would these\nnode-conditional distributions be consistent with a joint distribution? Would this joint distribution\nfactor according a conditional random \ufb01eld given by graph G? And would there be restrictions on\nthe form of the functions Es(YV \\s, X)? The following theorem answers these questions. We note\nthat it generalizes the MRF framework of [9, 10] in two ways: it allows for the presence of condi-\ntional covariates, and moreover allows for heterogeneous types and domains of distributions with\nthe different choices of Bs(\u00b7) and Cs(\u00b7) at each individual node.\nTheorem 1. Consider a p-dimensional random vector Y = (Y1, Y2, . . . , Yp) denoting the set of\nresponses, and let X = (X1, . . . , Xq) be a q-dimensional covariate vector. Consider the follow-\ning two assertions: (a) the node-conditional distributions of each P (Ys|YV \\s, X) are speci\ufb01ed by\nunivariate exponential family distributions as detailed in (1); and (b) the joint multivariate condi-\ntional distribution P (Y |X) factors according to the graph G = (V, E) with clique-set C, but with\nfactors over response-variable-cliques of size at most k. These assertions on the conditional and\njoint distributions respectively are consistent if and only if the conditional distribution in (1) has the\ntensor-factorized form:\n\nP (Ys|YV \\s, X; \u2713) = exp\u21e2Bs(Ys)\u21e3\u2713s(X) + Xt2N (s)\nkYj=2\n\n+ Xt2,...,tk2N (s)\n\n\u2713s t2...tk (X)\n\n\u2713st(X) Bt(Yt) + . . .\n\nBtj (Ytj )\u2318 + Cs(Ys) \u00afDs(YV \\s),\n\n(2)\n\nwhere \u2713s\u00b7(X) := {\u2713s(X),\u2713 st(X), . . . ,\u2713 s t2...tk (X)} is a set of functions that depend only on the\ncovariates X. Moreover, the corresponding joint conditional random \ufb01eld distribution has the form:\n\nP (Y |X; \u2713) = exp\u21e2Xs\n\n\u2713s(X)Bs(Ys) +Xs2V Xt2N (s)\nkYj=1\n\n\u2713t1...tk (X)\n\n+ . . . + X(t1,...,tk)2C\n\n\u2713st(X) Bs(Ys)Bt(Yt)\n\nBtj (Ytj ) +Xs\n\nCs(Ys) A\u2713(X),\n\n(3)\n\nwhere A\u2713(X) is the log-normalization constant.\n\nTheorem 1 speci\ufb01es the form of the function Es(YV \\s, X) de\ufb01ning the canonical parameter in the\nunivariate exponential family distribution (1). This function is a tensor factorization of products of\nsuf\ufb01cient statistics of YV \\s, and \u201cobservation functions\u201d, \u2713(X), of the covariates X alone. A key\npoint to note is that the observation functions, \u2713(X), in the CRF distribution (3) should ensure that\n\nthe density is normalizable, that is, A\u2713(X) < +1. We also note that we can allow different\nexponential families for each of the node-conditional distributions of the response variables, mean-\ning that the domains, Ys, or the suf\ufb01cient statistics functions, Bs(\u00b7), can vary across the response\nvariables Ys. A common setting of these suf\ufb01cient statistics functions however, for many popular\ndistributions (Gaussian, Bernoulli, etc.), is a linear function, so that Bs(Ys) = Ys.\n\n3\n\n\fAn important special case of the above result is when the joint CRF has response-variable-clique\nfactors of size at most two. The node conditional distributions (2) would then have the form:\n\nP (Ys|YV \\s, X; \u2713) / exp\u21e2Bs(Ys) \u00b7\u21e3\u2713s(X) + Xt2N (s)\n\n\u2713st(X) Bt(Yt)\u2318 + Cs(Ys),\n\nwhile the joint distribution in (3) has the form:\n\nP (Y |X; \u2713) = exp\u21e2Xs2V\n\u2713st(X) Bs(Ys) Bt(Yt) +Xs2V\nwith the log-partition function, A\u2713(X), given the covariates, X, de\ufb01ned as\nA\u2713(X) := logZYp\n\n\u2713s(X)Bs(Ys) + X(s,t)2E\nexp\u21e2Xs2V\n\n\u2713s(X)Bs(Ys) + X(s,t)2E\n\n\u2713st(X) Bs(Ys) Bt(Yt) +Xs2V\n\nCs(Ys) A\u2713(X),\nCs(Ys).\n\n(4)\n\n(5)\n\nTheorem 1 then addresses the model speci\ufb01cation question of how to select the compatibility func-\ntions in CRFs for varied types of responses. Our framework permits arbitrary observation functions,\n\u2713(X), with the only stipulation that the log-partition function must be \ufb01nite. (This only provides a\nrestriction when the domain of the response variables is not \ufb01nite). In the next section, we address\nthe second model speci\ufb01cation question of how to set the covariate functions.\n\n2.1 Setting Covariate Functions\n\nA candidate approach to specifying the observation functions, \u2713(X), in the CRF distribution above\nwould be to make distributional assumptions on X. Since Theorem 1 speci\ufb01es the conditional\ndistribution P (Y |X), specifying the marginal distribution P (X) would allow us to specify the\njoint distribution P (Y, X) without further restrictions on P (Y |X) using the simple product rule:\nP (X, Y ) = P (Y |X) P (X). As an example, suppose that the covariates X follow an MRF distri-\nbution with graph G0 = (V 0, E0), and parameters #:\n\nP (X) = expn Xu2V 0\n\n#uu(Xu) + X(u,v)2V 0\u21e5V 0\nThen, for any CRF distribution P (Y |X) in (4), we have\n\nP (X, Y ) = expnXu\n#uu(Xu) + X(u,v)\nCs(Ys) A(#) A\u2713(X)o.\n+Xs\n\n#uvuv(Xu, Xv) +Xs\n\n#uvuv(Xu, Xv) A(#)o.\n\u2713s(X)Ys +X(s,t)\n\n\u2713st(X)YsYt\n\nP (Xu|XV 0\\u, Y ) = exp{Eu(XV 0\\u, Y ) Bu(Xu) + Cu(Xu) \u00afDu(XV 0\\u, Y )},\n\nThe joint distribution, P (X, Y ), is valid provided P (Y |X) and P (X) are valid distributions. Thus,\na distributional assumption on P (X) does not restrict the set of covariate functions in any way.\nOn the other hand, specifying the conditional distribution, P (X|Y ), naturally entails restrictions on\nthe form of P (Y |X). Consider the case where the conditional distributions P (Xu|XV 0\\u, Y ) are\nalso speci\ufb01ed by univariate exponential families:\n(6)\nwhere Eu(XV 0\\u, Y ) is an arbitrary function of the rest of the variables, and Bu(\u00b7), Cu(\u00b7), \u00afDu(\u00b7) are\nspeci\ufb01ed by the univariate exponential family. Under these additional distributional assumptions in\n(6), what form would the CRF distribution in Theorem 1 take? Speci\ufb01cally, what would be the form\nof the observation functions \u2713(X)? The following theorem provides an answer to this question. (In\nthe following, we use the shorthand sm\nTheorem 2. Consider the following assertions: (a) the conditional CRF distribution of the re-\nsponses Y = (Y1, . . . , Yp) given covariates X = (X1, . . . , Xq) is given by the family (4); and (b)\nthe conditional distributions of individual covariates given rest of the variables P (Xu|XV 0\\u, Y ) is\ngiven by an exponential family of the form in (6); and (c) the joint distribution P (X, Y ) belongs to\na graphical model with graph \u00afG = (V [ V 0, \u00afE), with clique-set C, with factors of size at most k.\nThese assertions are consistent if and only if the CRF distribution takes the form:\n\n1 to denote the sequence (s1, . . . , sm).)\n\n4\n\n\fP (Y |X) = expn kXl=1 Xtr\n\n12V,slr\n1,slr\n(tr\n\n1 2V 0\n)2C\n\n1\n\n\u21b5tr\n\n1,slr\n\n1\n\nBsj (Xsj )\n\nlrYj=1\n\nrYj=1\n\nBtj (Ytj ) +Xt2V\n\nCt(Yt) A(\u21b5, X)o, (7)\n\nso that the observation functions \u2713t1,...,tr (X) in the CRF distribution (4) are tensor products of\n\nunivariate functions: \u2713t1,...,tr (X) =\n\nkXl=1 Xslr\n\n1 2V 0\n1,slr\n)2C\n\n1\n\n(tr\n\n\u21b5tr\n\n1,slr\n\n1\n\nBsj (Xsj ).\n\nlrYj=1\n\nLet us examine the consequences of this theorem for the pair-wise CRF distributions (4). Theorem 2\nthen entails that the observation functions, {\u2713s(X),\u2713 st(X)}, have the following form when the\ndistribution has factors of size at most two:\n(8)\n\n\u2713suBu(Xu) ,\u2713\n\nst(X) = \u2713st,\n\n\u2713s(X) = \u2713s + Xu2V 0\n\nfor some constant parameters \u2713s, \u2713su and \u2713st. Similarly, if the joint distribution has factors of size\nat most three, we have:\n\n\u2713s(X) = \u2713s + Xu2V 0\n\u2713st(X) = \u2713st + Xu2V 0\n\n\u2713suBu(Xu) + X(u,v)2V 0\u21e5V 0\n\n\u2713stuBu(Xu).\n\n\u2713suvBu(Xu)Bv(Xv),\n\n(9)\n\n(Remark 1) While we have derived the covariate functions in Theorem 2 by assuming a distribu-\ntional form on X, using the resulting covariate functions do not necessarily impose distributional\nassumptions on X. This is similar to \u201cgenerative-discriminative\u201d pairs of models [15]: a \u201cgener-\native\u201d Naive Bayes distribution for P (X|Y ) corresponds to a \u201cdiscriminative\u201d logistic regression\nmodel for P (Y |X), but the converse need not hold. We can thus leverage the parametric CRF\ndistributional form in Theorem 2 without necessarily imposing stringent distributional assump-\ntions on X.\n\n(Remark 2) Consider the form of the covariate functions given by (8) compared to (9). What does\nsparsity in the parameters entail in terms of conditional independence assumptions? \u2713st = 0\nin (8) entails that Ys is conditionally independent of Yt given the other responses and all the\ncovariates. Thus, the parametrization in (8) corresponds to pair-wise conditional independence\nassumptions between the responses (structure learning) and between the responses and covariates\n(feature selection). In contrast, (9) lets the edges weights between the responses, \u2713st(X) vary\nas a linear combination of the covariates. Letting \u2713stu = 0 entails the lack of a third-order\ninteraction between the pair of responses Ys and Yt and the covariate Xu, conditioned on all\nother responses and covariates.\n\n(Remark 3) Our general subclasses of CRFs speci\ufb01ed by Theorems 1 and 2 encompass many ex-\n\nisting CRF families as special cases, in addition to providing many novel forms of CRFs.\n\u2022 The Gaussian CRF presented in [7] as well as the reparameterization in [8] can be viewed\nas an instance of our framework by substituting in Gaussian suf\ufb01cient statistics in (8): here\nthe Gaussian mean of the CRF depends on the covariates, but not the covariance. We can\ncorrespondingly derive a novel Gaussian CRF formulation from (9), where the Gaussian\ncovariance of Y |X would also depend on X.\n\u2022 By using the Bernoulli distribution as the node-conditional distribution, we can derive the\nIsing CRF, recently studied in [2] with an application to studying tumor suppressor genes.\n\u2022 Several novel forms of CRFs can be derived by specifying node-conditional distributions\nas Poisson or exponential, for example. With certain distributions, such as the multivari-\nate Poisson for example, we would have to enforce constraints on the parameters to ensure\nnormalizability of the distribution. For the Poisson CRF distribution, it can be veri\ufb01ed that\nfor the log-partition function to be \ufb01nite, A\u2713st(X) < 1, the observation functions are\nconstrained to be non-positive, \u2713st(X) \uf8ff 0. Such restrictions are typically needed for cases\nwhere the variables have in\ufb01nite domains.\n\n5\n\n\f3 Graphical Model Structure Learning\n\nWe now address the task of learning a CRF distribution from our general family given i.i.d. ob-\nservations of the multivariate response vector and covariates. Structure recovery and estimation for\nCRFs has not attracted as much attention as that for MRFs. Schmidt et al. [16], Torralba et al.\n[17] empirically study greedy methods and block `1 regularized pseudo-likelihood respectively to\nlearn the discrete CRF graph structure. Bradley and Guestrin [18], Shahaf et al. [19] provide guar-\nantees on structure recovery for low tree-width discrete CRFs using graph cuts, and a maximum\nweight spanning tree based method respectively. Cai et al. [4], Liu et al. [6] provide structure recov-\nery guarantees for their two-stage procedure for recovering (a reparameterization of) a conditional\nGaussian based CRF; and the semi-parameteric partition based Gaussian CRF respectively. Here,\nwe provide a single theorem that provides structure recovery guarantees for any CRF from our class\nof exponential family CRFs, which encompasses not only Ising, and Gaussian based CRFs, but all\nother instances within our class, such as Poisson CRFs, exponential CRFs, and so on.\nWe are given n i.i.d. samples Z := {X (i), Y (i)}n\nspeci\ufb01ed by Theorems 1 and 2 with covariate functions as given in (8):\n\ni=1 from a pair-wise CRF distribution, of the form\n\nP (Y |X; \u2713\u21e4) / exp\u21e2Xs2V\u2713\u21e4s + Xu2N0(s)\n\n\u2713\u21e4suBu(Xu)Bs(Ys) + X(s,t)2E\n\n\u2713\u21e4st Bs(Ys) Bt(Yt) +Xs\n\nC(Ys), (10)\n\ns\n\ni=1 P (Y (i)\n\n|Y (i)\nV \\s, X (i); \u2713).\n\nn logQn\n\nwith unknown parameters, \u2713\u21e4. The task of CRF parameter learning corresponds to estimating the\nparameters \u2713\u21e4, structure learning corresponds to recovering the edge-set E, and feature selection\ncorresponds to recovering the neighborhoods N0(s) in (10). Note that the log-partition function\nA(\u2713\u21e4) is intractable to compute in general (other than special cases such as Gaussian CRFs). Ac-\ncordingly, we adopt the node-based neighborhood estimation approach of [12, 13, 9, 10]. Given the\njoint distribution in (10), the node-wise conditional distribution of Ys given the rest of the nodes\nand covariates, is given by P (Ys|YV \\s, X; \u2713\u21e4) = exp{\u2318 \u00b7 Bs(Ys) + Cs(Ys) Ds(\u2318)} which is a\nunivariate exponential family, with parameter \u2318 = \u2713\u21e4s +Pu2V 0 \u2713\u21e4suBu(Xu) +Pt2V \\s \u2713\u21e4stBt(Yt),\nas discussed in the previous section. The corresponding negative log-conditional-likelihood can be\nwritten as `(\u2713;Z) := 1\nFor each node s, we have three components of the parameter set, \u2713 := (\u2713s, \u2713x, \u2713y): a scalar \u2713s, a\nlength q vector \u2713x := [u2V 0\u2713su, and a length p 1 vector \u2713y := [t2V \\s\u2713st. Then, given samples\nZ, these parameters can be selected by the following `1 regularized M-estimator:\n(11)\nwhere x,n, y,n are the regularization constants. Note that x,n and y,n do not need to be the\nsame as y,n determines the degree of sparsity between Ys and YV \\s, and similarly x,n does\nthe degree of sparsity between Ys and covariates X. Given this M-estimator, we can recover the\nresponse-variable-neighborhood of response Ys as N (s) = {t 2 V \\s | \u2713y\nst 6= 0}, and the feature-\nneighborhood of the response Ys as N0(s) = {t 2 V 0 | \u2713x\nArmed with this machinery, we can provide the statistical guarantees on successful learning of all\nthree sub-tasks of CRFs:\nTheorem 3. Consider a CRF distribution as speci\ufb01ed in (10). Suppose that the regularization\nparameters in (11) are chosen such that\n\n`(\u2713) + x,nk\u2713xk1 + y,nk\u2713yk1,\n\n\u27132R1+(p1)+q\n\nsu 6= 0}.\n\nmin\n\nx,n M1r log q\n\nn\n\n, y,n M1r log p\n\nn\n\nand max{x,n, y,n}\uf8ff M2,\n\nwhere M1 and M2 are some constants depending on the node conditional distribution in the form of\nmaxpdxx,n,pdyy,n where\nexponential family. Further suppose that mint2N (s) |\u2713\u21e4st| 10\n\u21e2min is the minimum eigenvalue of the Hessian of the loss function at \u2713x\u21e4, \u2713y\u21e4, and dx, dy are the\nnumber of nonzero elements in \u2713x\u21e4 and \u2713y\u21e4, respectively. Then, for some positive constants L, c1,\nc2, and c3, if n L(dx + dy)2(log p + log q)(max{log n, log(p + q)})2, then with probability at\nleast 1 c1 max{n, p + q}2 exp(c2n) exp(c3n), the following statements hold.\n(a) (Parameter Error) For each node s 2 V , the solutionb\u2713 of the M-estimation problem in (11) is\n\nunique with parameter error bound\n\n\u21e2min\n\n5\n\nkc\u2713x \u2713x\u21e4k2 + kc\u2713y \u2713y\u21e4k2 \uf8ff\n\n\u21e2min\n\nmaxpdxx,n,pdyy,n \n\n6\n\n\ft\n\n \n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.5\n\n \n\n0\n0\n\n \n\nG\u2212CRF\ncGGM\npGGM\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\n \n\nt\n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.5\n\n \n\n0\n0\n\nG\u2212CRF\ncGGM\npGGM\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\n(a) Gaussian graphical models\n\n \n\n \n\nt\n\n \n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.5\n\n \n\n0\n0\n\n \n\nI\u2212CRF\nI\u2212MRF\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\nt\n\n \n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.5\n\n \n\n0\n0\n\n \n\nI\u2212CRF\nI\u2212MRF\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\n(b) Ising models\n\n \n\nP\u2212CRF\nP\u2212MRF\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\n1\n\n0.5\n\n \n\n0\n0\n\n \n\nt\n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\nt\n\n \n\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.5\n\n \n\n0\n0\n\nP\u2212CRF\nP\u2212MRF\n\n0.2\n\n0.4\n\nFalse Positive Rate\n\n(c) Poisson graphical models\n\nFigure 1: (a) ROC curves averaged over 50 simulations from a Gaussian CRF with p = 50 responses,\nq = 50 covariates, and (left) n = 100 and (right) n = 250 samples. Our method (G-CRF) is\ncompared to that of [7] (cGGM) and [8] (pGGM). (b) ROC curves for simulations from an Ising\nCRF with p = 100 responses, q = 10 covariates, and (left) n = 50 and (right) n = 150 samples.\nOur method (I-CRF) is compared to the unconditional Ising MRF (I-MRF). (c) ROC curves for\nsimulations from a Poisson CRF with p = 100 responses, q = 10 covariates, and (left) n = 50 and\n(right) n = 150 samples. Our method (P-CRF) is compared to the Poisson MRF (P-MRF).\n\n(b) (Structure Recovery) The M-estimate recovers the response-feature neighborhoods exactly, so\n\n(c) (Feature Selection) The M-estimate recovers the true response neighborhoods exactly, so that\n\nthatcN0(s) = N0(s), for all s 2 V .\nbN (s) = N (s), for all s 2 V .\n\nThe proof requires modifying that of Theorem 1 in [9, 10] to allow for two different regularization\nparameters, x,n and y,n, and for two distinct sets of random variables (responses and covariates).\nThis introduces subtleties related to interactions in the analyses. Extending our statistical analysis\nin Theorem 3 for pair-wise CRFs to general CRF distributions (3) as well as general covariate\nfunctions, such as in (9), are omitted for space reasons and left for future work.\n\n4 Experiments\n\nSimulation Studies.\nIn order to evaluate the generality of our framework, we simulate data from\nthree different instances of our model:\nthose given by Gaussian, Bernoulli (Ising), and Poisson\nnode-conditional distributions. We assume the true conditional distribution, P (Y |X), follows (7)\nwith the parameters: \u2713s(X) = \u2713s +Pu2V 0 \u2713suXu ,\u2713 st(X) = \u2713st +Pu2V 0 \u2713stuXu for some\nconstant parameters \u2713s, \u2713su, \u2713st and \u2713stu. In other words, we permit both the mean, \u2713s(X) and the\ncovariance or edge-weights, \u2713st(X), to depend on the covariates.\nFor the Gaussian CRFs, our goal is to infer the precision (or inverse covariance) matrix. We \ufb01rst\ngenerate covariates as X \u21e0 U [0.05, 0.05]. Given X, the precision matrix of Y , \u21e5(X), is generated\nas follows. All the diagonal elements are set to 1. For each node s, 4 nearest neighbors in the\npp \u21e5 pp lattice structure are selected, and \u2713st = 0 for non-neighboring nodes. For a given edge\nstructure, the strength is now a function of covariates, X, by letting \u2713st(X) = c + h!st, Xi where\nc is a constant bias term and !st is target vector of length q. Data of size p = 50 responses and\nq = 50 covariates was generated for n = 100 and n = 250 samples. Figure 1(a) reports the receiver-\noperator curves (ROC) averaged over 50 trials for three different methods: the model of [7] (denoted\nas cGGM), the model of [8] (denoted as pGGM), and our method (denoted as G-CRF). Results show\nthat our method outperforms competing methods as their edge-weights are restricted to be constants,\nwhile our method allows them to linearly depend on the covariates. Data was similarly generated\nusing a 4 nearest neighbor lattice structure for Ising and Poisson CRFs with p = 100 responses,\n\n7\n\n\fFigure 2: From left to right: Gaussian MRF, mean-speci\ufb01ed Gaussian CRF, and the set correspond-\ning to the covariance-speci\ufb01ed Gaussian CRF. The latter shows the third-order interactions between\ngene-pairs and each of the \ufb01ve common aberration covariates (EGFR, PTEN, CDKN2A, PDGFRA,\nand CDK4). The models were learned from gene expression array data of Glioblastoma samples,\nand the plots display the response neighborhoods of gene TWIST1.\n\nq = 10 covariates, and n = 50 or n = 150 samples. Figure 1(b) and Figure 1(c) report the ROC\ncurves averaged over 50 trials for the Ising and Poisson CRFs respectively. The performance of our\nmethod is compared to that of the unconditional Ising and Poisson MRFs of [9, 10].\n\nReal Data Example: Genetic Networks of Glioblastoma. We demonstrate the performance of\nour CRF models by learning genetic networks of Glioblastoma conditioned on common copy num-\nber aberrations. Level III gene expression data measured by Aglient arrays for n = 465 Glioblas-\ntoma tumor samples as well as copy number variation measured by CGH-arrays were downloaded\nfrom the Cancer Genome Atlas data portal [20]. The data was processed according to standard\ntechniques, and we only consider genes from the C2 Pathway Database. The \ufb01ve most common\ncopy number aberrations across all subjects were taken as covariates. We \ufb01t our Gaussian \u201cmean-\nspeci\ufb01ed\u201d CRFs (with covariate functions given in (8)) and Gaussian \u201ccovariance-speci\ufb01ed\u201d CRFs\n(with covariate functions given in (9)) by penalized neighborhood estimation to learn the graph\nstructure of gene expression responses, p = 876, conditional on q = 5 aberrations: EGFR, PTEN,\nCDKN2A, PDGFRA, and CDK4. Stability selection [21] was used to determine the sparsity of the\nnetwork.\nDue to space limitations, the entire network structures are not shown. Instead, we show the results of\nthe mean- and covariance-speci\ufb01ed Gaussian CRFs and that of the Gaussian graphical model (GGM)\nfor one particularly important gene neighborhood: TWIST1 is a transcription factor for epithelial\nto mesenchymal transition [22] and has been shown to promote tumor invasion in multiple cancers\nincluding Glioblastoma [23]. The neighborhoods of TWIST1 learned by GGMs and mean-speci\ufb01ed\nCRFs share many of the known interactors of TWIST1, such as SNAI2, MGP, and PMAIP1 [24].\nThe mean-speci\ufb01ed CRF is more sparse as conditioning on copy number aberrations may explain\nmany of the conditional dependencies with TWIST1 that are captured by GGMs, demonstrating the\nutility of conditional modeling via CRFs. For the covariance-speci\ufb01ed Gaussian CRF, we plot the\nneighborhood given by \u2713stu in (9) for the \ufb01ve values of u corresponding to each aberration. The\nresults of this network denote third-order effects between gene-pairs and aberrations, and are thus\neven more sparse with no neighbors for the interactions between TWIST1 and PTEN, CDK4, and\nEGFR. TWIST1 has different interactions between PDGFRA and CDKN2A, which have high fre-\nquency for proneual subtypes of Glioblastoma tumors. Thus, our covariance-speci\ufb01ed CRF network\nmay indicate that these two aberrations are the most salient in interacting with pairs of genes that in-\nclude the gene TWIST1. Overall, our analysis has demonstrated the applied advantages of our CRF\nmodels; namely, one can study the network structure between responses conditional on covariates\nand/or between pairs of responses that interact with particular covariates.\n\nAcknowledgments\nThe authors acknowledge support from the following sources: ARO via W911NF-12-1-0390 and\nNSF via IIS-1149803 and DMS-1264033 to E.Y. and P.R; Ken Kennedy Institute for Information\nTechnology at Rice to G.A. and Z.L.; NSF DMS-1264058 and DMS-1209017 to G.A.; and NSF\nDMS-1263932 to Z.L..\n\n8\n\nA\fReferences\n[1] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2014305, December 2008.\n\n[2] J. Cheng, E. Levina, P. Wang, and J. Zhu. Sparse ising models with covariates. Arxiv preprint\n\narXiv:1209.6419, 2012.\n\n[3] S. Ding, G. Wahba, and X. Zhu. Learning Higher-Order Graph Structure with Features by Structure\n\nPenalty. In NIPS, 2011.\n\n[4] T. Cai, H. Li, W. Liu, and J. Xie. Covariate adjusted precision matrix estimation with an application in\n\ngenetical genomics. Biometrika, 2011.\n\n[5] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative trait\n\nnetwork. PLoS Genetics, 2009.\n\n[6] H. Liu, X. Chen, J. Lafferty, and L. Wasserman. Graph-valued regression. In NIPS, 2010.\n[7] J. Yin and H. Li. A sparse conditional gaussian graphical model for analysis of genetical genomics data.\n\nAnnals of Applied Statistics, 5(4):2630\u20132650, 2011.\n\n[8] X. Yuan and T. Zhang. Partial gaussian graphical model estimation. Arxiv preprint arXiv:1209.6419,\n\n2012.\n\n[9] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur.\n\nInfo. Proc. Sys., 25, 2012.\n\n[10] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. On graphical models via univariate exponential family\n\ndistributions. Arxiv preprint arXiv:1301.4183, 2013.\n\n[11] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models using group-\n\nsparse regularization. In Inter. Conf. on AI and Statistics (AISTATS), 14, 2011.\n\n[12] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34:1436\u20131462, 2006.\n\n[13] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using `1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[14] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 36(2):192\u2013236, 1974.\n\n[15] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression\n\nand naive bayes. In Neur. Info. Proc. Sys., 2002.\n\n[16] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning in random \ufb01elds for heart motion\n\nabnormality detection. In Computer Vision and Pattern Recognition (CVPR), pages 1\u20138, 2008.\n\n[17] A. Torralba, K. P. Murphy, and W. T. Freeman. Contextual models for object detection using boosted\n\nrandom \ufb01elds. In NIPS, 2004.\n\n[18] J. K. Bradley and C. Guestrin. Learning tree conditional random \ufb01elds. In ICML, 2010.\n[19] D. Shahaf, A. Chechetka, and C. Guestrin. Learning thin junction trees via graph cuts. In AISTATS, 2009.\n[20] Cancer Genome Atlas Research Network. Comprehensive genomic characterization de\ufb01nes human\n\nglioblastoma genes and core pathways. Nature, 455(7216):1061\u20131068, October 2008.\n\n[21] H. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection (stars) for high\n\ndimensional graphical models. Arxiv preprint arXiv:1006.3316, 2010.\n\n[22] J. Yang, S. A. Mani, J. L. Donaher, S. Ramaswamy, R. A. Itzykson, C. Come, P. Savagner, I. Gitelman,\nA. Richardson, and R. A. Weinberg. Twist, a master regulator of morphogenesis, plays an essential role\nin tumor metastasis. Cell, 117(7):927\u2013939, 2004.\n\n[23] S. A. Mikheeva, A. M. Mikheev, A. Petit, R. Beyer, R. G. Oxford, L. Khorasani, J.-P. Maxwell, C. A.\nGlackin, H. Wakimoto, I. Gonz\u00b4alez-Herrero, et al. Twist1 promotes invasion through mesenchymal\nchange in human glioblastoma. Mol Cancer, 9:194, 2010.\n\n[24] M. A. Smit, T. R. Geiger, J.-Y. Song, I. Gitelman, and D. S. Peeper. A twist-snail axis critical for trkb-\ninduced epithelial-mesenchymal transition-like transformation, anoikis resistance, and metastasis. Molec-\nular and cellular biology, 29(13):3722\u20133737, 2009.\n\n[25] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using `1-constrained\n\nquadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183\u20132202, May 2009.\n\n[26] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Arxiv preprint arXiv:1010.2731, 2010.\n\n9\n\n\f", "award": [], "sourceid": 394, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Genevera", "family_name": "Allen", "institution": "Rice University"}, {"given_name": "Zhandong", "family_name": "Liu", "institution": "Baylor College of Medicine"}]}