{"title": "Learning Large-Scale Poisson DAG Models based on OverDispersion Scoring", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 639, "abstract": "In this paper, we address the question of identifiability and learning algorithms for large-scale Poisson Directed Acyclic Graphical (DAG) models. We define general Poisson DAG models as models where each node is a Poisson random variable with rate parameter depending on the values of the parents in the underlying DAG. First, we prove that Poisson DAG models are identifiable from observational data, and present a polynomial-time algorithm that learns the Poisson DAG model under suitable regularity conditions. The main idea behind our algorithm is based on overdispersion, in that variables that are conditionally Poisson are overdispersed relative to variables that are marginally Poisson. Our algorithms exploits overdispersion along with methods for learning sparse Poisson undirected graphical models for faster computation. We provide both theoretical guarantees and simulation results for both small and large-scale DAGs.", "full_text": "Learning Large-Scale Poisson DAG Models based on\n\nOverDispersion Scoring\n\nGunwoong Park\n\nDepartment of Statistics\n\nUniversity of Wisconsin-Madison\n\nMadison, WI 53706\n\nparkg@stat.wisc.edu\n\nGarvesh Raskutti\n\nDepartment of Statistics\n\nDepartment of Computer Science\n\nWisconsin Institute for Discovery, Optimization Group\n\nUniversity of Wisconsin-Madison\n\nMadison, WI 53706\n\nraskutti@cs.wisc.edu\n\nAbstract\n\nIn this paper, we address the question of identi\ufb01ability and learning algorithms\nfor large-scale Poisson Directed Acyclic Graphical (DAG) models. We de\ufb01ne\ngeneral Poisson DAG models as models where each node is a Poisson random\nvariable with rate parameter depending on the values of the parents in the under-\nlying DAG. First, we prove that Poisson DAG models are identi\ufb01able from ob-\nservational data, and present a polynomial-time algorithm that learns the Poisson\nDAG model under suitable regularity conditions. The main idea behind our algo-\nrithm is based on overdispersion, in that variables that are conditionally Poisson\nare overdispersed relative to variables that are marginally Poisson. Our algorithms\nexploits overdispersion along with methods for learning sparse Poisson undirected\ngraphical models for faster computation. We provide both theoretical guarantees\nand simulation results for both small and large-scale DAGs.\n\n1\n\nIntroduction\n\nModeling large-scale multivariate count data is an important challenge that arises in numerous ap-\nplications such as neuroscience, systems biology and amny others. One approach that has received\nsigni\ufb01cant attention is the graphical modeling framework since graphical models include a broad\nclass of dependence models for different data types. Broadly speaking, there are two sets of graph-\nical models: (1) undirected graphical models or Markov random \ufb01elds and (2) directed acyclic\ngraphical (DAG) models or Bayesian networks.\nBetween undirected graphical models and DAGs, undirected graphical models have generally re-\nceived more attention in the large-scale data setting since both learning and inference algorithms\nscale to larger datasets. In particular, for multivariate count data Yang et al. [1] introduce undirected\nPoisson graphical models. Yang et al. [1] de\ufb01ne undirected Poisson graphical models so that each\nnode is a Poisson random variable with rate parameter depending only on its neighboring nodes in\nthe graph. As pointed out in Yang et al. [1] one of the major challenges with Poisson undirected\ngraphical models is ensuring global normalizability.\nDirected acyclic graphs (DAGs) or Bayesian networks are a different class of generative models that\nmodel directional or causal relationships (see e.g. [2, 3] for details). Such directional relationships\nnaturally arise in most applications but are dif\ufb01cult to model based on observational data. One of\nthe bene\ufb01ts of DAG models is that they have a straightforward factorization into conditional distri-\nbutions [4], and hence no issues of normalizability arise as they do for undirected graphical models\nas mentioned earlier. However a number of challenges arise that make learning DAG models of-\nten impossible for large datasets even when variables have a natural causal or directional structure.\n\n1\n\n\fThese issues are: (1) identi\ufb01ability since inferring causal directions from data is often not possible;\n(2) computational complexity since it is often computationally infeasible to search over the space of\nDAGs [5]; (3) sample size guarantee since fundamental identi\ufb01ability assumptions such as faithful-\nness are often required extremely large sample sizes to be satis\ufb01ed even when the number of nodes\np is small (see e.g. [6]).\nIn this paper, we de\ufb01ne Poisson DAG models and address these 3 issues. In Section 3 we prove that\nPoisson DAG models are identi\ufb01able and in Section 4 we introduce a polynomial-time DAG learning\nalgorithm for Poisson DAGs which we call OverDispersion Scoring (ODS). The main idea behind\nproving identi\ufb01ability is based on the overdispersion of variables that are conditionally Poisson but\nnot marginally Poisson. Using overdispersion, we prove that it is possible to learn the causal ordering\nof Poisson DAGs using a polynomial-time algorithm and once the ordering is known, the problem of\nlearning DAGs reduces to a simple set of neighborhood regression problems. While overdispersion\nwith conditionally Poisson random variables is a well-known phenomena that is exploited in many\napplications (see e.g. [7, 8]), using overdispersion has never been exploited in DAG model learning\nto our knowledge.\nStatistical guarantees for learning the causal ordering are provided in Section 4.2 and we provide\nnumerical experiments on both small DAGs and large-scale DAGs with node-size up to 5000 nodes.\nOur theoretical guarantees prove that even in the setting where the number of nodes p is larger than\nthe sample size n, it is possible to learn the causal ordering under the assumption that the degree\nof the so-called moralized graph of the DAG has small degree. Our numerical experiments support\nour theoretical results and show that our ODS algorithm performs well compared to other state-of-\nthe-art DAG learning methods. Our numerical experiments con\ufb01rm that our ODS algorithm is one\nof the few DAG-learning algorithms that performs well in terms of statistical and computational\ncomplexity in the high-dimensional p > n setting.\n\n2 Poisson DAG Models\n\nIn this section, we de\ufb01ne general Poisson DAG models. A DAG G = (V, E) consists of a set of\nvertices V and a set of directed edges E with no directed cycle. We usually set V = {1, 2, . . . , p}\nand associate a random vector (X1, X2, . . . , Xp) with probability distribution P over the vertices\nin G. A directed edge from vertex j to k is denoted by (j, k) or j \u2192 k. The set Pa(k) of parents\nof a vertex k consists of all nodes j such that (j, k) \u2208 E. One of the convenient properties of\nDAG models is that the joint distribution f (X1, X2, ..., Xp) factorizes in terms of the conditional\ndistributions as follows [4]:\n\nf (X1, X2, ..., Xp) = \u03a0p\n\nj=1fj(Xj|XPa(j)),\n\nwhere fj(Xj|XPa(j)) refers to the conditional distribution of node Xj in terms of its parents. The\nbasic property of Poisson DAG models is that each conditional distribution fj(xj|xPa(j)) has a\nPoisson distribution. More precisely, for Poisson DAG models:\n\nXj|X{1,2,...,p}\\{j} \u223c Poisson(gj(XPa(j))),\n\n(1)\n\n(cid:80)\n\nwhere gj(.) is an arbitrary function of XPa(j). To take a concrete example, gj(.) can represent the\nlink function for the univariate Poisson generalized linear model (GLM) or gj(XPa(j)) = exp(\u03b8j +\n\nk\u2208Pa(j) \u03b8jkXk) where (\u03b8jk)k\u2208Pa(j) represent the linear weights.\n\nUsing the factorization (1), the overall joint distribution is:\n\n(cid:16)(cid:88)\n\nj\u2208V\n\n(cid:88)\n\n\u03b8jkXkXj \u2212(cid:88)\n\nlog Xj!\u2212(cid:88)\n\n\u03b8j +(cid:80)\n\nk\u2208Pa(j)\n\ne\n\n\u03b8jkXk(cid:17)\n\n.\n\n(k,j)\u2208E\n\nj\u2208V\n\nj\u2208V\n\nf (X1, X2, ..., Xp) = exp\n\n\u03b8jXj +\n\n(cid:16)(cid:88)\n\nj\u2208V\n\n(cid:88)\n\n\u03b8jkXkXj \u2212(cid:88)\n\nj\u2208V\n\n(k,j)\u2208E\n\n2\n\n(2)\nTo contrast this formulation with the Poisson undirected graphical model in Yang et al. [1], the joint\ndistribution for undirected graphical models has the form:\n\n(cid:17)\n\nlog Xj! \u2212 A(\u03b8)\n\n,\n\n(3)\n\nf (X1, X2, ..., Xp) = exp\n\n\u03b8jXj +\n\n\fopposed to the term(cid:80)\n\nwhere A(\u03b8) is the log-partition function or the log of the normalization constant. While the two\nforms (2) and (3) look quite similar, the key difference is the normalization constant of A(\u03b8) in (3) as\n\u03b8kj Xk in (2) which depends on X. To ensure the undirected\ngraphical model representation in (3) is a valid distribution, A(\u03b8) must be \ufb01nite which guarantees\nthe distribution is normalizable and Yang et al. [1] prove that A(\u03b8) is normalizable if and only if all\n\u03b8 values are less than or equal to 0.\n\n\u03b8j +(cid:80)\n\nj\u2208V e\n\nk\u2208Pa(j)\n\n3\n\nIdenti\ufb01ability\n\nIn this section, we prove that Poisson DAG models are identi\ufb01able under a very mild condition.\nIn general, DAG models can only be de\ufb01ned up to their Markov equivalence class (see e.g. [3]).\nHowever in some cases, it is possible to identify the DAG by exploiting speci\ufb01c properties of the\ndistribution. For example, Peters and B\u00a8uhlmann prove that for Gaussian DAGs based on structural\nequation models with known or the same variance, the models are identi\ufb01able [9], Shimizu et al. [10]\nprove identi\ufb01ability for linear non-Gaussian structural equation models, and Peters et al. [11] prove\nidenti\ufb01ability of non-parametric structural equation models with additive independent noise. Here\nwe show that Poisson DAG models are also identi\ufb01able using the idea of overdispersion.\nTo provide intuition, we begin by showing the identi\ufb01ability of a two-node Poisson DAG model.\nThe basic idea is that the relationship between nodes X1 and X2 generates the overdispersed child\nvariable. To be precise, consider all three models: M1 : X1 \u223c Poisson(\u03bb1), X2 \u223c Poisson(\u03bb2),\nwhere X1 and X2 are independent; M2 : X1 \u223c Poisson(\u03bb1) and X2|X1 \u223c Poisson(g2(X1)); and\nM3 : X2 \u223c Poisson(\u03bb2) and X1|X2 \u223c Poisson(g1(X2)). Our goal is to determine whether the\nunderlying DAG model is M1,M2 or M3.\n\nX1\n\nX2\n\nX1\n\nX2\n\nX1\n\nX2\n\nM1\n\nM2\n\nFigure 1: Directed graphs of M1, M2 and M3\n\nM3\n\nNow we exploit the fact that for a Poisson random variable X, Var(X) = E(X), while for a distri-\nbution which is a conditionally Poisson, the variance is overdispersed relative to the mean. Hence\nfor M1, Var(X1) = E(X1) and Var(X2) = E(X2). For M2, Var(X1) = E(X1), while\nVar(X2) = E[Var(X2|X1)] + Var[E(X2|X1)] = E[g2(X1)] + Var[g2(X1)] > E[g2(X1)] = E(X2),\nas long as Var(g2(X1)) > 0.\nSimilarly under M3, Var(X2) = E(X2) and Var(X1) > E(X1) as long as Var(g1(X2)) > 0.\nHence we can identify model M1, M2, and M3 by testing whether the variance is greater than the\nexpectation or equal to the expectation. With \ufb01nite sample size n, the quantities E(\u00b7) and Var(\u00b7) can\nbe estimated from data and we consider the \ufb01nite sample setting in Section 4 and 4.2. Now we\nextend this idea to provide an identi\ufb01ability condition for general Poisson DAG models.\nThe key idea to extending identi\ufb01ability from the bivariate to multivariate scenario involves con-\ndition on parents of each node and then testing overdispersion. The general p-variate result is as\nfollows:\nTheorem 3.1. Assume that for any j \u2208 V , K \u2282 Pa(j) and S \u2282 {1, 2, .., p} \\ K,\n\nthe Poisson DAG model is identi\ufb01able.\n\nVar(gj(XPa(j))|XS) > 0,\n\nWe defer the proof to the supplementary material. Once again, the main idea of the proof is\noverdispersion. To explain the required assumption note that for any j \u2208 V and S \u2282 Pa(j),\nVar(Xj|XS) \u2212 E(Xj|XS) = Var(gj(XPa(j))|XS). Note that if S = Pa(j) or {1, ...j \u2212 1},\nVar(gj(XPa(j))|XS) = 0. Otherwise Var(gj(XPa(j))|XS) > 0 by our assumption.\n\n3\n\n\f1\n\n3\n\nG\n\n2\n\n1\n\n2\n\n3\n\nGm\n\nFigure 2: Moralized graph Gm for DAG G\n\n4 Algorithm\n\nOur algorithm which we call OverDispersion Scoring (ODS) consists of three main steps: 1) esti-\nmating a candidate parents set [1, 12, 13] using existing learning undirected graph algorithms; 2)\nestimating a causal ordering using overdispersion scoring; and 3) estimating directed edges using\nstandard regression algorithms such as Lasso. Steps 3) is a standard problem in which we use off-\nthe-shelf algorithms. Step 1) allows us to reduce both computational and sample complexity by\nexploiting sparsity of the moralized or undirected graphical model representation of the DAG which\nwe inroduce shortly. Step 2) exploits overdispersion to learn a causal ordering.\nAn important concept we need to introduce for Step 1) of our algorithm is the moral graph or\nundirected graphical model representation of the DAG (see e.g. [14]). The moralized graph Gm\nfor a DAG G = (V, E) is an undirected graph where Gm = (V, Eu) where Eu includes edge\nset E without directions plus edges between any nodes that are parents of a common child. Fig. 2\ndemonstrates concepts of a moralized graph for a simple 3-node example where E = {(1, 3), (2, 3)}\nfor DAG G. Note that 1, 2 are parents of a common child 3. Hence Eu = {(1, 2), (1, 3), (2, 3)}\nwhere the additional edge (1, 2) arises from the fact that nodes 1 and 2 are both parents of node 3.\nFurther, de\ufb01ne N (j) := {k \u2208 {1, 2, ..., p} |(j, k) or (k, j) \u2208 Eu} denote the neighborhood set of a\nnode j in the moralized graph Gm. Let {X (i)}n\ni=1 denote n samples drawn from the Poisson DAG\nmodel G. Let \u03c0 : {1, 2, ..., p} \u2192 {1, 2, ..., p} be a bijective function corresponding to a permutation\ndata. For ease of notation for any j \u2208 {1, 2, ...p}, and S \u2282 {1, 2, ..., p} let \u00b5j|S and \u00b5j|S(xS)\nrepresent E(Xj|XS) and E(Xj|XS = xS), respectively. Furthermore let \u03c32\nj|S(xS) denote\nS = xS)\n\nor a causal ordering. We will also use the convenient notation(cid:98). to denote an estimate based on the\nVar(Xj|XS) and Var(Xj|XS = xS), respectively. We also de\ufb01ne n(xS) = (cid:80)n\nand nS =(cid:80)\nThe computation of the score(cid:98)sjk in Step 2) of our ODS algorithm 1 involves the following equation:\n\nn(xS)1(n(xS) \u2265 c0.n) for an arbitrary c0 \u2208 (0, 1).\n\ni=1 1(X (i)\n\nj|S and \u03c32\n\n, X (2)(cid:98)Cjk\n\n, ..., X (n)(cid:98)Cjk\n\n(x) \u2212(cid:98)\u00b5j|(cid:98)Cjk\n\n(cid:98)sjk =\nwhere (cid:98)Cjk refers to an estimated candidate set of parents speci\ufb01ed in Step 2) of our ODS algorithm 1\nand X ((cid:98)Cjk) = {x \u2208 {X (1)(cid:98)Cjk\n\n} | n(x) \u2265 c0.n} so that we ensure we have enough\nsamples for each element we select. In addition, c0 is a tuning parameter of our algorithm that we\nspecify in our main Theorem 4.2 and our numerical experiments.\nWe can use a number of standard algorithms for Step 1) of our ODS algorithm since it boils down\nto \ufb01nding a candidate set of parents. The main purpose of Step 1) is to reduce both computational\ncomplexity and the sample complexity by exploiting sparsity in the moralized graph. In Step 1)\na candidate set of parents is generated for each node which in principle could be the entire set of\nnodes. However since Step 2) requires computation of a conditional mean and variance, both the\nsample complexity and computational complexity depend signi\ufb01cantly on the number of variables\nwe condition on as illustrated in Section 4.1 and 4.2. Hence by making the set of candidate parents\nfor each node as small as possible we gain signi\ufb01cant computational and statistical improvements\nby exploiting the graph structure. A similar step is taken in the MMHC [15] and SC algorithms [16].\nThe way we choose a candidate set of parents is by learning the moralized graph Gm and then using\nthe neighborhood set N (j) for each j. Hence Step 1) reduces to a standard undirected graphical\nmodel learning algorithm. A number of choices are available for Step 1) including the neighborhood\nregression approach of Yang et al. [1] as well as standard DAG learning algorithms which \ufb01nd a\ncandidate parents set such as HITON [13] and MMPC [15].\n\n(cid:88)\nx\u2208X ((cid:98)Cjk)\n\n(4)\n\nxS\n\n(cid:0)(cid:98)\u03c32\nj|(cid:98)Cjk\n\nn(x)\n\nn(cid:98)Cjk\n\n(x)(cid:1)\n\n4\n\n\fAlgorithm 1: OverDispersion Scoring (ODS)\ninput : n samples from the given Poisson DAG model. X (1), ..., X (n) \u2208 {{0} \u222a N}p\n\nStep 2: Estimate causal ordering using overdispersion score;\nfor i \u2208 {1, 2, ..., p} do\n\noutput: A causal ordering(cid:98)\u03c0 \u2208 Np and a graph structure, (cid:98)E \u2208 {0, 1}p\u00d7p\nStep 1: Estimate the undirected edges (cid:98)Eu corresponding to the moralized graph with\nneighborhood set (cid:98)N (j);\ni \u2212(cid:98)\u00b5i\n(cid:98)si =(cid:98)\u03c32\nThe \ufb01rst element of a causal ordering(cid:98)\u03c01 = arg minj(cid:98)sj;\nfor k \u2208 N ((cid:98)\u03c0j\u22121) \u2229 {1, 2, ..., p} \\ {(cid:98)\u03c01, ...(cid:98)\u03c0j\u22121} do\nThe candidate parents set (cid:98)Cjk = (cid:98)N (k) \u2229 {(cid:98)\u03c01,(cid:98)\u03c02, ...,(cid:98)\u03c0j\u22121};\nCalculate(cid:98)sjk using (4);\nThe jth element of a causal ordering(cid:98)\u03c0j = arg mink(cid:98)sjk;\nStep 3: Estimate directed edges toward(cid:98)\u03c0j, denoted by (cid:98)Dj;\nThe pth element of the causal ordering(cid:98)\u03c0p = {1, 2, ..., p} \\ {(cid:98)\u03c01,(cid:98)\u03c02, ...,(cid:98)\u03c0p\u22121};\nThe directed edges toward(cid:98)\u03c0p, denoted by (cid:98)Dp = (cid:98)N ((cid:98)\u03c0p);\nReturn the estimated causal ordering(cid:98)\u03c0 = ((cid:98)\u03c01,(cid:98)\u03c02, ...,(cid:98)\u03c0p);\nReturn the estimated edge structure (cid:98)E = {(cid:98)D2,(cid:98)D3, ...,(cid:98)Dp};\n\nend\nfor j = {2, 3, ...p \u2212 1} do\n\nend\n\nend\n\nStep 2) learns the causal ordering by assigning an overdispersion score for each node. The basic idea\nis to determine which nodes are overdispersed based on the sample conditional mean and conditional\nvariance. The causal ordering is determined one node at a time by selecting the node with the\nsmallest overdispersion score which is representative of a node that is least likely to be conditionally\nPoisson and most likely to be marginally Poisson. Finding the causal ordering is usually the most\nchallenging step of DAG learning, since once the causal ordering is learnt, all that remains is to\n\ufb01nd the edge set for the DAG. Step 3), the \ufb01nal step \ufb01nds the directed edge set of the DAG G by\n\ufb01nding the parent set of each node. Using Steps 1) and 2), \ufb01nding the parent set of node j boils\ndown to selecting which variables are parents out of the candidate parents of node j generated in\nStep 1) intersected with all elements before node j of the causal ordering in Step 2). Hence we have\np regression variable selection problems which can be performed using GLMLasso [17] as well as\nstandard DAG learning algorithms.\n\n4.1 Computational Complexity\n\nSteps 1) and 3) use existing algorithms with known computational complexity. Clearly the compu-\ntational complexity for Steps 1) and 3) depend on the choice of algorithm. For example, if we use\nthe neighborhood selection GLMLasso algorithm [17] as is used in Yang et al. [1], the worst-case\ncomplexity is O(min(n, p)np) for a single Lasso run but since there are p nodes, the total worst-case\ncomplexity is O(min(n, p)np2). Similarly if we use GLMLasso for Step 3) the computational com-\nplexity is also O(min(n, p)np2). As we show in numerical experiments, DAG-based algorithms for\nStep 1) tend to run more slowly than neighborhood regression based on GLMLasso.\nFor Step 2) where we estimate the causal ordering has (p \u2212 1) iterations and each iteration has a\n\nnumber of overdispersion scores(cid:98)sj and(cid:98)sjk computed which is bounded by O(|K|) where K is\na set of candidates of each element of a causal ordering, N ((cid:98)\u03c0j\u22121) \u2229 {1, 2, ..., p} \\ {(cid:98)\u03c01, ...(cid:98)\u03c0j\u22121},\n\nwhich is also bounded by the maximum degree of the moralized graph d. Hence the total number\nof overdispersion scores that need to be computed is O(pd). Since the time for calculating each\noverdispersion score which is the difference between a conditional variance and expectation is pro-\nportional to n, the time complexity is O(npd). In worst case where the degree of the moralized\ngraph is p, the computational complexity of Step 2) is O(np2). As we discussed earlier there is a\n\n5\n\n\fsigni\ufb01cant computational saving by exploiting a sparse moralized graph which is why we perform\nStep 1) of the algorithm. Hence Steps 1) and 3) are the main computational bottlenecks of our ODS\nalgorithm. The addition of Step 2) which estimates the causal ordering does not signi\ufb01cantly add\nto the computational bottleneck. Consequently our ODS algorithm, which is designed for learning\nDAGs is almost as computationally ef\ufb01cient as standard methods for learning undirected graphical\nmodels.\n\n4.2 Statistical Guarantees\n\nIn this section, we show consistency of recovering a valid causal ordering recovery of our ODS\nalgorithm under suitable regularity conditions. We begin by stating the assumptions we impose on\nthe functions gj(.).\nAssumption 4.1.\n\n(A1) For all j \u2208 V , K \u2282 Pa(j) and all S \u2282 {1, 2.., p} \\ K, there exists an m > 0 such that\n\nVar(gj(XPa(j))|XS) > m.\n\n(A2) For all j \u2208 V , there exists an M < \u221e such that E[exp(gj(XPa(j)))] < M.\n\n(A1) is a stronger version of the identi\ufb01ability assumption in 3.1 Var(gj(XPa(j))|XS) > 0 where\nsince we are in the \ufb01nite sample setting, we need the conditional variance to be lower bounded by a\ntails of the score(cid:98)sjk in Step 2 of our ODS algorithm. To take a concrete example for which (A1)\nconstant bounded away from 0. (A2) is a condition on the tail behavior of gj(Pa(j)) for controlling\n\nand (A2) are satis\ufb01ed, it is straightforward to show that the GLM DAG model (2) with non-positive\nvalues of {\u03b8kj} satis\ufb01es both (A1) and (A2). The non-positivity constraint on the \u03b8\u2019s is suf\ufb01cient\nbut not necessary and ensures that the parameters do not grow too large.\nNow we present the main result under Assumptions (A1) and (A2). For general DAGs, the true\ncausal ordering \u03c0\u2217 is not unique. Therefore let E(\u03c0\u2217) denote all the causal orderings that are con-\nsistent with the true DAG G\u2217. Further recall that d denotes the maximum degree of the moralized\ngraph G\u2217\nm.\nTheorem 4.2 (Recovery of a causal ordering). Consider a Poisson DAG model as speci\ufb01ed in (1),\nwith a set of true causal orderings E(\u03c0\u2217) and the rate function gj(.) satis\ufb01es assumptions 4.1. If\nthe sample size threshold parameter c0 \u2264 n\u22121/(5+d), then there exist positive constants, C1, C2, C3\nsuch that\n\nP(\u02c6\u03c0 /\u2208 E(\u03c0\u2217)) \u2264 C1exp(\u2212C2n1/(5+d) + C3 log max{n, p}).\n\nWe defer the proof to the supplementary material. The main idea behind the proof uses the overdis-\npersion property exploited in Theorem 3.1 in combination with concentration bounds that exploit\nAssumption (A2). Note once again that the maximum degree d of the undirected graph plays an im-\nportant role in the sample complexity which is why Step 1) is so important. This is because the size\nof the conditioning set depends on the degree of the moralized graph d. Hence d plays an important\nrole in both the sample complexity and computational complexity.\nTheorem 4.2 can be used in combination with sample complexity guarantees for Steps 1) and 3)\n\nof our ODS algorithm to prove that our output DAG (cid:98)G is the true DAG G\u2217 with high probability.\n\nSample complexity guarantees for Steps 1) and 3) depend on the choice of algorithm but for neigh-\nborhood regression based on the GLMLasso, provided n = \u2126(d log p), Steps 1) and 3) should be\nconsistent.\nFor Theorem 4.2 if the triple (n, d, p) satis\ufb01es n = \u2126((log p)5+d), then our ODS algorithm recovers\nthe true DAG. Hence if the moralized graph is sparse, ODS recovers the true DAG in the high-\ndimensional p > n setting. DAG learning algorithms that apply to the high-dimensional setting\nare not common since they typically rely on faithfulness or similar assumptions or other restrictive\nconditions that are not satis\ufb01ed in the p > n setting. Note that if the DAG is not sparse and d = \u2126(p),\nour sample complexity is extremely large when p is large. This makes intuitive sense since if the\nnumber of candidate parents is large, we would need to condition on a large set of variables which\nis very sample-intensive. Our sample complexity is certainly not optimal since the choice of tuning\nparameter c0 \u2264 n\u22121/(5+d). Determining optimal sample complexity remains an open question.\n\n6\n\n\f(a) p = 10, d \u2265 3\n\n(b) p = 50, d \u2265 3\n\n(c) p = 100, d \u2265 3\n\n(d) p = 5000, d \u2265 3\n\nFigure 3: Accuracy rates of successful recovery for a causal ordering via our ODS algorithm using\ndifferent base algorithms\nThe larger sample complexity of our ODS algorithm relative to undirected graphical models learning\nis mainly due to the fact that DAG learning is an intrinsically harder problem than undirected graph\nlearning when the causal ordering is unknown. Furthermore note that Theorem 4.2 does not require\nany additional identi\ufb01ability assumptions such as faithfulness which severely increases the sample\ncomplexity for large-scale DAGs [6].\n\n5 Numerical Experiments\n\nGLM link function (i.e.gj(XPa(j)) = exp(\u03b8j +(cid:80)\n\nIn this section, we support our theoretical results with numerical experiments and show that our ODS\nalgorithm performs favorably compared to state-of-the-art DAG learning methods. The simulation\nstudy was conducted using 50 realizations of a p-node random Poisson DAG that was generated as\nfollows. The gj(.) functions for the general Poisson DAG model (1) was chosen using the standard\nk\u2208Pa(j) \u03b8jkXk)) resulting in the GLM DAG\nmodel (2). We experimented with other choices of gj(.) but only present results for the GLM\nDAG model (2). Note that our ODS algorithm works well as long as Assumption 4.1 is satis\ufb01ed\nregardless of choices of gj(.).\nIn all results presented (\u03b8jk) parameters were chosen uniformly\nat random in the range \u03b8jk \u2208 [\u22121,\u22120.7] although any values far from zero and satisfying the\nassumption 4.1 work well. In fact, smaller values of \u03b8jk are more favorable to our ODS algorithm\nthan state-of-the-art DAG learning methods because of weak dependency between nodes. DAGs are\ngenerated randomly with a \ufb01xed unique causal ordering {1, 2..., p} with edges randomly generated\nwhile respecting desired maximum degree constraints for the DAG. In our experiments, we always\nset the thresholding constant c0 = 0.005 although any value below 0.01 seems to work well.\nIn Fig. 3, we plot the proportion of simulations in which our ODS algorithm recovers the correct\ncausal ordering in order to validate Theorem 4.2. All graphs in Fig. 3 have exactly 2 parents for\neach node and we plot how the accuracy in recovering the true \u03c0\u2217 varies as a function of n for\nn \u2208 {500, 1000, 2500, 5000, 10000} and for different node sizes (a) p = 10, (b) p = 50, (c)\np = 100, and (d) p = 5000. As we can see, even when p = 5000, our ODS algorithm recovers the\ntrue causal ordering about 40% of the time even when n is approximately 5000 and for smaller DAGs\naccuracy is 100%. In each sub-\ufb01gure, 3 different algorithms are used for Step 1): GLMLasso [17]\nwhere we choose \u03bb = 0.1; MMPC [15] with \u03b1 = 0.005; and HITON [13] again with \u03b1 = 0.005\nand an oracle where the edges for the true moralized graph is used. As Fig. 3 shows, the GLMLasso\nseems to be the best performing algorithm in terms of recovery so we use the GLMLasso for Steps 1)\nand 3) for the remaining \ufb01gures. GLMLasso was also the only algorithm that scaled to the p = 5000\nsetting. However, it should be pointed out that GLMLasso is not necessarily consistent and it is\nhighly depending on the choice of gj(.). Recall that the degree d refers to the maximum degree of\nthe moralized DAG.\nFig. 4 provides a comparison of how our ODS algorithm performs in terms of Hamming distance\ncompared to the state-of-the-art PC [3], MMHC [15], GES [18], and SC [16] algorithms. For the PC,\nMMHC and SC algorithms, we use \u03b1 = 0.005 while for the GES algorithm we use the mBDe [19]\n(modi\ufb01ed Bayesian Dirichlet equivalent) score since it performs better than other score choices.\nWe consider node sizes of p = 10 in (a) and (b) and p = 100 in (c) and (d) since many of these\nalgorithms do not easily scale to larger node sizes. We consider two Hamming distance measures:\nin (a) and (c), we only measure the Hamming distance to the skeleton of the true DAG, which is the\nset of edges of the DAG without directions; for (b) and (d) we measure the Hamming distance for\n\n7\n\n025507510025005000750010000sample sizeAccuracy (%)Causal ordering025507510025005000750010000sample sizeCausal ordering025507510025005000750010000sample sizeCausal ordering02040608025005000750010000sample sizeAccuracy (%)Causal ordering for large DAGs\f(a) p = 10, d \u2265 3\n\n(b) p = 10, d \u2265 3\n\n(c) p = 100, d \u2265 3\n\n(d) p = 100, d \u2265 3\n\nedges(cid:0)p\n\n2\n\nFigure 4: Comparison of our ODS algorithm (black) and PC, GES, MMHC, SC algorithms in terms\nof Hamming distance to skeletons and directed edges.\nthe edges with directions. The reason we consider the skeleton is because the PC does not recover\nall directions of the DAG. We normalize the Hamming distance by dividing by the total number of\n\n(cid:1) and p(p \u2212 1), respectively so that the overall score is a percentage. As we can see our\n\nODS algorithm signi\ufb01cantly out-performs the other algorithms. We can also see that as the sample\nsize n grows, our algorithm recovers the true DAG which is consistent with our theoretical results.\nIt must be pointed out that the choice of DAG model is suited to our ODS algorithm while these\nstate-of-the-art algorithms apply to more general classes of DAG models.\nNow we consider the statistical performance for large-scale DAGs. Fig. 5 plots the statistical per-\nformance of ODS for large-scale DAGs in terms of (a) recovering the causal ordering; (b) Ham-\nming distance to the true skeleton; (c) Hamming distance to the true DAG with directions. All\ngraphs in Fig. 5 have exactly 2 parents for each node and accuracy varies as a function of n for\nn \u2208 {500, 1000, 2500, 5000, 10000} and for different node sizes p = {1000, 2500, 5000}. Fig. 5\nshows that our ODS algorithm accurately recovers the causal ordering and true DAG models even\nin high dimensional setting, supporting our theoretical results 4.2.\nFig. 6 shows run-time of our ODS algorithm. We measure the running time (a) by varying node size\np from 10 to 125 with the \ufb01xed n = 100 and 2 parents; (b) sample size n from 100 to 2500 with\nthe \ufb01xed p = 20 and 2 parents; (c) the number of parents of each node |Pa| from 1 to 5 with the\n\ufb01xed n = 5000 and p = 20. Fig. 6 (a) and (b) support the section 4.1 where the time complexity\nof our ODS algorithm is at most O(np2). Fig. 6 (c) shows running time is proportional to a parents\nsize which is a minimum degree of a graph. It agrees with the time complexity of Step 2) of our\nODS algorithm is O(npd). We can also see that the GLMLasso has the fastest run-time amongst all\nalgorithms that determine the candidate parent set.\n\n(a) d \u2265 3\n\n(b) d \u2265 3\n\n(c) d \u2265 3\n\nFigure 5: Performance of our ODS algorithm for large-scale DAGs with p = 1000, 2500, 5000\n\n(a) n = 100, d \u2265 3\n\n(b) p = 20, d \u2265 3\n\n(c) n = 5000, p = 20\n\nFigure 6: Time complexity of our ODS algorithm with respect to node size p, sample size n, and\nparents size |Pa|\n\n8\n\n05101525005000750010000sample sizeNormalized Hamming Dist (%)Skeletons010203025005000750010000sample sizeDirected edges0.00.51.01.52.025005000750010000sample sizeNormalized Hamming Dist (%)Skeletons012325005000750010000sample sizeDirected edges02040608025005000750010000sample sizeAccuracy (%)Causal ordering for large DAGs0.000.250.500.7525005000750010000sample sizeNormalized Hamming dist (%)Skeletons for large DAGs0.00.10.20.325005000750010000sample sizeDirected edges for large DAGs050100150255075100125Node size, pRunning time (sec)Time complexity123450500100015002000Sampe size, nTime complexity2.55.07.510.012.512345Parents size, |Pa|Time complexity\fReferences\n[1] E. Yang, G. Allen, Z. Liu, and P. K. Ravikumar, \u201cGraphical models via generalized linear\n\nmodels,\u201d in Advances in Neural Information Processing Systems, 2012, pp. 1358\u20131366.\n\n[2] P. Bonissone, M. Henrion, L. Kanal, and J. Lemmer, \u201cEquivalence and synthesis of causal\n\nmodels,\u201d in Uncertainty in arti\ufb01cial intelligence, vol. 6, 1991, p. 255.\n\n[3] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction and Search. MIT Press, 2000.\n[4] S. L. Lauritzen, Graphical models. Oxford University Press, 1996.\n[5] D. M. Chickering, \u201cLearning Bayesian networks is NP-complete,\u201d in Learning from data.\n\nSpringer, 1996, pp. 121\u2013130.\n\n[6] C. Uhler, G. Raskutti, P. B\u00a8uhlmann, B. Yu et al., \u201cGeometry of the faithfulness assumption in\n\ncausal inference,\u201d The Annals of Statistics, vol. 41, no. 2, pp. 436\u2013463, 2013.\n\n[7] C. B. Dean, \u201cTesting for overdispersion in Poisson and binomial regression models,\u201d Journal\n\nof the American Statistical Association, vol. 87, no. 418, pp. 451\u2013457, 1992.\n\n[8] T. Zheng, M. J. Salganik, and A. Gelman, \u201cHow many people do you know in prison? Using\noverdispersion in count data to estimate social structure in networks,\u201d Journal of the American\nStatistical Association, vol. 101, no. 474, pp. 409\u2013423, 2006.\n\n[9] J. Peters and P. B\u00a8uhlmann, \u201cIdenti\ufb01ability of Gaussian structural equation models with equal\n\nerror variances,\u201d Biometrika, p. ast043, 2013.\n\n[10] S. Shimizu, P. O. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen, \u201cA linear non-Gaussian acyclic\nmodel for causal discovery,\u201d The Journal of Machine Learning Research, vol. 7, pp. 2003\u2013\n2030, 2006.\n\n[11] J. Peters, J. Mooij, D. Janzing et al., \u201cIdenti\ufb01ability of causal graphs using functional models,\u201d\n\narXiv preprint arXiv:1202.3757, 2012.\n\n[12] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, \u201cThe max-min hill-climbing Bayesian network\n\nstructure learning algorithm,\u201d Machine learning, vol. 65, no. 1, pp. 31\u201378, 2006.\n\n[13] C. F. Aliferis, I. Tsamardinos, and A. Statnikov, \u201cHITON: a novel Markov Blanket algorithm\nfor optimal variable selection,\u201d in AMIA Annual Symposium Proceedings, vol. 2003. Ameri-\ncan Medical Informatics Association, 2003, p. 21.\n\n[14] R. G. Cowell, P. A. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter, Probabilistic Networks and\n\nExpert Systems. Springer-Verlag, 1999.\n\n[15] I. Tsamardinos and C. F. Aliferis, \u201cTowards principled feature selection: Relevancy, \ufb01lters and\nwrappers,\u201d in Proceedings of the ninth international workshop on Arti\ufb01cial Intelligence and\nStatistics. Morgan Kaufmann Publishers: Key West, FL, USA, 2003.\n\n[16] N. Friedman, I. Nachman, and D. Pe\u00b4er, \u201cLearning bayesian network structure from massive\ndatasets: the sparse candidate algorithm,\u201d in Proceedings of the Fifteenth conference on Un-\ncertainty in arti\ufb01cial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 206\u2013215.\n\n[17] J. Friedman, T. Hastie, and R. Tibshirani, \u201cglmnet: Lasso and elastic-net regularized general-\n\nized linear models,\u201d R package version, vol. 1, 2009.\n\n[18] D. M. Chickering, \u201cOptimal structure identi\ufb01cation with greedy search,\u201d The Journal of Ma-\n\nchine Learning Research, vol. 3, pp. 507\u2013554, 2003.\n\n[19] D. Heckerman, D. Geiger, and D. M. Chickering, \u201cLearning Bayesian networks: The com-\nbination of knowledge and statistical data,\u201d Machine learning, vol. 20, no. 3, pp. 197\u2013243,\n1995.\n\n9\n\n\f", "award": [], "sourceid": 437, "authors": [{"given_name": "Gunwoong", "family_name": "Park", "institution": "UW, Madison"}, {"given_name": "Garvesh", "family_name": "Raskutti", "institution": "University of Wisconsin, Madison"}]}