{"title": "DAGs with NO TEARS: Continuous Optimization for Structure Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9472, "page_last": 9483, "abstract": "Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: we formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. \nThis is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.", "full_text": "DAGs with NO TEARS:\n\nContinuous Optimization for Structure Learning\n\nXun Zheng1, Bryon Aragam1, Pradeep Ravikumar1, Eric P. Xing1,2\n\n{xunzheng,naragam,pradeepr,epxing}@cs.cmu.edu\n\n1Carnegie Mellon University\n\n2Petuum Inc.\n\nAbstract\n\nEstimating the structure of directed acyclic graphs (DAGs, also known as Bayesian\nnetworks) is a challenging problem since the search space of DAGs is combinatorial\nand scales superexponentially with the number of nodes. Existing approaches\nrely on various local heuristics for enforcing the acyclicity constraint. In this\npaper, we introduce a fundamentally different strategy: we formulate the structure\nlearning problem as a purely continuous optimization problem over real matrices\nthat avoids this combinatorial constraint entirely. This is achieved by a novel\ncharacterization of acyclicity that is not only smooth but also exact. The resulting\nproblem can be ef\ufb01ciently solved by standard numerical algorithms, which also\nmakes implementation effortless. The proposed method outperforms existing\nones, without imposing any structural assumptions on the graph such as bounded\ntreewidth or in-degree.\n\n1\n\nIntroduction\n\nLearning directed acyclic graphs (DAGs) from data is an NP-hard problem [8, 11], owing mainly to\nthe combinatorial acyclicity constraint that is dif\ufb01cult to enforce ef\ufb01ciently. At the same time, DAGs\nare popular models in practice, with applications in biology [33], genetics [49], machine learning\n[22], and causal inference [42]. For this reason, the development of new methods for learning DAGs\nremains a central challenge in machine learning and statistics.\nIn this paper, we propose a new approach for score-based learning of DAGs by converting the\ntraditional combinatorial optimization problem (left) into a continuous program (right):\n\nmin\n\nF (W )\n\nW2Rd\u21e5d\nsubject to G(W ) 2 DAGs ()\n\nmin\n\nF (W )\n\nW2Rd\u21e5d\nsubject to h(W ) = 0,\n\n(1)\n\nwhere G(W ) is the d-node graph induced by the weighted adjacency matrix W , F : Rd\u21e5d ! R is a\nscore function (see Section 2.1 for details), and our key technical device h : Rd\u21e5d ! R is a smooth\nfunction over real matrices, whose level set at zero exactly characterizes acyclic graphs. Although the\ntwo problems are equivalent, the continuous program on the right eliminates the need for specialized\nalgorithms that are tailored to search over the combinatorial space of DAGs. Instead, we are able\nto leverage standard numerical algorithms for constrained problems, which makes implementation\nparticularly easy, not requiring any knowledge about graphical models. This is similar in spirit to the\nsituation for undirected graphical models, in which the formulation of a continuous log-det program\n[4] sparked a series of remarkable advances in structure learning for undirected graphs (Section 2.2).\nUnlike undirected models, which can be reduced to a convex program, however, the program (1) is\nnonconvex. Nonetheless, as we will show, even na\u00efve solutions to this program yield state-of-the-art\nresults for learning DAGs.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) true graph\nFigure 1: Visual comparison of the learned weighted adjacency matrix on a 20-node graph with\n\n(b) estimate with n = 1000\n\n(c) estimate with n = 20\n\nn = 1000 (large samples) and n = 20 (insuf\ufb01cient samples):fWECP() is the proposed NOTEARS\n\nalgorithm with `1-regularization , and BFGS is the binary estimate of the baseline [31]. The proposed\nalgorithms perform well on large samples, and remains accurate on small n with `1 regularization.\n\nContributions. The main thrust of this work is to re-formulate score-based learning of DAGs so\nthat standard smooth optimization schemes such as L-BFGS [28] can be leveraged. To accomplish\nthis, we make the following speci\ufb01c contributions:\n\n\u2022 We explicitly construct a smooth function over Rd\u21e5d with computable derivatives that\nencodes the acyclicity constraint. This allows us to replace the combinatorial constraint\nG 2 D in (4) with a smooth equality constraint.\n\u2022 We develop an equality-constrained program for simultaneously estimating the structure and\nparameters of a sparse DAG from possibly high-dimensional data, and show how standard\nnumerical solvers can be used to \ufb01nd stationary points.\n\nexisting state-of-the-arts. See Figure 1 for a quick illustration and Section 5 for details.\n\n\u2022 We demonstrate the effectiveness of the resulting method in empirical evaluations against\n\u2022 We compare our ouput to the exact global minimizer [12], and show that our method attains\nscores that are comparable to the globally optimal score in practice, although our methods\nare only guaranteed to \ufb01nd stationary points.\n\nMost interestingly, our approach is very simple and can be implemented in about 50 lines of Python\ncode. As a result of its simplicity and effortlessness in its implementation, we call the resulting method\nNOTEARS: Non-combinatorial Optimization via Trace Exponential and Augmented lagRangian for\nStructure learning. The implementation is publicly available at https://github.com/xunzheng/\nnotears.\n\n2 Background\nThe basic DAG learning problem is formulated as follows: Let X 2 Rn\u21e5d be a data matrix consisting\nof n i.i.d. observations of the random vector X = (X1, . . . , Xd) and let D denote the (discrete) space\nof DAGs G = (V, E) on d nodes. Given X, we seek to learn a DAG G 2 D (also called a Bayesian\nnetwork) for the joint distribution P(X) [22, 42]. We model X via a structural equation model (SEM)\nde\ufb01ned by a weighted adjacency matrix W 2 Rd\u21e5d. Thus, instead of operating on the discrete space\nD, we will operate on Rd\u21e5d, the continuous space of d \u21e5 d real matrices.\n2.1 Score functions and SEM\nAny W 2 Rd\u21e5d de\ufb01nes a graph on d nodes in the following way: Let A(W ) 2{ 0, 1}d\u21e5d be the\nbinary matrix such that [A(W )]ij = 1 () wij 6= 0 and zero otherwise; then A(W ) de\ufb01nes the\nadjacency matrix of a directed graph G(W ). In a slight abuse of notation, we will thus treat W as\nif it were a (weighted) graph. In addition to the graph G(W ), W = [ w1 | \u00b7\u00b7\u00b7 | wd ] de\ufb01nes a linear\nj X + zj, where X = (X1, . . . , Xd) is a random vector and z = (z1, . . . , zd) is a\nSEM by Xj = wT\nrandom noise vector. We do not assume that z is Gaussian. More generally, we can model Xj via a\ngeneralized linear model (GLM) E(Xj | Xpa(Xj )) = f (wT\nj X). For example, if Xj 2{ 0, 1}, we can\nmodel the conditional distribution of Xj given its parents via logistic regression.\nIn this paper, we focus on linear SEM and the least-squares (LS) loss `(W ; X) = 1\nF ,\n2nkX XWk2\nalthough everything in the sequel applies to any smooth loss function ` de\ufb01ned over Rd\u21e5d. The\n\n2\n\n051015051015W051015051015fWECP(0)051015fWECP(0.1)051015BFGS-20+2051015051015fWECP(0)051015fWECP(0.1)051015BFGS-20+2\fstatistical properties of the LS loss in scoring DAGs have been extensively studied: The minimizer\nof the LS loss provably recovers a true DAG with high probability on \ufb01nite-samples and in high-\ndimensions (d n), and hence is consistent for both Gaussian SEM [3, 45] and non-Gaussian SEM\n[24].1 Note also that these results imply that the faithfulness assumption is not required in this set-up.\nGiven this extensive previous work on statistical issues, our focus in this paper is entirely on the\ncomputational problem of \ufb01nding an SEM that minimizes the LS loss.\nThis translation between graphs and SEM is central to our approach. Since we are interested in\nlearning a sparse DAG, we add `1-regularization kWk1 = k vec(W )k1 resulting in the regularized\nscore function\n\nF (W ) = `(W ; X) + kWk1 =\n\n1\n2nkX XWk2\n\nF + kWk1.\n\nThus we seek to solve\n\nmin\n\nF (W )\n\nW2Rd\u21e5d\nsubject to G(W ) 2 D.\n\n(2)\n\n(3)\n\nUnfortunately, although F (W ) is continuous, the DAG constraint G(W ) 2 D remains a challenge to\nenforce. In Section 3, we show how this discrete constraint can be replaced by a smooth equality\nconstraint.\n\n2.2 Previous work\nTraditionally, score-based learning seeks to optimize a discrete score Q : D ! R over the set of\nDAGs D; note that this is distinct from our score F (W ) whose domain is Rd\u21e5d instead of D. This\ncan be written as the following combinatorial optimization problem:\n\nG\n\nmin\n\nQ(G)\nsubject to G 2 D\n\n(4)\n\nPopular score functions include BDe(u) [20], BGe [23], BIC [10], and MDL [6]. Unfortunately, (4)\nis NP-hard to solve [8, 11] owing mainly to the nonconvex, combinatorial nature of the optimization\nproblem. This is the main drawback of existing approaches for solving (4): The acyclicity constraint\nis a combinatorial constraint with the number of acyclic structures increasing superexponentially in d\n[32]. Notwithstanding, there are algorithms for solving (4) to global optimality for small problems\n[12, 13, 29, 39, 40, 47]. There is also a wide literature on approximate algorithms based on order\nsearch [30, 34\u201336, 43], greedy search [9, 20, 31], and coordinate descent [2, 16, 18]. By searching\nover the space of topological orderings, the former order-based methods trade-off the dif\ufb01cult problem\nof enforcing acyclicity with a search over d! orderings, whereas the latter methods enforce acyclicity\none edge at a time, explicitly checking for acyclicity violations each time an edge is added. Other\napproaches that avoid optimizing (4) directly include constraint-based methods [41, 42], hybrid\nmethods [17, 44], and Bayesian methods [14, 27, 51].\nIt is instructive to compare this problem to a similar and well-understood problem: Learning an\nundirected graph (Markov network) from data. Score-based methods based on discrete scores similar\nto (4) proliferated in the early days for learning undirected graphs [e.g. 22, \u00a720.7]. More recently,\nthe re-formulation of this problem as a convex program over real, symmetric matrices [4, 48] has\nled to extremely ef\ufb01cient algorithms for learning undirected graphs [15, 21, 37]. One of the key\nfactors in this success was having a closed-form, tractable program for which existing techniques\nfrom the extensive optimization literature could be applied. Unfortunately, the general problem of\nDAG learning has not bene\ufb01tted in this way, arguably due to the intractable form of the program\n(4). One of our main goals in the current work is to formulate score-based learning via a similar\nclosed-form, continuous program. The key device in accomplishing this is a smooth characterization\nof acyclicity that will be introduced in the next section.\n\n1Due to nonconvexity, there may be more than one minimizer: These and other technical issues such as\n\nparameter identi\ufb01ability are addressed in detail in the cited references.\n\n3\n\n\f3 A new characterization of acyclicity\n\nIn order to make (3) amenable to black-box optimization, we propose to replace the combinatorial\nacyclicity constraint G(W ) 2 D in (3) with a single smooth equality constraint h(W ) = 0. Ideally,\nwe would like a function h : Rd\u21e5d ! R that satis\ufb01es the following desiderata:\n\n(a) h(W ) = 0 if and only if W is acyclic (i.e. G(W ) 2 D);\n(b) The values of h quantify the \u201cDAG-ness\u201d of the graph;\n(c) h is smooth;\n(d) h and its derivatives are easy to compute.\n\nProperty (b) is useful in practice for diagnostics. By \u201cDAG-ness\u201d, we mean some quanti\ufb01cation of\nhow severe violations from acyclicity become as W moves further from D. Although there are many\nways to satisfy (b) by measuring some notion of \u201cdistance\u201d to D, typical approaches would violate\n(c) and (d). For example, h might be the minimum `2 distance to D or it might be the sum of edge\nweights along all cyclic paths of W , however, these are either non-smooth (violating (c)) or hard to\ncompute (violating (d)). If a function that satis\ufb01es desiderata (a)-(d) exists, we can hope to apply\nexisting machinery for constrained optimization such as Lagrange multipliers. Consequently, the\nDAG learning problem becomes equivalent to solving a numerical optimization problem, which is\nagnostic about the graph structure.\nOur main result establishes the existence of such a function:\nTheorem 1. A matrix W 2 Rd\u21e5d is a DAG if and only if\n\n(5)\nwhere is the Hadamard product and eA is the matrix exponential of A. Moreover, h(W ) has a\nsimple gradient\n\nh(W ) = treWW d = 0,\nrh(W ) =eWWT 2W,\n\n(6)\n\nand satis\ufb01es all of the desiderata (a)-(d).\n\nWe sketch a proof of the \ufb01rst claim here; a formal proof of Theorem 1 can be found in Appendix A.\nLet S = W W , then S 2 Rd\u21e5d\n+ while preserving the sparsity pattern of W . Recall for any positive\ninteger k, the entries of matrix power (Sk)ij is the sum of weight products along all k-step paths from\nnode i to node j. Since S is nonnegative, tr(Sk) = 0 iff there is no k-cycles in the graph. Expanding\nthe power series,\n\ntr(eS) = tr(I) + tr(S) +\n\n1\n2!\n\ntr(S2) + \u00b7\u00b7\u00b7 d,\n\n(7)\n\nand the equality is attained iff the underlying graph of S, equivalently W , has no cycles.\nA key conclusion from Theorem 1 is that h and its gradient only involve evaluating the matrix\nexponential, which is a well-studied function in numerical anlaysis, and whose O(d3) algorithm [1]\nis readily available in many scienti\ufb01c computing libraries. Although the connection between trace of\nmatrix power and number of cycles in the graph is well-known [19], to the best of our knowledge,\nthis characterization of acyclicity has not appeared in the DAG learning literature previously. We\ndefer the discussion of other possible characterizations in the appendix. In the next section, we apply\nTheorem 1 to solve the program (3) to stationarity by treating it as an equality constrained program.\n\n4 Optimization\n\nTheorem 1 establishes a smooth, algebraic characterization of acyclicity that is also computable. As a\nconsequence, the following equality-constrained program (ECP) is equivalent to (3):\n\n(ECP)\n\nmin\n\nF (W )\n\nW2Rd\u21e5d\nsubject to h(W ) = 0.\n\n(8)\n\n4\n\n\fAlgorithm 1 NOTEARS algorithm\n\n1. Input: Initial guess (W0,\u21b5 0), progress rate c 2 (0, 1), tolerance \u270f> 0, threshold !> 0.\n2. For t = 0, 1, 2, . . . :\n\n(a) Solve primal Wt+1 arg minW L\u21e2(W, \u21b5t) with \u21e2 such that h(Wt+1) < ch(Wt).\n(b) Dual ascent \u21b5t+1 \u21b5t + \u21e2h(Wt+1).\n(c) If h(Wt+1) <\u270f , setfWECP = Wt+1 and break.\n3. Return the thresholded matrixcW :=fWECP 1(|fWECP| >! ).\n\nThe main advantage of (ECP) compared to both (3) and (4) is its amenability to classical techniques\nfrom the mathematical optimization literature. Nonetheless, since {W : h(W ) = 0} is a nonconvex\nconstraint, (8) is a nonconvex program, hence we still inherit the dif\ufb01culties associated with nonconvex\noptimization. In particular, we will be content to \ufb01nd stationary points of (8); in Section 5.3 we\ncompare our results to the global minimizer and show that the stationary points found by our method\nare close to global minima in practice.\nIn the follows, we outline the algorithm for solving (8). It consists of three steps: (i) converting the\nconstrained problem into a sequence of unconstrained subproblems, (ii) optimizing the unconstrained\nsubproblems, and (iii) thresholding. The full algorithm is outlined in Algorithm 1.\n\n4.1 Solving the ECP with augmented Lagrangian\nWe will use the augmented Lagrangian method [e.g. 25] to solve (ECP), which solves the original\nproblem augmented by a quadratic penalty:\n\nmin\n\nF (W ) +\n\nW2Rd\u21e5d\nsubject to h(W ) = 0\n\n\u21e2\n2|h(W )|2\n\n(9)\n\nwith a penalty parameter \u21e2> 0. A nice property of the augmented Lagrangian method is that it\napproximates well the solution of a constrained problem by the solution of unconstrained problems\nwithout increasing the penalty parameter \u21e2 to in\ufb01nity [25]. The algorithm is essentially a dual ascent\nmethod for (9). To begin with, the dual function with Lagrange multiplier \u21b5 is given by\n\nD(\u21b5) = min\n\nL\u21e2(W, \u21b5),\n\nW2Rd\u21e5d\nwhere L\u21e2(W, \u21b5) = F (W ) +\n\n\u21e2\n2|h(W )|2 + \u21b5h(W )\n\nis the augmented Lagrangian. The goal is to \ufb01nd a local solution to the dual problem\n\nD(\u21b5).\n\nmax\n\u21b52R\n\nLet W ?\n\u21b5 be the local minimizer of the Lagrangian (10) at \u21b5, i.e. D(\u21b5) = L\u21e2(W ?\nobjective D(\u21b5) is linear in \u21b5, the derivative is simply given by rD(\u21b5) = h(W ?\nperform dual gradient ascent to optimize (12):\n\n\u21b5,\u21b5 ). Since the dual\n\u21b5). Therefore one can\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n\u21b5 \u21b5 + \u21e2h(W ?\n\u21b5),\n\nwhere the choice of step size \u21e2 comes with the following convergence rate:\nProposition 1 (Corollary 11.2.1, 25). For \u21e2 large enough and the starting point \u21b50 near the solution\n\u21b5?, the update (13) converges to \u21b5? linearly.\n\nIn our experiments, typically fewer than 10 steps of the augmented Lagrangian scheme are required.\n\n4.2 Solving the unconstrained subproblem\nThe augmented Lagrangian converts a constrained problem (9) into a sequence of unconstrained\nproblems (10). We now discuss how to solve these subproblems ef\ufb01ciently. Let w = vec(W ) 2 Rp,\n\n5\n\n\fwith p = d2. The unconstrained subproblem (10) can be considered as a typical minimization\nproblem over real vectors:\n\nmin\nw2Rp\n\nf (w) + kwk1,\n\nwhere\n\nf (w) = `(W ; X) +\n\n\u21e2\n2|h(W )|2 + \u21b5h(W )\n\n(14)\n\n(15)\n\nis the smooth part of the objective. Our goal is to solve the above problem to high accuracy so that\nh(W ) can be suf\ufb01ciently suppressed.\nIn the special case of = 0, the nonsmooth term vanishes and the problem simply becomes an\nunconstrained smooth minimization, for which a number of ef\ufb01cient numerical algorithms are\navailable, for instance the L-BFGS [7]. To handle the nonconvexity, a slight modi\ufb01cation [28,\nProcedure 18.2] needs to be applied.\nWhen > 0, the problem becomes composite minimization, which can also be ef\ufb01ciently solved by\nthe proximal quasi-Newton (PQN) method [50]. At each step k, the key idea is to \ufb01nd the descent\ndirection through a quadratic approximation of the smooth term:\n\ndk = arg min\nd2Rp\n\ngT\nk d +\n\n1\n2\n\ndT Bkd + kwk + dk1,\n\n(16)\n\nwhere gk is the gradient of f (w) and Bk is the L-BFGS approximation of the Hessian. Note that for\neach coordinate j, problem (16) has a closed form update d d + z?ej given by\n+z| = c + S\u2713c \n\n)z + | wj + dj\n\nz2 + (gj + (Bd)j\n\nz? = arg min\n\n\n\na\u25c6 .\n\n(17)\n\nb\na\n\n1\n2\n\n,\n\nz\n\nBjj|{z}a\n\n|\n\nb\n\n{z\n\n}\n\nc\n\n| {z }\n\nMoreover, the low-rank structure of Bk enables fast computation for coordinate update. As we\ndescribe in Appendix B, the precomputation time is only O(m2p + m3) where m \u2327 p is the memory\nsize of L-BFGS, and each coordinate update is O(m). Furthermore, since we are using sparsity\nregularization, we can further speed up the algorithm by aggressively shrinking the active set of\ncoordinates based on their subgradients [50], and exclude the remaining dimensions from being\nupdated. With the updates restricted to the active set S, all dependencies of the complexity on O(p)\nbecomes O(|S|), which is substantially smaller. Hence the overall complexity of L-BFGS update is\nO(m2|S| + m3 + m|S|T ), where T is the number of inner iterations, typically T = 10.\n4.3 Thresholding\nIn regression problems, it is known that post-processing estimates of coef\ufb01cients via hard thresholding\nprovably reduces the number of false discoveries [46, 52]. Motivated by these encouraging results,\n\n\ufb01xed threshold !> 0, set any weights smaller than ! in absolute value to zero. This strategy also has\nthe important effect of \u201crounding\u201d the numerical solution of the augmented Lagrangian (9), since due\n\nwe threshold the edge weights as follows: After obtaining a stationary pointfWECP of (9), given a\nto numerical precisions the solution satis\ufb01es h(fWECP) \uf8ff \u270f for some small tolerance \u270f near machine\nprecision (e.g. \u270f = 108), rather than h(fWECP) = 0 strictly. However, since h(fWECP) explicitly\nquanti\ufb01es the \u201cDAG-ness\u201d offWECP (see desiderata (b), Section 3), a small threshold ! suf\ufb01ces to\n\nrule out cycle-inducing edges.\n\n5 Experiments\n\nWe compared our method against greedy equivalent search (GES) [9, 31], the PC algorithm [42], and\nLiNGAM [38]. For GES, we used the fast greedy search (FGS) implementation from Ramsey et al.\n[31]. Since the accuracy of PC and LiNGAM was signi\ufb01cantly lower than either FGS or NOTEARS,\nwe only report the results against FGS here. This is consistent with previous work on score-based\nlearning [2], which also indicates that FGS outperforms other techniques such as hill-climbing and\nMMHC [44]. FGS was chosen since it is a state-of-the-art algorithm that scales to large problems.\nFor brevity, we outline the basic set-up of our experiments here; precise details of our experimental\nset-up, including all parameter choices and more detailed evaluations, can be found in Appendix E.\n\n6\n\n\f(a) true graph\n\n(b) estimate with n = 1000\n\n(c) estimate with n = 20\n\nFigure 2: Parameter estimates offWECP on a scale-free graph. Without the additional thresholding\n\nstep in Algorithm 1, NOTEARS still produces consistent estimates of the true graph. The proposed\nmethod estimates the weights very well with large samples even without regularization, and remains\naccurate on insuf\ufb01cient samples when `1-regularization is introduced. See also Figure 1.\n\nIn each experiment, a random graph G was generated from one of two random graph models, Erd\u00f6s-\nR\u00e9nyi (ER) or scale-free (SF). Given G, we assigned uniformly random edge weights to obtain a\nweight matrix W . Given W , we sampled X = W T X + z 2 Rd from three different noise models:\nGaussian (Gauss), Exponential (Exp), and Gumbel (Gumbel). Based on these models, we generated\nrandom datasets X 2 Rn\u21e5d by generating rows i.i.d. according to one of these three models with\nd 2{ 10, 20, 50, 100} and n 2{ 20, 1000}. Since FGS outputs a CPDAG instead of a DAG or weight\nmatrix, some care needs to be taken in making comparisons; see Appendix E.1 for details.\n\n5.1 Parameter estimation\n\nWe \ufb01rst performed a qualitative study of the solutions obtained by NOTEARS without thresholding\n\nby visualizing the weight matrixfWECP obtained by solving (ECP) (i.e. ! = 0). This is illustrated in\n\nFigures 1 (ER-2) and 2 (SF-4). The key takeaway is that our method provides (empirically) consistent\nparameter estimates of the true weight matrix W . The \ufb01nal thresholding step in Algorithm 1 is only\nneeded to ensure accuracy in structure learning. It also shows how effective is `1-regularization in\nsmall n regime.\n\n5.2 Structure learning\n\nWe now examine our method for structure recovery, which is shown in Figure 3. For brevity, we\nonly report the numbers for the structural Hamming distance (SHD) here, but complete \ufb01gures and\ntables for additional metrics can be found in the supplement. Consistent with previous work on\ngreedy methods, FGS is very competitive when the number of edges is small (ER-2), but rapidly\ndeterioriates for even modest numbers of edges (SF-4). In the latter regime, NOTEARS shows\nsigni\ufb01cant improvements. This is consistent across each metric we evaluated, and the difference\ngrows as the number of nodes d gets larger. Also notice that our algorithm performs uniformly better\nfor each noise model (Exp, Gauss, and Gumbel), without leveraging any speci\ufb01c knowledge about\nthe noise type. Again, `1-regularizer helps signi\ufb01cantly in the small n setting.\n\n5.3 Comparison to exact global minimizer\n\nIn order to assess the ability of our method to solve the original program given by (3), we used the\nGOBNILP program [12, 13] to \ufb01nd the exact minimizer of (3). Since this involves enumerating\nall possible parent sets for each node, these experiments are limited to small DAGs. Nonetheless,\nthese small-scale experiments yield valuable insight into how well NOTEARS performs in actually\nsolving the original problem. In our experiments we generated random graphs with d = 10, and then\ngenerated 10 simulated datasets containing n = 20 samples (for high-dimensions) and n = 1000 (for\nlow-dimensions). We then compared the scores returned by our method to the exact global minimizer\ncomputed by GOBNILP along with the estimated parameters. The results are shown in Table 1.\nSurprisingly, although NOTEARS is only guaranteed to return a local minimizer, in many cases the\n\nobtained solution is very close to the global minimizer, as evidenced by deviations kcW WGk. Since\n\nthe general structure learning problem is NP-hard, we suspect that although the models we have\ntested (i.e. ER and SF) appear amenable to fast solution, in the worst-case there are graphs which will\nstill take exponential time to run or get stuck in a local minimum. Furthermore, the problem becomes\n\n7\n\n051015051015W051015051015fWECP(0)051015fWECP(0.1)051015BFGS-20+2051015051015fWECP(0)051015fWECP(0.1)051015BFGS-20+2\f)\n\nD\nH\nS\n\n(\n \n\ne\nc\nn\na\n\ni\n\n \n\nt\ns\nd\ng\nn\nm\nm\na\nH\n\ni\n\n \nl\n\na\nr\nu\n\nt\nc\nu\nr\nt\n\nS\n\n)\n\nR\nD\nF\n(\n \n\nt\n\ne\na\nr\n \ny\nr\ne\nv\no\nc\ns\nd\ne\ns\na\nF\n\n \n\ni\n\nl\n\n75\n50\n25\n0\n300\n200\n100\n0\n\n0.6\n0.4\n0.2\n0.0\n\n0.4\n0.2\n0.0\n\nexp\n\ngauss\n\ngumbel\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nE\nR\n2\n\nS\nF\n4\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n25 50 75 100\n\n25 50 75 100\n\nd (Number of nodes)\n\n25 50 75 100\n\nexp\n\ngauss\n\ngumbel\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nE\nR\n2\n\nS\nF\n4\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n25 50 75 100\n\n25 50 75 100\n\nd (Number of nodes)\n\n25 50 75 100\n\n)\n\nD\nH\nS\n\n(\n \n\ne\nc\nn\na\n\ni\n\n \n\nt\ns\nd\ng\nn\nm\nm\na\nH\n\ni\n\n \nl\n\na\nr\nu\n\nt\nc\nu\nr\nt\n\nS\n\n)\n\nR\nD\nF\n(\n \n\nt\n\ne\na\nr\n \ny\nr\ne\nv\no\nc\ns\nd\ne\ns\na\nF\n\n \n\ni\n\nl\n\n400\n300\n200\n100\n0\n500\n400\n300\n200\n100\n0\n\n0.9\n0.7\n0.5\n0.3\n\n0.6\n0.4\n0.2\n\nexp\n\ngauss\n\ngumbel\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\nE\nR\n2\n\nS\nF\n4\n\n25 50 75 100\n\n25 50 75 100\n\nd (Number of nodes)\n\n25 50 75 100\n\nexp\n\ngauss\n\ngumbel\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nE\nR\n2\n\nS\nF\n4\n\n25 50 75 100\n\n25 50 75 100\n\nd (Number of nodes)\n\n25 50 75 100\n\nMethod \u25cf\n\nFGS\n\n\u25cf\n\nNOTEARS\n\n\u25cf\n\nNOTEARS\u2212L1\n\nMethod \u25cf\n\nFGS\n\n\u25cf\n\nNOTEARS\n\n\u25cf\n\nNOTEARS\u2212L1\n\n(a) SHD with n = 1000\n\n(b) SHD with n = 20\n\nFigure 3: Structure recovery in terms of SHD and FDR to the true graph (lower is better). Rows: ran-\ndom graph types, {ER,SF}-k = {Erd\u00f6s-R\u00e9nyi, scale-free} graphs with kd expected edges. Columns:\nnoise types of SEM. Error bars represent standard errors over 10 simulations.\n\nTable 1: Comparison of NOTEARS vs. globally optimal solution. (WG,cW ) = F (WG) F (cW ).\nGraph F (W ) F (WG) F (cW ) F (fWECP)( WG,cW ) kcW WGk kW WGk\n\nn\n20\n0\n20 0.5\n0\n1000\n1000 0.5\n20\n0\n20 0.5\n0\n1000\n1000 0.5\n\nER2\nER2\nER2\nER2\nSF4\nSF4\nSF4\nSF4\n\n5.11\n16.04\n4.99\n15.93\n4.99\n23.33\n4.96\n23.29\n\n3.85\n12.81\n4.97\n13.32\n3.77\n16.19\n4.94\n17.56\n\n5.36\n13.49\n5.02\n14.03\n4.70\n17.31\n5.05\n19.70\n\n3.88\n12.90\n4.95\n13.46\n3.85\n16.69\n4.99\n18.43\n\n-1.52\n-0.68\n-0.05\n-0.71\n-0.93\n-1.12\n-0.11\n-2.13\n\n0.07\n0.12\n0.02\n0.12\n0.08\n0.15\n0.04\n0.13\n\n3.38\n3.15\n0.40\n2.95\n3.31\n5.08\n0.29\n4.34\n\nmore dif\ufb01cult as d increases. Nonetheless, this is encouraging evidence that the nonconvexity of (8)\nis a minor issue in practice. We leave it to future work to investigate these problems further.\n\n5.4 Real-data\n\nWe also compared FGS and NOTEARS on a real dataset provided by Sachs et al. [33]. This dataset\nconsists of continuous measurements of expression levels of proteins and phospholipids in human\nimmune system cells (n = 7466 d = 11, 20 edges). This dataset is a common benchmark in graphical\nmodels since it comes with a known consensus network, that is, a gold standard network based on\nexperimental annotations that is widely accepted by the biological community. In our experiments,\nFGS estimated 17 total edges with an SHD of 22, compared to 16 for NOTEARS with an SHD of 22.\n\n8\n\n\f6 Discussion\n\nWe have proposed a new method for learning DAGs from data based on a continuous optimization\nprogram. This represents a signi\ufb01cant departure from existing approaches that search over the discrete\nspace of DAGs, resulting in a dif\ufb01cult optimization program. We also proposed two optimization\nschemes for solving the resulting program to stationarity, and illustrated its advantages over existing\nmethods such as greedy equivalence search. Crucially, by performing global updates (e.g. all\nparameters at once) instead of local updates (e.g. one edge at a time) in each iteration, our method\nis able to avoid relying on assumptions about the local structure of the graph. To conclude, let us\ndiscuss some of the limitations of our method and possible directions for future work.\nFirst, it is worth emphasizing once more that the equality constrained program (8) is a nonconvex\nprogram. Thus, although we overcome the dif\ufb01culties of combinatorial optimization, our formulation\nstill inherits the dif\ufb01culties associated with nonconvex optimization. In particular, black-box solvers\ncan at best \ufb01nd stationary points of (8). With the exception of exact methods, however, existing\nmethods suffer from this drawback as well.2 The main advantage of NOTEARS then is smooth,\nglobal search, as opposed to combinatorial, local search; and furthermore the search is delegated to\nstandard numerical solvers.\nSecond, the current work relies on the smoothness of the score function, in order to make use of\ngradient-based numerical solvers to guide the graph search. However it is also interesting to consider\nnon-smooth, even discrete scores such as BDe [20]. Off-the-shelf techniques such as Nesterov\u2019s\nsmoothing [26] could be useful, however more thorough investigation is left for future work.\nThird, since the evaluation of the matrix exponential is O(d3), the computational complexity of our\nmethod is cubic in the number of nodes, although the constant is small for sparse matrices. In fact,\nthis is one of the key motivations for our use of second-order methods (as opposed to \ufb01rst-order),\ni.e. to reduce the number of matrix exponential computations. By using second-order methods,\neach iteration make signi\ufb01cantly more progress than \ufb01rst-order methods. Furthermore, although in\npractice not many iterations (t \u21e0 10) are required, we have not established any worst-case iteration\ncomplexity results. In light of the results in Section 5.3, we expect there are exceptional cases where\nconvergence is slow. Notwithstanding, NOTEARS already outperforms existing methods when the\nin-degree is large, which is known dif\ufb01cult spot for existing methods. We leave it to future work to\nstudy these cases in more depth.\nLastly, in our experiments, we chose a \ufb01xed, suboptimal value of !> 0 for thresholding (Section 4.3).\nClearly, it would be preferable to \ufb01nd a data-driven choice of ! that adapts to different noise-to-signal\nratios and graph types. It is an intersting direction for future to study such choices.\nThe code is publicly available at https://github.com/xunzheng/notears.\n\nAcknowledgments\nWe thank the anonymous reviewers for valuable feedback. P.R. acknowledges the support of NSF\nvia IIS-1149803, IIS-1664720. E.X. and B.A. acknowledge the support of NIH R01GM114311,\nP30DA035778. X.Z. acknowledges the support of Dept of Health BD4BH4100070287, NSF\nIIS1563887, AFRL/DARPA FA87501720152.\n\nReferences\n[1] Al-Mohy, Awad H., & Higham, Nicholas J. 2009. A New Scaling and Squaring Algorithm for\n\nthe Matrix Exponential. SIAM Journal on Matrix Analysis and Applications.\n\n[2] Aragam, Bryon, & Zhou, Qing. 2015. Concave Penalized Estimation of Sparse Gaussian Bayesian\n\nNetworks. Journal of Machine Learning Research, 16, 2273\u20132328.\n\n[3] Aragam, Bryon, Amini, Arash A., & Zhou, Qing. 2016. Learning directed acyclic graphs with\n\npenalized neighbourhood regression. Submitted, arXiv:1511.08963.\n\n2GES [9] is known to \ufb01nd the global minimizer in the limit n ! 1 under certain assumptions, but this is\n\nnot guaranteed for \ufb01nite samples.\n\n9\n\n\f[4] Banerjee, Onureena, El Ghaoui, Laurent, & d\u2019Aspremont, Alexandre. 2008. Model selection\nthrough sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal\nof Machine Learning Research, 9, 485\u2013516.\n\n[5] Barab\u00e1si, Albert-L\u00e1szl\u00f3, & Albert, R\u00e9ka. 1999. Emergence of scaling in random networks.\n\nScience, 286(5439), 509\u2013512.\n\n[6] Bouckaert, Remco R. 1993. Probabilistic network construction using the minimum description\nlength principle. In European conference on symbolic and quantitative approaches to reasoning\nand uncertainty. Springer, pp. 41\u201348.\n\n[7] Byrd, Richard H., Lu, Peihuang, Nocedal, Jorge, & Zhu, Ciyou. 1995. A limited memory\n\nalgorithm for bound constrained optimization. SIAM Journal on Scienti\ufb01c Computing.\n\n[8] Chickering, David Maxwell. 1996. Learning Bayesian networks is NP-complete. In Learning\n\nfrom data. Springer.\n\n[9] Chickering, David Maxwell. 2003. Optimal structure identi\ufb01cation with greedy search. Journal\n\nof Machine Learning Research, 3, 507\u2013554.\n\n[10] Chickering, David Maxwell, & Heckerman, David. 1997. Ef\ufb01cient approximations for the\nmarginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29(2-3),\n181\u2013212.\n\n[11] Chickering, David Maxwell, Heckerman, David, & Meek, Christopher. 2004. Large-sample\nlearning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5, 1287\u20131330.\n[12] Cussens, James. 2012. Bayesian network learning with cutting planes. arXiv preprint\n\narXiv:1202.3713.\n\n[13] Cussens, James, Haws, David, & Studen`y, Milan. 2017. Polyhedral aspects of score equivalence\n\nin Bayesian network structure learning. Mathematical Programming, 164(1-2), 285\u2013324.\n\n[14] Ellis, Byron, & Wong, Wing Hung. 2008. Learning causal Bayesian network structures from\n\nexperimental data. Journal of the American Statistical Association, 103(482).\n\n[15] Friedman, Jerome, Hastie, Trevor, & Tibshirani, Robert. 2008. Sparse inverse covariance\n\nestimation with the Graphical Lasso. Biostatistics, 9(3), 432\u2013441.\n\n[16] Fu, Fei, & Zhou, Qing. 2013. Learning Sparse Causal Gaussian Networks With Experimen-\ntal Intervention: Regularization and Coordinate Descent. Journal of the American Statistical\nAssociation, 108(501), 288\u2013300.\n\n[17] G\u00e1mez, Jos\u00e9 A, Mateo, Juan L, & Puerta, Jos\u00e9 M. 2011. Learning Bayesian networks by hill\nclimbing: Ef\ufb01cient methods based on progressive restriction of the neighborhood. Data Mining\nand Knowledge Discovery, 22(1-2), 106\u2013148.\n\n[18] Gu, Jiayang, Fu, Fei, & Zhou, Qing. 2018. Penalized Estimation of Directed Acyclic Graphs\n\nFrom Discrete Data. Statistics and Computing, DOI: 10.1007/s11222-018-9801-y.\n\n[19] Harary, Frank, & Manvel, Bennet. 1971. On the number of cycles in a graph. Matematick`y\n\n\u02c7casopis.\n\n[20] Heckerman, David, Geiger, Dan, & Chickering, David M. 1995. Learning Bayesian networks:\n\nThe combination of knowledge and statistical data. Machine learning, 20(3), 197\u2013243.\n\n[21] Hsieh, Cho-Jui, Sustik, M\u00e1ty\u00e1s A, Dhillon, Inderjit S, & Ravikumar, Pradeep. 2014. QUIC:\nquadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning\nResearch, 15(1), 2911\u20132947.\n\n[22] Koller, Daphne, & Friedman, Nir. 2009. Probabilistic graphical models: principles and\n\ntechniques. MIT press.\n\n[23] Kuipers, Jack, Moffa, Giusi, & Heckerman, David. 2014. Addendum on the scoring of Gaussian\n\ndirected acyclic graphical models. The Annals of Statistics, pp. 1689\u20131691.\n\n10\n\n\f[24] Loh, Po-Ling, & B\u00fchlmann, Peter. 2014. High-Dimensional Learning of Linear Causal Net-\nworks via Inverse Covariance Estimation. Journal of Machine Learning Research, 15, 3065\u20133105.\n[25] Nemirovski, Arkadi. 1999. Optimization II: Standard Numerical Methods for Nonlinear\n\nContinuous Optimization.\n\n[26] Nesterov, Yurii. 2005. Smooth minimization of non-smooth functions. Mathematical Program-\n\nming.\n\n[27] Niinim\u00e4ki, Teppo, Parviainen, Pekka, & Koivisto, Mikko. 2016. Structure discovery in Bayesian\nnetworks by sampling partial orders. Journal of Machine Learning Research, 17(1), 2002\u20132048.\n\n[28] Nocedal, Jorge, & Wright, Stephen J. 2006. Numerical Optimization.\n\n[29] Ott, Sascha, & Miyano, Satoru. 2003. Finding optimal gene networks using biological con-\n\nstraints. Genome Informatics, 14, 124\u2013133.\n\n[30] Park, Young Woong, & Klabjan, Diego. 2017. Bayesian Network Learning via Topological\n\nOrder. Journal of Machine Learning Research.\n\n[31] Ramsey, Joseph, Glymour, Madelyn, Sanchez-Romero, Ruben, & Glymour, Clark. 2016. A\nmillion variables and more: the Fast Greedy Equivalence Search algorithm for learning high-\ndimensional graphical causal models, with an application to functional magnetic resonance images.\nInternational Journal of Data Science and Analytics, pp. 1\u20139.\n\n[32] Robinson, Robert W. 1977. Counting unlabeled acyclic digraphs. In Combinatorial mathematics\n\nV. Springer.\n\n[33] Sachs, Karen, Perez, Omar, Pe\u2019er, Dana, Lauffenburger, Douglas A, & Nolan, Garry P.\n2005. Causal protein-signaling networks derived from multiparameter single-cell data. Sci-\nence, 308(5721), 523\u2013529.\n\n[34] Scanagatta, Mauro, de Campos, Cassio P, Corani, Giorgio, & Zaffalon, Marco. 2015. Learning\nBayesian networks with thousands of variables. In Advances in Neural Information Processing\nSystems. pp. 1864\u20131872.\n\n[35] Scanagatta, Mauro, Corani, Giorgio, de Campos, Cassio P, & Zaffalon, Marco. 2016. Learning\nTreewidth-Bounded Bayesian Networks with Thousands of Variables. In Advances in Neural\nInformation Processing Systems. pp. 1462\u20131470.\n\n[36] Schmidt, Mark, Niculescu-Mizil, Alexandru, & Murphy, Kevin. 2007. Learning graphical\n\nmodel structure using L1-regularization paths. In AAAI, vol. 7. pp. 1278\u20131283.\n\n[37] Schmidt, Mark, Berg, Ewout, Friedlander, Michael, & Murphy, Kevin. 2009. Optimizing\ncostly functions with simple constraints: A limited-memory projected quasi-newton algorithm. In\nArti\ufb01cial Intelligence and Statistics. pp. 456\u2013463.\n\n[38] Shimizu, Shohei, Hoyer, Patrik O, Hyv\u00e4rinen, Aapo, & Kerminen, Antti. 2006. A linear\nnon-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7,\n2003\u20132030.\n\n[39] Silander, Tomi, & Myllymaki, Petri. 2006. A simple approach for \ufb01nding the globally optimal\nBayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Arti\ufb01cial\nIntelligence.\n\n[40] Singh, Ajit P, & Moore, Andrew W. 2005. Finding optimal Bayesian networks by dynamic\n\nprogramming.\n\n[41] Spirtes, Peter, & Glymour, Clark. 1991. An algorithm for fast recovery of sparse causal graphs.\n\nSocial Science Computer Review, 9(1), 62\u201372.\n\n[42] Spirtes, Peter, Glymour, Clark, & Scheines, Richard. 2000. Causation, prediction, and search.\n\nVol. 81. The MIT Press.\n\n11\n\n\f[43] Teyssier, Marc, & Koller, Daphne. 2005. Ordering-based search: A simple and effective\n\nalgorithm for learning Bayesian networks. In Uncertainty in Arti\ufb01cal Intelligence (UAI).\n\n[44] Tsamardinos, Ioannis, Brown, Laura E, & Aliferis, Constantin F. 2006. The max-min hill-\n\nclimbing Bayesian network structure learning algorithm. Machine Learning, 65(1), 31\u201378.\n\n[45] van de Geer, Sara, & B\u00fchlmann, Peter. 2013. `0-penalized maximum likelihood for sparse\n\ndirected acyclic graphs. Annals of Statistics, 41(2), 536\u2013567.\n\n[46] Wang, Xiangyu, Dunson, David, & Leng, Chenlei. 2016. No penalty no tears: Least squares in\nhigh-dimensional linear models. In International Conference on Machine Learning. pp. 1814\u2013\n1822.\n\n[47] Xiang, Jing, & Kim, Seyoung. 2013. A* Lasso for Learning a Sparse Bayesian Network\nIn Advances in Neural Information Processing Systems.\n\nStructure for Continuous Variables.\npp. 2418\u20132426.\n\n[48] Yuan, Ming, & Lin, Yi. 2007. Model selection and estimation in the Gaussian graphical model.\n\nBiometrika, 94(1), 19\u201335.\n\n[49] Zhang, Bin, Gaiteri, Chris, Bodea, Liviu-Gabriel, Wang, Zhi, McElwee, Joshua, Podtelezhnikov,\nAlexei A, Zhang, Chunsheng, Xie, Tao, Tran, Linh, Dobrin, Radu, et al. 2013. Integrated systems\napproach identi\ufb01es genetic nodes and networks in late-onset Alzheimer\u2019s disease. Cell, 153(3),\n707\u2013720.\n\n[50] Zhong, Kai, Yen, Ian En-Hsu, Dhillon, Inderjit S, & Ravikumar, Pradeep K. 2014. Proximal\nquasi-Newton for computationally intensive l1-regularized m-estimators. In Advances in Neural\nInformation Processing Systems. pp. 2375\u20132383.\n\n[51] Zhou, Qing. 2011. Multi-Domain Sampling With Applications to Structural Inference of\n\nBayesian Networks. Journal of the American Statistical Association, 106(496), 1317\u20131330.\n\n[52] Zhou, Shuheng. 2009. Thresholding procedures for high dimensional variable selection and\n\nstatistical estimation. In Advances in Neural Information Processing Systems. pp. 2304\u20132312.\n\n12\n\n\f", "award": [], "sourceid": 5750, "authors": [{"given_name": "Xun", "family_name": "Zheng", "institution": "Carnegie Mellon University"}, {"given_name": "Bryon", "family_name": "Aragam", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}